In this workshop, we introduced transformers, covering their key components: positional embeddings, multi-head self-attention, residual connections, layer normalization, and feed-forward networks.
We demonstrated several applications of transformers across various domains, including time series forecasting, drug discovery, audio processing, text, and image analysis.
Next, we explored how transformers can be adapted to process images and showed how to use Vision Transformers for tasks like depth prediction and classification.
Finally, we discussed how transformers can handle both text and image inputs simultaneously by leveraging the CLIP technique. We demonstrated how to map these image and text together using Python.