In this lab, you’ll trace through parts of the implementation of a Transformer language model, focusing on the self-attention mechanism. We’ll compare the performance of a Transformer model with a baseline that only uses a feedforward network (MLP).
This lab address the following course objectives:
- [NC-Embeddings] I can identify various types of embeddings (tokens, hidden states, output, key, and query) in a language model and explain their purpose.
- [NC-SelfAttention] I can explain the purpose and components of a self-attention layer. (Bonus topics - multi-head attention, positional encodings)
- [NC-TransformerDataFlow] I can identify the shapes of data flowing through a Transformer-style language model.
- [MS-LLM-Generation] I can extract and interpret model outputs (token logits) and use them to generate text.
- [MS-LLM-Tokenization] I can explain the purpose, inputs, and outputs of tokenization.
It could also be used to address the following course objectives:
- [MS-LLM-Train] I can describe the overall process of training a state-of-the-art dialogue LLM such as Llama or OLMo.
- [MS-LLM-Compute] I can analyze the computational requirements of training and inference of generative AI systems.
- [NC-Scaling] I can analyze how the computational requirements of a model scale with number of parameters and context size.
- [LM-SelfSupervised] I can explain how self-supervised learning can be used to train foundation models on massive datasets without labeled data.
- [CI-Topic-History] I can trace current AI technologies and ways of thinking back to origins and developments of at least a decade ago.
- [CI-LLM-Failures] I can identify common types of failures in LLMs, such as hallucination (confabulation) and bias.
Task
Start with this notebook:
Implementing self-attention
(name: u10n1-implement-transformer.ipynb; show preview,
open in Colab)
You may find it helpful to refer to The Illustrated GPT-2 (Visualizing Transformer Language Models) – Jay Alammar – Visualizing machine learning one concept at a time.
Extension idea
- measure how much this network speeds up when you move it to a GPU (you may need to
torch.compileit first) - Other extensions are described on the Architectural Experimentation page.