Lab 376.3: Implementing Self-Attention | CS 375-376 Spring 2025 at Calvin University

In this lab, you’ll trace through parts of the implementation of a Transformer language model, focusing on the self-attention mechanism. We’ll compare the performance of a Transformer model with a baseline that only uses a feedforward network (MLP).

This lab address the following course objectives:

[NC-Embeddings] I can identify various types of embeddings (tokens, hidden states, output, key, and query) in a language model and explain their purpose.
[NC-SelfAttention] I can explain the purpose and components of a self-attention layer. (Bonus topics - multi-head attention, positional encodings)
[NC-TransformerDataFlow] I can identify the shapes of data flowing through a Transformer-style language model.
[MS-LLM-Generation] I can extract and interpret model outputs (token logits) and use them to generate text.
[MS-LLM-Tokenization] I can explain the purpose, inputs, and outputs of tokenization.

It could also be used to address the following course objectives:

[MS-LLM-Train] I can describe the overall process of training a state-of-the-art dialogue LLM such as Llama or OLMo.
[MS-LLM-Compute] I can analyze the computational requirements of training and inference of generative AI systems.
[NC-Scaling] I can analyze how the computational requirements of a model scale with number of parameters and context size.
[LM-SelfSupervised] I can explain how self-supervised learning can be used to train foundation models on massive datasets without labeled data.
[CI-Topic-History] I can trace current AI technologies and ways of thinking back to origins and developments of at least a decade ago.
[CI-LLM-Failures] I can identify common types of failures in LLMs, such as hallucination (confabulation) and bias.

Task

Start with this notebook:

Implementing self-attention (name: u10n1-implement-transformer.ipynb; show preview, open in Colab)

You may find it helpful to refer to The Illustrated GPT-2 (Visualizing Transformer Language Models) – Jay Alammar – Visualizing machine learning one concept at a time.

Extension idea

measure how much this network speeds up when you move it to a GPU (you may need to torch.compile it first)
Other extensions are described on the Architectural Experimentation page.