Self-Attention By Hand (in Code)¶
You computed self-attention by hand on the handout for "the cat chases the". Now let's verify those calculations in code, then experiment with different query vectors.
Objectives: TM-SelfAttention, TM-TransformerDataFlow
import torch
Setup: Vectors from the Handout¶
These are the key and value vectors from the handout, plus the query vector for "the" (position 3).
tokens = ["the", "cat", "chases", "the"]
# Key vectors for each token (from the handout)
keys = torch.tensor([
[1., 0.], # "the" (pos 0)
[3., 1.], # "cat" (pos 1)
[0., 3.], # "chases" (pos 2)
[2., 0.], # "the" (pos 3)
])
# Value vectors for each token (from the handout)
values = torch.tensor([
[0., 0.], # "the" (pos 0)
[4., 1.], # "cat" (pos 1)
[2., 2.], # "chases" (pos 2)
[0., 0.], # "the" (pos 3)
])
# Query vector for "the" at position 3
query = torch.tensor([2., 3.])
Step 1: Compute Attention Scores¶
The attention score between the query and each key is their dot product. Fill in the dot product computation inside the loop below.
Reminder: torch.dot(a, b) computes the dot product of two 1D tensors.
# Compute the attention score (dot product) for each token.
scores = []
for i, token in enumerate(tokens):
# score = dot product of query with keys[i]
# your code here
Check: do these match your hand calculations from the handout?
Step 2: Normalize to Get Attention Weights¶
On the handout, we normalized by dividing by the sum. Do that here. (Real transformers use softmax, which we'll try next.)
# Normalize: divide each score by the sum of all scores
# your code here
Step 3: Compute the Output¶
The output is the weighted sum of the value vectors. Each value vector is multiplied by its attention weight, then they're all added together.
# Compute weighted sum of values
# your code here
Check: does this match your handout answer (bottom-right cell of the table)?
Step 4: Try Softmax Instead¶
Real transformers use softmax (not simple sum-normalization) to convert scores to weights. Try it and see how the weights change.
# your code here
How did softmax change the weights compared to simple normalization? Which tokens got more/less attention?
your answer here
Experiment: Change the Query¶
The handout's discussion questions ask: "How would the attention pattern change if the query was [1, 3] instead?"
Try it! Also try other query vectors and see what happens.
def compute_attention(query, keys, values, tokens, use_softmax=True):
"""Compute and display attention for a given query."""
scores = keys @ query
if use_softmax:
weights = torch.softmax(scores, dim=0)
else:
weights = scores / scores.sum()
output = weights @ values
print(f"Query: {query.tolist()}")
for token, s, w in zip(tokens, scores, weights):
bar = '#' * int(w.item() * 40)
print(f" {token:10s} score={s.item():5.1f} weight={w.item():.3f} {bar}")
print(f" Output: {output.tolist()}")
print()
# Original query from the handout
compute_attention(torch.tensor([2., 3.]), keys, values, tokens)
# Discussion question: what if query was [1, 3]?
compute_attention(torch.tensor([1., 3.]), keys, values, tokens)
# Try your own query vectors!
# your code here
Challenge: Design a Query¶
Can you find a query vector that makes "the" attend almost entirely to "cat" (weight > 0.9)? What about one that splits attention roughly equally between "cat" and "chases"?
Hint: look at the key vectors. What query would have a large dot product with k("cat") = [3, 1] but small dot products with everything else?
# your code here: find a query that gives > 0.9 weight to "cat"
# your code here: find a query that splits attention ~equally between "cat" and "chases"
Scaling Up: From 2D to Real Models¶
We just worked with 2D vectors and 4 tokens. In a real transformer (like Qwen2.5-0.5B):
- Embedding dimension: 896
- Number of attention heads: 14
- Head dimension: 896 / 14 = 64
- Typical sequence length: 50+ tokens
Think about it: If we had 50 tokens with 64-dimensional key, query, and value vectors:
- What would be the shape of the
keysmatrix? - What would be the shape of the
scoresmatrix (if we computed attention for all queries at once)? - What would be the shape of the
outputmatrix?
We'll work through this on Friday with a real model's dimensions!
your answer here