Self-Attention By Hand (in Code)¶

You computed self-attention by hand on the handout for "the cat chases the". Now let's verify those calculations in code, then experiment with different query vectors.

Objectives: TM-SelfAttention, TM-TransformerDataFlow

In [ ]:
import torch

Setup: Vectors from the Handout¶

These are the key and value vectors from the handout, plus the query vector for "the" (position 3).

In [ ]:
tokens = ["the", "cat", "chases", "the"]

# Key vectors for each token (from the handout)
keys = torch.tensor([
    [1., 0.],   # "the" (pos 0)
    [3., 1.],   # "cat" (pos 1)
    [0., 3.],   # "chases" (pos 2)
    [2., 0.],   # "the" (pos 3)
])

# Value vectors for each token (from the handout)
values = torch.tensor([
    [0., 0.],   # "the" (pos 0)
    [4., 1.],   # "cat" (pos 1)
    [2., 2.],   # "chases" (pos 2)
    [0., 0.],   # "the" (pos 3)
])

# Query vector for "the" at position 3
query = torch.tensor([2., 3.])

Step 1: Compute Attention Scores¶

The attention score between the query and each key is their dot product. Fill in the dot product computation inside the loop below.

Reminder: torch.dot(a, b) computes the dot product of two 1D tensors.

In [ ]:
# Compute the attention score (dot product) for each token.
scores = []
for i, token in enumerate(tokens):
    # score = dot product of query with keys[i]
    # your code here

Check: do these match your hand calculations from the handout?

Step 2: Normalize to Get Attention Weights¶

On the handout, we normalized by dividing by the sum. Do that here. (Real transformers use softmax, which we'll try next.)

In [ ]:
# Normalize: divide each score by the sum of all scores
# your code here

Step 3: Compute the Output¶

The output is the weighted sum of the value vectors. Each value vector is multiplied by its attention weight, then they're all added together.

In [ ]:
# Compute weighted sum of values
# your code here

Check: does this match your handout answer (bottom-right cell of the table)?

Step 4: Try Softmax Instead¶

Real transformers use softmax (not simple sum-normalization) to convert scores to weights. Try it and see how the weights change.

In [ ]:
# your code here

How did softmax change the weights compared to simple normalization? Which tokens got more/less attention?

your answer here

Experiment: Change the Query¶

The handout's discussion questions ask: "How would the attention pattern change if the query was [1, 3] instead?"

Try it! Also try other query vectors and see what happens.

In [ ]:
def compute_attention(query, keys, values, tokens, use_softmax=True):
    """Compute and display attention for a given query."""
    scores = keys @ query
    if use_softmax:
        weights = torch.softmax(scores, dim=0)
    else:
        weights = scores / scores.sum()
    output = weights @ values
    
    print(f"Query: {query.tolist()}")
    for token, s, w in zip(tokens, scores, weights):
        bar = '#' * int(w.item() * 40)
        print(f"  {token:10s}  score={s.item():5.1f}  weight={w.item():.3f}  {bar}")
    print(f"  Output: {output.tolist()}")
    print()
In [ ]:
# Original query from the handout
compute_attention(torch.tensor([2., 3.]), keys, values, tokens)

# Discussion question: what if query was [1, 3]?
compute_attention(torch.tensor([1., 3.]), keys, values, tokens)

# Try your own query vectors!
# your code here

Challenge: Design a Query¶

Can you find a query vector that makes "the" attend almost entirely to "cat" (weight > 0.9)? What about one that splits attention roughly equally between "cat" and "chases"?

Hint: look at the key vectors. What query would have a large dot product with k("cat") = [3, 1] but small dot products with everything else?

In [ ]:
# your code here: find a query that gives > 0.9 weight to "cat"
In [ ]:
# your code here: find a query that splits attention ~equally between "cat" and "chases"

Scaling Up: From 2D to Real Models¶

We just worked with 2D vectors and 4 tokens. In a real transformer (like Qwen2.5-0.5B):

  • Embedding dimension: 896
  • Number of attention heads: 14
  • Head dimension: 896 / 14 = 64
  • Typical sequence length: 50+ tokens

Think about it: If we had 50 tokens with 64-dimensional key, query, and value vectors:

  1. What would be the shape of the keys matrix?
  2. What would be the shape of the scores matrix (if we computed attention for all queries at once)?
  3. What would be the shape of the output matrix?

We'll work through this on Friday with a real model's dimensions!

your answer here