Unit 10: Transformers

The Transformers architecture (sometimes called self-attention networks) has been the power behind many recent advances not just in NLP but also vision, audio, etc. The week we’ll see how they work!

By the end of this week you should be able to answer the following questions:

What is a layer in a self-attention network: what goes in, what comes out, and what are the shapes of all those things?
Why are variable-length sequences challenging for neural nets? How do self-attention networks handle that challenge? (Bonus: what are some alternative approaches? convolutional nets, recurrent nets, all-MLP (spatial gating) networks, etc.)
How does data flow in a self-attention network? In what sense does it use conditional logic?
What does an attention head do? Specifically, what are queries, keys, and values, and what do they do? And how does this relate with our old friends the dot product and softmax? (Wait, is this logistic classification yet again?)
How do self-attention networks keep track of position?
What are encoders and decoders? Why does that matter? What impact does that have on what you can do with the model?

Preparation

Read and/or watch two things about how Transformers work.

Transformers Study Materials at a range of levels of detail.
Twitter threads more your thing? Part 1, Part 2

Supplemental Material

After all of this, self-attention may not actually be best. Amazingly (to me) a precomputed token mixing matrix might actually outperform self-attention: [2203.06850] Efficient Language Modeling with Sparse all-MLP.

Class Meetings

Monday

Review Transformers layers
Attention: draw it, code it.
- bertviz

Wednesday: Advising Break

Friday

Logistics
- Feedback survey
- Project milestones
Transformer architecture: Annotated Transformer
Position embeddings
Encoders and decoders

Discussion 10: AI and the Environment (due Fri Mar 25)

Due this Week

Discussion 10: AI and the Environment (Fri)
Homework 9 (Thu)