Unit 10: Transformers

The Transformers architecture (sometimes called self-attention networks) has been the power behind many recent advances not just in NLP but also vision, audio, etc. The week we’ll see how they work!

By the end of this week you should be able to answer the following questions:

Preparation

Read and/or watch two things about how Transformers work.

Supplemental Material

After all of this, self-attention may not actually be best. Amazingly (to me) a precomputed token mixing matrix might actually outperform self-attention: [2203.06850] Efficient Language Modeling with Sparse all-MLP.

Class Meetings

Monday

Wednesday: Advising Break

Friday

Contents

Due this Week