Unit 10: Transformers

The Transformers architecture (sometimes called self-attention networks) has been the power behind many recent advances not just in NLP but also vision, audio, etc. That’s because they’re currently one of the best tools we have for representing high-dimensional joint distributions, such as the distribution over all possible sequences of words or images. This week we’ll see how they work!

By the end of this week you should be able to answer the following questions:

Preparation

Read and/or watch two things about how Transformers work.

Supplemental Material

Class Meetings

Monday class

bertviz

Wednesday: Advising Break

Friday: Midterm

Contents