The Transformers architecture (sometimes called self-attention networks) has been the power behind many recent advances not just in NLP but also vision, audio, etc.
That’s because they’re currently one of the best tools we have for representing high-dimensional joint distributions, such as the distribution over all possible sequences of words or images.
This week we’ll see how they work!
By the end of this week you should be able to answer the following questions:
Describe practical considerations of handling batches of variable-length sequences, such as padding, attention masking, and truncation.
Define perplexity, and describe how it relates to log-likelihood and cross-entropy (and the general concept of partial credit in classifiers)
What is a layer in a self-attention network: what goes in, what comes out, and what are the shapes of all those things?
Why are variable-length sequences challenging for neural nets? How do self-attention networks handle that challenge? (Bonus: what are some alternative approaches? convolutional nets, recurrent nets, all-MLP (spatial gating) networks, etc.)
How does data flow in a self-attention network? In what sense does it use conditional logic?
What does an attention head do? Specifically, what are queries, keys, and values, and what do they do? And how does this relate with our old friends the dot product and softmax? (Wait, is this logistic classification yet again?)
How do self-attention networks keep track of position?
What are encoders and decoders? Why does that matter? What impact does that have on what you can do with the model?
Preparation
Read and/or watch two things about how Transformers work.