Logistics
- Focus on projects
- Short weeks: Good Friday and Easter Monday
- Wednesday lab
- Midterm 2 last day of class.
- Language Modeling
- Transformers
- Basics of generative modeling and reinforcement learning
Objectives
- Compare and contrast the main types of deep neural network models (Transformers, Convolutional Networks, and Recurrent Networks) in terms of how information flows through them
Deep Neural Net = stack of layers
Modular components, often connected sequentially
- MLP (Feed-forward)
- Self-Attention
- Convolution
- Pooling
- Recurrent (RNN, LSTM)
- Normalization (BatchNorm, LayerNorm)
- Dropout
Feed-Forward / MLP
- Universal function approximator
- But not necessarily efficient
- Fixed information flow within layer
\[f(x) = f_2(ReLU(f_1(x))\]
where \(f_1\) and \(f_2\) are both linear transformations (\(f_i(x) = x W_i + b\)) and ReLU(x) = max(0, x) elementwise.
(Other nonlinearities are sometimes used instead of ReLU)
Example
On what day of the week will your next birthday be?
- Input: birthday, for each student in the class
- Output: day of week, for each student in the class
Information flow between students?
Self-Attention: One Attention Head
- Information flow computed dynamically
- Each token \(i\) computes a query and a key: \(q_i = x_i W_Q\); \(k_i = x_i W_K\)
- Each query is compared with each key: \(S = Q K^T\) (i.e., compute dot product of a token’s query with each other token’s key)
- Compute softmax across each row: \(A = \operatorname{softmax}(S)\)
- When query matches key, info flows (“attends”) via value \(v_i = x_i W_V\): \(\text{out} = \sum_j v_j A_{ij}\)
- masking: During training, need to ensure only valid flows
- parallel
- no explicit representation of neighbor
- anything can attend to anything
Example
Find another student whose birthday will be the same day of week as you.
Count the number of other students who will have the same birth-day as you.
What are the keys? queries? values?
How could we query the class to get the count of Monday birthdays?
In practice
- Multi-Head Attention (MHA)
- Several attention “head”s (8, 16, 48, …)
- Each head computes a query, key, and value.
- Sum the outputs of all the heads.
- Intersperse MHA and FFN layers
- FFN: local computation
- MHA: share information
- Residual connections: output = input + f(input)
- Gradients flow easier -> many more layers possible
- Common trick for other architectures also (LSTM, ResNet CNNs)
- Layers need to share a common dimensionality
Recurrent (RNN, LSTM)
- One step at a time
- Sequential
- Update a “hidden state”
- Efficient at inference time
- Difficult to learn long-range dependencies
Example
What is the latest-in-the-year birthday?
Review
Which is the most natural architecture for each of the following tasks?
Architectures:
- Feed-Forward (MLP)
- Self-Attention
- Recurrent (RNN, LSTM)
- Convolutional (CNN) (not yet)
Exercises:
- On what day of the week will your next birthday be?
- Find another student whose birthday will be the same day of week as you.
- What is the latest-in-the-year birthday?
- Which part of the class has the farthest-apart birthdays?
Convolution and Pooling
- Feed-Forward Network on a patch; slide the patch around to compute many outputs
- Information flow fixed but local (neighbors)
- Parallel
- Summarize regions of the input
- Efficient inference
Example
Which part of the class has the farthest-apart birthdays?
Compare and Contrast
Which of these exercises could we have solved using a different architecture? How?
Architectures:
- Self-Attention
- Recurrent (RNN, LSTM)
- Convolutional (CNN)
Exercises:
- Find another student whose birthday will be the same day of week as you.
- What is the latest-in-the-year birthday?
- Which part of the class has the farthest-apart birthdays?