Neural Architectures

Logistics

Focus on projects
Short weeks: Good Friday and Easter Monday
Wednesday lab
Midterm 2 last day of class.
- Language Modeling
- Transformers
- Basics of generative modeling and reinforcement learning

Compare and contrast the main types of deep neural network models (Transformers, Convolutional Networks, and Recurrent Networks) in terms of how information flows through them

Modular components, often connected sequentially

\[f(x) = f_2(ReLU(f_1(x))\]

where \(f_1\) and \(f_2\) are both linear transformations (\(f_i(x) = x W_i + b\)) and ReLU(x) = max(0, x) elementwise.

(Other nonlinearities are sometimes used instead of ReLU)

On what day of the week will your next birthday be?

Information flow between students?

none needed.

Information flow computed dynamically
- Each token \(i\) computes a query and a key: \(q_i = x_i W_Q\); \(k_i = x_i W_K\)
- Each query is compared with each key: \(S = Q K^T\) (i.e., compute dot product of a token’s query with each other token’s key)
- Compute softmax across each row: \(A = \operatorname{softmax}(S)\)
- When query matches key, info flows (“attends”) via value \(v_i = x_i W_V\): \(\text{out} = \sum_j v_j A_{ij}\)
masking: During training, need to ensure only valid flows
parallel
- no explicit representation of neighbor
- anything can attend to anything

Find another student whose birthday will be the same day of week as you.

Count the number of other students who will have the same birth-day as you.

What are the keys? queries? values?

How could we query the class to get the count of Monday birthdays?

Multi-Head Attention (MHA)
- Several attention “head”s (8, 16, 48, …)
- Each head computes a query, key, and value.
- Sum the outputs of all the heads.
Intersperse MHA and FFN layers
- FFN: local computation
- MHA: share information
Residual connections: output = input + f(input)
- Gradients flow easier -> many more layers possible
- Common trick for other architectures also (LSTM, ResNet CNNs)
- Layers need to share a common dimensionality

What is the latest-in-the-year birthday?

Which is the most natural architecture for each of the following tasks?

Architectures:

Exercises:

Feed-Forward Network on a patch; slide the patch around to compute many outputs
Information flow fixed but local (neighbors)
Parallel
Summarize regions of the input
Efficient inference

Which part of the class has the farthest-apart birthdays?

Which of these exercises could we have solved using a different architecture? How?

Architectures:

Exercises: