Neural Architectures

Welcome

Attention - Perspectives

Why do you look at the speck of sawdust in your brother’s eye and pay no attention to the plank in your own eye?

Luke 6:41 (NIV); other translations contrast “look”/“see” vs “notice”/“consider”/“perceive”/“observe”

Other examples:

“You have seen many things, but you pay no attention; your ears are open, but you do not listen.” (Isaiah 42:20 NIV)
“My son, pay attention to what I say; turn your ear to my words.” (Proverbs 4:20 NIV)
“Daniel, who is one of the exiles from Judah, pays no attention to you, Your Majesty, or to the decree you put in writing” (Daniel 6:13 NIV)

Logistics

Reflections Week 2 cut-off is today
Homework 1: when to have due?
Project Milestone: chat with me about your project by end of week
Quiz 1 on Friday
Readings this week: take some time to think it through

Review

Language Modeling: Learning to Mimic Language

Train a model P(token | context) to minimize cross-entropy loss on contexts from, well, all text ever.

Minimizing next-word surprisal is a powerful objective: models learn about:

Spelling
Common phrases (“one word at a ____”)
Subject-verb agreement
Rhyming (e.g., children’s books, poetry, song lyrics)
Summarizing, translating, sentiment classification, named-entity recognition…
Standard structures (e.g., the 5-paragraph essay)
Programming: JSON, HTML/JavaScript/Python, diagrams, bugs, vulnerabilities, errors
Viewpoints (liberal, conservative, conspiracy, propaganda, …)
And all stereotypes that can be expressed in writing

ML System for Language Modeling

The neural computer takes sequences of vectors and outputs vectors.
To use it for generating language, we need a “driver” program that:
- Turns examples into sequences of numbers (tokenizer)
- Runs the neural computer (the “model”) on that sequence
- Interprets the output as a probability distribution over next tokens, samples one of them, and adds it onto the sequence so far

In general:

Classical computers: orchestration and control flow
Neural computers: parallel vector operations

Sequential and Neural Computer: Training

Sequential and Neural Computer: Inference

Objectives

Compare and contrast the main types of deep neural network models (Transformers, Convolutional Networks, and Recurrent Networks) in terms of how information flows through them

Deep Neural Net = stack of layers

Modular components, often connected sequentially

Linear transformation (“Dense”, “fully connected”)
multiple linear layers: MLP / “Feed-forward”
Convolution and Pooling
Self-Attention
Recurrent (RNN, LSTM)
Normalization (BatchNorm, LayerNorm)
Dropout

An Oversimplified History of Neural Architectures

Connectivity Structure

Fully Connected
- Perceptron: single layer, or hidden layers (MLP)
Fixed Connections
- Convolutional networks (CNN): local connections
- Recurrent networks (RNN): remember what was seen before (temporally connected?)
Dynamic Connections: Transformer

Feed-Forward / MLP

Universal function approximator
- But not necessarily efficient
Fixed information flow within layer

\[f(x) = f_2(ReLU(f_1(x))\]

where \(f_1\) and \(f_2\) are both linear transformations (\(f_i(x) = x W_i + b\)) and ReLU(x) = max(0, x) elementwise.

(Other nonlinearities are sometimes used instead of ReLU)

How to apply an MLP to a sequence

Option 1: concatenate the sequence to one giant vector
Option 2: apply the MLP to each element of the sequence independently

Pros and cons of each?

Example

What’s your birth month?

Input: birthday, for each student in the class
Output: month, for each student in the class

Information flow between students?

none needed.

Attention Example

Count how many other students were born in the same month as you.

What are the keys? queries? values?

Extensions:

How could we query the class to get the count of birthdays in January?
How could we find the centroid of where students with January birthdays are sitting?

Self-Attention: One Attention Head

Information flow computed dynamically
- Each token \(i\) computes a query and a key: \(q_i = x_i W_Q\); \(k_i = x_i W_K\)
- Each query is compared with each key: \(S = Q K^T\) (i.e., compute dot product of a token’s query with each other token’s key)
- Compute softmax across each row: \(A = \operatorname{softmax}(S)\)
- When query matches key, info flows (“attends”) via value \(v_i = x_i W_V\): \(\text{out} = \sum_j v_j A_{ij}\)
masking: During training, need to ensure only valid flows
parallel
- no explicit representation of neighbor
- anything can attend to anything

In practice

Multi-Head Attention (MHA)
- Several attention “head”s (8, 16, 48, …)
- Each head computes a query, key, and value.
- Sum the outputs of all the heads.
Intersperse MHA and FFN layers
- FFN: local computation
- MHA: share information
Residual connections: output = input + f(input)
- Gradients flow easier -> many more layers possible
- Common trick for other architectures also (LSTM, ResNet CNNs)
- Layers need to share a common dimensionality

Recurrent (RNN, LSTM)

One step at a time
Sequential
Update a “hidden state”
Efficient at inference time
Difficult to learn long-range dependencies

Recurrent Example

Whose birthday is latest in the year?

Could we compute this with a self-attention layer?

Review

Which is the most natural architecture for each of the following tasks?

Architectures:

Feed-Forward (MLP)
Self-Attention
Recurrent (RNN, LSTM)
Convolutional (CNN)

Exercises:

On what day of the week will your next birthday be?
Find another student whose birthday will be the same day of week as you.
What is the latest-in-the-year birthday?
Which part of the class has the farthest-apart birthdays?

Convolution and Pooling

Feed-Forward Network on a patch; slide the patch around to compute many outputs
Information flow fixed but local (neighbors)
Parallel
Summarize regions of the input
Efficient inference

Example

Which part of the class has the farthest-apart birthdays?

Visualizations

What convolution does to an image: Image Kernels explained visually
How to use convolutions in a neural network: CS231n Convolutional Neural Networks for Visual Recognition
What they learn: Feature Visualization

Compare and Contrast

Which of these exercises could we have solved using a different architecture? How?

Architectures:

Self-Attention
Recurrent (RNN, LSTM)
Convolutional (CNN)

Exercises:

Count how many other students were born on the same weekday as you were.
What is the latest-in-the-year birthday?
Which part of the class has the farthest-apart birthdays?

Transformer Architecture

Attention as Routing

Computing Attention

Query: What do I want to know?
Key: What information is available?
Value: What is the answer?

Neural Architectures

Welcome

Attention - Perspectives

Logistics

Review

Language Modeling: Learning to Mimic Language

ML System for Language Modeling

Sequential and Neural Computer: Training

Sequential and Neural Computer: Inference

Objectives

Deep Neural Net = stack of layers

An Oversimplified History of Neural Architectures

Connectivity Structure

Feed-Forward / MLP

How to apply an MLP to a sequence

Example

Attention Example

Self-Attention: One Attention Head

In practice

Recurrent (RNN, LSTM)

Recurrent Example

Review

Convolution and Pooling

Example

Visualizations

Compare and Contrast

Transformer Architecture

Attention as Routing

Computing Attention

Attention and MLP

Using Transformers for Language Modeling

Using Transformers for Translation