Neural Architectures

Welcome

Attention - Perspectives

Why do you look at the speck of sawdust in your brother’s eye and pay no attention to the plank in your own eye?

Luke 6:41 (NIV); other translations contrast “look”/“see” vs “notice”/“consider”/“perceive”/“observe”

Other examples:

  • “You have seen many things, but you pay no attention; your ears are open, but you do not listen.” (Isaiah 42:20 NIV)
  • “My son, pay attention to what I say; turn your ear to my words.” (Proverbs 4:20 NIV)
  • “Daniel, who is one of the exiles from Judah, pays no attention to you, Your Majesty, or to the decree you put in writing” (Daniel 6:13 NIV)

Logistics

  • Reflections Week 2 cut-off is today
  • Homework 1: when to have due?
  • Project Milestone: chat with me about your project by end of week
  • Quiz 1 on Friday
  • Readings this week: take some time to think it through

Review

Language Modeling: Learning to Mimic Language

Train a model P(token | context) to minimize cross-entropy loss on contexts from, well, all text ever.

Minimizing next-word surprisal is a powerful objective: models learn about:

  • Spelling
  • Common phrases (“one word at a ____”)
  • Subject-verb agreement
  • Rhyming (e.g., children’s books, poetry, song lyrics)
  • Summarizing, translating, sentiment classification, named-entity recognition…
  • Standard structures (e.g., the 5-paragraph essay)
  • Programming: JSON, HTML/JavaScript/Python, diagrams, bugs, vulnerabilities, errors
  • Viewpoints (liberal, conservative, conspiracy, propaganda, …)
  • And all stereotypes that can be expressed in writing

ML System for Language Modeling

  • The neural computer takes sequences of vectors and outputs vectors.
  • To use it for generating language, we need a “driver” program that:
    • Turns examples into sequences of numbers (tokenizer)
    • Runs the neural computer (the “model”) on that sequence
    • Interprets the output as a probability distribution over next tokens, samples one of them, and adds it onto the sequence so far

In general:

  • Classical computers: orchestration and control flow
  • Neural computers: parallel vector operations

Sequential and Neural Computer: Training

Sequential and Neural Computer: Inference

Objectives

  • Compare and contrast the main types of deep neural network models (Transformers, Convolutional Networks, and Recurrent Networks) in terms of how information flows through them

Deep Neural Net = stack of layers

Modular components, often connected sequentially

  • Linear transformation (“Dense”, “fully connected”)
  • multiple linear layers: MLP / “Feed-forward”
  • Convolution and Pooling
  • Self-Attention
  • Recurrent (RNN, LSTM)
  • Normalization (BatchNorm, LayerNorm)
  • Dropout

An Oversimplified History of Neural Architectures

Connectivity Structure

  • Fully Connected
    • Perceptron: single layer, or hidden layers (MLP)
  • Fixed Connections
    • Convolutional networks (CNN): local connections
    • Recurrent networks (RNN): remember what was seen before (temporally connected?)
  • Dynamic Connections: Transformer

Feed-Forward / MLP

  • Universal function approximator
    • But not necessarily efficient
  • Fixed information flow within layer

\[f(x) = f_2(ReLU(f_1(x))\]

where \(f_1\) and \(f_2\) are both linear transformations (\(f_i(x) = x W_i + b\)) and ReLU(x) = max(0, x) elementwise.

(Other nonlinearities are sometimes used instead of ReLU)

How to apply an MLP to a sequence

  • Option 1: concatenate the sequence to one giant vector
  • Option 2: apply the MLP to each element of the sequence independently

Pros and cons of each?

Example

What’s your birth month?

  • Input: birthday, for each student in the class
  • Output: month, for each student in the class

Information flow between students?

none needed.

Attention Example

Count how many other students were born in the same month as you.

What are the keys? queries? values?

Extensions:

  1. How could we query the class to get the count of birthdays in January?
  2. How could we find the centroid of where students with January birthdays are sitting?

Self-Attention: One Attention Head

  • Information flow computed dynamically
    • Each token \(i\) computes a query and a key: \(q_i = x_i W_Q\); \(k_i = x_i W_K\)
    • Each query is compared with each key: \(S = Q K^T\) (i.e., compute dot product of a token’s query with each other token’s key)
    • Compute softmax across each row: \(A = \operatorname{softmax}(S)\)
    • When query matches key, info flows (“attends”) via value \(v_i = x_i W_V\): \(\text{out} = \sum_j v_j A_{ij}\)
  • masking: During training, need to ensure only valid flows
  • parallel
    • no explicit representation of neighbor
    • anything can attend to anything

In practice

  • Multi-Head Attention (MHA)
    • Several attention “head”s (8, 16, 48, …)
    • Each head computes a query, key, and value.
    • Sum the outputs of all the heads.
  • Intersperse MHA and FFN layers
    • FFN: local computation
    • MHA: share information
  • Residual connections: output = input + f(input)
    • Gradients flow easier -> many more layers possible
    • Common trick for other architectures also (LSTM, ResNet CNNs)
    • Layers need to share a common dimensionality

Recurrent (RNN, LSTM)

  • One step at a time
  • Sequential
  • Update a “hidden state”
  • Efficient at inference time
  • Difficult to learn long-range dependencies

Recurrent Example

Whose birthday is latest in the year?

Could we compute this with a self-attention layer?

Review

Which is the most natural architecture for each of the following tasks?

Architectures:

  • Feed-Forward (MLP)
  • Self-Attention
  • Recurrent (RNN, LSTM)
  • Convolutional (CNN)

Exercises:

  • On what day of the week will your next birthday be?
  • Find another student whose birthday will be the same day of week as you.
  • What is the latest-in-the-year birthday?
  • Which part of the class has the farthest-apart birthdays?

Convolution and Pooling

  • Feed-Forward Network on a patch; slide the patch around to compute many outputs
  • Information flow fixed but local (neighbors)
  • Parallel
  • Summarize regions of the input
  • Efficient inference

Example

Which part of the class has the farthest-apart birthdays?

Visualizations

Compare and Contrast

Which of these exercises could we have solved using a different architecture? How?

Architectures:

  • Self-Attention
  • Recurrent (RNN, LSTM)
  • Convolutional (CNN)

Exercises:

  • Count how many other students were born on the same weekday as you were.
  • What is the latest-in-the-year birthday?
  • Which part of the class has the farthest-apart birthdays?

Transformer Architecture

Attention as Routing

Computing Attention

  • Query: What do I want to know?
  • Key: What information is available?
  • Value: What is the answer?

Attention and MLP

Using Transformers for Language Modeling

Using Transformers for Translation