Neural Models

Ken Arnold

Monday

Reflection: Rich Data

Genesis 1

  • Ctrl-F for “saw”
  • We are surrounded by rich data.

Logistics

  • Discussion 2
  • Quiz this Friday (delayed from last Friday) and also next Friday (as normal)

Lab Review

  1. Which of these plots of training loss is happiest? Why?

  1. On the last step, we observed that the fitted model was different for MAE vs MSE. To get a different line, which had to change? (1) the computation of the loss, (2) the computation of the gradient, (3) both, (4) neither or something else.

  2. If you changed how the predictions were computed, would you need to change how the loss function gradient is computed?

Takeaways

  1. Training curves should look like the one on the left:
  • Often we train until convergence, i.e., loss stops going down (won’t be zero, might be noisy)
  • The second one diverged (in this case, the gradient computation was incorrect, but you can also see this with too high learning rate, poor initialization, etc.)
  • The third one was training too slowly (learning rate was too low); hadn’t yet converged.
  1. Gradients are the source of all learning in neural networks. What mattered wasn’t how the loss function was computed, but how its gradient was computed.
  2. Backpropagation is nicely modular: all computation happens locally, with only a small amount of communication (the gradients) between steps of the computation. It lets us break down the computation into small, automatable steps.

(Stochastic) Gradient Descent algorithm

  1. Get some data
  2. Forward pass: compute model’s predictions
  3. Loss computation: compare predictions to targets
  4. Backward pass: compute gradient of loss with respect to model parameters
  5. Update: adjust model parameters in a direction that reduces loss (“optimization”)
  • Gradient descent: do this on the whole dataset
  • Stochastic Gradient Descent (SGD): do this on subsets (mini-batches) of the dataset

From Linear Regression to Neural Networks

CS 375:

  • Nonlinear transformations (ReLU, etc.)
  • Extend to classification (softmax, cross-entropy loss)
  • More layers (“deep learning”)

CS 376:

  • Handle structure (locality of pixels in images, etc.)
  • Flexible data flow (attention, recurrences, etc.)

Libraries Help Us

PyTorch

import torch
import torch.nn as nn

model = nn.Linear(3, 1)
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters())
# ...
y_pred = model(x)
loss = loss_fn(y_pred, y)
loss.backward()
optimizer.step()

TensorFlow (low-level)

import tensorflow as tf
import keras

model = keras.layers.Dense(1, input_shape=(3,))
loss_fn = keras.losses.MeanSquaredError()
optimizer = keras.optimizers.SGD()
# ...
with tf.GradientTape() as tape:
    y_pred = model(x)
    loss = loss_fn(y, y_pred)
grads = tape.gradient(loss, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))

Automatic Differentiation

  • Programmer computes the forward pass
  • Library computes the backward pass (gradients) using the backpropagation algorithm
    • Start at the end (loss)
    • Work backwards through the computation graph
    • Use the chain rule to compute gradients

Upshot: we can differentiate programs.

Multiple Inputs and Outputs

  • y = x1*w1 + x2*w2 + x3*w3 + b
  • Or: y = x @ W + b

Matmul (@) so we can proces every example of x at once:

  • x is 100 samples, each with 3 features (x.shape is (100, 3))
  • W gives 4 outputs for each feature (W.shape is (3, 4))
  • Then x @ W gives 100 samples, each with 4 outputs ((100, 4))
  • Think: what is b’s shape?

ReLU

Chop off the negative part of its input.

y = max(0, x)

(Gradient is 1 for positive inputs, 0 for negative inputs)

Why is ReLU Useful?

In 2D

Interactive Activity

ReLU interactive (name: u04n00-relu.ipynb; show preview, open in Colab)

Wednesday

Warm-Up: Classifier Accuracy

Suppose your Homework 1 validation set had 2 images and it got both correct.

  1. What accuracy number would you have reported?
  2. If your classifier was a (fair) coin flip, how likely would it be to get both correrct?
  3. If your classifier’s true accuracy was 85%, how likely would it be to get both correrct?
  4. If your classifier’s true accuracy was 100%, how likely would it be to get both correrct?

How would that have changed if it got one right and one wrong?

Warm-Up Solutions

  • 2 / 2 = 100%
  • 1 / 2 = 50%
Outcome Probability if 50% Probability if 85% Probability if 100%
✅✅ 0.5 * 0.5 = 0.25 0.85 * 0.85 = 0.7225 1.0 * 1.0 = 1.0
✅❌ 0.5 * 0.5 = 0.25 0.85 * 0.15 = 0.1275 1.0 * 0.0 = 0.0
❌✅ 0.5 * 0.5 = 0.25 0.15 * 0.85 = 0.1275 0.0 * 1.0 = 0.0
❌❌ 0.5 * 0.5 = 0.25 0.15 * 0.15 = 0.0225 0.0 * 0.0 = 0.0

Objectives for Today

  • Set up a neural network to do classification
  • Describe and compute categorical cross-entropy loss
  • Explain the purpose and mathematical properties of the softmax operation.

Think…

Can we use accuracy as a loss function for a classifier? Why or why not?

No, because its derivative is almost always 0.

Today: Classification

  • How do we measure how good a classifier is? categorical cross-entropy loss
  • Cross-entropy depends on probabilities, but the model gives us scores. How do we turn scores into probabilities? -> softmax

Intuition: Predicting the outcome of a game

  • Suppose you play chess grandmaster Gary Kasparov in chess. Who wins?
  • Suppose you play someone with roughly equal skill. Who wins?

Good predictions give meaningful probabilities

  • How surprised would you be if you played Gary Kasparov and he won?
  • If you won?
  • Intuition: surprise

What if you played 5 times? What’s the total surprise?

Use surprise to compare two models

Suppose A and B are playing chess. Model M gives them equal odds (50-50), Model Q gives A an 80% win chance.

Player Model M win prob Model Q win prob
A 50% 80%
B 50% 20%

Now we let them play 5 games, and A wins each time. (data = AAAAA)

What is P(data) for each model?

  • Model M: 0.5 * 0.5 * 0.5 * 0.5 * 0.5 = (0.5)^5 = 0.03125
  • Model Q: 0.8 * 0.8 * 0.8 * 0.8 * 0.8 = (0.8)^5 = 0.32768

Which model was better able to predict the outcome?

Likelihood

Likelihood: probability that a model assigns to the data. (The P(AAAAA) we just computed.)

Assumption: data points are independent and order doesn’t matter. (i.i.d). So P(AAAAA) = P(A) * P(A) * P(A) * P(A) * P(A)

Log Likelihood

  • Likelihood numbers can quickly get too small to represent accurately.
  • Computational trick: take the logarithm.
    • log2(.5) = -1 because 2^(-1) = 0.5
    • log2((0.5)^5) = 5 * log2(0.5) = 5 * -1 = -5
    • log of a product = sum of logs

Log likelihood of data for a model:

  1. Compute the model’s probability for each data point
  2. Take the log of each probability
  3. Sum the logs

Cross-Entropy Loss

  • Negative of the log likelihood (“NLL”)
  • Intuition: average surprise
    • A good regression model predicts nearby the right answer.
    • A good classifier should give high probability to correct result.
  • Cross-entropy loss = average surprise.

Technical note: MSE loss minimizes cross-entropy if you model the data as Gaussian.

For technical details, see Goodfellow et al., Deep Learning Book chapters 3 (info theory background) and 5 (application to loss functions).

Categorical Cross-Entropy

Cross-entropy when the data is categorical (i.e., a classification problem).

Definition: Average of negative log of probability of the correct class.

  • Model M: Gave prob of 0.5 to the correct answer. Cross-entropy loss = -log2(0.5) = 1 bit
  • Model Q: Gave prob of 0.8 to the correct answer. Cross-entropy loss = -log2(0.8) = 0.3219 bits

(Usually use natural log, so units are nats.)

Math aside: Cross-Entropy

  • A general concept: comparing two distributions.
  • Most common use: classification.
    • Classifier outputs a probability distribution over classes.
    • Categorical cross-entropy is a distance between that distribution and the “true” distribution.
    • Estimate the true distribution using a 1-hot vector with 1 in the correct class and 0 elsewhere.
  • But it applies to any two distributions.

How do we turn scores into probabilities?

Intuition: Elo

A measure of relative skill:

  • Higher Elo more likely to win
  • Greater point spread -> more confidence in win

Formal definition:

Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))

EloDiff = A Elo - B Elo

From scores to probabilities

Suppose we have 3 chess players:

player Elo
A 1000
B 2200
C 1010

A and B play. Who wins?

A and C play. Who wins?

Softmax

See nfelo

  1. Pick a pair of teams from the list.
  2. Compute the difference in their Elo ratings.
  3. Compute the probability that the higher-rated team wins.
  4. Repeat for another pair of teams.

Elo probability formula:

Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))

Softmax

  1. Start with scores (use variable name logits), which can be any numbers (positive, negative, whatever)
  2. Make them only positive by exponentiating:
  • xx = exp(logits) (logits.exp() in PyTorch)
  • 10 ** logits
  1. Make them sum to 1: probs = xx / xx.sum()

Some properties of softmax

  • Sums to 1 (by construction)
  • Largest logit in gets biggest prob output
  • logits + constant doesn’t change output.
  • logits * constant does change output.

Sigmoid

Special case of softmax when you just have one score (binary classification): use logits = [score, 0.0]

Where do the “right” scores come from?

  • In linear regression we were given the right scores.
  • In classification, we have to learn the scores from data.

Review

Which of the following is a good loss function for classification?

  1. Mean squared error
  2. Softmax (generalization of sigmoid to multiple categories)
  3. Error rate (number of answers that were incorrect)
  4. Average of the probability the classifier assigned to the wrong answer
  5. Average of the negative of the log of the probability the classifier assigned to the right answer

Why?

A Story of A Classifier

Images as Input

  • Each image is a 3D array of numbers (color channel, height, width) with rich structure: adjacent pixels are related, color channels are related, etc.
  • Eventually we’ll handle this, but to simplify for now:
    • Ignore color information: use grayscale images: 28x28 array of numbers from 0 to 255
    • Ignore spatial relationships: flatten to 1D array of 784 numbers

Work through how this is done in the Chapter 2 notebook

Classification the Wrong Way

  1. Compute y = x @ W + b to compute a one-dimensional output y for each image.
  2. Compute the MSE loss between y and the desired number (0-9).
  3. Optimize the weights to minimize this loss.

Discuss with neighbors:

  1. Suppose you’re trying to classify an image of a 4. What might a prediction from this model look like? Compute the loss.
  2. What’s wrong with this approach? What might you do instead?

Fix 1: Multiple Outputs

We want a score for each digit (0-9), so we need 10 outputs.

Each output is 0 or 1 (“one-hot encoding”).

Discuss with neighbors:

  1. What’s the shape of the output?
  2. What must be the shape of the weights?
  3. What might an output from this model look like when given an image of a 4? What’s the loss?

Fix 2: Softmax (Large Values)

Suppose the network predicted 1.5 for the correct output.

Suppose instead the network predicted 0.5 for the correct output.

What’s the loss in each case?

Fix: make outputs be probabilities.

  1. Exponentiate each output
  2. Divide by the sum of all exponentiated outputs

Discuss with neighbors:

  1. What might an output from this model look like? What’s the loss for your running example?

Fix 3: Cross-Entropy Loss

Negative log of the probability of the correct class.

Compute the loss for your running example.