On the last step, we observed that the fitted model was different for MAE vs MSE. To get a different line, which had to change? (1) the computation of the loss, (2) the computation of the gradient, (3) both, (4) neither or something else.
If you changed how the predictions were computed, would you need to change how the loss function gradient is computed?
CS 375:
CS 376:
PyTorch
TensorFlow (low-level)
import tensorflow as tf
import keras
model = keras.layers.Dense(1, input_shape=(3,))
loss_fn = keras.losses.MeanSquaredError()
optimizer = keras.optimizers.SGD()
# ...
with tf.GradientTape() as tape:
y_pred = model(x)
loss = loss_fn(y, y_pred)
grads = tape.gradient(loss, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))
Upshot: we can differentiate programs.
y = x1*w1 + x2*w2 + x3*w3 + b
y = x @ W + b
Matmul (@
) so we can proces every example of x
at once:
x
is 100 samples, each with 3 features (x.shape
is (100, 3)
)W
gives 4 outputs for each feature (W.shape
is (3, 4)
)x @ W
gives 100 samples, each with 4 outputs ((100, 4)
)b
’s shape?Chop off the negative part of its input.
y = max(0, x)
(Gradient is 1 for positive inputs, 0 for negative inputs)
ReLU interactive (name: u04n00-relu.ipynb
; show preview, open in Colab)
Suppose your Homework 1 validation set had 2 images and it got both correct.
How would that have changed if it got one right and one wrong?
Outcome | Probability if 50% | Probability if 85% | Probability if 100% |
---|---|---|---|
✅✅ | 0.5 * 0.5 = 0.25 | 0.85 * 0.85 = 0.7225 | 1.0 * 1.0 = 1.0 |
✅❌ | 0.5 * 0.5 = 0.25 | 0.85 * 0.15 = 0.1275 | 1.0 * 0.0 = 0.0 |
❌✅ | 0.5 * 0.5 = 0.25 | 0.15 * 0.85 = 0.1275 | 0.0 * 1.0 = 0.0 |
❌❌ | 0.5 * 0.5 = 0.25 | 0.15 * 0.15 = 0.0225 | 0.0 * 0.0 = 0.0 |
Can we use accuracy as a loss function for a classifier? Why or why not?
No, because its derivative is almost always 0.
What if you played 5 times? What’s the total surprise?
Suppose A and B are playing chess. Model M gives them equal odds (50-50), Model Q gives A an 80% win chance.
Player | Model M win prob | Model Q win prob |
---|---|---|
A | 50% | 80% |
B | 50% | 20% |
Now we let them play 5 games, and A wins each time. (data = AAAAA)
What is P(data) for each model?
0.5 * 0.5 * 0.5 * 0.5 * 0.5
= (0.5)^5 = 0.031250.8 * 0.8 * 0.8 * 0.8 * 0.8
= (0.8)^5 = 0.32768Which model was better able to predict the outcome?
Likelihood: probability that a model assigns to the data. (The P(AAAAA) we just computed.)
Assumption: data points are independent and order doesn’t matter. (i.i.d). So P(AAAAA) = P(A) * P(A) * P(A) * P(A) * P(A)
Log likelihood of data for a model:
Technical note: MSE loss minimizes cross-entropy if you model the data as Gaussian.
For technical details, see Goodfellow et al., Deep Learning Book chapters 3 (info theory background) and 5 (application to loss functions).
Cross-entropy when the data is categorical (i.e., a classification problem).
Definition: Average of negative log of probability of the correct class.
(Usually use natural log, so units are nats.)
A measure of relative skill:
Formal definition:
Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))
EloDiff = A Elo - B Elo
Suppose we have 3 chess players:
player | Elo |
---|---|
A | 1000 |
B | 2200 |
C | 1010 |
A and B play. Who wins?
A and C play. Who wins?
See nfelo
Elo probability formula:
Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))
logits
), which can be any numbers (positive, negative, whatever)xx = exp(logits)
(logits.exp()
in PyTorch)10 ** logits
probs = xx / xx.sum()
logits + constant
doesn’t change output.logits * constant
does change output.Special case of softmax when you just have one score
(binary classification): use logits = [score, 0.0]
Which of the following is a good loss function for classification?
Why?
Work through how this is done in the Chapter 2 notebook
y = x @ W + b
to compute a one-dimensional output y
for each image.y
and the desired number (0-9).Discuss with neighbors:
We want a score for each digit (0-9), so we need 10 outputs.
Each output is 0 or 1 (“one-hot encoding”).
Discuss with neighbors:
Suppose the network predicted 1.5 for the correct output.
Suppose instead the network predicted 0.5 for the correct output.
What’s the loss in each case?
Fix: make outputs be probabilities.
Discuss with neighbors:
Negative log of the probability of the correct class.
Compute the loss for your running example.