MAE is like the median (cares about how many are above/below prediction); MSE/RMSE/R2 is like the mean (cares about the magnitude of errors)
All of these are also valid loss functions (i.e., we can use them to train a model).
Answer as a probability distribution.
Suppose A and B are playing chess. Model M gives them equal odds (50-50), Model Q gives A an 80% win chance.
| Player | Model M win prob | Model Q win prob |
|---|---|---|
| A | 50% | 80% |
| B | 50% | 20% |
Now we let them play 5 games, and A wins each time. (data = AAAAA)
What is P(data given model) for each model?
0.5 * 0.5 * 0.5 * 0.5 * 0.5 = (0.5)^5 = 0.031250.8 * 0.8 * 0.8 * 0.8 * 0.8 = (0.8)^5 = 0.32768Which model was better able to predict the outcome?
Likelihood: probability that a model assigns to the data. (The P(AAAAA) we just computed.)
Assumption: data points are independent and order doesn’t matter. (i.i.d). So P(AAAAA) = P(A) * P(A) * P(A) * P(A) * P(A)
Log likelihood of data for a model:
Technical note: MSE loss minimizes cross-entropy if you model the data as Gaussian.
For technical details, see Goodfellow et al., Deep Learning Book chapters 3 (info theory background) and 5 (application to loss functions).
Cross-entropy when the data is categorical (i.e., a classification problem).
Definition: Average of negative log of probability of the correct class.
(Usually use natural log, so units are nats.)



On the last step, we observed that the fitted model was different for MAE vs MSE. To get a different line, which had to change? (1) the computation of the loss, (2) the computation of the gradient, (3) both, (4) neither or something else.
If you changed how the predictions were computed, would you need to change how the loss function gradient is computed?
Can we use accuracy as a loss function for a classifier? Why or why not?
No, because its derivative is almost always 0.
Which of the following is a good loss function for classification?
Why?