Learning to Classify

class: center, middle, inverse, title-slide

.title[
# Learning to Classify
]
.author[
### Ken Arnold
]

---

## This week's objectives

- Describe the difference between a metric and a loss function.
- Describe and compute cross-entropy loss
- Explain the purpose and mathematical properties of the softmax operation.
- Explain the role of nonlinearities in a neural network (e.g., why they are used between linear layers)

An outline of all this week's material is available on the website.

---

## Logistics Reminders

- Preparation should be done by Monday class.
- Discussion should be done by Wednesday class (this week: a revision and a new post)
- Don't forget check-in quizzes for labs.
- Homework 2 due Friday
- Homework 3 posted, due in 2 weeks.

---

## Review

Can we use *accuracy* as a loss function? Why or why not?

No, because its derivative is almost always 0.

---

## Today: Classification

- How do we measure how good a classifier is? **cross-entropy loss**
- Cross-entropy depends on probabilities, but the model gives us scores. *How do we turn scores into probabilities?*

---

## Intuition: Predicting the outcome of a game

- Suppose you play Gary Kasparov in chess. Who wins?
- Suppose you play someone with roughly equal skill. Who wins?

---

## Good predictions give meaningful probabilities

- How surprised would you be if you played Gary Kasparov and he won?
- If you won?
- Intuition: surprise

What if you played 5 times? What's the total surprise?

---

## Use surprise to compare two models

Suppose A and B are playing chess. Model M gives them equal odds (50-50), Model Q gives A an 80% win chance.

| Player | Model M win prob | Model Q win prob
|--------|------------------|-----------------|
| A | 50% | 80% |
| B | 50% | 20% |

Now we let them play 5 games, and A wins each time.

What is P(AAAAA) for each model?

- Model M: `0.5 * 0.5 * 0.5 * 0.5 * 0.5` = (0.5)^5 = 0.03125
- Model Q: `0.8 * 0.8 * 0.8 * 0.8 * 0.8` = (0.8)^5 = 0.32768

Which model was better able to predict the outcome?

---

## Cross-Entropy Loss

- A good classifier should give high probability to correct result.
- Cross-entropy loss = **average surprise**.

Definition: Negative log of probability of the correct class.

- Likelihood of data: model M (50-50): (0.5)^5 = 0.03125: model Q (80-20): (0.8)^5 = 0.32768
- Negative log-likelihood of data: model M: `-5 * log2(0.5)` = 5 bits
- Negative log-likelihood of data: model Q: `-5 * log2(0.8)` = 1.61 bits

We usually report the average over the whole dataset, so model M has average cross-entropy loss of 5 bits, model Q has average cross-entropy loss of 1.61 bits.

(Log of a product = sum of logs. Much easier to work with computationally.)

(Usually use *natural* log, so units are *nats*.)

---

## Math aside: Cross-Entropy

- A general concept: comparing two distributions. 
- Most common use: classification.
  - Classifier outputs a probability distribution over classes.
  - Cross-entropy is a distance between that distribution and the "true" distribution.
  - Estimate the true distribution using a 1-hot vector with 1 in the correct class and 0 elsewhere.
- But it applies to any two distributions.

---

class: center, middle

## How do we turn scores into probabilities?

---

## Intuition: Elo

A measure of relative skill:

- Higher Elo more likely to win
- Greater point spread -> more confidence in win

Formal definition:

Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))

EloDiff = A Elo - B Elo

- Uses: [chess](https://en.wikipedia.org/wiki/Elo_rating_system), [NFL](https://projects.fivethirtyeight.com/complete-history-of-the-nfl/)
- 538's [Superbowl prediction](https://projects.fivethirtyeight.com/2022-nfl-predictions/games/) 
([discussion](https://fivethirtyeight.com/methodology/how-our-nfl-predictions-work/))
- sometimes adjusted, e.g., for playoffs, 538 multiplies EloDiff by 1.2 (wider margins in playoffs)

---

## From scores to probabilities

Suppose we have 3 chess players:

| player | Elo |
|--------|-----|
| A | 1000 |
| B | 2200 |
| C | 1010 |

A and B play. Who wins?

A and C play. Who wins?

---

## Softmax

This year: Chiefs vs Eagles.

Current Elo estimates: Chiefs at 1715, Eagles at 1679.

Continue [here](https://colab.research.google.com/drive/1CdTEcZP2bOx7zbPltAdTE7xBse6JI9z5)

---

## Softmax

1. Start with scores (use variable name `logits`), which can be any numbers (positive, negative, whatever)
2. Make them only positive by exponentiating:
  - `xx = exp(logits)` (`logits.exp()` in PyTorch)
  - `10 ** logits`
3. Make them sum to 1: `probs = xx / xx.sum()`

---

## Some properties of softmax

- Sums to 1 (by construction)
- Largest logit in gets biggest prob output
- `logits + constant` doesn't change output.
- `logits * constant` *does* change output.

---

## Sigmoid

Special case of softmax when you just have one `score` (binary classification): use `logits = [score, 0.0]`

---

## Where do the "right" scores come from?

- In linear regression we were given the right scores.
- In classification, we have to learn the scores from data.

---

## Review

Which of the following is a good *loss function* for classification?

1. Mean squared error
2. Softmax (generalization of sigmoid to multiple categories)
3. Error rate (number of answers that were incorrect)
4. Average of the probability the classifier assigned to the wrong answer
5. Average of the negative of the log of the probability the classifier assigned to the right answer

Why?

---

## Building a Neural Net

Where do Linear, Cross-Entropy and Softmax or Sigmoid go?

---

## Going deeper

```python
new_linear = nn.Linear(in_features=784, out_features=1)
new_linear.weight.data = (linear_2.weight.data @ linear_1.weight.data)
new_linear.bias.data = linear_2(linear_1.bias).data
new_linear(example_3_flat)
```

---

## ReLU Intuition

Piecewise linear. Ramps.