CS 375 Week 3

Ken Arnold

Overview

In this class we’re studying how Tuneable Machines can play Optimization Games.

So far:

Tuneable machines:
- What inputs look like (pile of numbers, representing images / board games / …)
- What outputs look like (e.g., vectors of probabilities)
- Array computing (NumPy / PyTorch)
  - Fundamental data structure: vector / matrix / tensor
  - Fundamental operation: dot product / matrix product
Optimization game: supervised learning
- Score = mimicry
- Rule: fixed bag of data, all with “right answers”, get to see some at “training” time

Learning Path

“I trained a neural net classifier from scratch.”

Basic array/“tensor” operations in PyTorch
- Code: array operations
- Concepts: dot product, mean squared error
Linear Regression “the hard way” (but black-box optimizer)
- Code: Representing data as arrays
- Concepts: loss function, forward pass, optimizer
Logistic Regression “the hard way”
- Concepts: softmax, cross-entropy loss
Multi-layer Perceptron
- Concepts: nonlinearity (ReLU), initialization
Gradient Descent
- Concepts: backpropagation, training loop
Data Loaders
- Concepts: batching, shuffling

Array Programming: `numpy`

aka np, because it’s canonically imported as:

import numpy as np

`numpy`

Numerical computing library for Python
Provides the array data type. Like a list but:
- Automatic for loops!
- Supports multiple dimensions
…and lots of utilities
- arange: range that makes arrays
- zeros / ones / full: make new arrays
- lots of math functions

Broadcasting (automatic `for` loops!)

array plus scalar:

import numpy as np
x = np.array([1.0, 2.0, 3.0])
x + 1

array([2., 3., 4.])

array plus array:

x + x

array([2., 4., 6.])

Applying a function to every element:

y = np.sin(2 * np.pi * x)
y

array([-2.44929360e-16, -4.89858720e-16, -7.34788079e-16])

Reduction operations

Reduce the dimensionsionality of an array (e.g., summing over an axis)

x.sum()

np.float64(6.0)

x.mean()

np.float64(2.0)

x.max()

np.float64(3.0)

np.argmax(x)

np.int64(2)

Exercise: computing error

Suppose the true values are:

y_true = np.array([1., 2., 3.])

and two model predictions are:

y_pred_a = np.array([1.5, 1.5, 3.5])
y_pred_b = np.array([1.1, 2.1, 1.8])

In what sense is Model A better? In what sense is Model B better? Try to quantify the score of each model in at least 2 different ways.
Write NumPy expressions to compute each of the errors you listed.

Quantifying Error

Error Metrics

MAE: mean absolute error: average of absolute differences

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| \]

mae_a = np.abs(y_true - y_pred_a).mean()
mae_b = np.abs(y_true - y_pred_b).mean()

print('MAE a: {:.2f}, MAE b: {:.2f}'.format(
  mae_a, mae_b))

MAE a: 0.50, MAE b: 0.47

MSE: mean squared error or RMSE: root mean squared error

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

mse_a = ((y_true - y_pred_a) ** 2).mean()
mse_b = ((y_true - y_pred_b) ** 2).mean()

print('MSE a: {:.2f}, MSE b: {:.2f}'.format(
  mse_a, mse_b))

MSE a: 0.25, MSE b: 0.49

Contrasting MAE and MSE

Which model had a better MAE? Which had a better MSE?
What do you notice about the specific mistakes that the models made?

pd.DataFrame({
  'y_true': y_true,
  'y_pred_a': y_pred_a, 'err_a': y_true - y_pred_a,
  'y_pred_b': y_pred_b, 'err_b': y_true - y_pred_b,
 }).style.hide(axis='index').format('{:.1f}')

y_true	y_pred_a	err_a	y_pred_b	err_b
1.0	1.5	-0.5	1.1	-0.1
2.0	1.5	0.5	2.1	-0.1
3.0	3.5	-0.5	1.8	1.2

Linear Regression

Model = architecture + loss + data + optimization

From Linear Regression to Neural Networks

CS 375:

Nonlinear transformations (ReLU, etc.)
Extend to classification (softmax, cross-entropy loss)
More layers (“deep learning”)

CS 376:

Handle structure (locality of pixels in images, etc.)
Flexible data flow (attention, recurrences, etc.)

Linear Regression with One Output

# inputs:
# - x_train (num_samples, num_features)
# - y_train (num_samples)
num_features_in = x_train.shape[1]
w = np.random.randn(num_features_in)
b = np.random.randn()

y_pred = x_train @ w + b
loss = ((y_train - y_pred) ** 2).mean()

Check-in question: what loss function is this?

Multiple Inputs and Outputs

y = x1*w1 + x2*w2 + x3*w3 + b
Or: y = x @ W + b

Matmul (@) so we can process every example of x at once:

x is 100 samples, each with 3 features (x.shape is (100, 3))
W gives 4 outputs for each feature (W.shape is (3, 4))
Then x @ W gives 100 samples, each with 4 outputs ((100, 4))
Think: what is b’s shape?

Linear Regression with Multiple Outputs

# inputs:
# - x_train (num_samples, num_features_in)
# - y_train (num_samples, num_features_out)
num_features_out = y_train.shape[1]
W = np.random.randn(num_features_in, num_features_out)
b = np.random.randn(num_features_out)

y_pred = x_train @ W + b
loss = ((y_train - y_pred) ** 2).mean()

W is now a 2-axis array: how much each input contributes to each output
b is now a 1-axis array: a number to add to each output

Friday: Classification Outputs

Intuition: Elo

A measure of relative skill:

Higher Elo more likely to win
Greater point spread -> more confidence in win

Formal definition (Wikipedia):

Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))

EloDiff = A Elo - B Elo

Uses: chess, NFL
Superbowl prediction

From scores to probabilities

Suppose we have 3 chess players:

player	Elo
A	1000
B	2200
C	1010

A and B play. Who wins?

A and C play. Who wins?

Elo Exercise

See this Elo ratings table

Pick a pair of teams from the list. (e.g., this year’s Superbowl)
Compute the difference in their Elo ratings.
Compute the probability that the higher-rated team wins using the formula below: