CS 375 Week 3

Ken Arnold

Overview

In this class we’re studying how Tuneable Machines can play Optimization Games.

So far:

  • Tuneable machines:
    • What inputs look like (pile of numbers, representing images / board games / …)
    • What outputs look like (e.g., vectors of probabilities)
    • Array computing (NumPy / PyTorch)
      • Fundamental data structure: vector / matrix / tensor
      • Fundamental operation: dot product / matrix product
  • Optimization game: supervised learning
    • Score = mimicry
    • Rule: fixed bag of data, all with “right answers”, get to see some at “training” time

Learning Path

“I trained a neural net classifier from scratch.”

  1. Basic array/“tensor” operations in PyTorch
    • Code: array operations
    • Concepts: dot product, mean squared error
  2. Linear Regression “the hard way” (but black-box optimizer)
    • Code: Representing data as arrays
    • Concepts: loss function, forward pass, optimizer
  3. Logistic Regression “the hard way”
    • Concepts: softmax, cross-entropy loss
  4. Multi-layer Perceptron
    • Concepts: nonlinearity (ReLU), initialization
  5. Gradient Descent
    • Concepts: backpropagation, training loop
  6. Data Loaders
    • Concepts: batching, shuffling

Array Programming: numpy

aka np, because it’s canonically imported as:

import numpy as np

numpy

  • Numerical computing library for Python
  • Provides the array data type. Like a list but:
    • Automatic for loops!
    • Supports multiple dimensions
  • …and lots of utilities
    • arange: range that makes arrays
    • zeros / ones / full: make new arrays
    • lots of math functions

Broadcasting (automatic for loops!)

array plus scalar:

import numpy as np
x = np.array([1.0, 2.0, 3.0])
x + 1
array([2., 3., 4.])

array plus array:

x + x
array([2., 4., 6.])

Applying a function to every element:

y = np.sin(2 * np.pi * x)
y
array([-2.44929360e-16, -4.89858720e-16, -7.34788079e-16])

Reduction operations

Reduce the dimensionsionality of an array (e.g., summing over an axis)

x.sum()
np.float64(6.0)
x.mean()
np.float64(2.0)
x.max()
np.float64(3.0)
np.argmax(x)
np.int64(2)

Exercise: computing error

Suppose the true values are:

y_true = np.array([1., 2., 3.])

and two model predictions are:

y_pred_a = np.array([1.5, 1.5, 3.5])
y_pred_b = np.array([1.1, 2.1, 1.8])
  1. In what sense is Model A better? In what sense is Model B better? Try to quantify the score of each model in at least 2 different ways.
  2. Write NumPy expressions to compute each of the errors you listed.

Quantifying Error

Error Metrics

MAE: mean absolute error: average of absolute differences

\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| \]

mae_a = np.abs(y_true - y_pred_a).mean()
mae_b = np.abs(y_true - y_pred_b).mean()

print('MAE a: {:.2f}, MAE b: {:.2f}'.format(
  mae_a, mae_b))
MAE a: 0.50, MAE b: 0.47

MSE: mean squared error or RMSE: root mean squared error

\[ \text{MSE} = \frac{1}{n} \sum_{i=1}^n (y_i - \hat{y}_i)^2 \]

mse_a = ((y_true - y_pred_a) ** 2).mean()
mse_b = ((y_true - y_pred_b) ** 2).mean()

print('MSE a: {:.2f}, MSE b: {:.2f}'.format(
  mse_a, mse_b))
MSE a: 0.25, MSE b: 0.49

Contrasting MAE and MSE

  • Which model had a better MAE? Which had a better MSE?
  • What do you notice about the specific mistakes that the models made?
pd.DataFrame({
  'y_true': y_true,
  'y_pred_a': y_pred_a, 'err_a': y_true - y_pred_a,
  'y_pred_b': y_pred_b, 'err_b': y_true - y_pred_b,
 }).style.hide(axis='index').format('{:.1f}')
y_true y_pred_a err_a y_pred_b err_b
1.0 1.5 -0.5 1.1 -0.1
2.0 1.5 0.5 2.1 -0.1
3.0 3.5 -0.5 1.8 1.2

Linear Regression

Model = architecture + loss + data + optimization

From Linear Regression to Neural Networks

CS 375:

  • Nonlinear transformations (ReLU, etc.)
  • Extend to classification (softmax, cross-entropy loss)
  • More layers (“deep learning”)

CS 376:

  • Handle structure (locality of pixels in images, etc.)
  • Flexible data flow (attention, recurrences, etc.)

Linear Regression with One Output

# inputs:
# - x_train (num_samples, num_features)
# - y_train (num_samples)
num_features_in = x_train.shape[1]
w = np.random.randn(num_features_in)
b = np.random.randn()

y_pred = x_train @ w + b
loss = ((y_train - y_pred) ** 2).mean()

Check-in question: what loss function is this?

Multiple Inputs and Outputs

  • y = x1*w1 + x2*w2 + x3*w3 + b
  • Or: y = x @ W + b

Matmul (@) so we can process every example of x at once:

  • x is 100 samples, each with 3 features (x.shape is (100, 3))
  • W gives 4 outputs for each feature (W.shape is (3, 4))
  • Then x @ W gives 100 samples, each with 4 outputs ((100, 4))
  • Think: what is b’s shape?

Linear Regression with Multiple Outputs

# inputs:
# - x_train (num_samples, num_features_in)
# - y_train (num_samples, num_features_out)
num_features_out = y_train.shape[1]
W = np.random.randn(num_features_in, num_features_out)
b = np.random.randn(num_features_out)

y_pred = x_train @ W + b
loss = ((y_train - y_pred) ** 2).mean()
  • W is now a 2-axis array: how much each input contributes to each output
  • b is now a 1-axis array: a number to add to each output

Friday: Classification Outputs

Intuition: Elo

A measure of relative skill:

  • Higher Elo more likely to win
  • Greater point spread -> more confidence in win

Formal definition (Wikipedia):

Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))

EloDiff = A Elo - B Elo

From scores to probabilities

Suppose we have 3 chess players:

player Elo
A 1000
B 2200
C 1010

A and B play. Who wins?

A and C play. Who wins?

Elo Exercise

See this Elo ratings table

  1. Pick a pair of teams from the list. (e.g., this year’s Superbowl)
  2. Compute the difference in their Elo ratings.
  3. Compute the probability that the higher-rated team wins using the formula below:

Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))

Now try it again by:

  1. Divide both scores by 400.
  2. Compute 10^score for both scores.
  3. Make that a probability distribution by dividing by the sum.

Do you get the same result?

Softmax

  1. Start with scores (use variable name logits), which can be any numbers (positive, negative, whatever)
  2. Make them only positive by exponentiating:
  • xx = exp(logits) (logits.exp() in PyTorch)
  • alternatives: 10 ** logits or 2 ** logits
  1. Make them sum to 1: probs = xx / xx.sum()

Some properties of softmax

  • Sums to 1 (by construction)
  • Largest logit in gets biggest prob output
  • logits + constant doesn’t change output.
  • logits * constant does change output.

Sigmoid

Special case of softmax when you just have one score (binary classification): use logits = [score, 0.0]

Exercise for practice: write this out in math and see if you can get it to simplify to the traditional way that the sigmoid function is written.