“I trained a neural net classifier from scratch.”
Open up your Lab 1 notebooks. Discuss with your neighbors:
# your code here)display()ed or print()ed or plot()ted—and the result of the last expressionaka, things to make your work look more professional
# and the heading text- abc)code (e.g., functions, variable names)0.9664535356921388
0.4407325991753527
numpyaka np, because it’s canonically imported as:
numpyarray data type. Like a list but:
for loops!arange: range that makes arrayszeros / ones / full: make new arraysAll ints:
All floats:
np.arangeLike range, but:
arraysfloatsfor loops!)array plus scalar:
array plus array:
Applying a function to every element:
Reduce the dimensionsionality of an array (e.g., summing over an axis)
Suppose the true values are:
and two model predictions are:
MAE: mean absolute error: average of absolute differences
\[ \text{MAE} = \frac{1}{n} \sum_{i=1}^n |y_i - \hat{y}_i| \]
CS 375:
CS 376:
# inputs:
# - x_train (num_samples, num_features)
# - y_train (num_samples)
num_features_in = x_train.shape[1]
w = np.random.randn(num_features_in)
b = np.random.randn()
y_pred = x_train @ w + b
loss = ((y_train - y_pred) ** 2).mean()Check-in question: what loss function is this?
y = x1*w1 + x2*w2 + x3*w3 + by = x @ W + bMatmul (@) so we can process every example of x at once:
x is 100 samples, each with 3 features (x.shape is (100, 3))W gives 4 outputs for each feature (W.shape is (3, 4))x @ W gives 100 samples, each with 4 outputs ((100, 4))b’s shape?# inputs:
# - x_train (num_samples, num_features_in)
# - y_train (num_samples, num_features_out)
num_features_out = y_train.shape[1]
W = np.random.randn(num_features_in, num_features_out)
b = np.random.randn(num_features_out)
y_pred = x_train @ W + b
loss = ((y_train - y_pred) ** 2).mean()W is now a 2-axis array: how much each input contributes to each outputb is now a 1-axis array: a number to add to each output# inputs:
# - x_train (num_samples, num_features)
# - y_train (num_samples, num_classes), one-hot encoded
num_classes = y_train.shape[1]
W = np.random.randn(num_features_in, num_classes)
b = np.random.randn(num_classes)
scores = x_train @ W + b
probs = softmax(scores, axis=1)
probs_of_correct = probs[np.arange(len(y_train)), y_train]
loss = -np.log(probs_of_correct).mean()Check-in question: what loss function is this?
A measure of relative skill:
Formal definition:
Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))
EloDiff = A Elo - B Elo
Suppose we have 3 chess players:
| player | Elo |
|---|---|
| A | 1000 |
| B | 2200 |
| C | 1010 |
A and B play. Who wins?
A and C play. Who wins?
See nfelo
Elo probability formula:
Pr(A wins) = 1 / (1 + 10^(-EloDiff / 400))
logits), which can be any numbers (positive, negative, whatever)xx = exp(logits) (logits.exp() in PyTorch)10 ** logits or 2 ** logitsprobs = xx / xx.sum()logits + constant doesn’t change output.logits * constant does change output.Special case of softmax when you just have one score (binary classification): use logits = [score, 0.0]
Exercise for practice: write this out in math and see if you can get it to simplify to the traditional way that the sigmoid function is written.
Chop off the negative part of its input.
y = max(0, x)
(Gradient is 1 for positive inputs, 0 for negative inputs)
ReLU interactive (name: u04n00-relu.ipynb; show preview, open in Colab)
PyTorch
TensorFlow (low-level)
import tensorflow as tf
import keras
model = keras.layers.Dense(1, input_shape=(3,))
loss_fn = keras.losses.MeanSquaredError()
optimizer = keras.optimizers.SGD()
# ...
with tf.GradientTape() as tape:
y_pred = model(x)
loss = loss_fn(y, y_pred)
grads = tape.gradient(loss, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))Upshot: we can differentiate programs.
Work through how this is done in the Chapter 2 notebook
y = x @ W + b to compute a one-dimensional output y for each image.y and the desired number (0-9).Discuss with neighbors:
We want a score for each digit (0-9), so we need 10 outputs.
Each output is 0 or 1 (“one-hot encoding”).
Discuss with neighbors:
Suppose the network predicted 1.5 for the correct output.
Suppose instead the network predicted 0.5 for the correct output.
What’s the loss in each case?
Fix: make outputs be probabilities.
Discuss with neighbors:
Negative log of the probability of the correct class.
Compute the loss for your running example.
The internal data structure of neural networks.
flowchart LR
A[Input] --> B[Feature Extractor]
B --> C[Linear Classifier]
C --> D[Output]
Example:
flowchart LR
A[Input] --> B["Pre-trained CNN"]
B --> C["Linear layer with 3 outputs"]
C --> D["Softmax"]
D --> E["Predicted probabilities"]
style B stroke-width:4px
The feature extractor constructs a representation of the input that’s useful for classification.
Embedding noun: a vector representation of an object, constructed to be useful for some task (not necessarily human-interpretable); verb: to construct such a representation.
Similar items get similar embeddings.
Similarity can be defined as:
(Note: some sources describe “embedding” as a specific lookup operation, but we’ll use it more broadly.)
Source: Jurafsky and Martin. Speech and Language Processing 3rd ed


See also: Word embeddings quantify 100 years of gender and ethnic stereotypes (Garg et al, PNAS 2018)
Source: GloVe project
Similar people end up with similar vectors because they like similar movies.