“I trained a neural net classifier from scratch.”
display()ed or print()ed or plot()ted—and the result of the last expressionaka, things to make your work look more professional
# and the heading text- abc)code (e.g., functions, variable names)0.9664535356921388
0.4407325991753527
# inputs:
# - x_train (num_samples, num_features)
# - y_train (num_samples, num_classes), one-hot encoded
num_classes = y_train.shape[1]
W = np.random.randn(num_features_in, num_classes)
b = np.random.randn(num_classes)
scores = x_train @ W + b
probs = softmax(scores, axis=1)
probs_of_correct = probs[np.arange(len(y_train)), y_train]
loss = -np.log(probs_of_correct).mean()Check-in question: what loss function is this?
Chop off the negative part of its input.
y = max(0, x)
(Gradient is 1 for positive inputs, 0 for negative inputs)
ReLU interactive (name: u04n00-relu.ipynb; show preview, open in Colab)
PyTorch
TensorFlow (low-level)
import tensorflow as tf
import keras
model = keras.layers.Dense(1, input_shape=(3,))
loss_fn = keras.losses.MeanSquaredError()
optimizer = keras.optimizers.SGD()
# ...
with tf.GradientTape() as tape:
y_pred = model(x)
loss = loss_fn(y, y_pred)
grads = tape.gradient(loss, model.trainable_weights)
optimizer.apply_gradients(zip(grads, model.trainable_weights))Upshot: we can differentiate programs.
Work through how this is done in the Chapter 2 notebook
y = x @ W + b to compute a one-dimensional output y for each image.y and the desired number (0-9).Discuss with neighbors:
We want a score for each digit (0-9), so we need 10 outputs.
Each output is 0 or 1 (“one-hot encoding”).
Discuss with neighbors:
Suppose the network predicted 1.5 for the correct output.
Suppose instead the network predicted 0.5 for the correct output.
What’s the loss in each case?
Fix: make outputs be probabilities.
Discuss with neighbors:
Negative log of the probability of the correct class.
Compute the loss for your running example.
The internal data structure of neural networks.
flowchart LR
A[Input] --> B[Feature Extractor]
B --> C[Linear Classifier]
C --> D[Output]
Example:
flowchart LR
A[Input] --> B["Pre-trained CNN"]
B --> C["Linear layer with 3 outputs"]
C --> D["Softmax"]
D --> E["Predicted probabilities"]
style B stroke-width:4px
The feature extractor constructs a representation of the input that’s useful for classification.
Embedding noun: a vector representation of an object, constructed to be useful for some task (not necessarily human-interpretable); verb: to construct such a representation.
Similar items get similar embeddings.
Similarity can be defined as:
(Note: some sources describe “embedding” as a specific lookup operation, but we’ll use it more broadly.)
Source: Jurafsky and Martin. Speech and Language Processing 3rd ed


See also: Word embeddings quantify 100 years of gender and ethnic stereotypes (Garg et al, PNAS 2018)
Source: GloVe project
Similar people end up with similar vectors because they like similar movies.