After this course, I will be able to:
“I trained a neural net classifier from scratch.”
You might have heard (or experienced) that Python is slow. So how can Python be the language behind basically all of the recent advances in AI, which all require huge amounts of computing? The secret is array computing. The Python code is orchestrating operations that happen on powerful “accelerator” hardware like GPUs and TPUs. Those operations typically involve repeatedly applying an operation to a big (usually rectangular) arrays of numbers, hence, array computing.
For those used to writing loops, this sort of coding can take some getting used to. Here are two exercises that previous students have found very helpful in getting their mind around how arrays work in PyTorch. (The concepts are basically identical in other libraries like TensorFlow, NumPy, and JAX.)
u02n1-pytorch.ipynb; show preview,
open in Colab)
for loop approach
The reference below is an AI-generated summary of the material in the notebook.
A dot product is a fundamental operation in neural networks, particularly in linear (Dense) layers. Key concepts:
Basic form: y = w1*x1 + w2*x2 + ... + wN*xN + b
x[i] is multiplied by its corresponding weight w[i]b (can be omitted for simplicity)torch.dot(w, x) or w @ xw * x(w * x).sum()A linear transformation is the basic building block of neural networks:
y = w*x + bw represents weightsb represents biasw * x multiplies corresponding elementsCommon reduction methods:
sum(): Adds all elementsmean(): Computes averagemax(): Finds maximum valueargmax(): Finds index of maximum valueCan be called as methods (x.sum()) or functions (torch.sum(x))
Common error metric for regression tasks.
Formula: MSE = (1/n)Σ(y_true - y_pred)²
Implementation steps:
y_true - y_pred(y_true - y_pred)**2((y_true - y_pred)**2).mean()PyTorch provides built-in implementations:
F.mse_loss(y_pred, y_true)nn.MSELoss()axis parameterx.sum(axis=1) sums along axis 1The simplest “neural computation” model is linear regression. We’ll implement it today so that we can understand each part of how it works.
To keep things simple we’ll use a “black box” optimizer, a function that just takes the data and the loss function and finds the best values of the parameters.
Later we’ll study how to optimize a function using stochastic gradient descent, and how to compute the gradients of the loss function with respect to the parameters using backpropagation.
Go to the interactive figures for the Understanding Deep Learning book.
Work through this notebook to practice linear regression with a single feature.
u03n1-linreg-manual.ipynb; show preview,
open in Colab)To reinforce our understanding and extend our understanding of linear regression to multiple features, we’ll work through this notebook:
u04n1-multi-linreg-manual.ipynb; show preview,
open in Colab)softmax input: a vector x of shape (n,)softmax output: a vector y of shape (n,) where y[i] is the probability that x is in class i.y[i] = exp(x[i]) / sum(exp(x))Jargon:
Open the softmax and cross-entropy interactive demo that Prof Arnold created.
Try adjusting the logits (the inputs to softmax) and get a sense for how the outputs change. Describe the outputs when:
Finally, describe the input that gives the largest possible value for output 1.
Softmax, part 1
(name: u04n2-softmax.ipynb; show preview,
open in Colab)
logistic_regression input: an array X of shape (samples, features)logistic_regression output: an array y of shape (samples, classes) where y[i, j] is the probability that sample i is in class j.logits = x @ W + b where W is an array of shape (features, classes) and b is an array of shape (classes,).logits is then passed through the softmax function to get the output probabilities: y = softmax(logits).
softmax(x) = exp(x) / sum(exp(x)), where the sums are taken across the classes.loss_i = -sum(y_true_onehot_i * log(y_pred_i)).Jargon:
Imports:
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
Building a model object with the desired architecture (structure)
model = nn.Linear(in_features=2, out_features=3, bias=True)
# or
class Model(nn.Module):
def __init__(self):
super().__init__()
self.linear = nn.Linear(in_features=2, out_features=3, bias=True)
def forward(self, x):
return self.linear(x)
model = Model()
# or
n_hidden = 100
model = nn.Sequential(
nn.Linear(in_features=2, out_features=n_hidden, bias=True),
nn.ReLU(),
nn.Linear(in_features=n_hidden, out_features=3, bias=True)
)
Training a model:
loss_fn = nn.MSELoss()
optimizer = torch.optim.SGD(model.parameters())
# in a training loop ...
y_pred = model(x)
loss = loss_fn(y_pred, y_true)
loss.backward()
optimizer.step()
Given on paper.
We’re classifying houses as low/medium/high price based on longitude and latitude using logistic regression. The model outputs 3 scores, one for each class. For 100 houses (processed all at once in a “batch” of samples):
a. What shape is X? X.shape =
b. What shape should W (the array of weights) be? W.shape =
c. What shape should b (the array of biases) be? b.shape =
d. What shape will the output have? (X @ W + b).shape =
For one house, if our model outputs scores [1.0, 2.0, -1.0] for low/med/high prices:
Write the steps to convert these scores to probabilities that sum to 1. (You can use words or math notation.)
If the true label for this house is “medium”, what’s the model’s accuracy and loss for this house? (You can use words or math notation.)
From Linear Regression in NumPy to Logistic Regression in PyTorch
(name: u04n3-logreg-pytorch.ipynb; show preview,
open in Colab)
Try the Gradient Game: How few calls do you need to get the Loss small? How do you do it?
MNIST with PyTorch
(name: u06n1-mnist-torch.ipynb; show preview,
open in Colab)
Example: Sentence Embeddings
(name: u08s1-sentence-embeddings.ipynb; show preview,
open in Colab)
Slides (including graphics)
u07n1-image-embeddings.ipynb; show preview,
open in Colab)We’ll discuss this much more in CS 376, but here are some ideas for further exploration:
grad(f, x)
PyTorch uses automatic differentiation to compute gradients.
Example:
import torch
# Let's call it "w" as if it were a weight in a neural network
w = torch.tensor(2.0, requires_grad=True)
y = w**2
y.backward()
print(w.grad)
After calling y.backward(), the gradient of y with respect to w is stored in w.grad.
If we want to minimize a function $f(w)$, we can use gradient descent:
If the function depends on some data (e.g., it’s the loss of a neural network computed on a batch of data), we often use stochastic gradient descent (SGD):
We call it stochastic because it uses a stochastic estimate of the gradient.
Suppose we’re trying to minimize a function $f(p) = p^2 + 2x + 1$ using gradient descent:
def f(p):
return p**2 + 2*p + 1
def grad_f(p):
# gradient of f with respect to p
return ______________________
print(f(3)) # ______
print(grad_f(3)) # ______
# fill in the blank to minimize
p = random()
for i in range(100):
p = ______________________
What should we fill in the blanks to minimize the function?
Compute gradients using PyTorch
(name: u06n2-compute-grad-pytorch.ipynb; show preview,
open in Colab)