Train Simple Image Classifier¶
Task: Train a simple image classifier using logistic regression and cross-entropy loss
(Experimental label for transparency of using language model predictions: @assist=Copilot @mode=predictive @level=0.1. This means I used GitHub Copilot as a predictive text assistant, estimating that it was somehow involved in generating around 10% of the characters here.)
Setup¶
from fastai.vision.all import *
if sys.platform == "darwin":
# https://stackoverflow.com/a/64855500/69707
import os
os.environ['OMP_NUM_THREADS'] = '1'
Load up the MNIST dataset. It has 10 digits.
path = untar_data(URLs.MNIST)
Create a subset of the images, so we train faster. We do this by taking 500 random images of each digit.
set_seed(0)
num_imgs_per_digit = 500
items = L([
p
for split in ['training', 'testing']
for digit in range(10)
for p in (path/split/str(digit)).ls().shuffle()[:num_imgs_per_digit]
])
Create the dataloaders. We need a slightly special ImageBlock because we want grayscale images.
block = DataBlock(
blocks=(ImageBlock(PILImageBW), CategoryBlock),
get_y = parent_label,
splitter=GrandparentSplitter(train_name='training', valid_name="testing"),
)
dataloaders = block.dataloaders(items, bs=16)
print(f"{dataloaders.train.n} training images, {dataloaders.valid.n} validation images")
5000 training images, 5000 validation images
Let's inspect a batch of data.
dataloaders.train.show_batch()
print(f"Available categories: {dataloaders.train.vocab}")
Available categories: ['0', '1', '2', '3', '4', '5', '6', '7', '8', '9']
Task¶
Let's make a neural network to predict which digit was written, using the raw pixel values. We'll keep it at a single layer today, so this is actually just a fancy way of doing logistic regression. But it'll give us a chance to work with minibatches and loss functions.
First, let's get a batch of data to try out. (Later we'll do this in a loop.)
images, labels = dataloaders.train.one_batch()
images = cast(images, Tensor) # work around a fastai quirk
labels = cast(labels, Tensor)
images.shape
torch.Size([16, 1, 28, 28])
show_images(images)
What do the values look like? Let's check the mean and range.
images.max(), images.min()
(tensor(1.), tensor(0.))
images.mean(), images.std()
(tensor(0.1356), tensor(0.3120))
We'll flatten the images before giving them to the linear layer. They are 28 pixels square, so 28*28 = 784 pixels total.
An implication of this step is that we're throwing away the spatial structure of the image. Convolutional models like ResNet are able to use the spatial structure to improve performance.
flattener = nn.Flatten()
Let's show what the flattener does to our example batch. Make sure you can explain this shape.
flattener(images).shape
torch.Size([16, 784])
Step 1: Create a linear layer of the appropriate dimensionality.
- The classifier will be trying to predict which of 10 options (0 through 9) each digit is. Remember that the linear layer gives an output for each logit. So think about how many logit outputs we need.
linear_1 = nn.Linear(in_features=..., out_features=..., bias=True)
# nn.Sequential just connects the output of one function into the input of the next.
# In this case we'll use it to connect the flattener to the linear layer.
model = nn.Sequential(
flattener,
linear_1,
)
In case we're running on a machine with a GPU, the data loader might put images on the GPU -- but the model is still on CPU. So we'll move the model to the same device as the images.
model.to(images.device)
Sequential( (0): Flatten(start_dim=1, end_dim=-1) (1): Linear(in_features=784, out_features=10, bias=True) )
We test out our model on one batch of data. So we call the model to see what it returns.
Note: calling a model object basically just calls its
forwardmethod.
logits = model(images) # if this fails, check the input dimensionality of the model.
print("Logits shape:", logits.shape)
assert logits.shape == (16, 10) # if this fails, check the output dimensionality of the model.
Logits shape: torch.Size([16, 10])
Make sure you can explain why logits.shape contains those two numbers.
Let's look at what the predictions of this (randomly initialized) model are for one image, in particular, the first one in the batch. First we look at the logits (what we sometimes call the score of each class).
logits[0]
tensor([ 0.1896, 0.1378, 0.0330, -0.1807, 0.0758, -0.0851, -0.1845, -0.1535,
-0.3112, 0.2156], grad_fn=<SelectBackward0>)
What prediction would this model make for this image? Let's look at the probabilities. (Note: since the model was initialized randomly, the specific values are meaningless here.)
probs = logits.softmax(dim=-1)
plt.barh(torch.arange(10), probs[0].detach().cpu()); plt.xlabel("Probability"); plt.ylabel("Digit");
# side note: the `detach` is needed because we don't need to take the gradient of this *plot*.
Was that correct? Unlikely, since the model was randomly initialized. To make a better model, we'll need to look at the labels. Make sure you can explain both the shape and values of labels.
labels
tensor([0, 9, 8, 6, 0, 6, 8, 3, 1, 6, 5, 0, 0, 7, 7, 9])
Now let's compute the cross-entropy loss. We'll use F.cross_entropy from PyTorch, which has the following basic signature (simplified somewhat from the official docs):
F.cross_entropy(
logits: Tensor[Batch, Categories], # the unnormalized scores of each class, for each item in the batch
target: TensorCategory[Batch], # the correct label index (an int) for each item in the class
reduction: str = 'mean' # whether to return a single number for the average loss across the batch ('mean') or not ('none')
label_smoothing: float = 0.0 # how much label smoothing to apply (none by default)
)
Let's try it on our logits and labels for this batch.
loss = F.cross_entropy(logits, labels, reduction='none')
loss
tensor([2.1014, 2.2606, 1.9505, 2.4734, 2.4618, 2.4118, 2.1921, 2.2975, 2.1964,
2.4857, 2.2314, 2.2812, 2.3104, 2.3450, 2.3184, 2.2362],
grad_fn=<NllLossBackward0>)
loss.mean()
tensor(2.2846, grad_fn=<MeanBackward0>)
Side note: we could have computed the same result using the following code, but it's less numerically stable than just using
F.cross_entropydirectly.
# An alternative, less numerically stable way to compute the loss:
probs = logits.softmax(dim=-1)
F.nll_loss(probs.log(), labels, reduction='none').mean()
tensor(2.2846, grad_fn=<MeanBackward0>)
We can use argmax to ask which category got the highest probability for each label. This will be useful for compting a metric like accuracy.
predictions = logits.argmax(dim=1) # note: we could use `probs` instead of `logits`. Why?
print(predictions.shape)
predictions
torch.Size([16])
tensor([9, 2, 8, 8, 9, 7, 2, 8, 8, 1, 4, 5, 5, 4, 2, 2])
Now, let's put the pieces together. Fill in the blanks in the code below to train the model. Make sure you understand what each line is doing.
# Make sure we don't accidentally reuse global variables from our example.
# This is a common source of bugs in Jupyter notebooks.
del images, labels, logits, loss, probs, predictions
num_epochs = 1 # increase this to 10 or 20 once your training loop is working
learning_rate = .1
losses = []
# Re-initialize the parameters of the model, so training restarts when this block starts.
linear_1.reset_parameters()
for epoch in range(num_epochs):
# Keep track of some things for each epoch.
total_images = 0
total_correct = 0
# Loop over the training data in batches.
for images, labels in dataloaders.train:
images = cast(images, Tensor) # work around a quirk in fastai, ignore this
labels = cast(labels, Tensor)
logits = ...
loss = ...
# take an SGD step.
loss.backward()
for parameter in model.parameters():
parameter.data -= learning_rate * parameter.grad
model.zero_grad()
# Track metrics
predictions = logits.argmax(axis=1)
num_accurate = (predictions == labels).sum()
total_images += len(labels)
total_correct += num_accurate
# Track losses.
# The .item method converts a 1-element tensor to a Python number.
losses.append(loss.item())
# Epoch done, print some stats.
avg_loss_this_epoch = np.mean(losses[-total_images:])
print(f"Epoch {epoch:2d}: loss={avg_loss_this_epoch:.2f}, train accuracy {total_correct:3d}/{total_images}")
# Plot the un-smoothed loss
#plt.plot(losses)
# Plot a smoothed version of the loss (easier to see the trend)
pd.Series(losses).ewm(alpha = .1).mean().plot()
plt.xlabel("Iteration")
plt.ylabel("Cross-Entropy Loss");
Epoch 0: loss=0.67, train accuracy 4109/4992
Let's inspect the weights of our trained network. Since we have a single layer, it's relatively easy to do this. First, look at the weights of the linear_1 layer:
linear_1.weight.shape
torch.Size([10, 784])
Now let's show the weights as an image. We'll need to reshape them to be 28x28 pixels. We'll use a color map where blue indicates highly positive weights, red indicates highly negative weights, and white indicates weight values that are close to 0. (You're welcome to try a different diverging colormap.)
weight_images = linear_1.weight.data.view((10, 28, 28))
with matplotlib.rc_context(rc={'image.cmap': 'RdBu'}):
show_images(weight_images)
Analysis¶
Q1: Why is logits.shape 16 by 10?
your answer here
Q2: Before we trained the model (i.e., it just had random weights), the cross entropy was all about the same number. What was that number, and why? Hint:
np.log(10)
2.302585092994046
your answer here
Q3: Adjust the learning rate parameter. Give an example of a learning rate that is too high, one that is too low, and one that is good. For each, explain your answer by describing what the loss curve looks like; how do its shape and its values indicate good or bad training?
your answer here
Q4: Why the weight images look the way they do? (Why might they look similar to the digits in question? Why might they look not exactly like the digits in question?)
your answer here
Extension¶
PyTorch gives us optimizer objects that do all the work of updating parameters. It not only saves code, it lets us swap in fancier optimizers. See documentation
- Compare this code with the code block above.
- Fill in the blanks in the same way and observe how you get the same result with fewer lines of code.
- Add
weight_decay=0.01to theSGDconstructor. Rerun the chunk that visualizes the weights. What difference do you notice? What do you notice about the loss? Accuracy? - Replace
SGDwithAdamWand compare the results.
num_epochs = 10
learning_rate = .1
losses = []
# Initialize the optimizer.
optimizer = torch.optim.SGD(params=model.parameters(), lr=learning_rate)
# Re-initialize the parameters of the model, so training restarts when this block starts.
linear_1.reset_parameters()
for epoch in range(num_epochs):
# Keep track of some things for each epoch.
total_images = 0
total_correct = 0
for images, labels in dataloaders.train:
images = TensorBase(images) # work around a quirk in fastai, ignore this
logits = ...
loss = ...
# take an SGD step.
loss.backward()
optimizer.step()
model.zero_grad()
# Track metrics
predictions = logits.argmax(axis=1)
num_accurate = (predictions == labels).sum()
total_images += len(labels)
total_correct += num_accurate
# Track losses.
losses.append(loss.item())
# Epoch done, print some stats.
avg_loss_this_epoch = np.mean(losses[-total_images:])
print(f"Epoch {epoch:2d}: loss={avg_loss_this_epoch:.2f}, train accuracy {total_correct:3d}/{total_images}")
# Plot the un-smoothed loss
#plt.plot(losses)
# Plot a smoothed version of the loss (easier to see the trend)
pd.Series(losses).ewm(alpha = .1).mean().plot()
plt.xlabel("Iteration")
plt.ylabel("Cross-Entropy Loss");
Comments from grading this assignment last year:
- Why 10 output features? One score (logit) for each of the 10 digits.
- Describe why you conclude that the loss curves are bad.
- Learning rates: the basic shape will typically be something like 1/x, but with differences in shape and, most significantly, ending values.
- Think more about why the weights sorta look like the digits in this case. See below.
- The weight images won't look exactly like a digit even at convergence--why?
Notes (mostly generic notes I'm giving to everyone):
- The initialization was good because the predicted probabilities before training ended up approximately equal across the 10 categories, so the model was rightly telling us "I don't know" before training. We could see that because the cross-entropy (log loss) was about -ln(1/10), so whatever the true digit was, the classifier was giving it a probability of about 1/10. So the loss wasn't huge early in training, so the gradient updates weren't huge either.
- Look at both the shape of the loss curve and also the values it achieves. Note, for example, that both too-small and too-large learning rates end up with loss values that are much higher than what you see with a reasonable learning rate, though they take a different trajectory to get there. (Why?)
- Importantly: the problem with learning rates like 1.0 or 0.5 isn't the steep initial descent (actually that can be a good thing!), it's the fact that the weights jump around too much between batches so the weights never get close enough to their best values. You can see this because the loss jumps around too.
- The weight images do look like the digits because the dot product that
linear_1does is basically comparing them with the input images, pixel-by-pixel, to see how much overlap there is. - The weight images don't look like the digits because:
- they're trained on the differences between the digit images, not the images themselves (think about how this relates to the softmax properties we studied in hw5),
- all of the values could have been shifted up or down by a constant without affecting the result, so negative values aren't meaningful per se (if we had used any kind of regularization this wouldn't be true anymore), and
- some digits can be drawn in different ways. Without any hidden layers to build up representations, the only thing this network can do with digits that are drawn different ways is to take the average of the two.
- This model massively underfits--much better performance is possible. (But hm, look at the weight images... could it have also been overfitting? Yes you can have both at the same time.)