In this notebook, I will compare different image classification models. I will use the PyTorch library to build the models and train them on the MNIST dataset of handwritten digits. I will write my own training and evaluation loops to practice the basic concepts of deep learning.
I will train a total of 3 models:
| Model | Train Accuracy | Test Accuracy |
|---|---|---|
| Linear | 0.92 | 0.92 |
| Two-layer | 0.98 | 0.97 |
| Two-layer (tuned) | 0.99 | 0.98 |
import torch
from torch import nn
%matplotlib inline
import matplotlib.pyplot as plt
We'll be training several models, so I'll define a function to train a model and a function to evaluate a model.
We define a simple linear model with no hidden layers. The input is a 28x28 image, which we flatten into a 784-dimensional vector. The output is a 10-dimensional vector, where each element corresponds to the probability of the input image belonging to a particular class.
We need to decide whether the linear layer should have a bias term. This could be useful if some categories are more common than others. However, we know that the MNIST dataset is balanced, so we will omit the bias term.
We chose to use an SGD optimizer with a learning rate of 0.01. We will train for 10 epochs, which means that we will go through the entire training set 10 times.
# the code ...
learning plot
The loss stopped decreasing for the last few epochs, so we're pretty sure that this network has converged. We reached a training loss of XXX and accuracy of XXX.
To make sure we're not overfitting, we will evaluate the model on the validation set.
...
The choice of hidden dimension seemed arbitrary, so I wanted to try out a few different values.
Overall, the two-layer neural network performed the best. The linear model was the simplest, but it was also the least accurate. The two-layer neural network with 100 hidden units performed better than the two-layer neural network with 50 hidden units. The two-layer neural network with 100 hidden units also performed better than the two-layer neural network with 200 hidden units. This suggests that the optimal hidden dimension is somewhere between 50 and 200.
The previous paragraph was entirely made up, by GitHub CoPilot.
Results table, with both loss and accuracy:
| Model | Train Loss | Train Accuracy | Test Loss | Test Accuracy |
|---|---|---|---|---|
| Linear | 0.26 | 0.92 | 0.26 | 0.92 |
| Two-layer | 0.08 | 0.98 | 0.10 | 0.97 |
| Two-layer (tuned) | 0.04 | 0.99 | 0.07 | 0.98 |