Unit 6: Generalization

The content may not be revised for this year. If you really want to see it, click the link above.

Outcomes

The process of completing this assignment will improve your ability to:

Explain the importance of evaluating image classifiers on unseen data.
Describe characteristics of a dataset that are relevant to downstream performance.
Use data augmentation to improve model generalization.
Describe how the concept of distributions applies to image data.

Along the way, we’ll participate in a Kaggle competition, so you’ll get to practice with that.

Task

Load up the classifier you trained in Homework 1. Use it to make predictions on a set of images collected by others in the class. You’ll do this by participating in a Kaggle competition.

Click the link provided in Moodle to join the Kaggle competition. Then make a copy of your Homework 1 notebook (in Google Colab, File → Save a copy) to use as your starting point for this assignment.

Getting the competition data

Download the competition dataset into your Colab notebook:

import urllib.request, zipfile
from pathlib import Path

competition_url = "https://students.cs.calvin.edu/~ka37/letter-images-26sp.zip"
competition_dir = Path("./data/competition")
competition_dir.mkdir(parents=True, exist_ok=True)
archive_path = competition_dir / "letter-images-26sp.zip"

if not archive_path.exists():
    print(f"Downloading {competition_url}...")
    urllib.request.urlretrieve(competition_url, archive_path)
    with zipfile.ZipFile(archive_path, "r") as z:
        z.extractall(competition_dir)
    print("Done.")

Then check what’s inside:

!ls {competition_dir}

You should see folders and CSV files similar to:

sample_submission.csv  test/  test.csv  train/  train.csv

Loading the competition images

The competition images are in a flat folder with a CSV file mapping filenames to labels (not sorted into class subfolders like your Homework 1 data). Here’s how to load them:

import pandas as pd
from torch.utils.data import Dataset
from PIL import Image

valid_df = pd.read_csv(competition_dir / 'train.csv').sort_values('filename')
test_df = pd.read_csv(competition_dir / 'test.csv').sort_values('filename')

class CSVImageDataset(Dataset):
    """Load images from a flat folder using a CSV for labels."""
    def __init__(self, df, image_dir, transform, class_names):
        self.df = df.reset_index(drop=True)
        self.image_dir = Path(image_dir)
        self.transform = transform
        self.class_names = class_names

    def __len__(self):
        return len(self.df)

    def __getitem__(self, idx):
        row = self.df.iloc[idx]
        image = Image.open(self.image_dir / row['filename']).convert('RGB')
        image = self.transform(image)
        if 'label' in row and pd.notna(row['label']):
            label = self.class_names.index(row['label'])
            return image, label
        return image, -1  # test set has no labels

valid_dataset = CSVImageDataset(valid_df, competition_dir / 'train', data_transforms, class_names)
valid_dataloader = DataLoader(valid_dataset, batch_size=config.batch_size, shuffle=False)

Here data_transforms and class_names should be the same ones from your HW1 notebook (class_names should be ['a', 'b', 'c']).

Steps

Run your Homework 1 model-training code in the new notebook (or load saved weights). Note that although the competition has a “training” set, you should (mostly) use your Homework 1 model, including its dataset.
Use the code below to make predictions on the test images and generate a submission.csv. Upload it to the Kaggle competition. Name your submission “Homework 1 baseline” or the like. Write down in your analysis how well your baseline does on the leaderboard.
The competition’s “training” dataset is actually a validation set of images from other students. Use it to evaluate your confusion matrix, like you did in Homework 1. Report the most frequent mistakes your classifier makes. Quantify the mistakes (using the confusion matrix) and make an educated guess as to why these might be the most common mistakes.
Make some changes to the training process you used in Homework 1. For example, you might want to add data augmentation or change the foundation model. Experiment as much as you want, but make two more submissions to evaluate on the test set and see what effect your changes had on the leaderboard. Be thoughtful about your changes and explain them in your analysis.
Optionally, try to improve your model’s performance further to try to get a higher score on the leaderboard. You may, for example, train on the training set given in the competition.

Generating submission.csv

test_dataset = CSVImageDataset(test_df, competition_dir / 'test', data_transforms, class_names)
test_dataloader = DataLoader(test_dataset, batch_size=config.batch_size, shuffle=False)

# Get predictions
test_predictions = []
model.eval()
with torch.no_grad():
    for inputs, _ in test_dataloader:
        inputs = inputs.to(device)
        outputs = model(inputs)
        preds = outputs.argmax(dim=1).cpu().numpy()
        test_predictions.extend(preds)

# Map predictions to class names and save
test_df['label'] = [class_names[p] for p in test_predictions]
test_df[['id', 'label']].to_csv('submission.csv', index=False)

Upload submission.csv to the Kaggle competition page.

Analysis and Submission

Submit your Homework 3 notebook to Moodle. You don’t need to submit a revised Homework 1 notebook, but make sure that your Homework 3 notebook includes details of what you changed in that notebook.

Your notebook should include:

An introduction that summarizes what you did and what you found.
A clear explanation of the source and nature of the data you used for training your initial (homework 1) classifier.
- Include any relevant information you reported in the README in Homework 1.
- Also, read the introduction to Microsoft’s Dataset Documentation (Datasheets for Datasets) template. Then, skim through the questions that follow. Choose two or three questions that are most relevant to how well the model that you trained on that data worked on new data. Include both the question texts and your answers. Good answers are those that would most help someone who is training on your dataset predict how it will work on new data.
An organized summary of your results for the baseline model and the two modified models.
- Recall that in Homework 1 you estimated the accuracy that your classifier would obtain on other people’s images. Compare the accuracy you observed from the baseline model to the accuracy that you thought you’d get.
- Discuss what you noticed from analyzing the mistakes of your baseline model.
- Discuss what changes you tried in order to improve performance, and most importantly, why you tried them.
- Discuss what you learned from the changes you tried.
A clear conclusion that summarizes what you found and interprets what the results mean.

Details

Possible things to adjust:

How big your Homework 1 classifier’s validation set is
Which foundation model to use (see hw1 for a list of options)
What data augmentation (if any) to apply
How many epochs to train
What learning rate to use

Think about what other sources of variation might come up, and how you might be systematic about them.

Augmentation

We didn’t do code for image augmentation in the lab, but it’s straightforward with torchvision.transforms. In your Homework 1 notebook, create an augmentation pipeline that applies random transformations before the standard preprocessing:

from torchvision import transforms

augmentation_transforms = transforms.Compose([
    transforms.RandomHorizontalFlip(),
    transforms.RandomRotation(15),
    transforms.ColorJitter(brightness=0.2, contrast=0.2),
    transforms.Resize((config.image_size, config.image_size)),
    # then apply the same normalization as your pretrained model expects:
    config.pretrained_weights.transforms(crop_size=config.image_size),
])

Then create a new version of your training dataset that uses these augmented transforms:

train_dataset_aug = datasets.ImageFolder(
    root=your_data_path,
    transform=augmentation_transforms
)

Because the random transforms are applied each time an image is loaded, each epoch will see different augmented versions of your training images.

I suggest visualizing some example batches from the augmented dataset to make sure the augmentation is working as you expect. When you train the model, use the augmented dataset instead of the original.

Think carefully about which augmentations make sense for handwritten letters. For example, would a horizontal flip be appropriate for distinguishing ‘b’ from ’d’?

Errata

None yet this year.

Gradient Game

Try the Gradient Game: How few calls do you need to get the Loss small? How do you do it?

Notebooks

MNIST with PyTorch (name: u06n1-mnist-torch.ipynb; show preview, open in Colab)

Chapter 3

Some stragglers:

Why are darker faces harder for CV to recognize?

Mostly: lack of training data. Most datasets of faces were heavily biased towards light-colored skin.
Also: Early film was designed to reproduce light-colored skin well; wasn’t really tested on dark-colored skin.

Is it too late to fix the ethical issues in technology?

Yes, in one sense: the Fall already happened, and that’s the ultimate cause of these problems. Humans are broken, and the systems we make will always be broken.
No, in an important sense: you.

Is every ethical problem in AI about race and gender?

Critical Theory has monopolized ethical discussions in many areas of society lately, so many ethics researchers tend to highlight aspects of their work that relate to power differentials across race and gender. But other aspects are also important; just a few examples include the relationship of AI moderation to freedom of speech, the environmental impact of AI, the existential risks that we may be taking in developing more powerful AI technology, and systems that optimize themselves to hold our attention.

Chapter 5

Why am I saying to skim or skip certain topics?

There are some things in our book that the fastai people make a much bigger deal about than most researchers and practitioners. It’s a lot of reading and stuff to understand, so I’m trying to help you focus on the parts that are going to pay most dividends down the line.

Do the concepts of loss functions and optimization apply to other ML models besides image classifiers?

Yes!

language (NLP): basically the same loss functions (typically cross-entropy on picking the next word), similar optimization approach.
regression: see chapter 6; just use MSE or MAE instead of cross-entropy.
generative models: sometimes use contrastive losses, which are just classification losses where the model has to distinguish true from generated data.

How transferrable models are between physical machines?

Generally very much so. They’re trained with lots of noise added intentionally, so little differences in floating point behavior don’t tend to matter much.

Guidelines for how cleaning relates to modeling?

Garbage in, garbage out.
Focus time making sure the labels are what you want, rather than including and excluding specific examples. You’ll get lots of weird examples in the wild, so the weirder your training set examples are the better (though perhaps include an “ignore” category, so your system learns when to not even try to make a decision)

How do people actually choose learning rates?

First pass: just use Adam with the default parameters, it often works well enough.
- I’ve never seen the “LR Finder” used outside of FastAI.
General wisdom is that if you have time to tune it, SGD with the right learning rate schedule will give best results.

Why relu and not absolute value?

The abs function should work as a nonlinearity, but my intuition is it would be harder to learn (because the effect of increasing an activation flips when you’re on the other side of 0).

How do we actually fix clf errors?

Don’t worry about individual errors; make sure you’re focusing on systematic issues.
Make sure your training is converging and your model has actually fit your training data well. Beyond that, trace it back to the training data. Get more examples like the ones that fail. Augment more so your existing examples look more like the failing examples. etc.

Where does the score come from that gets fed into softmax?

A linear layer.

Why does cross-entropy need negative-log?

The Wikipedia article is actually pretty good here.

Do all the images really have to be the same size?

They used to have to be, but modern networks can work with any size. It’s still more efficient to run a batch of images at the same size through at the same time though.

Why ReLU?

Piecewise linear approximation.

I’m still fuzzy about SGD.

See the Glossary.

The content may not be revised for this year. If you really want to see it, click the link above.

How to Compute Gradients

Numerical approximation: $\frac{f(x+h) - f(x)}{h}$
- Pros: Easy to implement
- Cons: Computationally expensive, not accurate
Symbolic differentiation: $\frac{df}{dx}$
- Pros: Accurate
- Cons: Can make unwieldy expressions
Automatic differentiation: grad(f, x)
- Pros: Accurate, efficient, works even with billions of parameters
- Cons: Can be hard to debug, requires intermediate values to be stored

PyTorch Autograd

PyTorch uses automatic differentiation to compute gradients.

Example:

import torch

# Let's call it "w" as if it were a weight in a neural network
w = torch.tensor(2.0, requires_grad=True)
y = w**2
y.backward()
print(w.grad)

After calling y.backward(), the gradient of y with respect to w is stored in w.grad.

(Stochastic) Gradient Descent

If we want to minimize a function $f(w)$, we can use gradient descent:

Initialize $w$ randomly
Repeat:
- Compute the gradient of $f$ with respect to $w$
- Update $w$ by moving in the opposite direction of the gradient

If the function depends on some data (e.g., it’s the loss of a neural network computed on a batch of data), we often use stochastic gradient descent (SGD):

Initialize $w$ randomly
Repeat:
- Sample a batch of data
- Compute the gradient of the loss with respect to $w$ on the batch
- Update $w$ by moving in the opposite direction of the gradient

We call it stochastic because it uses a stochastic estimate of the gradient.

Warm-Up Activity

Suppose we’re trying to minimize a function $f(p) = p^2 + 2x + 1$ using gradient descent:

def f(p):
  return p**2 + 2*p + 1

def grad_f(p):
  # gradient of f with respect to p
  return ______________________

print(f(3)) # ______
print(grad_f(3)) # ______

# fill in the blank to minimize
p = random()
for i in range(100):
  p = ______________________

What should we fill in the blanks to minimize the function?

Notebooks

Compute gradients using PyTorch (name: u06n2-compute-grad-pytorch.ipynb; show preview, open in Colab)

Generalization

Contents

Outcomes

Task

Getting the competition data

Loading the competition images

Steps

Generating submission.csv

Analysis and Submission

Details

Augmentation

Errata

Gradient Game

Notebooks

Chapter 3

Why are darker faces harder for CV to recognize?

Is it too late to fix the ethical issues in technology?

Is every ethical problem in AI about race and gender?

Chapter 5

Why am I saying to skim or skip certain topics?

Do the concepts of loss functions and optimization apply to other ML models besides image classifiers?

How transferrable models are between physical machines?

Guidelines for how cleaning relates to modeling?

How do people actually choose learning rates?

Why relu and not absolute value?

How do we actually fix clf errors?

Where does the score come from that gets fed into softmax?

Why does cross-entropy need negative-log?

Do all the images really have to be the same size?

Why ReLU?

I’m still fuzzy about SGD.

How to Compute Gradients

PyTorch Autograd

(Stochastic) Gradient Descent

Warm-Up Activity

Notebooks