Learning Machines

Outcomes

The process of completing this assignment will improve your ability to:

Explain the importance of evaluating image classifiers on unseen data.
Describe characteristics of a dataset that are relevant to downstream performance.
Use data augmentation to improve model generalization.
Describe how the concept of distributions applies to image data.

Along the way, we’ll participate in a Kaggle competition, so you’ll get to practice with that.

Task

Load up the classifier you trained in Homework 1. Use it to make predictions on a set of images collected by others in the class. You’ll do this by participating in a Kaggle competition.

Click the link provided in Moodle to join the Kaggle competition. Create a Kaggle notebook in the Code tab of the competition, and load the following starter notebook:

Letter Classification Starter Notebook (name: letter-classification-starter-notebook-25sp.ipynb; show preview, open in Colab)

Note that the starter notebook includes only the code; it is not a template for your report. You’ll need to add descriptions as explained under Submission below.

Bring up your Homework 1 notebook. Copy and paste your model-training code from the Homework 1 notebook into the Homework 3 notebook in the section indicated. (You might need to add your dataset as an input.) **Note that although the competition has a “training” set, you should (mostly) use your Homework 1 model, including its dataset.
Run the notebook, fixing any path errors. The code will attempt to make use your model to make predictions on the “test” images. Once it completes successfully, Submit your predictions to the competition. Name your submission “Homework 1 baseline” or the like. Write down in your analysis how well your baseline does on the leaderboard.
- Make sure that the weights were actually loaded; if you see a warning about random weights, make sure you’re saving the weights to WEIGHTS_FILENAME.
The Kaggle competition includes a “training” dataset, which we’ll actually use as a validation set. Use this dataset (loaded into the notebook as valid_dataset) to evaluate your top losses and confusion matrix, like you did in Lab 5. Report the most frequent mistakes your classifier makes. Quantify the mistakes (using the confusion matrix) and make an educated guess as to why these might be the most common mistakes (by, for example, studying the top losses).
Make some changes to the training process you used in Homework 1. For example, you might want to add data augmentation or change the foundation model. Experiment as much as you want, but make two more submissions to evaluate on the test set and see what effect your changes had on the leaderboard. Be thoughtful about your changes and explain them in your analysis.
Optionally, try to improve your model’s performance further to try to get a higher score on the leaderboard. You may, for example, train on the training set given in the competition.

Analysis and Submission

Submit your Homework 3 notebook to Moodle. You don’t need to submit a revised Homework 1 notebook, but make sure that your Homework 3 notebook includes details of what you changed in that notebook.

Your notebook should include:

An introduction that summarizes what you did and what you found.
A clear explanation of the source and nature of the data you used for training your initial (homework 1) classifier.
- Include any relevant information you reported in the README in Homework 1.
- Also, read the introduction to Microsoft’s Dataset Documentation (Datasheets for Datasets) template. Then, skim through the questions that follow. Choose two or three questions that are most relevant to how well the model that you trained on that data worked on new data. Include both the question texts and your answers. Good answers are those that would most help someone who is training on your dataset predict how it will work on new data.
An organized summary of your results for the baseline model and the two modified models.
- Recall that in Homework 1 you estimated the accuracy that your classifier would obtain on other people’s images. Compare the accuracy you observed from the baseline model to the accuracy that you thought you’d get.
- Discuss what you noticed from analyzing the mistakes of your baseline model.
- Discuss what changes you tried in order to improve performance, and most importantly, why you tried them.
- Discuss what you learned from the changes you tried.
A clear conclusion that summarizes what you found and interprets what the results mean.

Details

Possible things to adjust:

How big your Homework 1 classifier’s validation set is
Which foundation model to use (see hw1 for a list of options)
What data augmentation (if any) to apply
How many epochs to train
What learning rate to use

Think about what other sources of variation might come up, and how you might be systematic about them.

Augmentation

We didn’t do code for image augmentation in the lab, but it’s actually pretty simple. In your Homework 1 notebook, after you’ve created your train_dataset, create an augmentation pipeline. Refer to Chapter 8 of the book, or look at the Data Augmentation section of the Keras CV guide.

Then, assuming you called your augmentation pipeline augment, you can apply it to your train_dataset like this:

train_dataset_with_aug = train_dataset.map(
  lambda inputs, labels: (augment(inputs), labels),
  num_parallel_calls=tf.data.AUTOTUNE)

It turns out that this does actually apply different augmentations on each epoch.

I suggest looking at example batches from train_dataset_with_aug to make sure the augmentation is working as you expect.

When you fit the model, use train_dataset_with_aug instead of train_dataset.

Errata

None yet this year.

Neural nets are strong performers for data that lacks clear features. But for well-structured tabular data with meaningful features (or data that can be translated to that form), simple models can sometimes perform very well, and can be much faster and sometimes more interpretable. Even if you plan to fit a neural net model, training a decision tree or random forest first can be a good quick first pass.

The Scikit-Learn (sklearn) fit-predict interface for modeling has become the de facto industry standard for this sort of modeling, so it’s highly likely that what you see here will be useful in your future work.

Objectives

Contrast a training set and validation set; explain appropriate uses of both
Use a decision tree for regression tasks
Practice with the fit and predict interface of sklearn models
Get a visual sense of how different regression models work.
Explain underfitting and overfitting (TODO: this is actually in the classification notebook; move it there)

Notebooks

Regression in scikit-learn (name: u02n2-sklearn-regression.ipynb; show preview, open in Colab)

Note: the most important elements are:

The Reflection section of Part 1
The Analysis section at the end.

Upload your .ipynb files to Moodle. Make sure the names are sensible!

Documentation

The sklearn documentation is exemplary. See:

Linear Models for, e.g., linear regression
Decision Trees
Ensemble Methods for, e.g., random forests

Libraries

We use pandas and NumPy for data wrangling, Matplotlib for plotting, and scikit-learn (sklearn) for the models.

Pandas (typically imported as pd, see above) is a very useful library for working with tabular datasets. For example, we can easily read a CSV file directly off the Internet.

The main object from pandas is a DataFrame:

It holds a table of data.
Each column of data generally has a consistent data type. (Note: object columns are the exception. They usually mean “string”, but could actually hold any Python object.)
It behaves like a dictionary of its columns. Each column is a Series object.
Series support broadcast operations, similar to NumPy arrays and Torch tensors; they also have other functionality. (You can access the underlying NumPy array with the .values attribute.)

Conventions

X is typically used for input data (features)
y is typically used for output data (labels)

Notice that X has two axes and thus is written in uppercase; y has 1 and thus is written in lowercase. (This is sklearn convention; other libraries are less consistent about this.)

The first index of both X and y is the sample index: X is a 2D array of shape (n_samples, n_features) and y is a 1D array of shape (n_samples,).

Data Splitting

To make sure we’re evaluating how well the model generalizes (rather than just memorizing the training data), we split the data into a train and valid set. The model is fit on the train set and evaluated on the valid set.

Notes:

Sometimes the valid set is called the test set. Sometimes there’s all three: train, valid, and test.
random_state is how sklearn specifies the random seed (it’s actually slightly more flexible than a seed).

Linear regression

Make an instance of LinearRegression from sklearn.linear_model.
- Making an instance sets up the structure of the model but doesn’t actually fit it to data.
The fit method takes X and y as arguments
The predict method takes X and returns predicted y values (usually we call them y_pred)
- since we have the actual y values in the valid set, we can compare them to the predicted values to compute error metrics
The score method computes the R^2 score by default.
The coef_ attribute holds the coefficients of the model (the weights); the intercept_ attribute holds the bias term.

Linear models construct their predictions as a linear combination of the input features. Viewed in the input space, linear models will always be flat, never bumpy or curvy.

Note: that doesn’t mean that linear models can’t be bumpy or curvy when viewed in a different space. For example, if you have a feature x and you add a feature x^2, the model can fit a parabola as a linear combination of x and x^2. This is the idea behind neural network models; they can fit very complex functions by composing simple functions. We’ll dig into this soon.

Metrics

sklearn has a number of metrics functions in sklearn.metrics. For regression, the most common are:

mean_squared_error
mean_absolute_error
r2_score

The score method of sklearn regression models computes the R^2 score by default.

Decision tree regression

Decision trees are a type of model that makes predictions by following a series of if-then rules. The rules are learned from the data. The tree is built by splitting the data into subsets based on the values of the features. The splits are chosen to minimize the error in the predictions.

In sklearn, decision trees for regression (sometimes called “regression trees”) are implemented in the DecisionTreeRegressor class. The API is almost exactly the same as the LinearRegression class (it has fit, predict, and score methods).

Notice how the tree makes its prediction starting at the top (root) and checking one feature at a time. If the check is True, it goes left; otherwise, it goes right. When it hits a node with no check (a “leaf”), it predicts the value stored there. (Think: how do you think it might have computed that value?)

Random Forest regression

Random Forests take random subsets of the data and fit decision trees to each one. As each tree is fit, it also considers only a random subset of features for each decision. The combination of these two reduces the variance of the model, that is, how much the model’s predictions change if it’s fit on different subsets of data.

Analysis

These are the analysis questions from the notebook. You should answer them in your notebook.

Describe the basic steps for fitting a model in sklearn and making predictions.
Describe parameters that the fit method takes. For each one, describe its purpose and its shape.
Describe, qualitatively, what each of the 3 models here looks like in data space. Describe a characteristic of the visualization that would let you tell immediately which type of model it is from. You might notice differences in the shapes of the boundaries it draws and, if you look more closely, a difference in how the boundaries relate to the data.
Describe, quantitatively, how the performance of the different models compares. Which performs best? Which performs worst? Explain how the performance numbers make sense in light of the data-space plots.

Extension

optional

Compute the loss on the training set for each of these models. Can that help you tell whether the model overfit or not?
Try using more features in the dataset. How well can you predict the price? Be careful about categorical features. (Note that you won’t be able to use plot_model as-is if you add additional features.)

Key questions

Key objectives

Materials

Contents

Objectives

Notebooks

Key Concepts

Reframing Regression as Classification

Cross-Entropy Loss

Working Through the Notebook

Analysis Tips

Outcomes

Task

Analysis and Submission

Details

Augmentation

Errata

Objectives

Notebooks

Documentation

Libraries

Conventions

Data Splitting

Linear regression

Metrics

Decision tree regression

Random Forest regression

Analysis

Extension