Unit 3: ML Fundamentals

Students who complete this unit will demonstrate that they can:

Fundamentals of Machine Learning

Students who complete this unit will demonstrate that they can:

Contents

Preparation 3 (draft!)
The content may not be revised for this year. If you really want to see it, click the link above.
Linear Regression the Hard Way

The simplest “neural computation” model is linear regression. We’ll implement it today so that we can understand each part of how it works.

To keep things simple we’ll use a “black box” optimizer, a function that just takes the data and the loss function and finds the best values of the parameters.

Later we’ll study how to optimize a function using stochastic gradient descent, and how to compute the gradients of the loss function with respect to the parameters using backpropagation.

Warm-Up Activity

Go to the interactive figures for the Understanding Deep Learning book.

  1. Go to Figure 2.2 (Least squares loss). Adjust the sliders to try to make the loss bigger or smaller. What are the highest and lowest values you can get for the loss? What does the plot look like at those different settings? (consider the line, the data points, and the dashed lines).
  2. How can you tell if you got a good setting for the sliders? Can you tell just by observing the loss (without looking at the plot of the data and the line)?

Alternatively, you may have played with this in the linreg explainer in the readings.

Notebooks

Single Linear Regression

Work through this notebook to practice linear regression with a single feature.

Multiple Linear Regression

To reinforce our understanding and extend our understanding of linear regression to multiple features, we’ll work through this notebook:

Regression Models

Neural nets are strong performers for data that lacks clear features. But for well-structured tabular data with meaningful features (or data that can be translated to that form), simple models can sometimes perform very well, and can be much faster and sometimes more interpretable. Even if you plan to fit a neural net model, training a decision tree or random forest first can be a good quick first pass.

The Scikit-Learn (sklearn) fit-predict interface for modeling has become the de facto industry standard for this sort of modeling, so it’s highly likely that what you see here will be useful in your future work.

Objectives

Notebooks

Regression in scikit-learn (name: u02n2-sklearn-regression.ipynb; show preview, open in Colab)

Note: the most important elements are:

Upload your .ipynb files to Moodle. Make sure the names are sensible!

Documentation

The sklearn documentation is exemplary. See:

Libraries

We use pandas and NumPy for data wrangling, Matplotlib for plotting, and scikit-learn (sklearn) for the models.

Pandas (typically imported as pd, see above) is a very useful library for working with tabular datasets. For example, we can easily read a CSV file directly off the Internet.

The main object from pandas is a DataFrame:

Conventions

Notice that X has two axes and thus is written in uppercase; y has 1 and thus is written in lowercase. (This is sklearn convention; other libraries are less consistent about this.)

The first index of both X and y is the sample index: X is a 2D array of shape (n_samples, n_features) and y is a 1D array of shape (n_samples,).

Data Splitting

To make sure we’re evaluating how well the model generalizes (rather than just memorizing the training data), we split the data into a train and valid set. The model is fit on the train set and evaluated on the valid set.

Notes:

Linear regression

Linear models construct their predictions as a linear combination of the input features. Viewed in the input space, linear models will always be flat, never bumpy or curvy.

Note: that doesn’t mean that linear models can’t be bumpy or curvy when viewed in a different space. For example, if you have a feature x and you add a feature x^2, the model can fit a parabola as a linear combination of x and x^2. This is the idea behind neural network models; they can fit very complex functions by composing simple functions. We’ll dig into this soon.

Metrics

sklearn has a number of metrics functions in sklearn.metrics. For regression, the most common are:

The score method of sklearn regression models computes the R^2 score by default.

Decision tree regression

Decision trees are a type of model that makes predictions by following a series of if-then rules. The rules are learned from the data. The tree is built by splitting the data into subsets based on the values of the features. The splits are chosen to minimize the error in the predictions.

In sklearn, decision trees for regression (sometimes called “regression trees”) are implemented in the DecisionTreeRegressor class. The API is almost exactly the same as the LinearRegression class (it has fit, predict, and score methods).

Notice how the tree makes its prediction starting at the top (root) and checking one feature at a time. If the check is True, it goes left; otherwise, it goes right. When it hits a node with no check (a “leaf”), it predicts the value stored there. (Think: how do you think it might have computed that value?)

Random Forest regression

Random Forests take random subsets of the data and fit decision trees to each one. As each tree is fit, it also considers only a random subset of features for each decision. The combination of these two reduces the variance of the model, that is, how much the model’s predictions change if it’s fit on different subsets of data.

Analysis

These are the analysis questions from the notebook. You should answer them in your notebook.

  1. Describe the basic steps for fitting a model in sklearn and making predictions.

  2. Describe parameters that the fit method takes. For each one, describe its purpose and its shape.

  3. Describe, qualitatively, what each of the 3 models here looks like in data space. Describe a characteristic of the visualization that would let you tell immediately which type of model it is from. You might notice differences in the shapes of the boundaries it draws and, if you look more closely, a difference in how the boundaries relate to the data.

  4. Describe, quantitatively, how the performance of the different models compares. Which performs best? Which performs worst? Explain how the performance numbers make sense in light of the data-space plots.

Extension

optional

  1. Compute the loss on the training set for each of these models. Can that help you tell whether the model overfit or not?
  2. Try using more features in the dataset. How well can you predict the price? Be careful about categorical features. (Note that you won’t be able to use plot_model as-is if you add additional features.)
Classification Models

Objectives

Notebooks

Classification in scikit-learn (name: u03n2-sklearn-classification.ipynb; show preview, open in Colab)

For reference only, you might want to have your sklearn-regression notebook open: Regression in scikit-learn (name: u02n2-sklearn-regression.ipynb; show preview, open in Colab)

Key Concepts

Reframing Regression as Classification

Rather than predicting exact home prices (regression), we’ll predict price categories (classification):

Why? Price distribution is highly skewed - a small error on expensive homes dominates the loss. Classification evens out the importance. Really, we’re mostly doing this to help you see the difference between these two types of problems.

Watch out: Notice that the class is expressed as a number (0, 1, 2), but this is not a regression problem. The numbers are just labels for the classes.

Cross-Entropy Loss

      True Class
      ↓ 
[0.7, 0.2, 0.1] → -log(prob of true class)
 ↑
Model's predicted probabilities

Working Through the Notebook

  1. Data Setup

    • Use same X (lat/long) as regression
    • New y: price_bin column (0=low, 1=med, 2=high)
    • Split into train/validation sets
  2. Three Models

    • Logistic Regression: Linear boundaries between classes
    • Decision Tree: Box-like regions
    • Random Forest: Smooth combination of many trees
  3. For Each Model

    • Fit on training data
    • Plot probability contours
    • Compute accuracy and cross-entropy loss
    • Compare training vs validation performance

Analysis Tips

Use an AI to make an AI (draft!)
The content may not be revised for this year. If you really want to see it, click the link above.