Regression in scikit-learn¶

Goals¶

  • Practice with the fit and predict interface of sklearn models
  • Get a visual sense of how different regression models work.

Rationale¶

Neural nets are strong performers for data that lacks clear features. But for well-structured tabular data with meaningful features (or data that can be translated to that form), simple models can sometimes perform very well, and can be much faster and sometimes more interpretable. Even if you plan to fit a neural net model, training a decision tree or random forest first can be a good quick first pass.

The Scikit-Learn (sklearn) fit-predict interface for modeling has become the de facto industry standard for this sort of modeling, so it's highly likely that what you see here will be useful in your future work.

Documentation¶

The sklearn documentation is exemplary. See:

  • Linear Models for, e.g., linear regression
  • Decision Trees
  • Ensemble Methods for, e.g., random forests

Setup¶

Let's import necessary modules: pandas and NumPy for data wrangling, Matplotlib for plotting, and some sklearn models.

In [ ]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeRegressor
import sklearn.tree
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split

We'll load the data. We're using a dataset of home sale prices from the Ames, Iowa assessor's database, described in this paper. (DATA 202 students may remember seeing this dataset.)

Pandas (typically imported as pd, see above) is a very useful library for working with tabular datasets. We'll see here that we can easily read a CSV file directly off the Internet...

In [ ]:
ames = pd.read_csv('https://github.com/kcarnold/AmesHousing/blob/master/data/ames.csv.gz?raw=true', compression="gzip")

The main object from pandas is a DataFrame. It holds a table of data:

In [ ]:
ames.head()
Out[ ]:
MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour Utilities Lot_Config ... Fence Misc_Feature Misc_Val Mo_Sold Year_Sold Sale_Type Sale_Condition Sale_Price Longitude Latitude
0 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 141 31770 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner ... No_Fence NaN 0 5 2010 WD Normal 215000 -93.619754 42.054035
1 One_Story_1946_and_Newer_All_Styles Residential_High_Density 80 11622 Pave No_Alley_Access Regular Lvl AllPub Inside ... Minimum_Privacy NaN 0 6 2010 WD Normal 105000 -93.619756 42.053014
2 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 81 14267 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner ... No_Fence Gar2 12500 6 2010 WD Normal 172000 -93.619387 42.052659
3 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 93 11160 Pave No_Alley_Access Regular Lvl AllPub Corner ... No_Fence NaN 0 4 2010 WD Normal 244000 -93.617320 42.051245
4 Two_Story_1946_and_Newer Residential_Low_Density 74 13830 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Inside ... Minimum_Privacy NaN 0 3 2010 WD Normal 189900 -93.638933 42.060899

5 rows × 81 columns

Each column of data generally has a consistent data type. (Note: object columns are the exception. They usually mean "string", but could actually hold any Python object.)

In [ ]:
ames.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 81 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   MS_SubClass         2930 non-null   object 
 1   MS_Zoning           2930 non-null   object 
 2   Lot_Frontage        2930 non-null   int64  
 3   Lot_Area            2930 non-null   int64  
 4   Street              2930 non-null   object 
 5   Alley               2930 non-null   object 
 6   Lot_Shape           2930 non-null   object 
 7   Land_Contour        2930 non-null   object 
 8   Utilities           2930 non-null   object 
 9   Lot_Config          2930 non-null   object 
 10  Land_Slope          2930 non-null   object 
 11  Neighborhood        2930 non-null   object 
 12  Condition_1         2930 non-null   object 
 13  Condition_2         2930 non-null   object 
 14  Bldg_Type           2930 non-null   object 
 15  House_Style         2930 non-null   object 
 16  Overall_Qual        2930 non-null   object 
 17  Overall_Cond        2930 non-null   object 
 18  Year_Built          2930 non-null   int64  
 19  Year_Remod_Add      2930 non-null   int64  
 20  Roof_Style          2930 non-null   object 
 21  Roof_Matl           2930 non-null   object 
 22  Exterior_1st        2930 non-null   object 
 23  Exterior_2nd        2930 non-null   object 
 24  Mas_Vnr_Type        1155 non-null   object 
 25  Mas_Vnr_Area        2930 non-null   int64  
 26  Exter_Qual          2930 non-null   object 
 27  Exter_Cond          2930 non-null   object 
 28  Foundation          2930 non-null   object 
 29  Bsmt_Qual           2930 non-null   object 
 30  Bsmt_Cond           2930 non-null   object 
 31  Bsmt_Exposure       2930 non-null   object 
 32  BsmtFin_Type_1      2930 non-null   object 
 33  BsmtFin_SF_1        2930 non-null   int64  
 34  BsmtFin_Type_2      2930 non-null   object 
 35  BsmtFin_SF_2        2930 non-null   int64  
 36  Bsmt_Unf_SF         2930 non-null   int64  
 37  Total_Bsmt_SF       2930 non-null   int64  
 38  Heating             2930 non-null   object 
 39  Heating_QC          2930 non-null   object 
 40  Central_Air         2930 non-null   object 
 41  Electrical          2930 non-null   object 
 42  First_Flr_SF        2930 non-null   int64  
 43  Second_Flr_SF       2930 non-null   int64  
 44  Low_Qual_Fin_SF     2930 non-null   int64  
 45  Gr_Liv_Area         2930 non-null   int64  
 46  Bsmt_Full_Bath      2930 non-null   int64  
 47  Bsmt_Half_Bath      2930 non-null   int64  
 48  Full_Bath           2930 non-null   int64  
 49  Half_Bath           2930 non-null   int64  
 50  Bedroom_AbvGr       2930 non-null   int64  
 51  Kitchen_AbvGr       2930 non-null   int64  
 52  Kitchen_Qual        2930 non-null   object 
 53  TotRms_AbvGrd       2930 non-null   int64  
 54  Functional          2930 non-null   object 
 55  Fireplaces          2930 non-null   int64  
 56  Fireplace_Qu        2930 non-null   object 
 57  Garage_Type         2930 non-null   object 
 58  Garage_Finish       2930 non-null   object 
 59  Garage_Cars         2930 non-null   int64  
 60  Garage_Area         2930 non-null   int64  
 61  Garage_Qual         2930 non-null   object 
 62  Garage_Cond         2930 non-null   object 
 63  Paved_Drive         2930 non-null   object 
 64  Wood_Deck_SF        2930 non-null   int64  
 65  Open_Porch_SF       2930 non-null   int64  
 66  Enclosed_Porch      2930 non-null   int64  
 67  Three_season_porch  2930 non-null   int64  
 68  Screen_Porch        2930 non-null   int64  
 69  Pool_Area           2930 non-null   int64  
 70  Pool_QC             2930 non-null   object 
 71  Fence               2930 non-null   object 
 72  Misc_Feature        106 non-null    object 
 73  Misc_Val            2930 non-null   int64  
 74  Mo_Sold             2930 non-null   int64  
 75  Year_Sold           2930 non-null   int64  
 76  Sale_Type           2930 non-null   object 
 77  Sale_Condition      2930 non-null   object 
 78  Sale_Price          2930 non-null   int64  
 79  Longitude           2930 non-null   float64
 80  Latitude            2930 non-null   float64
dtypes: float64(2), int64(33), object(46)
memory usage: 1.8+ MB

It behaves like a dictionary of its columns. Each column is a Series object.

In [ ]:
type(ames['Sale_Price'])
Out[ ]:
pandas.core.series.Series

Series support broadcast operations, similar to NumPy arrays and Torch tensors; they also have other functionality.

In [ ]:
ames['price'] = ames["Sale_Price"] / 100_000 # Make `price` be in units of $100k, to be easier to interpret.

Now we'll look into this dataset:

We'll define some functions to plot the data and models. Since we have latitude and longitude for each home, we can plot this data in 2D with a color for the sale price.

(Sorry, you'll just have to imagine there's a map underneath.)

In [ ]:
def plot_data():
    # You don't have to know how this function works.
    plt.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=.5)
    plt.xlabel("Longitude"); plt.ylabel("Latitude")
    plt.colorbar(label="Sale Price ($100k)")
plot_data()
No description has been provided for this image

We'll try to predict home price based on location (which the realtors assure us is the most important factor anyway). So we'll grab the Latitude and Longitude columns of the data. We'll call that input data X, by convention. There are several different ways to index into a pandas DataFrame; using a list gives us a DataFrame with just the columns with those names. We'll then access the underlying NumPy data by using .values.

In [ ]:
feature_names = ['Longitude', 'Latitude']
X = ames[feature_names].values
X.shape
Out[ ]:
(2930, 2)

Our target, called y by convention, will be the home price (we'll soon introduce a different y, but start with this one).

In [ ]:
y = ames['price'].values
y.shape
Out[ ]:
(2930,)

Notice that X has two axes and thus is written in uppercase; y has 1 and thus is written in lowercase. (This is sklearn convention; other libraries are less consistent about this.)

Now let's split the data into a train and valid set (which sklearn calls train-test, but that's fine). random_state is how sklearn specifies the random seed (it's actually slightly more flexible than a seed).

In [ ]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=.2, random_state=42)

We'll verify that the shapes make sense. Note how many items are in each of the sets.

In [ ]:
X_train.shape, y_train.shape
Out[ ]:
((2344, 2), (2344,))
In [ ]:
X_valid.shape, y_valid.shape
Out[ ]:
((586, 2), (586,))

Here's a function to plot our regression model in "data space" (i.e., what it would predict everywhere on the map).

This function is pretty customized to our specific use case, though you can get inspiration from it for use in other situations.

In [ ]:
def plot_model(clf, fig=None):
    # Compute extents
    lat_min = ames.Latitude.min()
    lat_max = ames.Latitude.max()
    lon_min = ames.Longitude.min()
    lon_max = ames.Longitude.max()
    price_min = ames.price.min()
    price_max = ames.price.max()

    # Ask the classifier for predictions on a grid
    xx, yy = np.meshgrid(np.linspace(lon_min, lon_max, 250), np.linspace(lat_min, lat_max, 250))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    if fig is None:
        fig = plt.figure(figsize=plt.figaspect(2))

    # Left side: show the predictions in 2D. Superimpose the original data.
    ax = fig.add_subplot(2, 1, 1)
    surf = ax.contourf(xx, yy, Z, alpha=.5, cmap=plt.cm.viridis, vmin=price_min, vmax=price_max)
    ax.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=1, cmap='viridis', vmin=price_min, vmax=price_max)
    ax.set(xlabel="Longitude", ylabel="Latitude", title="2D contour view")
    fig.colorbar(surf, label="Sale Price ($100k)")

    # Right side: show the predictions in 3D
    ax = fig.add_subplot(2, 1, 2, projection='3d')
    ax.plot_surface(xx, yy, Z, alpha=.5, cmap=plt.cm.viridis, vmin=price_min, vmax=price_max)
    #ax.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=1, cmap='viridis', vmin=price_min, vmax=price_max)
    ax.set(title="3D view")

Task¶

Part A: Linear regression¶

Step A1: Fit a linear regression model (call it linreg) to the training set (X_train, y_train).

In [ ]:
linreg = LinearRegression().fit(...)
print("Prediction equation:")
print('y_pred = '
    + ' + ' .join(f'{coef:.3f} * {name}' for coef, name in zip(linreg.coef_, feature_names))
    + f' + {linreg.intercept_:.3f}')
Prediction equation:
y_pred = -7.955 * Longitude + 11.886 * Latitude + -1242.742

Here are the Longitude and Latitude coordinates of the first home in the validation set. Compute what the model would predict for this home, based on the prediction equation shown above.

In [ ]:
example_home = X_valid[0]
example_home
Out[ ]:
array([-93.621065,  42.029038])

Now we'll plot the model's predictions in data space. The code for step is filled in for you because there's not a generic way to do this; our approach here is customized to our particular model and task so you don't have to understand the details of how it works.

The main thing to observe here is what shapes do you see. Think about why you might see those shapes in light of the prediction equation.

Note: The first plot is a contour plot (aka contour graph) showing both the model's predictions and our training data. The colored dots scattered across the plot are our actual training data points, while the background contour lines show the model's predicted prices across the entire space.

If you've ever seen a topographic map showing elevation, this is the same idea - each line represents locations with the same predicted price, just like topographic lines show locations of the same elevation.

For more information on contour plots, look them up on Wikipedia or Khan Academy.

How to Read This Plot:

  • The x-axis shows longitude
  • The y-axis shows latitude
  • The color and contour lines show the predicted price
  • Darker colors indicate lower predicted prices
  • The scattered dots show where our actual training data points are located

The lines that you see are like contour lines; the linear regression is actually quite smooth (too smooth?). The second plot shows the prediction in 3D space, which may make it clearer.

In [ ]:
plot_model(linreg)
No description has been provided for this image

Step A3: Compute the model's predictions on the validation set (call them y_pred). What was the model's prediction for the first house in the validation set? (did you get it right in the previous question?) How does that compare with the actual price for that home?

In [ ]:
y_pred = linreg.predict(...)
In [ ]:
# your code here
Prediction for first house in validation set: 1.55, Actual: 1.61, difference: 0.06

Step A4: Compute and show the mean squared error, the mean absolute error, and the r2 score for the validation set.

  • You may use the mean_absolute_error, mean_squared_error, and r2_score functions (imported from sklearn.metrics above).
  • Use the predictions you already made above.
  • Use Shift-TAB or ? to get the documentation for these functions to ensure you're passing the arguments in the correct order.
In [ ]:
# your code here
Mean absolute error: 0.582
Mean squared error:  0.679
R2 score:            0.154

For convenience, sklearn models have a .score() method, which computes the predictions and error metrics at the same time. Look at the results below: which error metric does .score() seem to use for regression tasks?

In [ ]:
print(f"Training score: {linreg.score(X_train, y_train):.3f}")
print(f"Validation score: {linreg.score(X_valid, y_valid):.3f}")
Training score: 0.147
Validation score: 0.154

Reflect: (these questions might make more sense once you have looked at the other two types of models below)

  1. How does a linear regression make a prediction?
  2. Is a linear regression able to capture that homes are typically more expensive on one side of town than the other? (Which side?) Why or why not?
  3. Is a linear regression able to capture that homes in one neighborhood are typically more expensive than those in adjacent neighborhoods? Why or why not?

Part B: Decision tree regression¶

Step B1: Fit a decision tree model (call it dtree_reg) to the training set (X_train, y_train).

We'll use a small max_depth to be able to plot the tree. We'll then fit another one with full depth.

Notice how the tree makes its prediction starting at the top (root) and checking one feature at a time. If the check is True, it goes left; otherwise, it goes right. When it hits a node with no check (a "leaf"), it predicts the value stored there. (Think: how do you think it might have computed that value?)

In [ ]:
dtree_reg_small = DecisionTreeRegressor(max_depth=2, random_state=42)...
plt.figure(figsize=(20, 15))
sklearn.tree.plot_tree(dtree_reg_small, feature_names=feature_names, filled=True);
No description has been provided for this image

Exercise: compute by hand the tree's prediction for the same first house in the validation set.

your answer here

Now let's let the tree grow as big as it wants.

In [ ]:
dtree_reg = DecisionTreeRegressor(random_state=42)...

If the tree is big, the graphic may get unreadable. A text export may be easier to read:

In [ ]:
print(sklearn.tree.export_text(dtree_reg, feature_names=feature_names, max_depth=2))
|--- Latitude <= 42.05
|   |--- Longitude <= -93.63
|   |   |--- Latitude <= 42.00
|   |   |   |--- truncated branch of depth 9
|   |   |--- Latitude >  42.00
|   |   |   |--- truncated branch of depth 22
|   |--- Longitude >  -93.63
|   |   |--- Latitude <= 42.04
|   |   |   |--- truncated branch of depth 28
|   |   |--- Latitude >  42.04
|   |   |   |--- truncated branch of depth 20
|--- Latitude >  42.05
|   |--- Longitude <= -93.65
|   |   |--- Longitude <= -93.66
|   |   |   |--- truncated branch of depth 12
|   |   |--- Longitude >  -93.66
|   |   |   |--- truncated branch of depth 24
|   |--- Longitude >  -93.65
|   |   |--- Longitude <= -93.63
|   |   |   |--- truncated branch of depth 22
|   |   |--- Longitude >  -93.63
|   |   |   |--- truncated branch of depth 16

Step B2: Plot the decision tree model in data space. Observe what shapes you see.

In [ ]:
plot_model(dtree_reg)
No description has been provided for this image
In [ ]:
# your code here
Mean absolute error: 0.3832485153583618
Mean squared error: 0.4089356941066932

Part C: Random Forest regression¶

Random Forests take random subsets of the data and fit decision trees to each one. As each tree is fit, it also considers only a random subset of features for each decision. The combination of these two reduces the variance of the model, that is, how much the model's predictions change if it's fit on different subsets of data.

Fit a random forest regression model to this data. Use the default hyperparameters and a random_state of 42 (only matters if you want to compare your results with the reference notebook).

In [ ]:
rf_reg = ...
print(f"We just fit a random forest with {rf_reg.n_estimators} trees.")
We just fit a random forest with 100 trees.
In [ ]:
plot_model(rf_reg)
No description has been provided for this image

Note: you can use code like this to show all of the different trees in the forest. It may or may not work in your computer, though.

In [ ]:
if False:
    from matplotlib.animation import FuncAnimation
    from IPython.display import HTML
    def frame(i):
        plt.clf()
        plot_model(rf_reg.estimators_[i], fig=fig)
        plt.title(f"Tree {i:03d}")
    fig = plt.figure(figsize=(16, 10))
    anim = FuncAnimation(fig=fig, func=frame, frames=min(10, len(rf_reg.estimators_)))
    # One of these two should work:
    display(HTML(anim.to_html5_video()))
    #display(HTML(anim.to_jshtml()))

Again, compute the predictions and errors.

In [ ]:
# your code here
Mean absolute error: 0.28960648131845396
Mean squared error: 0.20209223072066412

Analysis¶

Q1: Describe the basic steps for fitting a model in sklearn and making predictions.

your narrative answer here

Q2: Describe parameters that the fit method takes. For each one, describe its purpose and its shape.

your narrative answer here

Q3: Describe, qualitatively, what each of the 3 models here looks like in data space. Describe a characteristic of the visualization that would let you tell immediately which type of model it is from. You might notice differences in the shapes of the boundaries it draws and, if you look more closely, a difference in how the boundaries relate to the data.

your narrative answer here

Q4: Describe, quantitatively, how the performance of the different models compares. Which performs best? Which performs worst? Explain how the performance numbers make sense in light of the data-space plots.

your narrative answer here

Extension¶

optional

  1. Compute the loss on the training set for each of these models. Can that help you tell whether the model overfit or not?
  2. Try using more features in the dataset. How well can you predict the price? Be careful about categorical features. (Note that you won't be able to use plot_model as-is if you add additional features.)