Regression in scikit-learn¶

Goals¶

  • Practice with the fit and predict interface of sklearn models
  • Get a visual sense of how different regression models work.

Rationale¶

Neural nets are strong performers for data that lacks clear features. But for well-structured tabular data with meaningful features (or data that can be translated to that form), simple models can sometimes perform very well, and can be much faster and sometimes more interpretable. Even if you plan to fit a neural net model, training a decision tree or random forest first can be a good quick first pass.

The Scikit-Learn (sklearn) fit-predict interface for modeling has become the de facto industry standard for this sort of modeling, so it's highly likely that what you see here will be useful in your future work.

Documentation¶

The sklearn documentation is exemplary. See:

  • Linear Models for, e.g., linear regression
  • Decision Trees
  • Ensemble Methods for, e.g., random forests

Setup¶

Let's import necessary modules: pandas and NumPy for data wrangling, Matplotlib for plotting, and some sklearn models.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
import sklearn.tree
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, log_loss, accuracy_score, classification_report
from sklearn.model_selection import train_test_split

We'll load the data. We're using a dataset of home sale prices from the Ames, Iowa assessor's database, described in this paper. (DATA 202 students may remember seeing this dataset.)

Pandas (typically imported as pd, see above) is a very useful library for working with tabular datasets. We'll see here that we can easily read a CSV file directly off the Internet...

In [2]:
ames = pd.read_csv('https://github.com/kcarnold/AmesHousing/blob/master/data/ames.csv.gz?raw=true', compression="gzip")

The main object from pandas is a DataFrame. It holds a table of data:

In [3]:
ames.head()
Out[3]:
MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape Land_Contour Utilities Lot_Config ... Fence Misc_Feature Misc_Val Mo_Sold Year_Sold Sale_Type Sale_Condition Sale_Price Longitude Latitude
0 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 141 31770 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner ... No_Fence None 0 5 2010 WD Normal 215000 -93.619754 42.054035
1 One_Story_1946_and_Newer_All_Styles Residential_High_Density 80 11622 Pave No_Alley_Access Regular Lvl AllPub Inside ... Minimum_Privacy None 0 6 2010 WD Normal 105000 -93.619756 42.053014
2 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 81 14267 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner ... No_Fence Gar2 12500 6 2010 WD Normal 172000 -93.619387 42.052659
3 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 93 11160 Pave No_Alley_Access Regular Lvl AllPub Corner ... No_Fence None 0 4 2010 WD Normal 244000 -93.617320 42.051245
4 Two_Story_1946_and_Newer Residential_Low_Density 74 13830 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Inside ... Minimum_Privacy None 0 3 2010 WD Normal 189900 -93.638933 42.060899

5 rows × 81 columns

Each column of data generally has a consistent data type. (Note: object columns are the exception. They usually mean "string", but could actually hold any Python object.)

In [4]:
ames.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 81 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   MS_SubClass         2930 non-null   object 
 1   MS_Zoning           2930 non-null   object 
 2   Lot_Frontage        2930 non-null   int64  
 3   Lot_Area            2930 non-null   int64  
 4   Street              2930 non-null   object 
 5   Alley               2930 non-null   object 
 6   Lot_Shape           2930 non-null   object 
 7   Land_Contour        2930 non-null   object 
 8   Utilities           2930 non-null   object 
 9   Lot_Config          2930 non-null   object 
 10  Land_Slope          2930 non-null   object 
 11  Neighborhood        2930 non-null   object 
 12  Condition_1         2930 non-null   object 
 13  Condition_2         2930 non-null   object 
 14  Bldg_Type           2930 non-null   object 
 15  House_Style         2930 non-null   object 
 16  Overall_Qual        2930 non-null   object 
 17  Overall_Cond        2930 non-null   object 
 18  Year_Built          2930 non-null   int64  
 19  Year_Remod_Add      2930 non-null   int64  
 20  Roof_Style          2930 non-null   object 
 21  Roof_Matl           2930 non-null   object 
 22  Exterior_1st        2930 non-null   object 
 23  Exterior_2nd        2930 non-null   object 
 24  Mas_Vnr_Type        2930 non-null   object 
 25  Mas_Vnr_Area        2930 non-null   int64  
 26  Exter_Qual          2930 non-null   object 
 27  Exter_Cond          2930 non-null   object 
 28  Foundation          2930 non-null   object 
 29  Bsmt_Qual           2930 non-null   object 
 30  Bsmt_Cond           2930 non-null   object 
 31  Bsmt_Exposure       2930 non-null   object 
 32  BsmtFin_Type_1      2930 non-null   object 
 33  BsmtFin_SF_1        2930 non-null   int64  
 34  BsmtFin_Type_2      2930 non-null   object 
 35  BsmtFin_SF_2        2930 non-null   int64  
 36  Bsmt_Unf_SF         2930 non-null   int64  
 37  Total_Bsmt_SF       2930 non-null   int64  
 38  Heating             2930 non-null   object 
 39  Heating_QC          2930 non-null   object 
 40  Central_Air         2930 non-null   object 
 41  Electrical          2930 non-null   object 
 42  First_Flr_SF        2930 non-null   int64  
 43  Second_Flr_SF       2930 non-null   int64  
 44  Low_Qual_Fin_SF     2930 non-null   int64  
 45  Gr_Liv_Area         2930 non-null   int64  
 46  Bsmt_Full_Bath      2930 non-null   int64  
 47  Bsmt_Half_Bath      2930 non-null   int64  
 48  Full_Bath           2930 non-null   int64  
 49  Half_Bath           2930 non-null   int64  
 50  Bedroom_AbvGr       2930 non-null   int64  
 51  Kitchen_AbvGr       2930 non-null   int64  
 52  Kitchen_Qual        2930 non-null   object 
 53  TotRms_AbvGrd       2930 non-null   int64  
 54  Functional          2930 non-null   object 
 55  Fireplaces          2930 non-null   int64  
 56  Fireplace_Qu        2930 non-null   object 
 57  Garage_Type         2930 non-null   object 
 58  Garage_Finish       2930 non-null   object 
 59  Garage_Cars         2930 non-null   int64  
 60  Garage_Area         2930 non-null   int64  
 61  Garage_Qual         2930 non-null   object 
 62  Garage_Cond         2930 non-null   object 
 63  Paved_Drive         2930 non-null   object 
 64  Wood_Deck_SF        2930 non-null   int64  
 65  Open_Porch_SF       2930 non-null   int64  
 66  Enclosed_Porch      2930 non-null   int64  
 67  Three_season_porch  2930 non-null   int64  
 68  Screen_Porch        2930 non-null   int64  
 69  Pool_Area           2930 non-null   int64  
 70  Pool_QC             2930 non-null   object 
 71  Fence               2930 non-null   object 
 72  Misc_Feature        2930 non-null   object 
 73  Misc_Val            2930 non-null   int64  
 74  Mo_Sold             2930 non-null   int64  
 75  Year_Sold           2930 non-null   int64  
 76  Sale_Type           2930 non-null   object 
 77  Sale_Condition      2930 non-null   object 
 78  Sale_Price          2930 non-null   int64  
 79  Longitude           2930 non-null   float64
 80  Latitude            2930 non-null   float64
dtypes: float64(2), int64(33), object(46)
memory usage: 1.8+ MB

It behaves like a dictionary of its columns. Each column is a Series object.

In [5]:
type(ames['Sale_Price'])
Out[5]:
pandas.core.series.Series

Series support broadcast operations, similar to NumPy arrays and Torch tensors; they also have other functionality.

In [6]:
ames['price'] = ames["Sale_Price"] / 100_000 # Make `price` be in units of $100k, to be easier to interpret.

Now we'll look into this dataset:

We'll define some functions to plot the data and models. Since we have latitude and longitude for each home, we can plot this data in 2D with a color for the sale price.

(Sorry, you'll just have to imagine there's a map underneath.)

In [7]:
def plot_data():
    # You don't have to know how this function works.
    plt.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=.5)
    plt.xlabel("Longitude"); plt.ylabel("Latitude")
    plt.colorbar(label="Sale Price ($100k)")
plot_data()

We'll try to predict home price based on location (which the realtors assure us is the most important factor anyway). So we'll grab the Latitude and Longitude columns of the data. We'll call that input data X, by convention. There are several different ways to index into a pandas DataFrame; using a list gives us a DataFrame with just the columns with those names. We'll then access the underlying NumPy data by using .values.

In [8]:
X = ames[['Longitude', 'Latitude']].values
X.shape
Out[8]:
(2930, 2)

Our target, called y by convention, will be the home price (we'll soon introduce a different y, but start with this one).

In [9]:
y = ames['price'].values
y.shape
Out[9]:
(2930,)

Notice that X has two axes and thus is written in uppercase; y has 1 and thus is written in lowercase. (This is sklearn convention; other libraries are less consistent about this.)

Now let's split the data into a train and valid set (which sklearn calls train-test, but that's fine). random_state is how sklearn specifies the random seed (it's actually slightly more flexible than a seed).

In [10]:
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=.2, random_state=42)

We'll verify that the shapes make sense. Note how many items are in each of the sets.

In [11]:
X_train.shape, y_train.shape
Out[11]:
((2344, 2), (2344,))
In [12]:
X_valid.shape, y_valid.shape
Out[12]:
((586, 2), (586,))

Here's a function to plot our regression model in "data space" (i.e., what it would predict everywhere on the map).

This function is pretty customized to our specific use case, though you can get inspiration from it for use in other situations.

In [13]:
def plot_model(clf, fig=None):
    # Compute extents
    lat_min = ames.Latitude.min()
    lat_max = ames.Latitude.max()
    lon_min = ames.Longitude.min()
    lon_max = ames.Longitude.max()
    price_min = ames.price.min()
    price_max = ames.price.max()

    # Ask the classifier for predictions on a grid
    xx, yy = np.meshgrid(np.linspace(lon_min, lon_max, 250), np.linspace(lat_min, lat_max, 250))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)

    if fig is None:
        fig = plt.figure(figsize=plt.figaspect(2))

    # Left side: show the predictions in 2D. Superimpose the original data.
    ax = fig.add_subplot(2, 1, 1)
    surf = ax.contourf(xx, yy, Z, alpha=.5, cmap=plt.cm.viridis, vmin=price_min, vmax=price_max)
    ax.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=1, cmap='viridis', vmin=price_min, vmax=price_max)
    ax.set(xlabel="Longitude", ylabel="Latitude", title="2D contour view")
    fig.colorbar(surf, label="Sale Price ($100k)")

    # Right side: show the predictions in 3D
    ax = fig.add_subplot(2, 1, 2, projection='3d')
    ax.plot_surface(xx, yy, Z, alpha=.5, cmap=plt.cm.viridis, vmin=price_min, vmax=price_max)
    #ax.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=1, cmap='viridis', vmin=price_min, vmax=price_max)
    ax.set(title="3D view")

Task¶

Part A: Linear regression¶

Step A1: Fit a linear regression model (call it linreg) to the training set (X_train, y_train).

In [14]:
linreg = LinearRegression().fit(...)
print("Prediction equation:")
print('y_pred = '
    + ' + ' .join(f'{coef:.3f} * {name}' for coef, name in zip(linreg.coef_, ['Latitude', 'Longitude']))
    + f' + {linreg.intercept_:.3f}')
Prediction equation:
y_pred = -7.955 * Latitude + 11.886 * Longitude + -1242.742

Step A2: Plot the model's predictions in data space. The code for step is filled in for you because there's not a generic way to do this; our approach here is customized to our particular model and task so you don't have to understand the details of how it works.

The main thing to observe here is what shapes do you see. Think about why you might see those shapes in light of the prediction equation.

Note: The first plot is a contour plot (aka contour graph). If you're not familiar with this style of visualization, look at the Wikipedia page or the Khan Academy video. It's like a topological map; the third dimension in this case is the model's predicted price. The lines that you see are like contour lines; the linear regression is actually quite smooth (too smooth?). The second plot shows the prediction in 3D space, which may make it clearer.

In [15]:
plot_model(linreg)

Step A3: Compute the model's predictions on the validation set (call them y_pred). What does the model predict for the first house in the validation set? How does that compare with the actual price for that home?

In [16]:
y_pred = linreg.predict(...)
In [17]:
# your code here
Prediction for first house in validation set: 1.55, Actual: 1.61, difference: 0.06

Step A4: Compute and show the mean squared error and the mean absolute error for the validation set.

  • You may use the mean_absolute_error and mean_squared_error functions (imported from sklearn.metrics above).
  • Use the predictions you already made above.
  • Use Shift-TAB or ? to get the documentation for these functions to ensure you're passing the arguments in the correct order.
In [18]:
# your code here
Mean absolute error: 0.582
Mean squared error:  0.679

Part B: Decision tree regression¶

Step B1: Fit a decision tree model (call it dtree_reg) to the training set (X_train, y_train).

We'll use a small max_depth to be able to plot the tree. We'll then fit another one with full depth.

Notice how the tree makes its prediction starting at the top (root) and checking one feature at a time. If the check is True, it goes left; otherwise, it goes right. When it hits a node with no check (a "leaf"), it predicts the value stored there. (Think: how do you think it might have computed that value?)

In [19]:
dtree_reg_small = DecisionTreeRegressor(max_depth=3)...
plt.figure(figsize=(20, 15))
sklearn.tree.plot_tree(dtree_reg_small, feature_names=["Latitude", "Longitude"], filled=True);

Now let's let the tree grow as big as it wants.

In [20]:
dtree_reg = DecisionTreeRegressor()...

If the tree is big, the graphic may get unreadable. A text export may be easier to read:

In [21]:
print(sklearn.tree.export_text(dtree_reg, feature_names=["Latitude", "Longitude"], max_depth=2))
|--- Longitude <= 42.05
|   |--- Latitude <= -93.63
|   |   |--- Longitude <= 42.00
|   |   |   |--- truncated branch of depth 9
|   |   |--- Longitude >  42.00
|   |   |   |--- truncated branch of depth 22
|   |--- Latitude >  -93.63
|   |   |--- Longitude <= 42.04
|   |   |   |--- truncated branch of depth 28
|   |   |--- Longitude >  42.04
|   |   |   |--- truncated branch of depth 20
|--- Longitude >  42.05
|   |--- Latitude <= -93.65
|   |   |--- Latitude <= -93.66
|   |   |   |--- truncated branch of depth 12
|   |   |--- Latitude >  -93.66
|   |   |   |--- truncated branch of depth 24
|   |--- Latitude >  -93.65
|   |   |--- Latitude <= -93.63
|   |   |   |--- truncated branch of depth 22
|   |   |--- Latitude >  -93.63
|   |   |   |--- truncated branch of depth 16

Step B2: Plot the decision tree model in data space. Observe what shapes of you see.

In [22]:
plot_model(dtree_reg)
In [23]:
# your code here
Mean absolute error: 0.3863323208191126
Mean squared error: 0.4096931114556693

Part C: Random Forest regression¶

Random Forests take random subsets of the data and fit decision trees to each one. As each tree is fit, it also considers only a random subset of features for each decision. The combination of these two reduces the variance of the model, that is, how much the model's predictions change if it's fit on different subsets of data.

Fit a random forest regression model to this data. Use the default hyperparameters.

In [24]:
rf_reg = ...
print(f"We just fit a random forest with {rf_reg.n_estimators} trees.")
We just fit a random forest with 100 trees.
In [25]:
plot_model(rf_reg)

Note: you can use code like this to show all of the different trees in the forest. It may or may not work in your computer, though.

In [26]:
if False:
    from matplotlib.animation import FuncAnimation
    from IPython.display import HTML
    def frame(i):
        plt.clf()
        plot_model(rf_reg.estimators_[i], fig=fig)
        plt.title(f"Tree {i:03d}")
    fig = plt.figure(figsize=(16, 10))
    anim = FuncAnimation(fig=fig, func=frame, frames=len(rf_reg.estimators_))
    # One of these two should work:
    display(HTML(anim.to_html5_video()))
    #display(HTML(anim.to_jshtml()))

Again, compute the predictions and errors.

In [27]:
# your code here
Mean absolute error: 0.2890493789401819
Mean squared error: 0.19982058357560653

Analysis¶

Q1: Describe the basic steps for fitting a model in sklearn and making predictions.

your narrative answer here

Q2: Describe parameters that the fit method takes. For each one, describe its purpose and its shape.

your narrative answer here

Q3: Describe, qualitatively, what each of the 3 models here looks like in data space. Describe a characteristic of the visualization that would let you tell immediately which type of model it is from. You might notice differences in the shapes of the boundaries it draws and, if you look more closely, a difference in how the boundaries relate to the data.

your narrative answer here

Q4: Describe, quantitatively, how the performance of the different models compares. Which performs best? Which performs worst? Explain how the performance numbers make sense in light of the data-space plots.

your narrative answer here

Extension¶

optional

  1. Compute the loss on the training set for each of these models. Can that help you tell whether the model overfit or not?
  2. Try using more features in the dataset. How well can you predict the price? Be careful about categorical features. (Note that you won't be able to use plot_model as-is if you add additional features.)