Regression in scikit-learn¶
Goals¶
- Practice with the
fitandpredictinterface of sklearn models - Get a visual sense of how different regression models work.
Rationale¶
Neural nets are strong performers for data that lacks clear features. But for well-structured tabular data with meaningful features (or data that can be translated to that form), simple models can sometimes perform very well, and can be much faster and sometimes more interpretable. Even if you plan to fit a neural net model, training a decision tree or random forest first can be a good quick first pass.
The Scikit-Learn (sklearn) fit-predict interface for modeling has become the de facto industry standard for this sort of modeling, so it's highly likely that what you see here will be useful in your future work.
Documentation¶
The sklearn documentation is exemplary. See:
- Linear Models for, e.g., linear regression
- Decision Trees
- Ensemble Methods for, e.g., random forests
Setup¶
Let's import necessary modules: pandas and NumPy for data wrangling, Matplotlib for plotting, and some sklearn models.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeRegressor
import sklearn.tree
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor, HistGradientBoostingRegressor
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.model_selection import train_test_split
We'll load the data. We're using a dataset of home sale prices from the Ames, Iowa assessor's database, described in this paper. (DATA 202 students may remember seeing this dataset.)
Pandas (typically imported as pd, see above) is a very useful library for working with tabular datasets. We'll see here that we can easily read a CSV file directly off the Internet...
ames = pd.read_csv('https://github.com/kcarnold/AmesHousing/blob/master/data/ames.csv.gz?raw=true', compression="gzip")
The main object from pandas is a DataFrame. It holds a table of data:
ames.head()
| MS_SubClass | MS_Zoning | Lot_Frontage | Lot_Area | Street | Alley | Lot_Shape | Land_Contour | Utilities | Lot_Config | ... | Fence | Misc_Feature | Misc_Val | Mo_Sold | Year_Sold | Sale_Type | Sale_Condition | Sale_Price | Longitude | Latitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | One_Story_1946_and_Newer_All_Styles | Residential_Low_Density | 141 | 31770 | Pave | No_Alley_Access | Slightly_Irregular | Lvl | AllPub | Corner | ... | No_Fence | NaN | 0 | 5 | 2010 | WD | Normal | 215000 | -93.619754 | 42.054035 |
| 1 | One_Story_1946_and_Newer_All_Styles | Residential_High_Density | 80 | 11622 | Pave | No_Alley_Access | Regular | Lvl | AllPub | Inside | ... | Minimum_Privacy | NaN | 0 | 6 | 2010 | WD | Normal | 105000 | -93.619756 | 42.053014 |
| 2 | One_Story_1946_and_Newer_All_Styles | Residential_Low_Density | 81 | 14267 | Pave | No_Alley_Access | Slightly_Irregular | Lvl | AllPub | Corner | ... | No_Fence | Gar2 | 12500 | 6 | 2010 | WD | Normal | 172000 | -93.619387 | 42.052659 |
| 3 | One_Story_1946_and_Newer_All_Styles | Residential_Low_Density | 93 | 11160 | Pave | No_Alley_Access | Regular | Lvl | AllPub | Corner | ... | No_Fence | NaN | 0 | 4 | 2010 | WD | Normal | 244000 | -93.617320 | 42.051245 |
| 4 | Two_Story_1946_and_Newer | Residential_Low_Density | 74 | 13830 | Pave | No_Alley_Access | Slightly_Irregular | Lvl | AllPub | Inside | ... | Minimum_Privacy | NaN | 0 | 3 | 2010 | WD | Normal | 189900 | -93.638933 | 42.060899 |
5 rows × 81 columns
Each column of data generally has a consistent data type. (Note: object columns are the exception. They usually mean "string", but could actually hold any Python object.)
ames.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 2930 entries, 0 to 2929 Data columns (total 81 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 MS_SubClass 2930 non-null object 1 MS_Zoning 2930 non-null object 2 Lot_Frontage 2930 non-null int64 3 Lot_Area 2930 non-null int64 4 Street 2930 non-null object 5 Alley 2930 non-null object 6 Lot_Shape 2930 non-null object 7 Land_Contour 2930 non-null object 8 Utilities 2930 non-null object 9 Lot_Config 2930 non-null object 10 Land_Slope 2930 non-null object 11 Neighborhood 2930 non-null object 12 Condition_1 2930 non-null object 13 Condition_2 2930 non-null object 14 Bldg_Type 2930 non-null object 15 House_Style 2930 non-null object 16 Overall_Qual 2930 non-null object 17 Overall_Cond 2930 non-null object 18 Year_Built 2930 non-null int64 19 Year_Remod_Add 2930 non-null int64 20 Roof_Style 2930 non-null object 21 Roof_Matl 2930 non-null object 22 Exterior_1st 2930 non-null object 23 Exterior_2nd 2930 non-null object 24 Mas_Vnr_Type 1155 non-null object 25 Mas_Vnr_Area 2930 non-null int64 26 Exter_Qual 2930 non-null object 27 Exter_Cond 2930 non-null object 28 Foundation 2930 non-null object 29 Bsmt_Qual 2930 non-null object 30 Bsmt_Cond 2930 non-null object 31 Bsmt_Exposure 2930 non-null object 32 BsmtFin_Type_1 2930 non-null object 33 BsmtFin_SF_1 2930 non-null int64 34 BsmtFin_Type_2 2930 non-null object 35 BsmtFin_SF_2 2930 non-null int64 36 Bsmt_Unf_SF 2930 non-null int64 37 Total_Bsmt_SF 2930 non-null int64 38 Heating 2930 non-null object 39 Heating_QC 2930 non-null object 40 Central_Air 2930 non-null object 41 Electrical 2930 non-null object 42 First_Flr_SF 2930 non-null int64 43 Second_Flr_SF 2930 non-null int64 44 Low_Qual_Fin_SF 2930 non-null int64 45 Gr_Liv_Area 2930 non-null int64 46 Bsmt_Full_Bath 2930 non-null int64 47 Bsmt_Half_Bath 2930 non-null int64 48 Full_Bath 2930 non-null int64 49 Half_Bath 2930 non-null int64 50 Bedroom_AbvGr 2930 non-null int64 51 Kitchen_AbvGr 2930 non-null int64 52 Kitchen_Qual 2930 non-null object 53 TotRms_AbvGrd 2930 non-null int64 54 Functional 2930 non-null object 55 Fireplaces 2930 non-null int64 56 Fireplace_Qu 2930 non-null object 57 Garage_Type 2930 non-null object 58 Garage_Finish 2930 non-null object 59 Garage_Cars 2930 non-null int64 60 Garage_Area 2930 non-null int64 61 Garage_Qual 2930 non-null object 62 Garage_Cond 2930 non-null object 63 Paved_Drive 2930 non-null object 64 Wood_Deck_SF 2930 non-null int64 65 Open_Porch_SF 2930 non-null int64 66 Enclosed_Porch 2930 non-null int64 67 Three_season_porch 2930 non-null int64 68 Screen_Porch 2930 non-null int64 69 Pool_Area 2930 non-null int64 70 Pool_QC 2930 non-null object 71 Fence 2930 non-null object 72 Misc_Feature 106 non-null object 73 Misc_Val 2930 non-null int64 74 Mo_Sold 2930 non-null int64 75 Year_Sold 2930 non-null int64 76 Sale_Type 2930 non-null object 77 Sale_Condition 2930 non-null object 78 Sale_Price 2930 non-null int64 79 Longitude 2930 non-null float64 80 Latitude 2930 non-null float64 dtypes: float64(2), int64(33), object(46) memory usage: 1.8+ MB
It behaves like a dictionary of its columns. Each column is a Series object.
type(ames['Sale_Price'])
pandas.core.series.Series
Series support broadcast operations, similar to NumPy arrays and Torch tensors; they also have other functionality.
ames['price'] = ames["Sale_Price"] / 100_000 # Make `price` be in units of $100k, to be easier to interpret.
Now we'll look into this dataset:
We'll define some functions to plot the data and models. Since we have latitude and longitude for each home, we can plot this data in 2D with a color for the sale price.
(Sorry, you'll just have to imagine there's a map underneath.)
def plot_data():
# You don't have to know how this function works.
plt.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=.5)
plt.xlabel("Longitude"); plt.ylabel("Latitude")
plt.colorbar(label="Sale Price ($100k)")
plot_data()
We'll try to predict home price based on location (which the realtors assure us is the most important factor anyway). So we'll grab the Latitude and Longitude columns of the data. We'll call that input data X, by convention. There are several different ways to index into a pandas DataFrame; using a list gives us a DataFrame with just the columns with those names. We'll then access the underlying NumPy data by using .values.
feature_names = ['Longitude', 'Latitude']
X = ames[feature_names].values
X.shape
(2930, 2)
Our target, called y by convention, will be the home price (we'll soon introduce a different y, but start with this one).
y = ames['price'].values
y.shape
(2930,)
Notice that X has two axes and thus is written in uppercase; y has 1 and thus is written in lowercase. (This is sklearn convention; other libraries are less consistent about this.)
Now let's split the data into a train and valid set (which sklearn calls train-test, but that's fine). random_state is how sklearn specifies the random seed (it's actually slightly more flexible than a seed).
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=.2, random_state=42)
We'll verify that the shapes make sense. Note how many items are in each of the sets.
X_train.shape, y_train.shape
((2344, 2), (2344,))
X_valid.shape, y_valid.shape
((586, 2), (586,))
Here's a function to plot our regression model in "data space" (i.e., what it would predict everywhere on the map).
This function is pretty customized to our specific use case, though you can get inspiration from it for use in other situations.
def plot_model(clf, fig=None):
# Compute extents
lat_min = ames.Latitude.min()
lat_max = ames.Latitude.max()
lon_min = ames.Longitude.min()
lon_max = ames.Longitude.max()
price_min = ames.price.min()
price_max = ames.price.max()
# Ask the classifier for predictions on a grid
xx, yy = np.meshgrid(np.linspace(lon_min, lon_max, 250), np.linspace(lat_min, lat_max, 250))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
if fig is None:
fig = plt.figure(figsize=plt.figaspect(2))
# Left side: show the predictions in 2D. Superimpose the original data.
ax = fig.add_subplot(2, 1, 1)
surf = ax.contourf(xx, yy, Z, alpha=.5, cmap=plt.cm.viridis, vmin=price_min, vmax=price_max)
ax.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=1, cmap='viridis', vmin=price_min, vmax=price_max)
ax.set(xlabel="Longitude", ylabel="Latitude", title="2D contour view")
fig.colorbar(surf, label="Sale Price ($100k)")
# Right side: show the predictions in 3D
ax = fig.add_subplot(2, 1, 2, projection='3d')
ax.plot_surface(xx, yy, Z, alpha=.5, cmap=plt.cm.viridis, vmin=price_min, vmax=price_max)
#ax.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=1, cmap='viridis', vmin=price_min, vmax=price_max)
ax.set(title="3D view")
Task¶
Part A: Linear regression¶
Step A1: Fit a linear regression model (call it linreg) to the training set (X_train, y_train).
linreg = LinearRegression().fit(...)
print("Prediction equation:")
print('y_pred = '
+ ' + ' .join(f'{coef:.3f} * {name}' for coef, name in zip(linreg.coef_, feature_names))
+ f' + {linreg.intercept_:.3f}')
Prediction equation: y_pred = -7.955 * Longitude + 11.886 * Latitude + -1242.742
Here are the Longitude and Latitude coordinates of the first home in the validation set. Compute what the model would predict for this home, based on the prediction equation shown above.
example_home = X_valid[0]
example_home
array([-93.621065, 42.029038])
Now we'll plot the model's predictions in data space. The code for step is filled in for you because there's not a generic way to do this; our approach here is customized to our particular model and task so you don't have to understand the details of how it works.
The main thing to observe here is what shapes do you see. Think about why you might see those shapes in light of the prediction equation.
Note: The first plot is a contour plot (aka contour graph) showing both the model's predictions and our training data. The colored dots scattered across the plot are our actual training data points, while the background contour lines show the model's predicted prices across the entire space.
If you've ever seen a topographic map showing elevation, this is the same idea - each line represents locations with the same predicted price, just like topographic lines show locations of the same elevation.
For more information on contour plots, look them up on Wikipedia or Khan Academy.
How to Read This Plot:
- The x-axis shows longitude
- The y-axis shows latitude
- The color and contour lines show the predicted price
- Darker colors indicate lower predicted prices
- The scattered dots show where our actual training data points are located
The lines that you see are like contour lines; the linear regression is actually quite smooth (too smooth?). The second plot shows the prediction in 3D space, which may make it clearer.
plot_model(linreg)
Step A3: Compute the model's predictions on the validation set (call them y_pred). What was the model's prediction for the first house in the validation set? (did you get it right in the previous question?) How does that compare with the actual price for that home?
y_pred = linreg.predict(...)
# your code here
Prediction for first house in validation set: 1.55, Actual: 1.61, difference: 0.06
Step A4: Compute and show the mean squared error, the mean absolute error, and the r2 score for the validation set.
- You may use the
mean_absolute_error,mean_squared_error, andr2_scorefunctions (imported fromsklearn.metricsabove). - Use the predictions you already made above.
- Use Shift-TAB or
?to get the documentation for these functions to ensure you're passing the arguments in the correct order.
# your code here
Mean absolute error: 0.582 Mean squared error: 0.679 R2 score: 0.154
For convenience, sklearn models have a .score() method, which computes the predictions and error metrics at the same time. Look at the results below: which error metric does .score() seem to use for regression tasks?
print(f"Training score: {linreg.score(X_train, y_train):.3f}")
print(f"Validation score: {linreg.score(X_valid, y_valid):.3f}")
Training score: 0.147 Validation score: 0.154
Reflect: (these questions might make more sense once you have looked at the other two types of models below)
- How does a linear regression make a prediction?
- Is a linear regression able to capture that homes are typically more expensive on one side of town than the other? (Which side?) Why or why not?
- Is a linear regression able to capture that homes in one neighborhood are typically more expensive than those in adjacent neighborhoods? Why or why not?
Part B: Decision tree regression¶
Step B1: Fit a decision tree model (call it dtree_reg) to the training set (X_train, y_train).
We'll use a small max_depth to be able to plot the tree. We'll then fit another one with full depth.
Notice how the tree makes its prediction starting at the top (root) and checking one feature at a time. If the check is True, it goes left; otherwise, it goes right. When it hits a node with no check (a "leaf"), it predicts the value stored there. (Think: how do you think it might have computed that value?)
dtree_reg_small = DecisionTreeRegressor(max_depth=2, random_state=42)...
plt.figure(figsize=(20, 15))
sklearn.tree.plot_tree(dtree_reg_small, feature_names=feature_names, filled=True);
Exercise: compute by hand the tree's prediction for the same first house in the validation set.
your answer here
Now let's let the tree grow as big as it wants.
dtree_reg = DecisionTreeRegressor(random_state=42)...
If the tree is big, the graphic may get unreadable. A text export may be easier to read:
print(sklearn.tree.export_text(dtree_reg, feature_names=feature_names, max_depth=2))
|--- Latitude <= 42.05 | |--- Longitude <= -93.63 | | |--- Latitude <= 42.00 | | | |--- truncated branch of depth 9 | | |--- Latitude > 42.00 | | | |--- truncated branch of depth 22 | |--- Longitude > -93.63 | | |--- Latitude <= 42.04 | | | |--- truncated branch of depth 28 | | |--- Latitude > 42.04 | | | |--- truncated branch of depth 20 |--- Latitude > 42.05 | |--- Longitude <= -93.65 | | |--- Longitude <= -93.66 | | | |--- truncated branch of depth 12 | | |--- Longitude > -93.66 | | | |--- truncated branch of depth 24 | |--- Longitude > -93.65 | | |--- Longitude <= -93.63 | | | |--- truncated branch of depth 22 | | |--- Longitude > -93.63 | | | |--- truncated branch of depth 16
Step B2: Plot the decision tree model in data space. Observe what shapes you see.
plot_model(dtree_reg)
# your code here
Mean absolute error: 0.3832485153583618 Mean squared error: 0.4089356941066932
Part C: Random Forest regression¶
Random Forests take random subsets of the data and fit decision trees to each one. As each tree is fit, it also considers only a random subset of features for each decision. The combination of these two reduces the variance of the model, that is, how much the model's predictions change if it's fit on different subsets of data.
Fit a random forest regression model to this data. Use the default hyperparameters and a random_state of 42 (only matters if you want to compare your results with the reference notebook).
rf_reg = ...
print(f"We just fit a random forest with {rf_reg.n_estimators} trees.")
We just fit a random forest with 100 trees.
plot_model(rf_reg)
Note: you can use code like this to show all of the different trees in the forest. It may or may not work in your computer, though.
if False:
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
def frame(i):
plt.clf()
plot_model(rf_reg.estimators_[i], fig=fig)
plt.title(f"Tree {i:03d}")
fig = plt.figure(figsize=(16, 10))
anim = FuncAnimation(fig=fig, func=frame, frames=min(10, len(rf_reg.estimators_)))
# One of these two should work:
display(HTML(anim.to_html5_video()))
#display(HTML(anim.to_jshtml()))
Again, compute the predictions and errors.
# your code here
Mean absolute error: 0.28960648131845396 Mean squared error: 0.20209223072066412
Analysis¶
Q1: Describe the basic steps for fitting a model in sklearn and making predictions.
your narrative answer here
Q2: Describe parameters that the fit method takes. For each one, describe its purpose and its shape.
your narrative answer here
Q3: Describe, qualitatively, what each of the 3 models here looks like in data space. Describe a characteristic of the visualization that would let you tell immediately which type of model it is from. You might notice differences in the shapes of the boundaries it draws and, if you look more closely, a difference in how the boundaries relate to the data.
your narrative answer here
Q4: Describe, quantitatively, how the performance of the different models compares. Which performs best? Which performs worst? Explain how the performance numbers make sense in light of the data-space plots.
your narrative answer here
Extension¶
optional
- Compute the loss on the training set for each of these models. Can that help you tell whether the model overfit or not?
- Try using more features in the dataset. How well can you predict the price? Be careful about categorical features. (Note that you won't be able to use
plot_modelas-is if you add additional features.)