scikit-learn¶fit and predict interface of sklearn modelsNeural nets are strong performers for data that lacks clear features. But for well-structured tabular data with meaningful features (or data that can be translated to that form), simple models can sometimes perform very well, and can be much faster and sometimes more interpretable. Even if you plan to fit a neural net model, training a decision tree or random forest first can be a good quick first pass.
The Scikit-Learn (sklearn) fit-predict interface for modeling has become the de facto industry standard for this sort of modeling, so it's highly likely that what you see here will be useful in your future work.
The sklearn documentation is exemplary. See:
Let's import necessary modules: pandas and NumPy for data wrangling, Matplotlib for plotting, and some sklearn models.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
import sklearn.tree
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, log_loss, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
We'll load the data. We're using a dataset of home sale prices from the Ames, Iowa assessor's database, described in this paper. (DATA 202 students may remember seeing this dataset.)
Pandas (typically imported as pd, see above) is a very useful library for working with tabular datasets. We'll see here that we can easily read a CSV file directly off the Internet...
ames = pd.read_csv('https://github.com/kcarnold/AmesHousing/blob/master/data/ames.csv.gz?raw=true', compression="gzip")
The main object from pandas is a DataFrame. It holds a table of data:
ames.head()
Each column of data generally has a consistent data type. (Note: object columns are the exception. They usually mean "string", but could actually hold any Python object.)
ames.info()
It behaves like a dictionary of its columns. Each column is a Series object.
type(ames['Sale_Price'])
Series support broadcast operations, similar to NumPy arrays and Torch tensors; they also have other functionality.
ames['price'] = ames["Sale_Price"] / 100_000 # Make `price` be in units of $100k, to be easier to interpret.
Now we'll look into this dataset:
We'll define some functions to plot the data and models. Since we have latitude and longitude for each home, we can plot this data in 2D with a color for the sale price.
(Sorry, you'll just have to imagine there's a map underneath.)
def plot_data():
# You don't have to know how this function works.
plt.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=.5)
plt.xlabel("Longitude"); plt.ylabel("Latitude")
plt.colorbar(label="Sale Price ($100k)")
plot_data()
We'll try to predict home price based on location (which the realtors assure us is the most important factor anyway). So we'll grab the Latitude and Longitude columns of the data. We'll call that input data X, by convention. There are several different ways to index into a pandas DataFrame; using a list gives us a DataFrame with just the columns with those names. We'll then access the underlying NumPy data by using .values.
X = ames[['Longitude', 'Latitude']].values
X.shape
Our target, called y by convention, will be the home price (we'll soon introduce a different y, but start with this one).
y = ames['price'].values
y.shape
Notice that X has two axes and thus is written in uppercase; y has 1 and thus is written in lowercase. (This is sklearn convention; other libraries are less consistent about this.)
Now let's split the data into a train and valid set (which sklearn calls train-test, but that's fine). random_state is how sklearn specifies the random seed (it's actually slightly more flexible than a seed).
X_train, X_valid, y_train, y_valid = train_test_split(X, y, test_size=.2, random_state=42)
We'll verify that the shapes make sense. Note how many items are in each of the sets.
X_train.shape, y_train.shape
X_valid.shape, y_valid.shape
Here's a function to plot our regression model in "data space" (i.e., what it would predict everywhere on the map).
This function is pretty customized to our specific use case, though you can get inspiration from it for use in other situations.
def plot_model(clf, fig=None):
# Compute extents
lat_min = ames.Latitude.min()
lat_max = ames.Latitude.max()
lon_min = ames.Longitude.min()
lon_max = ames.Longitude.max()
price_min = ames.price.min()
price_max = ames.price.max()
# Ask the classifier for predictions on a grid
xx, yy = np.meshgrid(np.linspace(lon_min, lon_max, 250), np.linspace(lat_min, lat_max, 250))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()]).reshape(xx.shape)
# Show the predictions. Superimpose the original data.
if fig is None:
plt.figure(figsize=(16, 8))
plt.contourf(xx, yy, Z, alpha=.5, cmap=plt.cm.viridis, vmin=price_min, vmax=price_max)
plt.scatter(ames['Longitude'], ames['Latitude'], c=ames["price"], s=1, cmap='viridis', vmin=price_min, vmax=price_max)
plt.xlabel("Longitude"); plt.ylabel("Latitude")
plt.colorbar(label="Sale Price ($100k)")
Step A1: Fit a linear regression model (call it linreg) to the training set (X_train, y_train).
linreg = LinearRegression().fit(...)
print("Prediction equation:")
print('y_pred = '
+ ' + ' .join(f'{coef:.3f} * {name}' for coef, name in zip(linreg.coef_, ['Latitude', 'Longitude']))
+ f' + {linreg.intercept_:.3f}')
Step A2: Plot the model's predictions in data space. The code for step is filled in for you because there's not a generic way to do this; our approach here is customized to our particular model and task so you don't have to understand the details of how it works.
The main thing to observe here is what shapes do you see. Think about why you might see those shapes in light of the prediction equation.
Note: These are contour plots (aka contour graphs). If you're not familiar with this style of visualization, look at the Wikipedia page or the Khan Academy video. It's like a topological map; the third dimension in this case is the model's predicted price.
plot_model(linreg)
Step A3: Compute the model's predictions on the validation set (call them y_pred). What does the model predict for the first house in the validation set? How does that compare with the actual price for that home?
y_pred = linreg.predict(...)
# your code here
Step A4: Compute and show the mean squared error and the mean absolute error for the validation set.
mean_absolute_error and mean_squared_error functions (imported from sklearn.metrics above).? to get the documentation for these functions to ensure you're passing the arguments in the correct order.# your code here
Step B1: Fit a decision tree model (call it dtree_reg) to the training set (X_train, y_train).
We'll use a small max_depth to be able to plot the tree. We'll then fit another one with full depth.
Notice how the tree makes its prediction starting at the top (root) and checking one feature at a time. If the check is True, it goes left; otherwise, it goes right. When it hits a node with no check (a "leaf"), it predicts the value stored there. (Think: how do you think it might have computed that value?)
dtree_reg_small = DecisionTreeRegressor(max_depth=3)...
plt.figure(figsize=(20, 15))
sklearn.tree.plot_tree(dtree_reg_small, feature_names=["Latitude", "Longitude"], filled=True);
Now let's let the tree grow as big as it wants.
dtree_reg = DecisionTreeRegressor()...
If the tree is big, the graphic may get unreadable. A text export may be easier to read:
print(sklearn.tree.export_text(dtree_reg, feature_names=["Latitude", "Longitude"], max_depth=2))
Step B2: Plot the decision tree model in data space. Observe what shapes of you see.
plot_model(dtree_reg)
# your code here
Random Forests take random subsets of the data and fit decision trees to each one. As each tree is fit, it also considers only a random subset of features for each decision. The combination of these two reduces the variance of the model, that is, how much the model's predictions change if it's fit on different subsets of data.
Fit a random forest regression model to this data. Use the default hyperparameters.
rf_reg = ...
print(f"We just fit a random forest with {rf_reg.n_estimators} trees.")
plot_model(rf_reg)
Note: you can use code like this to show all of the different trees in the forest. It may or may not work in your computer, though.
if False:
from matplotlib.animation import FuncAnimation
from IPython.display import HTML
def frame(i):
plt.clf()
plot_model(rf_reg.estimators_[i], fig=fig)
plt.title(f"Tree {i:03d}")
fig = plt.figure(figsize=(16, 10))
anim = FuncAnimation(fig=fig, func=frame, frames=len(rf_reg.estimators_))
# One of these two should work:
display(HTML(anim.to_html5_video()))
#display(HTML(anim.to_jshtml()))
Again, compute the predictions and errors.
# your code here
Q1: Describe the basic steps for fitting a model in sklearn and making predictions.
your narrative answer here
Q2: Describe parameters that the fit method takes. For each one, describe its purpose and its shape.
your narrative answer here
Q3: Describe, qualitatively, what each of the 3 models here looks like in data space. Describe a characteristic of the visualization that would let you tell immediately which type of model it is from. You might notice differences in the shapes of the boundaries it draws and, if you look more closely, a difference in how the boundaries relate to the data.
your narrative answer here
Q4: Describe, quantitatively, how the performance of the different models compares. Which performs best? Which performs worst? Explain how the performance numbers make sense in light of the data-space plots.
your narrative answer here
optional
plot_model as-is if you add additional features.)