scikit-learn
¶Goals:
fit
and predict
interface of sklearn modelsMuch of this setup is the same as the regression notebook.
Let's import necessary modules: Pandas and NumPy for data wrangling, Matplotlib for plotting, and some sklearn models.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn.tree import DecisionTreeClassifier, DecisionTreeRegressor
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
from sklearn.metrics import mean_absolute_error, mean_squared_error, log_loss, accuracy_score, classification_report
from sklearn.model_selection import train_test_split
We'll load the data. We're using a dataset of home sale prices from the Ames, Iowa assessor's database, described in this paper.
ames = pd.read_csv('https://github.com/kcarnold/AmesHousing/blob/master/data/ames.csv.gz?raw=true', compression="gzip")
ames['price'] = ames["Sale_Price"] / 100_000 # Make `price` be in units of $100k, to be easier to interpret.
ames.head()
We'll try to predict home price based on location (which the realtors assure us is the most important factor anyway). So we'll grab the Latitude and Longitude columns of the data. We'll call that input data X
, by convention.
X = ames[['Longitude', 'Latitude']].values
X.shape
We'll do something different for y
; see below.
Notice that the distribution of sale prices is skewed.
plt.hist(ames.price);
Skew can make regression hard because errors in the tails (in this case, the expensive houses) can dominate: mispredicting a million-dollar home by 1% is as bad as mispredicting a \$100k home by 10%!
One way to resolve this is to transform the target variable to be more evenly distributed. (For example, a log transformation will make all percentage errors equally important.) Another way is to transform it into a classification problem, where we predict whether the home price is low, medium, or high. We'll skip lots of nuance here and just split the prices into 3 equal buckets.
# This is some Pandas trickery. Enjoy, those who dare venture here! Otherwise don't worry about it.
ames['price_rank'] = ames.price.rank(pct=True)
ames['price_bin'] = 0 + (ames.price_rank > 1/3) + (ames.price_rank > 2/3)
ames.price_bin.value_counts()
Make a target y
using the price_bin
column of ames
.
y = ...values
Split the data (X
and y
) into a training and validation set using the same fraction and seed as the previous notebook.
# your code here
We could use the plotting mechanism we used for regression in the previous notebook (try it, it does work), but it's a bit confusing because the model will be predicting 0, 1, or 2, while the data is in the original range. That would also omit one cool thing we gain by moving to a classifier: we get probabilities! You can interpret that as the model's confidence about a home price prediction. We could do this with regression too, but it's more complex; it comes for free with classification.
Here we define the new plotting function; don't worry about how it works.
def plot_class_probs(clf):
lat_min = ames.Latitude.min()
lat_max = ames.Latitude.max()
lon_min = ames.Longitude.min()
lon_max = ames.Longitude.max()
xx, yy = np.meshgrid(np.linspace(lon_min, lon_max, 500), np.linspace(lat_min, lat_max, 500))
Z = clf.predict_proba(np.c_[xx.ravel(), yy.ravel()])
n_classes = Z.shape[1]
fig, axs = plt.subplots(ncols=n_classes, figsize=(16, 6), sharey=True)
for i, ax in enumerate(axs):
contour = ax.contourf(xx, yy, Z[:, i].reshape(xx.shape), alpha=.5, cmap=plt.cm.RdBu_r)#, vmin=0., vmax=1.)
ax.scatter(ames['Longitude'], ames['Latitude'], s=.5, color='k')
ax.set(title=f"Class {i} probabilities", xlabel="Longitude")
axs[0].set(ylabel="Latitude")
fig.colorbar(contour, ax=ax, fraction=.05)
Logistic regression is a classification algorithm, despite the name! It's the classifier version of linear regression.
Fit a logistic regression model (call it logreg
) to our training set (X_train
and y_train
).
logreg = ...
Let's plot the class probabilities. Notice the range of values (in the color bar). What do you think the model will classify homes in the northwest (top left) corner as?
plot_class_probs(logreg)
Compute the accuracy and cross-entropy loss. You can use accuracy_score
and log_loss
. You'll need to use predict_proba
for one of these (which one?) to ask the classifier to tell you its probabilities, not just its best guess.
Note: cross-entropy loss is also known as log loss; think about why.
def summarize_classifier(clf, X_valid, y_valid):
print("Accuracy: {:.3f}".format(accuracy_score(y_valid, ...)))
print("Log loss: {:.3f}".format(log_loss(y_valid, ...)))
summarize_classifier(logreg, X_valid, y_valid)
Fit a decision tree classifier (call it dtree_clf
) to the training set. Use the default hyperparameters.
dtree_clf = DecisionTreeClassifier()...
Let's plot the probabilities for this classifier.
plot_class_probs(dtree_clf)
Compute the accuracy and cross-entropy loss. Be careful to use the correct classifier each time!
summarize_classifier(..., X_valid, y_valid)
A random forest consists of many different decision trees, but:
To make a decision, it averages the decisions of each tree. (This means it's an "ensemble" method.)
The net effect is that an RF can fit the data well (since each tree is a pretty good predictor) but it tends not to overfit because it averages the predictions of trees trained on different subsets of data.
Let's try it.
Fit a random forest classifier to the training set.
rf_clf = ...
plot_class_probs(rf_clf)
Compute the accuracy and cross-entropy loss. Be careful to use the correct classifier each time!
# your code here
How does the accuracy compare between the three classifiers?
your narrative answer here
How does the cross-entropy loss compare between the three classifiers? Why is the ranking different for this loss compared with accuracy? Look at the actual values. Hint:
np.log(3)
your narrative answer here
Describe, qualitatively, the shapes that each classifier makes in its class probability plots. Explain how the accuracy and cross-entropy numbers make sense in light of these plots.
your narrative answer here
Which of these classifiers overfit? Which ones underfit?
your narrative answer here
optional