The goal of this exercise is to start getting familiar with modeling. We’ll try the same regression task that we saw in the slides, and we’ll use a decision tree model just like we did there.
If you want to read more about tidymodels
, you can read some of the tidymodels docs or Tidy Modeling with R. But those get deep into the details quickly; stick with us here and you’ll be ok.
Pull your portfolio repo as usual.
We’ll be using the Ames home sales dataset that we saw in class. If you’re curious, you can look at the Data dictionary that the author provided. In the original paper, the author suggests working with a subset of the data. So let’s do that:
# Get the data from the "modeldata" package, which comes with tidymodels.
data(ames, package = "modeldata")
<- ames %>%
ames filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>%
mutate(Sale_Price = Sale_Price / 1000)
Good analysis starts by exploring the data. One compelling way to explore data is by making plots. Let’s make one together.
In your report, use inline code to report how many homes we’re working with (after the above filtering).
Make a plot of how the sale price (Sale_Price
) related to the number of square feet of above-grade living area (Gr_Liv_Area
) for different building types (Bldg_Type
).
Your graph might look something like this:
We will now go through the basic steps of making a predictive model. We will add on to this workflow later, but this is a good start.
Each step will provide the full code for you.
Remember that our goal is to be able to predict what homes will sell for before they’re sold.
But our dataset has only homes that were already sold. How can we possibly figure out how well we’d predict a sale price before it’s sold?
Our strategy, which we’ll discuss more in future weeks, will be to hold out a “testing set” of homes. We won’t let our model see the actual sale price for these homes.
The homes where we do show the model the sale price we’ll call the “training” homes.
We’ll make this split randomly but consistently: we’ll first seed the random number generator so it always gives the same sequence of numbers.
set.seed(1234)
# Split our data randomly
<- initial_split(ames, prop = 2/3)
ames_split <- training(ames_split)
ames_train <- testing(ames_split) ames_test
The basic tidymodels
interface is
fit(model_spec, formula, data)
model_spec
: the kind of model we want to fit. We’ll use a decision_tree
in regression
mode. We’ll tell it we want trees that are at most 3 levels deep.formula
~
var1 +
var2 +
…Gr_Liv_Area
and Bldg_Type
to predict Sale_Price
.data
: we’ll give the tree-builder our training data to use to try to find a good tree.The result is a model object, which can make predictions for us.
<- fit(
decision_tree_fit decision_tree(mode = "regression", tree_depth = 3),
~ Gr_Liv_Area + Bldg_Type,
Sale_Price
data = ames_train)
If we just show the model object, we get a crude representation of its internals:
decision_tree_fit
parsnip model object
Fit time: 10ms
n= 1608
node), split, n, deviance, yval
* denotes terminal node
1) root 1608 8185170.000 176.2939
2) Gr_Liv_Area< 1958.5 1377 3515223.000 159.5994
4) Gr_Liv_Area< 1416.5 767 892288.500 134.9673
8) Gr_Liv_Area< 1112.5 403 270511.200 119.6004 *
9) Gr_Liv_Area>=1112.5 364 421252.100 151.9806 *
5) Gr_Liv_Area>=1416.5 610 1572419.000 190.5712
10) Bldg_Type=TwoFmCon,Duplex,Twnhs 59 34537.080 145.4323 *
11) Bldg_Type=OneFam,TwnhsE 551 1404796.000 195.4046 *
3) Gr_Liv_Area>=1958.5 231 1998436.000 275.8107
6) Gr_Liv_Area< 2443.5 151 780940.600 247.9970
12) Bldg_Type=TwoFmCon,Duplex 12 9379.729 147.3083 *
13) Bldg_Type=OneFam,TwnhsE 139 639399.400 256.6896 *
7) Gr_Liv_Area>=2443.5 80 880195.500 328.3091
14) Gr_Liv_Area< 3157 72 621975.700 312.7948 *
15) Gr_Liv_Area>=3157 8 84921.220 467.9375 *
To make a more visual representation, we can use the rpart.plot
package.
%>%
decision_tree_fit extract_fit_engine() %>%
::rpart.plot(roundint = FALSE) rpart.plot
We can ask the model for its predictions by using predict
. Let’s get predictions for the training homes (the same homes we built the tree with):
<-
ames_train_predictions %>%
decision_tree_fit predict(ames_train)
ames_train_predictions
# A tibble: 1,608 × 1
.pred
<dbl>
1 195.
2 120.
3 152.
4 257.
5 195.
6 152.
7 152.
8 195.
9 195.
10 152.
# … with 1,598 more rows
Let’s compare the predictions with the true sale prices. To do this, we’ll want the predictions and the original data in the same data frame. Since they line up row-by-row, we can just staple the predictions column to the end:
bind_cols(ames_train_predictions, ames_train)
# A tibble: 1,608 × 75
.pred MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
<dbl> <fct> <fct> <dbl> <int> <fct> <fct> <fct>
1 195. One_Story_194… Residentia… 141 31770 Pave No_A… Slightly…
2 120. One_Story_194… Residentia… 80 11622 Pave No_A… Regular
3 152. One_Story_194… Residentia… 81 14267 Pave No_A… Slightly…
4 257. One_Story_194… Residentia… 93 11160 Pave No_A… Regular
5 195. Two_Story_194… Residentia… 74 13830 Pave No_A… Slightly…
6 152. One_Story_PUD… Residentia… 41 4920 Pave No_A… Regular
7 152. One_Story_PUD… Residentia… 43 5005 Pave No_A… Slightly…
8 195. One_Story_PUD… Residentia… 39 5389 Pave No_A… Slightly…
9 195. Two_Story_194… Residentia… 60 7500 Pave No_A… Regular
10 152. One_Story_194… Residentia… 0 7980 Pave No_A… Slightly…
# … with 1,598 more rows, and 67 more variables: Land_Contour <fct>,
# Utilities <fct>, Lot_Config <fct>, Land_Slope <fct>, Neighborhood <fct>,
# Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
# Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
# Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
# Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
# Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, …
Usually I do this in a single pipeline:
<-
ames_train_predictions %>%
decision_tree_fit predict(ames_train) %>%
bind_cols(ames_train)
Make the following plot (which I intentionally haven’t labeled well, to help you make it). (note: the column name is .pred
, with a period at the beginning. They named it that way just in case there was already a column named pred
.)
To quantify the model’s error, let’s compute the mean absolute error.
%>%
ames_train_predictions mutate(error = Sale_Price - .pred) %>%
summarize(mean(abs(error)))
# A tibble: 1 × 1
`mean(abs(error))`
<dbl>
1 33.1
The yardstick
package includes functions that compute this and other metrics:
<- yardstick::metric_set(rsq_trad, mae, mape, rmse)
metrics %>%
ames_train_predictions metrics(truth = Sale_Price, estimate = .pred) %>%
select(-.estimator)
# A tibble: 4 × 2
.metric .estimate
<chr> <dbl>
1 rsq_trad 0.574
2 mae 33.1
3 mape 20.4
4 rmse 46.6
So far we’ve been evaluating the model’s performance on the training set.
But we already know the sale prices for those homes, so why do we really care about predicting them?
What we’d really like to know is how well we’d predict the sale price for homes that our model never actually saw.
Good thing we held out a test set!
tree_depth
parameter. Try adding or removing features. How do those affect the model?Here’s the code I used in the slides to plot the model predictions. I needed a few extra things:
sweep_model
is a helper function to try out all combinations of multiple variables: all the living area values for each building type. The tidyr::expand_grid
function helped me out there!geom
s can be given a different data
than the whole ggplot
. I used that to draw the lines.<- function(model, ...) {
sweep_model <- expand_grid(...)
X %>%
model predict(X) %>%
bind_cols(X)
}
ggplot(ames_train, aes(x = Gr_Liv_Area, y = Sale_Price)) +
geom_point(alpha = .5, size = .5) +
geom_line(data = sweep_model(
decision_tree_fit,Gr_Liv_Area = seq(0, 4000, length.out = 500),
Bldg_Type = levels(ames_train$Bldg_Type)
),mapping = aes(y = .pred),
color = "red") +
facet_wrap(vars(Bldg_Type)) +
labs(x = "Living Area", y = "Sale Price ($1k)")
Here’s some results I got. I used this:
<- bind_rows(
all_preds train = ames_train_predictions,
test = ames_test_predictions,
.id = 'set',
)
.metric | train | test |
---|---|---|
rsq_trad | 0.5740134 | 0.4835928 |
mae | 33.1457230 | 34.3370013 |
mape | 20.4110345 | 21.1342666 |
rmse | 46.5659860 | 48.2617657 |