Guided Exercise 07 - Modeling Intro

The goal of this exercise is to start getting familiar with modeling. We’ll try the same regression task that we saw in the slides, and we’ll use a decision tree model just like we did there.

If you want to read more about tidymodels, you can read some of the tidymodels docs or Tidy Modeling with R. But those get deep into the details quickly; stick with us here and you’ll be ok.

Getting started

Pull your portfolio repo as usual.

We’ll be using the Ames home sales dataset that we saw in class. If you’re curious, you can look at the Data dictionary that the author provided. In the original paper, the author suggests working with a subset of the data. So let’s do that:

# Get the data from the "modeldata" package, which comes with tidymodels.
data(ames, package = "modeldata")
ames <- ames %>% 
  filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>% 
  mutate(Sale_Price = Sale_Price / 1000)

Exploratory analysis

Good analysis starts by exploring the data. One compelling way to explore data is by making plots. Let’s make one together.

  1. In your report, use inline code to report how many homes we’re working with (after the above filtering).

  2. Make a plot of how the sale price (Sale_Price) related to the number of square feet of above-grade living area (Gr_Liv_Area) for different building types (Bldg_Type).

Your graph might look something like this:

Modeling

We will now go through the basic steps of making a predictive model. We will add on to this workflow later, but this is a good start.

Each step will provide the full code for you.

Hold out some unseen data

Remember that our goal is to be able to predict what homes will sell for before they’re sold.

But our dataset has only homes that were already sold. How can we possibly figure out how well we’d predict a sale price before it’s sold?

Our strategy, which we’ll discuss more in future weeks, will be to hold out a “testing set” of homes. We won’t let our model see the actual sale price for these homes.

The homes where we do show the model the sale price we’ll call the “training” homes.

We’ll make this split randomly but consistently: we’ll first seed the random number generator so it always gives the same sequence of numbers.

set.seed(1234)

# Split our data randomly
ames_split <- initial_split(ames, prop = 2/3) 
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
  1. How many home sales were in the training set? The testing set?

Construct a model specification

The basic tidymodels interface is

fit(model_spec, formula, data)

  • model_spec: the kind of model we want to fit. We’ll use a decision_tree in regression mode. We’ll tell it we want trees that are at most 3 levels deep.
  • formula
    • this goes thing-to-predict ~ var1 + var2 +
    • In our case, well use Gr_Liv_Area and Bldg_Type to predict Sale_Price.
  • data: we’ll give the tree-builder our training data to use to try to find a good tree.

The result is a model object, which can make predictions for us.

decision_tree_fit <- fit(
  decision_tree(mode = "regression", tree_depth = 3),
  
  Sale_Price ~ Gr_Liv_Area + Bldg_Type,
  
  data = ames_train)

Visualize the Model Internals

If we just show the model object, we get a crude representation of its internals:

decision_tree_fit
parsnip model object

Fit time:  10ms 
n= 1608 

node), split, n, deviance, yval
      * denotes terminal node

 1) root 1608 8185170.000 176.2939  
   2) Gr_Liv_Area< 1958.5 1377 3515223.000 159.5994  
     4) Gr_Liv_Area< 1416.5 767  892288.500 134.9673  
       8) Gr_Liv_Area< 1112.5 403  270511.200 119.6004 *
       9) Gr_Liv_Area>=1112.5 364  421252.100 151.9806 *
     5) Gr_Liv_Area>=1416.5 610 1572419.000 190.5712  
      10) Bldg_Type=TwoFmCon,Duplex,Twnhs 59   34537.080 145.4323 *
      11) Bldg_Type=OneFam,TwnhsE 551 1404796.000 195.4046 *
   3) Gr_Liv_Area>=1958.5 231 1998436.000 275.8107  
     6) Gr_Liv_Area< 2443.5 151  780940.600 247.9970  
      12) Bldg_Type=TwoFmCon,Duplex 12    9379.729 147.3083 *
      13) Bldg_Type=OneFam,TwnhsE 139  639399.400 256.6896 *
     7) Gr_Liv_Area>=2443.5 80  880195.500 328.3091  
      14) Gr_Liv_Area< 3157 72  621975.700 312.7948 *
      15) Gr_Liv_Area>=3157 8   84921.220 467.9375 *

To make a more visual representation, we can use the rpart.plot package.

decision_tree_fit %>% 
  extract_fit_engine() %>% 
  rpart.plot::rpart.plot(roundint = FALSE)

  1. How many groups does the tree divide the data into?

Visualize the Model’s Predictions

We can ask the model for its predictions by using predict. Let’s get predictions for the training homes (the same homes we built the tree with):

ames_train_predictions <- 
  decision_tree_fit %>% 
    predict(ames_train)
ames_train_predictions
# A tibble: 1,608 × 1
   .pred
   <dbl>
 1  195.
 2  120.
 3  152.
 4  257.
 5  195.
 6  152.
 7  152.
 8  195.
 9  195.
10  152.
# … with 1,598 more rows
  1. How many predictions did the model make?

Let’s compare the predictions with the true sale prices. To do this, we’ll want the predictions and the original data in the same data frame. Since they line up row-by-row, we can just staple the predictions column to the end:

bind_cols(ames_train_predictions, ames_train)
# A tibble: 1,608 × 75
   .pred MS_SubClass    MS_Zoning   Lot_Frontage Lot_Area Street Alley Lot_Shape
   <dbl> <fct>          <fct>              <dbl>    <int> <fct>  <fct> <fct>    
 1  195. One_Story_194… Residentia…          141    31770 Pave   No_A… Slightly…
 2  120. One_Story_194… Residentia…           80    11622 Pave   No_A… Regular  
 3  152. One_Story_194… Residentia…           81    14267 Pave   No_A… Slightly…
 4  257. One_Story_194… Residentia…           93    11160 Pave   No_A… Regular  
 5  195. Two_Story_194… Residentia…           74    13830 Pave   No_A… Slightly…
 6  152. One_Story_PUD… Residentia…           41     4920 Pave   No_A… Regular  
 7  152. One_Story_PUD… Residentia…           43     5005 Pave   No_A… Slightly…
 8  195. One_Story_PUD… Residentia…           39     5389 Pave   No_A… Slightly…
 9  195. Two_Story_194… Residentia…           60     7500 Pave   No_A… Regular  
10  152. One_Story_194… Residentia…            0     7980 Pave   No_A… Slightly…
# … with 1,598 more rows, and 67 more variables: Land_Contour <fct>,
#   Utilities <fct>, Lot_Config <fct>, Land_Slope <fct>, Neighborhood <fct>,
#   Condition_1 <fct>, Condition_2 <fct>, Bldg_Type <fct>, House_Style <fct>,
#   Overall_Cond <fct>, Year_Built <int>, Year_Remod_Add <int>,
#   Roof_Style <fct>, Roof_Matl <fct>, Exterior_1st <fct>, Exterior_2nd <fct>,
#   Mas_Vnr_Type <fct>, Mas_Vnr_Area <dbl>, Exter_Cond <fct>, Foundation <fct>,
#   Bsmt_Cond <fct>, Bsmt_Exposure <fct>, BsmtFin_Type_1 <fct>, …

Usually I do this in a single pipeline:

ames_train_predictions <- 
  decision_tree_fit %>% 
    predict(ames_train) %>% 
    bind_cols(ames_train)

Make the following plot (which I intentionally haven’t labeled well, to help you make it). (note: the column name is .pred, with a period at the beginning. They named it that way just in case there was already a column named pred.)

  1. Just looking at the plot: are these good predictions? About how far off are they?

Quantify the Model’s Performance

To quantify the model’s error, let’s compute the mean absolute error.

ames_train_predictions %>% 
  mutate(error = Sale_Price - .pred) %>% 
  summarize(mean(abs(error)))
# A tibble: 1 × 1
  `mean(abs(error))`
               <dbl>
1               33.1

The yardstick package includes functions that compute this and other metrics:

metrics <- yardstick::metric_set(rsq_trad, mae, mape, rmse)
ames_train_predictions %>% 
  metrics(truth = Sale_Price, estimate = .pred) %>% 
  select(-.estimator)
# A tibble: 4 × 2
  .metric  .estimate
  <chr>        <dbl>
1 rsq_trad     0.574
2 mae         33.1  
3 mape        20.4  
4 rmse        46.6  

Quantify the model’s performance on the test set

So far we’ve been evaluating the model’s performance on the training set.

But we already know the sale prices for those homes, so why do we really care about predicting them?

What we’d really like to know is how well we’d predict the sale price for homes that our model never actually saw.

Good thing we held out a test set!

  1. Repeat the visualization and quantification for the test set.

Tweak hyperparameters

  1. Try tweaking the tree_depth parameter. Try adding or removing features. How do those affect the model?

Appendix

Here’s the code I used in the slides to plot the model predictions. I needed a few extra things:

sweep_model <- function(model, ...) {
  X <- expand_grid(...)
  model %>% 
    predict(X) %>% 
    bind_cols(X)
}

ggplot(ames_train, aes(x = Gr_Liv_Area, y = Sale_Price)) +
  geom_point(alpha = .5, size = .5) +
  geom_line(data = sweep_model(
      decision_tree_fit,
      Gr_Liv_Area = seq(0, 4000, length.out = 500),
      Bldg_Type = levels(ames_train$Bldg_Type)
    ),
    mapping = aes(y = .pred),
    color = "red") +
  facet_wrap(vars(Bldg_Type)) +
  labs(x = "Living Area", y = "Sale Price ($1k)")

Here’s some results I got. I used this:

all_preds <- bind_rows(
  train = ames_train_predictions,
  test = ames_test_predictions,
  .id = 'set',
)

.metric train test
rsq_trad 0.5740134 0.4835928
mae 33.1457230 34.3370013
mape 20.4110345 21.1342666
rmse 46.5659860 48.2617657