Lab 10: Model Tuning

Goals

In this lab (turned homework) you will try to make models of two different types and tune their parameters and data choices to do well at a prediction task. This is the central task of data modeling. Many (but far from all) people think it’s the most fun part.

Our learning objectives are:

Design an accurate predictive model
Compare model performance using cross-validation
Validate model performance on held-out data
Practice more visualization (as always!)

Getting started

Pick a team to work with–ideally 2, no more than 3.
- A team should be from within a single cohort.
- (Crossing cohort lines is possible, but you’ll need to ask the course staff to make an appropriate repo.)
Go to our GitHub organization. You’ll find 5 lab10 repos available to your cohort. Pick one for your team, making sure that no two teams share the same repo.
On the attendance sheet, each team member should fill in the name of their team repo under the “Team” column. It should look like lab10-V2.
Each team member should clone the repo as usual.

Optional, experimental workflow

The RStudio server lets multiple people edit a single project at the same time! This can be confusing since it’s different from collaborating using GitHub, so try it only if all of your team is game to try:

The person whose name appears first on the Attendance sheet (or someone else if needed) clones the team’s repo on the RStudio Server.
They Share the project with the other team members
They copy the URL and paste it into the attendance sheet
Others go to that URL.

Data and Initial Split

You have been given a training set with a random subset of the Ames home sale dataset. Your repo includes a command to load it. See the Data dictionary if you need it.

We’ve scaled the sale price to be in units of $1000, to keep the numbers more manageable.

Modeling

Now that we have techniques for specifying a model, preprocessing data to give to it, and estimating its performance, let’s try to just make a good model.

We learned about recipes and workflows but only used them occasionally last week. For this lab, though, they will become essential as we seek to compare models that use different variables, different preprocessing steps, different models, etc, using the same methodology.

Our basic strategy will be to collect model specs and metrics in a data frame. We’ll run through the approach once for a basic linreg and decision tree, then you’ll try out things on your own.

Initialize

Overall let’s use 10-fold cross-validation. We need to set a seed here because vfold_cv uses random splitting.

⊕Since the toolkit I usually use, Scikit-Learn, uses deterministic cross-validation by default, I didn’t realize that this needs a seed to be reproducible. Sorry for those of you who got confused because our numbers didn’t match.

set.seed(2)
ames_resamples <- ames_train %>% 
  vfold_cv(v = 10)

These models can be time-consuming to run, so let’s collect the results in a modeling_results data frame as we go. Let’s start off with no results:

modeling_results <- tibble()

And here’s a function to add a new set of results. Don’t worry about how it works.

append_new_results <- function(modeling_results, model_name, spec, cv_results) {
  if (nrow(modeling_results) > 0 && model_name %in% modeling_results$model_name)
    stop(paste0(
      "There's already results for a model with the name ", model_name,
      ". Did you forget to change the name?"))
  
  bind_rows(
    modeling_results %>% mutate(model_name = as.character(model_name)),
    tibble(
      model_name = model_name,
      spec = list(spec),
      cv_results %>% select(-.estimator)
    )
  ) %>% 
    mutate(model_name = as_factor(model_name)) # Ensure that factor level matches insertion order.
}

get_spec_for_model <- function(model_name) {
  modeling_results %>% filter(model_name == !!model_name) %>% purrr::chuck("spec", 1)
}

Finally, a few utilities:

add_predictions <- function(data, ...) {
  imap_dfr(
    rlang::dots_list(..., .named = TRUE),
    function(model, model_name) {
      model %>%
        predict(data) %>%
        bind_cols(data) %>%
        mutate(model = !!model_name)
    }
  )
}

sweep_model_examples <- function(model, dataset, vars_to_sweep, examples = slice_sample(dataset, n = 10)) {
  X <- map_dfr(vars_to_sweep, function(v) {
    var_to_sweep <- rlang::as_label(v)
    sweep_min <- min(dataset[[var_to_sweep]])
    sweep_max <- max(dataset[[var_to_sweep]])
    expand_grid(
      examples %>% select(-!!var_to_sweep) %>% mutate(.idx = row_number()),
      !!enquo(var_to_sweep) := seq(sweep_min, sweep_max, length.out = 500)) %>% 
      mutate(sweep_var = var_to_sweep, .sweep_val = .data[[var_to_sweep]])
  })
  model %>% 
    predict(X) %>% 
    bind_cols(X)
}

linear_reg <- function(engine = "lm", ...) {
  parsnip::linear_reg(...) %>% set_engine(engine)
}

decision_tree <- function(mode = "regression", engine = "rpart", ...) {
  parsnip::decision_tree(mode = "regression", ...) %>%
    set_engine(engine)
}

Linear Regression

Let’s first fit a simple linear regression. First we define our spec.

⊕Note that we need to give recipe a data parameter. This is the template data, just used to tell the recipe what columns there are and what data type each one has. It does not actually train on this data!

spec <- workflow() %>% 
  add_recipe(
    recipe(Sale_Price ~ Latitude + Longitude, data = ames_train)
  ) %>% 
  add_model(
    linear_reg()
  )

Then, let’s get the cross-validation assessment scores:

cv_results <- 
  spec %>% 
  fit_resamples(resamples = ames_resamples, metrics = metric_set(mae)) %>% 
  collect_metrics(summarize = FALSE)

cv_results

## # A tibble: 10 x 4
##    id     .metric .estimator .estimate
##    <chr>  <chr>   <chr>          <dbl>
##  1 Fold01 mae     standard        45.4
##  2 Fold02 mae     standard        47.7
##  3 Fold03 mae     standard        53.1
##  4 Fold04 mae     standard        48.1
##  5 Fold05 mae     standard        40.6
##  6 Fold06 mae     standard        48.7
##  7 Fold07 mae     standard        45.6
##  8 Fold08 mae     standard        53.4
##  9 Fold09 mae     standard        49.7
## 10 Fold10 mae     standard        45.5

Finally, we collect these results into our modeling_results data frame.

modeling_results <- modeling_results %>% 
  append_new_results(
    model_name = "Simple Linreg",
    spec = spec,
    cv_results = cv_results
  )
modeling_results

## # A tibble: 10 x 5
##    model_name    spec       id     .metric .estimate
##    <fct>         <list>     <chr>  <chr>       <dbl>
##  1 Simple Linreg <workflow> Fold01 mae          45.4
##  2 Simple Linreg <workflow> Fold02 mae          47.7
##  3 Simple Linreg <workflow> Fold03 mae          53.1
##  4 Simple Linreg <workflow> Fold04 mae          48.1
##  5 Simple Linreg <workflow> Fold05 mae          40.6
##  6 Simple Linreg <workflow> Fold06 mae          48.7
##  7 Simple Linreg <workflow> Fold07 mae          45.6
##  8 Simple Linreg <workflow> Fold08 mae          53.4
##  9 Simple Linreg <workflow> Fold09 mae          49.7
## 10 Simple Linreg <workflow> Fold10 mae          45.5

Notice that in your repo these three code blocks are all together in the same chunk. That’s intentional, to avoid accidentally getting the three parts out of sync.

Decision Tree

Your turn: repeat the above, but now using the default decision_tree(). Start by copying and pasting the linreg chunk from above and changing the chunk name, model spec, and model_name parameter.

Here’s what my modeling results looked like after I did that:

modeling_results

## # A tibble: 20 x 5
##    model_name    spec       id     .metric .estimate
##    <fct>         <list>     <chr>  <chr>       <dbl>
##  1 Simple Linreg <workflow> Fold01 mae          45.4
##  2 Simple Linreg <workflow> Fold02 mae          47.7
##  3 Simple Linreg <workflow> Fold03 mae          53.1
##  4 Simple Linreg <workflow> Fold04 mae          48.1
##  5 Simple Linreg <workflow> Fold05 mae          40.6
##  6 Simple Linreg <workflow> Fold06 mae          48.7
##  7 Simple Linreg <workflow> Fold07 mae          45.6
##  8 Simple Linreg <workflow> Fold08 mae          53.4
##  9 Simple Linreg <workflow> Fold09 mae          49.7
## 10 Simple Linreg <workflow> Fold10 mae          45.5
## 11 Decision Tree <workflow> Fold01 mae          33.3
## 12 Decision Tree <workflow> Fold02 mae          36.2
## 13 Decision Tree <workflow> Fold03 mae          38.5
## 14 Decision Tree <workflow> Fold04 mae          35.3
## 15 Decision Tree <workflow> Fold05 mae          31.4
## 16 Decision Tree <workflow> Fold06 mae          34.3
## 17 Decision Tree <workflow> Fold07 mae          34.1
## 18 Decision Tree <workflow> Fold08 mae          35.7
## 19 Decision Tree <workflow> Fold09 mae          34.8
## 20 Decision Tree <workflow> Fold10 mae          33.8

Compare model performance

Compare the performance of the two models. Which one is better?

modeling_results %>% 
  ggplot(aes(y = fct_rev(model_name), x = .estimate)) +
    geom_boxplot() +
    geom_point() +
    labs(x = "Mean Absolute Error ($1000)", y = "")

## # A tibble: 2 x 2
##   model_name    mae_median
##   <fct>              <dbl>
## 1 Simple Linreg       47.9
## 2 Decision Tree       34.6

Better Decision Tree

Try getting a better decision tree model:

Use more features
Tweak the decision_tree parameters (run ?rpart::rpart.control for details)
- cost_complexity: any split that does not improve the fit by this much is not attempted (defaults to 0.01, aka cp)
- tree_depth: maximum depth of any node in the final tree (defaults to 30, aka maxdepth)
- min_n: minimum number of observations that must exist in a node in order for a split to be attempted (defaults to 20, aka minsplit)

Visualize models

Let’s look at two ways of visualizing what’s going on in fitted models. The examples in this section will show the results of the last model that your instructor happened to spec out (while playing around with decision trees for the last section); notice what’s different when you apply these techniques to different models.

First, let’s get a model fit on the whole training set. Remember: the cross validation process trained a bunch of smaller models; now we want to look at the slightly bigger full model.

model <- 
  get_spec_for_model("Decision Tree 2") %>%
  fit(ames_train)

Examine Residuals

Here’s a way to see at a glance what might be helpful features to add becasue there are patterns in the residuals. We just plot the residuals against all of the other numeric features. (It doesn’t handle categorical features, but you can try mapping a categorical feature to color or something.)

ames_train %>% 
  add_predictions(model) %>% 
  mutate(resid = Sale_Price - .pred) %>% 
  # Pivot all of the numeric variables except for the outcomes and residuals.
  pivot_longer(c(where(is.numeric) & !any_of(c("Sale_Price", ".pred", "resid")))) %>% 
  # Plot!
  ggplot(aes(x = value, y = resid)) +
    geom_point() +
    geom_smooth(method = "lm") +
    facet_wrap(vars(name), scales = "free")

## `geom_smooth()` using formula 'y ~ x'

Examine Predictions

The sweep_model_examples function defined above creates example predictions by starting with 10 randomly selected example data points and sweeping a variable (columns) across the whole range of values seen in the training set. You give it one or more variables using vars, the same quoting function that facet_wrap uses. Note: it currently only works for continuous variables.

The following example shows how to sweep several variables and plot each sweep as a line.

set.seed(0)
sweep_model_examples(model, ames_train, vars(Longitude, Latitude, Gr_Liv_Area)) %>% 
  ggplot(aes(x = .sweep_val, y = .pred, group = .idx, color = Latitude)) +
  geom_line(alpha = .5) +
  facet_wrap(vars(sweep_var), scales = "free")

Better Linreg

Now, try to get a better linear regression model. You might try:

Use more features
- add dummy encoding steps as needed (step_dummy, and perhaps step_other if needed)
Transform existing features
- step_poly, or its better-behaved cousin step_ns
- step_discretize
- step_interact (see hw9 for an example)

Compare model performance

Once you have collected performance data for a few different models, compare their performance by plotting data from the modeling_results data frame.

(The other visuals are just for you to explore. Keep your model performance visual chunk after all of your modeling results have been collected, so that you can plot the performance here.)

For example, here is the result with some models I was playing around with:

modeling_results %>% 
  ggplot(aes(y = fct_rev(model_name), x = .estimate)) +
    geom_boxplot() +
    geom_point() +
    labs(x = "Mean Absolute Error ($1000)", y = "")

Wrap-up

Make a list of all of the models that you tried. (Give enough detail that another student would be able to try the same model). Mark which one is your best decision tree and which one is your best linear regression.
How does the numerical performance of your best linear regression compare with your best decision tree? Refer to both the center and spread of cross-validated performance estimates.
Pick a house from the training set. Explain how each of the two models you described in the previous question makes a prediction on that house.

Here’s the example home to use:

example_home <- ames_train %>% head(1)
example_home

## # A tibble: 1 x 74
##   MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
##   <chr>       <chr>            <dbl>    <dbl> <chr>  <chr> <chr>    
## 1 One_Story_… Resident…          141    31770 Pave   No_A… Slightly…
## # … with 67 more variables: Land_Contour <chr>, Utilities <chr>,
## #   Lot_Config <chr>, Land_Slope <chr>, Neighborhood <chr>, Condition_1 <chr>,
## #   Condition_2 <chr>, Bldg_Type <chr>, House_Style <chr>, Overall_Cond <chr>,
## #   Year_Built <dbl>, Year_Remod_Add <dbl>, Roof_Style <chr>, Roof_Matl <chr>,
## #   Exterior_1st <chr>, Exterior_2nd <chr>, Mas_Vnr_Type <chr>,
## #   Mas_Vnr_Area <dbl>, Exter_Cond <chr>, Foundation <chr>, Bsmt_Cond <chr>,
## #   Bsmt_Exposure <chr>, BsmtFin_Type_1 <chr>, BsmtFin_SF_1 <dbl>,
## #   BsmtFin_Type_2 <chr>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
## #   Total_Bsmt_SF <dbl>, Heating <chr>, Heating_QC <chr>, Central_Air <chr>,
## #   Electrical <chr>, First_Flr_SF <dbl>, Second_Flr_SF <dbl>,
## #   Gr_Liv_Area <dbl>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
## #   Full_Bath <dbl>, Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>,
## #   TotRms_AbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, Garage_Type <chr>,
## #   Garage_Finish <chr>, Garage_Cars <dbl>, Garage_Area <dbl>,
## #   Garage_Cond <chr>, Paved_Drive <chr>, Wood_Deck_SF <dbl>,
## #   Open_Porch_SF <dbl>, Enclosed_Porch <dbl>, Three_season_porch <dbl>,
## #   Screen_Porch <dbl>, Pool_Area <dbl>, Pool_QC <chr>, Fence <chr>,
## #   Misc_Feature <chr>, Misc_Val <dbl>, Mo_Sold <dbl>, Year_Sold <dbl>,
## #   Sale_Type <chr>, Sale_Condition <chr>, Sale_Price <dbl>, Longitude <dbl>,
## #   Latitude <dbl>

And here’s how to get your model’s prediction on it.

# Get the fitted model
model <- 
  get_spec_for_model("Decision Tree") %>%
  fit(ames_train)

# Make the prediction
model %>% predict(example_home)

## # A tibble: 1 x 1
##   .pred
##   <dbl>
## 1  137.

You can use the plots from “Examine Predictions” above to help understand what your model is doing. For a small enough decision tree, you can also use rpart.plot like we did in Lab 9 (but we need very slightly different code since we’re using workflows):

model %>%
  workflows::pull_workflow_fit() %>% 
  pluck("fit") %>%
  rpart.plot::rpart.plot(roundint = FALSE, digits = 5, type = 4)

Which model do you think will do best on the test set that the course staff has held out? What error do you expect? Report your answer as follows:

spec <- YOUR SPEC CODE HERE
expected_test_set_MAE <- YOUR NUMBER HERE

Explain your reasoning.