Hyperparameter Tuning and Cross-Validation

Introduciton

The goal of this exercise is to …

Apply pre-processing to transform data
Apply cross-validation to identify and avoid overfitting.
Compute the number of samples in each fold for cross-validation.
Identify what tuning parameters affect the performance of each type of classifier

Getting started

ames_all <- 
  read_builtin("ames", package = "modeldata") %>% 
  filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>%
  mutate(across(where(is.integer), as.double)) %>%
  mutate(Sale_Price = Sale_Price / 1000)

metrics <- yardstick::metric_set(mae, rsq_trad)

set.seed(10) # Seed the random number generator
ames_split <- initial_split(ames_all, prop = 2 / 3)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

ames_with_split_marked <- bind_rows(
  train = ames_train,
  test = ames_test,
  .id = "split"
) %>% mutate(split = as_factor(split))

Preprocessing

One simple but important thing that affects some models’ performance is preprocessing, or more generally, feature engineering. Two big reasons:

Sometimes a model can do better if it’s given a processed version of a feature. A classic example is dates: a date stored as a string is not likely to be a useful feature, but (as we saw in homework) it’s much more useful if we extract information out of it like what day of the week it is and give that as a separate feature.
Sometimes a model needs a feature to be encoded to be able to use it at all. This used to be the case for XGBoost: it couldn’t use categorical features unless we can dummy-encoded them first. (Linear regression also needs dummy encoding, but the code handles that automatically so we sometimes don’t notice.)

Let’s see a cool example with linear regression. Normally you’d expect linear regression to only be able to fit lines, but with some transformation, we can get it to make nuanced predictions.

We’ll need some new code concepts, though: a workflow represents a complete modeling process, including both the model specification and optionally some a recipe that handles the preprocessing.

Here’s what linear regression looks like with a recipe. We’ll start with a blank recipe. (It just extracts the columns we asked for using the formula.)

linreg_recipe <- recipe(
  formula = Sale_Price ~ Latitude + Longitude,
  data = ames_train)

linreg_workflow <- workflow(
  preprocessor = linreg_recipe,
  spec = linear_reg()
)

# Fit the entire workflow
linreg_fit <- fit(linreg_workflow, data = ames_train)

# show the model in data space
show_latlong_model(
  ames_train,
  linreg_fit
)

# Evaluate it on the test set
augment(linreg_fit, ames_with_split_marked) %>% 
  group_by(split) %>% 
  metrics(truth = Sale_Price, estimate = .pred)

# A tibble: 4 × 4
  split .metric  .estimator .estimate
  <fct> <chr>    <chr>          <dbl>
1 train mae      standard      47.4  
2 test  mae      standard      48.3  
3 train rsq_trad standard       0.120
4 test  rsq_trad standard       0.159

We can add a preprocessing step by piping the recipe through some step_s. For example, we can add a spline transformation of the Latitude column by adding a %>% and then step_ns(Latitude, deg_free = 2).

(You may have heard of polynomial regression, where we make a feature for x^2, x^3, etc., and that lets us make curve-shaped predictions because our prediction equation now looks like c1 * x^2 + c2 *x + c3 or the like. Spline transformations are the same basic idea, but are better behaved.)

linreg_recipe <- recipe(
  Sale_Price ~ Latitude + Longitude,
  data = ames_train) %>% 
      step_ns(Latitude, deg_free = 3)

# Show an example of what the recipe output looks like. We don't generally need to use these functions.
linreg_recipe %>% 
  prep(training = ames_train) %>% 
  bake(ames_train) %>% 
  select(starts_with("Latitude")) %>% 
  bind_cols(select(ames_train, Latitude)) %>% 
  pivot_longer(cols = !Latitude, names_to = "predictor") %>% 
  ggplot(aes(x = Latitude, y = value, color = predictor)) + geom_line()

Exercise: Try adding a few preprocessing steps. For each one, write in your notes (1) How does the visualization of the model’s predictions (in data space) change, and (2) How does the accuracy change? Don’t be overly concerned about the exact accuracy values; just note whether a step seems to help or not.

Add the step_ns() shown above to the recipe.
Replace Latitude by all_numeric_predictors().
Try changing deg_free to 2, 3, or 4.
Replace the entire step_ns step by step_discretize(all_numeric_predictors()).
Replace the step_discretize step by step_interact(~ starts_with("Latitude") : starts_with("Longitude")).
Notice the error you got in the previous step. This was because the model hit numerical issues because the features were not on a sensible scale. So before the step_interact, add step_normalize(all_numeric_predictors()) and try again.

Note that we need to give recipe a data parameter. This is the template data, just used to tell the recipe what columns there are and what data type each one has. It does not train a model on this data!

Note: this did not use any kind of cross-validation; we just kept peeking at our testing set. Let’s fix that!

Cross-Validation

Here are the sizes of the training and test sets:

# A tibble: 2 × 2
  split     n
  <fct> <int>
1 train  1608
2 test    804

Suppose we declare that we want to evaluate models using 6-fold cross-validation:

ames_resamples <- vfold_cv(ames_train, v = 6)

Exercise: How many homes are in one fold?

Exercise: When performing cross-validation using these resamples, how many homes will be used for fitting models?

Exercise: When performing cross-validation using these resamples, how many homes will be used for evaluating the performance of fitted models?

Tuning

We’ll use a dataset about predicting whether a customer will churn (cancel their service) or not. See mlc_churn for more details.

mlc_churn <- read_builtin("mlc_churn", package = "modeldata")
skimr::skim(mlc_churn)

Data summary
Name	mlc_churn
Number of rows	5000
Number of columns	20
_______________________
Column type frequency:
factor	5
numeric	15
________________________
Group variables	None

Variable type: factor

skim_variable	complete_rate	ordered	n_unique	top_counts
state	1	FALSE	51	WV: 158, MN: 125, AL: 124, ID: 119
area_code	1	FALSE	3	are: 2495, are: 1259, are: 1246
international_plan	1	FALSE	2	no: 4527, yes: 473
voice_mail_plan	1	FALSE	2	no: 3677, yes: 1323
churn	1	FALSE	2	no: 4293, yes: 707

Variable type: numeric

skim_variable	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
account_length	1	100.26	39.69	1	73.00	100.00	127.00	243.00	▂▇▇▂▁
number_vmail_messages	1	7.76	13.55	0	0.00	0.00	17.00	52.00	▇▁▂▁▁
total_day_minutes	1	180.29	53.89	0	143.70	180.10	216.20	351.50	▁▃▇▅▁
total_day_calls	1	100.03	19.83	0	87.00	100.00	113.00	165.00	▁▁▇▇▁
total_day_charge	1	30.65	9.16	0	24.43	30.62	36.75	59.76	▁▃▇▅▁
total_eve_minutes	1	200.64	50.55	0	166.38	201.00	234.10	363.70	▁▂▇▅▁
total_eve_calls	1	100.19	19.83	0	87.00	100.00	114.00	170.00	▁▁▇▇▁
total_eve_charge	1	17.05	4.30	0	14.14	17.09	19.90	30.91	▁▂▇▅▁
total_night_minutes	1	200.39	50.53	0	166.90	200.40	234.70	395.00	▁▃▇▃▁
total_night_calls	1	99.92	19.96	0	87.00	100.00	113.00	175.00	▁▁▇▆▁
total_night_charge	1	9.02	2.27	0	7.51	9.02	10.56	17.77	▁▃▇▃▁
total_intl_minutes	1	10.26	2.76	0	8.50	10.30	12.00	20.00	▁▃▇▃▁
total_intl_calls	1	4.44	2.46	0	3.00	4.00	6.00	20.00	▇▅▁▁▁
total_intl_charge	1	2.77	0.75	0	2.30	2.78	3.24	5.40	▁▃▇▃▁
number_customer_service_calls	1	1.57	1.31	0	1.00	1.00	2.00	9.00	▇▅▁▁▁

Note that most customers didn’t churn, so predicting “no” would be right much of the time. Any model should do better than this:

mlc_churn %>% count(churn) %>% mutate(frac = n / sum(n))

# A tibble: 2 × 3
  churn     n  frac
  <fct> <int> <dbl>
1 yes     707 0.141
2 no     4293 0.859

We’ll compare the performance of two models using cross validation. First, let’s declare the resamples to use:

churn_split <- initial_split(mlc_churn)
churn_resamples <- training(churn_split) %>% vfold_cv(v = 5)

Now let’s define the two models. Let’s compare a decision tree at two different tree depths.

tree_recipe <- recipe(churn ~ ., data = mlc_churn)
tree1 <- decision_tree(mode = "classification", tree_depth = 3)
tree2 <- decision_tree(mode = "classification", tree_depth = 30)

# Tree 1
fit_resamples(
  workflow(preprocessor = tree_recipe, spec = tree1),
  resamples = churn_resamples
) %>% collect_metrics()

# A tibble: 3 × 6
  .metric     .estimator   mean     n std_err .config             
  <chr>       <chr>       <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.901      5 0.00381 Preprocessor1_Model1
2 brier_class binary     0.0871     5 0.00297 Preprocessor1_Model1
3 roc_auc     binary     0.728      5 0.00877 Preprocessor1_Model1

# Tree 2
fit_resamples(
  workflow(preprocessor = tree_recipe, spec = tree2),
  resamples = churn_resamples
) %>% collect_metrics()

# A tibble: 3 × 6
  .metric     .estimator   mean     n std_err .config             
  <chr>       <chr>       <dbl> <int>   <dbl> <chr>               
1 accuracy    binary     0.935      5 0.00357 Preprocessor1_Model1
2 brier_class binary     0.0571     5 0.00324 Preprocessor1_Model1
3 roc_auc     binary     0.887      5 0.0123  Preprocessor1_Model1

But that’s a lot of copy-and-paste code. workflows give us an easier way:

# Construct a bunch of workflows by applying the same preprocessing recipe to a bunch of different models.
churn_workflows <- workflow_set(
  preproc = list(tree_recipe),
  models = list(shallow = tree1, deeper = tree2))
churn_workflows

# A workflow set/tibble: 2 × 4
  wflow_id       info             option    result    
  <chr>          <list>           <list>    <list>    
1 recipe_shallow <tibble [1 × 4]> <opts[0]> <list [0]>
2 recipe_deeper  <tibble [1 × 4]> <opts[0]> <list [0]>

A workflow set is just a data frame of model specifications. We can now call the fit_resamples function for each of these models, giving each model the same set of resamples.

In CS, when we apply the same function to a bunch of different items, we call it mapping, hence the name workflow_map.

scores <- churn_workflows %>% 
  workflow_map(
    fn = "fit_resamples",
    resamples = churn_resamples
  )

There’s a convenient function for plotting the results.

autoplot(scores, metric = "accuracy") +
  geom_text(aes(y = mean, label = wflow_id), angle = 90, nudge_x = .05, color = "black")

We see that the deeper tree has higher accuracy.

Use the code area below to try another few models. You might try a different hyperparameter on the decision tree, or a different kind of model entirely.

tree_recipe <- recipe(churn ~ ., data = mlc_churn)
tree1 <- decision_tree(mode = "classification", tree_depth = 3)
tree2 <- decision_tree(mode = "classification", tree_depth = 30)

# Construct a bunch of workflows by applying the same preprocessing recipe to a bunch of different models.
churn_workflows <- workflow_set(
  preproc = list(tree_recipe),
  models = list(shallow = tree1, deeper = tree2))

# Evaluate all the CV scores.
scores <- churn_workflows %>% 
  workflow_map(
    fn = "fit_resamples",
    resamples = churn_resamples
  )

# Plot the results
autoplot(scores, metric = "accuracy") +
  geom_text(aes(y = mean, label = wflow_id), angle = 90, nudge_x = .05, color = "black")

Optional: Automated Tuning

# This was created using:
#usemodels::use_glmnet(churn ~ ., data = mlc_churn)
glmnet_recipe <- 
  recipe(formula = churn ~ ., data = mlc_churn) %>% 
  step_novel(all_nominal_predictors()) %>% 
  step_dummy(all_nominal_predictors()) %>% 
  step_zv(all_predictors()) %>% 
  step_normalize(all_numeric_predictors()) 

glmnet_spec <- 
  logistic_reg(penalty = tune(), mixture = tune()) %>% 
  set_mode("classification") %>% 
  set_engine("glmnet") 

glmnet_workflow <- 
  workflow() %>% 
  add_recipe(glmnet_recipe) %>% 
  add_model(glmnet_spec) 

glmnet_grid <- tidyr::crossing(penalty = 10^seq(-6, -1, length.out = 20), mixture = c(0.05, 
    0.2, 0.4, 0.6, 0.8, 1)) 

glmnet_tune <- 
  tune_grid(glmnet_workflow, resamples = churn_resamples, grid = glmnet_grid)

Then we can summarize the results:

collect_metrics(glmnet_tune)

# A tibble: 360 × 8
      penalty mixture .metric     .estimator  mean     n std_err .config        
        <dbl>   <dbl> <chr>       <chr>      <dbl> <int>   <dbl> <chr>          
 1 0.000001      0.05 accuracy    binary     0.861     5 0.00823 Preprocessor1_…
 2 0.000001      0.05 brier_class binary     0.101     5 0.00441 Preprocessor1_…
 3 0.000001      0.05 roc_auc     binary     0.808     5 0.00784 Preprocessor1_…
 4 0.00000183    0.05 accuracy    binary     0.861     5 0.00823 Preprocessor1_…
 5 0.00000183    0.05 brier_class binary     0.101     5 0.00441 Preprocessor1_…
 6 0.00000183    0.05 roc_auc     binary     0.808     5 0.00784 Preprocessor1_…
 7 0.00000336    0.05 accuracy    binary     0.861     5 0.00823 Preprocessor1_…
 8 0.00000336    0.05 brier_class binary     0.101     5 0.00441 Preprocessor1_…
 9 0.00000336    0.05 roc_auc     binary     0.808     5 0.00784 Preprocessor1_…
10 0.00000616    0.05 accuracy    binary     0.861     5 0.00823 Preprocessor1_…
# ℹ 350 more rows

autoplot(glmnet_tune)

And show just the best ones.

show_best(glmnet_tune)

Warning in show_best(glmnet_tune): No value of `metric` was given; "roc_auc"
will be used.

# A tibble: 5 × 8
  penalty mixture .metric .estimator  mean     n std_err .config               
    <dbl>   <dbl> <chr>   <chr>      <dbl> <int>   <dbl> <chr>                 
1 0.00886     0.8 roc_auc binary     0.823     5 0.00933 Preprocessor1_Model096
2 0.00886     1   roc_auc binary     0.823     5 0.00912 Preprocessor1_Model116
3 0.0162      0.4 roc_auc binary     0.822     5 0.00970 Preprocessor1_Model057
4 0.0162      0.6 roc_auc binary     0.822     5 0.00922 Preprocessor1_Model077
5 0.00886     0.6 roc_auc binary     0.822     5 0.00942 Preprocessor1_Model076

For more details, especially for doing this on many models at once, see Tidy Modeling with R.

Appendix

To run any code chunk from this tutorial in your own environment, use:

library(tidyverse)
library(tidymodels)
theme_set(theme_bw())
options(scipen = 5) # encourage metrics to print in fixed-point notation
options(dplyr.summarise.inform = FALSE) # silence a warning message

ames_all <- 
  read_builtin("ames", package = "modeldata") %>% 
  filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>%
  mutate(across(where(is.integer), as.double)) %>%
  mutate(Sale_Price = Sale_Price / 1000)
metrics <- yardstick::metric_set(mae, rsq_trad)

set.seed(10) # Seed the random number generator
ames_split <- initial_split(ames_all, prop = 2 / 3)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)

ames_with_split_marked <- bind_rows(
  train = ames_train,
  test = ames_test,
  .id = "split"
) %>% mutate(split = as_factor(split))
ames_resamples <- vfold_cv(ames_train, v = 6)
mlc_churn <- read_builtin("mlc_churn", package = "modeldata")
skimr::skim(mlc_churn)
churn_split <- initial_split(mlc_churn)
churn_resamples <- training(churn_split) %>% vfold_cv(v = 5)
lat_long_grid <- expand_grid(
  Latitude  = modelr::seq_range(ames_train$Latitude,  n = 200, expand = .05),
  Longitude = modelr::seq_range(ames_train$Longitude, n = 200, expand = .05),
)

show_latlong_model <- function(dataset, model, model_name = deparse(substitute(model))) {
  ggplot(dataset, aes(x = Longitude, y = Latitude)) +
    geom_raster(
      data = augment(model, lat_long_grid),
      mapping = aes(fill = .pred)
    ) +
    geom_point(color = "black", size = .75) +
    geom_point(aes(color = Sale_Price), size = .5) +
    scale_color_viridis_c(aesthetics = c("color", "fill"), limits = range(dataset$Sale_Price)) +
    coord_equal(expand = FALSE) +
    guides(fill = "none") +
    labs(title = model_name) +
    theme_minimal()
}