Predictive Analytics Unit 7: Hyperparameter Tuning and Cross-Validation

.title[
# Predictive Analytics Unit 7: Hyperparameter Tuning and Cross-Validation
]
.author[
### Ken Arnold<br>Calvin University
]

---

```r
data(ames, package = "modeldata")
ames_all <- ames %>%
  filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>%
  mutate(across(where(is.integer), as.double)) %>%
  mutate(Sale_Price = Sale_Price / 1000)
rm(ames)
```

```r
metrics <- yardstick::metric_set(mae, mape, rsq_trad)

set.seed(10) # Seed the random number generator
ames_split <- initial_split(ames_all, prop = 2 / 3)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
```

```r
model1 <-
  decision_tree(mode = "regression", tree_depth = 2) %>%
  fit(Sale_Price ~ Latitude + Longitude, data = ames_train)
```

```r
model2 <-
  decision_tree(mode = "regression", tree_depth = 30) %>%
  fit(Sale_Price ~ Latitude + Longitude, data = ames_train)
```

```r
model3 <-
  decision_tree(mode = "regression", cost_complexity = 1e-6, min_n = 2) %>%
  fit(Sale_Price ~ Latitude + Longitude, data = ames_train)
```
]

---

## Location, Location, Location!

- Recall: Ames housing dataset has sale prices for homes. **Task**: Predict how much a home will sell for.
- Previously we used attributes of the house and lot.
- Today (just to illustrate), we'll look at location *only*.

---

## Which model is better?

.pull-left[
<img src="slides07tuning_files/figure-html/show-model1-alg-1.png" width="90%" style="display: block; margin: auto;" />

]

.pull-right[
<img src="slides07tuning_files/figure-html/show-model2-alg-1.png" width="90%" style="display: block; margin: auto;" />

]

---

## How about this model?

.autocontent[
<img src="slides07tuning_files/figure-html/show-model3-data-1.png" width="100%" style="display: block; margin: auto;" />
]

---

## What made these models different? *Hyperparameters*

* Tree depth: how many levels of decisions
* Leaf size: minimum number of observations for each leaf node
* Complexity penalty: how much improvement for a split to be "worth it"

```r
model1 <-
  decision_tree(mode = "regression", tree_depth = 2) %>%
  fit(Sale_Price ~ Latitude + Longitude, data = ames_train)
model2 <-
  decision_tree(mode = "regression", tree_depth = 30) %>%
  fit(Sale_Price ~ Latitude + Longitude, data = ames_train)
model3 <-
  decision_tree(mode = "regression", cost_complexity = 1e-6, min_n = 2) %>%
  fit(Sale_Price ~ Latitude + Longitude, data = ames_train)
```

---

## How do we *train* a decision tree?

Greedy algorithm: make the best single split of the current data, repeat.

* The model: "choose your own adventure": at each step, check one simple condition
about one variable (e.g., `Latitude < 42.05`)
* Goal: find the best tree (for regression: minimize MSE)
* Approach: greedy algorithm: try all possible splits, keep the best one, repeat.

---

## Which one works best?

```r
bind_rows(
  model1 = augment(model1, ames_train),
  model2 = augment(model2, ames_train),
  model3 = augment(model3, ames_train),
  .id = "model"
) %>% 
  group_by(model = as_factor(model)) %>% 
  metrics(truth = Sale_Price, estimate = .pred) %>% 
  ggplot(aes(y = .estimate, x = model)) + geom_col() +
  facet_wrap(vars(.metric), scales = "free_y")
```

---

## How about on testing data?

```r
bind_rows(
  model1 = augment(model1, ames_test),
  model2 = augment(model2, ames_test),
  model3 = augment(model3, ames_test),
  .id = "model"
) %>% 
  group_by(model = as_factor(model)) %>% 
  metrics(truth = Sale_Price, estimate = .pred) %>% 
  ggplot(aes(y = .estimate, x = model)) + geom_col() +
  facet_wrap(vars(.metric), scales = "free_y")
```

---

## Why train-test split? Memorizing the eye chart

.floating-source[
[Snellen chart on Wikimedia](https://commons.wikimedia.org/wiki/File:Snellen_chart.svg), CC-BY-SA  
Analogy by [Clem Wang](https://www.linkedin.com/pulse/metaphor-over-fitting-machine-learning-clem-wang)
]

---

## Cross-Validation

#### Puzzle

* We want to pick the model that works best on *unseen* data
* ... but as soon as we try one model, **we've peeked at the data!**

#### One Solution

* Divide training data into *V* piles (e.g., 10)
* Hide one pile from yourself.
  * train on ("analyze") the rest,
  * evaluate ("assess") on the one you held out.
* Repeat for each of the *V* piles.

---

---

## What is Cross-Validation?

---

---

---

## How to do CV?

1. Declare the splitting strategy:

```r
ames_resamples <- ames_train %>% vfold_cv(v = 10)
```

```r
ames_resamples
```

```
#  10-fold cross-validation 
# A tibble: 10 × 2
  splits             id    
  <list>             <chr> 
1 <split [1447/161]> Fold01
2 <split [1447/161]> Fold02
3 <split [1447/161]> Fold03
4 <split [1447/161]> Fold04
5 <split [1447/161]> Fold05
6 <split [1447/161]> Fold06
# … with 4 more rows
```

---

## How to do CV?

1. Declare the splitting strategy
2. Fit on each resample, evaluate using a set of metrics.

---

## How to do CV?

1. Declare the splitting strategy
2. Fit on each resample, evaluate using a set of metrics.

```r
model3_samples <- model3_spec %>%
* fit_resamples(
    Sale_Price ~ Latitude + Longitude,
*   resamples = ames_resamples,
    metrics = metric_set(mae))
model3_samples %>% collect_metrics(summarize = FALSE)
```

```
# A tibble: 10 × 5
  id     .metric .estimator .estimate .config             
  <chr>  <chr>   <chr>          <dbl> <chr>               
1 Fold01 mae     standard        29.0 Preprocessor1_Model1
2 Fold02 mae     standard        36.6 Preprocessor1_Model1
3 Fold03 mae     standard        34.4 Preprocessor1_Model1
4 Fold04 mae     standard        31.8 Preprocessor1_Model1
5 Fold05 mae     standard        32.0 Preprocessor1_Model1
6 Fold06 mae     standard        27.3 Preprocessor1_Model1
# … with 4 more rows
```

---

## How to do CV?

1. Declare the splitting strategy
2. Fit on each resample, evaluate using a set of metrics.
3. Plot and/or summarize the metrics.

```r
model3_samples %>%
  collect_metrics(summarize = FALSE) %>% 
  ggplot(aes(x = .estimate, y = "model3")) +
  geom_point()
```

<img src="slides07tuning_files/figure-html/crude-plot-folds-1.png" width="90%" style="display: block; margin: auto;" />
]

```r
model3_samples %>%
  collect_metrics(summarize = TRUE)
```

```
# A tibble: 1 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard    32.2    10    1.20 Preprocessor1_Model1
```

]

---

---

## Goal: *Generalize*

*Observation*: Predictive models almost always do better on the data they're trained on than anything else.

- Problem of *variance*:
  - model uses a pattern that only held by chance
  - model uses a pattern that only holds for some data
  - model uses a pattern that's real but got a fuzzy picture of it
- Problem of *distribution shift*
  - model assumed the world wasn't changing, but it was

---

## Bias and Variance

- Models differ in *bias*: some models just can't learn some types of data. (e.g., linear regression trying to fit a sharp boundary)
  - High bias: poor performance on both train and test
- Models differ in *variance*: some models are more sensitive to which specific training data it happened to get
  - High variance: better performance on train often means worse performance on test (but sometimes ok by chance)

Often a trade-off: more flexible models are more likely to fit quirks in the training set. But not always. (e.g., random forest)

---

## Hyperparameters control bias/variance trade-off

- Decision tree:
  - Let tree grow deeper tree: bias goes ....., variance goes .....
  - Require more data in each leaf: bias goes ....., variance goes .....
- Nearest-Neighbors:
  - Use more neighbors: bias goes ....., variance goes .....
- Neural Net:
  - More regularization: bias goes ....., variance goes .....

---

## Automated methods can tune hyperparmeters

- Typically use cross-validation to measure performance under different parameters.
- See "Tidy Modeling with R" textbook for extended examples.