class: left, top, title-slide .title[ # Predictive Analytics Unit 7: Hyperparameter Tuning and Cross-Validation ] .author[ ### Ken Arnold
Calvin University ] --- .small-code[ ```r data(ames, package = "modeldata") ames_all <- ames %>% filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>% mutate(across(where(is.integer), as.double)) %>% mutate(Sale_Price = Sale_Price / 1000) rm(ames) ``` ```r metrics <- yardstick::metric_set(mae, mape, rsq_trad) set.seed(10) # Seed the random number generator ames_split <- initial_split(ames_all, prop = 2 / 3) ames_train <- training(ames_split) ames_test <- testing(ames_split) ``` ```r model1 <- decision_tree(mode = "regression", tree_depth = 2) %>% fit(Sale_Price ~ Latitude + Longitude, data = ames_train) ``` ```r model2 <- decision_tree(mode = "regression", tree_depth = 30) %>% fit(Sale_Price ~ Latitude + Longitude, data = ames_train) ``` ```r model3 <- decision_tree(mode = "regression", cost_complexity = 1e-6, min_n = 2) %>% fit(Sale_Price ~ Latitude + Longitude, data = ames_train) ``` ] --- ## Location, Location, Location! - Recall: Ames housing dataset has sale prices for homes. **Task**: Predict how much a home will sell for. - Previously we used attributes of the house and lot. - Today (just to illustrate), we'll look at location *only*. --- class: autosize ## Which model is better? .pull-left[ <img src="slides07tuning_files/figure-html/show-model1-alg-1.png" width="90%" style="display: block; margin: auto;" /> <img src="slides07tuning_files/figure-html/show-model1-data-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ <img src="slides07tuning_files/figure-html/show-model2-alg-1.png" width="90%" style="display: block; margin: auto;" /> <img src="slides07tuning_files/figure-html/show-model2-data-1.png" width="90%" style="display: block; margin: auto;" /> ] --- class: autosize ## How about this model? .autocontent[ <img src="slides07tuning_files/figure-html/show-model3-data-1.png" width="100%" style="display: block; margin: auto;" /> ] --- ## What made these models different? *Hyperparameters* * Tree depth: how many levels of decisions * Leaf size: minimum number of observations for each leaf node * Complexity penalty: how much improvement for a split to be "worth it" ```r model1 <- decision_tree(mode = "regression", tree_depth = 2) %>% fit(Sale_Price ~ Latitude + Longitude, data = ames_train) model2 <- decision_tree(mode = "regression", tree_depth = 30) %>% fit(Sale_Price ~ Latitude + Longitude, data = ames_train) model3 <- decision_tree(mode = "regression", cost_complexity = 1e-6, min_n = 2) %>% fit(Sale_Price ~ Latitude + Longitude, data = ames_train) ``` --- ## How do we *train* a decision tree? Greedy algorithm: make the best single split of the current data, repeat. * The model: "choose your own adventure": at each step, check one simple condition about one variable (e.g., `Latitude < 42.05`) * Goal: find the best tree (for regression: minimize MSE) * Approach: greedy algorithm: try all possible splits, keep the best one, repeat. --- ## Which one works best? ```r bind_rows( model1 = augment(model1, ames_train), model2 = augment(model2, ames_train), model3 = augment(model3, ames_train), .id = "model" ) %>% group_by(model = as_factor(model)) %>% metrics(truth = Sale_Price, estimate = .pred) %>% ggplot(aes(y = .estimate, x = model)) + geom_col() + facet_wrap(vars(.metric), scales = "free_y") ``` <img src="slides07tuning_files/figure-html/train-metrics-1.png" width="90%" style="display: block; margin: auto;" /> --- ## How about on testing data? ```r bind_rows( model1 = augment(model1, ames_test), model2 = augment(model2, ames_test), model3 = augment(model3, ames_test), .id = "model" ) %>% group_by(model = as_factor(model)) %>% metrics(truth = Sale_Price, estimate = .pred) %>% ggplot(aes(y = .estimate, x = model)) + geom_col() + facet_wrap(vars(.metric), scales = "free_y") ``` <img src="slides07tuning_files/figure-html/test-metrics-1.png" width="90%" style="display: block; margin: auto;" /> --- ## Why train-test split? Memorizing the eye chart .floating-source[ [Snellen chart on Wikimedia](https://commons.wikimedia.org/wiki/File:Snellen_chart.svg), CC-BY-SA Analogy by [Clem Wang](https://www.linkedin.com/pulse/metaphor-over-fitting-machine-learning-clem-wang) ] <img src="https://upload.wikimedia.org/wikipedia/commons/thumb/9/9f/Snellen_chart.svg/1000px-Snellen_chart.svg.png" width="50%" style="display: block; margin: auto;" /> --- ## Cross-Validation #### Puzzle * We want to pick the model that works best on *unseen* data * ... but as soon as we try one model, **we've peeked at the data!** #### One Solution * Divide training data into *V* piles (e.g., 10) * Hide one pile from yourself. * train on ("analyze") the rest, * evaluate ("assess") on the one you held out. * Repeat for each of the *V* piles. --- <!-- TODO: Use workflowsets for these (see tutorial) --> <img src="slides07tuning_files/figure-html/compare-models-traintest-1.png" width="100%" style="display: block; margin: auto;" /> --- ## What is Cross-Validation? <img src="https://www.tmwr.org/premade/resampling.svg" width="90%" style="display: block; margin: auto;" /> --- <img src="img/tmwr-three-CV-iter.png" width="100%" style="display: block; margin: auto;" /> .floating-source[Source: [Tidy Modeling with R](https://www.tmwr.org/premade/three-CV-iter.svg)] --- <video width="100%" controls loop><source src="slides07tuning_files/figure-html/ames-cv-anim.mp4" /></video> --- ## How to do CV? 1. Declare the splitting strategy: ```r ames_resamples <- ames_train %>% vfold_cv(v = 10) ``` ```r ames_resamples ``` ``` # 10-fold cross-validation # A tibble: 10 × 2 splits id <list> <chr> 1 <split [1447/161]> Fold01 2 <split [1447/161]> Fold02 3 <split [1447/161]> Fold03 4 <split [1447/161]> Fold04 5 <split [1447/161]> Fold05 6 <split [1447/161]> Fold06 # … with 4 more rows ``` --- ## How to do CV? 1. Declare the splitting strategy 2. Fit on each resample, evaluate using a set of metrics. <video width="70%" controls loop><source src="slides07tuning_files/figure-html/ames-cv-model3-anim.mp4" /></video> --- ## How to do CV? 1. Declare the splitting strategy 2. Fit on each resample, evaluate using a set of metrics. ```r model3_samples <- model3_spec %>% * fit_resamples( Sale_Price ~ Latitude + Longitude, * resamples = ames_resamples, metrics = metric_set(mae)) model3_samples %>% collect_metrics(summarize = FALSE) ``` ``` # A tibble: 10 × 5 id .metric .estimator .estimate .config <chr> <chr> <chr> <dbl> <chr> 1 Fold01 mae standard 29.0 Preprocessor1_Model1 2 Fold02 mae standard 36.6 Preprocessor1_Model1 3 Fold03 mae standard 34.4 Preprocessor1_Model1 4 Fold04 mae standard 31.8 Preprocessor1_Model1 5 Fold05 mae standard 32.0 Preprocessor1_Model1 6 Fold06 mae standard 27.3 Preprocessor1_Model1 # … with 4 more rows ``` --- ## How to do CV? 1. Declare the splitting strategy 2. Fit on each resample, evaluate using a set of metrics. 3. Plot and/or summarize the metrics. .pull-left[ ```r model3_samples %>% collect_metrics(summarize = FALSE) %>% ggplot(aes(x = .estimate, y = "model3")) + geom_point() ``` <img src="slides07tuning_files/figure-html/crude-plot-folds-1.png" width="90%" style="display: block; margin: auto;" /> ] .pull-right[ ```r model3_samples %>% collect_metrics(summarize = TRUE) ``` ``` # A tibble: 1 × 6 .metric .estimator mean n std_err .config <chr> <chr> <dbl> <int> <dbl> <chr> 1 mae standard 32.2 10 1.20 Preprocessor1_Model1 ``` ] --- <video width="90%" controls loop><source src="slides07tuning_files/figure-html/cv-anim.mp4" /></video> --- ## Goal: *Generalize* *Observation*: Predictive models almost always do better on the data they're trained on than anything else. - Problem of *variance*: - model uses a pattern that only held by chance - model uses a pattern that only holds for some data - model uses a pattern that's real but got a fuzzy picture of it - Problem of *distribution shift* - model assumed the world wasn't changing, but it was --- ## Bias and Variance - Models differ in *bias*: some models just can't learn some types of data. (e.g., linear regression trying to fit a sharp boundary) - High bias: poor performance on both train and test - Models differ in *variance*: some models are more sensitive to which specific training data it happened to get - High variance: better performance on train often means worse performance on test (but sometimes ok by chance) Often a trade-off: more flexible models are more likely to fit quirks in the training set. But not always. (e.g., random forest) --- ## Hyperparameters control bias/variance trade-off - Decision tree: - Let tree grow deeper tree: bias goes ....., variance goes ..... - Require more data in each leaf: bias goes ....., variance goes ..... - Nearest-Neighbors: - Use more neighbors: bias goes ....., variance goes ..... - Neural Net: - More regularization: bias goes ....., variance goes ..... --- ## Automated methods can tune hyperparmeters - Typically use cross-validation to measure performance under different parameters. - See "Tidy Modeling with R" textbook for extended examples.