In this lab (turned homework) you will try to make models of two different types and tune their parameters and data choices to do well at a prediction task. This is the central task of data modeling. Many (but far from all) people think it’s the most fun part.
Our learning objectives are:
lab10 repos available to your cohort. Pick one for your team, making sure that no two teams share the same repo.lab10-V2.The RStudio server lets multiple people edit a single project at the same time! This can be confusing since it’s different from collaborating using GitHub, so try it only if all of your team is game to try:
You have been given a training set with a random subset of the Ames home sale dataset. Your repo includes a command to load it. See the Data dictionary if you need it.
We’ve scaled the sale price to be in units of $1000, to keep the numbers more manageable.
Now that we have techniques for specifying a model, preprocessing data to give to it, and estimating its performance, let’s try to just make a good model.
We learned about recipes and workflows but only used them occasionally last week. For this lab, though, they will become essential as we seek to compare models that use different variables, different preprocessing steps, different models, etc, using the same methodology.
Our basic strategy will be to collect model specs and metrics in a data frame. We’ll run through the approach once for a basic linreg and decision tree, then you’ll try out things on your own.
Overall let’s use 10-fold cross-validation. We need to set a seed here because vfold_cv uses random splitting.
Since the toolkit I usually use, Scikit-Learn, uses deterministic cross-validation by default, I didn’t realize that this needs a seed to be reproducible. Sorry for those of you who got confused because our numbers didn’t match.
set.seed(2)
ames_resamples <- ames_train %>%
vfold_cv(v = 10)These models can be time-consuming to run, so let’s collect the results in a modeling_results data frame as we go. Let’s start off with no results:
modeling_results <- tibble()And here’s a function to add a new set of results. Don’t worry about how it works.
append_new_results <- function(modeling_results, model_name, spec, cv_results) {
if (nrow(modeling_results) > 0 && model_name %in% modeling_results$model_name)
stop(paste0(
"There's already results for a model with the name ", model_name,
". Did you forget to change the name?"))
bind_rows(
modeling_results %>% mutate(model_name = as.character(model_name)),
tibble(
model_name = model_name,
spec = list(spec),
cv_results %>% select(-.estimator)
)
) %>%
mutate(model_name = as_factor(model_name)) # Ensure that factor level matches insertion order.
}
get_spec_for_model <- function(model_name) {
modeling_results %>% filter(model_name == !!model_name) %>% purrr::chuck("spec", 1)
}Finally, a few utilities:
add_predictions <- function(data, ...) {
imap_dfr(
rlang::dots_list(..., .named = TRUE),
function(model, model_name) {
model %>%
predict(data) %>%
bind_cols(data) %>%
mutate(model = !!model_name)
}
)
}
sweep_model_examples <- function(model, dataset, vars_to_sweep, examples = slice_sample(dataset, n = 10)) {
X <- map_dfr(vars_to_sweep, function(v) {
var_to_sweep <- rlang::as_label(v)
sweep_min <- min(dataset[[var_to_sweep]])
sweep_max <- max(dataset[[var_to_sweep]])
expand_grid(
examples %>% select(-!!var_to_sweep) %>% mutate(.idx = row_number()),
!!enquo(var_to_sweep) := seq(sweep_min, sweep_max, length.out = 500)) %>%
mutate(sweep_var = var_to_sweep, .sweep_val = .data[[var_to_sweep]])
})
model %>%
predict(X) %>%
bind_cols(X)
}
linear_reg <- function(engine = "lm", ...) {
parsnip::linear_reg(...) %>% set_engine(engine)
}
decision_tree <- function(mode = "regression", engine = "rpart", ...) {
parsnip::decision_tree(mode = "regression", ...) %>%
set_engine(engine)
}Let’s first fit a simple linear regression. First we define our spec.
Note that we need to give recipe a data parameter. This is the template data, just used to tell the recipe what columns there are and what data type each one has. It does not actually train on this data!
spec <- workflow() %>%
add_recipe(
recipe(Sale_Price ~ Latitude + Longitude, data = ames_train)
) %>%
add_model(
linear_reg()
)Then, let’s get the cross-validation assessment scores:
cv_results <-
spec %>%
fit_resamples(resamples = ames_resamples, metrics = metric_set(mae)) %>%
collect_metrics(summarize = FALSE)
cv_results## # A tibble: 10 x 4
## id .metric .estimator .estimate
## <chr> <chr> <chr> <dbl>
## 1 Fold01 mae standard 45.4
## 2 Fold02 mae standard 47.7
## 3 Fold03 mae standard 53.1
## 4 Fold04 mae standard 48.1
## 5 Fold05 mae standard 40.6
## 6 Fold06 mae standard 48.7
## 7 Fold07 mae standard 45.6
## 8 Fold08 mae standard 53.4
## 9 Fold09 mae standard 49.7
## 10 Fold10 mae standard 45.5
Finally, we collect these results into our modeling_results data frame.
modeling_results <- modeling_results %>%
append_new_results(
model_name = "Simple Linreg",
spec = spec,
cv_results = cv_results
)
modeling_results## # A tibble: 10 x 5
## model_name spec id .metric .estimate
## <fct> <list> <chr> <chr> <dbl>
## 1 Simple Linreg <workflow> Fold01 mae 45.4
## 2 Simple Linreg <workflow> Fold02 mae 47.7
## 3 Simple Linreg <workflow> Fold03 mae 53.1
## 4 Simple Linreg <workflow> Fold04 mae 48.1
## 5 Simple Linreg <workflow> Fold05 mae 40.6
## 6 Simple Linreg <workflow> Fold06 mae 48.7
## 7 Simple Linreg <workflow> Fold07 mae 45.6
## 8 Simple Linreg <workflow> Fold08 mae 53.4
## 9 Simple Linreg <workflow> Fold09 mae 49.7
## 10 Simple Linreg <workflow> Fold10 mae 45.5
Notice that in your repo these three code blocks are all together in the same chunk. That’s intentional, to avoid accidentally getting the three parts out of sync.
Your turn: repeat the above, but now using the default decision_tree(). Start by copying and pasting the linreg chunk from above and changing the chunk name, model spec, and model_name parameter.
Here’s what my modeling results looked like after I did that:
modeling_results## # A tibble: 20 x 5
## model_name spec id .metric .estimate
## <fct> <list> <chr> <chr> <dbl>
## 1 Simple Linreg <workflow> Fold01 mae 45.4
## 2 Simple Linreg <workflow> Fold02 mae 47.7
## 3 Simple Linreg <workflow> Fold03 mae 53.1
## 4 Simple Linreg <workflow> Fold04 mae 48.1
## 5 Simple Linreg <workflow> Fold05 mae 40.6
## 6 Simple Linreg <workflow> Fold06 mae 48.7
## 7 Simple Linreg <workflow> Fold07 mae 45.6
## 8 Simple Linreg <workflow> Fold08 mae 53.4
## 9 Simple Linreg <workflow> Fold09 mae 49.7
## 10 Simple Linreg <workflow> Fold10 mae 45.5
## 11 Decision Tree <workflow> Fold01 mae 33.3
## 12 Decision Tree <workflow> Fold02 mae 36.2
## 13 Decision Tree <workflow> Fold03 mae 38.5
## 14 Decision Tree <workflow> Fold04 mae 35.3
## 15 Decision Tree <workflow> Fold05 mae 31.4
## 16 Decision Tree <workflow> Fold06 mae 34.3
## 17 Decision Tree <workflow> Fold07 mae 34.1
## 18 Decision Tree <workflow> Fold08 mae 35.7
## 19 Decision Tree <workflow> Fold09 mae 34.8
## 20 Decision Tree <workflow> Fold10 mae 33.8
Compare the performance of the two models. Which one is better?
modeling_results %>%
ggplot(aes(y = fct_rev(model_name), x = .estimate)) +
geom_boxplot() +
geom_point() +
labs(x = "Mean Absolute Error ($1000)", y = "")## # A tibble: 2 x 2
## model_name mae_median
## <fct> <dbl>
## 1 Simple Linreg 47.9
## 2 Decision Tree 34.6
Try getting a better decision tree model:
decision_tree parameters (run ?rpart::rpart.control for details)
cost_complexity: any split that does not improve the fit by this much is not attempted (defaults to 0.01, aka cp)tree_depth: maximum depth of any node in the final tree (defaults to 30, aka maxdepth)min_n: minimum number of observations that must exist in a node in order for a split to be attempted (defaults to 20, aka minsplit)Let’s look at two ways of visualizing what’s going on in fitted models. The examples in this section will show the results of the last model that your instructor happened to spec out (while playing around with decision trees for the last section); notice what’s different when you apply these techniques to different models.
First, let’s get a model fit on the whole training set. Remember: the cross validation process trained a bunch of smaller models; now we want to look at the slightly bigger full model.
model <-
get_spec_for_model("Decision Tree 2") %>%
fit(ames_train)Here’s a way to see at a glance what might be helpful features to add becasue there are patterns in the residuals. We just plot the residuals against all of the other numeric features. (It doesn’t handle categorical features, but you can try mapping a categorical feature to color or something.)
ames_train %>%
add_predictions(model) %>%
mutate(resid = Sale_Price - .pred) %>%
# Pivot all of the numeric variables except for the outcomes and residuals.
pivot_longer(c(where(is.numeric) & !any_of(c("Sale_Price", ".pred", "resid")))) %>%
# Plot!
ggplot(aes(x = value, y = resid)) +
geom_point() +
geom_smooth(method = "lm") +
facet_wrap(vars(name), scales = "free")## `geom_smooth()` using formula 'y ~ x'
The sweep_model_examples function defined above creates example predictions by starting with 10 randomly selected example data points and sweeping a variable (columns) across the whole range of values seen in the training set. You give it one or more variables using vars, the same quoting function that facet_wrap uses. Note: it currently only works for continuous variables.
The following example shows how to sweep several variables and plot each sweep as a line.
set.seed(0)
sweep_model_examples(model, ames_train, vars(Longitude, Latitude, Gr_Liv_Area)) %>%
ggplot(aes(x = .sweep_val, y = .pred, group = .idx, color = Latitude)) +
geom_line(alpha = .5) +
facet_wrap(vars(sweep_var), scales = "free")
Now, try to get a better linear regression model. You might try:
step_dummy, and perhaps step_other if needed)step_poly, or its better-behaved cousin step_nsstep_discretizestep_interact (see hw9 for an example)Once you have collected performance data for a few different models, compare their performance by plotting data from the modeling_results data frame.
(The other visuals are just for you to explore. Keep your model performance visual chunk after all of your modeling results have been collected, so that you can plot the performance here.)
For example, here is the result with some models I was playing around with:
modeling_results %>%
ggplot(aes(y = fct_rev(model_name), x = .estimate)) +
geom_boxplot() +
geom_point() +
labs(x = "Mean Absolute Error ($1000)", y = "")Make a list of all of the models that you tried. (Give enough detail that another student would be able to try the same model). Mark which one is your best decision tree and which one is your best linear regression.
How does the numerical performance of your best linear regression compare with your best decision tree? Refer to both the center and spread of cross-validated performance estimates.
Pick a house from the training set. Explain how each of the two models you described in the previous question makes a prediction on that house.
Here’s the example home to use:
example_home <- ames_train %>% head(1)
example_home## # A tibble: 1 x 74
## MS_SubClass MS_Zoning Lot_Frontage Lot_Area Street Alley Lot_Shape
## <chr> <chr> <dbl> <dbl> <chr> <chr> <chr>
## 1 One_Story_… Resident… 141 31770 Pave No_A… Slightly…
## # … with 67 more variables: Land_Contour <chr>, Utilities <chr>,
## # Lot_Config <chr>, Land_Slope <chr>, Neighborhood <chr>, Condition_1 <chr>,
## # Condition_2 <chr>, Bldg_Type <chr>, House_Style <chr>, Overall_Cond <chr>,
## # Year_Built <dbl>, Year_Remod_Add <dbl>, Roof_Style <chr>, Roof_Matl <chr>,
## # Exterior_1st <chr>, Exterior_2nd <chr>, Mas_Vnr_Type <chr>,
## # Mas_Vnr_Area <dbl>, Exter_Cond <chr>, Foundation <chr>, Bsmt_Cond <chr>,
## # Bsmt_Exposure <chr>, BsmtFin_Type_1 <chr>, BsmtFin_SF_1 <dbl>,
## # BsmtFin_Type_2 <chr>, BsmtFin_SF_2 <dbl>, Bsmt_Unf_SF <dbl>,
## # Total_Bsmt_SF <dbl>, Heating <chr>, Heating_QC <chr>, Central_Air <chr>,
## # Electrical <chr>, First_Flr_SF <dbl>, Second_Flr_SF <dbl>,
## # Gr_Liv_Area <dbl>, Bsmt_Full_Bath <dbl>, Bsmt_Half_Bath <dbl>,
## # Full_Bath <dbl>, Half_Bath <dbl>, Bedroom_AbvGr <dbl>, Kitchen_AbvGr <dbl>,
## # TotRms_AbvGrd <dbl>, Functional <chr>, Fireplaces <dbl>, Garage_Type <chr>,
## # Garage_Finish <chr>, Garage_Cars <dbl>, Garage_Area <dbl>,
## # Garage_Cond <chr>, Paved_Drive <chr>, Wood_Deck_SF <dbl>,
## # Open_Porch_SF <dbl>, Enclosed_Porch <dbl>, Three_season_porch <dbl>,
## # Screen_Porch <dbl>, Pool_Area <dbl>, Pool_QC <chr>, Fence <chr>,
## # Misc_Feature <chr>, Misc_Val <dbl>, Mo_Sold <dbl>, Year_Sold <dbl>,
## # Sale_Type <chr>, Sale_Condition <chr>, Sale_Price <dbl>, Longitude <dbl>,
## # Latitude <dbl>
And here’s how to get your model’s prediction on it.
# Get the fitted model
model <-
get_spec_for_model("Decision Tree") %>%
fit(ames_train)
# Make the prediction
model %>% predict(example_home)## # A tibble: 1 x 1
## .pred
## <dbl>
## 1 137.
You can use the plots from “Examine Predictions” above to help understand what your model is doing. For a small enough decision tree, you can also use rpart.plot like we did in Lab 9 (but we need very slightly different code since we’re using workflows):
model %>%
workflows::pull_workflow_fit() %>%
pluck("fit") %>%
rpart.plot::rpart.plot(roundint = FALSE, digits = 5, type = 4)spec <- YOUR SPEC CODE HERE
expected_test_set_MAE <- YOUR NUMBER HERE
Explain your reasoning.