::opts_chunk$set(echo = TRUE)
knitrlibrary(tidyverse)
library(tidymodels)
library(rpart.plot)
theme_set(theme_bw())
Predictive Analytics Homework 3: Prediction with Tree-Based Models
Goals
- Compare and contrast linear regression, decision trees, and tree ensembles (random forest and boosted trees)
- Use visualization to understand the strengths and weaknesses of different models
- Identify distribution shift (bonus)
- Apply predictive modeling using
tidymodels
to a time-series prediction task
The Data
Let’s imagine we’re hired by the administrators of the Capital Bikeshare program program in Washington D.C., to help them understand and predict the hourly demand for rental bikes. We will be working with this data in several activities throughout this semester.
This understanding will help them plan the number of bikes that need to be available at different parts of the system at different times, to avoid the case where someone wants a bike but the station is empty, or someone wants to return a bike but the station is full.
The data for this problem were collected from the Capital Bikeshare program over the course of two years (2011 and 2012). Researchers at the University of Porto processed the data and augmented it with extra information, as described on this page. It has been cleaned up slightly for this assignment.
Getting Started
I used the following setup chunk when producing this assignment:
Include the following code to load the data file.
<- readRDS(url("https://cs.calvin.edu/courses/info/602/06ensembles/hw/bikeshare-day.rds")) daily_rides
Explore the data file in RStudio to get familiar with what data it contains. Refer to the description above for details.
Exploratory Analytics
Optional Exercise: Make a scatter plot of the number of casual
riders by date
, color
-coding whether each day is a workingday
or not. Describe one thing you notice about the data from this plot.
Optional Exercise: When starting an analysis, it’s really useful to look at the distribution of your response variable. Plot the distribution of casual
. Use a histogram or density plot.
Notice the long tail of high-ridership days. These can often throw off a model. Also notice that there are no ride counts below 0, as should be expected for data of counts. But a linear model might end up predicting a negative value, which wouldn’t make sense.
We can address both of those problems by transforming our response variable. Instead of trying to predict casual
, let’s try to predict log(casual)
. That way we’ll never predict a negative value for casual
.
Other transformations may work better, but log-transforming data is simple and common. In a linear model, it means that each term contributes multiplicatively to the prediction: each degree of temperature increase will mulitply the ride count by something (i.e., it goes up by x percent), rather than adding to it.
Optional Exercise: Now, plot the distribution of log(casual)
. Use a histogram or density plot.
```{r log-casual-hist}
%>%
daily_rides ggplot(aes(x = log(casual))) + geom_histogram(binwidth = .25)
```
Notice that the data distribution is less skewed, and the clear outliers are now actually the low-ridership days, which may not matter as much anyway.
For your first time through this assignment, use casual
without transformation. Once you finish, optionally try going back and predicting log(casual)
instead.
One final useful bit of EDA: ggpairs
. Try this on your own. You may need to adjust the figure size to make it big enough. You can always reduce the number of variables you’re plotting. If you have trouble getting an appropriate output, comment this out.
%>%
daily_rides select(month, workingday, temp, atemp, casual) %>%
::ggpairs() GGally
Train-test split
We’re going to do two kinds of train-test split for this assignment.
First, we’re going to pretend that this isn’t a time series prediction task. That is, we’ll ignore the progression of time, considering each day independently. This is the classic setting of supervised learning, and we’ll see how well our models learn to predict this data and generalize to other randomly selected days.
Then, we’ll recognize that the pattern of bike ridership actually does change over time; the phenomenon is called distribution shift. Specifically, we’ll compare 2011 (the first full year of service for which we have data) and 2012 (the second full year).
So we’ll start by splitting the data by year:
<- daily_rides %>% filter(year == 2011)
rides_2011 <- daily_rides %>% filter(year == 2012) rides_2012
Check that you get 365 days in 2011 and 366 days in 2012 (why might this make sense?)
Now we’ll do a train-test split on 2011. We’ll hold out 2012 as a real testing set and return to it later.
set.seed(1234)
<- initial_split(rides_2011, prop = 3/4)
rides_split <- training(rides_split)
train <- testing(rides_split) test
We’ll want to analyze performance by whether a day is in the training or test set. So we’ll combine them back together to make a data frame we can predict on later. Notice that the result has a split
column, which will record whether each row came from the training or test set.
<- bind_rows(
rides_2011_with_split_marked train = train,
test = test,
.id = "split"
%>% mutate(split = as_factor(split)) %>%
) arrange(date)
%>% head() %>% knitr::kable() rides_2011_with_split_marked
split | date | year | workingday | temp | atemp | casual | registered | month | day_since_start |
---|---|---|---|---|---|---|---|---|---|
test | 2011-01-01 | 2011 | weekend | 8.175849 | 7.999250 | 331 | 654 | 1 | 1 |
train | 2011-01-02 | 2011 | weekend | 9.083466 | 7.346774 | 131 | 670 | 1 | 2 |
test | 2011-01-03 | 2011 | workday | 1.229108 | -3.499270 | 120 | 1229 | 1 | 3 |
train | 2011-01-04 | 2011 | workday | 1.400000 | -1.999948 | 108 | 1454 | 1 | 4 |
test | 2011-01-05 | 2011 | workday | 2.666979 | -0.868180 | 82 | 1518 | 1 | 5 |
train | 2011-01-06 | 2011 | workday | 1.604356 | -0.608206 | 88 | 1518 | 1 | 6 |
This is actually a time series prediction problem, sometimes just called forecasting. Methods that are competitive for forecasting almost always look at recent history, not just general trends. There are ways of adapting the models we use here to do that, but we will keep it simple for now.
One aspect of time series prediction bears noting here, though: we need to be mindful of how we evaluate performance. If we picked days at random to evaluate on, most of the evaluation days would be within the time we’ve already seen. So we would be overconfident in our model’s ability to predict future ridership.
The initial_time_split
function can be used for this purpose if there is not a nice division by year like we have here.
More discussion of time series validation approaches can be found here.
Fitting Models
We’re going to compare four different models for predicting this data:
- Linear regression
- Decision tree
- Random Forest
- XGBoost
To keep the comparison simple (though not necessarily fair), we’ll use the same prediction formula for all four models:
<- casual ~ temp + workingday + month
model_formula # Come back and try this one later:
# model_formula <- casual ~ temp + workingday + month + day_since_start
Linear Regression
First we’ll fit a linear regression:
<- fit(
linreg_model linear_reg(),
model_formula,data = train
)
Exercise 1 (1pt): Inspect the model (using tidy()
or otherwise). Complete this sentence: “For every additional degree C, the model predicts ___ additional rides.”
```{r include=TRUE}
%>%
linreg_model tidy() %>%
select(term, estimate)
```
# A tibble: 14 × 2
term estimate
<chr> <dbl>
1 (Intercept) 490.
2 temp 24.2
3 workingdayworkday -636.
4 month2 43.4
5 month3 219.
6 month4 336.
7 month5 497.
8 month6 405.
9 month7 493.
10 month8 297.
11 month9 320.
12 month10 373.
13 month11 187.
14 month12 43.8
Predictions
Let’s plot the model’s predictions on the training set along with the number of casual
rides. Do this by using augment(linreg_model, train)
to make the predictions, then plot the result. You’ll want to add a geom_line
where you specify .pred
as the y
aesthetic.
augment(linreg_model, train) %>%
ggplot(aes(x = date, y = casual, color = workingday)) +
geom_point() +
geom_line(aes(y = .pred))
Exercise 2 (2pt): This is a linear model, so generally we expect to see straight lines. Explain why the prediction is not a straight line. Hint: Think about what the x axis shows here, vs what x axis would show a straight line. (To check your reasoning, try temporarily changing your formula. You might try casual ~ temp
.)
Observed by Predicted
Let’s make an observed-by-predicted plot. Add coord_obs_pred()
to make the axes exactly equal, and geom_abline
to draw the line corresponding to exactly correct predictions. Use the train
ing set again.
augment(linreg_model, new_data = train) %>%
ggplot(aes(x = casual, y = .pred, color = workingday)) +
geom_abline() +
geom_point(alpha = .5) +
coord_obs_pred()
Exercise 3 (1pt): Under what circumstances does this this model typically predict too high? Under what circumstances does it predict too low? Use the plots above to justify your answer.
Model debugging strategy:
- What values do the features have when the prediction is wrong? (only one feature is shown in that plot, but you could imagine making other plots.)
- What value does the target have when the prediction is wrong?
Quantify errors
Use the mae
function (or write your own summarize
call) to compute the Mean Absolute Error for both the train and test sets.
augment(linreg_model, rides_2011_with_split_marked) %>%
group_by(split) %>%
#summarize(mae = mean(abs(casual - .pred)))
mae(truth = casual, estimate = .pred)
# A tibble: 2 × 4
split .metric .estimator .estimate
<fct> <chr> <chr> <dbl>
1 train mae standard 220.
2 test mae standard 235.
Exercise 4 (1pt): Complete the following sentence using the appropriate result from the table above: “On days that the model had not seen, the predicted number of rides was _____”. (You will need to use several words to fill in the blank.)
Decision Tree Regression
Now, let’s fit a decision tree using the same model_formula
as the linear regression. Use the default settings. Call the result dtree_model
.
<- fit(
dtree_model decision_tree(mode = "regression"),
data = train) model_formula,
We can show the tree using the same method we used in class and lab:
%>%
dtree_model extract_fit_engine() %>%
rpart.plot(roundint = FALSE, digits = 3, type = 4)
Exercise 5 (1pt): Select an arbitrary day from the dataset. Using the tree diagram shown here (which may differ slightly from what you obtain if you run the same code), explain how the model makes its prediction about the casual ridership on that date.
Exercise 6 (2pt): We’ve made another observed-by-predicted plot, this time for the decision tree. Explain why the horizontal lines appear for this model and not the linear regression model.
augment(dtree_model, new_data = train) %>%
ggplot(aes(x = casual, y = .pred, color = workingday)) +
geom_abline() +
geom_point(alpha = .5) +
coord_obs_pred()
Exercise 7 (1pt): Write a succinct sentence comparing the quantitative performance of this model with the performance of the linear regression model. You’ll need the results below:
```{r include=TRUE}
augment(dtree_model, rides_2011_with_split_marked) %>%
group_by(split) %>%
#summarize(mae = mean(abs(casual - .pred)))
mae(truth = casual, estimate = .pred)
```
# A tibble: 2 × 4
split .metric .estimator .estimate
<fct> <chr> <chr> <dbl>
1 train mae standard 180.
2 test mae standard 186.
Ensembles
Finally, let’s fit Random Forest (rand_forest
) and gradient-boosted tree (boost_tree
) regression models on this data. Call them rf_model
and boost_model
.
<-
rf_model rand_forest(mode = "regression") %>%
fit(model_formula, data = train)
<- fit(
boost_model boost_tree(mode = "regression"),
data = train) model_formula,
We can’t show these models because they have too many different trees in them. But we can still make other plots. Make observed-by-predicted plots for both the rf_model
and the boost_model
. (If you know how to write functions, this could be a good place to use them!)
Exercise 8 (1pt): What do you notice about the boosted tree that was not true for any of the other models?
augment(rf_model, new_data = train) %>%
ggplot(aes(x = casual, y = .pred, color = workingday)) +
geom_abline() +
geom_point(alpha = .5) +
coord_obs_pred()
augment(boost_model, new_data = train) %>%
ggplot(aes(x = casual, y = .pred, color = workingday)) +
geom_abline() +
geom_point(alpha = .5) +
coord_obs_pred()
# As an aside, here's a way to use a function to represent what's in common between these plots.
# It uses an advanced tidyverse technique: https://dplyr.tidyverse.org/articles/programming.html#indirection
<- function(model, data, var, ...) {
show_obs_vs_pred augment(model, new_data = data) %>%
ggplot(aes(x = {{var}}, y = .pred, ...)) +
geom_abline() +
geom_point(alpha = .5) +
coord_obs_pred()
}show_obs_vs_pred(rf_model, train, casual, color = workingday)
show_obs_vs_pred(boost_model, train, casual, color = workingday)
Finally, let’s quantify the performance of all these models. Here’s a relatively simple way to combine the results of all models:
<- rides_2011_with_split_marked
eval_dataset
<- bind_rows(
all_predictions linreg_model = augment(linreg_model, new_data = eval_dataset),
dtree_model = augment(dtree_model, new_data = eval_dataset),
rf_model = augment(rf_model, new_data = eval_dataset),
boost_model = augment(boost_model, eval_dataset),
.id = "model"
%>% mutate(model = as_factor(model)) )
Now we can group by model
and split
to compute our metrics:
%>%
all_predictions group_by(model, split) %>%
mae(truth = casual, estimate = .pred) %>%
mutate(mae = .estimate) %>%
ggplot(aes(x = model, y = mae, fill = split)) +
geom_col(position = "dodge")
Exercise 9 (2pt): Summarize the relative performance of these different models. Which performs best on training set data? Which performs best on unseen data? We say that a model underfits when it performs poorly even on its training data, and that a model overfits when its training-set performance stops being a good indicator of its test set performance. Which model(s) underfit? Which model(s) significantly overfit?
Distribution Shift
The goal of this last section is to notice what happens when the data and task break one of the basic assumptions of supervised learning. The models we’re using are trained to try to capture (i.e., “model”) the distribution of data that they were trained on. When tested on more data of the same kind, they tend to do well (to different degrees depending on the data). But when tested on data that came from a different distribution, they might do much worse. This scenario is called distribution shift, because the distribution of the data has changed compared with the data that the model was trained on. Sometimes the more closely the model captures its training distribution, the less robust it is to distribution shift.
Our scenario includes an example of distribution shift: the second full year of operation (2012), ridership patterns were significantly different than they were on comparable days in 2011. In this case, the distribution has shifted for the next year (2012), so some of what we learned in 2011 no longer applied. Think about how learning to fit one distribution better may have come at the expense of robustness to distribution shifts.
Exercise 10 (1pt): Refer to our exploratory plot from the very first exercise. What difference(s) do you notice about the distribution of daily rides in 2011 vs 2012?
```{r casual-by-year-boxplot, include=TRUE}
%>% ggplot(aes(x = casual, y = year)) + geom_boxplot()
daily_rides ```
The code below repeats the same code we used to compare the models on the 2011 data, but this time compares 2011 and 2012.
<- daily_rides %>%
eval_dataset mutate(split = year)
<- bind_rows(
all_predictions linreg_model = augment(linreg_model, new_data = eval_dataset),
dtree_model = augment(dtree_model, new_data = eval_dataset),
rf_model = augment(rf_model, new_data = eval_dataset),
boost_model = augment(boost_model, eval_dataset),
.id = "model"
%>% mutate(model = as_factor(model)) )
%>%
all_predictions group_by(model, split) %>%
mae(truth = casual, estimate = .pred) %>%
mutate(mae = .estimate) %>%
ggplot(aes(x = model, y = mae, fill = split)) +
geom_col(position = "dodge")
Exercise 11 (1pt): Suppose you had actually deployed one of these models (which? why?) at the beginning of 2012, trained on 2011 data, for some business purpose. As the year progresses, your boss notices that the model has more error than you had initially said it would, and asks you why. Explain why, without using the word “distribution”.
The following plot may also be insightful:
```{r casual-by-date-trend, message=FALSE}
%>% ggplot(aes(x = date, y = casual, color=year)) + geom_point() + geom_smooth(method = 'lm')
daily_rides ```
Submit
Follow the same instructions as for the previous homework.
This assignment is worth 14 points total.