Predictive Analytics Unit 5: Tree-Based Learning

class: left, top, title-slide

.title[
# Predictive Analytics Unit 5: Tree-Based Learning
]
.author[
### Ken Arnold<br>Calvin University
]

---

class: center, middle

# Overview of tree-based learning

---

## Data

We'll use two datasets today:

.pull-left[
The `census` data from MDSR chapter 11, where we want to predict `income`. Here's an example row:

.col2.small[
- **age**: 39
- **workclass**: State-gov
- **fnlwgt**: 77516
- **education**: Bachelors
- **education_1**: 13
- **marital_status**: Never-married
- **occupation**: Adm-clerical
- **relationship**: Not-in-family
- **race**: White
- **sex**: Male
- **capital_gain**: 2174
- **capital_loss**: 0
- **hours_per_week**: 40
- **native_country**: United-States
- **income**: <=50K
]

]

.pull-right[
The flights in `nycflights13` that go to `California` airports, and we want to predict `arr_delay`. Example:

.col2.small[
- **year**: 2013
- **month**: 1
- **day**: 2013-01-01
- **dep_time**: 558
- **sched_dep_time**: 600
- **dep_delay**: -2
- **arr_time**: 924
- **sched_arr_time**: 917
- **arr_delay**: 7
- **carrier**: UA
- **flight**: 194
- **tailnum**: N29129
- **origin**: JFK
- **dest**: LAX
- **air_time**: 345
- **distance**: 2475
- **hour**: 6
- **minute**: 0
- **time_hour**: 2013-01-01 06:00:00
- **dow**: Tue
]
]

---

## Data Splitting

As usual, we'll *hold out* some data that we won't train on, so that we can test how well the model will work on data it hasn't seen. First, `census`:

```r
set.seed(364)
census_split <- census %>% initial_split(prop = 0.8)
census_train <- census_split %>% training()
census_test <- census_split %>% testing()
```

Then, `California`.

```r
California_split <- initial_split(California, prop = 0.8)
California_train <- training(California_split)
California_test <- testing(California_split)
```

(The above shows two options for the syntax; you may use either because they're identical.)

---

## Fitting a tree

We'll start by fitting a tree on `census`. Note: everything is the same except `decision_tree` instead of `logistic_regression`.

```r
*model1 <- decision_tree(mode = "classification") %>%
  fit(income ~ age + workclass + education + marital_status + 
    occupation + relationship + race + sex + 
    capital_gain + capital_loss + hours_per_week, data = census_train)
model1
```

```
parsnip model object

n= 26048

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 26048 6285 <=50K (0.75871468 0.24128532)  
   2) relationship=Not-in-family,Other-relative,Own-child,Unmarried 14231  941 <=50K (0.93387675 0.06612325)  
     4) capital_gain< 7073.5 13975  696 <=50K (0.95019678 0.04980322) *
     5) capital_gain>=7073.5 256   11 >50K (0.04296875 0.95703125) *
   3) relationship=Husband,Wife 11817 5344 <=50K (0.54777016 0.45222984)  
     6) education=10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,HS-grad,Preschool,Some-college 8294 2774 <=50K (0.66554136 0.33445864)  
      12) capital_gain< 5095.5 7875 2364 <=50K (0.69980952 0.30019048) *
      13) capital_gain>=5095.5 419    9 >50K (0.02147971 0.97852029) *
     7) education=Bachelors,Doctorate,Masters,Prof-school 3523  953 >50K (0.27050809 0.72949191) *
```

---

## How a trained decision tree makes a decision

Find which bucket the item goes in. Predict that it's like other items in that bucket.

---

## Decision trees for regression also

```r
model2 <- decision_tree(mode = "regression") %>% 
  fit(arr_delay ~ origin + dest + hour + carrier + month + dow,
      data = California_train)
model2
```

```
parsnip model object

n= 23868

node), split, n, deviance, yval
      * denotes terminal node

1) root 23868 46468620  1.570974  
  2) hour=6,7,8,9,10,11,12,13,14 12917 14113700 -5.761555 *
  3) hour=5,15,16,17,18,19,20,21,22 10951 30841240 10.219890  
    6) month=1,2,3,4,5,8,9,10,11,12 8876 17235060  3.770730 *
    7) month=6,7 2075 11657870 37.806750 *
```

---

## Fitted decision tree for regression

Predict the mean of the bucket.

---

## What decisions can be made at each node

- Continuous variables: if above a threshold, go right, else left
- Categorical variables: if is one of a set of categories, go right, else left
- At *leaf* nodes:
  - regression tree: compute the *mean* value of items there, predict that.
  - classification tree: compute the *proportion* of each category for items there
    - predict the most common category
    - or: predict that other items will follow the same proportions

---

## Decision trees make stair-step structures.

.pull-left[
<img src="slides05trees_files/figure-html/show-model1-data-1.png" width="100%" style="display: block; margin: auto;" />

**Core assumption**: split up items into boxes; treat everything in a box as the same.
]

.pull-right[
<img src="slides05trees_files/figure-html/show-model2-data-1.png" width="100%" style="display: block; margin: auto;" />

**Core assumption**: add up contributions from each feature; each contribution is a unit conversion (e.g., "$5 per square foot")
]