class: left, top, title-slide .title[ # Predictive Analytics Unit 5: Tree-Based Learning ] .author[ ### Ken Arnold
Calvin University ] --- class: center, middle # Overview of tree-based learning --- ## Data We'll use two datasets today: .pull-left[ The `census` data from MDSR chapter 11, where we want to predict `income`. Here's an example row: .col2.small[ - **age**: 39 - **workclass**: State-gov - **fnlwgt**: 77516 - **education**: Bachelors - **education_1**: 13 - **marital_status**: Never-married - **occupation**: Adm-clerical - **relationship**: Not-in-family - **race**: White - **sex**: Male - **capital_gain**: 2174 - **capital_loss**: 0 - **hours_per_week**: 40 - **native_country**: United-States - **income**: <=50K ] ] .pull-right[ The flights in `nycflights13` that go to `California` airports, and we want to predict `arr_delay`. Example: .col2.small[ - **year**: 2013 - **month**: 1 - **day**: 2013-01-01 - **dep_time**: 558 - **sched_dep_time**: 600 - **dep_delay**: -2 - **arr_time**: 924 - **sched_arr_time**: 917 - **arr_delay**: 7 - **carrier**: UA - **flight**: 194 - **tailnum**: N29129 - **origin**: JFK - **dest**: LAX - **air_time**: 345 - **distance**: 2475 - **hour**: 6 - **minute**: 0 - **time_hour**: 2013-01-01 06:00:00 - **dow**: Tue ] ] --- ## Data Splitting As usual, we'll *hold out* some data that we won't train on, so that we can test how well the model will work on data it hasn't seen. First, `census`: ```r set.seed(364) census_split <- census %>% initial_split(prop = 0.8) census_train <- census_split %>% training() census_test <- census_split %>% testing() ``` Then, `California`. ```r California_split <- initial_split(California, prop = 0.8) California_train <- training(California_split) California_test <- testing(California_split) ``` (The above shows two options for the syntax; you may use either because they're identical.) --- ## Fitting a tree We'll start by fitting a tree on `census`. Note: everything is the same except `decision_tree` instead of `logistic_regression`. ```r *model1 <- decision_tree(mode = "classification") %>% fit(income ~ age + workclass + education + marital_status + occupation + relationship + race + sex + capital_gain + capital_loss + hours_per_week, data = census_train) model1 ``` ``` parsnip model object n= 26048 node), split, n, loss, yval, (yprob) * denotes terminal node 1) root 26048 6285 <=50K (0.75871468 0.24128532) 2) relationship=Not-in-family,Other-relative,Own-child,Unmarried 14231 941 <=50K (0.93387675 0.06612325) 4) capital_gain< 7073.5 13975 696 <=50K (0.95019678 0.04980322) * 5) capital_gain>=7073.5 256 11 >50K (0.04296875 0.95703125) * 3) relationship=Husband,Wife 11817 5344 <=50K (0.54777016 0.45222984) 6) education=10th,11th,12th,1st-4th,5th-6th,7th-8th,9th,Assoc-acdm,Assoc-voc,HS-grad,Preschool,Some-college 8294 2774 <=50K (0.66554136 0.33445864) 12) capital_gain< 5095.5 7875 2364 <=50K (0.69980952 0.30019048) * 13) capital_gain>=5095.5 419 9 >50K (0.02147971 0.97852029) * 7) education=Bachelors,Doctorate,Masters,Prof-school 3523 953 >50K (0.27050809 0.72949191) * ``` --- ## How a trained decision tree makes a decision <img src="slides05trees_files/figure-html/show-census-tree-1.png" width="70%" style="display: block; margin: auto;" /> Find which bucket the item goes in. Predict that it's like other items in that bucket. --- ## Decision trees for regression also ```r model2 <- decision_tree(mode = "regression") %>% fit(arr_delay ~ origin + dest + hour + carrier + month + dow, data = California_train) model2 ``` ``` parsnip model object n= 23868 node), split, n, deviance, yval * denotes terminal node 1) root 23868 46468620 1.570974 2) hour=6,7,8,9,10,11,12,13,14 12917 14113700 -5.761555 * 3) hour=5,15,16,17,18,19,20,21,22 10951 30841240 10.219890 6) month=1,2,3,4,5,8,9,10,11,12 8876 17235060 3.770730 * 7) month=6,7 2075 11657870 37.806750 * ``` --- ## Fitted decision tree for regression <img src="slides05trees_files/figure-html/show-flights-tree-1.png" width="70%" style="display: block; margin: auto;" /> Predict the mean of the bucket. --- ## What decisions can be made at each node - Continuous variables: if above a threshold, go right, else left - Categorical variables: if is one of a set of categories, go right, else left - At *leaf* nodes: - regression tree: compute the *mean* value of items there, predict that. - classification tree: compute the *proportion* of each category for items there - predict the most common category - or: predict that other items will follow the same proportions --- ## Decision trees make stair-step structures. .pull-left[ <img src="slides05trees_files/figure-html/show-model1-data-1.png" width="100%" style="display: block; margin: auto;" /> **Core assumption**: split up items into boxes; treat everything in a box as the same. ] .pull-right[ <img src="slides05trees_files/figure-html/show-model2-data-1.png" width="100%" style="display: block; margin: auto;" /> **Core assumption**: add up contributions from each feature; each contribution is a unit conversion (e.g., "$5 per square foot") ]