class: left, top, title-slide .title[ # Predictive Analytics Unit 1: Statistical Foundations ] .author[ ### Ken Arnold
Calvin University ] --- # Introduction - How is predictive analytics useful? - Brief reminder of stats Main point: we must acknowledge our *uncertainty*. --- ## Wisdom - "Do not be wise in your own eyes; fear the Lord and shun evil." (Proverbs 3:7) - "The fear of the Lord is the beginning of wisdom" (Proverbs 9:10) - "God opposes the proud, but shows favor to the humble." (1 Peter 5:5, among others) --- ## Different Kinds of Analytics - *Past*: - **Descriptive Analytics**: what happened? - **Diagnostic Analytics**: why did it happen? - *Future*: - **Predictive Analytics**: what might happen in the future? - **Prescriptive Analytics**: what should we do next? We'll focus on predictive analytics here. --- ### Predictive Analytics for Forecasting Trends - How much demand will we have next month? - How will prices change as we approach the holidays? - Are failure rates about to go up? - etc. --- ### Predictive Analytics for Labeling Things - Which customers will churn? - Which of these transactions might be fraudulent? - Which other products will this customer buy (cross-selling recommendations) - What are our customers saying about our products? --- ## Predictive models can help explore data - What are our main market segments, defined by behavior? --- # Review of Statistical Fundamentals Main point: we must acknowledge our *uncertainty*. --- ## We don't have the whole population - only a *sample*. - If we were careful, it's a *good* sample - not corrupted by *sampling bias* --- ### Playground: let's *pretend* we have the whole population ```r all_SF_flights <- nycflights13::flights %>% # get the flights from NYC in 2013... filter(dest == "SFO") %>% # ... that went to San Francisco filter(!is.na(arr_delay)) # ... and have an arrival delay recorded ``` (aside: *never blindly filter out missing data!*) ### Get a sample from it In the real world, this is where we actually collect our data. But here, in the playground... ```r set.seed(5) # make it so we always get the same "random" sample sample_size <- 25 sample_SF_flights <- all_SF_flights %>% slice_sample(n = sample_size) ``` --- ### What does our sample look like? ```r sample_SF_flights %>% head(5) ``` ``` # A tibble: 5 × 19 year month day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴ <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 9 29 1657 1700 -3 2001 2010 2 2013 9 4 2015 2025 -10 2318 2350 3 2013 8 6 916 853 23 1218 1212 4 2013 11 5 730 730 0 1052 1100 5 2013 4 23 727 730 -3 1034 1105 # … with 11 more variables: arr_delay <dbl>, carrier <chr>, # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm>, and abbreviated variable names # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time ``` --- ## We can summarize a sample by a *statistic* .pull-left[ Some useful statistics: the *mean*, *spread* (sd), or percentiles: median (`p50`), maximum (`p100`), etc. ```r sample_SF_flights %>% skim(arr_delay) ``` |var | n| na| mean| sd| p0| p25| p50| p75| p100| |:---------|--:|--:|-----:|--------:|---:|---:|---:|---:|----:| |arr_delay | 25| 0| -0.52| 34.42276| -44| -16| -13| 12| 120| ] .pull-right[ What does this say about typical *arrival delays*? Compare these with the *population* (which we only have because we're in a *playground*): ```r *all_SF_flights %>% skim(arr_delay) ``` |var | n| na| mean| sd| p0| p25| p50| p75| p100| |:---------|-----:|--:|--------:|--------:|---:|---:|---:|---:|----:| |arr_delay | 13173| 0| 2.672891| 47.67064| -86| -23| -8| 12| 1007| Notice: different *mean*, very different *maximum*. ] --- ## Different samples, different statistics Let's repeat this sampling many times. ```r set.seed(12345) num_trials <- 1000 sampling_distribution <- mosaic::do(num_trials) * { all_SF_flights %>% * slice_sample(n = sample_size) %>% summarize(n = n(), # compute the sample statistics for this sample * mean_arr_delay = mean(arr_delay), sd_arr_delay = sd(arr_delay)) } sampling_distribution ``` ``` # A tibble: 1,000 × 5 n mean_arr_delay sd_arr_delay .row .index <int> <dbl> <dbl> <int> <dbl> 1 25 -0.68 46.4 1 1 2 25 2.28 47.7 1 2 3 25 -8.56 27.7 1 3 4 25 4.52 34.0 1 4 5 25 0.64 37.6 1 5 6 25 12.7 42.7 1 6 # … with 994 more rows ``` --- ## Different samples, different statistics .pull-left[ Estimated **sampling distribution** of the mean delay: <img src="slides01foundations_files/figure-html/sampling-distribution-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - Vertical line shows the *population* statistic: the mean arrival delay for the whole population. - For *most* samples: mean delay is *near* the true mean delay - ... but some samples are quite different. (as big as 40, as small as -20) ] --- ## But we don't actually know the sampling distribution! - Outside the playground, we only get *one* sample! - So we can't compute or plot `sampling_distribution` - How can we get any idea what the real average delay is? - Idea: report a **range**, called a *confidence interval*, where the true value should be. --- ## Bad confidence intervals What if we guessed that the true value was within 1 minute of the mean of our sample? .pull-left[ ```r sampling_distribution %>% head(100) %>% plot_conf_intervals(mean_arr_delay, width = 1) ``` <img src="slides01foundations_files/figure-html/bad-ci-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - **Oops!** For most samples, the confidence interval doesn't actually contain the true value! - *Remember, we computed this on a sample*, and *some samples are unusual*. ] --- ## Better confidence intervals .pull-left[ - The sample itself has a range of values. Let's use that range to make a confidence interval. ```r sampling_distribution %>% mutate(standard_err = sd_arr_delay / sqrt(n)) %>% head(100) %>% plot_conf_intervals(mean_arr_delay, width = 1.96 * standard_err) + coord_cartesian(xlim = c(-30, 90)) ``` <img src="slides01foundations_files/figure-html/good-ci-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - If you look carefully, a few of the intervals don't actually contain the true value. But maybe we'll be ok with that. - Depends on assumptions about sampling distribution. ] --- ## Larger sample sizes give narrower confidence intervals .pull-left[ ```r set.seed(12345) num_trials <- 1000 sampling_distribution_16x <- mosaic::do(num_trials) * { all_SF_flights %>% slice_sample(n = 16 * sample_size) %>% summarize(n = n(), mean_arr_delay = mean(arr_delay), sd_arr_delay = sd(arr_delay), q98 = quantile(arr_delay, p = 0.98)) } ``` ] .pull-right[ ```r sampling_distribution_16x %>% mutate(standard_err = sd_arr_delay / sqrt(n)) %>% head(100) %>% plot_conf_intervals(mean_arr_delay, width = 1.96 * standard_err) + coord_cartesian(xlim = c(-30, 90)) ``` <img src="slides01foundations_files/figure-html/good-ci-16x-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ### Definitions .col2[ What we *actually know*: - *sample size*: how big is the sample? - *sample statistic*: a summary we compute based on that sample (mean, 98% quantile, etc.) What we *would like to know* (but can't outside of our playground): - *population statistic*: the value of the sample statistic, if we could actually compute it on the whole population - *sampling distribution*: the value of the sample statistic in *all* possible samples But since we only have a sample, we can try to estimate: - *confidence interval*: range of plausible values of the population statistic, given a sample - The true value of the population statistic better be within this interval. - But some samples may be really strange (you randomly picked only the most-delayed flights!?) - So we'll be okay with one that only includes the population statistic for, say, 95% of all samples. <!-- - *standard error*: standard deviation of the sampling distribution (*not* of a sample) --> ] --- ## How can we estimate variability from a *single* sample? - We could estimate our uncertainty by taking lots of samples from the population - Outside of the playground, we can't practically do that! - But we can pretend that we are back in our playground! - *Bootstrap resampling*: pretend that our *sample* is actually the *population*. Remember we'd taken a sample earlier: ```r sample_SF_flights %>% head() ``` ``` # A tibble: 6 × 19 year month day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴ <int> <int> <int> <int> <int> <dbl> <int> <int> 1 2013 9 29 1657 1700 -3 2001 2010 2 2013 9 4 2015 2025 -10 2318 2350 3 2013 8 6 916 853 23 1218 1212 4 2013 11 5 730 730 0 1052 1100 5 2013 4 23 727 730 -3 1034 1105 6 2013 10 25 559 600 -1 903 923 # … with 11 more variables: arr_delay <dbl>, carrier <chr>, # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, # time_hour <dttm>, and abbreviated variable names # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time ``` --- ```r set.seed(12345) num_trials <- 1000 sampling_distribution_boot <- mosaic::do(num_trials) * { * sample_SF_flights %>% slice_sample(prop = 1.0, replace = TRUE) %>% summarize(n = n(), mean_arr_delay = mean(arr_delay), sd_arr_delay = sd(arr_delay), q98 = quantile(arr_delay, p = 0.98)) } ``` --- .pull-left[ Here's what the **bootstrap** sampling distribution of the <!-- 98th percentile --> mean delay might look like: ```r pop_mean_arr_delay <- mean(all_SF_flights$arr_delay) ggplot(sampling_distribution_boot) + stat_density(aes(x=mean_arr_delay), geom = "area", fill = "gray")+ geom_vline(color = "red", xintercept = pop_mean_arr_delay) + labs(x = "Mean arrival delay (min)") ``` <img src="slides01foundations_files/figure-html/sampling-distribution-boot-1.png" width="100%" style="display: block; margin: auto;" /> ] .pull-right[ - Very similar to the sampling distribution plot we made on the playground - But now we're not on the playground anymore! ] --- ## Takeaway - We might wish we had all possible data... - Main point: we must acknowledge our *uncertainty*. - Our measurements are partial. - Our inferences sometimes fail (and we may not know it!) - But God made a world with *structure* that we can learn about even with imperfect tools.