Predictive Analytics Unit 1: Statistical Foundations

class: left, top, title-slide

.title[
# Predictive Analytics Unit 1: Statistical Foundations
]
.author[
### Ken Arnold<br>Calvin University
]

---

# Introduction

- How is predictive analytics useful?
- Brief reminder of stats

Main point: we must acknowledge our *uncertainty*.

---

## Wisdom

- "Do not be wise in your own eyes; fear the Lord and shun evil." (Proverbs 3:7)
- "The fear of the Lord is the beginning of wisdom" (Proverbs 9:10)
- "God opposes the proud, but shows favor to the humble." (1 Peter 5:5, among others)

---

## Different Kinds of Analytics

- *Past*:
  - **Descriptive Analytics**: what happened?
  - **Diagnostic Analytics**: why did it happen?
- *Future*:
  - **Predictive Analytics**: what might happen in the future?
  - **Prescriptive Analytics**: what should we do next?

We'll focus on predictive analytics here.

---

### Predictive Analytics for Forecasting Trends

- How much demand will we have next month?
- How will prices change as we approach the holidays?
- Are failure rates about to go up?
- etc.

---

### Predictive Analytics for Labeling Things

- Which customers will churn?
- Which of these transactions might be fraudulent?
- Which other products will this customer buy (cross-selling recommendations)
- What are our customers saying about our products?

---

## Predictive models can help explore data

- What are our main market segments, defined by behavior?

---

# Review of Statistical Fundamentals

Main point: we must acknowledge our *uncertainty*.

---

## We don't have the whole population

- only a *sample*.
- If we were careful, it's a *good* sample
  - not corrupted by *sampling bias*

---

### Playground: let's *pretend* we have the whole population

```r
all_SF_flights <- nycflights13::flights %>% # get the flights from NYC in 2013...
  filter(dest == "SFO") %>% # ... that went to San Francisco
  filter(!is.na(arr_delay)) # ... and have an arrival delay recorded
```

(aside: *never blindly filter out missing data!*)

### Get a sample from it

In the real world, this is where we actually collect our data. But here, in the playground...

```r
set.seed(5) # make it so we always get the same "random" sample
sample_size <- 25
sample_SF_flights <- all_SF_flights %>%
  slice_sample(n = sample_size)
```

---

### What does our sample look like?

```r
sample_SF_flights %>%
  head(5)
```

```
# A tibble: 5 × 19
   year month   day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
  <int> <int> <int>    <int>        <int>   <dbl>   <int>   <int>
1  2013     9    29     1657         1700      -3    2001    2010
2  2013     9     4     2015         2025     -10    2318    2350
3  2013     8     6      916          853      23    1218    1212
4  2013    11     5      730          730       0    1052    1100
5  2013     4    23      727          730      -3    1034    1105
# … with 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time
```

---

## We can summarize a sample by a *statistic*

.pull-left[
Some useful statistics: the *mean*, *spread* (sd), or percentiles: median (`p50`), maximum (`p100`), etc.

```r
sample_SF_flights %>%
  skim(arr_delay)
```

|var       |  n| na|  mean|       sd|  p0| p25| p50| p75| p100|
|:---------|--:|--:|-----:|--------:|---:|---:|---:|---:|----:|
|arr_delay | 25|  0| -0.52| 34.42276| -44| -16| -13|  12|  120|
]

.pull-right[
What does this say about typical *arrival delays*? Compare these with the *population* (which we only have because we're in a *playground*):

```r
*all_SF_flights %>%
  skim(arr_delay)
```

|var       |     n| na|     mean|       sd|  p0| p25| p50| p75| p100|
|:---------|-----:|--:|--------:|--------:|---:|---:|---:|---:|----:|
|arr_delay | 13173|  0| 2.672891| 47.67064| -86| -23|  -8|  12| 1007|
Notice: different *mean*, very different *maximum*.
]

---

## Different samples, different statistics

Let's repeat this sampling many times.

```r
set.seed(12345)
num_trials <- 1000
sampling_distribution <- mosaic::do(num_trials) * {
  all_SF_flights %>% 
*   slice_sample(n = sample_size) %>%
    summarize(n = n(), # compute the sample statistics for this sample
*             mean_arr_delay = mean(arr_delay), sd_arr_delay = sd(arr_delay))
}
sampling_distribution
```

```
# A tibble: 1,000 × 5
      n mean_arr_delay sd_arr_delay  .row .index
  <int>          <dbl>        <dbl> <int>  <dbl>
1    25          -0.68         46.4     1      1
2    25           2.28         47.7     1      2
3    25          -8.56         27.7     1      3
4    25           4.52         34.0     1      4
5    25           0.64         37.6     1      5
6    25          12.7          42.7     1      6
# … with 994 more rows
```

---

## Different samples, different statistics

.pull-left[
Estimated **sampling distribution** of the mean delay:

<img src="slides01foundations_files/figure-html/sampling-distribution-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
- Vertical line shows the *population* statistic: the mean arrival delay for the whole population.
- For *most* samples: mean delay is *near* the true mean delay
- ... but some samples are quite different. (as big as 40, as small as -20)
]

---

## But we don't actually know the sampling distribution!

- Outside the playground, we only get *one* sample!
- So we can't compute or plot `sampling_distribution`
- How can we get any idea what the real average delay is?
- Idea: report a **range**, called a *confidence interval*, where the true value should be.

---

## Bad confidence intervals

What if we guessed that the true value was within 1 minute of the mean of our sample?

.pull-left[

```r
sampling_distribution %>% head(100) %>% 
  plot_conf_intervals(mean_arr_delay, width = 1)
```

<img src="slides01foundations_files/figure-html/bad-ci-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
- **Oops!** For most samples, the confidence interval doesn't actually contain the true value!
- *Remember, we computed this on a sample*, and *some samples are unusual*.
]

---

## Better confidence intervals

.pull-left[
- The sample itself has a range of values. Let's use that range to make a confidence interval.

```r
sampling_distribution %>% 
  mutate(standard_err = sd_arr_delay / sqrt(n)) %>% 
  head(100) %>% 
  plot_conf_intervals(mean_arr_delay, width = 1.96 * standard_err) +
   coord_cartesian(xlim = c(-30, 90))
```

<img src="slides01foundations_files/figure-html/good-ci-1.png" width="100%" style="display: block; margin: auto;" />
]
.pull-right[
- If you look carefully, a few of the intervals don't actually contain the true value. But maybe we'll be ok with that.
- Depends on assumptions about sampling distribution.
]

---

## Larger sample sizes give narrower confidence intervals

.pull-left[

```r
set.seed(12345)
num_trials <- 1000
sampling_distribution_16x <- mosaic::do(num_trials) * { 
  all_SF_flights %>% 
    slice_sample(n = 16 * sample_size) %>%
    summarize(n = n(),
              mean_arr_delay = mean(arr_delay),
              sd_arr_delay = sd(arr_delay),
              q98 = quantile(arr_delay, p = 0.98))
}
```
]

.pull-right[

```r
sampling_distribution_16x %>% 
  mutate(standard_err = sd_arr_delay / sqrt(n)) %>% 
  head(100) %>% 
  plot_conf_intervals(mean_arr_delay, width = 1.96 * standard_err) +
   coord_cartesian(xlim = c(-30, 90))
```

<img src="slides01foundations_files/figure-html/good-ci-16x-1.png" width="90%" style="display: block; margin: auto;" />
]

---

### Definitions

.col2[
What we *actually know*:

- *sample size*: how big is the sample?
- *sample statistic*: a summary we compute based on that sample (mean, 98% quantile, etc.)

What we *would like to know* (but can't outside of our playground):

- *population statistic*: the value of the sample statistic, if we could actually compute it on the whole population
- *sampling distribution*: the value of the sample statistic in *all* possible samples

But since we only have a sample, we can try to estimate:

- *confidence interval*: range of plausible values of the population statistic, given a sample
  - The true value of the population statistic better be within this interval.
  - But some samples may be really strange (you randomly picked only the most-delayed flights!?)
  - So we'll be okay with one that only includes the population statistic for, say, 95% of all samples.

]

---

## How can we estimate variability from a *single* sample?

- We could estimate our uncertainty by taking lots of samples from the population
- Outside of the playground, we can't practically do that!
- But we can pretend that we are back in our playground!
- *Bootstrap resampling*: pretend that our *sample* is actually the *population*.

Remember we'd taken a sample earlier:

```r
sample_SF_flights %>% head()
```

```
# A tibble: 6 × 19
   year month   day dep_time sched_dep_…¹ dep_d…² arr_t…³ sched…⁴
  <int> <int> <int>    <int>        <int>   <dbl>   <int>   <int>
1  2013     9    29     1657         1700      -3    2001    2010
2  2013     9     4     2015         2025     -10    2318    2350
3  2013     8     6      916          853      23    1218    1212
4  2013    11     5      730          730       0    1052    1100
5  2013     4    23      727          730      -3    1034    1105
6  2013    10    25      559          600      -1     903     923
# … with 11 more variables: arr_delay <dbl>, carrier <chr>,
#   flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>,
#   time_hour <dttm>, and abbreviated variable names
#   ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time
```

---

```r
set.seed(12345)
num_trials <- 1000
sampling_distribution_boot <- mosaic::do(num_trials) * { 
* sample_SF_flights %>%
    slice_sample(prop = 1.0, replace = TRUE) %>%
    summarize(n = n(),
              mean_arr_delay = mean(arr_delay),
              sd_arr_delay = sd(arr_delay),
              q98 = quantile(arr_delay, p = 0.98))
}
```

---

.pull-left[
Here's what the **bootstrap** sampling distribution of the  mean delay might look like:

```r
pop_mean_arr_delay <- mean(all_SF_flights$arr_delay)
ggplot(sampling_distribution_boot) + 
  stat_density(aes(x=mean_arr_delay), geom = "area", fill = "gray")+
  geom_vline(color = "red", xintercept = pop_mean_arr_delay) +
  labs(x = "Mean arrival delay (min)")
```

<img src="slides01foundations_files/figure-html/sampling-distribution-boot-1.png" width="100%" style="display: block; margin: auto;" />
]

.pull-right[
- Very similar to the sampling distribution plot we made on the playground
- But now we're not on the playground anymore!
]

---

## Takeaway

- We might wish we had all possible data...
- Main point: we must acknowledge our *uncertainty*.
  - Our measurements are partial.
  - Our inferences sometimes fail (and we may not know it!)
- But God made a world with *structure* that we can learn about even with imperfect tools.