Predictive Analytics Unit 4: Logistic Classifiers

class: left, top, title-slide

.title[
# Predictive Analytics Unit 4: Logistic Classifiers
]
.author[
### Ken Arnold<br>Calvin University
]

---

## Two Types of Supervised Learning

- **Regression**: fill in a *number*
  - How many customers?
  - What price?
- **Classification**: fill in a *choice*
  - How likely is this transaction to be fraudulent?
  - What is the *risk of default* for this credit application?

---

## Classification example: Blood test for autism?

.col2[
We'll use an example from a [2017 PLOS Computational Biology paper](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005385)

```r
skimr::skim(autism)
```

Table: Data summary

|                         |       |
|:------------------------|:------|
|Name                     |autism |
|Number of rows           |206    |
|Number of columns        |26     |
|_______________________  |       |
|Column type frequency:   |       |
|character                |1      |
|numeric                  |25     |
|________________________ |       |
|Group variables          |None   |

**Variable type: character**

|skim_variable | n_missing| complete_rate| min| max| empty| n_unique| whitespace|
|:-------------|---------:|-------------:|---:|---:|-----:|--------:|----------:|
|Group         |         0|             1|   3|   3|     0|        3|          0|

**Variable type: numeric**

|skim_variable      | n_missing| complete_rate|   mean|    sd|     p0|    p25|    p50|    p75|   p100|hist  |
|:------------------|---------:|-------------:|------:|-----:|------:|------:|------:|------:|------:|:-----|
|Methion.           |         0|          1.00|  21.38|  3.89|  13.43|  18.94|  21.08|  23.61|  33.14|▃▇▆▂▁ |
|SAM                |         0|          1.00|  65.65| 12.33|  37.22|  57.83|  63.92|  71.13| 121.16|▂▇▂▁▁ |
|SAH                |         0|          1.00|  16.68|  4.35|   7.39|  14.14|  16.61|  19.27|  28.12|▃▆▇▃▁ |
|SAM/SAH            |         0|          1.00|   4.26|  1.61|   1.66|   3.24|   3.90|   4.79|  13.10|▇▆▁▁▁ |
|% DNA methylation  |         0|          1.00|   3.85|  0.94|   1.78|   3.14|   3.88|   4.52|   7.73|▃▇▆▁▁ |
|8-OHG              |         0|          1.00|   0.07|  0.03|   0.03|   0.05|   0.07|   0.08|   0.18|▅▇▂▁▁ |
|Adenosine          |         0|          1.00|   0.13|  0.06|   0.04|   0.08|   0.12|   0.17|   0.39|▇▆▃▁▁ |
|Homocysteine       |         0|          1.00|   4.87|  1.14|   3.06|   4.11|   4.67|   5.57|   9.70|▇▇▃▁▁ |
|Cysteine           |         0|          1.00| 201.50| 19.83| 141.31| 188.50| 201.12| 214.61| 261.44|▁▅▇▅▁ |
|Glu.-Cys.          |         0|          1.00|   2.10|  0.55|   1.12|   1.67|   1.95|   2.50|   3.83|▅▇▅▂▁ |
|Cys.-Gly.          |         0|          1.00|  43.01|  6.80|  27.12|  38.83|  42.66|  46.97|  63.75|▂▇▇▃▁ |
|tGSH               |         0|          1.00|   6.24|  1.40|   3.25|   5.24|   6.18|   7.01|  11.18|▃▇▇▁▁ |
|fGSH               |         0|          1.00|   2.02|  0.59|   1.08|   1.68|   1.90|   2.23|   5.44|▇▅▁▁▁ |
|GSSG               |         0|          1.00|   0.18|  0.09|   0.05|   0.12|   0.17|   0.22|   0.65|▇▆▁▁▁ |
|fGSH/GSSG          |         0|          1.00|  13.77|  8.03|   2.15|   8.24|  11.55|  17.78|  57.00|▇▅▁▁▁ |
|tGSH/GSSG          |         0|          1.00|  42.50| 22.76|   8.55|  25.60|  36.30|  55.38| 130.00|▇▆▃▁▁ |
|Chlorotyrosine     |         0|          1.00|  44.06| 25.07|   6.39|  26.53|  38.39|  58.10| 143.28|▇▇▃▁▁ |
|Nitrotyrosine      |         0|          1.00|  88.07| 59.23|  21.56|  47.69|  71.44| 107.00| 467.30|▇▂▁▁▁ |
|Tyrosine           |         0|          1.00|  50.87| 13.78|  22.93|  40.71|  48.97|  60.15|  92.93|▂▇▅▂▁ |
|Tryptophane        |         0|          1.00|  38.16|  8.88|  20.05|  30.92|  37.61|  44.60|  60.12|▃▇▇▆▁ |
|fCystine           |         0|          1.00|  26.21|  8.38|  10.32|  19.80|  25.41|  30.80|  54.11|▅▇▆▂▁ |
|fCysteine          |         0|          1.00|  21.73|  4.72|  10.85|  18.39|  21.04|  24.23|  36.54|▁▇▆▂▁ |
|fCystine/fCysteine |         0|          1.00|   1.25|  0.50|   0.57|   0.91|   1.12|   1.44|   4.06|▇▃▁▁▁ |
|% oxidized         |         0|          1.00|   0.16|  0.07|   0.03|   0.10|   0.15|   0.20|   0.48|▇▇▂▁▁ |
|Vineland ABC       |       159|          0.23|  70.77| 13.91|  46.00|  61.50|  69.00|  76.00| 106.00|▅▇▅▃▂ |
]

---

We have 3 kinds of data about 206 children:

1. The outcome (`Group`): ASD (diagnosed with ASD), SIB (sibling not diagnosed with ASD), and NEU (age-matched neurotypical children, for control)

```r
autism %>% group_by(Group) %>% summarize(n = n()) %>% knitr::kable()
```

|Group |  n|
|:-----|--:|
|ASD   | 83|
|NEU   | 76|
|SIB   | 47|

---

1. The outcome (`Group`): ASD, SIB, NEU
2. Concentrations of various metabolites in a blood sample:

```r
autism %>% select(-1, -last_col()) %>% names()
```

```
 [1] "Methion."           "SAM"               
 [3] "SAH"                "SAM/SAH"           
 [5] "% DNA methylation"  "8-OHG"             
 [7] "Adenosine"          "Homocysteine"      
 [9] "Cysteine"           "Glu.-Cys."         
[11] "Cys.-Gly."          "tGSH"              
[13] "fGSH"               "GSSG"              
[15] "fGSH/GSSG"          "tGSH/GSSG"         
[17] "Chlorotyrosine"     "Nitrotyrosine"     
[19] "Tyrosine"           "Tryptophane"       
[21] "fCystine"           "fCysteine"         
[23] "fCystine/fCysteine" "% oxidized"        
```

---

1. The outcome (`Group`): ASD, SIB, NEU
2. Concentrations of various metabolites in a blood sample
3. For the ASD children only, a measure of life skills ("Vineland ABC")

```r
autism %>% 
  ggplot(aes(x = `Vineland ABC`, y = Group)) + geom_boxplot()
```

```
Warning: Removed 159 rows containing non-finite values
(stat_boxplot).
```

---

## Exploratory Data Analysis (EDA)

What do these metabolites look like?

---

code for the previous plot:

```r
library(ggridges)
autism %>%
  select(-`Vineland ABC`) %>% 
  pivot_longer(-Group, names_to = "Measure") %>% 
  ggplot(aes(x = value, y = Measure)) +
  geom_density_ridges() + 
  facet_wrap(vars(Group), scales = "free_x")
```

---

## EDA

Better question: **Can these metabolites help us distinguish autism?**

---

---

code for previous plot:

```r
autism %>%
  select(-`Vineland ABC`) %>% 
  pivot_longer(-Group, names_to = "Measure") %>% 
  ggplot(aes(x = value, y = Group)) +
  geom_density_ridges() +
  facet_wrap(vars(Measure), scales = "free_x") + 
  theme_ridges()
```

---

## Can we predict ASD vs non-ASD from metabolites?

* Let's start by (1) ignoring the behavior scores (that's an *outcome*) and comparing
just ASD and NEU.
* We need to drop SIB and encode `Group` as a factor.

```r
data <- autism %>% 
  select(-`Vineland ABC`) %>% 
  filter(Group != "SIB") %>% 
  mutate(Group = as_factor(Group))
```

---

## Linear Model for *Classification*

```r
*model <- logistic_reg() %>%
  fit(Group ~ `tGSH/GSSG` + `fGSH` + `Methion.`, data = data)
```

```r
model %>% tidy() %>% select(term, estimate) %>% kable()
```

|term        |    estimate|
|:-----------|-----------:|
|(Intercept) | -15.3438894|
|`tGSH/GSSG` |   0.1364630|
|fGSH        |   2.1681857|
|Methion.    |   0.2919989|

$$
\log\left[ \frac { \widehat{P( \operatorname{..y} = \operatorname{NEU} )} }{ 1 - \widehat{P( \operatorname{..y} = \operatorname{NEU} )} } \right] = -15.34 + 0.14(\operatorname{tGSH/GSSG}) + 2.17(\operatorname{fGSH}) + 0.29(\operatorname{Methion.})
$$

---

### What do the *predictions* look like?

```r
model %>% predict(data, type = "prob")
```

```
# A tibble: 159 × 2
  .pred_ASD .pred_NEU
      <dbl>     <dbl>
1     0.924   0.0764 
2     0.940   0.0605 
3     0.994   0.00603
4     0.996   0.00361
5     0.868   0.132  
6     0.797   0.203  
# … with 153 more rows
```

---

## Intuition: *Risk*

- First, linear model predicts a *risk score* (negative = low risk, positive = high risk).
  - Technically: *log-odds*: log(prob of *yes* / prob of *no*)
- Then, *logistic* function turns risk score into probability

Other examples:

- Risk of a customer churning
- Risk of a transaction being fraudulent
- Chance of one team winning: difference in skill scores (Elo)

---

### Were those predictions *good*?

.small-code[

```r
model %>%
  predict(data, type = "prob") %>%
  bind_cols(data) %>% 
  mutate(idx = row_number()) %>% 
  ggplot(aes(x = idx, y = .pred_ASD, color = Group, shape = Group)) +
  geom_hline(yintercept = .5) +
  geom_point() 
```

<img src="slides04logistic_files/figure-html/plot-predicted-probs-tree-1.png" width="90%" style="display: block; margin: auto;" />
]

---

### Quantifying that

```r
metrics <- yardstick::metric_set(accuracy, sensitivity, specificity, yardstick::f_meas)
model %>% 
  predict(data, type = "class") %>% 
  bind_cols(data) %>% 
  metrics(truth = Group, estimate = .pred_class)
```

```
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.906
2 sensitivity binary         0.940
3 specificity binary         0.868
4 f_meas      binary         0.912
```

---

## Classification Metrics

|                      | Event happened              | No event happened           |
|----------------------|-------------------------------|-------------------------------|
| Event predicted    | True positive                 | False positive (Type 1 error) |
| No event predicted | False negative (Type 2 error) | True negative                 |

- **Accuracy** (% correct) = (TP + TN) / (# episodes)
- **False negative** ("miss") **rate** = FN / (# actual events)
- **False positive** ("false alarm") **rate** = FP / (# true non-events)

- **Sensitivity** ("true positive rate") = TP / (# actual events)
  - Sensitivity = 1 − False negative rate
- **Specificity** ("true negative rate") = TN / (# actual events)
  - Specificity = 1 − False positive rate
- [Wikipedia article](https://en.wikipedia.org/wiki/Sensitivity_and_specificity)

---

## Predicted probabilities tell confidence

- **Comparing models**: It's better to be *confident* when you're right, it's better to be *un-confident* when you're wrong.
- **Setting thresholds**
  - Model is usually trained as if false negative and false positive are equally bad.
  - But if false negative is worse than false positive, we can adjust the confidence threshold (default: 50%)

```r
metrics <- yardstick::metric_set(mn_log_loss, roc_auc)
model %>% predict(data, type = "prob") %>% 
  bind_cols(data) %>% 
  metrics(truth = Group, .pred_ASD)
```

```
# A tibble: 2 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 mn_log_loss binary         0.262
2 roc_auc     binary         0.953
```