class: left, top, title-slide .title[ # Predictive Analytics Unit 4: Logistic Classifiers ] .author[ ### Ken Arnold
Calvin University ] --- ## Two Types of Supervised Learning - **Regression**: fill in a *number* - How many customers? - What price? - **Classification**: fill in a *choice* - How likely is this transaction to be fraudulent? - What is the *risk of default* for this credit application? --- ## Classification example: Blood test for autism? .col2[ We'll use an example from a [2017 PLOS Computational Biology paper](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1005385) ```r skimr::skim(autism) ``` Table: Data summary | | | |:------------------------|:------| |Name |autism | |Number of rows |206 | |Number of columns |26 | |_______________________ | | |Column type frequency: | | |character |1 | |numeric |25 | |________________________ | | |Group variables |None | **Variable type: character** |skim_variable | n_missing| complete_rate| min| max| empty| n_unique| whitespace| |:-------------|---------:|-------------:|---:|---:|-----:|--------:|----------:| |Group | 0| 1| 3| 3| 0| 3| 0| **Variable type: numeric** |skim_variable | n_missing| complete_rate| mean| sd| p0| p25| p50| p75| p100|hist | |:------------------|---------:|-------------:|------:|-----:|------:|------:|------:|------:|------:|:-----| |Methion. | 0| 1.00| 21.38| 3.89| 13.43| 18.94| 21.08| 23.61| 33.14|▃▇▆▂▁ | |SAM | 0| 1.00| 65.65| 12.33| 37.22| 57.83| 63.92| 71.13| 121.16|▂▇▂▁▁ | |SAH | 0| 1.00| 16.68| 4.35| 7.39| 14.14| 16.61| 19.27| 28.12|▃▆▇▃▁ | |SAM/SAH | 0| 1.00| 4.26| 1.61| 1.66| 3.24| 3.90| 4.79| 13.10|▇▆▁▁▁ | |% DNA methylation | 0| 1.00| 3.85| 0.94| 1.78| 3.14| 3.88| 4.52| 7.73|▃▇▆▁▁ | |8-OHG | 0| 1.00| 0.07| 0.03| 0.03| 0.05| 0.07| 0.08| 0.18|▅▇▂▁▁ | |Adenosine | 0| 1.00| 0.13| 0.06| 0.04| 0.08| 0.12| 0.17| 0.39|▇▆▃▁▁ | |Homocysteine | 0| 1.00| 4.87| 1.14| 3.06| 4.11| 4.67| 5.57| 9.70|▇▇▃▁▁ | |Cysteine | 0| 1.00| 201.50| 19.83| 141.31| 188.50| 201.12| 214.61| 261.44|▁▅▇▅▁ | |Glu.-Cys. | 0| 1.00| 2.10| 0.55| 1.12| 1.67| 1.95| 2.50| 3.83|▅▇▅▂▁ | |Cys.-Gly. | 0| 1.00| 43.01| 6.80| 27.12| 38.83| 42.66| 46.97| 63.75|▂▇▇▃▁ | |tGSH | 0| 1.00| 6.24| 1.40| 3.25| 5.24| 6.18| 7.01| 11.18|▃▇▇▁▁ | |fGSH | 0| 1.00| 2.02| 0.59| 1.08| 1.68| 1.90| 2.23| 5.44|▇▅▁▁▁ | |GSSG | 0| 1.00| 0.18| 0.09| 0.05| 0.12| 0.17| 0.22| 0.65|▇▆▁▁▁ | |fGSH/GSSG | 0| 1.00| 13.77| 8.03| 2.15| 8.24| 11.55| 17.78| 57.00|▇▅▁▁▁ | |tGSH/GSSG | 0| 1.00| 42.50| 22.76| 8.55| 25.60| 36.30| 55.38| 130.00|▇▆▃▁▁ | |Chlorotyrosine | 0| 1.00| 44.06| 25.07| 6.39| 26.53| 38.39| 58.10| 143.28|▇▇▃▁▁ | |Nitrotyrosine | 0| 1.00| 88.07| 59.23| 21.56| 47.69| 71.44| 107.00| 467.30|▇▂▁▁▁ | |Tyrosine | 0| 1.00| 50.87| 13.78| 22.93| 40.71| 48.97| 60.15| 92.93|▂▇▅▂▁ | |Tryptophane | 0| 1.00| 38.16| 8.88| 20.05| 30.92| 37.61| 44.60| 60.12|▃▇▇▆▁ | |fCystine | 0| 1.00| 26.21| 8.38| 10.32| 19.80| 25.41| 30.80| 54.11|▅▇▆▂▁ | |fCysteine | 0| 1.00| 21.73| 4.72| 10.85| 18.39| 21.04| 24.23| 36.54|▁▇▆▂▁ | |fCystine/fCysteine | 0| 1.00| 1.25| 0.50| 0.57| 0.91| 1.12| 1.44| 4.06|▇▃▁▁▁ | |% oxidized | 0| 1.00| 0.16| 0.07| 0.03| 0.10| 0.15| 0.20| 0.48|▇▇▂▁▁ | |Vineland ABC | 159| 0.23| 70.77| 13.91| 46.00| 61.50| 69.00| 76.00| 106.00|▅▇▅▃▂ | ] --- We have 3 kinds of data about 206 children: 1. The outcome (`Group`): ASD (diagnosed with ASD), SIB (sibling not diagnosed with ASD), and NEU (age-matched neurotypical children, for control) ```r autism %>% group_by(Group) %>% summarize(n = n()) %>% knitr::kable() ``` |Group | n| |:-----|--:| |ASD | 83| |NEU | 76| |SIB | 47| --- 1. The outcome (`Group`): ASD, SIB, NEU 2. Concentrations of various metabolites in a blood sample: ```r autism %>% select(-1, -last_col()) %>% names() ``` ``` [1] "Methion." "SAM" [3] "SAH" "SAM/SAH" [5] "% DNA methylation" "8-OHG" [7] "Adenosine" "Homocysteine" [9] "Cysteine" "Glu.-Cys." [11] "Cys.-Gly." "tGSH" [13] "fGSH" "GSSG" [15] "fGSH/GSSG" "tGSH/GSSG" [17] "Chlorotyrosine" "Nitrotyrosine" [19] "Tyrosine" "Tryptophane" [21] "fCystine" "fCysteine" [23] "fCystine/fCysteine" "% oxidized" ``` --- 1. The outcome (`Group`): ASD, SIB, NEU 2. Concentrations of various metabolites in a blood sample 3. For the ASD children only, a measure of life skills ("Vineland ABC") ```r autism %>% ggplot(aes(x = `Vineland ABC`, y = Group)) + geom_boxplot() ``` ``` Warning: Removed 159 rows containing non-finite values (stat_boxplot). ``` <img src="slides04logistic_files/figure-html/show-behavior-score-1.png" width="90%" style="display: block; margin: auto;" /> --- ## Exploratory Data Analysis (EDA) What do these metabolites look like? <img src="slides04logistic_files/figure-html/bad-metabolite-plot-1.png" width="80%" style="display: block; margin: auto;" /> --- code for the previous plot: ```r library(ggridges) autism %>% select(-`Vineland ABC`) %>% pivot_longer(-Group, names_to = "Measure") %>% ggplot(aes(x = value, y = Measure)) + geom_density_ridges() + facet_wrap(vars(Group), scales = "free_x") ``` --- ## EDA Better question: **Can these metabolites help us distinguish autism?** --- <img src="slides04logistic_files/figure-html/ridgeplot-1.png" width="80%" style="display: block; margin: auto;" /> --- code for previous plot: ```r autism %>% select(-`Vineland ABC`) %>% pivot_longer(-Group, names_to = "Measure") %>% ggplot(aes(x = value, y = Group)) + geom_density_ridges() + facet_wrap(vars(Measure), scales = "free_x") + theme_ridges() ``` --- ## Can we predict ASD vs non-ASD from metabolites? * Let's start by (1) ignoring the behavior scores (that's an *outcome*) and comparing just ASD and NEU. * We need to drop SIB and encode `Group` as a factor. ```r data <- autism %>% select(-`Vineland ABC`) %>% filter(Group != "SIB") %>% mutate(Group = as_factor(Group)) ``` --- ## Linear Model for *Classification* ```r *model <- logistic_reg() %>% fit(Group ~ `tGSH/GSSG` + `fGSH` + `Methion.`, data = data) ``` ```r model %>% tidy() %>% select(term, estimate) %>% kable() ``` |term | estimate| |:-----------|-----------:| |(Intercept) | -15.3438894| |`tGSH/GSSG` | 0.1364630| |fGSH | 2.1681857| |Methion. | 0.2919989| $$ \log\left[ \frac { \widehat{P( \operatorname{..y} = \operatorname{NEU} )} }{ 1 - \widehat{P( \operatorname{..y} = \operatorname{NEU} )} } \right] = -15.34 + 0.14(\operatorname{tGSH/GSSG}) + 2.17(\operatorname{fGSH}) + 0.29(\operatorname{Methion.}) $$ --- ### What do the *predictions* look like? ```r model %>% predict(data, type = "prob") ``` ``` # A tibble: 159 × 2 .pred_ASD .pred_NEU <dbl> <dbl> 1 0.924 0.0764 2 0.940 0.0605 3 0.994 0.00603 4 0.996 0.00361 5 0.868 0.132 6 0.797 0.203 # … with 153 more rows ``` --- ## Intuition: *Risk* - First, linear model predicts a *risk score* (negative = low risk, positive = high risk). - Technically: *log-odds*: log(prob of *yes* / prob of *no*) - Then, *logistic* function turns risk score into probability Other examples: - Risk of a customer churning - Risk of a transaction being fraudulent - Chance of one team winning: difference in skill scores (Elo) --- ### Were those predictions *good*? .small-code[ ```r model %>% predict(data, type = "prob") %>% bind_cols(data) %>% mutate(idx = row_number()) %>% ggplot(aes(x = idx, y = .pred_ASD, color = Group, shape = Group)) + geom_hline(yintercept = .5) + geom_point() ``` <img src="slides04logistic_files/figure-html/plot-predicted-probs-tree-1.png" width="90%" style="display: block; margin: auto;" /> ] --- ### Quantifying that ```r metrics <- yardstick::metric_set(accuracy, sensitivity, specificity, yardstick::f_meas) model %>% predict(data, type = "class") %>% bind_cols(data) %>% metrics(truth = Group, estimate = .pred_class) ``` ``` # A tibble: 4 × 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 accuracy binary 0.906 2 sensitivity binary 0.940 3 specificity binary 0.868 4 f_meas binary 0.912 ``` --- ## Classification Metrics | | Event happened | No event happened | |----------------------|-------------------------------|-------------------------------| | Event predicted | True positive | False positive (Type 1 error) | | No event predicted | False negative (Type 2 error) | True negative | -- - **Accuracy** (% correct) = (TP + TN) / (# episodes) - **False negative** ("miss") **rate** = FN / (# actual events) - **False positive** ("false alarm") **rate** = FP / (# true non-events) -- - **Sensitivity** ("true positive rate") = TP / (# actual events) - Sensitivity = 1 − False negative rate - **Specificity** ("true negative rate") = TN / (# actual events) - Specificity = 1 − False positive rate - [Wikipedia article](https://en.wikipedia.org/wiki/Sensitivity_and_specificity) --- ## Predicted probabilities tell confidence - **Comparing models**: It's better to be *confident* when you're right, it's better to be *un-confident* when you're wrong. - **Setting thresholds** - Model is usually trained as if false negative and false positive are equally bad. - But if false negative is worse than false positive, we can adjust the confidence threshold (default: 50%) ```r metrics <- yardstick::metric_set(mn_log_loss, roc_auc) model %>% predict(data, type = "prob") %>% bind_cols(data) %>% metrics(truth = Group, .pred_ASD) ``` ``` # A tibble: 2 × 3 .metric .estimator .estimate <chr> <chr> <dbl> 1 mn_log_loss binary 0.262 2 roc_auc binary 0.953 ```