Predictive Analytics Unit 4: Logistic Classifiers (revised)

Ken Arnold
Calvin University

Two Types of Supervised Learning

  • Regression: fill in a number
    • How many customers?
    • What price?
  • Classification: fill in a choice
    • How likely is this transaction to be fraudulent?
    • What is the risk of default for this credit application?

Classification example: Blood test for autism?

We’ll use an example from a 2017 PLOS Computational Biology paper

skimr::skim(autism)
Data summary
Name autism
Number of rows 206
Number of columns 26
_______________________
Column type frequency:
character 1
numeric 25
________________________
Group variables None

Variable type: character

skim_variable n_missing complete_rate min max empty n_unique whitespace
Group 0 1 3 3 0 3 0

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
Methion. 0 1.00 21.38 3.89 13.43 18.94 21.08 23.61 33.14 ▃▇▆▂▁
SAM 0 1.00 65.65 12.33 37.22 57.83 63.92 71.13 121.16 ▂▇▂▁▁
SAH 0 1.00 16.68 4.35 7.39 14.14 16.61 19.27 28.12 ▃▆▇▃▁
SAM/SAH 0 1.00 4.26 1.61 1.66 3.24 3.90 4.79 13.10 ▇▆▁▁▁
% DNA methylation 0 1.00 3.85 0.94 1.78 3.14 3.88 4.52 7.73 ▃▇▆▁▁
8-OHG 0 1.00 0.07 0.03 0.03 0.05 0.07 0.08 0.18 ▅▇▂▁▁
Adenosine 0 1.00 0.13 0.06 0.04 0.08 0.12 0.17 0.39 ▇▆▃▁▁
Homocysteine 0 1.00 4.87 1.14 3.06 4.11 4.67 5.57 9.70 ▇▇▃▁▁
Cysteine 0 1.00 201.50 19.83 141.31 188.50 201.12 214.61 261.44 ▁▅▇▅▁
Glu.-Cys. 0 1.00 2.10 0.55 1.12 1.67 1.95 2.50 3.83 ▅▇▅▂▁
Cys.-Gly. 0 1.00 43.01 6.80 27.12 38.83 42.66 46.97 63.75 ▂▇▇▃▁
tGSH 0 1.00 6.24 1.40 3.25 5.24 6.18 7.01 11.18 ▃▇▇▁▁
fGSH 0 1.00 2.02 0.59 1.08 1.68 1.90 2.23 5.44 ▇▅▁▁▁
GSSG 0 1.00 0.18 0.09 0.05 0.12 0.17 0.22 0.65 ▇▆▁▁▁
fGSH/GSSG 0 1.00 13.77 8.03 2.15 8.24 11.55 17.78 57.00 ▇▅▁▁▁
tGSH/GSSG 0 1.00 42.50 22.76 8.55 25.60 36.30 55.38 130.00 ▇▆▃▁▁
Chlorotyrosine 0 1.00 44.06 25.07 6.39 26.53 38.39 58.10 143.28 ▇▇▃▁▁
Nitrotyrosine 0 1.00 88.07 59.23 21.56 47.69 71.44 107.00 467.30 ▇▂▁▁▁
Tyrosine 0 1.00 50.87 13.78 22.93 40.71 48.97 60.15 92.93 ▂▇▅▂▁
Tryptophane 0 1.00 38.16 8.88 20.05 30.92 37.61 44.60 60.12 ▃▇▇▆▁
fCystine 0 1.00 26.21 8.38 10.32 19.80 25.41 30.80 54.11 ▅▇▆▂▁
fCysteine 0 1.00 21.73 4.72 10.85 18.39 21.04 24.23 36.54 ▁▇▆▂▁
fCystine/fCysteine 0 1.00 1.25 0.50 0.57 0.91 1.12 1.44 4.06 ▇▃▁▁▁
% oxidized 0 1.00 0.16 0.07 0.03 0.10 0.15 0.20 0.48 ▇▇▂▁▁
Vineland ABC 159 0.23 70.77 13.91 46.00 61.50 69.00 76.00 106.00 ▅▇▅▃▂

We have 3 kinds of data about 206 children:

  1. The outcome (Group): ASD (diagnosed with ASD), SIB (sibling not diagnosed with ASD), and NEU (age-matched neurotypical children, for control)
autism %>% group_by(Group) %>% summarize(n = n()) %>% knitr::kable()
Group n
ASD 83
NEU 76
SIB 47
  1. The outcome (Group): ASD, SIB, NEU
  2. Concentrations of various metabolites in a blood sample:
autism %>% select(-1, -last_col()) %>% names()
 [1] "Methion."           "SAM"                "SAH"               
 [4] "SAM/SAH"            "% DNA methylation"  "8-OHG"             
 [7] "Adenosine"          "Homocysteine"       "Cysteine"          
[10] "Glu.-Cys."          "Cys.-Gly."          "tGSH"              
[13] "fGSH"               "GSSG"               "fGSH/GSSG"         
[16] "tGSH/GSSG"          "Chlorotyrosine"     "Nitrotyrosine"     
[19] "Tyrosine"           "Tryptophane"        "fCystine"          
[22] "fCysteine"          "fCystine/fCysteine" "% oxidized"        
  1. The outcome (Group): ASD, SIB, NEU
  2. Concentrations of various metabolites in a blood sample
  3. For the ASD children only, a measure of life skills (“Vineland ABC”)
autism %>% 
  ggplot(aes(x = `Vineland ABC`, y = Group)) + geom_boxplot()

Exploratory Data Analysis (EDA)

What do these metabolites look like?

code for the previous plot:

library(ggridges)
autism %>%
  select(-`Vineland ABC`) %>% 
  pivot_longer(-Group, names_to = "Measure") %>% 
  ggplot(aes(x = value, y = Measure)) +
  geom_density_ridges() + 
  facet_wrap(vars(Group), scales = "free_x")

EDA

Better question: Can these metabolites help us distinguish autism?

code for previous plot:

autism %>%
  select(-`Vineland ABC`) %>% 
  pivot_longer(-Group, names_to = "Measure") %>% 
  ggplot(aes(x = value, y = Group)) +
  geom_density_ridges() +
  facet_wrap(vars(Measure), scales = "free_x") + 
  theme_ridges()

Can we predict ASD vs non-ASD from metabolites?

  • Let’s start by (1) ignoring the behavior scores (that’s an outcome) and comparing just ASD and NEU.
  • We need to drop SIB and encode Group as a factor. (Ensure that “ASD” is the first level.)
data <- autism %>% 
  select(-`Vineland ABC`) %>% 
  filter(Group != "SIB") %>% 
  mutate(Group = as_factor(Group) %>% fct_relevel("ASD"))
data$Group
  [1] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
 [19] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
 [37] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
 [55] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
 [73] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD NEU NEU NEU NEU NEU NEU NEU
 [91] NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU
[109] NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU
[127] NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU
[145] NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU
Levels: ASD NEU

Linear Model for Classification

model <- logistic_reg() %>% #<<
  fit(Group ~ `tGSH/GSSG` + `fGSH` + `Methion.`, data = data)
model %>% tidy() %>% select(term, estimate) %>% kable()
term estimate
(Intercept) -15.3438894
tGSH/GSSG 0.1364630
fGSH 2.1681857
Methion. 0.2919989

\[ \log\left[ \frac { \widehat{P( \operatorname{..y} = \operatorname{NEU} )} }{ 1 - \widehat{P( \operatorname{..y} = \operatorname{NEU} )} } \right] = -15.34 + 0.14(\operatorname{tGSH/GSSG}) + 2.17(\operatorname{fGSH}) + 0.29(\operatorname{Methion.}) \]

What do the predictions look like?

model %>% predict(data, type = "prob")
# A tibble: 159 × 2
   .pred_ASD .pred_NEU
       <dbl>     <dbl>
 1     0.924   0.0764 
 2     0.940   0.0605 
 3     0.994   0.00603
 4     0.996   0.00361
 5     0.868   0.132  
 6     0.797   0.203  
 7     0.521   0.479  
 8     0.773   0.227  
 9     0.901   0.0986 
10     0.986   0.0141 
# … with 149 more rows

Intuition: Risk

  • First, linear model predicts a risk score (negative = low risk, positive = high risk).
    • Technically: log-odds: log(prob of yes / prob of no)
  • Then, logistic function turns risk score into probability

Other examples:

  • Risk of a customer churning
  • Risk of a transaction being fraudulent
  • Chance of one team winning: difference in skill scores (Elo)

Were those predictions good?

model %>%
  predict(data, type = "prob") %>%
  bind_cols(data) %>% 
  mutate(idx = row_number()) %>% 
  ggplot(aes(x = idx, y = .pred_ASD, color = Group, shape = Group)) +
  geom_hline(yintercept = .5) +
  geom_point() 

Quantifying Classification Quality

decision_metrics <- yardstick::metric_set(accuracy, sensitivity,
                                 specificity, yardstick::f_meas)
model %>% 
  predict(data, type = "class") %>% 
  bind_cols(data) %>% 
  decision_metrics(truth = Group, estimate = .pred_class)
# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.906
2 sensitivity binary         0.940
3 specificity binary         0.868
4 f_meas      binary         0.912

Note: these are all measures of the quality of the decision. Other metrics can evaluate the scores:

model %>% 
  augment(data) %>% 
  roc_auc(truth = Group, .pred_ASD)
# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.953

Classification Metrics

Note: this is transposed of the Wikipedia table

Event happened No event happened
Event predicted True positive False positive (Type 1 error)
No event predicted False negative (Type 2 error) True negative
  • Accuracy (% correct) = (TP + TN) / (# episodes)
  • False negative (“miss”) rate = FN / (# actual events)
  • False positive (“false alarm”) rate = FP / (# true non-events)
  • Sensitivity (“true positive rate”) = TP / (# actual events)
    • Sensitivity = 1 − False negative rate
  • Specificity (“true negative rate”) = TN / (# actual events)
    • Specificity = 1 − False positive rate
  • Wikipedia article

Predicted probabilities tell confidence

  • Comparing models: It’s better to be confident when you’re right, it’s better to be un-confident when you’re wrong.
  • Setting thresholds
    • Model is usually trained as if false negative and false positive are equally bad.
    • But if false negative is worse than false positive, we can adjust the confidence threshold (default: 50%)
metrics <- yardstick::metric_set(mn_log_loss, roc_auc)
model %>% predict(data, type = "prob") %>% 
  bind_cols(data) %>% 
  metrics(truth = Group, .pred_ASD)
# A tibble: 2 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 mn_log_loss binary         0.262
2 roc_auc     binary         0.953