Predictive Analytics Unit 4: Logistic Classifiers (revised)

Ken Arnold
Calvin University

Two Types of Supervised Learning

Regression: fill in a number
- How many customers?
- What price?
Classification: fill in a choice
- How likely is this transaction to be fraudulent?
- What is the risk of default for this credit application?

Classification example: Blood test for autism?

We’ll use an example from a 2017 PLOS Computational Biology paper

skimr::skim(autism)

Data summary
Name	autism
Number of rows	206
Number of columns	26
_______________________
Column type frequency:
character	1
numeric	25
________________________
Group variables	None

Variable type: character

skim_variable	n_missing	complete_rate	min	max	empty	n_unique	whitespace
Group	0	1	3	3	0	3	0

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
Methion.	0	1.00	21.38	3.89	13.43	18.94	21.08	23.61	33.14	▃▇▆▂▁
SAM	0	1.00	65.65	12.33	37.22	57.83	63.92	71.13	121.16	▂▇▂▁▁
SAH	0	1.00	16.68	4.35	7.39	14.14	16.61	19.27	28.12	▃▆▇▃▁
SAM/SAH	0	1.00	4.26	1.61	1.66	3.24	3.90	4.79	13.10	▇▆▁▁▁
% DNA methylation	0	1.00	3.85	0.94	1.78	3.14	3.88	4.52	7.73	▃▇▆▁▁
8-OHG	0	1.00	0.07	0.03	0.03	0.05	0.07	0.08	0.18	▅▇▂▁▁
Adenosine	0	1.00	0.13	0.06	0.04	0.08	0.12	0.17	0.39	▇▆▃▁▁
Homocysteine	0	1.00	4.87	1.14	3.06	4.11	4.67	5.57	9.70	▇▇▃▁▁
Cysteine	0	1.00	201.50	19.83	141.31	188.50	201.12	214.61	261.44	▁▅▇▅▁
Glu.-Cys.	0	1.00	2.10	0.55	1.12	1.67	1.95	2.50	3.83	▅▇▅▂▁
Cys.-Gly.	0	1.00	43.01	6.80	27.12	38.83	42.66	46.97	63.75	▂▇▇▃▁
tGSH	0	1.00	6.24	1.40	3.25	5.24	6.18	7.01	11.18	▃▇▇▁▁
fGSH	0	1.00	2.02	0.59	1.08	1.68	1.90	2.23	5.44	▇▅▁▁▁
GSSG	0	1.00	0.18	0.09	0.05	0.12	0.17	0.22	0.65	▇▆▁▁▁
fGSH/GSSG	0	1.00	13.77	8.03	2.15	8.24	11.55	17.78	57.00	▇▅▁▁▁
tGSH/GSSG	0	1.00	42.50	22.76	8.55	25.60	36.30	55.38	130.00	▇▆▃▁▁
Chlorotyrosine	0	1.00	44.06	25.07	6.39	26.53	38.39	58.10	143.28	▇▇▃▁▁
Nitrotyrosine	0	1.00	88.07	59.23	21.56	47.69	71.44	107.00	467.30	▇▂▁▁▁
Tyrosine	0	1.00	50.87	13.78	22.93	40.71	48.97	60.15	92.93	▂▇▅▂▁
Tryptophane	0	1.00	38.16	8.88	20.05	30.92	37.61	44.60	60.12	▃▇▇▆▁
fCystine	0	1.00	26.21	8.38	10.32	19.80	25.41	30.80	54.11	▅▇▆▂▁
fCysteine	0	1.00	21.73	4.72	10.85	18.39	21.04	24.23	36.54	▁▇▆▂▁
fCystine/fCysteine	0	1.00	1.25	0.50	0.57	0.91	1.12	1.44	4.06	▇▃▁▁▁
% oxidized	0	1.00	0.16	0.07	0.03	0.10	0.15	0.20	0.48	▇▇▂▁▁
Vineland ABC	159	0.23	70.77	13.91	46.00	61.50	69.00	76.00	106.00	▅▇▅▃▂

We have 3 kinds of data about 206 children:

The outcome (Group): ASD (diagnosed with ASD), SIB (sibling not diagnosed with ASD), and NEU (age-matched neurotypical children, for control)

autism %>% group_by(Group) %>% summarize(n = n()) %>% knitr::kable()

Group	n
ASD	83
NEU	76
SIB	47

The outcome (Group): ASD, SIB, NEU
Concentrations of various metabolites in a blood sample:

autism %>% select(-1, -last_col()) %>% names()

 [1] "Methion."           "SAM"                "SAH"               
 [4] "SAM/SAH"            "% DNA methylation"  "8-OHG"             
 [7] "Adenosine"          "Homocysteine"       "Cysteine"          
[10] "Glu.-Cys."          "Cys.-Gly."          "tGSH"              
[13] "fGSH"               "GSSG"               "fGSH/GSSG"         
[16] "tGSH/GSSG"          "Chlorotyrosine"     "Nitrotyrosine"     
[19] "Tyrosine"           "Tryptophane"        "fCystine"          
[22] "fCysteine"          "fCystine/fCysteine" "% oxidized"

The outcome (Group): ASD, SIB, NEU
Concentrations of various metabolites in a blood sample
For the ASD children only, a measure of life skills (“Vineland ABC”)

autism %>% 
  ggplot(aes(x = `Vineland ABC`, y = Group)) + geom_boxplot()

Exploratory Data Analysis (EDA)

What do these metabolites look like?

code for the previous plot:

library(ggridges)
autism %>%
  select(-`Vineland ABC`) %>% 
  pivot_longer(-Group, names_to = "Measure") %>% 
  ggplot(aes(x = value, y = Measure)) +
  geom_density_ridges() + 
  facet_wrap(vars(Group), scales = "free_x")

EDA

Better question: Can these metabolites help us distinguish autism?

code for previous plot:

autism %>%
  select(-`Vineland ABC`) %>% 
  pivot_longer(-Group, names_to = "Measure") %>% 
  ggplot(aes(x = value, y = Group)) +
  geom_density_ridges() +
  facet_wrap(vars(Measure), scales = "free_x") + 
  theme_ridges()

Can we predict ASD vs non-ASD from metabolites?

Let’s start by (1) ignoring the behavior scores (that’s an outcome) and comparing just ASD and NEU.
We need to drop SIB and encode Group as a factor. (Ensure that “ASD” is the first level.)

data <- autism %>% 
  select(-`Vineland ABC`) %>% 
  filter(Group != "SIB") %>% 
  mutate(Group = as_factor(Group) %>% fct_relevel("ASD"))

data$Group

  [1] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
 [19] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
 [37] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
 [55] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD
 [73] ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD ASD NEU NEU NEU NEU NEU NEU NEU
 [91] NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU
[109] NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU
[127] NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU
[145] NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU NEU
Levels: ASD NEU

Linear Model for Classification

model <- logistic_reg() %>% #<<
  fit(Group ~ `tGSH/GSSG` + `fGSH` + `Methion.`, data = data)

model %>% tidy() %>% select(term, estimate) %>% kable()

term	estimate
(Intercept)	-15.3438894
`tGSH/GSSG`	0.1364630
fGSH	2.1681857
Methion.	0.2919989

\[ \log\left[ \frac { \widehat{P( \operatorname{..y} = \operatorname{NEU} )} }{ 1 - \widehat{P( \operatorname{..y} = \operatorname{NEU} )} } \right] = -15.34 + 0.14(\operatorname{tGSH/GSSG}) + 2.17(\operatorname{fGSH}) + 0.29(\operatorname{Methion.}) \]

What do the predictions look like?

model %>% predict(data, type = "prob")

# A tibble: 159 × 2
   .pred_ASD .pred_NEU
       <dbl>     <dbl>
 1     0.924   0.0764 
 2     0.940   0.0605 
 3     0.994   0.00603
 4     0.996   0.00361
 5     0.868   0.132  
 6     0.797   0.203  
 7     0.521   0.479  
 8     0.773   0.227  
 9     0.901   0.0986 
10     0.986   0.0141 
# … with 149 more rows

Intuition: Risk

First, linear model predicts a risk score (negative = low risk, positive = high risk).
- Technically: log-odds: log(prob of yes / prob of no)
Then, logistic function turns risk score into probability

Other examples:

Risk of a customer churning
Risk of a transaction being fraudulent
Chance of one team winning: difference in skill scores (Elo)

Were those predictions good?

model %>%
  predict(data, type = "prob") %>%
  bind_cols(data) %>% 
  mutate(idx = row_number()) %>% 
  ggplot(aes(x = idx, y = .pred_ASD, color = Group, shape = Group)) +
  geom_hline(yintercept = .5) +
  geom_point()

Quantifying Classification Quality

decision_metrics <- yardstick::metric_set(accuracy, sensitivity,
                                 specificity, yardstick::f_meas)
model %>% 
  predict(data, type = "class") %>% 
  bind_cols(data) %>% 
  decision_metrics(truth = Group, estimate = .pred_class)

# A tibble: 4 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 accuracy    binary         0.906
2 sensitivity binary         0.940
3 specificity binary         0.868
4 f_meas      binary         0.912

Note: these are all measures of the quality of the decision. Other metrics can evaluate the scores:

model %>% 
  augment(data) %>% 
  roc_auc(truth = Group, .pred_ASD)

# A tibble: 1 × 3
  .metric .estimator .estimate
  <chr>   <chr>          <dbl>
1 roc_auc binary         0.953

Classification Metrics

Note: this is transposed of the Wikipedia table

	Event happened	No event happened
Event predicted	True positive	False positive (Type 1 error)
No event predicted	False negative (Type 2 error)	True negative

Accuracy (% correct) = (TP + TN) / (# episodes)
False negative (“miss”) rate = FN / (# actual events)
False positive (“false alarm”) rate = FP / (# true non-events)

Sensitivity (“true positive rate”) = TP / (# actual events)
- Sensitivity = 1 − False negative rate
Specificity (“true negative rate”) = TN / (# actual events)
- Specificity = 1 − False positive rate
Wikipedia article

Predicted probabilities tell confidence

Comparing models: It’s better to be confident when you’re right, it’s better to be un-confident when you’re wrong.
Setting thresholds
- Model is usually trained as if false negative and false positive are equally bad.
- But if false negative is worse than false positive, we can adjust the confidence threshold (default: 50%)

metrics <- yardstick::metric_set(mn_log_loss, roc_auc)
model %>% predict(data, type = "prob") %>% 
  bind_cols(data) %>% 
  metrics(truth = Group, .pred_ASD)

# A tibble: 2 × 3
  .metric     .estimator .estimate
  <chr>       <chr>          <dbl>
1 mn_log_loss binary         0.262
2 roc_auc     binary         0.953