Making Inferences about Features

class: center, middle, inverse, title-slide

# Making Inferences about Features
### DATA 202 21FA

---

## Logistics

- Final projects
- Midterm 2??
- Discussions
- Quizzes and Homework

---

## What's your goal?

* **Predict** unseen labels
  * How much will this house sell for?
  * Does this child have autism?
  * Is this a positive or negative movie review?
* **Infer** relationships between features and labels
  * How much does home size affect price?
  * Is DNA methylation a marker of autism?
  * Does "sick" indicate a positive or negative review?
* Understand the **causal** effect of interventions
  * How much will building an addition increase the price of my home?
  * Will antioxidants prevent autism?
  * Will cutting this scene make my movie get better reviews?

---

## Techniques for Inference

* Classical statistical inference
  * 2-sample t tests, chi squared tests, ANOVA, ...
  * inference about model parameters (coefficient standard errors etc.)
* Variable importance plots
* Benefit of adding each feature

---

## Objectives for Today

* Identify several different approaches for drawing conclusions about features
* Recognize potential challenges in making those inferences

---

## Palmer Penguins

.pull-left[
<img src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/logo.png" width="50%" style="display: block; margin: auto;" />

```r
library(palmerpenguins)
penguins <- palmerpenguins::penguins %>% 
  filter(!is.na(bill_length_mm), !is.na(bill_depth_mm))
```
]

.pull-right[
<img src="https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/culmen_depth.png" width="90%" style="display: block; margin: auto;" />
]

.floating-source[
[Artwork by @allison_horst](https://allisonhorst.github.io/palmerpenguins/articles/art.html)
]
---

## How does bill length relate to bill depth?

.small-code[
<img src="w12d2-inference_files/figure-html/penguins-simpsons-1.png" width="90%" style="display: block; margin: auto;" />
]

---

## How does bill length relate to bill depth?

---

###  One big model

.panelset[
.panel[.panel-name[Fit Model]

```r
overall_model <- linear_reg() %>%
  fit(bill_depth_mm ~ bill_length_mm, data = penguins)
tidy(overall_model, conf.int = TRUE)
```

```
# A tibble: 2 × 7
  term   estimate std.error statistic  p.value conf.low conf.high
  <chr>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Inte…  20.9       0.844      24.7  4.72e-78   19.2     22.5   
2 bill_…  -0.0850    0.0191     -4.46 1.12e- 5   -0.123   -0.0475
```
]
.panel[.panel-name[Coefficients]

```r
GGally::ggcoef_model(overall_model$fit) + labs(x = "Coefficient")
```

<img src="w12d2-inference_files/figure-html/overall-model-coef-1.png" width="90%" style="display: block; margin: auto;" />
]

.panel[.panel-name[Predictions]
.small-code[

```r
augment(overall_model, penguins) %>% 
  ggplot(aes(x = bill_length_mm, y = bill_depth_mm)) +
    geom_point() +
    geom_line(aes(y = .pred), color = "blue", size = 1) +
    labs(title = "Penguin bill dimensions", subtitle = "Palmer Station LTER", x = "Bill length (mm)", y = "Bill depth (mm)")
```

<img src="w12d2-inference_files/figure-html/single-model-1.png" width="90%" style="display: block; margin: auto;" />
]]
]

---

### Parallel slopes

.panelset[
.panel[.panel-name[Fit Model]

```r
parallel_slopes_model <- linear_reg() %>%
  fit(bill_depth_mm ~ bill_length_mm
*     + species,
      data = penguins)
tidy(parallel_slopes_model, conf.int = TRUE)
```

```
# A tibble: 4 × 7
  term   estimate std.error statistic  p.value conf.low conf.high
  <chr>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Inte…   10.6      0.683      15.5  2.43e-41    9.25     11.9  
2 bill_…    0.200    0.0175     11.4  8.66e-26    0.165     0.234
3 speci…   -1.93     0.224      -8.62 2.55e-16   -2.37     -1.49 
4 speci…   -5.11     0.191     -26.7  3.65e-85   -5.48     -4.73 
```
]
.panel[.panel-name[Coefficients]

```r
GGally::ggcoef_model(parallel_slopes_model$fit) + labs(x = "Coefficient")
```

<img src="w12d2-inference_files/figure-html/parallel-slopes-coef-1.png" width="90%" style="display: block; margin: auto;" />
]

.panel[.panel-name[Predictions]
.small-code[

```r
augment(parallel_slopes_model, penguins) %>% 
* ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
    geom_point() +
    geom_line(aes(y = .pred), size = 1) +
    labs(title = "Penguin bill dimensions", subtitle = "Palmer Station LTER", x = "Bill length (mm)", y = "Bill depth (mm)")
```

<img src="w12d2-inference_files/figure-html/parallel-slopes-model-1.png" width="90%" style="display: block; margin: auto;" />
]
]]

---

### Interactions

.panelset[
.panel[.panel-name[Fit Model]

```r
interaction_model <- linear_reg() %>%
  fit(bill_depth_mm ~ bill_length_mm
*     * species,
      data = penguins)
tidy(interaction_model, conf.int = TRUE)
```

```
# A tibble: 6 × 7
  term   estimate std.error statistic  p.value conf.low conf.high
  <chr>     <dbl>     <dbl>     <dbl>    <dbl>    <dbl>     <dbl>
1 (Inte…  11.4       1.14      10.0   7.28e-21   9.17      13.6  
2 bill_…   0.179     0.0293     6.11  2.76e- 9   0.121      0.236
3 speci…  -3.84      2.05      -1.87  6.24e- 2  -7.88       0.200
4 speci…  -6.16      1.75      -3.51  5.09e- 4  -9.61      -2.71 
5 bill_…   0.0434    0.0456     0.952 3.42e- 1  -0.0463     0.133
6 bill_…   0.0260    0.0405     0.642 5.22e- 1  -0.0537     0.106
```
]
.panel[.panel-name[Coefficients]

```r
GGally::ggcoef_model(interaction_model$fit) + labs(x = "Coefficient")
```

<img src="w12d2-inference_files/figure-html/interaction-coef-1.png" width="90%" style="display: block; margin: auto;" />
]

.panel[.panel-name[Predictions]
.small-code[

```r
augment(interaction_model, penguins) %>% 
* ggplot(aes(x = bill_length_mm, y = bill_depth_mm, color = species)) +
    geom_point() +
    geom_line(aes(y = .pred), size = 1) +
    labs(title = "Penguin bill dimensions", subtitle = "Palmer Station LTER", x = "Bill length (mm)", y = "Bill depth (mm)")
```

<img src="w12d2-inference_files/figure-html/interaction-model-1.png" width="90%" style="display: block; margin: auto;" />
]
]]

---

## LINE Assumptions for making inference

- **Linearity**: there's actually a linear relationship
- **Independence**: there's no pattern to the errors
- **Normality of residuals**: no major outliers
- **Equal variance of residuals**: variability doesn't change systematically

These are colloquial wordings. See textbook [section E.4](https://mdsr-book.github.io/mdsr2e/ch-regression.html#assumptions-underlying-regression) for technical and how to verify.

---

## Examples of broken assumptions

- **Linearity**: the hotter the temperature, the more each degree of temperature matters
- **Independence**: ridership for adjacent hours is similar
- **Normality of residuals**: spike in demand after championship game
- **Equal variance of residuals**: bigger variability in demand on weekends vs weekdays

---

## Variable Importance Plots

.small-code[

```r
regresion_workflow <- workflow() %>%  add_model(decision_tree(mode = "regression") %>% set_engine('rpart'))

model <- regresion_workflow %>% 
  add_recipe(recipe(Sale_Price ~ ., data = ames_train)) %>% 
  fit(data = ames_train)

model %>% extract_fit_engine() %>% vip::vip(num_features = 15L)
```

<img src="w12d2-inference_files/figure-html/tree-vip-1.png" width="90%" style="display: block; margin: auto;" />
]

---

## How much does it help to have a feature in?

.small-code[

```r
regresion_workflow %>% 
  add_recipe(recipe(Sale_Price ~ ., data = ames_train)) %>% 
  fit_resamples(resamples = resamples, metrics = metric_set(mae, rmse)) %>% collect_metrics()
```

```
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard    25.7    10   0.438 Preprocessor1_Model1
2 rmse    standard    36.1    10   1.16  Preprocessor1_Model1
```

```r
regresion_workflow %>% 
* add_recipe(recipe(Sale_Price ~ ., data = ames_train) %>% step_rm(ends_with("Qual"))) %>%
  fit_resamples(resamples = resamples, metrics = metric_set(mae, rmse)) %>% collect_metrics()
```

```
# A tibble: 2 × 6
  .metric .estimator  mean     n std_err .config             
  <chr>   <chr>      <dbl> <int>   <dbl> <chr>               
1 mae     standard    25.6    10   0.627 Preprocessor1_Model1
2 rmse    standard    36.5    10   1.10  Preprocessor1_Model1
```
]

---

## Variable Importance without the Quality Features

```r
regresion_workflow %>% 
* add_recipe(recipe(Sale_Price ~ ., data = ames_train) %>% step_rm(ends_with("Qual"))) %>%
  fit(data = ames_train) %>%
  extract_fit_engine() %>% vip::vip(num_features = 15L)
```

---

## Appendix: code

.small-code[

```r
include_graphics("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/logo.png")
include_graphics("https://raw.githubusercontent.com/allisonhorst/palmerpenguins/master/man/figures/culmen_depth.png")
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm)) +
  geom_point() +
  geom_smooth(method = "lm") +
  labs(title = "Penguin bill dimensions", subtitle = "Palmer Station LTER", x = "Bill length (mm)", y = "Bill depth (mm)")
ggplot(penguins, aes(x = bill_length_mm, y = bill_depth_mm, color = species, shape = species)) +
  geom_point() +
  geom_smooth(method = "lm") +
  scale_color_manual(values = c("darkorange","purple","cyan4")) +
  labs(title = "Penguin bill dimensions",
       subtitle = "Bill length and depth for Adelie, Chinstrap and Gentoo Penguins at Palmer Station LTER",
       x = "Bill length (mm)",
       y = "Bill depth (mm)",
       color = "Penguin species",
       shape = "Penguin species") +
  theme(legend.position = c(0.85, 0.15),
        legend.background = element_rect(fill = "white", color = NA))
#data(ames, package = "modeldata")
ames <- AmesHousing::make_ames()
ames_all <- ames %>%
  filter(Gr_Liv_Area < 4000, Sale_Condition == "Normal") %>%
  mutate(across(where(is.integer), as.double)) %>%
  mutate(Sale_Price = Sale_Price / 1000)
rm(ames)
set.seed(10) # Seed the random number generator
ames_split <- initial_split(ames_all, prop = 2 / 3)
ames_train <- training(ames_split)
ames_test <- testing(ames_split)
set.seed(0)
resamples <- vfold_cv(ames_train, v = 10)
```
]