Tidying Data

# Tidying Data
### DATA 202 21FA (thanks: datasciencebox.org)

---

## Q&A

> Can we use actual maps instead of scatterplots?

Yes, we need a *base layer*. We'll focus more on spatial data in a few weeks.

---

## Review of Exercise 5

Why not just grab `states$population`?

---

# Pivoting

---

## Data: Sales

```
# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <int> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  
```
]

--
.pull-right[
### .pink[We want...]

```
# A tibble: 6 × 3
  customer_id item_num item        
        <int> <chr>    <chr>       
1           1 item_1   bread       
2           1 item_2   milk        
3           1 item_3   banana      
4           2 item_1   milk        
5           2 item_2   toilet paper
6           2 item_3   <NA>        
```
]

---

## A grammar of data tidying

.pull-left[
<img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" />
]
.pull-right[
The goal of tidyr is to help you tidy your data via

- pivoting for going between wide and long data
- splitting and combining character columns
- nesting and unnesting columns
- clarifying how `NA`s should be treated
]

---

# Pivoting data

---

## Not this...

---

## but this!

.center[
<img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" />
]

---

## Wider vs. longer

```
# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <int> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  
```
]

--
.pull-right[
### .pink[longer]
more rows

---

## `pivot_longer()`

```r
pivot_longer(
* data,
  cols, 
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

```r
pivot_longer(
  data, 
* cols,
  names_to = "name", 
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
]
.pull-right[

```r
pivot_longer(
  data, 
  cols, 
* names_to = "name",
  values_to = "value"
  )
```
]

---

## `pivot_longer()`

.pull-left[
- `data` (as usual)
- `cols`: columns to pivot into longer format 
- `names_to`: name of the column where column names of pivoted variables go (character string)
- `values_to`: name of the column where data in pivoted variables go (character string)
]
.pull-right[

```r
pivot_longer(
  data, 
  cols, 
  names_to = "name", 
* values_to = "value"
  )
```
]

---

## Customers `$\rightarrow$` purchases

```r
purchases <- customers %>%
* pivot_longer(
*   cols = item_1:item_3,  # variables item_1 to item_3
*   names_to = "item_num", # column names -> new column called item_num
*   values_to = "item"     # values in columns -> new column called item
*   )

purchases
```

```
# A tibble: 6 × 3
  customer_id item_num item        
        <int> <chr>    <chr>       
1           1 item_1   bread       
2           1 item_2   milk        
3           1 item_3   banana      
4           2 item_1   milk        
5           2 item_2   toilet paper
6           2 item_3   <NA>        
```

---

## Why pivot?

Most likely, because the next step of your analysis needs it

```r
prices
```

```
# A tibble: 5 × 2
  item         price
  <chr>        <dbl>
1 avocado       0.5 
2 banana        0.15
3 bread         1   
4 milk          0.8 
5 toilet paper  3   
```
]
.pull-right[

```r
purchases %>%
* left_join(prices)
```

```
# A tibble: 6 × 4
  customer_id item_num item         price
        <int> <chr>    <chr>        <dbl>
1           1 item_1   bread         1   
2           1 item_2   milk          0.8 
3           1 item_3   banana        0.15
4           2 item_1   milk          0.8 
5           2 item_2   toilet paper  3   
6           2 item_3   <NA>         NA   
```
]

---

## Purchases `$\rightarrow$` customers

.pull-left-narrow[
- `data` (long)
- `names_from`: tells us what column to put each value in
- `values_from`: tells us what to put in that column
]
.pull-right-wide[

```r
purchases %>%
* pivot_wider(
*   names_from = item_num,
*   values_from = item
* )
```

```
# A tibble: 2 × 4
  customer_id item_1 item_2       item_3
        <int> <chr>  <chr>        <chr> 
1           1 bread  milk         banana
2           2 milk   toilet paper <NA>  
```
]

---

# Case study: Biden Approval Rating

---

---

## Data

```r
approval
```

```
# A tibble: 510 × 4
  subgroup date       approval disapproval
  <chr>    <date>        <dbl>       <dbl>
1 Voters   2021-10-04     45.9        48.7
2 Adults   2021-10-04     44.7        46.6
3 Voters   2021-10-03     45.9        48.7
4 Adults   2021-10-03     44.9        48.1
5 Adults   2021-10-02     44.8        47.8
6 Voters   2021-10-02     45.9        48.7
# … with 504 more rows
```

---

## Goal

.pull-left-wide[
<img src="w06d1-tidy_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" />
]
--
.pull-right-narrow[
**Aesthetic mappings:**  
✅  x = `date`  
❌      y = `rating_value`  
❌      color = `rating_type`

**Facet:**  
✅  `subgroup` (Adults and Voters)
]

---

## Pivot

```r
approval_longer <- approval %>%
  pivot_longer(
    cols = c(approval, disapproval),
    names_to = "rating_type",
    values_to = "rating_value"
  )

approval_longer
```

```
# A tibble: 1,020 × 4
  subgroup date       rating_type rating_value
  <chr>    <date>     <chr>              <dbl>
1 Voters   2021-10-04 approval            45.9
2 Voters   2021-10-04 disapproval         48.7
3 Adults   2021-10-04 approval            44.7
4 Adults   2021-10-04 disapproval         46.6
5 Voters   2021-10-03 approval            45.9
6 Voters   2021-10-03 disapproval         48.7
# … with 1,014 more rows
```

---

## Plot

```r
ggplot(approval_longer, 
       aes(x = date, y = rating_value, color = rating_type)) +
  geom_line() +
  facet_wrap(vars(subgroup))
```

---

```r
ggplot(approval_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type)) +
  geom_line() +
  facet_wrap(vars(subgroup)) +
* scale_color_manual(values = c("darkgreen", "orange")) +
* labs(
*   x = "Date", y = "Rating",
*   color = NULL,
*   title = "How (un)popular is Joe Biden?",
*   subtitle = "Estimates based on polls of all adults and polls of likely/registered voters",
*   caption = "Source: FiveThirtyEight modeling estimates"
* )
```
]

.panel[.panel-name[Plot]
<img src="w06d1-tidy_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

```r
ggplot(approval_longer, 
       aes(x = date, y = rating_value, 
           color = rating_type)) +
  geom_line() +
  facet_wrap(vars(subgroup)) +
  scale_color_manual(values = c("darkgreen", "orange")) + 
  labs( 
    x = "Date", y = "Rating", 
    color = NULL, 
    title = "How (un)popular is Joe Biden??", 
    subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", 
    caption = "Source: FiveThirtyEight modeling estimates" 
  ) + 
* theme_minimal() +
* theme(legend.position = "bottom")
```
]

.panel[.panel-name[Plot]
<img src="w06d1-tidy_files/figure-html/unnamed-chunk-25-1.png" width="75%" style="display: block; margin: auto;" />
]

]

---

# Case study: Gapminder

---

Source: https://www.gapminder.org/data/documentation/gd003/

---

## We want...

```r
gm_pop_wide <- read_csv("data/GM-Population - Dataset - v6 - data-pop-gmv6-in-columns.csv", skip = 2, show_col_types = FALSE)
gm_pop_wide %>% distinct(geo, name)
```

```
# A tibble: 204 × 2
  geo   name               
  <chr> <chr>              
1 afg   Afghanistan        
2 alb   Albania            
3 dza   Algeria            
4 and   Andorra            
5 ago   Angola             
6 atg   Antigua and Barbuda
# … with 198 more rows
```

```r
gm_pop_long <- gm_pop_wide %>% 
  pivot_longer(
    cols = -c(geo, name),
    names_to = "year",
    values_to = "population"
  ) %>% 
  mutate(
    year = parse_number(year)
  )

countries_to_show <- c("China", "India", "United States", "Indonesia", "Pakistan")

#gm_pop_long %>% filter(year == 2021) %>% mutate(pop_rank = rank(population)) %>% arrange(-pop_rank)

#gm_pop_long %>% left_join(gm_pop_long %>% filter(year == 2021) %>% transmute(geo, name, pop_rank = rank(population)))
p <- gm_pop_long %>% 
  filter(between(year, 1900, 2010), name %in% countries_to_show) %>% 
  ggplot(aes(x = year, y = population / 1e6, color = name)) +
    geom_line() +
    scale_x_continuous(expand = expansion(mult = c(0, .2)))
directlabels::direct.label(p, "last.qp")
```

---

# Case Study: is college worth it?

---

```r
# https://www.insidehighered.com/news/2019/06/10/new-data-show-economic-value-earning-bachelors-degree-remains-high
# https://www.newyorkfed.org/medialibrary/media/research/college-labor-market/labor-market-for-recent-college-grads.xlsx?la=en
labor_fed <- read_csv("data/college-labor-data.csv", show_col_types = FALSE)
labor_fed
```

```
# A tibble: 31 × 5
  Date   `Bachelor's degree… `Bachelor's degr… `Bachelor's degre…
  <chr>                <dbl>             <dbl>              <dbl>
1 1/1/90               35369             46311              59434
2 1/1/91               33077             43723              57030
3 1/1/92               33208             44277              55347
4 1/1/93               32250             43000              55542
5 1/1/94               31303             41912              54137
6 1/1/95               29727             40769              52660
# … with 25 more rows, and 1 more variable:
#   High school diploma: median <dbl>
```

```r
labor_fed %>% 
  pivot_longer(
    cols = -Date,
    names_to = "education",
    values_to = "wage"
  ) %>% 
  separate(education, into = c("degree", "measure"), sep = ": ") %>% 
  rename(date = Date) %>% 
  mutate(date = lubridate::parse_date_time(date, "%m/%d/%y!*")) %>% 
  filter(measure == "median") %>% 
  ggplot(aes(x = date, y = wage, color = degree)) + geom_line()
```