class: center, middle, inverse, title-slide # Tidying Data ### DATA 202 21FA (thanks: datasciencebox.org) --- ## Q&A > Can we use actual maps instead of scatterplots? Yes, we need a *base layer*. We'll focus more on spatial data in a few weeks. --- ## Review of Exercise 5 Why not just grab `states$population`? --- class: center, middle # Pivoting --- ## Data: Sales .pull-left[ ### .green[We have...] ``` # A tibble: 2 × 4 customer_id item_1 item_2 item_3 <int> <chr> <chr> <chr> 1 1 bread milk banana 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[We want...] ``` # A tibble: 6 × 3 customer_id item_num item <int> <chr> <chr> 1 1 item_1 bread 2 1 item_2 milk 3 1 item_3 banana 4 2 item_1 milk 5 2 item_2 toilet paper 6 2 item_3 <NA> ``` ] --- ## A grammar of data tidying .pull-left[ <img src="img/tidyr-part-of-tidyverse.png" width="60%" style="display: block; margin: auto;" /> ] .pull-right[ The goal of tidyr is to help you tidy your data via - pivoting for going between wide and long data - splitting and combining character columns - nesting and unnesting columns - clarifying how `NA`s should be treated ] --- class: middle # Pivoting data --- ## Not this... <img src="img/pivot.gif" width="70%" style="display: block; margin: auto;" /> --- ## but this! .center[ <img src="img/tidyr-longer-wider.gif" width="45%" style="background-color: #FDF6E3" style="display: block; margin: auto;" /> ] --- ## Wider vs. longer .pull-left[ ### .green[wider] more columns ``` # A tibble: 2 × 4 customer_id item_1 item_2 item_3 <int> <chr> <chr> <chr> 1 1 bread milk banana 2 2 milk toilet paper <NA> ``` ] -- .pull-right[ ### .pink[longer] more rows ``` # A tibble: 6 × 3 customer_id item_num item <int> <chr> <chr> 1 1 item_1 bread 2 1 item_2 milk 3 1 item_3 banana 4 2 item_1 milk 5 2 item_2 toilet paper 6 2 item_3 <NA> ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) ] .pull-right[ ```r pivot_longer( * data, cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format ] .pull-right[ ```r pivot_longer( data, * cols, names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, * names_to = "name", values_to = "value" ) ``` ] --- ## `pivot_longer()` .pull-left[ - `data` (as usual) - `cols`: columns to pivot into longer format - `names_to`: name of the column where column names of pivoted variables go (character string) - `values_to`: name of the column where data in pivoted variables go (character string) ] .pull-right[ ```r pivot_longer( data, cols, names_to = "name", * values_to = "value" ) ``` ] --- ## Customers `\(\rightarrow\)` purchases ```r purchases <- customers %>% * pivot_longer( * cols = item_1:item_3, # variables item_1 to item_3 * names_to = "item_num", # column names -> new column called item_num * values_to = "item" # values in columns -> new column called item * ) purchases ``` ``` # A tibble: 6 × 3 customer_id item_num item <int> <chr> <chr> 1 1 item_1 bread 2 1 item_2 milk 3 1 item_3 banana 4 2 item_1 milk 5 2 item_2 toilet paper 6 2 item_3 <NA> ``` --- ## Why pivot? Most likely, because the next step of your analysis needs it -- .pull-left[ ```r prices ``` ``` # A tibble: 5 × 2 item price <chr> <dbl> 1 avocado 0.5 2 banana 0.15 3 bread 1 4 milk 0.8 5 toilet paper 3 ``` ] .pull-right[ ```r purchases %>% * left_join(prices) ``` ``` # A tibble: 6 × 4 customer_id item_num item price <int> <chr> <chr> <dbl> 1 1 item_1 bread 1 2 1 item_2 milk 0.8 3 1 item_3 banana 0.15 4 2 item_1 milk 0.8 5 2 item_2 toilet paper 3 6 2 item_3 <NA> NA ``` ] --- ## Purchases `\(\rightarrow\)` customers .pull-left-narrow[ - `data` (long) - `names_from`: tells us what column to put each value in - `values_from`: tells us what to put in that column ] .pull-right-wide[ ```r purchases %>% * pivot_wider( * names_from = item_num, * values_from = item * ) ``` ``` # A tibble: 2 × 4 customer_id item_1 item_2 item_3 <int> <chr> <chr> <chr> 1 1 bread milk banana 2 2 milk toilet paper <NA> ``` ] --- --- class: middle # Case study: Biden Approval Rating --- <img src="img/biden-approval.png" width="70%" style="display: block; margin: auto;" /> .footnote[ Source: [FiveThirtyEight](https://projects.fivethirtyeight.com/biden-approval-rating/) ] --- ## Data ```r approval ``` ``` # A tibble: 510 × 4 subgroup date approval disapproval <chr> <date> <dbl> <dbl> 1 Voters 2021-10-04 45.9 48.7 2 Adults 2021-10-04 44.7 46.6 3 Voters 2021-10-03 45.9 48.7 4 Adults 2021-10-03 44.9 48.1 5 Adults 2021-10-02 44.8 47.8 6 Voters 2021-10-02 45.9 48.7 # … with 504 more rows ``` --- ## Goal .pull-left-wide[ <img src="w06d1-tidy_files/figure-html/unnamed-chunk-21-1.png" width="100%" style="display: block; margin: auto;" /> ] -- .pull-right-narrow[ **Aesthetic mappings:** ✅ x = `date` ❌ y = `rating_value` ❌ color = `rating_type` **Facet:** ✅ `subgroup` (Adults and Voters) ] --- ## Pivot ```r approval_longer <- approval %>% pivot_longer( cols = c(approval, disapproval), names_to = "rating_type", values_to = "rating_value" ) approval_longer ``` ``` # A tibble: 1,020 × 4 subgroup date rating_type rating_value <chr> <date> <chr> <dbl> 1 Voters 2021-10-04 approval 45.9 2 Voters 2021-10-04 disapproval 48.7 3 Adults 2021-10-04 approval 44.7 4 Adults 2021-10-04 disapproval 46.6 5 Voters 2021-10-03 approval 45.9 6 Voters 2021-10-03 disapproval 48.7 # … with 1,014 more rows ``` --- ## Plot ```r ggplot(approval_longer, aes(x = date, y = rating_value, color = rating_type)) + geom_line() + facet_wrap(vars(subgroup)) ``` <img src="w06d1-tidy_files/figure-html/unnamed-chunk-23-1.png" width="60%" style="display: block; margin: auto;" /> --- .panelset[ .panel[.panel-name[Code] ```r ggplot(approval_longer, aes(x = date, y = rating_value, color = rating_type)) + geom_line() + facet_wrap(vars(subgroup)) + * scale_color_manual(values = c("darkgreen", "orange")) + * labs( * x = "Date", y = "Rating", * color = NULL, * title = "How (un)popular is Joe Biden?", * subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", * caption = "Source: FiveThirtyEight modeling estimates" * ) ``` ] .panel[.panel-name[Plot] <img src="w06d1-tidy_files/figure-html/unnamed-chunk-24-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- .panelset[ .panel[.panel-name[Code] ```r ggplot(approval_longer, aes(x = date, y = rating_value, color = rating_type)) + geom_line() + facet_wrap(vars(subgroup)) + scale_color_manual(values = c("darkgreen", "orange")) + labs( x = "Date", y = "Rating", color = NULL, title = "How (un)popular is Joe Biden??", subtitle = "Estimates based on polls of all adults and polls of likely/registered voters", caption = "Source: FiveThirtyEight modeling estimates" ) + * theme_minimal() + * theme(legend.position = "bottom") ``` ] .panel[.panel-name[Plot] <img src="w06d1-tidy_files/figure-html/unnamed-chunk-25-1.png" width="75%" style="display: block; margin: auto;" /> ] ] --- class: middle # Case study: Gapminder --- <img src="img/gapminder_pop.png" width="100%" style="display: block; margin: auto;" /> Source: https://www.gapminder.org/data/documentation/gd003/ --- ## We want... ```r gm_pop_wide <- read_csv("data/GM-Population - Dataset - v6 - data-pop-gmv6-in-columns.csv", skip = 2, show_col_types = FALSE) gm_pop_wide %>% distinct(geo, name) ``` ``` # A tibble: 204 × 2 geo name <chr> <chr> 1 afg Afghanistan 2 alb Albania 3 dza Algeria 4 and Andorra 5 ago Angola 6 atg Antigua and Barbuda # … with 198 more rows ``` ```r gm_pop_long <- gm_pop_wide %>% pivot_longer( cols = -c(geo, name), names_to = "year", values_to = "population" ) %>% mutate( year = parse_number(year) ) countries_to_show <- c("China", "India", "United States", "Indonesia", "Pakistan") #gm_pop_long %>% filter(year == 2021) %>% mutate(pop_rank = rank(population)) %>% arrange(-pop_rank) #gm_pop_long %>% left_join(gm_pop_long %>% filter(year == 2021) %>% transmute(geo, name, pop_rank = rank(population))) p <- gm_pop_long %>% filter(between(year, 1900, 2010), name %in% countries_to_show) %>% ggplot(aes(x = year, y = population / 1e6, color = name)) + geom_line() + scale_x_continuous(expand = expansion(mult = c(0, .2))) directlabels::direct.label(p, "last.qp") ``` <img src="w06d1-tidy_files/figure-html/unnamed-chunk-28-1.png" width="60%" style="display: block; margin: auto;" /> --- class: middle # Case Study: is college worth it? --- ```r # https://www.insidehighered.com/news/2019/06/10/new-data-show-economic-value-earning-bachelors-degree-remains-high # https://www.newyorkfed.org/medialibrary/media/research/college-labor-market/labor-market-for-recent-college-grads.xlsx?la=en labor_fed <- read_csv("data/college-labor-data.csv", show_col_types = FALSE) labor_fed ``` ``` # A tibble: 31 × 5 Date `Bachelor's degree… `Bachelor's degr… `Bachelor's degre… <chr> <dbl> <dbl> <dbl> 1 1/1/90 35369 46311 59434 2 1/1/91 33077 43723 57030 3 1/1/92 33208 44277 55347 4 1/1/93 32250 43000 55542 5 1/1/94 31303 41912 54137 6 1/1/95 29727 40769 52660 # … with 25 more rows, and 1 more variable: # High school diploma: median <dbl> ``` ```r labor_fed %>% pivot_longer( cols = -Date, names_to = "education", values_to = "wage" ) %>% separate(education, into = c("degree", "measure"), sep = ": ") %>% rename(date = Date) %>% mutate(date = lubridate::parse_date_time(date, "%m/%d/%y!*")) %>% filter(measure == "median") %>% ggplot(aes(x = date, y = wage, color = degree)) + geom_line() ``` <img src="w06d1-tidy_files/figure-html/unnamed-chunk-31-1.png" width="60%" style="display: block; margin: auto;" />