class: center, middle, inverse, title-slide # Joining data from multiple sources ### K Arnold, based on DSBox --- ## Logistics * Lab 4 due today * Discussion 2 reply due tomorrow * Not just: "you made a good point" * Some good examples: Krista, Lauren (replies to Kaitlyn) * Final post not officially due till next week, but some done already. Good examples: Advait, Ricky. * hw3 and hw4 due Wed * Quiz 5 on Thursday: no more infinite tries. .center[One month till Election Day! Are you registered to vote?] --- ## Discussion 3: Critique a graph you find Draft proposal: * *Collect* examples of visualizations that have been used to make a political or social argument (whether or not you agree!) * *Post a critique* as Discussion 3 * later (as Homework): replicate the visual yourself, write up your response --- ## A note on `mutate` * Badly named. Think "add_computed_column". * **DON'T** think of it operating on a variable ("mutate the ride's start time"). * **DO** think of it operating on a data frame ("add a column computed by flooring the start time) --- ## Q&A > `count` vs `group_by %>% summarize`? `count(x)` is (mostly) shorthand for `group_by(x) %>% summarize(n = n())` > `select` vs `filter`? Maybe should have been named `select_columns` and `select_rows`. > Where do I get the specific lines of code I need? Good question *only if* you have a clear idea of what you want to do. Sketch out a very specific example of the output you want. --- ## Q&A > What's on the tests? * All assessments are open-everything (except for getting help from other people). * Quizzes will become 1-attempt and timed soon. * Midterm and final are both (mainly) projects. > uh, `lubridate`?? Most confusions I've seen have actually been about data manipulation concepts, not `lubridate` itself. If the assignment instructions don't give you all the syntax and parameters you need, let me know. --- ## Q&A > Cohorts? * If your repo has your own name on it, it's solo. Team name? It's team. > Merge conflicts? * We'll deliberately make one together this week so you know how to deal with it. --- ## Q&A > joins? --- ## Joining data frames * I have a data frame `x` (e.g., Covid confirmed cases) * I want extra information about things in `x` (e.g., population) * Some other table, `y`, has that information. "Joins" let me look it up. * Needs a <span style="background: blue; color: white; padding: 5px;">key</span> : what has to match. Must match *exactly*. <img src="https://raw.githubusercontent.com/gadenbuie/tidyexplain/master/images/static/png/original-dfs.png" width="30%" style="display: block; margin: auto;" /> .small[Graphics thanks to [tidyexplain](https://github.com/gadenbuie/tidyexplain)] --- ## Types of joins What to do when things don't exactly line up? Start with all rows, but: * **full** or **outer**: Leave blanks (`NA`) for mismatches * **inner**: Drop rows with any mismatches * **left** / **right**: Drop rows where one of the sides has a mismatch --- ## Setup .pull-left[ ```r x ``` ``` ## # A tibble: 3 x 2 ## key xdata ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ## 3 3 x3 ``` ```r y ``` ``` ## # A tibble: 3 x 2 ## key ydata ## <dbl> <chr> ## 1 1 y1 ## 2 2 y2 ## 3 4 y4 ``` ] .pull-right[ <!-- --> ] --- ## `full_join()` All rows from both `x` and `y`. Leave `NA` for mismatches. .pull-left[ ```r full_join(x, y, by = "key") ``` ``` ## # A tibble: 4 x 3 ## key xdata ydata ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ## 4 4 <NA> y4 ``` ] .pull-right[ <!-- --> ] --- ## `inner_join()` All matching rows. Drops mismatches. .pull-left[ ```r inner_join(x, y, by = "key") ``` ``` ## # A tibble: 2 x 3 ## key xdata ydata ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ``` ] .pull-right[ <!-- --> ] --- ## `left_join()` All rows from `x`. .pull-left[ ```r left_join(x, y, by = "key") ``` ``` ## # A tibble: 3 x 3 ## key xdata ydata ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 3 x3 <NA> ``` ] .pull-right[ <!-- --> ] --- ## `right_join()` All rows from `y`. .pull-left[ ```r right_join(x, y) ``` ``` ## Joining, by = "key" ``` ``` ## # A tibble: 3 x 3 ## key xdata ydata ## <dbl> <chr> <chr> ## 1 1 x1 y1 ## 2 2 x2 y2 ## 3 4 <NA> y4 ``` ] .pull-right[ <!-- --> ] --- ## Summary - `full_join()`: all rows from both `x` and `y` - `inner_join()`: all *matching* rows from `x` where there are matching values in y. Multiple matches? Return all combinations. - `left_join()`: all rows from x - `right_join()`: all rows from y ^ are called *mutating joins* (by analogy to the `mutate` verb). Sometimes useful: *filtering* joins and *nest* join: - `semi_join()`: include a row from `x` only if there's some match in `y` - `anti_join()`: include a row from `x` only if there's *no* match in `y` - `nest_join()`: get bundles of all matching rows from `y` (most flexible) --- --- ## `semi_join()` .pull-left[ ```r semi_join(x, y, by = "key") ``` ``` ## # A tibble: 2 x 2 ## key xdata ## <dbl> <chr> ## 1 1 x1 ## 2 2 x2 ``` ] .pull-right[ <!-- --> ] --- ## `anti_join()` .pull-left[ ```r anti_join(x, y, by = "key") ``` ``` ## # A tibble: 1 x 2 ## key xdata ## <dbl> <chr> ## 1 3 x3 ``` ] .pull-right[ <!-- --> ] --- .question[ We want to keep all rows and columns from `confirmed_cases` and add a column for corresponding populations from `population`. Which join function should we use? ]