class: center, middle, inverse, title-slide # Grammar of Data Transformation ### K Arnold, based on IntroDS.org --- ## Week 3 * Lab 2 end-of-day today * Discussion 1 replies by tomorrow * Hw 2 due Wednesday * Check that your `.md` file is on GitHub * What does each row represent? * How many rows *should* there be? * Office Hours: [kca] Mon 8-9am, Fri 3-4pm; [yk] Wed 4:30-5:30pm --- ## Question-answers > Am I submitting my labs and homework correctly? * See checklist at end of Lab 1. * Main point: Check your `.md` file on GitHub. > When will we get feedback? * General feedback already given in class. * A bit backed up on specific feedback. > Can we make animated plots like in the video/ * Stay tuned for Plotly. --- ## Questions for you - How is week 3 of Fall 2020? - What's working well in DATA 202? What's challenging? -- - How are the readings and prep exercises? - How long are you spending on labs outside of class? - How hard are labs compared with homework? --- ## So far * *R/RStudio/Rmarkdown/Git*: a toolkit for reproducible collaborative analysis and reporting * `ggplot2`: a *Grammar of Graphics* * a language for describing, and building, visualizations * concepts apply to many other toolkits This week: * `dplyr`: a *Grammar of Data Transformation* * basic concepts will show up again in Python (Pandas) and SQL. --- class: center, middle # Data wrangling and summarizing<br> with *dplyr* --- ## A grammar of data wrangling Functions as verbs that manipulate data frames .pull-left[ <img src="img/dplyr-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" /> ] .pull-right[ .midi[ - `select`: pick columns by name - `arrange`: reorder rows - `slice`: pick rows by index(es) - `slice_sample`: randomly sample rows - `filter`: pick rows matching criteria - `distinct`: filter for unique rows - `mutate`: add new variables - `summarize`: reduce variables to values - `pull`: grab a column as a vector - ... (many more) ] ] --- ## Rules of *dplyr* functions - First argument is *always* a data frame - Subsequent arguments say what to do with that data frame - Always return a data frame - Don't modify in place --- ## Bike crashes in NC 2007 - 2014 ```r ncbikecrash <- read_csv("data/ncbikecrash.csv") ``` ```r glimpse(ncbikecrash) ``` ``` ## Rows: 7,467 ## Columns: 53 ## $ object_id <dbl> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1675, … ## $ city <chr> "None - Rural Crash", "Henderson", "None - Rural… ## $ county <chr> "Wayne", "Vance", "Lincoln", "Columbus", "New Ha… ## $ region <chr> "Coastal", "Piedmont", "Piedmont", "Coastal", "C… ## $ development <chr> "Farms, Woods, Pastures", "Residential", "Farms,… ## $ locality <chr> "Rural (<30% Developed)", "Mixed (30% To 70% Dev… ## $ on_road <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURKHEAD … ## $ rural_urban <chr> "Rural", "Urban", "Rural", "Urban", "Urban", "Ru… ## $ speed_limit <chr> "50 - 55 MPH", "30 - 35 MPH", "50 - 55 MPH", … ## $ traffic_control <chr> "No Control Present", "Stop Sign", "Double Yello… ## $ weather <chr> "Clear", "Clear", "Clear", "Rain", "Clear", "Clo… ## $ workzone <chr> "No", "No", "No", "No", "No", "No", "No", "No", … ## $ bike_age <chr> "52", "66", "33", "52", "22", "15", "41", "14", … ## $ bike_age_group <chr> "50-59", "60-69", "30-39", "50-59", "20-24", "11… ## $ bike_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", "No",… ## $ bike_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ bike_direction <chr> "With Traffic", "With Traffic", "With Traffic", … ## $ bike_injury <chr> "B: Evident Injury", "C: Possible Injury", "C: P… ## $ bike_position <chr> "Bike Lane / Paved Shoulder", "Travel Lane", "Tr… ## $ bike_race <chr> "Black", "Black", "White", "Black", "White", "Na… ## $ bike_sex <chr> "Male", "Male", "Male", "Male", "Female", "Male"… ## $ driver_age <chr> "34", NA, "37", "55", "25", "17", NA, "50", "32"… ## $ driver_age_group <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-19", … ## $ driver_alcohol <chr> "No", "Missing", "No", "No", "No", "No", "Missin… ## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, … ## $ driver_est_speed <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-15 mph… ## $ driver_injury <chr> "O: No Injury", "Unknown Injury", "O: No Injury"… ## $ driver_race <chr> "White", "Unknown/Missing", "Hispanic", "Black",… ## $ driver_sex <chr> "Male", NA, "Female", "Male", "Male", "Female", … ## $ driver_vehicle_type <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "Passe… ## $ crash_alcohol <chr> "No", "No", "No", "Yes", "No", "No", "No", "No",… ## $ crash_date <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14DEC201… ## $ crash_day <chr> "Wednesday", "Wednesday", "Sunday", "Saturday", … ## $ crash_group <chr> "Motorist Overtaking Bicyclist", "Bicyclist Fail… ## $ crash_hour <dbl> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22, 12,… ## $ crash_location <chr> "Non-Intersection", "Intersection", "Non-Interse… ## $ crash_month <chr> "December", "November", "November", "December", … ## $ crash_severity <chr> "B: Evident Injury", "C: Possible Injury", "C: P… ## $ crash_time <time> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13:27:0… ## $ crash_type <chr> "Motorist Overtaking - Undetected Bicyclist", "B… ## $ crash_year <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, … ## $ ambulance_req <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", … ## $ hit_run <chr> "No", "Yes", "No", "No", "No", "No", "Yes", "No"… ## $ light_condition <chr> "Dark - Roadway Not Lighted", NA, "Dark - Roadwa… ## $ road_character <chr> "Straight - Level", "Straight - Level", "Straigh… ## $ road_class <chr> "State Secondary Route", "Local Street", "US Rou… ## $ road_condition <chr> "Dry", "Dry", "Dry", "Water (Standing, Moving)",… ## $ road_configuration <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unpro… ## $ road_defects <chr> "None", NA, "None", "None", "None", "None", "Non… ## $ road_feature <chr> "No Special Feature", "T-Intersection", "No Spec… ## $ road_surface <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth Asph… ## $ num_lanes <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", "8 la… ## $ geo_point <chr> "35.3336070056, -77.9955023901", "36.3151872016,… ``` --- ## Variables View the names of variables via ```r names(ncbikecrash) ``` ``` ## [1] "object_id" "city" "county" ## [4] "region" "development" "locality" ## [7] "on_road" "rural_urban" "speed_limit" ## [10] "traffic_control" "weather" "workzone" ## [13] "bike_age" "bike_age_group" "bike_alcohol" ## [16] "bike_alcohol_drugs" "bike_direction" "bike_injury" ## [19] "bike_position" "bike_race" "bike_sex" ## [22] "driver_age" "driver_age_group" "driver_alcohol" ## [25] "driver_alcohol_drugs" "driver_est_speed" "driver_injury" ## [28] "driver_race" "driver_sex" "driver_vehicle_type" ## [31] "crash_alcohol" "crash_date" "crash_day" ## [34] "crash_group" "crash_hour" "crash_location" ## [37] "crash_month" "crash_severity" "crash_time" ## [40] "crash_type" "crash_year" "ambulance_req" ## [43] "hit_run" "light_condition" "road_character" ## [46] "road_class" "road_condition" "road_configuration" ## [49] "road_defects" "road_feature" "road_surface" ## [52] "num_lanes" "geo_point" ``` and see detailed descriptions [here](https://introds.org/hw/hw-02/hw-02-bike-crash.html). --- ## Select columns .midi[ ```r select(ncbikecrash, county, bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 Wayne 52 ## 2 Vance 66 ## 3 Lincoln 33 ## 4 Columbus 52 ## 5 New Hanover 22 ## 6 Robeson 15 ## 7 Richmond 41 ## 8 Wake 14 ## 9 Columbus 16 ## 10 Craven 54 ## # … with 7,457 more rows ``` ] -- .question[ What if we wanted to select these columns, and then arrange the data in ascending order of biker age? ] --- ## Data wrangling, step-by-step .midi[ .pull-left[ Select: ```r ncbikecrash %>% select(county, bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 Wayne 52 ## 2 Vance 66 ## 3 Lincoln 33 ## 4 Columbus 52 ## 5 New Hanover 22 ## 6 Robeson 15 ## 7 Richmond 41 ## 8 Wake 14 ## 9 Columbus 16 ## 10 Craven 54 ## # … with 7,457 more rows ``` ] .pull-right[ Select, then arrange: ```r ncbikecrash %>% select(county, bike_age) %>% arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 New Hanover 0 ## 2 Carteret 1 ## 3 Guilford 1 ## 4 Pitt 10 ## 5 Cumberland 10 ## 6 Carteret 10 ## 7 Hoke 10 ## 8 Martin 10 ## 9 New Hanover 10 ## 10 Onslow 10 ## # … with 7,457 more rows ``` ] ] --- class: center, middle # Pipes --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. -- .pull-left[ - Start with the data frame `ncbikecrash` ] .pull-right[ ```r *ncbikecrash %>% select(county, bike_age) %>% arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 New Hanover 0 ## 2 Carteret 1 ## 3 Guilford 1 ## 4 Pitt 10 ## 5 Cumberland 10 ## 6 Carteret 10 ## 7 Hoke 10 ## 8 Martin 10 ## 9 New Hanover 10 ## 10 Onslow 10 ## # … with 7,457 more rows ``` ] --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. .pull-left[ - Start with the data frame `ncbikecrash`, - then we `select` the variables `county` and `bike_age`, ] .pull-right[ ```r ncbikecrash %>% * select(county, bike_age) %>% arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 New Hanover 0 ## 2 Carteret 1 ## 3 Guilford 1 ## 4 Pitt 10 ## 5 Cumberland 10 ## 6 Carteret 10 ## 7 Hoke 10 ## 8 Martin 10 ## 9 New Hanover 10 ## 10 Onslow 10 ## # … with 7,457 more rows ``` ] --- ## What is a pipe? In programming, a pipe is a technique for passing information from one process to another. .pull-left[ - Start with the data frame `ncbikecrash`, - then we `select` the variables `county` and `bike_age`, - and then we `arrange` the data frame by `bike_age` in ascending order. ] .pull-right[ ```r ncbikecrash %>% select(county, bike_age) %>% * arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## county bike_age ## <chr> <chr> ## 1 New Hanover 0 ## 2 Carteret 1 ## 3 Guilford 1 ## 4 Pitt 10 ## 5 Cumberland 10 ## 6 Carteret 10 ## 7 Hoke 10 ## 8 Martin 10 ## 9 New Hanover 10 ## 10 Onslow 10 ## # … with 7,457 more rows ``` ] --- ## How does a pipe work? Conventional (nested functions): ```r arrange(select(ncbikecrash, county, bike_age), bike_age) ``` With pipes: ```r ncbikecrash %>% select(county, bike_age) %>% arrange(bike_age) ``` --- ## What about other arguments? Use the dot (`.`) to - send results to a function argument other than first one or - use the previous result for multiple arguments ```r starwars %>% filter(., species == "Human") %>% * lm(mass ~ height, data = .) ``` ``` ## ## Call: ## lm(formula = mass ~ height, data = .) ## ## Coefficients: ## (Intercept) height ## -116.58 1.11 ``` --- ## A note on piping and layering - The `%>%` operator in **dplyr** functions is called the *pipe* operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code. - The `+` operator in **ggplot2** functions is used for "*layering*". This means you create the plot in layers, separated by `+`. - Many of the styling principles are consistent across `%>%` and `+`: - always a space before - always a line break after (for pipelines with more than 2 lines) --- class: center, middle # Data wrangling with dplyr --- class: center, middle [Exercise: Hotel Wrangling](https://rsconnect.calvin.edu:3939/content/105) --- ## `select` to keep variables ```r ncbikecrash %>% * select(locality, speed_limit) ``` ``` ## # A tibble: 7,467 x 2 ## locality speed_limit ## <chr> <chr> ## 1 Rural (<30% Developed) 50 - 55 MPH ## 2 Mixed (30% To 70% Developed) 30 - 35 MPH ## 3 Rural (<30% Developed) 50 - 55 MPH ## 4 Urban (>70% Developed) 30 - 35 MPH ## 5 Urban (>70% Developed) <NA> ## 6 Rural (<30% Developed) 50 - 55 MPH ## 7 Mixed (30% To 70% Developed) 30 - 35 MPH ## 8 Urban (>70% Developed) 30 - 35 MPH ## 9 Rural (<30% Developed) 30 - 35 MPH ## 10 Urban (>70% Developed) 20 - 25 MPH ## # … with 7,457 more rows ``` --- ## `select` to exclude variables ```r ncbikecrash %>% * select(-object_id) ``` ``` ## # A tibble: 7,467 x 52 ## city county region development locality on_road rural_urban speed_limit ## <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural 50 - 55 M… ## 2 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban 30 - 35 M… ## 3 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural 50 - 55 M… ## 4 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban 30 - 35 M… ## 5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban <NA> ## 6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural 50 - 55 M… ## 7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural 30 - 35 M… ## 8 Rale… Wake Piedm… Commercial Urban (… PERSON… Urban 30 - 35 M… ## 9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban 30 - 35 M… ## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban 20 - 25 M… ## # … with 7,457 more rows, and 44 more variables: traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>, ## # driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>, ## # driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>, ## # crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <dbl>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr> ``` --- ## `select` variables with certain characteristics ```r ncbikecrash %>% * select(starts_with("bike_")) ``` ``` ## # A tibble: 7,467 x 9 ## bike_age bike_age_group bike_alcohol bike_alcohol_dr… bike_direction ## <chr> <chr> <chr> <chr> <chr> ## 1 52 50-59 No <NA> With Traffic ## 2 66 60-69 No <NA> With Traffic ## 3 33 30-39 No <NA> With Traffic ## 4 52 50-59 Yes <NA> <NA> ## 5 22 20-24 No <NA> Facing Traffic ## 6 15 11-15 No <NA> With Traffic ## 7 41 40-49 No <NA> Facing Traffic ## 8 14 11-15 No <NA> <NA> ## 9 16 16-19 No <NA> Facing Traffic ## 10 54 50-59 No <NA> With Traffic ## # … with 7,457 more rows, and 4 more variables: bike_injury <chr>, ## # bike_position <chr>, bike_race <chr>, bike_sex <chr> ``` --- ## `select` variables with certain characteristics ```r ncbikecrash %>% * select(ends_with("age")) ``` ``` ## # A tibble: 7,467 x 2 ## bike_age driver_age ## <chr> <chr> ## 1 52 34 ## 2 66 <NA> ## 3 33 37 ## 4 52 55 ## 5 22 25 ## 6 15 17 ## 7 41 <NA> ## 8 14 50 ## 9 16 32 ## 10 54 69 ## # … with 7,457 more rows ``` --- ## Select helpers - `starts_with()`: Starts with a prefix - `ends_with()`: Ends with a suffix - `contains()`: Contains a literal string - `num_range()`: Matches a numerical range like x01, x02, x03 - `one_of()`: Matches variable names in a character vector - `everything()`: Matches all variables - `last_col()`: Select last variable, possibly with an offset - `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text) .footnote[ See help for any of these functions for more info, e.g. `?everything`. ] --- ## `arrange` in ascending / descending order .pull-left[ ```r ncbikecrash %>% select(ends_with("age")) %>% * arrange(bike_age) ``` ``` ## # A tibble: 7,467 x 2 ## bike_age driver_age ## <chr> <chr> ## 1 0 47 ## 2 1 70+ ## 3 1 61 ## 4 10 30 ## 5 10 19 ## 6 10 22 ## 7 10 18 ## 8 10 27 ## 9 10 53 ## 10 10 <NA> ## # … with 7,457 more rows ``` ] .pull-right[ ```r ncbikecrash %>% select(ends_with("age")) %>% * arrange(desc(bike_age)) ``` ``` ## # A tibble: 7,467 x 2 ## bike_age driver_age ## <chr> <chr> ## 1 9 23 ## 2 9 35 ## 3 9 70+ ## 4 9 41 ## 5 9 53 ## 6 9 18 ## 7 9 45 ## 8 9 19 ## 9 9 70+ ## 10 9 59 ## # … with 7,457 more rows ``` ] --- ## `slice` for certain row numbers First five ```r ncbikecrash %>% * slice(1:5) ``` ``` ## # A tibble: 5 x 53 ## object_id city county region development locality on_road rural_urban ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 1686 None… Wayne Coast… Farms, Woo… Rural (… SR 1915 Rural ## 2 1674 Hend… Vance Piedm… Residential Mixed (… NICHOL… Urban ## 3 1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321 Rural ## 4 1687 Whit… Colum… Coast… Commercial Urban (… W BURK… Urban ## 5 1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban ## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>, ## # driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>, ## # driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>, ## # crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <dbl>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr> ``` --- ## `slice` for certain row numbers Last five ```r last_row <- nrow(ncbikecrash) ncbikecrash %>% * slice((last_row - 4):last_row) ``` ``` ## # A tibble: 5 x 53 ## object_id city county region development locality on_road rural_urban ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 6989 High… Guilf… Piedm… Residential Urban (… <NA> Urban ## 2 6991 Wilm… New H… Coast… Residential Urban (… <NA> Urban ## 3 6995 Kins… Lenoir Coast… Commercial Urban (… <NA> Urban ## 4 6998 Faye… Cumbe… Coast… Residential Urban (… <NA> Urban ## 5 7000 None… Onslow Coast… Farms, Woo… Rural (… <NA> Rural ## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>, ## # driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>, ## # driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>, ## # crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <dbl>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr> ``` --- ## `sample_n` / `sample_frac` for a random sample - `slice_sample`: randomly sample `n = 5` observations ```r ncbikecrash_n5 <- ncbikecrash %>% * slice_sample(n = 5, replace = FALSE) dim(ncbikecrash_n5) ``` ``` ## [1] 5 53 ``` -- - `sample_frac`: randomly sample `prop = 20%` of observations ```r ncbikecrash_perc20 <-ncbikecrash %>% * slice_sample(prop = 0.2, replace = FALSE) dim(ncbikecrash_perc20) ``` ``` ## [1] 1493 53 ``` --- ## `filter` to select a subset of rows Crashes in Durham County ```r ncbikecrash %>% * filter(county == "Durham") ``` ``` ## # A tibble: 340 x 53 ## object_id city county region development locality on_road rural_urban ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 2452 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 2441 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 3 2466 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 4 549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban ## 5 598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban ## 6 603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban ## 7 3974 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 8 7134 Durh… Durham Piedm… Commercial Urban (… <NA> Urban ## 9 1670 Durh… Durham Piedm… Commercial Urban (… INFINI… Urban ## 10 1773 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## # … with 330 more rows, and 45 more variables: speed_limit <chr>, ## # traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>, ## # bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>, ## # bike_direction <chr>, bike_injury <chr>, bike_position <chr>, ## # bike_race <chr>, bike_sex <chr>, driver_age <chr>, driver_age_group <chr>, ## # driver_alcohol <chr>, driver_alcohol_drugs <chr>, driver_est_speed <chr>, ## # driver_injury <chr>, driver_race <chr>, driver_sex <chr>, ## # driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>, ## # crash_day <chr>, crash_group <chr>, crash_hour <dbl>, crash_location <chr>, ## # crash_month <chr>, crash_severity <chr>, crash_time <time>, ## # crash_type <chr>, crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr> ``` --- ## `filter` for many conditions at once Crashes in Durham County where biker is 0-5 years old ```r ncbikecrash %>% * filter( * county == "Durham", * bike_age_group == "0-5" * ) ``` ``` ## # A tibble: 4 x 53 ## object_id city county region development locality on_road rural_urban ## <dbl> <chr> <chr> <chr> <chr> <chr> <chr> <chr> ## 1 4062 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 2 414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban ## 3 3016 Durh… Durham Piedm… Residential Urban (… <NA> Urban ## 4 1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban ## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>, ## # weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>, ## # bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>, ## # bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>, ## # driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>, ## # driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>, ## # driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>, ## # crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>, ## # crash_hour <dbl>, crash_location <chr>, crash_month <chr>, ## # crash_severity <chr>, crash_time <time>, crash_type <chr>, ## # crash_year <dbl>, ambulance_req <chr>, hit_run <chr>, ## # light_condition <chr>, road_character <chr>, road_class <chr>, ## # road_condition <chr>, road_configuration <chr>, road_defects <chr>, ## # road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr> ``` --- ## Logical operators in R operator | definition || operator | definition ------------|------------------------------||--------------|---------------- `<` | less than ||`x` | `y` | `x` OR `y` `<=` | less than or equal to ||`is.na(x)` | test if `x` is `NA` `>` | greater than ||`!is.na(x)` | test if `x` is not `NA` `>=` | greater than or equal to ||`x %in% y` | test if `x` is in `y` `==` | exactly equal to ||`!(x %in% y)` | test if `x` is not in `y` `!=` | not equal to ||`!x` | not `x` `x & y` | `x` AND `y` || | --- .question[ Fill in the blanks for filtering for crashes **not** in Durham County where crash year is after 2014 and `bike_position` is not `NA`. ] ```r ncbikecrash %>% filter( county ____ "Durham", crash_year ____ 2014, ____ ) ``` --- .question[ Fill in the blanks for filtering for crashes **not** in Durham County where crash year is after 2014 and `bike_position` is not `NA`. ] ```r ncbikecrash %>% filter( * county != "Durham", * crash_year > 2014, * !is.na(bike_position) ) ``` ``` ## # A tibble: 0 x 53 ## # … with 53 variables: object_id <dbl>, city <chr>, county <chr>, region <chr>, ## # development <chr>, locality <chr>, on_road <chr>, rural_urban <chr>, ## # speed_limit <chr>, traffic_control <chr>, weather <chr>, workzone <chr>, ## # bike_age <chr>, bike_age_group <chr>, bike_alcohol <chr>, ## # bike_alcohol_drugs <chr>, bike_direction <chr>, bike_injury <chr>, ## # bike_position <chr>, bike_race <chr>, bike_sex <chr>, driver_age <chr>, ## # driver_age_group <chr>, driver_alcohol <chr>, driver_alcohol_drugs <chr>, ## # driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>, ## # driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>, ## # crash_date <chr>, crash_day <chr>, crash_group <chr>, crash_hour <dbl>, ## # crash_location <chr>, crash_month <chr>, crash_severity <chr>, ## # crash_time <time>, crash_type <chr>, crash_year <dbl>, ambulance_req <chr>, ## # hit_run <chr>, light_condition <chr>, road_character <chr>, ## # road_class <chr>, road_condition <chr>, road_configuration <chr>, ## # road_defects <chr>, road_feature <chr>, road_surface <chr>, ## # num_lanes <chr>, geo_point <chr> ``` --- ## `distinct` to filter for unique rows ... and `arrange` to order alphabetically .pull-left[ ```r ncbikecrash %>% * distinct(county) %>% arrange(county) ``` ``` ## # A tibble: 101 x 1 ## county ## <chr> ## 1 Alamance ## 2 Alexander ## 3 Alleghany ## 4 Anson ## 5 Ashe ## 6 Avery ## 7 Beaufort ## 8 Bertie ## 9 Bladen ## 10 Brunswick ## # … with 91 more rows ``` ] .pull-right[ ```r ncbikecrash %>% select(county, city) %>% * distinct() %>% arrange(county, city) ``` ``` ## # A tibble: 391 x 2 ## county city ## <chr> <chr> ## 1 Alamance Alamance ## 2 Alamance Burlington ## 3 Alamance Elon ## 4 Alamance Elon College ## 5 Alamance Gibsonville ## 6 Alamance Graham ## 7 Alamance Green Level ## 8 Alamance Mebane ## 9 Alamance None - Rural Crash ## 10 Alexander None - Rural Crash ## # … with 381 more rows ``` ] --- ## Code Style >"Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread." > >Hadley Wickham - Recommended: Tidyverse style guide <https://style.tidyverse.org/> --- ## Summary * File names and code chunks: `data-wrangling`, not `Data Wrangling`. * Variable names: `hourly_rides`, not `hourlyRides` or `hourly.rides` or `rides_by_hour_with_weather` * Informative but short. Don't reuse. --- ## Spacing - Put a space before and after all infix operators (=, +, -, <-, etc.), and when naming arguments in function calls - Always put a space after a comma, and never before (just like in regular English) ```r # Good average <- mean(feet / 12 + inches, na.rm = TRUE) # Bad average<-mean(feet/12+inches,na.rm=TRUE) ``` --- ## ggplot2 - Always end a line with `+` - Always indent the next line ```r # Good ggplot(diamonds, mapping = aes(x = price)) + geom_histogram() # Bad ggplot(diamonds,mapping=aes(x=price))+geom_histogram() ``` --- class: center, middle # Tidy data --- ## Tidy data >Happy families are all alike; every unhappy family is unhappy in its own way. > >Leo Tolstoy -- .pull-left[ **Characteristics of tidy data:** - Each variable forms a column. - Each observation forms a row. - Each type of observational unit forms a table. ] -- .pull-right[ **Characteristics of untidy data:** - Varies. ] --- ## .question[ What makes this data not tidy? ] <img src="img/untidy-data/hyperwar-airplanes-on-hand.png" width="90%" style="display: block; margin: auto;" /> .footnote[ Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html) ] --- .question[ What makes this data not tidy? ] <br> <img src="img/untidy-data/hiv-est-prevalence-15-49.png" width="95%" style="display: block; margin: auto;" /> .footnote[ Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data) ] --- .question[ What makes this data not tidy? ] <br> <img src="img/untidy-data/us-general-economic-characteristic-acs-2017.png" width="95%" style="display: block; margin: auto;" /> .footnote[ Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt) ]