Grammar of Data Transformation

# Grammar of Data Transformation
### K Arnold, based on IntroDS.org

---

## Week 3

* Lab 2 end-of-day today
* Discussion 1 replies by tomorrow
* Hw 2 due Wednesday
  * Check that your `.md` file is on GitHub
  * What does each row represent?
  * How many rows *should* there be?
* Office Hours: [kca] Mon 8-9am, Fri 3-4pm; [yk] Wed 4:30-5:30pm

---

## Question-answers

> Am I submitting my labs and homework correctly?

* See checklist at end of Lab 1.
* Main point: Check your `.md` file on GitHub.

> When will we get feedback?

* General feedback already given in class.
* A bit backed up on specific feedback.

> Can we make animated plots like in the video/

* Stay tuned for Plotly.

---

## Questions for you

- How is week 3 of Fall 2020?
- What's working well in DATA 202? What's challenging?

- How are the readings and prep exercises?
- How long are you spending on labs outside of class?
- How hard are labs compared with homework?

---

## So far

* *R/RStudio/Rmarkdown/Git*: a toolkit for reproducible collaborative analysis and reporting
* `ggplot2`: a *Grammar of Graphics*
  * a language for describing, and building, visualizations
  * concepts apply to many other toolkits

This week:

* `dplyr`: a *Grammar of Data Transformation*
  * basic concepts will show up again in Python (Pandas) and SQL.

---

# Data wrangling and summarizing<br> with *dplyr*

---

## A grammar of data wrangling

Functions as verbs that manipulate data frames

.pull-left[
<img src="img/dplyr-part-of-tidyverse.png" width="80%" style="display: block; margin: auto;" />
]
.pull-right[
.midi[
- `select`: pick columns by name
- `arrange`: reorder rows
- `slice`: pick rows by index(es)
- `slice_sample`: randomly sample rows
- `filter`: pick rows matching criteria
- `distinct`: filter for unique rows
- `mutate`: add new variables
- `summarize`: reduce variables to values
- `pull`: grab a column as a vector
- ... (many more)
]
]

---

## Rules of *dplyr* functions

- First argument is *always* a data frame
- Subsequent arguments say what to do with that data frame
- Always return a data frame
- Don't modify in place

---

## Bike crashes in NC 2007 - 2014

```r
ncbikecrash <- read_csv("data/ncbikecrash.csv")
```

```r
glimpse(ncbikecrash)
```

```
## Rows: 7,467
## Columns: 53
## $ object_id            <dbl> 1686, 1674, 1673, 1687, 1653, 1665, 1642, 1675, …
## $ city                 <chr> "None - Rural Crash", "Henderson", "None - Rural…
## $ county               <chr> "Wayne", "Vance", "Lincoln", "Columbus", "New Ha…
## $ region               <chr> "Coastal", "Piedmont", "Piedmont", "Coastal", "C…
## $ development          <chr> "Farms, Woods, Pastures", "Residential", "Farms,…
## $ locality             <chr> "Rural (<30% Developed)", "Mixed (30% To 70% Dev…
## $ on_road              <chr> "SR 1915", "NICHOLAS ST", "US 321", "W BURKHEAD …
## $ rural_urban          <chr> "Rural", "Urban", "Rural", "Urban", "Urban", "Ru…
## $ speed_limit          <chr> "50 - 55  MPH", "30 - 35  MPH", "50 - 55  MPH", …
## $ traffic_control      <chr> "No Control Present", "Stop Sign", "Double Yello…
## $ weather              <chr> "Clear", "Clear", "Clear", "Rain", "Clear", "Clo…
## $ workzone             <chr> "No", "No", "No", "No", "No", "No", "No", "No", …
## $ bike_age             <chr> "52", "66", "33", "52", "22", "15", "41", "14", …
## $ bike_age_group       <chr> "50-59", "60-69", "30-39", "50-59", "20-24", "11…
## $ bike_alcohol         <chr> "No", "No", "No", "Yes", "No", "No", "No", "No",…
## $ bike_alcohol_drugs   <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ bike_direction       <chr> "With Traffic", "With Traffic", "With Traffic", …
## $ bike_injury          <chr> "B: Evident Injury", "C: Possible Injury", "C: P…
## $ bike_position        <chr> "Bike Lane / Paved Shoulder", "Travel Lane", "Tr…
## $ bike_race            <chr> "Black", "Black", "White", "Black", "White", "Na…
## $ bike_sex             <chr> "Male", "Male", "Male", "Male", "Female", "Male"…
## $ driver_age           <chr> "34", NA, "37", "55", "25", "17", NA, "50", "32"…
## $ driver_age_group     <chr> "30-39", NA, "30-39", "50-59", "25-29", "0-19", …
## $ driver_alcohol       <chr> "No", "Missing", "No", "No", "No", "No", "Missin…
## $ driver_alcohol_drugs <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, …
## $ driver_est_speed     <chr> "51-55 mph", "6-10 mph", "41-45 mph", "11-15 mph…
## $ driver_injury        <chr> "O: No Injury", "Unknown Injury", "O: No Injury"…
## $ driver_race          <chr> "White", "Unknown/Missing", "Hispanic", "Black",…
## $ driver_sex           <chr> "Male", NA, "Female", "Male", "Male", "Female", …
## $ driver_vehicle_type  <chr> "Single Unit Truck (2-Axle, 6-Tire)", NA, "Passe…
## $ crash_alcohol        <chr> "No", "No", "No", "Yes", "No", "No", "No", "No",…
## $ crash_date           <chr> "11DEC2013", "20NOV2013", "03NOV2013", "14DEC201…
## $ crash_day            <chr> "Wednesday", "Wednesday", "Sunday", "Saturday", …
## $ crash_group          <chr> "Motorist Overtaking Bicyclist", "Bicyclist Fail…
## $ crash_hour           <dbl> 6, 20, 18, 18, 13, 17, 17, 7, 15, 2, 12, 22, 12,…
## $ crash_location       <chr> "Non-Intersection", "Intersection", "Non-Interse…
## $ crash_month          <chr> "December", "November", "November", "December", …
## $ crash_severity       <chr> "B: Evident Injury", "C: Possible Injury", "C: P…
## $ crash_time           <time> 06:10:00, 20:41:00, 18:05:00, 18:34:00, 13:27:0…
## $ crash_type           <chr> "Motorist Overtaking - Undetected Bicyclist", "B…
## $ crash_year           <dbl> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013, …
## $ ambulance_req        <chr> "Yes", "No", "Yes", "Yes", "Yes", "Yes", "Yes", …
## $ hit_run              <chr> "No", "Yes", "No", "No", "No", "No", "Yes", "No"…
## $ light_condition      <chr> "Dark - Roadway Not Lighted", NA, "Dark - Roadwa…
## $ road_character       <chr> "Straight - Level", "Straight - Level", "Straigh…
## $ road_class           <chr> "State Secondary Route", "Local Street", "US Rou…
## $ road_condition       <chr> "Dry", "Dry", "Dry", "Water (Standing, Moving)",…
## $ road_configuration   <chr> "Two-Way, Not Divided", "Two-Way, Divided, Unpro…
## $ road_defects         <chr> "None", NA, "None", "None", "None", "None", "Non…
## $ road_feature         <chr> "No Special Feature", "T-Intersection", "No Spec…
## $ road_surface         <chr> "Coarse Asphalt", "Smooth Asphalt", "Smooth Asph…
## $ num_lanes            <chr> "2 lanes", "2 lanes", "2 lanes", "1 lane", "8 la…
## $ geo_point            <chr> "35.3336070056, -77.9955023901", "36.3151872016,…
```

---

## Variables

View the names of variables via

```r
names(ncbikecrash)
```

```
##  [1] "object_id"            "city"                 "county"              
##  [4] "region"               "development"          "locality"            
##  [7] "on_road"              "rural_urban"          "speed_limit"         
## [10] "traffic_control"      "weather"              "workzone"            
## [13] "bike_age"             "bike_age_group"       "bike_alcohol"        
## [16] "bike_alcohol_drugs"   "bike_direction"       "bike_injury"         
## [19] "bike_position"        "bike_race"            "bike_sex"            
## [22] "driver_age"           "driver_age_group"     "driver_alcohol"      
## [25] "driver_alcohol_drugs" "driver_est_speed"     "driver_injury"       
## [28] "driver_race"          "driver_sex"           "driver_vehicle_type" 
## [31] "crash_alcohol"        "crash_date"           "crash_day"           
## [34] "crash_group"          "crash_hour"           "crash_location"      
## [37] "crash_month"          "crash_severity"       "crash_time"          
## [40] "crash_type"           "crash_year"           "ambulance_req"       
## [43] "hit_run"              "light_condition"      "road_character"      
## [46] "road_class"           "road_condition"       "road_configuration"  
## [49] "road_defects"         "road_feature"         "road_surface"        
## [52] "num_lanes"            "geo_point"
```

and see detailed descriptions [here](https://introds.org/hw/hw-02/hw-02-bike-crash.html).

---

## Select columns

```r
select(ncbikecrash, county, bike_age)
```

```
## # A tibble: 7,467 x 2
##    county      bike_age
##    <chr>       <chr>   
##  1 Wayne       52      
##  2 Vance       66      
##  3 Lincoln     33      
##  4 Columbus    52      
##  5 New Hanover 22      
##  6 Robeson     15      
##  7 Richmond    41      
##  8 Wake        14      
##  9 Columbus    16      
## 10 Craven      54      
## # … with 7,457 more rows
```
]

.question[
What if we wanted to select these columns, and then arrange the data in ascending order of biker age?
]

---

## Data wrangling, step-by-step

```r
ncbikecrash %>%
  select(county, bike_age)
```

```r
ncbikecrash %>%
  select(county, bike_age) %>%
  arrange(bike_age)
```

```
## # A tibble: 7,467 x 2
##    county      bike_age
##    <chr>       <chr>   
##  1 New Hanover 0       
##  2 Carteret    1       
##  3 Guilford    1       
##  4 Pitt        10      
##  5 Cumberland  10      
##  6 Carteret    10      
##  7 Hoke        10      
##  8 Martin      10      
##  9 New Hanover 10      
## 10 Onslow      10      
## # … with 7,457 more rows
```
]
]

---

# Pipes

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

```r
*ncbikecrash %>%
  select(county, bike_age) %>%
  arrange(bike_age)
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `ncbikecrash`,
- then we `select` the variables `county` and `bike_age`,
]
.pull-right[

```r
ncbikecrash %>%
* select(county, bike_age) %>%
  arrange(bike_age)
```

---

## What is a pipe?

In programming, a pipe is a technique for passing information from one process to another.

.pull-left[
- Start with the data frame `ncbikecrash`,
- then we `select` the variables `county` and `bike_age`,
- and then we `arrange` the data frame by `bike_age` in ascending order.
]
.pull-right[

```r
ncbikecrash %>%
  select(county, bike_age) %>%
* arrange(bike_age)
```

---

## How does a pipe work?

Conventional (nested functions):

```r
arrange(select(ncbikecrash, county, bike_age), bike_age)
```

With pipes:

```r
ncbikecrash %>%
  select(county, bike_age) %>%
  arrange(bike_age)
```

---

## What about other arguments?

Use the dot (`.`) to

- send results to a function argument other than first one or 
- use the previous result for multiple arguments

```r
starwars %>%
  filter(., species == "Human") %>%
* lm(mass ~ height, data = .)
```

```
## 
## Call:
## lm(formula = mass ~ height, data = .)
## 
## Coefficients:
## (Intercept)       height  
##     -116.58         1.11
```

---

## A note on piping and layering

- The `%>%` operator in **dplyr** functions is called the *pipe* operator. This means you "pipe" the output of the previous line of code as the first input of the next line of code.
- The `+` operator in **ggplot2** functions is used for "*layering*". This means you create the plot in layers, separated by `+`.
- Many of the styling principles are consistent across `%>%` and `+`:
  - always a space before
  - always a line break after (for pipelines with more than 2 lines)

---

# Data wrangling with dplyr

---

[Exercise: Hotel Wrangling](https://rsconnect.calvin.edu:3939/content/105)

---

## `select` to keep variables

```r
ncbikecrash %>%
* select(locality, speed_limit)
```

```
## # A tibble: 7,467 x 2
##    locality                     speed_limit 
##    <chr>                        <chr>       
##  1 Rural (<30% Developed)       50 - 55  MPH
##  2 Mixed (30% To 70% Developed) 30 - 35  MPH
##  3 Rural (<30% Developed)       50 - 55  MPH
##  4 Urban (>70% Developed)       30 - 35  MPH
##  5 Urban (>70% Developed)       <NA>        
##  6 Rural (<30% Developed)       50 - 55  MPH
##  7 Mixed (30% To 70% Developed) 30 - 35  MPH
##  8 Urban (>70% Developed)       30 - 35  MPH
##  9 Rural (<30% Developed)       30 - 35  MPH
## 10 Urban (>70% Developed)       20 - 25  MPH
## # … with 7,457 more rows
```

---

## `select` to exclude variables

```r
ncbikecrash %>%
* select(-object_id)
```

```
## # A tibble: 7,467 x 52
##    city  county region development locality on_road rural_urban speed_limit
##    <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>       <chr>      
##  1 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural       50 - 55  M…
##  2 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban       30 - 35  M…
##  3 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural       50 - 55  M…
##  4 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban       30 - 35  M…
##  5 Wilm… New H… Coast… Residential Urban (… RACINE… Urban       <NA>       
##  6 None… Robes… Coast… Farms, Woo… Rural (… SR 1513 Rural       50 - 55  M…
##  7 None… Richm… Piedm… Residential Mixed (… SR 1903 Rural       30 - 35  M…
##  8 Rale… Wake   Piedm… Commercial  Urban (… PERSON… Urban       30 - 35  M…
##  9 Whit… Colum… Coast… Residential Rural (… FLOWER… Urban       30 - 35  M…
## 10 New … Craven Coast… Residential Urban (… SUTTON… Urban       20 - 25  M…
## # … with 7,457 more rows, and 44 more variables: traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>,
## #   driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>,
## #   driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>,
## #   driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>,
## #   crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <dbl>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr>
```

---

## `select` variables with certain characteristics

```r
ncbikecrash %>%
* select(starts_with("bike_"))
```

```
## # A tibble: 7,467 x 9
##    bike_age bike_age_group bike_alcohol bike_alcohol_dr… bike_direction
##    <chr>    <chr>          <chr>        <chr>            <chr>         
##  1 52       50-59          No           <NA>             With Traffic  
##  2 66       60-69          No           <NA>             With Traffic  
##  3 33       30-39          No           <NA>             With Traffic  
##  4 52       50-59          Yes          <NA>             <NA>          
##  5 22       20-24          No           <NA>             Facing Traffic
##  6 15       11-15          No           <NA>             With Traffic  
##  7 41       40-49          No           <NA>             Facing Traffic
##  8 14       11-15          No           <NA>             <NA>          
##  9 16       16-19          No           <NA>             Facing Traffic
## 10 54       50-59          No           <NA>             With Traffic  
## # … with 7,457 more rows, and 4 more variables: bike_injury <chr>,
## #   bike_position <chr>, bike_race <chr>, bike_sex <chr>
```

---

## `select` variables with certain characteristics

```r
ncbikecrash %>%
* select(ends_with("age"))
```

```
## # A tibble: 7,467 x 2
##    bike_age driver_age
##    <chr>    <chr>     
##  1 52       34        
##  2 66       <NA>      
##  3 33       37        
##  4 52       55        
##  5 22       25        
##  6 15       17        
##  7 41       <NA>      
##  8 14       50        
##  9 16       32        
## 10 54       69        
## # … with 7,457 more rows
```

---

## Select helpers

- `starts_with()`: Starts with a prefix
- `ends_with()`: Ends with a suffix
- `contains()`: Contains a literal string
- `num_range()`: Matches a numerical range like x01, x02, x03
- `one_of()`: Matches variable names in a character vector
- `everything()`: Matches all variables
- `last_col()`: Select last variable, possibly with an offset
- `matches()`: Matches a regular expression (a sequence of symbols/characters expressing a string/pattern to be searched for within text)

---

## `arrange` in ascending / descending order

```r
ncbikecrash %>%
  select(ends_with("age")) %>%
* arrange(bike_age)
```

```
## # A tibble: 7,467 x 2
##    bike_age driver_age
##    <chr>    <chr>     
##  1 0        47        
##  2 1        70+       
##  3 1        61        
##  4 10       30        
##  5 10       19        
##  6 10       22        
##  7 10       18        
##  8 10       27        
##  9 10       53        
## 10 10       <NA>      
## # … with 7,457 more rows
```
]
.pull-right[

```r
ncbikecrash %>%
  select(ends_with("age")) %>%
* arrange(desc(bike_age))
```

```
## # A tibble: 7,467 x 2
##    bike_age driver_age
##    <chr>    <chr>     
##  1 9        23        
##  2 9        35        
##  3 9        70+       
##  4 9        41        
##  5 9        53        
##  6 9        18        
##  7 9        45        
##  8 9        19        
##  9 9        70+       
## 10 9        59        
## # … with 7,457 more rows
```
]

---

## `slice` for certain row numbers

First five

```r
ncbikecrash %>%
* slice(1:5)
```

```
## # A tibble: 5 x 53
##   object_id city  county region development locality on_road rural_urban
##       <dbl> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      1686 None… Wayne  Coast… Farms, Woo… Rural (… SR 1915 Rural      
## 2      1674 Hend… Vance  Piedm… Residential Mixed (… NICHOL… Urban      
## 3      1673 None… Linco… Piedm… Farms, Woo… Rural (… US 321  Rural      
## 4      1687 Whit… Colum… Coast… Commercial  Urban (… W BURK… Urban      
## 5      1653 Wilm… New H… Coast… Residential Urban (… RACINE… Urban      
## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>,
## #   driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>,
## #   driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>,
## #   driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>,
## #   crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <dbl>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr>
```

---

## `slice` for certain row numbers

Last five

```r
last_row <- nrow(ncbikecrash)
ncbikecrash %>%
* slice((last_row - 4):last_row)
```

```
## # A tibble: 5 x 53
##   object_id city  county region development locality on_road rural_urban
##       <dbl> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      6989 High… Guilf… Piedm… Residential Urban (… <NA>    Urban      
## 2      6991 Wilm… New H… Coast… Residential Urban (… <NA>    Urban      
## 3      6995 Kins… Lenoir Coast… Commercial  Urban (… <NA>    Urban      
## 4      6998 Faye… Cumbe… Coast… Residential Urban (… <NA>    Urban      
## 5      7000 None… Onslow Coast… Farms, Woo… Rural (… <NA>    Rural      
## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>,
## #   driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>,
## #   driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>,
## #   driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>,
## #   crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <dbl>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr>
```

---

## `sample_n` / `sample_frac` for a random sample

- `slice_sample`: randomly sample `n = 5` observations

```r
ncbikecrash_n5 <- ncbikecrash %>%
* slice_sample(n = 5, replace = FALSE)
dim(ncbikecrash_n5)
```

```
## [1]  5 53
```

- `sample_frac`: randomly sample `prop = 20%` of observations

```r
ncbikecrash_perc20 <-ncbikecrash %>%
* slice_sample(prop = 0.2, replace = FALSE)
dim(ncbikecrash_perc20)
```

```
## [1] 1493   53
```

---

## `filter` to select a subset of rows

Crashes in Durham County

```r
ncbikecrash %>%
* filter(county == "Durham")
```

```
## # A tibble: 340 x 53
##    object_id city  county region development locality on_road rural_urban
##        <dbl> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
##  1      2452 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
##  2      2441 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  3      2466 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  4       549 Durh… Durham Piedm… Residential Urban (… PARK A… Urban      
##  5       598 Durh… Durham Piedm… Residential Urban (… BELT S… Urban      
##  6       603 Durh… Durham Piedm… Residential Urban (… HINSON… Urban      
##  7      3974 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  8      7134 Durh… Durham Piedm… Commercial  Urban (… <NA>    Urban      
##  9      1670 Durh… Durham Piedm… Commercial  Urban (… INFINI… Urban      
## 10      1773 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## # … with 330 more rows, and 45 more variables: speed_limit <chr>,
## #   traffic_control <chr>, weather <chr>, workzone <chr>, bike_age <chr>,
## #   bike_age_group <chr>, bike_alcohol <chr>, bike_alcohol_drugs <chr>,
## #   bike_direction <chr>, bike_injury <chr>, bike_position <chr>,
## #   bike_race <chr>, bike_sex <chr>, driver_age <chr>, driver_age_group <chr>,
## #   driver_alcohol <chr>, driver_alcohol_drugs <chr>, driver_est_speed <chr>,
## #   driver_injury <chr>, driver_race <chr>, driver_sex <chr>,
## #   driver_vehicle_type <chr>, crash_alcohol <chr>, crash_date <chr>,
## #   crash_day <chr>, crash_group <chr>, crash_hour <dbl>, crash_location <chr>,
## #   crash_month <chr>, crash_severity <chr>, crash_time <time>,
## #   crash_type <chr>, crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr>
```

---

## `filter` for many conditions at once

Crashes in Durham County where biker is 0-5 years old

```r
ncbikecrash %>%
* filter(
*   county == "Durham",
*   bike_age_group == "0-5"
*   )
```

```
## # A tibble: 4 x 53
##   object_id city  county region development locality on_road rural_urban
##       <dbl> <chr> <chr>  <chr>  <chr>       <chr>    <chr>   <chr>      
## 1      4062 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## 2       414 Durh… Durham Piedm… Residential Urban (… PVA 90… Urban      
## 3      3016 Durh… Durham Piedm… Residential Urban (… <NA>    Urban      
## 4      1383 Durh… Durham Piedm… Residential Urban (… PVA 62… Urban      
## # … with 45 more variables: speed_limit <chr>, traffic_control <chr>,
## #   weather <chr>, workzone <chr>, bike_age <chr>, bike_age_group <chr>,
## #   bike_alcohol <chr>, bike_alcohol_drugs <chr>, bike_direction <chr>,
## #   bike_injury <chr>, bike_position <chr>, bike_race <chr>, bike_sex <chr>,
## #   driver_age <chr>, driver_age_group <chr>, driver_alcohol <chr>,
## #   driver_alcohol_drugs <chr>, driver_est_speed <chr>, driver_injury <chr>,
## #   driver_race <chr>, driver_sex <chr>, driver_vehicle_type <chr>,
## #   crash_alcohol <chr>, crash_date <chr>, crash_day <chr>, crash_group <chr>,
## #   crash_hour <dbl>, crash_location <chr>, crash_month <chr>,
## #   crash_severity <chr>, crash_time <time>, crash_type <chr>,
## #   crash_year <dbl>, ambulance_req <chr>, hit_run <chr>,
## #   light_condition <chr>, road_character <chr>, road_class <chr>,
## #   road_condition <chr>, road_configuration <chr>, road_defects <chr>,
## #   road_feature <chr>, road_surface <chr>, num_lanes <chr>, geo_point <chr>
```

---

## Logical operators in R

operator    | definition                   || operator     | definition
------------|------------------------------||--------------|----------------
`<`         | less than                    ||`x`&nbsp;&#124;&nbsp;`y`     | `x` OR `y` 
`<=`        |	less than or equal to        ||`is.na(x)`    | test if `x` is `NA`
`>`         | greater than                 ||`!is.na(x)`   | test if `x` is not `NA`
`>=`        |	greater than or equal to     ||`x %in% y`    | test if `x` is in `y`
`==`        |	exactly equal to             ||`!(x %in% y)` | test if `x` is not in `y`
`!=`        |	not equal to                 ||`!x`          | not `x`
`x & y`     | `x` AND `y`                  ||              |

---

.question[
Fill in the blanks for filtering for crashes **not** in Durham County where crash year is after 2014 and `bike_position` is not `NA`.
]

```r
ncbikecrash %>%
  filter(
    county ____ "Durham", 
    crash_year ____ 2014,
    ____
    )
```

---

.question[
Fill in the blanks for filtering for crashes **not** in Durham County where crash year is after 2014 and `bike_position` is not `NA`.
]

```r
ncbikecrash %>%
  filter(
*   county != "Durham",
*   crash_year > 2014,
*   !is.na(bike_position)
    )
```

```
## # A tibble: 0 x 53
## # … with 53 variables: object_id <dbl>, city <chr>, county <chr>, region <chr>,
## #   development <chr>, locality <chr>, on_road <chr>, rural_urban <chr>,
## #   speed_limit <chr>, traffic_control <chr>, weather <chr>, workzone <chr>,
## #   bike_age <chr>, bike_age_group <chr>, bike_alcohol <chr>,
## #   bike_alcohol_drugs <chr>, bike_direction <chr>, bike_injury <chr>,
## #   bike_position <chr>, bike_race <chr>, bike_sex <chr>, driver_age <chr>,
## #   driver_age_group <chr>, driver_alcohol <chr>, driver_alcohol_drugs <chr>,
## #   driver_est_speed <chr>, driver_injury <chr>, driver_race <chr>,
## #   driver_sex <chr>, driver_vehicle_type <chr>, crash_alcohol <chr>,
## #   crash_date <chr>, crash_day <chr>, crash_group <chr>, crash_hour <dbl>,
## #   crash_location <chr>, crash_month <chr>, crash_severity <chr>,
## #   crash_time <time>, crash_type <chr>, crash_year <dbl>, ambulance_req <chr>,
## #   hit_run <chr>, light_condition <chr>, road_character <chr>,
## #   road_class <chr>, road_condition <chr>, road_configuration <chr>,
## #   road_defects <chr>, road_feature <chr>, road_surface <chr>,
## #   num_lanes <chr>, geo_point <chr>
```

---

## `distinct` to filter for unique rows

... and `arrange` to order alphabetically

```r
ncbikecrash %>% 
* distinct(county) %>%
  arrange(county)
```

```
## # A tibble: 101 x 1
##    county   
##    <chr>    
##  1 Alamance 
##  2 Alexander
##  3 Alleghany
##  4 Anson    
##  5 Ashe     
##  6 Avery    
##  7 Beaufort 
##  8 Bertie   
##  9 Bladen   
## 10 Brunswick
## # … with 91 more rows
```
]
.pull-right[

```r
ncbikecrash %>% 
  select(county, city) %>% 
* distinct() %>%
  arrange(county, city)
```

```
## # A tibble: 391 x 2
##    county    city              
##    <chr>     <chr>             
##  1 Alamance  Alamance          
##  2 Alamance  Burlington        
##  3 Alamance  Elon              
##  4 Alamance  Elon College      
##  5 Alamance  Gibsonville       
##  6 Alamance  Graham            
##  7 Alamance  Green Level       
##  8 Alamance  Mebane            
##  9 Alamance  None - Rural Crash
## 10 Alexander None - Rural Crash
## # … with 381 more rows
```
]

---

## Code Style

>"Good coding style is like correct punctuation: you can manage without it, butitsuremakesthingseasiertoread."
>
>Hadley Wickham

- Recommended: Tidyverse style guide <https://style.tidyverse.org/>

---

## Summary

* File names and code chunks: `data-wrangling`, not `Data Wrangling`.
* Variable names: `hourly_rides`, not `hourlyRides` or `hourly.rides` or `rides_by_hour_with_weather`
  * Informative but short. Don't reuse.

---

## Spacing

- Put a space before and after all infix operators (=, +, -, <-, etc.), and when 
naming arguments in function calls
- Always put a space after a comma, and never before (just like in regular English)

```r
# Good
average <- mean(feet / 12 + inches, na.rm = TRUE)

# Bad
average<-mean(feet/12+inches,na.rm=TRUE)
```

---

## ggplot2

- Always end a line with `+`
- Always indent the next line

```r
# Good
ggplot(diamonds, mapping = aes(x = price)) +
  geom_histogram()

# Bad
ggplot(diamonds,mapping=aes(x=price))+geom_histogram()
```

---

# Tidy data

---

## Tidy data

>Happy families are all alike; every unhappy family is unhappy in its own way. 
>
>Leo Tolstoy

- Each variable forms a column.
- Each observation forms a row.
- Each type of observational unit forms a table.
]
--
.pull-right[
**Characteristics of untidy data:**

- Varies.
]

---

.footnote[
Source: [Army Air Forces Statistical Digest, WW II](https://www.ibiblio.org/hyperwar/AAF/StatDigest/aafsd-3.html)
]

---

<br>

.footnote[
Source: [Gapminder, Estimated HIV prevalence among 15-49 year olds](https://www.gapminder.org/data)
]

---

<br>

.footnote[
Source: [US Census Fact Finder, General Economic Characteristics, ACS 2017](https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_17_5YR_DP03&src=pt)
]