Lab 9.1 - dplyr & SQL

library(tidyverse)
library(DBI)

The goal of this exercise is to learn about how to work with databases and particularly the SQL query language.

Loading the Data

We’ll be working with the flights data from our textbook. Here’s the code to connect to the database:

db <- DBI::dbConnect(
  RMariaDB::MariaDB(), 
  dbname = "airlines", 
  host = "mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com", 
  user = "mdsr_public", 
  password = "ImhsmflMDSwR",
  )

And we’ll get references to two specific tables we’ll be working with.

flights <- tbl(db, "flights")
carriers <- tbl(db, "carriers")

Each of those behaves like a tibble (data frame), but it’s actually a reference to remote data.

flights %>% head()

## # Source:   lazy query [?? x 21]
## # Database: mysql
## #   [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##   <int> <int> <int>    <int>          <int>     <int>    <int>          <int>
## 1  2010    10     1        1           2100       181      159           2320
## 2  2010    10     1        1           1920       281      230           2214
## 3  2010    10     1        3           2355         8      339            334
## 4  2010    10     1        5           2200       125       41           2249
## 5  2010    10     1        7           2245        82      104           2347
## 6  2010    10     1        7             10        -3      451            500
## # ... with 13 more variables: arr_delay <int>, carrier <chr>, tailnum <chr>,
## #   flight <int>, origin <chr>, dest <chr>, air_time <int>, distance <int>,
## #   cancelled <int>, diverted <int>, hour <int>, minute <int>, time_hour <dttm>

The tool that makes this possible is called dbplyr. The objects it gives us to work with are lazy: they don’t actually fetch the data until we absolutely need it. They’re lazy queries in that to get the data we’re going to ask the database for data in a language called “SQL”. (Structured Query Language).

We can ask for the query that would be executed using show_query.

flights %>% show_query()

## <SQL>
## SELECT *
## FROM `flights`

flights %>% 
  filter(year == 2016) %>% 
  show_query()

## <SQL>
## SELECT *
## FROM `flights`
## WHERE (`year` = 2016.0)

Filtering Records

🚧 Based only on looking at the above, what SQL expression corresponds to filter?

Summarizing Data

Because queries are lazy, even finding out how many rows is expensive, since we’d have to run that query. So, somewhat surprisingly:

nrow(flights)

## [1] NA

But we can use summary functions!

flights %>% 
  filter(year == 2016, month == 11, distance > 1000) %>% 
  summarize(
    n = n(),
    mean_dep_delay = mean(dep_delay, na.rm = TRUE)
  ) %>% show_query()

## <SQL>
## SELECT COUNT(*) AS `n`, AVG(`dep_delay`) AS `mean_dep_delay`
## FROM `flights`
## WHERE ((`year` = 2016.0) AND (`month` = 11.0) AND (`distance` > 1000.0))

flights %>% 
  filter(year == 2016, month == 11, distance > 2000) %>% 
  summarize(n = n(), mean_dep_delay = mean(dep_delay, na.rm = TRUE))

## # Source:   lazy query [?? x 2]
## # Database: mysql
## #   [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
##         n mean_dep_delay
##   <int64>          <dbl>
## 1   30366           5.96

🚧 Based on the above, how does SQL express n and mean?

🚧 What years of data are stored in the flights table? Do this by summarizing the min and max of the year column. (Before actually executing this query, get the instructor to verify your query so that you don’t overload the shared server.)

flights %>% 
  summarize(first_year = min(year, na.rm = TRUE), last_year = max(year, na.rm = TRUE)) %>% 
  collect()

## # A tibble: 1 x 2
##   first_year last_year
##        <int>     <int>
## 1       2010      2017

🚧 How many flights arrived in JFK (dest is JFK) in November of 2016?

🚧 In that period (November of 2016 for JFK), how many flights arrived for each carrier? Sort the result by decreasing number of flights.

Grouping Recordsh

🚧 Look at the SQL query for the previous exercise. How does SQL express group_by? How does SQL express arrange?

🚧 In that same period (November of 2016 for JFK), what fraction of flights from each carrier were on time? Consider a flight to be on time if its arr_delay is no more than 15 (minutes).

The easiest way to do this is the “sum-as-count” pattern: sum(arr_delay <= 15).

Joining Tables

🚧 Modify the query that lists flight counts into JFK grouped by carrier to join in the carrier code with the full name of the carrier from the carriers.