library(tidyverse)
library(DBI)
The goal of this exercise is to learn about how to work with databases and particularly the SQL query language.
We’ll be working with the flights data from our textbook. Here’s the code to connect to the database:
db <- DBI::dbConnect(
RMariaDB::MariaDB(),
dbname = "airlines",
host = "mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com",
user = "mdsr_public",
password = "ImhsmflMDSwR",
)
And we’ll get references to two specific tables we’ll be working with.
flights <- tbl(db, "flights")
carriers <- tbl(db, "carriers")
Each of those behaves like a tibble (data frame), but it’s actually a reference to remote data.
flights %>% head()
## # Source: lazy query [?? x 21]
## # Database: mysql
## # [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <int> <int> <int>
## 1 2010 10 1 1 2100 181 159 2320
## 2 2010 10 1 1 1920 281 230 2214
## 3 2010 10 1 3 2355 8 339 334
## 4 2010 10 1 5 2200 125 41 2249
## 5 2010 10 1 7 2245 82 104 2347
## 6 2010 10 1 7 10 -3 451 500
## # ... with 13 more variables: arr_delay <int>, carrier <chr>, tailnum <chr>,
## # flight <int>, origin <chr>, dest <chr>, air_time <int>, distance <int>,
## # cancelled <int>, diverted <int>, hour <int>, minute <int>, time_hour <dttm>
The tool that makes this possible is called dbplyr. The objects it gives us to work with are lazy: they don’t actually fetch the data until we absolutely need it. They’re lazy queries in that to get the data we’re going to ask the database for data in a language called “SQL”. (Structured Query Language).
We can ask for the query that would be executed using show_query.
flights %>% show_query()
## <SQL>
## SELECT *
## FROM `flights`
flights %>%
filter(year == 2016) %>%
show_query()
## <SQL>
## SELECT *
## FROM `flights`
## WHERE (`year` = 2016.0)
🚧 Based only on looking at the above, what SQL expression corresponds to filter?
Because queries are lazy, even finding out how many rows is expensive, since we’d have to run that query. So, somewhat surprisingly:
nrow(flights)
## [1] NA
But we can use summary functions!
flights %>%
filter(year == 2016, month == 11, distance > 1000) %>%
summarize(
n = n(),
mean_dep_delay = mean(dep_delay, na.rm = TRUE)
) %>% show_query()
## <SQL>
## SELECT COUNT(*) AS `n`, AVG(`dep_delay`) AS `mean_dep_delay`
## FROM `flights`
## WHERE ((`year` = 2016.0) AND (`month` = 11.0) AND (`distance` > 1000.0))
flights %>%
filter(year == 2016, month == 11, distance > 2000) %>%
summarize(n = n(), mean_dep_delay = mean(dep_delay, na.rm = TRUE))
## # Source: lazy query [?? x 2]
## # Database: mysql
## # [mdsr_public@mdsr.cdc7tgkkqd0n.us-east-1.rds.amazonaws.com:NA/airlines]
## n mean_dep_delay
## <int64> <dbl>
## 1 30366 5.96
🚧 Based on the above, how does SQL express n and mean?
🚧 What years of data are stored in the flights table? Do this by summarizing the min and max of the year column. (Before actually executing this query, get the instructor to verify your query so that you don’t overload the shared server.)
flights %>%
summarize(first_year = min(year, na.rm = TRUE), last_year = max(year, na.rm = TRUE)) %>%
collect()
## # A tibble: 1 x 2
## first_year last_year
## <int> <int>
## 1 2010 2017
🚧 How many flights arrived in JFK (dest is JFK) in November of 2016?
🚧 In that period (November of 2016 for JFK), how many flights arrived for each carrier? Sort the result by decreasing number of flights.
🚧 Look at the SQL query for the previous exercise. How does SQL express group_by? How does SQL express arrange?
🚧 In that same period (November of 2016 for JFK), what fraction of flights from each carrier were on time? Consider a flight to be on time if its arr_delay is no more than 15 (minutes).
The easiest way to do this is the “sum-as-count” pattern: sum(arr_delay <= 15).
🚧 Modify the query that lists flight counts into JFK grouped by carrier to join in the carrier code with the full name of the carrier from the carriers.