library(tidyverse)

This document describes a simple analysis of the well-known nycflights13 dataset, which includes data on all flights departing from NYC airports in 2013.

The Data

We start by load the data, which is pre-packaged in the nycflights13 package. There are five dataframes, often called tables.

library(nycflights13)

flights
## # A tibble: 336,776 x 19
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
airports
## # A tibble: 1,458 x 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/~
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/~
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/~
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/~
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/~
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/~
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/~
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/~
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/~
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/~
## # ... with 1,448 more rows
airlines
## # A tibble: 16 x 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.
planes
## # A tibble: 3,322 x 9
##    tailnum  year type              manufacturer model engines seats speed engine
##    <chr>   <int> <chr>             <chr>        <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi~ EMBRAER      EMB-~       2    55    NA Turbo~
##  2 N102UW   1998 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  3 N103US   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  4 N104UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  5 N10575   2002 Fixed wing multi~ EMBRAER      EMB-~       2    55    NA Turbo~
##  6 N105UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  7 N107US   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  8 N108UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
##  9 N109UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
## 10 N110UW   1999 Fixed wing multi~ AIRBUS INDU~ A320~       2   182    NA Turbo~
## # ... with 3,312 more rows
weather
## # A tibble: 26,115 x 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4 
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5 
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7 
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7 
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5 
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0 
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4 
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0 
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8 
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## #   precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>

These are all flights from the New-York-area airports (Newark, La Guardia, & John F. Kennedy).

flights %>%
  distinct(origin)
## # A tibble: 3 x 1
##   origin
##   <chr> 
## 1 EWR   
## 2 LGA   
## 3 JFK
flights %>%
  distinct(year)
## # A tibble: 1 x 1
##    year
##   <int>
## 1  2013

The Structure of the Data

As datasets grow in complexity, they are increasingly difficult to visualize. To help understand the structure of the data, here is what is known as a schema (i.e., structure) of the data tables. It depicts the separate dataframes and the relationships between them.

NYC Flights Database Schema

Key Values

The schema marks the primary key, if any, for each table. Primary key values uniquely identify one record in the table. They cannot be NA/NULL, but they can comprise more than one field. Note that not all these tables have primary keys.

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)
## # A tibble: 3 x 6
##    year month   day  hour origin     n
##   <int> <int> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2
flights %>%
  count(year, month, day, hour, flight) %>% 
  filter(n > 1)
## # A tibble: 2,945 x 6
##     year month   day  hour flight     n
##    <int> <int> <int> <dbl>  <int> <int>
##  1  2013     1     1     6    303     2
##  2  2013     1     1     7     11     2
##  3  2013     1     1     7    715     2
##  4  2013     1     1     8    717     2
##  5  2013     1     1    12     32     2
##  6  2013     1     1    12   1443     2
##  7  2013     1     1    15   4105     2
##  8  2013     1     1    16    702     2
##  9  2013     1     1    19     87     2
## 10  2013     1     1    19    645     2
## # ... with 2,935 more rows

The schema also marks the foreign keys, which are fields in other (i.e., “foreign”) tables that reference a table’s primary key. This is how inter-table relationships are specified in multi-table datasets. It is important that the foreign key references a primary key because there needs to be a unique row in the “foreign” table to match. Foreign key values can be reused in the “foreign” table and they can be NA/NULL.

Join Queries

It is efficient to store data in separate tables because it allows us to represent important data only once in the dataset. For example, it’s much better to store the names and locations of the airports in one, central place rather than storing them for every flight that arrives or departs from that airport.

However, when analyzing data, we often need to “join” these table back together, e.g., to find the location of the the destination airport in the flights table.

1-to-Many Joins

Here we try to find all the flights to Gerald R. Ford airport.

If we happened to know the three-character code for the Ford airport (GRR), then we could simply use the flights table. If we don’t, however, we need to join the flights and airports tables to relate flights with the associated airports (including their name).

flights_airports <- inner_join(flights, airports, by = c("dest" = "faa"))
flights_airports
## # A tibble: 329,174 x 26
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      554            600        -6      812            837
##  5  2013     1     1      554            558        -4      740            728
##  6  2013     1     1      555            600        -5      913            854
##  7  2013     1     1      557            600        -3      709            723
##  8  2013     1     1      557            600        -3      838            846
##  9  2013     1     1      558            600        -2      753            745
## 10  2013     1     1      558            600        -2      849            851
## # ... with 329,164 more rows, and 18 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
## #   tzone <chr>

We can then query the flights to the Ford airport by airport name.

flights_airports %>%
  filter(str_detect(name, "Ford")) %>%
  select(origin, year, month, day, carrier, name)
## # A tibble: 765 x 6
##    origin  year month   day carrier name              
##    <chr>  <int> <int> <int> <chr>   <chr>             
##  1 EWR     2013     1     1 EV      Gerald R Ford Intl
##  2 EWR     2013     1     1 EV      Gerald R Ford Intl
##  3 EWR     2013     1     2 EV      Gerald R Ford Intl
##  4 LGA     2013     1     2 9E      Gerald R Ford Intl
##  5 EWR     2013     1     2 EV      Gerald R Ford Intl
##  6 EWR     2013     1     2 EV      Gerald R Ford Intl
##  7 EWR     2013     1     3 EV      Gerald R Ford Intl
##  8 LGA     2013     1     3 9E      Gerald R Ford Intl
##  9 EWR     2013     1     3 EV      Gerald R Ford Intl
## 10 EWR     2013     1     3 EV      Gerald R Ford Intl
## # ... with 755 more rows

Many-to-Many Joins

Here, we try to find all the flights to the GR Ford airport operated by Endeavor Airlines.

Again, if we happened to know the code for Endeavor Airlines (EV), then we could just use the flights table. If we don’t, however, we need to join the flights and airlines tables to relate flights with the associated airlines (including their name).

flights_airports_airlines <- flights %>%
  left_join(airports, 
            by = c("dest" = "faa")
            ) %>%
  left_join(airlines, 
            by = c("carrier" = "carrier"), 
            suffix = c(".airport", ".airline"),
            )
flights_airports_airlines
## # A tibble: 336,776 x 27
##     year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##    <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
##  1  2013     1     1      517            515         2      830            819
##  2  2013     1     1      533            529         4      850            830
##  3  2013     1     1      542            540         2      923            850
##  4  2013     1     1      544            545        -1     1004           1022
##  5  2013     1     1      554            600        -6      812            837
##  6  2013     1     1      554            558        -4      740            728
##  7  2013     1     1      555            600        -5      913            854
##  8  2013     1     1      557            600        -3      709            723
##  9  2013     1     1      557            600        -3      838            846
## 10  2013     1     1      558            600        -2      753            745
## # ... with 336,766 more rows, and 19 more variables: arr_delay <dbl>,
## #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## #   name.airport <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
## #   tzone <chr>, name.airline <chr>

Here, we needed to specify which keys should match up for the joins. The first join matches flights$dest with airports$faa, the second flights$carrier with airlines$carrier.

We also needed to deal with the issue of name conflicts. Both the airports and airlines tables have a name field. We add the suffix to the last query to distinguish name.airport and name.airline.

Now, we can query flights to Ford airport by Endeavor by name.

flights_airports_airlines %>%
  filter(
    str_detect(name.airport, "Ford"),
    str_detect(name.airline, "Endeavor"),
    ) %>%
  select(year, month, day, carrier, name.airport, name.airline)
## # A tibble: 44 x 6
##     year month   day carrier name.airport       name.airline     
##    <int> <int> <int> <chr>   <chr>              <chr>            
##  1  2013     1     2 9E      Gerald R Ford Intl Endeavor Air Inc.
##  2  2013     1     3 9E      Gerald R Ford Intl Endeavor Air Inc.
##  3  2013     1     4 9E      Gerald R Ford Intl Endeavor Air Inc.
##  4  2013     1     7 9E      Gerald R Ford Intl Endeavor Air Inc.
##  5  2013     1     8 9E      Gerald R Ford Intl Endeavor Air Inc.
##  6  2013     1     9 9E      Gerald R Ford Intl Endeavor Air Inc.
##  7  2013     1    10 9E      Gerald R Ford Intl Endeavor Air Inc.
##  8  2013     1    11 9E      Gerald R Ford Intl Endeavor Air Inc.
##  9  2013     1    14 9E      Gerald R Ford Intl Endeavor Air Inc.
## 10  2013     1    15 9E      Gerald R Ford Intl Endeavor Air Inc.
## # ... with 34 more rows

Assessing Delays

We new attempt to find patterns in delays on flights departing from NYC.

According to Season

Here, we plot average departure delays over time.

library(lubridate)
## 
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
## 
##     date, intersect, setdiff, union
flights %>%
  mutate(date = ymd(paste(year, month, day))) %>%
  group_by(date) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot() +
  aes(x = date, y = avg_dep_delay) +
  geom_point() +
  geom_smooth() +
  labs(
    title = "Average Departure Delays from NYC Airports",
    x = "Date",
    y = "Average Delay (mins)",
  )
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

The delays show a moderate rise in the summer and holiday months.

According to Location

We now consider the effect of geography on flight delays.

Here is a simple scatter plot based on latitude and longitude of all the airports in the dataset. Perhaps you recognize the pattern.

airports %>%
  filter(lon < 0) %>%
  ggplot() +
  aes(x = lon, y = lat) +
  geom_point()

The pattern is a bit easier if we filter out positive longitude values. These simple lat-long plots are not particularly accurate, geospatially, which we’ll address later in the course, but they can be marginally useful.

Here, we plot the average delay associated with airport, along with the volume of flight traffic.

flights_airports %>%
  group_by(dest, lat, lon) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
            count_dep = n(),
            ) %>%
  ggplot() +
  aes(x = lon, 
      y = lat, 
      color = avg_dep_delay, 
      size = count_dep,
      ) +
  geom_point() +
  labs(
    title = "Average Departure Delays From NYC by Destination",
    x = "Date",
    y = "Average Delay (mins)",
  ) 
## `summarise()` has grouped output by 'dest', 'lat'. You can override using the `.groups` argument.

There don’t appear to be obvious geospatial patterns here, but the high-volume airports appear to have smaller delays that some of the less-frequented airports.