library(tidyverse)

This document describes a simple analysis of the well-known nycflights13 dataset, which includes data on all flights departing from NYC airports in 2013.

The Data

We start by load the data, which is pre-packaged in the nycflights13 package. There are five dataframes, often called tables.

library(nycflights13)

flights
## # A tibble: 336,776 × 19
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay
airports
## # A tibble: 1,458 × 8
##    faa   name                             lat    lon   alt    tz dst   tzone    
##    <chr> <chr>                          <dbl>  <dbl> <dbl> <dbl> <chr> <chr>    
##  1 04G   Lansdowne Airport               41.1  -80.6  1044    -5 A     America/…
##  2 06A   Moton Field Municipal Airport   32.5  -85.7   264    -6 A     America/…
##  3 06C   Schaumburg Regional             42.0  -88.1   801    -6 A     America/…
##  4 06N   Randall Airport                 41.4  -74.4   523    -5 A     America/…
##  5 09J   Jekyll Island Airport           31.1  -81.4    11    -5 A     America/…
##  6 0A9   Elizabethton Municipal Airport  36.4  -82.2  1593    -5 A     America/…
##  7 0G6   Williams County Airport         41.5  -84.5   730    -5 A     America/…
##  8 0G7   Finger Lakes Regional Airport   42.9  -76.8   492    -5 A     America/…
##  9 0P2   Shoestring Aviation Airfield    39.8  -76.6  1000    -5 U     America/…
## 10 0S9   Jefferson County Intl           48.1 -123.    108    -8 A     America/…
## # … with 1,448 more rows
airlines
## # A tibble: 16 × 2
##    carrier name                       
##    <chr>   <chr>                      
##  1 9E      Endeavor Air Inc.          
##  2 AA      American Airlines Inc.     
##  3 AS      Alaska Airlines Inc.       
##  4 B6      JetBlue Airways            
##  5 DL      Delta Air Lines Inc.       
##  6 EV      ExpressJet Airlines Inc.   
##  7 F9      Frontier Airlines Inc.     
##  8 FL      AirTran Airways Corporation
##  9 HA      Hawaiian Airlines Inc.     
## 10 MQ      Envoy Air                  
## 11 OO      SkyWest Airlines Inc.      
## 12 UA      United Air Lines Inc.      
## 13 US      US Airways Inc.            
## 14 VX      Virgin America             
## 15 WN      Southwest Airlines Co.     
## 16 YV      Mesa Airlines Inc.
planes
## # A tibble: 3,322 × 9
##    tailnum  year type                   manuf…¹ model engines seats speed engine
##    <chr>   <int> <chr>                  <chr>   <chr>   <int> <int> <int> <chr> 
##  1 N10156   2004 Fixed wing multi engi… EMBRAER EMB-…       2    55    NA Turbo…
##  2 N102UW   1998 Fixed wing multi engi… AIRBUS… A320…       2   182    NA Turbo…
##  3 N103US   1999 Fixed wing multi engi… AIRBUS… A320…       2   182    NA Turbo…
##  4 N104UW   1999 Fixed wing multi engi… AIRBUS… A320…       2   182    NA Turbo…
##  5 N10575   2002 Fixed wing multi engi… EMBRAER EMB-…       2    55    NA Turbo…
##  6 N105UW   1999 Fixed wing multi engi… AIRBUS… A320…       2   182    NA Turbo…
##  7 N107US   1999 Fixed wing multi engi… AIRBUS… A320…       2   182    NA Turbo…
##  8 N108UW   1999 Fixed wing multi engi… AIRBUS… A320…       2   182    NA Turbo…
##  9 N109UW   1999 Fixed wing multi engi… AIRBUS… A320…       2   182    NA Turbo…
## 10 N110UW   1999 Fixed wing multi engi… AIRBUS… A320…       2   182    NA Turbo…
## # … with 3,312 more rows, and abbreviated variable name ¹​manufacturer
weather
## # A tibble: 26,115 × 15
##    origin  year month   day  hour  temp  dewp humid wind_dir wind_speed wind_g…¹
##    <chr>  <int> <int> <int> <int> <dbl> <dbl> <dbl>    <dbl>      <dbl>    <dbl>
##  1 EWR     2013     1     1     1  39.0  26.1  59.4      270      10.4        NA
##  2 EWR     2013     1     1     2  39.0  27.0  61.6      250       8.06       NA
##  3 EWR     2013     1     1     3  39.0  28.0  64.4      240      11.5        NA
##  4 EWR     2013     1     1     4  39.9  28.0  62.2      250      12.7        NA
##  5 EWR     2013     1     1     5  39.0  28.0  64.4      260      12.7        NA
##  6 EWR     2013     1     1     6  37.9  28.0  67.2      240      11.5        NA
##  7 EWR     2013     1     1     7  39.0  28.0  64.4      240      15.0        NA
##  8 EWR     2013     1     1     8  39.9  28.0  62.2      250      10.4        NA
##  9 EWR     2013     1     1     9  39.9  28.0  62.2      260      15.0        NA
## 10 EWR     2013     1     1    10  41    28.0  59.6      260      13.8        NA
## # … with 26,105 more rows, 4 more variables: precip <dbl>, pressure <dbl>,
## #   visib <dbl>, time_hour <dttm>, and abbreviated variable name ¹​wind_gust

These are all flights from the New-York-area airports (Newark, La Guardia, & John F. Kennedy).

flights %>%
  distinct(origin)
## # A tibble: 3 × 1
##   origin
##   <chr> 
## 1 EWR   
## 2 LGA   
## 3 JFK
flights %>%
  distinct(year)
## # A tibble: 1 × 1
##    year
##   <int>
## 1  2013

The Structure of the Data

As datasets grow in complexity, they are increasingly difficult to visualize. To help understand the structure of the data, here is what is known as a schema (i.e., structure) of the data tables. It depicts the separate dataframes and the relationships between them.

NYC Flights Database Schema
NYC Flights Database Schema

Key Values

The schema marks the primary key, if any, for each table. Primary key values uniquely identify one record in the table. They cannot be NA/NULL, but they can comprise more than one field. Note that not all these tables have primary keys.

planes %>% 
  count(tailnum) %>% 
  filter(n > 1)
## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>
weather %>% 
  count(year, month, day, hour, origin) %>% 
  filter(n > 1)
## # A tibble: 3 × 6
##    year month   day  hour origin     n
##   <int> <int> <int> <int> <chr>  <int>
## 1  2013    11     3     1 EWR        2
## 2  2013    11     3     1 JFK        2
## 3  2013    11     3     1 LGA        2
flights %>%
  count(year, month, day, hour, flight) %>% 
  filter(n > 1)
## # A tibble: 2,945 × 6
##     year month   day  hour flight     n
##    <int> <int> <int> <dbl>  <int> <int>
##  1  2013     1     1     6    303     2
##  2  2013     1     1     7     11     2
##  3  2013     1     1     7    715     2
##  4  2013     1     1     8    717     2
##  5  2013     1     1    12     32     2
##  6  2013     1     1    12   1443     2
##  7  2013     1     1    15   4105     2
##  8  2013     1     1    16    702     2
##  9  2013     1     1    19     87     2
## 10  2013     1     1    19    645     2
## # … with 2,935 more rows

The schema also marks the foreign keys. A foreign key is a field in the current table that references the primary key of another table. This allows us to represent more flexible inter-table relationships than were shown the class slides. It is important that the foreign key references a primary key in the other table because the foreign key value must reference a unique row in that other table. Foreign key values in the current table can appear multiple times and can be NA/NULL.

Join Queries

It is efficient to store data in separate tables because it allows us to represent important data only once in the dataset. For example, it’s much better to store the names and locations of the airports in one, central place rather than storing them for every flight that arrives or departs from that airport.

However, when analyzing data, we often need to “join” these table back together, e.g., to find the location of the the destination airport in the flights table.

1-to-Many Joins

Here we try to find all the flights to Gerald R. Ford airport.

If we happened to know the three-character code for the Ford airport (GRR), then we could simply use the flights table. If we don’t, however, we need to join the flights and airports tables to relate flights with the associated airports (including their name).

flights_airports <- inner_join(flights, airports, by = c("dest" = "faa"))
flights_airports
## # A tibble: 329,174 × 26
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      554        600      -6     812     837     -25 DL     
##  5  2013     1     1      554        558      -4     740     728      12 UA     
##  6  2013     1     1      555        600      -5     913     854      19 B6     
##  7  2013     1     1      557        600      -3     709     723     -14 EV     
##  8  2013     1     1      557        600      -3     838     846      -8 B6     
##  9  2013     1     1      558        600      -2     753     745       8 AA     
## 10  2013     1     1      558        600      -2     849     851      -2 B6     
## # … with 329,164 more rows, 16 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>, lon <dbl>,
## #   alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>, and abbreviated variable names
## #   ¹​sched_dep_time, ²​dep_delay, ³​arr_time, ⁴​sched_arr_time, ⁵​arr_delay

We can then query the flights to the Ford airport by airport name.

flights_airports %>%
  filter(str_detect(name, "Ford")) %>%
  select(origin, year, month, day, carrier, name)
## # A tibble: 765 × 6
##    origin  year month   day carrier name              
##    <chr>  <int> <int> <int> <chr>   <chr>             
##  1 EWR     2013     1     1 EV      Gerald R Ford Intl
##  2 EWR     2013     1     1 EV      Gerald R Ford Intl
##  3 EWR     2013     1     2 EV      Gerald R Ford Intl
##  4 LGA     2013     1     2 9E      Gerald R Ford Intl
##  5 EWR     2013     1     2 EV      Gerald R Ford Intl
##  6 EWR     2013     1     2 EV      Gerald R Ford Intl
##  7 EWR     2013     1     3 EV      Gerald R Ford Intl
##  8 LGA     2013     1     3 9E      Gerald R Ford Intl
##  9 EWR     2013     1     3 EV      Gerald R Ford Intl
## 10 EWR     2013     1     3 EV      Gerald R Ford Intl
## # … with 755 more rows

Many-to-Many Joins

Here, we try to find all the flights to the GR Ford airport operated by Endeavor Airlines.

Again, if we happened to know the code for Endeavor Airlines (EV), then we could just use the flights table. If we don’t, however, we need to join the flights and airlines tables to relate flights with the associated airlines (including their name).

flights_airports_airlines <- flights %>%
  left_join(airports, 
            by = c("dest" = "faa")
            ) %>%
  left_join(airlines, 
            by = c("carrier" = "carrier"), 
            suffix = c(".airport", ".airline"),
            )
flights_airports_airlines
## # A tibble: 336,776 × 27
##     year month   day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
##    <int> <int> <int>    <int>      <int>   <dbl>   <int>   <int>   <dbl> <chr>  
##  1  2013     1     1      517        515       2     830     819      11 UA     
##  2  2013     1     1      533        529       4     850     830      20 UA     
##  3  2013     1     1      542        540       2     923     850      33 AA     
##  4  2013     1     1      544        545      -1    1004    1022     -18 B6     
##  5  2013     1     1      554        600      -6     812     837     -25 DL     
##  6  2013     1     1      554        558      -4     740     728      12 UA     
##  7  2013     1     1      555        600      -5     913     854      19 B6     
##  8  2013     1     1      557        600      -3     709     723     -14 EV     
##  9  2013     1     1      557        600      -3     838     846      -8 B6     
## 10  2013     1     1      558        600      -2     753     745       8 AA     
## # … with 336,766 more rows, 17 more variables: flight <int>, tailnum <chr>,
## #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## #   minute <dbl>, time_hour <dttm>, name.airport <chr>, lat <dbl>, lon <dbl>,
## #   alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>, name.airline <chr>, and
## #   abbreviated variable names ¹​sched_dep_time, ²​dep_delay, ³​arr_time,
## #   ⁴​sched_arr_time, ⁵​arr_delay

Here, we needed to specify which keys should match up for the joins. The first join matches flights$dest with airports$faa, the second flights$carrier with airlines$carrier.

We also needed to deal with the issue of name conflicts. Both the airports and airlines tables have a name field. We add the suffix to the last query to distinguish name.airport and name.airline.

Now, we can query flights to Ford airport by Endeavor by name.

flights_airports_airlines %>%
  filter(
    str_detect(name.airport, "Ford"),
    str_detect(name.airline, "Endeavor"),
    ) %>%
  select(year, month, day, carrier, name.airport, name.airline)
## # A tibble: 44 × 6
##     year month   day carrier name.airport       name.airline     
##    <int> <int> <int> <chr>   <chr>              <chr>            
##  1  2013     1     2 9E      Gerald R Ford Intl Endeavor Air Inc.
##  2  2013     1     3 9E      Gerald R Ford Intl Endeavor Air Inc.
##  3  2013     1     4 9E      Gerald R Ford Intl Endeavor Air Inc.
##  4  2013     1     7 9E      Gerald R Ford Intl Endeavor Air Inc.
##  5  2013     1     8 9E      Gerald R Ford Intl Endeavor Air Inc.
##  6  2013     1     9 9E      Gerald R Ford Intl Endeavor Air Inc.
##  7  2013     1    10 9E      Gerald R Ford Intl Endeavor Air Inc.
##  8  2013     1    11 9E      Gerald R Ford Intl Endeavor Air Inc.
##  9  2013     1    14 9E      Gerald R Ford Intl Endeavor Air Inc.
## 10  2013     1    15 9E      Gerald R Ford Intl Endeavor Air Inc.
## # … with 34 more rows

Assessing Delays

We new attempt to find patterns in delays on flights departing from NYC.

According to Season

Here, we plot average departure delays over time.

library(lubridate)

flights %>%
  mutate(date = ymd(paste(year, month, day))) %>%
  group_by(date) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
  ggplot() +
  aes(x = date, y = avg_dep_delay) +
  geom_point() +
  geom_smooth() +
  labs(
    title = "Average Departure Delays from NYC Airports",
    x = "Date",
    y = "Average Delay (mins)",
  )

The delays show a moderate rise in the summer and holiday months.

According to Location

We now consider the effect of geography on flight delays.

Here is a simple scatter plot based on latitude and longitude of all the airports in the dataset. Perhaps you recognize the pattern.

airports %>%
  filter(lon < 0) %>%
  ggplot() +
  aes(x = lon, y = lat) +
  geom_point()

The pattern is a bit easier if we filter out positive longitude values. These simple lat-long plots are not particularly accurate, geospatially, which we’ll address later in the course, but they can be marginally useful.

Here, we plot the average delay associated with airport, along with the volume of flight traffic.

flights_airports %>%
  group_by(dest, lat, lon) %>%
  summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
            count_dep = n(),
            ) %>%
  ggplot() +
  aes(x = lon, 
      y = lat, 
      color = avg_dep_delay, 
      size = count_dep,
      ) +
  geom_point() +
  labs(
    title = "Average Departure Delays From NYC by Destination",
    x = "Date",
    y = "Average Delay (mins)",
  ) 

There don’t appear to be obvious geospatial patterns here, but the high-volume airports appear to have smaller delays that some of the less-frequented airports.