library(tidyverse)
This document describes a simple analysis of the well-known nycflights13
dataset, which includes data on all flights departing from NYC airports in 2013.
We start by load the data, which is pre-packaged in the nycflights13
package. There are five dataframes, often called tables.
library(nycflights13)
flights
## # A tibble: 336,776 x 19
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 11 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
airports
## # A tibble: 1,458 x 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/~
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/~
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/~
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/~
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/~
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/~
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/~
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/~
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/~
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/~
## # ... with 1,448 more rows
airlines
## # A tibble: 16 x 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
planes
## # A tibble: 3,322 x 9
## tailnum year type manufacturer model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi~ EMBRAER EMB-~ 2 55 NA Turbo~
## 2 N102UW 1998 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 3 N103US 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 4 N104UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 5 N10575 2002 Fixed wing multi~ EMBRAER EMB-~ 2 55 NA Turbo~
## 6 N105UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 7 N107US 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 8 N108UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 9 N109UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## 10 N110UW 1999 Fixed wing multi~ AIRBUS INDU~ A320~ 2 182 NA Turbo~
## # ... with 3,312 more rows
weather
## # A tibble: 26,115 x 15
## origin year month day hour temp dewp humid wind_dir wind_speed
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8
## # ... with 26,105 more rows, and 5 more variables: wind_gust <dbl>,
## # precip <dbl>, pressure <dbl>, visib <dbl>, time_hour <dttm>
These are all flights from the New-York-area airports (Newark, La Guardia, & John F. Kennedy).
flights %>%
distinct(origin)
## # A tibble: 3 x 1
## origin
## <chr>
## 1 EWR
## 2 LGA
## 3 JFK
flights %>%
distinct(year)
## # A tibble: 1 x 1
## year
## <int>
## 1 2013
As datasets grow in complexity, they are increasingly difficult to visualize. To help understand the structure of the data, here is what is known as a schema (i.e., structure) of the data tables. It depicts the separate dataframes and the relationships between them.
NYC Flights Database Schema
The schema marks the primary key, if any, for each table. Primary key values uniquely identify one record in the table. They cannot be NA/NULL, but they can comprise more than one field. Note that not all these tables have primary keys.
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 x 2
## # ... with 2 variables: tailnum <chr>, n <int>
weather %>%
count(year, month, day, hour, origin) %>%
filter(n > 1)
## # A tibble: 3 x 6
## year month day hour origin n
## <int> <int> <int> <int> <chr> <int>
## 1 2013 11 3 1 EWR 2
## 2 2013 11 3 1 JFK 2
## 3 2013 11 3 1 LGA 2
flights %>%
count(year, month, day, hour, flight) %>%
filter(n > 1)
## # A tibble: 2,945 x 6
## year month day hour flight n
## <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 6 303 2
## 2 2013 1 1 7 11 2
## 3 2013 1 1 7 715 2
## 4 2013 1 1 8 717 2
## 5 2013 1 1 12 32 2
## 6 2013 1 1 12 1443 2
## 7 2013 1 1 15 4105 2
## 8 2013 1 1 16 702 2
## 9 2013 1 1 19 87 2
## 10 2013 1 1 19 645 2
## # ... with 2,935 more rows
The schema also marks the foreign keys, which are fields in other (i.e., “foreign”) tables that reference a table’s primary key. This is how inter-table relationships are specified in multi-table datasets. It is important that the foreign key references a primary key because there needs to be a unique row in the “foreign” table to match. Foreign key values can be reused in the “foreign” table and they can be NA/NULL.
It is efficient to store data in separate tables because it allows us to represent important data only once in the dataset. For example, it’s much better to store the names and locations of the airports in one, central place rather than storing them for every flight that arrives or departs from that airport.
However, when analyzing data, we often need to “join” these table back together, e.g., to find the location of the the destination airport in the flights table.
Here we try to find all the flights to Gerald R. Ford airport.
If we happened to know the three-character code for the Ford airport (GRR
), then we could simply use the flights
table. If we don’t, however, we need to join the flights and airports tables to relate flights with the associated airports (including their name).
flights_airports <- inner_join(flights, airports, by = c("dest" = "faa"))
flights_airports
## # A tibble: 329,174 x 26
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 554 600 -6 812 837
## 5 2013 1 1 554 558 -4 740 728
## 6 2013 1 1 555 600 -5 913 854
## 7 2013 1 1 557 600 -3 709 723
## 8 2013 1 1 557 600 -3 838 846
## 9 2013 1 1 558 600 -2 753 745
## 10 2013 1 1 558 600 -2 849 851
## # ... with 329,164 more rows, and 18 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # name <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
## # tzone <chr>
We can then query the flights to the Ford airport by airport name.
flights_airports %>%
filter(str_detect(name, "Ford")) %>%
select(origin, year, month, day, carrier, name)
## # A tibble: 765 x 6
## origin year month day carrier name
## <chr> <int> <int> <int> <chr> <chr>
## 1 EWR 2013 1 1 EV Gerald R Ford Intl
## 2 EWR 2013 1 1 EV Gerald R Ford Intl
## 3 EWR 2013 1 2 EV Gerald R Ford Intl
## 4 LGA 2013 1 2 9E Gerald R Ford Intl
## 5 EWR 2013 1 2 EV Gerald R Ford Intl
## 6 EWR 2013 1 2 EV Gerald R Ford Intl
## 7 EWR 2013 1 3 EV Gerald R Ford Intl
## 8 LGA 2013 1 3 9E Gerald R Ford Intl
## 9 EWR 2013 1 3 EV Gerald R Ford Intl
## 10 EWR 2013 1 3 EV Gerald R Ford Intl
## # ... with 755 more rows
Here, we try to find all the flights to the GR Ford airport operated by Endeavor Airlines.
Again, if we happened to know the code for Endeavor Airlines (EV
), then we could just use the flights
table. If we don’t, however, we need to join the flights and airlines tables to relate flights with the associated airlines (including their name).
flights_airports_airlines <- flights %>%
left_join(airports,
by = c("dest" = "faa")
) %>%
left_join(airlines,
by = c("carrier" = "carrier"),
suffix = c(".airport", ".airline"),
)
flights_airports_airlines
## # A tibble: 336,776 x 27
## year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
## <int> <int> <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 517 515 2 830 819
## 2 2013 1 1 533 529 4 850 830
## 3 2013 1 1 542 540 2 923 850
## 4 2013 1 1 544 545 -1 1004 1022
## 5 2013 1 1 554 600 -6 812 837
## 6 2013 1 1 554 558 -4 740 728
## 7 2013 1 1 555 600 -5 913 854
## 8 2013 1 1 557 600 -3 709 723
## 9 2013 1 1 557 600 -3 838 846
## 10 2013 1 1 558 600 -2 753 745
## # ... with 336,766 more rows, and 19 more variables: arr_delay <dbl>,
## # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
## # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>,
## # name.airport <chr>, lat <dbl>, lon <dbl>, alt <dbl>, tz <dbl>, dst <chr>,
## # tzone <chr>, name.airline <chr>
Here, we needed to specify which keys should match up for the joins. The first join matches flights$dest
with airports$faa
, the second flights$carrier
with airlines$carrier
.
We also needed to deal with the issue of name conflicts. Both the airports and airlines tables have a name
field. We add the suffix
to the last query to distinguish name.airport
and name.airline
.
Now, we can query flights to Ford airport by Endeavor by name.
flights_airports_airlines %>%
filter(
str_detect(name.airport, "Ford"),
str_detect(name.airline, "Endeavor"),
) %>%
select(year, month, day, carrier, name.airport, name.airline)
## # A tibble: 44 x 6
## year month day carrier name.airport name.airline
## <int> <int> <int> <chr> <chr> <chr>
## 1 2013 1 2 9E Gerald R Ford Intl Endeavor Air Inc.
## 2 2013 1 3 9E Gerald R Ford Intl Endeavor Air Inc.
## 3 2013 1 4 9E Gerald R Ford Intl Endeavor Air Inc.
## 4 2013 1 7 9E Gerald R Ford Intl Endeavor Air Inc.
## 5 2013 1 8 9E Gerald R Ford Intl Endeavor Air Inc.
## 6 2013 1 9 9E Gerald R Ford Intl Endeavor Air Inc.
## 7 2013 1 10 9E Gerald R Ford Intl Endeavor Air Inc.
## 8 2013 1 11 9E Gerald R Ford Intl Endeavor Air Inc.
## 9 2013 1 14 9E Gerald R Ford Intl Endeavor Air Inc.
## 10 2013 1 15 9E Gerald R Ford Intl Endeavor Air Inc.
## # ... with 34 more rows
We new attempt to find patterns in delays on flights departing from NYC.
Here, we plot average departure delays over time.
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
flights %>%
mutate(date = ymd(paste(year, month, day))) %>%
group_by(date) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot() +
aes(x = date, y = avg_dep_delay) +
geom_point() +
geom_smooth() +
labs(
title = "Average Departure Delays from NYC Airports",
x = "Date",
y = "Average Delay (mins)",
)
## `geom_smooth()` using method = 'loess' and formula 'y ~ x'
The delays show a moderate rise in the summer and holiday months.
We now consider the effect of geography on flight delays.
Here is a simple scatter plot based on latitude and longitude of all the airports in the dataset. Perhaps you recognize the pattern.
airports %>%
filter(lon < 0) %>%
ggplot() +
aes(x = lon, y = lat) +
geom_point()
The pattern is a bit easier if we filter out positive longitude values. These simple lat-long plots are not particularly accurate, geospatially, which we’ll address later in the course, but they can be marginally useful.
Here, we plot the average delay associated with airport, along with the volume of flight traffic.
flights_airports %>%
group_by(dest, lat, lon) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
count_dep = n(),
) %>%
ggplot() +
aes(x = lon,
y = lat,
color = avg_dep_delay,
size = count_dep,
) +
geom_point() +
labs(
title = "Average Departure Delays From NYC by Destination",
x = "Date",
y = "Average Delay (mins)",
)
## `summarise()` has grouped output by 'dest', 'lat'. You can override using the `.groups` argument.
There don’t appear to be obvious geospatial patterns here, but the high-volume airports appear to have smaller delays that some of the less-frequented airports.