library(tidyverse)
This document describes a simple analysis of the well-known nycflights13
dataset, which includes data on all flights departing from NYC
airports in 2013.
We start by load the data, which is pre-packaged in the
nycflights13
package. There are five dataframes, often
called tables.
library(nycflights13)
flights
## # A tibble: 336,776 × 19
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 9 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
airports
## # A tibble: 1,458 × 8
## faa name lat lon alt tz dst tzone
## <chr> <chr> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
## 1 04G Lansdowne Airport 41.1 -80.6 1044 -5 A America/…
## 2 06A Moton Field Municipal Airport 32.5 -85.7 264 -6 A America/…
## 3 06C Schaumburg Regional 42.0 -88.1 801 -6 A America/…
## 4 06N Randall Airport 41.4 -74.4 523 -5 A America/…
## 5 09J Jekyll Island Airport 31.1 -81.4 11 -5 A America/…
## 6 0A9 Elizabethton Municipal Airport 36.4 -82.2 1593 -5 A America/…
## 7 0G6 Williams County Airport 41.5 -84.5 730 -5 A America/…
## 8 0G7 Finger Lakes Regional Airport 42.9 -76.8 492 -5 A America/…
## 9 0P2 Shoestring Aviation Airfield 39.8 -76.6 1000 -5 U America/…
## 10 0S9 Jefferson County Intl 48.1 -123. 108 -8 A America/…
## # … with 1,448 more rows
airlines
## # A tibble: 16 × 2
## carrier name
## <chr> <chr>
## 1 9E Endeavor Air Inc.
## 2 AA American Airlines Inc.
## 3 AS Alaska Airlines Inc.
## 4 B6 JetBlue Airways
## 5 DL Delta Air Lines Inc.
## 6 EV ExpressJet Airlines Inc.
## 7 F9 Frontier Airlines Inc.
## 8 FL AirTran Airways Corporation
## 9 HA Hawaiian Airlines Inc.
## 10 MQ Envoy Air
## 11 OO SkyWest Airlines Inc.
## 12 UA United Air Lines Inc.
## 13 US US Airways Inc.
## 14 VX Virgin America
## 15 WN Southwest Airlines Co.
## 16 YV Mesa Airlines Inc.
planes
## # A tibble: 3,322 × 9
## tailnum year type manuf…¹ model engines seats speed engine
## <chr> <int> <chr> <chr> <chr> <int> <int> <int> <chr>
## 1 N10156 2004 Fixed wing multi engi… EMBRAER EMB-… 2 55 NA Turbo…
## 2 N102UW 1998 Fixed wing multi engi… AIRBUS… A320… 2 182 NA Turbo…
## 3 N103US 1999 Fixed wing multi engi… AIRBUS… A320… 2 182 NA Turbo…
## 4 N104UW 1999 Fixed wing multi engi… AIRBUS… A320… 2 182 NA Turbo…
## 5 N10575 2002 Fixed wing multi engi… EMBRAER EMB-… 2 55 NA Turbo…
## 6 N105UW 1999 Fixed wing multi engi… AIRBUS… A320… 2 182 NA Turbo…
## 7 N107US 1999 Fixed wing multi engi… AIRBUS… A320… 2 182 NA Turbo…
## 8 N108UW 1999 Fixed wing multi engi… AIRBUS… A320… 2 182 NA Turbo…
## 9 N109UW 1999 Fixed wing multi engi… AIRBUS… A320… 2 182 NA Turbo…
## 10 N110UW 1999 Fixed wing multi engi… AIRBUS… A320… 2 182 NA Turbo…
## # … with 3,312 more rows, and abbreviated variable name ¹manufacturer
weather
## # A tibble: 26,115 × 15
## origin year month day hour temp dewp humid wind_dir wind_speed wind_g…¹
## <chr> <int> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 EWR 2013 1 1 1 39.0 26.1 59.4 270 10.4 NA
## 2 EWR 2013 1 1 2 39.0 27.0 61.6 250 8.06 NA
## 3 EWR 2013 1 1 3 39.0 28.0 64.4 240 11.5 NA
## 4 EWR 2013 1 1 4 39.9 28.0 62.2 250 12.7 NA
## 5 EWR 2013 1 1 5 39.0 28.0 64.4 260 12.7 NA
## 6 EWR 2013 1 1 6 37.9 28.0 67.2 240 11.5 NA
## 7 EWR 2013 1 1 7 39.0 28.0 64.4 240 15.0 NA
## 8 EWR 2013 1 1 8 39.9 28.0 62.2 250 10.4 NA
## 9 EWR 2013 1 1 9 39.9 28.0 62.2 260 15.0 NA
## 10 EWR 2013 1 1 10 41 28.0 59.6 260 13.8 NA
## # … with 26,105 more rows, 4 more variables: precip <dbl>, pressure <dbl>,
## # visib <dbl>, time_hour <dttm>, and abbreviated variable name ¹wind_gust
These are all flights from the New-York-area airports (Newark, La Guardia, & John F. Kennedy).
flights %>%
distinct(origin)
## # A tibble: 3 × 1
## origin
## <chr>
## 1 EWR
## 2 LGA
## 3 JFK
flights %>%
distinct(year)
## # A tibble: 1 × 1
## year
## <int>
## 1 2013
As datasets grow in complexity, they are increasingly difficult to visualize. To help understand the structure of the data, here is what is known as a schema (i.e., structure) of the data tables. It depicts the separate dataframes and the relationships between them.
The schema marks the primary key, if any, for each table. Primary key values uniquely identify one record in the table. They cannot be NA/NULL, but they can comprise more than one field. Note that not all these tables have primary keys.
planes %>%
count(tailnum) %>%
filter(n > 1)
## # A tibble: 0 × 2
## # … with 2 variables: tailnum <chr>, n <int>
weather %>%
count(year, month, day, hour, origin) %>%
filter(n > 1)
## # A tibble: 3 × 6
## year month day hour origin n
## <int> <int> <int> <int> <chr> <int>
## 1 2013 11 3 1 EWR 2
## 2 2013 11 3 1 JFK 2
## 3 2013 11 3 1 LGA 2
flights %>%
count(year, month, day, hour, flight) %>%
filter(n > 1)
## # A tibble: 2,945 × 6
## year month day hour flight n
## <int> <int> <int> <dbl> <int> <int>
## 1 2013 1 1 6 303 2
## 2 2013 1 1 7 11 2
## 3 2013 1 1 7 715 2
## 4 2013 1 1 8 717 2
## 5 2013 1 1 12 32 2
## 6 2013 1 1 12 1443 2
## 7 2013 1 1 15 4105 2
## 8 2013 1 1 16 702 2
## 9 2013 1 1 19 87 2
## 10 2013 1 1 19 645 2
## # … with 2,935 more rows
The schema also marks the foreign keys. A foreign key is a field in the current table that references the primary key of another table. This allows us to represent more flexible inter-table relationships than were shown the class slides. It is important that the foreign key references a primary key in the other table because the foreign key value must reference a unique row in that other table. Foreign key values in the current table can appear multiple times and can be NA/NULL.
It is efficient to store data in separate tables because it allows us to represent important data only once in the dataset. For example, it’s much better to store the names and locations of the airports in one, central place rather than storing them for every flight that arrives or departs from that airport.
However, when analyzing data, we often need to “join” these table back together, e.g., to find the location of the the destination airport in the flights table.
Here we try to find all the flights to Gerald R. Ford airport.
If we happened to know the three-character code for the Ford airport
(GRR
), then we could simply use the flights
table. If we don’t, however, we need to join the flights and airports
tables to relate flights with the associated airports (including their
name).
flights_airports <- inner_join(flights, airports, by = c("dest" = "faa"))
flights_airports
## # A tibble: 329,174 × 26
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 554 600 -6 812 837 -25 DL
## 5 2013 1 1 554 558 -4 740 728 12 UA
## 6 2013 1 1 555 600 -5 913 854 19 B6
## 7 2013 1 1 557 600 -3 709 723 -14 EV
## 8 2013 1 1 557 600 -3 838 846 -8 B6
## 9 2013 1 1 558 600 -2 753 745 8 AA
## 10 2013 1 1 558 600 -2 849 851 -2 B6
## # … with 329,164 more rows, 16 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, name <chr>, lat <dbl>, lon <dbl>,
## # alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>, and abbreviated variable names
## # ¹sched_dep_time, ²dep_delay, ³arr_time, ⁴sched_arr_time, ⁵arr_delay
We can then query the flights to the Ford airport by airport name.
flights_airports %>%
filter(str_detect(name, "Ford")) %>%
select(origin, year, month, day, carrier, name)
## # A tibble: 765 × 6
## origin year month day carrier name
## <chr> <int> <int> <int> <chr> <chr>
## 1 EWR 2013 1 1 EV Gerald R Ford Intl
## 2 EWR 2013 1 1 EV Gerald R Ford Intl
## 3 EWR 2013 1 2 EV Gerald R Ford Intl
## 4 LGA 2013 1 2 9E Gerald R Ford Intl
## 5 EWR 2013 1 2 EV Gerald R Ford Intl
## 6 EWR 2013 1 2 EV Gerald R Ford Intl
## 7 EWR 2013 1 3 EV Gerald R Ford Intl
## 8 LGA 2013 1 3 9E Gerald R Ford Intl
## 9 EWR 2013 1 3 EV Gerald R Ford Intl
## 10 EWR 2013 1 3 EV Gerald R Ford Intl
## # … with 755 more rows
Here, we try to find all the flights to the GR Ford airport operated by Endeavor Airlines.
Again, if we happened to know the code for Endeavor Airlines
(EV
), then we could just use the flights
table. If we don’t, however, we need to join the flights and airlines
tables to relate flights with the associated airlines (including their
name).
flights_airports_airlines <- flights %>%
left_join(airports,
by = c("dest" = "faa")
) %>%
left_join(airlines,
by = c("carrier" = "carrier"),
suffix = c(".airport", ".airline"),
)
flights_airports_airlines
## # A tibble: 336,776 × 27
## year month day dep_time sched_de…¹ dep_d…² arr_t…³ sched…⁴ arr_d…⁵ carrier
## <int> <int> <int> <int> <int> <dbl> <int> <int> <dbl> <chr>
## 1 2013 1 1 517 515 2 830 819 11 UA
## 2 2013 1 1 533 529 4 850 830 20 UA
## 3 2013 1 1 542 540 2 923 850 33 AA
## 4 2013 1 1 544 545 -1 1004 1022 -18 B6
## 5 2013 1 1 554 600 -6 812 837 -25 DL
## 6 2013 1 1 554 558 -4 740 728 12 UA
## 7 2013 1 1 555 600 -5 913 854 19 B6
## 8 2013 1 1 557 600 -3 709 723 -14 EV
## 9 2013 1 1 557 600 -3 838 846 -8 B6
## 10 2013 1 1 558 600 -2 753 745 8 AA
## # … with 336,766 more rows, 17 more variables: flight <int>, tailnum <chr>,
## # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
## # minute <dbl>, time_hour <dttm>, name.airport <chr>, lat <dbl>, lon <dbl>,
## # alt <dbl>, tz <dbl>, dst <chr>, tzone <chr>, name.airline <chr>, and
## # abbreviated variable names ¹sched_dep_time, ²dep_delay, ³arr_time,
## # ⁴sched_arr_time, ⁵arr_delay
Here, we needed to specify which keys should match up for the joins.
The first join matches flights$dest
with
airports$faa
, the second flights$carrier
with
airlines$carrier
.
We also needed to deal with the issue of name conflicts. Both the
airports and airlines tables have a name
field. We add the
suffix
to the last query to distinguish
name.airport
and name.airline
.
Now, we can query flights to Ford airport by Endeavor by name.
flights_airports_airlines %>%
filter(
str_detect(name.airport, "Ford"),
str_detect(name.airline, "Endeavor"),
) %>%
select(year, month, day, carrier, name.airport, name.airline)
## # A tibble: 44 × 6
## year month day carrier name.airport name.airline
## <int> <int> <int> <chr> <chr> <chr>
## 1 2013 1 2 9E Gerald R Ford Intl Endeavor Air Inc.
## 2 2013 1 3 9E Gerald R Ford Intl Endeavor Air Inc.
## 3 2013 1 4 9E Gerald R Ford Intl Endeavor Air Inc.
## 4 2013 1 7 9E Gerald R Ford Intl Endeavor Air Inc.
## 5 2013 1 8 9E Gerald R Ford Intl Endeavor Air Inc.
## 6 2013 1 9 9E Gerald R Ford Intl Endeavor Air Inc.
## 7 2013 1 10 9E Gerald R Ford Intl Endeavor Air Inc.
## 8 2013 1 11 9E Gerald R Ford Intl Endeavor Air Inc.
## 9 2013 1 14 9E Gerald R Ford Intl Endeavor Air Inc.
## 10 2013 1 15 9E Gerald R Ford Intl Endeavor Air Inc.
## # … with 34 more rows
We new attempt to find patterns in delays on flights departing from NYC.
Here, we plot average departure delays over time.
library(lubridate)
flights %>%
mutate(date = ymd(paste(year, month, day))) %>%
group_by(date) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE)) %>%
ggplot() +
aes(x = date, y = avg_dep_delay) +
geom_point() +
geom_smooth() +
labs(
title = "Average Departure Delays from NYC Airports",
x = "Date",
y = "Average Delay (mins)",
)
The delays show a moderate rise in the summer and holiday months.
We now consider the effect of geography on flight delays.
Here is a simple scatter plot based on latitude and longitude of all the airports in the dataset. Perhaps you recognize the pattern.
airports %>%
filter(lon < 0) %>%
ggplot() +
aes(x = lon, y = lat) +
geom_point()
The pattern is a bit easier if we filter out positive longitude values. These simple lat-long plots are not particularly accurate, geospatially, which we’ll address later in the course, but they can be marginally useful.
Here, we plot the average delay associated with airport, along with the volume of flight traffic.
flights_airports %>%
group_by(dest, lat, lon) %>%
summarize(avg_dep_delay = mean(dep_delay, na.rm = TRUE),
count_dep = n(),
) %>%
ggplot() +
aes(x = lon,
y = lat,
color = avg_dep_delay,
size = count_dep,
) +
geom_point() +
labs(
title = "Average Departure Delays From NYC by Destination",
x = "Date",
y = "Average Delay (mins)",
)
There don’t appear to be obvious geospatial patterns here, but the high-volume airports appear to have smaller delays that some of the less-frequented airports.