This document demonstrates the key dplyr functions in the context of a hotel bookings dataset.
library(tidyverse)
We’d like to analyze hotel bookings data from TidyTuesday Hotels. Note that we’ve converted the cancellation field into a factor, which will be useful in our analysis.
hotels <- read_csv("data/hotels.csv",
col_types = cols(is_canceled = col_factor(levels = c("0", "1")))
)
glimpse(hotels)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort~
## $ is_canceled <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, ~
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, ~
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201~
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",~
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,~
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ~
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, ~
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, ~
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB~
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR~
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat~
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat~
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",~
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",~
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit~
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "~
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",~
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ customer_type <chr> "Transient", "Transient", "Transient", ~
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,~
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ~
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, ~
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", ~
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20~
Notes: - This dataset is too big to list easily. There are too many: - Rows (i.e., records, observations) - Columns (i.e., fields, variables) - To do analysis, we need to “zoom in” on the data of interest. - We read the is_canceled (stored as a double, values 1-cancelled; 0-not-cancelled) as a factor.
dplyr Provides Data-Wrangling FunctionsWe now demonstrate some of the basic dplyr functions.
The select() function allows us to zoom in on particular columns.
hotels %>%
select(lead_time, hotel)
## # A tibble: 119,390 x 2
## lead_time hotel
## <dbl> <chr>
## 1 342 Resort Hotel
## 2 737 Resort Hotel
## 3 7 Resort Hotel
## 4 13 Resort Hotel
## 5 14 Resort Hotel
## 6 14 Resort Hotel
## 7 0 Resort Hotel
## 8 9 Resort Hotel
## 9 85 Resort Hotel
## 10 75 Resort Hotel
## # ... with 119,380 more rows
Notes: - You can add or remove column names.
The filter() function allows us to zoom in on particular rows.
hotels %>%
filter(hotel == "Resort Hotel")
## # A tibble: 40,060 x 32
## hotel is_canceled lead_time arrival_date_year arrival_date_month
## <chr> <fct> <dbl> <dbl> <chr>
## 1 Resort Hotel 0 342 2015 July
## 2 Resort Hotel 0 737 2015 July
## 3 Resort Hotel 0 7 2015 July
## 4 Resort Hotel 0 13 2015 July
## 5 Resort Hotel 0 14 2015 July
## 6 Resort Hotel 0 14 2015 July
## 7 Resort Hotel 0 0 2015 July
## 8 Resort Hotel 0 9 2015 July
## 9 Resort Hotel 1 85 2015 July
## 10 Resort Hotel 1 75 2015 July
## # ... with 40,050 more rows, and 27 more variables:
## # arrival_date_week_number <dbl>, arrival_date_day_of_month <dbl>,
## # stays_in_weekend_nights <dbl>, stays_in_week_nights <dbl>, adults <dbl>,
## # children <dbl>, babies <dbl>, meal <chr>, country <chr>,
## # market_segment <chr>, distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, ...
Notes: - You can add more filter conditions, e.g., lead_time < 7) - The logical operators are: == < <= > >= !=
The summarize() function allows us to compute a variety of descriptive statistics, including counting, measurements of central tendency (e.g., mean, median, quartiles), etc.
hotels %>%
summarize(avg_lead_time = mean(lead_time))
## # A tibble: 1 x 1
## avg_lead_time
## <dbl>
## 1 104.
Summarize is often used with grouping. Here we count the books in different market segments.
hotels %>%
group_by(market_segment) %>%
summarize(booking_count = n())
## # A tibble: 8 x 2
## market_segment booking_count
## <chr> <int>
## 1 Aviation 237
## 2 Complementary 743
## 3 Corporate 5295
## 4 Direct 12606
## 5 Groups 19811
## 6 Offline TA/TO 24219
## 7 Online TA 56477
## 8 Undefined 2
The arrange() function allows us to reorder the rows.
hotels %>%
group_by(market_segment) %>%
summarize(average_lead_time = mean(lead_time)) %>%
arrange(desc(average_lead_time))
## # A tibble: 8 x 2
## market_segment average_lead_time
## <chr> <dbl>
## 1 Groups 187.
## 2 Offline TA/TO 135.
## 3 Online TA 83.0
## 4 Direct 49.9
## 5 Corporate 22.1
## 6 Complementary 13.3
## 7 Aviation 4.44
## 8 Undefined 1.5
Notes: - Reverse the ordering using desc().
The mutate() function allows us to restructure the data.
hotels %>%
mutate(num_nights = stays_in_week_nights + stays_in_weekend_nights,
week_nights = stays_in_week_nights,
weekend_nights = stays_in_weekend_nights
) %>%
arrange(desc(lead_time)) %>%
select(num_nights,
week_nights,
weekend_nights
)
## # A tibble: 119,390 x 3
## num_nights week_nights weekend_nights
## <dbl> <dbl> <dbl>
## 1 0 0 0
## 2 28 20 8
## 3 1 1 0
## 4 1 1 0
## 5 2 2 0
## 6 2 2 0
## 7 2 2 0
## 8 2 2 0
## 9 2 2 0
## 10 2 2 0
## # ... with 119,380 more rows
Notes: - We’ve focused on the data of interest by: - Computing a new field. - Renaming some fields. - Selecting the new/renamed fields. - Be careful about ordering the pipelines, e.g., the sorting clause can’t go just anywhere.
The distinct() function removes rows with duplicate values of the given columns.
hotels %>%
distinct(hotel) %>%
arrange(hotel)
## # A tibble: 2 x 1
## hotel
## <chr>
## 1 City Hotel
## 2 Resort Hotel
Note: - You can count on lists of fields (e.g., add market_segment).
The count() function counts the rows.
hotels %>%
count(market_segment)
## # A tibble: 8 x 2
## market_segment n
## <chr> <int>
## 1 Aviation 237
## 2 Complementary 743
## 3 Corporate 5295
## 4 Direct 12606
## 5 Groups 19811
## 6 Offline TA/TO 24219
## 7 Online TA 56477
## 8 Undefined 2
Notes: - count(X) is shorthand for group_by(X) %>% summarize(n = n())
As we’ve seen above, dplyr functions can be used in combination.
The following dplyr pipeline plots counts that geom_bar() computes automatically.
# manual pipeline
hotels %>%
group_by(hotel) %>%
summarize(count = n()) %>%
ggplot() +
aes(x = hotel, y = count) %>%
geom_col()
# geom_bar only
hotels %>%
ggplot() +
aes(x = hotel) +
geom_bar()
The order of the wrangling functions in a script can matter. For example, this script fails:
# Fails
# hotels %>%
# select(hotel, children) %>%
# filter(adults == 0)
# Works
hotels %>%
filter(adults == 0) %>%
select(hotel, children)
## # A tibble: 403 x 2
## hotel children
## <chr> <dbl>
## 1 Resort Hotel 0
## 2 Resort Hotel 0
## 3 Resort Hotel 0
## 4 Resort Hotel 0
## 5 Resort Hotel 0
## 6 Resort Hotel 0
## 7 Resort Hotel 0
## 8 Resort Hotel 0
## 9 Resort Hotel 0
## 10 Resort Hotel 0
## # ... with 393 more rows
dplyr provides a variety of data transformation functions that are useful in data wrangling.
See the RStudio dplyr cheat sheet.