This document demonstrates the key dplyr functions in
the context of a hotel-bookings dataset.
library(tidyverse)
We’d like to analyze hotel bookings data from TidyTuesday Hotels. Note that we’ve converted the cancellation field into a factor, which will be useful in our analysis.
hotels <- read_csv("data/hotels.csv",
col_types = cols(is_canceled = col_factor(levels = c("0", "1")))
)
glimpse(hotels)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type <chr> "Transient", "Transient", "Transient", …
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
Notes: - This dataset is too big to list easily. There are too many:
- Rows (i.e., records, observations) - Columns (i.e., fields, variables)
- To do analysis, we need to “zoom in” on the data of interest. - We
read the is_canceled (stored as a double, values
1-cancelled; 0-not-cancelled) as a factor.
dplyr Provides Data-Wrangling FunctionsWe now demonstrate some of the basic dplyr
functions.
The select() function allows us to zoom in on particular
columns.
hotels %>%
select(lead_time, hotel)
## # A tibble: 119,390 × 2
## lead_time hotel
## <dbl> <chr>
## 1 342 Resort Hotel
## 2 737 Resort Hotel
## 3 7 Resort Hotel
## 4 13 Resort Hotel
## 5 14 Resort Hotel
## 6 14 Resort Hotel
## 7 0 Resort Hotel
## 8 9 Resort Hotel
## 9 85 Resort Hotel
## 10 75 Resort Hotel
## # … with 119,380 more rows
Notes: - You can add or remove column names.
The filter() function allows us to zoom in on particular
rows.
hotels %>%
filter(hotel == "Resort Hotel")
## # A tibble: 40,060 × 32
## hotel is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
## <chr> <fct> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Resor… 0 342 2015 July 27 1 0 0 2
## 2 Resor… 0 737 2015 July 27 1 0 0 2
## 3 Resor… 0 7 2015 July 27 1 0 1 1
## 4 Resor… 0 13 2015 July 27 1 0 1 1
## 5 Resor… 0 14 2015 July 27 1 0 2 2
## 6 Resor… 0 14 2015 July 27 1 0 2 2
## 7 Resor… 0 0 2015 July 27 1 0 2 2
## 8 Resor… 0 9 2015 July 27 1 0 2 2
## 9 Resor… 1 85 2015 July 27 1 0 3 2
## 10 Resor… 1 75 2015 July 27 1 0 3 2
## # … with 40,050 more rows, 22 more variables: children <dbl>, babies <dbl>,
## # meal <chr>, country <chr>, market_segment <chr>,
## # distribution_channel <chr>, is_repeated_guest <dbl>,
## # previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## # reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## # deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## # customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …
Notes: - You can add more filter conditions, e.g.,
lead_time < 7) - The logical operators are:
== < <= > >= !=
The summarize() function allows us to compute a variety
of descriptive statistics, including counting, measurements of central
tendency (e.g., mean, median, quartiles), etc.
hotels %>%
summarize(avg_lead_time = mean(lead_time))
## # A tibble: 1 × 1
## avg_lead_time
## <dbl>
## 1 104.
Summarize is often used with grouping. Here we count the books in different market segments.
hotels %>%
group_by(market_segment) %>%
summarize(booking_count = n())
## # A tibble: 8 × 2
## market_segment booking_count
## <chr> <int>
## 1 Aviation 237
## 2 Complementary 743
## 3 Corporate 5295
## 4 Direct 12606
## 5 Groups 19811
## 6 Offline TA/TO 24219
## 7 Online TA 56477
## 8 Undefined 2
Here, the n() function counts the observations in each
group and must, therefore, be used in the context of a summarize, mutate
or filter command. See the related count() function
below.
The arrange() function allows us to reorder the
rows.
hotels %>%
group_by(market_segment) %>%
summarize(average_lead_time = mean(lead_time)) %>%
arrange(desc(average_lead_time))
## # A tibble: 8 × 2
## market_segment average_lead_time
## <chr> <dbl>
## 1 Groups 187.
## 2 Offline TA/TO 135.
## 3 Online TA 83.0
## 4 Direct 49.9
## 5 Corporate 22.1
## 6 Complementary 13.3
## 7 Aviation 4.44
## 8 Undefined 1.5
Notes: - Reverse the ordering using desc().
The mutate() function allows us to restructure the
data.
hotels %>%
mutate(num_nights = stays_in_week_nights + stays_in_weekend_nights,
week_nights = stays_in_week_nights,
weekend_nights = stays_in_weekend_nights
) %>%
arrange(desc(lead_time)) %>%
select(num_nights,
week_nights,
weekend_nights
)
## # A tibble: 119,390 × 3
## num_nights week_nights weekend_nights
## <dbl> <dbl> <dbl>
## 1 0 0 0
## 2 28 20 8
## 3 1 1 0
## 4 1 1 0
## 5 2 2 0
## 6 2 2 0
## 7 2 2 0
## 8 2 2 0
## 9 2 2 0
## 10 2 2 0
## # … with 119,380 more rows
Notes: - We’ve focused on the data of interest by: - Computing a new
field. - Renaming some fields. (It may be better to use
rename() for this.) - Selecting the new/renamed fields. -
Be careful about ordering the pipelines, e.g., the sorting clause can’t
go just anywhere.
The distinct() function removes rows with duplicate
values of the given columns.
hotels %>%
distinct(hotel) %>%
arrange(hotel)
## # A tibble: 2 × 1
## hotel
## <chr>
## 1 City Hotel
## 2 Resort Hotel
Note: - You can count on lists of fields (e.g., add
market_segment).
The count() function counts the rows.
hotels %>%
count(market_segment)
## # A tibble: 8 × 2
## market_segment n
## <chr> <int>
## 1 Aviation 237
## 2 Complementary 743
## 3 Corporate 5295
## 4 Direct 12606
## 5 Groups 19811
## 6 Offline TA/TO 24219
## 7 Online TA 56477
## 8 Undefined 2
Notes: - count(X) is shorthand for
group_by(X) %>% summarize(n = n())
As we’ve seen above, dplyr functions can be used in
combination.
The following dplyr pipeline plots counts that
geom_bar() computes automatically.
# manual pipeline
hotels %>%
group_by(hotel) %>%
summarize(count = n()) %>%
ggplot() +
aes(x = hotel, y = count) %>%
geom_col()
# geom_bar only
hotels %>%
ggplot() +
aes(x = hotel) +
geom_bar()
The order of the wrangling functions in a script can matter. For example, this script fails:
# Fails
# hotels %>%
# select(hotel, children) %>%
# filter(adults == 0)
# Works
hotels %>%
filter(adults == 0) %>%
select(hotel, children)
## # A tibble: 403 × 2
## hotel children
## <chr> <dbl>
## 1 Resort Hotel 0
## 2 Resort Hotel 0
## 3 Resort Hotel 0
## 4 Resort Hotel 0
## 5 Resort Hotel 0
## 6 Resort Hotel 0
## 7 Resort Hotel 0
## 8 Resort Hotel 0
## 9 Resort Hotel 0
## 10 Resort Hotel 0
## # … with 393 more rows
dplyr provides a variety of data transformation
functions that are useful in data wrangling.
See the RStudio dplyr cheat sheet.