This document demonstrates the key dplyr functions in the context of a hotel-bookings dataset.

library(tidyverse)

Dataset: Hotel Bookings

We’d like to analyze hotel bookings data from TidyTuesday Hotels. Note that we’ve converted the cancellation field into a factor, which will be useful in our analysis.

hotels <- read_csv("data/hotels.csv", 
    col_types = cols(is_canceled = col_factor(levels = c("0", "1")))
    )
glimpse(hotels)
## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled                    <fct> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type                  <chr> "Transient", "Transient", "Transient", …
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…

Notes: - This dataset is too big to list easily. There are too many: - Rows (i.e., records, observations) - Columns (i.e., fields, variables) - To do analysis, we need to “zoom in” on the data of interest. - We read the is_canceled (stored as a double, values 1-cancelled; 0-not-cancelled) as a factor.

dplyr Provides Data-Wrangling Functions

We now demonstrate some of the basic dplyr functions.

Selecting Columns

The select() function allows us to zoom in on particular columns.

hotels %>%
  select(lead_time, hotel)
## # A tibble: 119,390 × 2
##    lead_time hotel       
##        <dbl> <chr>       
##  1       342 Resort Hotel
##  2       737 Resort Hotel
##  3         7 Resort Hotel
##  4        13 Resort Hotel
##  5        14 Resort Hotel
##  6        14 Resort Hotel
##  7         0 Resort Hotel
##  8         9 Resort Hotel
##  9        85 Resort Hotel
## 10        75 Resort Hotel
## # … with 119,380 more rows

Notes: - You can add or remove column names.

Filtering Rows

The filter() function allows us to zoom in on particular rows.

hotels %>%
  filter(hotel == "Resort Hotel") 
## # A tibble: 40,060 × 32
##    hotel  is_ca…¹ lead_…² arriv…³ arriv…⁴ arriv…⁵ arriv…⁶ stays…⁷ stays…⁸ adults
##    <chr>  <fct>     <dbl>   <dbl> <chr>     <dbl>   <dbl>   <dbl>   <dbl>  <dbl>
##  1 Resor… 0           342    2015 July         27       1       0       0      2
##  2 Resor… 0           737    2015 July         27       1       0       0      2
##  3 Resor… 0             7    2015 July         27       1       0       1      1
##  4 Resor… 0            13    2015 July         27       1       0       1      1
##  5 Resor… 0            14    2015 July         27       1       0       2      2
##  6 Resor… 0            14    2015 July         27       1       0       2      2
##  7 Resor… 0             0    2015 July         27       1       0       2      2
##  8 Resor… 0             9    2015 July         27       1       0       2      2
##  9 Resor… 1            85    2015 July         27       1       0       3      2
## 10 Resor… 1            75    2015 July         27       1       0       3      2
## # … with 40,050 more rows, 22 more variables: children <dbl>, babies <dbl>,
## #   meal <chr>, country <chr>, market_segment <chr>,
## #   distribution_channel <chr>, is_repeated_guest <dbl>,
## #   previous_cancellations <dbl>, previous_bookings_not_canceled <dbl>,
## #   reserved_room_type <chr>, assigned_room_type <chr>, booking_changes <dbl>,
## #   deposit_type <chr>, agent <chr>, company <chr>, days_in_waiting_list <dbl>,
## #   customer_type <chr>, adr <dbl>, required_car_parking_spaces <dbl>, …

Notes: - You can add more filter conditions, e.g., lead_time < 7) - The logical operators are: == < <= > >= !=

Summarizing Data

The summarize() function allows us to compute a variety of descriptive statistics, including counting, measurements of central tendency (e.g., mean, median, quartiles), etc.

hotels %>%
  summarize(avg_lead_time = mean(lead_time))
## # A tibble: 1 × 1
##   avg_lead_time
##           <dbl>
## 1          104.

Summarize is often used with grouping. Here we count the books in different market segments.

hotels %>%
  group_by(market_segment) %>%
  summarize(booking_count = n())
## # A tibble: 8 × 2
##   market_segment booking_count
##   <chr>                  <int>
## 1 Aviation                 237
## 2 Complementary            743
## 3 Corporate               5295
## 4 Direct                 12606
## 5 Groups                 19811
## 6 Offline TA/TO          24219
## 7 Online TA              56477
## 8 Undefined                  2

Here, the n() function counts the observations in each group and must, therefore, be used in the context of a summarize, mutate or filter command. See the related count() function below.

Arranging Rows

The arrange() function allows us to reorder the rows.

hotels %>%
  group_by(market_segment) %>%
  summarize(average_lead_time = mean(lead_time)) %>%
  arrange(desc(average_lead_time))
## # A tibble: 8 × 2
##   market_segment average_lead_time
##   <chr>                      <dbl>
## 1 Groups                    187.  
## 2 Offline TA/TO             135.  
## 3 Online TA                  83.0 
## 4 Direct                     49.9 
## 5 Corporate                  22.1 
## 6 Complementary              13.3 
## 7 Aviation                    4.44
## 8 Undefined                   1.5

Notes: - Reverse the ordering using desc().

Mutating Structure

The mutate() function allows us to restructure the data.

hotels %>%
  mutate(num_nights = stays_in_week_nights + stays_in_weekend_nights,
         week_nights = stays_in_week_nights,
         weekend_nights = stays_in_weekend_nights
         ) %>%
  arrange(desc(lead_time)) %>%
  select(num_nights,
         week_nights, 
         weekend_nights
         )  
## # A tibble: 119,390 × 3
##    num_nights week_nights weekend_nights
##         <dbl>       <dbl>          <dbl>
##  1          0           0              0
##  2         28          20              8
##  3          1           1              0
##  4          1           1              0
##  5          2           2              0
##  6          2           2              0
##  7          2           2              0
##  8          2           2              0
##  9          2           2              0
## 10          2           2              0
## # … with 119,380 more rows

Notes: - We’ve focused on the data of interest by: - Computing a new field. - Renaming some fields. (It may be better to use rename() for this.) - Selecting the new/renamed fields. - Be careful about ordering the pipelines, e.g., the sorting clause can’t go just anywhere.

Additional Functions

The distinct() function removes rows with duplicate values of the given columns.

hotels %>%
  distinct(hotel) %>%
  arrange(hotel)
## # A tibble: 2 × 1
##   hotel       
##   <chr>       
## 1 City Hotel  
## 2 Resort Hotel

Note: - You can count on lists of fields (e.g., add market_segment).

The count() function counts the rows.

hotels %>%
  count(market_segment) 
## # A tibble: 8 × 2
##   market_segment     n
##   <chr>          <int>
## 1 Aviation         237
## 2 Complementary    743
## 3 Corporate       5295
## 4 Direct         12606
## 5 Groups         19811
## 6 Offline TA/TO  24219
## 7 Online TA      56477
## 8 Undefined          2

Notes: - count(X) is shorthand for group_by(X) %>% summarize(n = n())

Using Functions in Combination

As we’ve seen above, dplyr functions can be used in combination.

The following dplyr pipeline plots counts that geom_bar() computes automatically.

# manual pipeline
hotels %>%
  group_by(hotel) %>%
  summarize(count = n()) %>%
  ggplot() +
  aes(x = hotel, y = count) %>%
  geom_col()

# geom_bar only
hotels %>%
  ggplot() +
  aes(x = hotel) + 
  geom_bar()

The order of the wrangling functions in a script can matter. For example, this script fails:

# Fails
# hotels %>%
#   select(hotel, children) %>%
#   filter(adults == 0)

# Works
hotels %>%
  filter(adults == 0) %>%
  select(hotel, children)
## # A tibble: 403 × 2
##    hotel        children
##    <chr>           <dbl>
##  1 Resort Hotel        0
##  2 Resort Hotel        0
##  3 Resort Hotel        0
##  4 Resort Hotel        0
##  5 Resort Hotel        0
##  6 Resort Hotel        0
##  7 Resort Hotel        0
##  8 Resort Hotel        0
##  9 Resort Hotel        0
## 10 Resort Hotel        0
## # … with 393 more rows

Summary

dplyr provides a variety of data transformation functions that are useful in data wrangling.

See the RStudio dplyr cheat sheet.