Case Study: Hotel Bookings with Children

library(tidyverse)

This document describes a simple analysis of the hotel booking dataset. This example is adapted from A. Scotina’s Taking a Vacation with #TidyTuesday hotel bookings, with an emphasis on data wrangling. Our interest is in understanding booking patterns over the the course of the year.

Loading the Hotel Bookings Dataset

We start by reloading the TidyTuesday Hotel Dataset. The dataset is available here from TidyTuesday-GitHub: hotels.csv, but it’s only 16K so we’ll download it to reduce traffic on the server.

hotels_original <- read_csv("data/hotels.csv")
glimpse(hotels_original)

## Rows: 119,390
## Columns: 32
## $ hotel                          <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled                    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time                      <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year              <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month             <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number       <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month      <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights        <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights           <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults                         <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children                       <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies                         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal                           <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country                        <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment                 <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel           <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest              <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations         <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type             <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type             <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes                <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type                   <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent                          <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company                        <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list           <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type                  <chr> "Transient", "Transient", "Transient", …
## $ adr                            <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces    <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests      <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status             <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date        <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…

We can see that date of the booking is split into multiple columns: arrival_date_year, arrival_date_month, arrival_date_day_of_month.

Wrangling Dates

To plot a time-series showing the distribution of bookings over time, we’ll need to create a full date that can be sequenced. We’ll use lubridate.

library(lubridate)

hotels <- hotels_original %>%
  mutate(full_date = ymd(paste(arrival_date_year, 
                               arrival_date_month, 
                               arrival_date_day_of_month
                               )))

hotels %>%
  select(full_date, arrival_date_year, arrival_date_month, arrival_date_day_of_month)

## # A tibble: 119,390 × 4
##    full_date  arrival_date_year arrival_date_month arrival_date_day_of_month
##    <date>                 <dbl> <chr>                                  <dbl>
##  1 2015-07-01              2015 July                                       1
##  2 2015-07-01              2015 July                                       1
##  3 2015-07-01              2015 July                                       1
##  4 2015-07-01              2015 July                                       1
##  5 2015-07-01              2015 July                                       1
##  6 2015-07-01              2015 July                                       1
##  7 2015-07-01              2015 July                                       1
##  8 2015-07-01              2015 July                                       1
##  9 2015-07-01              2015 July                                       1
## 10 2015-07-01              2015 July                                       1
## # … with 119,380 more rows

This concatenated the year-month-day elements of the data together and used ymd() to produce a date object that can be sequenced chronologically. Here’s a density plot of the bookings over time.

hotels %>%
  ggplot() +
  aes(x = full_date) +
  geom_density()

For a more general analysis, it would be better to group the months together so that we can see annual trends. To support this, we’ll factorize the month character vector, count the bookings (i.e., group-by and sum), and then plot the bookings count for each month.

hotels <- hotels %>%
    mutate(arrival_date_month_factor = 
               fct_recode(arrival_date_month))

hotels %>%
  count(arrival_date_month_factor, hotel) %>%
  ggplot() + 
  aes(x = arrival_date_month_factor, y = n, fill = hotel) +
  geom_col() +
  labs(x = NULL, y = "Frequency", fill = NULL) +
  coord_flip()

Alas, the months are ordered alphabetically, not chronologically. We’ll rebuild the month factor, this time “re-leveling” (i.e., ordering) according to the chronological order of the months.

hotels <- hotels %>%
    mutate(arrival_date_month_factor = 
             fct_relevel(
               fct_recode(arrival_date_month),
               levels = month.name))

hotels %>%
  count(arrival_date_month_factor, hotel) %>%
  ggplot() + 
  aes(x = fct_rev(arrival_date_month_factor), y = n, fill = hotel) +
  geom_col() +
  labs(x = NULL, y = "Frequency", fill = NULL) +
  coord_flip()

month.name is a character vector that lists the month names (in English) in order: January, February, March, April, May, June, July, August, September, October, November, December.

It appears that there are fewer bookings in winter (December-February) overall.

An alternative way to get the months:

hotels <- hotels %>%
    mutate(arrival_date_month_factor = 
               month(full_date, label = TRUE, abbr = FALSE))

Checking Booking Patterns with and without Children

Here, add a column indicating if there are any children in the party and then plot the booking patterns with and without children. This code uses a conditional control structure implemented by the ifelse function, which we’ll discuss later in the course.

hotels <- hotels %>%
  mutate(
    any_kids = ifelse(children + babies > 0, 
                      "w/ children", 
                      "w/o children")
    )

hotels %>%
  select(any_kids, adults, children, babies)

## # A tibble: 119,390 × 4
##    any_kids     adults children babies
##    <chr>         <dbl>    <dbl>  <dbl>
##  1 w/o children      2        0      0
##  2 w/o children      2        0      0
##  3 w/o children      1        0      0
##  4 w/o children      1        0      0
##  5 w/o children      2        0      0
##  6 w/o children      2        0      0
##  7 w/o children      2        0      0
##  8 w/o children      2        0      0
##  9 w/o children      2        0      0
## 10 w/o children      2        0      0
## # … with 119,380 more rows

We can see here that some of the entries for children and babies are NA, which leads to NA values for any_kids, which then leads to problems in the graphing.

hotels %>%
  select(any_kids, adults, children, babies) %>%
  filter(is.na(any_kids))

## # A tibble: 4 × 4
##   any_kids adults children babies
##   <chr>     <dbl>    <dbl>  <dbl>
## 1 <NA>          2       NA      0
## 2 <NA>          2       NA      0
## 3 <NA>          3       NA      0
## 4 <NA>          2       NA      0

We didn’t notice this at first, but it led to issues in the graphing, so we redid this mutation to simply drop the (four) offending records.

hotels <- hotels %>%
  filter(!is.na(children), !is.na(babies)) %>%
  mutate(
    any_kids = ifelse(children + babies > 0, 
                      "w/ children", 
                      "w/o children")
    )

hotels %>%
  select(any_kids, adults, children, babies) %>%
  filter(is.na(any_kids))

## # A tibble: 0 × 4
## # … with 4 variables: any_kids <chr>, adults <dbl>, children <dbl>,
## #   babies <dbl>

Now, we plot the annual distribution of bookings with and without children.

hotels %>%
  count(arrival_date_month_factor, hotel, any_kids) %>%
  group_by(hotel, any_kids) %>%
  mutate(prop = n/sum(n)) %>%
  ggplot() +
  aes(x = fct_rev(arrival_date_month_factor), 
      y = prop, 
      fill = hotel) +
  geom_col(position = "dodge") +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format()) +
  labs(x = NULL, y = "Percent of hotel stays", fill = NULL) +
  facet_wrap( ~ any_kids)

Here, we see that bookings with children are much higher in the summer months and that bookings without children are more evenly distributed.

Summary

We’ve demonstrated that there are tends in the annual distribution of bookings at the two different hotels, plotting both a continuous time-series over date and a binned time-series over the months of the year. This work required the use of a variety of the dplyr wrangling commands.

A. Scotina goes on to bin the bookings over season rather than month and then to use those categories to train a predictive model for hotel bookings. Training a sufficiently-predictive model over continuous time or even over months, is difficult, so the seasonal binning was important.

Case Study: Hotel Bookings with Children

Keith VanderLinden

Spring, 2022

Loading the Hotel Bookings Dataset

Wrangling Dates

Checking Booking Patterns with and without Children

Summary