library(tidyverse)
This document describes a simple analysis of the hotel booking dataset. This example is adapted from A. Scotina’s Taking a Vacation with #TidyTuesday hotel bookings, with an emphasis on data wrangling. Our interest is in understanding booking patterns over the the course of the year.
We start by reloading the TidyTuesday Hotel Dataset. The dataset is available here from TidyTuesday-GitHub: hotels.csv, but it’s only 16K so we’ll download it to reduce traffic on the server.
hotels_original <- read_csv("data/hotels.csv")
glimpse(hotels_original)
## Rows: 119,390
## Columns: 32
## $ hotel <chr> "Resort Hotel", "Resort Hotel", "Resort…
## $ is_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0, …
## $ lead_time <dbl> 342, 737, 7, 13, 14, 14, 0, 9, 85, 75, …
## $ arrival_date_year <dbl> 2015, 2015, 2015, 2015, 2015, 2015, 201…
## $ arrival_date_month <chr> "July", "July", "July", "July", "July",…
## $ arrival_date_week_number <dbl> 27, 27, 27, 27, 27, 27, 27, 27, 27, 27,…
## $ arrival_date_day_of_month <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, …
## $ stays_in_weekend_nights <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ stays_in_week_nights <dbl> 0, 0, 1, 1, 2, 2, 2, 2, 3, 3, 4, 4, 4, …
## $ adults <dbl> 2, 2, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, …
## $ children <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ babies <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ meal <chr> "BB", "BB", "BB", "BB", "BB", "BB", "BB…
## $ country <chr> "PRT", "PRT", "GBR", "GBR", "GBR", "GBR…
## $ market_segment <chr> "Direct", "Direct", "Direct", "Corporat…
## $ distribution_channel <chr> "Direct", "Direct", "Direct", "Corporat…
## $ is_repeated_guest <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_cancellations <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ previous_bookings_not_canceled <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ reserved_room_type <chr> "C", "C", "A", "A", "A", "A", "C", "C",…
## $ assigned_room_type <chr> "C", "C", "C", "A", "A", "A", "C", "C",…
## $ booking_changes <dbl> 3, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ deposit_type <chr> "No Deposit", "No Deposit", "No Deposit…
## $ agent <chr> "NULL", "NULL", "NULL", "304", "240", "…
## $ company <chr> "NULL", "NULL", "NULL", "NULL", "NULL",…
## $ days_in_waiting_list <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ customer_type <chr> "Transient", "Transient", "Transient", …
## $ adr <dbl> 0.00, 0.00, 75.00, 75.00, 98.00, 98.00,…
## $ required_car_parking_spaces <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
## $ total_of_special_requests <dbl> 0, 0, 0, 0, 1, 1, 0, 1, 1, 0, 0, 0, 3, …
## $ reservation_status <chr> "Check-Out", "Check-Out", "Check-Out", …
## $ reservation_status_date <date> 2015-07-01, 2015-07-01, 2015-07-02, 20…
We can see that date of the booking is split into multiple columns:
arrival_date_year, arrival_date_month,
arrival_date_day_of_month.
To plot a time-series showing the distribution of bookings over time,
we’ll need to create a full date that can be sequenced. We’ll use
lubridate.
library(lubridate)
hotels <- hotels_original %>%
mutate(full_date = ymd(paste(arrival_date_year,
arrival_date_month,
arrival_date_day_of_month
)))
hotels %>%
select(full_date, arrival_date_year, arrival_date_month, arrival_date_day_of_month)
## # A tibble: 119,390 × 4
## full_date arrival_date_year arrival_date_month arrival_date_day_of_month
## <date> <dbl> <chr> <dbl>
## 1 2015-07-01 2015 July 1
## 2 2015-07-01 2015 July 1
## 3 2015-07-01 2015 July 1
## 4 2015-07-01 2015 July 1
## 5 2015-07-01 2015 July 1
## 6 2015-07-01 2015 July 1
## 7 2015-07-01 2015 July 1
## 8 2015-07-01 2015 July 1
## 9 2015-07-01 2015 July 1
## 10 2015-07-01 2015 July 1
## # … with 119,380 more rows
This concatenated the year-month-day elements of the data together
and used ymd() to produce a date object that can be
sequenced chronologically. Here’s a density plot of the bookings over
time.
hotels %>%
ggplot() +
aes(x = full_date) +
geom_density()
For a more general analysis, it would be better to group the months together so that we can see annual trends. To support this, we’ll factorize the month character vector, count the bookings (i.e., group-by and sum), and then plot the bookings count for each month.
hotels <- hotels %>%
mutate(arrival_date_month_factor =
fct_recode(arrival_date_month))
hotels %>%
count(arrival_date_month_factor, hotel) %>%
ggplot() +
aes(x = arrival_date_month_factor, y = n, fill = hotel) +
geom_col() +
labs(x = NULL, y = "Frequency", fill = NULL) +
coord_flip()
Alas, the months are ordered alphabetically, not chronologically. We’ll rebuild the month factor, this time “re-leveling” (i.e., ordering) according to the chronological order of the months.
hotels <- hotels %>%
mutate(arrival_date_month_factor =
fct_relevel(
fct_recode(arrival_date_month),
levels = month.name))
hotels %>%
count(arrival_date_month_factor, hotel) %>%
ggplot() +
aes(x = fct_rev(arrival_date_month_factor), y = n, fill = hotel) +
geom_col() +
labs(x = NULL, y = "Frequency", fill = NULL) +
coord_flip()
month.name is a character vector that lists the month
names (in English) in order: January, February, March, April, May, June,
July, August, September, October, November, December.
It appears that there are fewer bookings in winter (December-February) overall.
An alternative way to get the months:
hotels <- hotels %>%
mutate(arrival_date_month_factor =
month(full_date, label = TRUE, abbr = FALSE))
Here, add a column indicating if there are any children in the party
and then plot the booking patterns with and without children. This code
uses a conditional control structure implemented by the
ifelse function, which we’ll discuss later in the
course.
hotels <- hotels %>%
mutate(
any_kids = ifelse(children + babies > 0,
"w/ children",
"w/o children")
)
hotels %>%
select(any_kids, adults, children, babies)
## # A tibble: 119,390 × 4
## any_kids adults children babies
## <chr> <dbl> <dbl> <dbl>
## 1 w/o children 2 0 0
## 2 w/o children 2 0 0
## 3 w/o children 1 0 0
## 4 w/o children 1 0 0
## 5 w/o children 2 0 0
## 6 w/o children 2 0 0
## 7 w/o children 2 0 0
## 8 w/o children 2 0 0
## 9 w/o children 2 0 0
## 10 w/o children 2 0 0
## # … with 119,380 more rows
We can see here that some of the entries for children and babies are
NA, which leads to NA values for
any_kids, which then leads to problems in the graphing.
hotels %>%
select(any_kids, adults, children, babies) %>%
filter(is.na(any_kids))
## # A tibble: 4 × 4
## any_kids adults children babies
## <chr> <dbl> <dbl> <dbl>
## 1 <NA> 2 NA 0
## 2 <NA> 2 NA 0
## 3 <NA> 3 NA 0
## 4 <NA> 2 NA 0
We didn’t notice this at first, but it led to issues in the graphing, so we redid this mutation to simply drop the (four) offending records.
hotels <- hotels %>%
filter(!is.na(children), !is.na(babies)) %>%
mutate(
any_kids = ifelse(children + babies > 0,
"w/ children",
"w/o children")
)
hotels %>%
select(any_kids, adults, children, babies) %>%
filter(is.na(any_kids))
## # A tibble: 0 × 4
## # … with 4 variables: any_kids <chr>, adults <dbl>, children <dbl>,
## # babies <dbl>
Now, we plot the annual distribution of bookings with and without children.
hotels %>%
count(arrival_date_month_factor, hotel, any_kids) %>%
group_by(hotel, any_kids) %>%
mutate(prop = n/sum(n)) %>%
ggplot() +
aes(x = fct_rev(arrival_date_month_factor),
y = prop,
fill = hotel) +
geom_col(position = "dodge") +
coord_flip() +
scale_y_continuous(labels = scales::percent_format()) +
labs(x = NULL, y = "Percent of hotel stays", fill = NULL) +
facet_wrap( ~ any_kids)
Here, we see that bookings with children are much higher in the summer months and that bookings without children are more evenly distributed.
We’ve demonstrated that there are tends in the annual distribution of
bookings at the two different hotels, plotting both a continuous
time-series over date and a binned time-series over the months of the
year. This work required the use of a variety of the dplyr
wrangling commands.
A. Scotina goes on to bin the bookings over season rather than month and then to use those categories to train a predictive model for hotel bookings. Training a sufficiently-predictive model over continuous time or even over months, is difficult, so the seasonal binning was important.