We will be continuing our work with the Capital Bikeshare dataset that we started in Homework 1.
Start by creating a new RMarkdown document named
hw2-bikeshare.Rmd using the standard homework format.
Our goal continues to be to understand ridership patterns to evaluate the current system and suggest potential improvements.
Towards that end, we will construct some more visualizations, this time using more fine-grained ridership data.
The dataset is an updated dataset based on the Capital Bikeshare
dataset used before: data/bikeshare-day.csv. Download it
into a data sub-directory as usual, and then read it as follows. (N.b.
we’ll discuss the mutate() function later in the course;
for now, just note that this code converts the listed field values into
factors.):
daily_rides <- read_csv("data/bikeshare-day.csv") %>%
mutate(across(c(season, year, holiday, workingday, day_of_week, weather_type, rider_type), as.factor))
glimpse(daily_rides)
## Rows: 1,462
## Columns: 13
## $ date <date> 2011-01-01, 2011-01-01, 2011-01-02, 2011-01-02, 2011-01-…
## $ rider_type <fct> casual, registered, casual, registered, casual, registere…
## $ rides <dbl> 331, 654, 131, 670, 120, 1229, 108, 1454, 82, 1518, 88, 1…
## $ season <fct> W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, …
## $ year <fct> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 201…
## $ holiday <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, …
## $ workingday <fct> weekend, weekend, weekend, weekend, workday, workday, wor…
## $ day_of_week <fct> 6, 6, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 0, 0, 1, …
## $ weather_type <fct> 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, …
## $ temp <dbl> 8.175849, 8.175849, 9.083466, 9.083466, 1.229108, 1.22910…
## $ feels_like <dbl> 7.999250, 7.999250, 7.346774, 7.346774, -3.499270, -3.499…
## $ humidity <dbl> 0.805833, 0.805833, 0.696087, 0.696087, 0.437273, 0.43727…
## $ wind_speed <dbl> 10.749882, 10.749882, 16.652113, 16.652113, 16.636703, 16…
Observe that there are some extra columns in the dataset now.
The id columns are the columns that uniquely
identify an observation (sometimes called a “case” instead of
“observation”). In Homework 1, we only had one id column,
date, because we had one observation for each date. The
dataset for this homework has two id columns:
date: as beforerider_type: registered or
casual (see below)The additional id column means that we’ve now broken down the data by
rider type (rider_type). Some riders have
registered for a Capital Bikeshare membership to get better
rates. Other riders just bought a single trip or short-term pass, so we
call them casual riders. (Nb., according to the source data,
“casual” riders include: Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day
Pass).
So: each row is the count of how many rides were completed on a given day by a given type of rider. For each row, we have the following observed variables:
rides: the number of rides by that type of riderseason: Winter, Spring, sUmmer, or Fallyear: 2011 or 2012.holiday: N for ordinary days,
Y for holidaysworkingday: either “weekday” or “weekend” (where
“weekend” includes holidays too).day_of_week: an integer between 0 and 6 inclusive. In
this exercise, you will decode which number represents Monday, etc.temp: the average temperature that day, in degrees
Cfeels_like: the “feels-like” temperature in degrees
Chumidity: relative humidity, scaled to range from 0 to
1weather_type: four coded weather types herewindspeed: scaled to 67 maxFor a description of the original fields, perhaps with different names, see the source data.
The data set uses the integers 0 through 6 to label days of the week. It does not document, however, what 0 means, or what 6 means. If we want to make understandable plots, we should label these days of the week. To do this: first, figure out what day-of-week codes map to what days-of-week (see the glimpse of the dataset given above for evidence); and then do the following.
daily_rides dataframe to make sure the
result is correct.daily_rides <- daily_rides %>%
mutate(day_of_week = factor(day_of_week, levels = c(0, 1, 2, 3, 4, 5, 6), labels = c(_____)))
Describe, in one or two English sentences, the information conveyed
by the first row in the data frame. Focus you description on only
following fields: date; rider_type;
rides; holiday; workingday; and
day_of_week.
Make a scatterplot of the number of rides by date, broken down by type of rider. Tips:
size of 1) and partially
transparent (alpha of 0.5) to reduce
overplotting.Here is one possibility:
Write a brief interpretation of this plot.
Now let’s look at workdays vs weekends (and holidays), which is
encoded in the workingday variable, like in Homework 1. Try
the following.
shape to
workingday.workingday instead.rider_type and
workingday in the faceted scatterplot.Once you’re done, pick two plots to leave in this section and remove all the others. Describe the structure the each plot, and compare and contrast the value of the two plots. For which purpose is each one better? What about the design of the plot makes it fit that purpose?
We want to find out how ridership varies over a typical week. Before moving on, consider what question is being asking about the relationship between which variables. It can help to sketch a visualization on scrap paper.
Here’s one possible plot; try to make it.
Once you have that plot, try a few variations: faceting, using different plot types, etc.
Finally, write a one-or-two-sentence description of what the plot tells you about the data.
Pick another variable or two from the list of variables above. Make a plot of their relationship.
Write a one-sentence description of what the plot suggests about ridership based on the data.
*Exercise based on Data Science in a Box