This assignment continues our work with the Capital Bikeshare dataset. Our goal continues to be to understand ridership patterns to evaluate the current system and suggest potential improvements. Towards that end, we will construct some more visualizations, this time using more fine-grained ridership data.
The dataset is an updated dataset based on the Capital Bikeshare
dataset used before: data/bikeshare-day.csv. Download it
into the shared data sub-directory as usual, and then read it as
follows. (N.b. we’ll discuss the mutate() function later in
the course; for now, just note that this code converts the listed field
values into factors.):
daily_rides <- read_csv("data/bikeshare-day.csv") %>%
mutate(
across(
c(season, year, holiday, workingday, day_of_week, weather_type, rider_type),
as.factor
))
daily_rides
## # A tibble: 1,462 × 13
## date rider_t…¹ rides season year holiday worki…² day_o…³ weath…⁴ temp
## <date> <fct> <dbl> <fct> <fct> <fct> <fct> <fct> <fct> <dbl>
## 1 2011-01-01 casual 331 W 2011 N weekend 6 2 8.18
## 2 2011-01-01 register… 654 W 2011 N weekend 6 2 8.18
## 3 2011-01-02 casual 131 W 2011 N weekend 0 2 9.08
## 4 2011-01-02 register… 670 W 2011 N weekend 0 2 9.08
## 5 2011-01-03 casual 120 W 2011 N workday 1 1 1.23
## 6 2011-01-03 register… 1229 W 2011 N workday 1 1 1.23
## 7 2011-01-04 casual 108 W 2011 N workday 2 1 1.4
## 8 2011-01-04 register… 1454 W 2011 N workday 2 1 1.4
## 9 2011-01-05 casual 82 W 2011 N workday 3 1 2.67
## 10 2011-01-05 register… 1518 W 2011 N workday 3 1 2.67
## # … with 1,452 more rows, 3 more variables: feels_like <dbl>, humidity <dbl>,
## # wind_speed <dbl>, and abbreviated variable names ¹rider_type, ²workingday,
## # ³day_of_week, ⁴weather_type
Observe that there are some extra columns in the dataset now.
The id columns are the columns that uniquely
identify an observation (sometimes called a “case” instead of
“observation”). In Homework 1, we only had one id column,
date, because we had one observation for each date. The
dataset for this homework has two id columns:
date: as beforerider_type: registered or
casual (see below)The additional id column means that we’ve now broken down the data by
rider type (rider_type). Some riders have
registered for a Capital Bikeshare membership to get better
rates. Other riders just bought a single trip or short-term pass, so we
call them casual riders. (Nb., according to the source data,
“casual” riders include: Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day
Pass).
So: each row is the count of how many rides were completed on a given day by a given type of rider. For each row, we have the following observed variables:
rides: the number of rides by that type of riderseason: Winter, Spring, sUmmer, or Fallyear: 2011 or 2012.holiday: N for ordinary days,
Y for holidaysworkingday: either “weekday” or “weekend” (where
“weekend” includes holidays too).day_of_week: an integer between 0 and 6 inclusive. In
this exercise, you will decode which number represents Monday, etc.temp: the average temperature that day, in degrees
Cfeels_like: the “feels-like” temperature in degrees
Chumidity: relative humidity, scaled to range from 0 to
1weather_type: four coded weather types herewindspeed: scaled to 67 maxFor a description of the original fields, perhaps with different names, see the source data.
Do the following data exploration exercises and include descriptions of your work in the document:
1. Label the days of the week.
The data set uses the integers 0 through 6 to label days of the week. It does not document, however, what 0 means, or what 6 means. If we want to make understandable plots, we should label these days of the week. To do this: first, figure out what day-of-week codes map to what days-of-week (see the glimpse of the dataset given above for evidence); and then do the following.
daily_rides dataframe to make sure the
result is correct.daily_rides <- daily_rides %>%
mutate(day_of_week = factor(day_of_week, levels = c(0, 1, 2, 3, 4, 5, 6), labels = c(_____)))
2. Describe a row.
Describe, in one or two sentences, the information conveyed by the
first row in the data frame. Focus you description on only following
fields: date; rider_type; rides;
holiday; workingday; and
day_of_week.
3. Visualize rides by date and by rider type.
Make a scatterplot of the number of rides by date, broken down by type of rider. Tips:
size of 1) and partially
transparent (alpha of 0.5) to reduce
overplotting.Here is one possibility.
Write a brief interpretation of this plot.
4. Experiment with mapping vs faceting.
Now let’s look at workdays vs weekends (and holidays), which is
encoded in the workingday variable, as we did in Homework
1. Try the following.
shape to
workingday.workingday instead.rider_type and
workingday in the faceted scatterplot.Once you’re done, pick two plots to leave in this section and remove all the others. Describe the structure the each plot, and compare and contrast the value of the plots. For which purpose is each one better? What about the design of the plot makes it fit that purpose?
5. Explore how ridership varies over a typical week.
We want to find out how ridership varies over a typical week. Before moving on, consider what question is being asking about the relationship between which variables. It can help to sketch a visualization on scrap paper. Make a plot that helps us answer the question.
Here’s one possible plot.
Once you have that plot, try a few variations: faceting, using different plot types, etc.
Finally, write a one-or-two-sentence description of what the plot tells you about the data.
6. Create a new plot of your own design.
Pick another variable or two from the list of variables above. Make a plot of their relationship and write a one-sentence description of what the plot suggests about ridership based on the data.
This homework has meandered around a bit, but it has created a few useful visualizations of the bike-share data. Make a general recommendation for Capital Bikeshare, based on your analysis, on how to plan the number of bikes.
*Exercise based on Data Science in a Box