Homework 2 - Capital Bikeshare (Revisited)*

We will be continuing our work with the Capital Bikeshare dataset that we started in Homework 1.

The Document

Start by creating a new RMarkdown document named hw2-bikeshare.Rmd using the standard homework format.

The Purpose

Our goal continues to be to understand ridership patterns to evaluate the current system and suggest potential improvements.

Towards that end, we will construct some more visualizations, this time using more fine-grained ridership data.

The Dataset

The dataset is an updated dataset based on the Capital Bikeshare dataset used before: data/bikeshare-day.csv. Download it into a data sub-directory as usual, and then read it as follows. (N.b. we’ll discuss the mutate() function later in the course; for now, just note that this code converts the listed field values into factors.):

daily_rides <- read_csv("data/bikeshare-day.csv") %>%
    mutate(across(c(season, year, holiday, workingday, day_of_week, weather_type, rider_type), as.factor))
glimpse(daily_rides)

## Rows: 1,462
## Columns: 13
## $ date         <date> 2011-01-01, 2011-01-01, 2011-01-02, 2011-01-02, 2011-01-…
## $ rider_type   <fct> casual, registered, casual, registered, casual, registere…
## $ rides        <dbl> 331, 654, 131, 670, 120, 1229, 108, 1454, 82, 1518, 88, 1…
## $ season       <fct> W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, W, …
## $ year         <fct> 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 2011, 201…
## $ holiday      <fct> N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, N, …
## $ workingday   <fct> weekend, weekend, weekend, weekend, workday, workday, wor…
## $ day_of_week  <fct> 6, 6, 0, 0, 1, 1, 2, 2, 3, 3, 4, 4, 5, 5, 6, 6, 0, 0, 1, …
## $ weather_type <fct> 2, 2, 2, 2, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 1, 1, 1, …
## $ temp         <dbl> 8.175849, 8.175849, 9.083466, 9.083466, 1.229108, 1.22910…
## $ feels_like   <dbl> 7.999250, 7.999250, 7.346774, 7.346774, -3.499270, -3.499…
## $ humidity     <dbl> 0.805833, 0.805833, 0.696087, 0.696087, 0.437273, 0.43727…
## $ wind_speed   <dbl> 10.749882, 10.749882, 16.652113, 16.652113, 16.636703, 16…

What are the columns?

Observe that there are some extra columns in the dataset now.

The id columns are the columns that uniquely identify an observation (sometimes called a “case” instead of “observation”). In Homework 1, we only had one id column, date, because we had one observation for each date. The dataset for this homework has two id columns:

date: as before
rider_type: registered or casual (see below)

The additional id column means that we’ve now broken down the data by rider type (rider_type). Some riders have registered for a Capital Bikeshare membership to get better rates. Other riders just bought a single trip or short-term pass, so we call them casual riders. (Nb., according to the source data, “casual” riders include: Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass).

So: each row is the count of how many rides were completed on a given day by a given type of rider. For each row, we have the following observed variables:

rides: the number of rides by that type of rider
season: Winter, Spring, sUmmer, or Fall
year: 2011 or 2012.
holiday: N for ordinary days, Y for holidays
workingday: either “weekday” or “weekend” (where “weekend” includes holidays too).
day_of_week: an integer between 0 and 6 inclusive. In this exercise, you will decode which number represents Monday, etc.
temp: the average temperature that day, in degrees C
feels_like: the “feels-like” temperature in degrees C
humidity: relative humidity, scaled to range from 0 to 1
weather_type: four coded weather types here
windspeed: scaled to 67 max

For a description of the original fields, perhaps with different names, see the source data.

Exercise 1: Label days of week

The data set uses the integers 0 through 6 to label days of the week. It does not document, however, what 0 means, or what 6 means. If we want to make understandable plots, we should label these days of the week. To do this: first, figure out what day-of-week codes map to what days-of-week (see the glimpse of the dataset given above for evidence); and then do the following.

Write a brief description of what the mapping is and your evidence that you got it correct.
Fill in the blank in the given code block to give labels to the weekdays. Use abbreviations (“Wed”, “Fri”, …).
Check your daily_rides dataframe to make sure the result is correct.

daily_rides <- daily_rides %>%
  mutate(day_of_week = factor(day_of_week, levels = c(0, 1, 2, 3, 4, 5, 6), labels = c(_____)))

Exercise 2: Describe a row

Describe, in one or two English sentences, the information conveyed by the first row in the data frame. Focus you description on only following fields: date; rider_type; rides; holiday; workingday; and day_of_week.

Exercise 3: Rides by date, by rider type

Make a scatterplot of the number of rides by date, broken down by type of rider. Tips:

Refer to your Homework 1 solution for a very similar (but not identical) plot.
Make the points smaller (size of 1) and partially transparent (alpha of 0.5) to reduce overplotting.
Fully label your plot to make the context clear.

Here is one possibility:
hw2 Date vs. Rides

Write a brief interpretation of this plot.

Exercise 4: Mapping vs Faceting

Now let’s look at workdays vs weekends (and holidays), which is encoded in the workingday variable, like in Homework 1. Try the following.

Make a scatterplot like in Exercise 3, but map shape to workingday.
Try faceting by workingday instead.
Try swapping the roles of rider_type and workingday in the faceted scatterplot.
Try adjusting parameters for the faceting, considering whether you should use rows or columns, and whether you should use free or fixed y scales.

Once you’re done, pick two plots to leave in this section and remove all the others. Describe the structure the each plot, and compare and contrast the value of the two plots. For which purpose is each one better? What about the design of the plot makes it fit that purpose?

Exercise 5: How does ridership vary over a typical week?

We want to find out how ridership varies over a typical week. Before moving on, consider what question is being asking about the relationship between which variables. It can help to sketch a visualization on scrap paper.

Here’s one possible plot; try to make it.
hw2 day vs rides

Once you have that plot, try a few variations: faceting, using different plot types, etc.

Finally, write a one-or-two-sentence description of what the plot tells you about the data.

Exercise 6: Plot of your choice

Pick another variable or two from the list of variables above. Make a plot of their relationship.

Write a one-sentence description of what the plot suggests about ridership based on the data.

^*Exercise based on Data Science in a Box