Homework 2 - Capital Bikeshare (Revisited)*

This assignment continues our work with the Capital Bikeshare dataset. Our goal continues to be to understand ridership patterns to evaluate the current system and suggest potential improvements. Towards that end, we will construct some more visualizations, this time using more fine-grained ridership data.

Data

The dataset is an updated dataset based on the Capital Bikeshare dataset used before: data/bikeshare-day.csv. Download it into the shared data sub-directory as usual, and then read it as follows. (N.b. we’ll discuss the mutate() function later in the course; for now, just note that this code converts the listed field values into factors.):

daily_rides <- read_csv("data/bikeshare-day.csv") %>%
    mutate(
      across(
        c(season, year, holiday, workingday, day_of_week, weather_type, rider_type), 
        as.factor
        ))
daily_rides

## # A tibble: 1,462 × 13
##    date       rider_t…¹ rides season year  holiday worki…² day_o…³ weath…⁴  temp
##    <date>     <fct>     <dbl> <fct>  <fct> <fct>   <fct>   <fct>   <fct>   <dbl>
##  1 2011-01-01 casual      331 W      2011  N       weekend 6       2        8.18
##  2 2011-01-01 register…   654 W      2011  N       weekend 6       2        8.18
##  3 2011-01-02 casual      131 W      2011  N       weekend 0       2        9.08
##  4 2011-01-02 register…   670 W      2011  N       weekend 0       2        9.08
##  5 2011-01-03 casual      120 W      2011  N       workday 1       1        1.23
##  6 2011-01-03 register…  1229 W      2011  N       workday 1       1        1.23
##  7 2011-01-04 casual      108 W      2011  N       workday 2       1        1.4 
##  8 2011-01-04 register…  1454 W      2011  N       workday 2       1        1.4 
##  9 2011-01-05 casual       82 W      2011  N       workday 3       1        2.67
## 10 2011-01-05 register…  1518 W      2011  N       workday 3       1        2.67
## # … with 1,452 more rows, 3 more variables: feels_like <dbl>, humidity <dbl>,
## #   wind_speed <dbl>, and abbreviated variable names ¹rider_type, ²workingday,
## #   ³day_of_week, ⁴weather_type

Observe that there are some extra columns in the dataset now.

The id columns are the columns that uniquely identify an observation (sometimes called a “case” instead of “observation”). In Homework 1, we only had one id column, date, because we had one observation for each date. The dataset for this homework has two id columns:

date: as before
rider_type: registered or casual (see below)

The additional id column means that we’ve now broken down the data by rider type (rider_type). Some riders have registered for a Capital Bikeshare membership to get better rates. Other riders just bought a single trip or short-term pass, so we call them casual riders. (Nb., according to the source data, “casual” riders include: Single Trip, 24-Hour Pass, 3-Day Pass or 5-Day Pass).

So: each row is the count of how many rides were completed on a given day by a given type of rider. For each row, we have the following observed variables:

rides: the number of rides by that type of rider
season: Winter, Spring, sUmmer, or Fall
year: 2011 or 2012.
holiday: N for ordinary days, Y for holidays
workingday: either “weekday” or “weekend” (where “weekend” includes holidays too).
day_of_week: an integer between 0 and 6 inclusive. In this exercise, you will decode which number represents Monday, etc.
temp: the average temperature that day, in degrees C
feels_like: the “feels-like” temperature in degrees C
humidity: relative humidity, scaled to range from 0 to 1
weather_type: four coded weather types here
windspeed: scaled to 67 max

For a description of the original fields, perhaps with different names, see the source data.

Analysis

Do the following data exploration exercises and include descriptions of your work in the document:

1. Label the days of the week.

The data set uses the integers 0 through 6 to label days of the week. It does not document, however, what 0 means, or what 6 means. If we want to make understandable plots, we should label these days of the week. To do this: first, figure out what day-of-week codes map to what days-of-week (see the glimpse of the dataset given above for evidence); and then do the following.

Write a brief description of what the mapping is and your evidence that you got it correct.
Fill in the blank in the given code block to give labels to the weekdays. Use abbreviations (“Wed”, “Fri”, …).
Check your daily_rides dataframe to make sure the result is correct.

daily_rides <- daily_rides %>%
  mutate(day_of_week = factor(day_of_week, levels = c(0, 1, 2, 3, 4, 5, 6), labels = c(_____)))

2. Describe a row.

Describe, in one or two sentences, the information conveyed by the first row in the data frame. Focus you description on only following fields: date; rider_type; rides; holiday; workingday; and day_of_week.

3. Visualize rides by date and by rider type.

Make a scatterplot of the number of rides by date, broken down by type of rider. Tips:

Refer to your Homework 1 solution for a very similar (but not identical) plot.
Make the points smaller (size of 1) and partially transparent (alpha of 0.5) to reduce overplotting.
Fully label your plot to make the context clear.

Here is one possibility.

Write a brief interpretation of this plot.

4. Experiment with mapping vs faceting.

Now let’s look at workdays vs weekends (and holidays), which is encoded in the workingday variable, as we did in Homework 1. Try the following.

Make a scatterplot like in Exercise 3, but map shape to workingday.
Try faceting by workingday instead.
Try swapping the roles of rider_type and workingday in the faceted scatterplot.
Try adjusting parameters for the faceting, considering whether you should use rows or columns, and whether you should use free or fixed y scales.

Once you’re done, pick two plots to leave in this section and remove all the others. Describe the structure the each plot, and compare and contrast the value of the plots. For which purpose is each one better? What about the design of the plot makes it fit that purpose?

5. Explore how ridership varies over a typical week.

We want to find out how ridership varies over a typical week. Before moving on, consider what question is being asking about the relationship between which variables. It can help to sketch a visualization on scrap paper. Make a plot that helps us answer the question.

Here’s one possible plot.

Once you have that plot, try a few variations: faceting, using different plot types, etc.

Finally, write a one-or-two-sentence description of what the plot tells you about the data.

6. Create a new plot of your own design.

Pick another variable or two from the list of variables above. Make a plot of their relationship and write a one-sentence description of what the plot suggests about ridership based on the data.

Conclusion

This homework has meandered around a bit, but it has created a few useful visualizations of the bike-share data. Make a general recommendation for Capital Bikeshare, based on your analysis, on how to plan the number of bikes.

^*Exercise based on Data Science in a Box