Homework 1 - Capital Bikeshare*

This exercise does an exploratory data analysis on a dataset produced by Capital Bikeshare in Washington D.C.

The Document

Start by creating a new homework sub-directory (i.e., /homework1) and, in it, a new RMarkdown document named hw1-bikeshare.Rmd. Include the standard assignment header with your name and the date (e.g., Spring 2022). The document should produce HTML output.

The Purpose

Let’s imagine we’re hired by the administrators of the Capital Bikeshare program program to help them understand and predict the hourly demand for rental bikes. This understanding will help them plan the number of bikes that need to be available at different parts of the system at different times so that they can avoid cases in which:

someone wants a bike but the station is empty.
someone wants to return a bike but the station is full.

Describe this purpose at the beginning of your document.

The Dataset

The data for this problem were collected from the Capital Bikeshare program over the course of two years (2011 and 2012). Researchers at the University of Porto processed the data and augmented it with extra information, as described on this UCI ML Repository webpage.

We’ll use this simplified version of the dataset that we’ve derived from the original data. It is in CSV format. Download a copy into a data sub-directory of your homework solution directory.

Include in your document a description of the source of this dataset, and a code chunk that loads it and prints out the first few rows.

The Analysis

Do the following data exploration exercises and include descriptions of your work in the document:

Name and describe the fields of the dataset.
Say how many rows the dataset contains and what each row represents.
Create a scatter plot showing the total number of rides each day. Sample code is provided below, but you will need to fill in the blanks.

____ %>%
  ggplot() +
  aes(x = ___, y = ___) +
  geom_point() +
  geom_smooth() +
  labs(
    x = "___", 
    y = "___"
  )

Notes on this code:

It uses what will be our standard pipeline approach for data wrangling and plotting.
We pipe the dataset into ggplot() which builds our plot, in layers.
We then define the mappings between the variables in the dataset and the aesthetics of the plot (e.g. x and y coordinates, colors, etc.).
In the next two layers, we:
- represent the data with geometric shapes, in this case with points.
- add a smoothing “trend” line.
Try removing geom_point() and geom_smooth() one at a time to make sure you understand what each one does.
In the final layer, we make the visualization more hospitable by adding labels for each aesthetic of the plot.

The result should look like this: hw 1.4 plot

Add a new code chunk that creates the same plot again, but add a mapping of workingday to the color aesthetic. Your result should look like:

You might start this section by coping and pasting from your previous code chunk.

Write a one or two sentence interpretation of the graph, focusing on the following question: How do the number of rides compare for weekdays vs weekends? Based on this, make a guess about how Capital Bikeshare riders use the bikes.

Submit a ZIP of your final Rmd, html, csv etc. files and sub-directories. For instructions on how to do this, see the first lab specification. This will be the workflow for all future homework assignments.

^*Exercise based on Data Science in a Box