1 Tools

1.1 Useful Resources

1.2 Why these tools?

  • R
    • It uses nouns and verbs where Python/Pandas uses syntax
    • Its operations map pretty cleanly onto generalizable concepts
    • It has an extensive library of useful packages
    • It plays well with Python
  • RMarkdown
    • It embodies reproducibility
    • … which helps us practice integrity and humility.
  • Git and GitHub
    • It helps us be hospitable to teammates and future readers
    • It’s very commonly used.

1.2.1 Why not just use Excel?

For data analysis: doing your analysis once in Excel can be instructive, especially for initial exploration or for demoing your ideas to non-technical clients.

But move to a reproducible platform for reporting. The ability to click a button to regenerate a report using new data will be well worth the additional time it takes to set up the report the first time. Also, spreadsheets are difficult to debug. For example, a mistake in a formula contributed to flawed conclusions in a study of economic austerity that was used to make policy decisions in several countries: “[The formula] averaged cells in lines 30 to 44 instead of lines 30 to 49.”, “This spreadsheet error, compounded with other errors, is responsible for a −0.3 percentage-point error…” (Herndon, Ash, and Pollin 2013, Does High Public Debt Consistently Stifle Economic Growth?).

For data management, tread carefully. Spreadsheets might be fine for small data and small teams. But they fall over when either gets big.

One reason is that spreadsheets have capacity limitations. For example, the limited number of rows and columns in an Excel spreadsheet caused the UK to lose data about many Covid cases until it was discovered.

Also, workflows that rely on emailing spreadsheets (or file-based databases) can lead to inconsistency, data corruption, lack of awareness of and accountability for changes, and tedium.

Instead, seek more structure:

  • SQL-style databases
  • Salesforce or similar CRM platforms
  • Other data management platforms (AirTable, Notion, …)

1.3 R and RStudio

1.3.1 rstudio.calvin.edu

Your Calvin login and password should get you onto https://rstudio.calvin.edu

1.3.2 RStudio on the Linux machines

R and RStudio are also installed on the Linux machines in the Maroon and Gold Labs. They are accessible via https://remote.cs.calvin.edu.

1.3.3 Rstudio on your own machine

You can download and install RStudio and R on your personal computer.

1.4 R packages

You will need a number of R packages. We will have these installed for your on https://rstudio.calvin.edu, but if you work on your own machine or on the Linux lab machines, you will need to install these yourself.

Individual packages can be installed from within RStudio by clicking on the install icon in the packages tab. Alternatively, you can run the following code from the command line.

# make sure all current packages are up-to-date
update.packages()

pkgs <- c(
  "rmarkdown",
  "tidyverse",
#  "ggformula",
  "plotly",
  "tidymodels",
#  "pins",
#  "reticulate",
#  "mosaic",
  "devtools",
  "usethis",
#  "keras",
#  "nycflights13",
  "skimr",
  "maps"
#  "timeDate"
#  "qualtRics"
)

# skip package you already have to save time
pkgs <- setdiff(pkgs, installed.packages()[,1])

# install what's left
install.packages(pkgs)

We’re additionally using these packages for development; they’re optional for you.

if (!("emo" %in% installed.packages()[, 1]))
  devtools::install_github("hadley/emo")