3 Data Wrangling

3.1 Resources

First, here are some questions to ask if you’re working with data that you didn’t collect yourself. (That article is one of my favorites from the analytics writings by the Head of Decision Intelligence at Google

For more resources, see the previous chapter but also:

R for Data Science: Factors
dplyr: cheat sheet
lubridate: cheat sheet
Some tips for working with SPSS data (e.g., Pew)

3.1.1 Practice

TidyTuesday has weekly examples!

David Robinson, contributor to several notable R packages, has done screencasts of analyzing many TidyTuesday examples. Here’s the code.

3.2 SQL and BigQuery

Query languages allow us to query big datasets from our small computers. The most popular by far is SQL.

Google’s BigQuery is a SQL-like language for querying datasets stored on its cloud infrastructure. Most of the time you’ll be querying data that are internal to your organization, but Google and other providers have published some open datasets. Some examples:

3.3 Afterward

Arquero is a new JavaScript library that uses almost all of the same basic concepts of the Grammar of Data, though sometimes with different names.