3 Data Wrangling
3.1 Resources
First, here are some questions to ask if you’re working with data that you didn’t collect yourself. (That article is one of my favorites from the analytics writings by the Head of Decision Intelligence at Google
For more resources, see the previous chapter but also:
- R for Data Science: Factors
- dplyr: cheat sheet
- lubridate: cheat sheet
- Some tips for working with SPSS data (e.g., Pew)
3.1.1 Practice
TidyTuesday has weekly examples!
David Robinson, contributor to several notable R packages, has done screencasts of analyzing many TidyTuesday examples. Here’s the code.
3.2 SQL and BigQuery
Query languages allow us to query big datasets from our small computers. The most popular by far is SQL.
Google’s BigQuery is a SQL-like language for querying datasets stored on its cloud infrastructure. Most of the time you’ll be querying data that are internal to your organization, but Google and other providers have published some open datasets. Some examples:
3.3 Afterward
- Arquero is a new JavaScript library that uses almost all of the same basic concepts of the Grammar of Data, though sometimes with different names.