Midterm Project: replicate and critique a visual

For this project you will pick some existing data science work (a blog post, report, research paper, etc.) and replicate a visual from it.

The project will explore the intersection between three things:

  • The data
  • The visualization
  • The story that the visualization tells about the data (and, indirectly, about the world)

The three components to this project involve each of these three things:

  • Data: Obtain some real-world dataset. Trace where it came from and how it’s structured. Load it and process it using one of the data science toolkits we’re studying to a form that’s appropriate for making the visualization.
  • Visualization: Replicate (re-create) a visualization that someone else already made based on that data. Evaluate some of the choices and assumptions that were made in the visualization process: in what ways does the visualization faithfully represent the data (or not)? What sort of stories does the chosen visualization amplify?
  • Story: Consider the story that the original source told using the visualization. Is that story accurate? complete? clearly articulated? How did choices in the data collection, preparation, and visualization affect the storytelling? Are there other stories that the data might also be telling?

This project addresses our course-level learning objectives in this way:

  • Technical skills: manipulating data, constructing visualizations, and creating reports.
  • Communication: analyzing choices made in visualization and text with respect to how it tells a story about data. Proposing and implementing changes to improve the clarity of communication.
  • Ethics and Critical Thinking: identify potential ethical questions (e.g., of transparency, diversity, etc.) that emerge in the process of obtaining, manipulating, and communicating with data.

Report Outline

Your report should be:

  • understandable by itself: a reader should not need to see your discussion posts or prior submissions.
  • reproducible (no paths that only work on your computer, for example)
  • understandable without the code: a reader should be able to skip over all of the code and understand all of the results.

Please follow this general outline (though you may add or remove aspects if needed).

  • Overview: Introduce the article you’ve found and the specific visualization you’re replicating, including:
    • A complete URL to the article
    • An image or screenshot of the visualization.
    • A concise statement (ideally a quote) of the claim that the article uses the visualization to make (or the claim you invented if there wasn’t a clear one)
  • Design
    • What overall type of visualization was chosen? Why might the author have chosen it?
    • What variables are being shown?
    • What retinal variables and/or aesthetics were chosen to represent those data variables?
      • For at least one of these variables, describe what makes that choice appropriate or inappropriate.
    • Overall, what about the visual makes it effective, or ineffective, for making its claim?
  • Data
    • A high-level overview of the data you’re working with, including:
      • Whether you were able to find the original data (if not, why not?)
      • Where the data came from
        • Direct URL and/or specific instructions for how to obtain it.
        • Under what terms is the source allowing you to use the data?
        • Try to trace it upstream as close to the source as you can.
        • Who worked with the data on its way to you? (Include names and roles, if applicable.)
        • What processing may have happened to it: was it aggregated? Anonymized? etc.
      • What might we need to know about the data collection process in order to interpret the data correctly? (e.g., If it’s from a survey–who was surveyed?)
    • A low-level description of the size and structure of the data (include your data-loading code here)
      • What does each row represent?
      • How many rows are there? (use inline code)
      • What might be interesting to know about what information the data does, and doesn’t, provide?
  • Wrangling
    • Describe, at a broad level, what you need to do to the data to make it into the form you need for the plot
    • Include code blocks, with appropriate names, for wrangling steps.
  • Replication
    • Include your code to replicate the visual, and the result.
    • Briefly describe any difficulties you encountered, both those you overcame and those you still have not.
    • It’s ok to not have a perfect graph here. If the essential structure is there, don’t worry if the details are a bit different. Focus your attention on making an interesting and polished alternative design.
  • Alternative Designs
    • Describe at least two alternative design choices that could be made in visualizing your data. For each design, include:
      • What choice did the original visual make? (e.g., to use a particular aesthetic mapping or glyph)
      • What’s an alternative choice? (It should be a reasonable choice, but it doesn’t have to be an improvement.)
      • How does that change affect how the visual supports the original claim? Can your redesign now support some different claim?
    • Make a solid attempt to implement your best alternative design.
      • If creating it using ggplot/plotly/etc is too challenging, you may submit a high-fidelity sketch of what the visualization would look like along with a clear description of what you’d need to figure out in order to produce it with code.
  • Summary
    • Now that you’ve gone through the whole process, how has your understanding of, and belief in, the original article’s claim changed?
    • How faithful was your replication?
    • How successful was your alternative design?
    • What follow-up questions and ideas do you have about the data or visualization you worked with?
    • What opportunities do you see for extending this work into a final project? e.g., is there some way you could apply predictive modeling? Get better data? Do a more rigorous analysis? etc. “This seems like a dead end” is valid.
    • How do you feel about this whole experience?

At least one of your visuals (either the original or alternative design) should be high quality, with effort spent getting the details right.

Project 1 Milestones

Milestones are due along with the corresponding week’s homework.

  • Week 2: Propose 2 or 3 potential options that interest you, and write why you find them interesting. The course staff will help you determine what’s feasible.
  • Week 3: Pick a specific visualization to recreate and identify concrete steps towards obtaining the data.
  • Week 4: Create a rough-draft visualization, including brief notes on how the data were obtained and story was originally told using that data. We will workshop some visualizations in class.
  • Week 5: Provide feedback to other students.
  • Week 6: Final visual, plus your critique of the original article and some follow-up questions.

Final Project: ask and answer your own data science question

This project will require doing new work, either extending some existing work (like the first project) or starting from scratch. Results will be presented as a report, supporting code, and a brief presentation to the class. Successful outcomes should include visual, analytical, and perspectival components. The report should be at the level of polish and formality of a blog post (more than a class homework assignment, less than an academic paper). The overview and primary visualizations should be intelligible to a non-technical audience; the methods should be described in precise technical language as appropriate.

The final course meeting (during the designated final exam period) will be devoted to final project presentations. Feedback on others’ projects will be part of your final project grade, so attendance is mandatory.

Project 2 Detailed Expectations

NOTES:

  • You will likely need to make some choices regarding what variables to include, whether to do some pre-processing (e.g., addressing missing values, generating new variables), etc. Clearly state:
    • What decisions you made
    • Why you made them
    • What might have been alternative choices.
  • Tell the “rational reconstruction” of the story of how the analysis was done.
    • Don’t give the play-by-play of everything you tried, every idea you had, etc., but…
    • Do include things you tried that led to an important observation later on.
  • If you use any code from the Internet, you should acknowledge its source and provide a link.
  • You should submit all of the code needed to replicate your results, but your report should be understandable without looking at the code.
  • Each project team has a proj repo; please use it.
  • The repo includes a template that gives one possible outline. You may adapt the outline as needed for your project. The template includes an “appendix” block that should cause all code to appear there. You may also set echo=TRUE on any particular code block if you want to highlight something about it.
  • You may make slides if you like (using Xaringan, ioslides, PowerPoint, etc.), Shiny apps, etc., but please also submit a report document in either HTML or PDF format.

Your report should include the following general elements (though treat this specific outline as a suggestion only; certain reports will need to deviate from this structure in small or large part):

  • A succinct but descriptive title
  • Overview
    • A real-world question that you’d like to explore, and why it’s interesting.
      • This question should be stated in language that is understandable to someone who hasn’t studied data science and doesn’t know the details of your dataset.
      • The best questions include motivation from prior literature that gives, for example, some pattern or relationship that you’d expect to find and why.
    • A brief (2-4 sentences) high-level description of the dataset: what is the dataset about? Where did it come from? What sort of data does it contain?
  • Approach
    • A problem description or specific question
      • This question should be stated in more specific technical terms than the real-world question.
      • It should reference the particular features of your dataset.
      • This question ideally helps answer the real-world question, but it’s okay if it doesn’t.
    • The approach that you’ll take to answer that question, probably using some sort of predictive or statistical modeling.
    • A description of the data’s provenance: as much as you can, trace the path from the events or measurements all the way to the dataset you’re working with. You might answer:
      • where did the data come from originally?
      • Where did you download it from? As much as you can tell or speculate, how did it end up available there?
    • The number of records (rows) in the dataset, and what each one represents.
      • Give an example of some part of the data in your dataset.
      • Consider writing a simple sentence that conveys the information in the first row, as an example.
    • A list of the features in the dataset and their types
    • An analysis of the appropriateness of your dataset for your approach. (What’s good about it? What do you wish were better?)
    • This section should also discuss the overall approach of any basic data wrangling needed to get the data into an overall usable form. More specific wrangling may be needed for constructing plots or models later.
  • Exploratory Data Analysis (EDA)
    • Show plots or tables illustrating the distribution of at least two variables in your dataset. Comment on anything interesting you observe.
    • Show plots illustrating bivariate relationships for at least 2 pairs of variables. Comment on anything interesting you observe (e.g., strength of relationship, dependence on other factors).
    • Summarize your EDA findings: how do your observations inform the modeling?
  • Modeling
    • This section is written for predictive modeling; if you’re doing inferential modeling or clustering, adapt this section as needed.
    • Describe the modeling setup. Clearly state at least:
      • what is the target variable you are trying to predict
      • which variables (features) you are using to predict it, and why you chose those features
      • how you will measure accuracy (can you give meaningful units?)
      • what validation method did you choose and why
    • Fit a basic predictive model using one of the techniques we discussed in class (regression: Nearest Neighbors or Linear Regression, classification: Decision Trees or Logistic Regression; other choices such as Nearest Neighbors or Random Forest are also fine)
      • Describe why you chose that model (and its features and any hyperparameters)
      • Describe what kind of performance you expect from it
    • Report the results of your basic predictive model via cross-validation.
    • Make one or more changes to the predictive model to (attempt to) improve the accuracy. Discuss what changes you made, why you made them, and what the results were.
    • The strongest reports will include insightful visualizations of the model, its predictions, and/or its mistakes, and a discussion of what those plots tell us.
    • Report on the final accuracy of your best model the test set, if applicable.
    • Alternative: instead of a supervised prediction task, you can define an unsupervised learning task and use clustering. In this case, clearly state what you want to understand through the clustering, and report your observations.
  • Findings: Summarize the analyses you performed and what the results told you. What do your findings say about the real-world and prediction (or clustering) questions you posed?
  • Limitations: What are some limitations of your analyses? Did you notice any potential biases in the data you used or analysis you did? Any other ethical questions raised during this project?
  • Future Directions: What new questions came up following your exploration of this data? Identify at least one question that would require new data or a new analysis approach, and specify what steps would be required.

(Project descriptions originally thanks to Ofra Amir)

Connection to Learning Objectives

Projects should demonstrate proficiency in the primary learning outcomes of this course:

Learning outcome Evidence in project
Formulate a question or problem that can be answered by data Includes a statement of the goal of the project that clearly shows what a successful outcome looks like (e.g., how does the work answer the question? Did the work achieve the goal?)
Acquire data responsibly Describes where the data came from and evaluates the suitability of that source for that data
Transform data faithfully into usable forms Includes some data wrangling (merging multiple data sources, aggregating, re-coding data, identifying and dealing with missing data, etc.)
Explore datasets to interrogate their representativeness, build intuition, and generate hypotheses Includes results of exploratory analytics about data values, completeness, etc. (e.g., exploratory visualizations and/or descriptive statistics); includes some reflection on how the actually available data shaped the project goal / question
Apply predictive tools to draw conclusions, evaluating the suitability of these models Includes a predictive model (regression or classification) that uses two or more features to predict an outcome. Includes a discussion of how modeling decisions were made and on how accurate the model should be on unseen (e.g., future) data, based on quantitative evidence (e.g., cross-validation).
Communicate visual and textual data-driven narratives that are useful, faithful, and intelligible to both technical and non-technical audiences Includes a report in the style of a blog post and a brief presentation
Analyze considerations of responsibility and justice in all of the above practices Demonstrates humility and awareness throughout the project, e.g., includes appropriate caveats with claims, includes discussion of implications of decisions that were made during acquisition / modeling / communication. Identifies potential ethical issues raised by existing data.