The goal of this project is to practice a complete predictive analytics workflow, from honing a good question to communicating your results.

Results will be presented as a report, supporting code, and a brief presentation to the class. The report should be at the level of polish and formality of a blog post (more than a class homework assignment, less than an academic paper). The overview and primary visualizations should be intelligible to a non-technical audience; the methods should be described in precise technical language as appropriate.

An important component of this project is a critique of some related prior work. That is, you are not simply demonstrating that you can perform an analysis, but also that you can evaluate the strengths and weaknesses of others’ analyses. This may be the most important take-away from this class for your future career.

Logistics

Teams: Both individual and team projects are permitted, with a mild encouragement towards teams.

Presentations will be done over VoiceThread.

Grading

The project will be graded as follows:

Choosing your own project

Your project must include predictive analytics of the kind discussed in this class, or other approaches by permission of the instructor.

Although students often think of inference questions (e.g., “what is the relationship between X and Y?”) for projects, I enourage you to think of prediction questions (e.g., “can we predict Y from X?”) instead, although either is acceptable.

A Kaggle competition is a good place to start looking for project ideas and data. But:

  1. Kaggle competitions often don’t document the data well; you may need to do some extra digging to understand what the data is and how it was collected.
  2. Kaggle competitions helpfully show you what other people have done, but many of those analyses are flashy and overcomplicated. Read through others’ work for inspiration and ideas, but in your work, keep it simple and focus on interpreting the results and connecting them to the real world.

Your data should be rich enough to support the kind of modeling you want to do, but not too complex to work with. For example, large datasets (i.e., bigger than about 500 MB) may require careful handling to avoid exhausting your computing resources.

Deliverables

Report

Overall:

  • Be mindful of decisions. Report:
    • What decisions you made
    • Why you made them
    • What might have been alternative choices.
  • Tell the “rational reconstruction” of the story of how the analysis was done.
    • Don’t give the play-by-play of everything you tried, every idea you had, etc., but…
    • Do include things you tried that led to an important observation later on.
  • If you use any code from the Internet, you should acknowledge its source and provide a link.
  • You should submit all of the code needed to replicate your results, but your report should be understandable without looking at the code.

Elements of a report will vary depending on the specific project, but should generally include:

  • A succinct but descriptive title
  • A one-paragraph executive summary of the main conclusions of your analysis, aimed at a decision-maker. If possible, make a visual to supplement your text.
  • Overview of the vision of the project, the data you use, the kind of analysis you will perform, and the main results you obtained.
    • Include any background information needed to understand the situation.
  • Critique of Prior Work - see details below.
  • Approach you took, including:
    • What data you’re using. Give enough detail for the reader to evaluate how appropriate the data is for your task. (What’s its origin? size? structure? how it relates with your vision, etc.) The reader should be able to obtain the data themselves.
      • Include some Exploratory Data Analysis (EDA) here–plots, summary tables, etc.
      • Summarize what you observed in your EDA and how that informs your modeling.
    • What analytics question you’re asking (in terms of the specific characteristics of the data you have), and how that relates to your overall vision.
      • Describe the setup (e.g., what target variable you’re trying to predict, how you’ll measure accuracy, how you’ll validate, etc.)
      • Alternative: instead of a supervised prediction task, you can define an unsupervised learning task and use clustering. In this case, clearly state what you want to understand through the clustering, and report your observations.
      • Describe your analytics process
        • What sort of techniques and hyperparameters did you choose? Why?
        • What steps did you perform to get your result?
        • What results did you get? Summarize them quantitatively and qualitatively.
        • How did you refine the process?
  • Describe your findings:
    • The strongest reports will include insightful visualizations of the model, its predictions, and/or its mistakes, and a discussion of what those plots tell us.
    • Report on the final accuracy of your best model the test set, if applicable.
    • Summarize the analyses you performed and what the results told you. What do your findings say about the real-world and prediction (or clustering) questions you posed?
  • Limitations: What are some limitations of your analyses? Did you notice any potential biases in the data you used or analysis you did? Any other ethical questions raised during this project?
  • Future Directions: What new questions came up following your exploration of this data? Identify at least one question that would require new data or a new analysis approach, and specify what steps would be required.

(Project descriptions originally thanks to Ofra Amir)

Critique of Prior Work

This section of the report should include:

  • A clear citation to the article you’re critiquing, including a link if applicable.
    • You can find examples of predictive analytics work on sites like Kaggle, TowardsDataScience, Reddit, YouTube, and GitHub. Also check out the TidyTuesday project and r-bloggers.
  • An overview of their work (see Overview below)
  • Technical Choices:
    • What was a technical choice that the author(s) made? (note: people are not always aware of the choices that they make)
    • What would have been some alternative options they could have chosen?
    • What are some pros and cons of the option they chose?
  • Trustworthiness
    • What did the author do to get you to trust their results more? (attention to detail? careful validation? clear explanation? etc.)
    • What aspect of the work do you trust the least? Why?
  • In what ways do you intend for this project to extend or enhance that prior work? (Save the details of how for the Approach section of your report.)

You should pick one specific prior work and analyze it in depth. If you find several articles, you may include links and brief discussions of them as well.

Presentation

Prepare a short (around 5 minutes or less) of your work. See the Unit 16 slides for some guidelines.

I suggest using presentation software rather than tools like Quarto slides or Xaringan, but you’re very welcome to try them out if you like.