Choosing a Project

Overall, pick a project where you can:

Here are some ideas to get you started:

Replicate or Extend an Academic Paper

I often come across a paper that makes me think, “my students could have done that!” Here are a few examples:

I have a whole collection of others; just chat with him about what you’re interested in.

BYOTA

I just saw this and it looks like a great project! Build Your Own Timeline Algorithm: A Blueprint

Prof Arnold’s prototypes

Extending anything from class

Take any notebook we do in class. Extend it in some way that’s interesting to you, and write clearly about your experience. That’s a totally valid mini-project. You might need to do a few of these to hit all of the objectives, but that’s great. Some weeks already have “Extension” ideas.

Working with LLM APIs

Working with Models Directly

Compete in a Competition

You may compete in a Kaggle competition or similar competition, like:

If you do this, you should:

Creativity

Use ML to empower human creativity in some way. For example, you might try to generate art, music, or poetry, or to help with the creative process in some other way.

For this project, you should aim for:

Replication and/or Constraints

One way you could extend a replication project is to add constraints: limited compute (e.g., lab computers, your laptop, Raspberry Pi), limited data (a small subset of the original dataset), limited model size (fits in xx MB), etc.

One example I’d really like to see: Train the best language model you can on our lab computers (or your laptop).

Details on Replication Projects

Expectations for Replication Projects

  • For these projects, we will not expect as much discussion of motivation, assuming that the original artifact took care of that.
  • Depending on your results, you should either:
    • Demonstrate surmounting significant technical challenge in attaining the result,
    • Provide a thoughtful analysis of the decisions you and the original authors made, or
    • Improve on the quantitative result in some measurable and well-motivated way.

Choosing a Replication Project

If you’re choosing a replication project, ask yourself:

  1. Is there some specific write-up, with quantitative results clearly reported, that I can use to anchor the project?
  2. Can I easily access the same data that the original authors used? (Does it fit on computing hardware I can easily access?)
  3. Do I understand the basic approach? Maybe there’s fancy stuff too, but you should be able to think of how you’d implement a simple version of it.

Expository Notebooks (“Notebookify”)

One strategy to take when starting with an existing code is to “Notebookify” it. Most notebooks you’ll find are demo notebooks, designed to show off the best results but hide a lot of details behind opaque code chunks or external libraries. In contrast, an expository notebook walks the reader through what’s going on.

The code part of such a project is relatively straightforward: find a demo notebook, step through it, pull in the contents of the “do-all-the-stuff” functions (test that it still works), split things up into individual cells (test that it still works), and show intermediate results and shapes. But you’ll also write up descriptions of what’s happening.

You will almost certainly want to refer to a paper by the original authors. It’ll usually explain the names of variables and methods, and it’ll show what parameters and data are likely to work well.

If the original has big loops, flatten them. For example, show one example of how the data is prepared, run one minibatch of the model training, show how the evaluation scores are computed for one datapoint.

Simplify the code as needed. e.g., if there are ifs to do different things depending on configuration, remove the code that isn’t actually run in your case.

Most importantly, explain what is going on. Start with an intro about the overall goal of the approach you’re demoing, and the basic outline of what the process looks like. Then dive in. End with a conclusion summarizing the main points that you highlighted about what’s going on. Perhaps end with some questions and future directions: what decisions did the original authors make that aren’t clear to you? What ideas might you have for doing something differently?

How to replicate without duplicating

One strategy: the Benjamin Franklin replication. Here’s how I adapt it to code:

  1. Read the original. Take notes in a separate document. Make them mostly in human language or math; put code in your notes only sparingly.
  2. Close the original. Try to write a replication based on your notes.
  3. Fail at some point because your notes aren’t detailed enough. So close your replication and open the original again, and return to step 1.

Tips for Replication Projects

Basic outline of a project here:

  • Get the code running (could be very easy if you find a Colab notebook etc)
  • Replicate something interesting that’s already been done.
  • Use an example that you provide instead of one of their pre-built ones.
  • Push the limits a bit.

Ideas of what to replicate

  • Compete in https://babylm.github.io/ (train a language model on only data that a human child plausibly has access to).
    • Suggestion: initialize your model wisely.
  • Train the best LM you can on a lab machine in under a day, on permissively licensed data and/or synthetic data.

See https://paperswithcode.com/ for some examples. Their newsletter is particularly approachable.

Also, see proceedings of general conferences like NeurIPS, ICML, ICLR, …, or domain-focused conferences: text (EMNLP, ACL), speech and music (ISMIR, InterSpeech), computer vision (ICCV, SIGGRAPH), recommender systems (RecSys), etc.

Teaching

Create materials to teach this class about a topic beyond the scope of the course.

Deliverables should include at least 3 of the following:

Examples: Jay Alammar’s blog, many articles on distill.pub, 3Blue1Brown videos, …

Other Project Ideas

Interpretability Initiative