Discussion for Week 8

This short task is intended to help you get a sense of how we might evaluate a model’s performance in a natural language processing task. Clearly a model can’t understand language in the way that humans do, but what happens if we take a reductionist lens and ask how well it does on specific tasks?

Try Out NLP Benchmarks

Try out a benchmark of NLP progress. I suggest BIG-Bench, organized by Google, but you’re welcome to try a different one.

  1. Pick one task, e.g., one of these. Skim the prior postings in this forum first to try to pick a task that hasn’t been done yet.
  2. Pick two example items from that task, arbitrarily.
    • For example, if you’re using BIG-Bench, I clicked the first task, bbq-lite, clicked the first JSON file under Resources, and grabbed an example from there.
    • If you have the patience, the BIG-Bench Task Testing Notebook is really useful for exploring the tasks, but it takes a while to initially set up.
  3. Test three different models:
    1. Yourself.
    2. ChatGPT or the OpenAI Playground
    3. A pre-built model from Hugging Face, such as flan-ul2

Initial Post

Post a brief reflection on the experience.

Replies

Give the answer to the example item that someone else posted. (Pick one that hasn’t already been answered.) Also respond to their comments about the task.

Lab 8