376 Unit 1: Introduction to Generative Modeling

Contents

376 Preparation 1 (draft!)
The content may not be revised for this year. If you really want to see it, click the link above.
Discussion 376.1: Are You Sure? An LLM Evaluation

How can we quantify the performance of large language models? Researchers have developed benchmarks to evaluate models on a variety of tasks.

This Discussion addresses the course objective MS-LLM-Eval. With additional thought, you could find connections to CI-LLM-Failures and various CI-Topics objectives here. You may also find connections to MS-LLM-Prompting, MS-LLM-API, and (if you’re really ambitious) LM-ICL.

In this discussion, we’ll try reproducing one interesting result: Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment | Abstract. The authors studied whether asking a chatbot “Are you sure?” led it to change its answer—and, importantly, whether that made it more or less accurate.

The paper is in Perusall; you can also read it by clicking on the “View PDF” link in the arXiv abstract. (Note: “arXiv” is pronounced “archive”; the “X” is the Greek letter “chi”. It’s a preprint server where researchers share papers before they’re peer-reviewed. Lots of AI/ML papers are posted there; note that quality may vary widely.) You don’t need to read the whole paper to participate in this discussion.

Instructions

  1. Pick example questions. Read the Task Selection section (4.2). Pick one of the tasks listed there. (You may need to refer to the cited papers to find the details of the tasks; please use Perusall comments to share where details and links as you find them.) Each “task” is actually a collection of questions (with reference answers). Pick two specific example questions that are interesting to you. But try not to peek at the answer yet.

  2. Try to answer the questions yourself. First try doing it without any Internet resources, then use the Internet if you need to. Then ask yourself “are you sure?” and see if you want to change your answer.

  3. Follow the FlipFlop experimental procedure (Section 3.1), by hand, to try your example on a chatbot. You may use any chatbot you like, but you should ask it the same questions you asked yourself. You can use a commercial LLM like ChatGPT / Claude / Gemini, or an open-weights model; one easy way to run those is on the Hugging Face Playground, Meta AI, Google AI Studio, or Perplexity Labs’ Playground. (For simplicity, just use the “Are you sure?” prompt; don’t worry about the other prompts in the FlipFlop experiment.)

Record the initial accuracy and final accuracy.

Initial Post

Post a brief reflection on the experience.

Reflect on whether the model flipped its answer when you asked “Are you sure?”, and whether that made it more or less accurate.

Replies

Give the answer to the example item that someone else posted. (Pick one that hasn’t already been answered.) Also respond to their comments about the task.

Rubric

See Moodle for the rubric.

Older benchmarks

A past version of this Discussion had us try out some other benchmarks; you’re welcome to try those out too.

In 23SP I suggested BIG-Bench, organized by Google. If you want to try one of these:

  1. Pick one task, e.g., one of these. Skim the prior postings in this forum first to try to pick a task that hasn’t been done yet.
  2. Pick two example items from that task, arbitrarily.
    • For example, if you’re using BIG-Bench, I clicked the first task, bbq-lite, clicked the first JSON file under Resources, and grabbed an example from there.
    • If you have the patience, the BIG-Bench Task Testing Notebook is really useful for exploring the tasks, but it takes a while to initially set up.
Exploring Language Models

Revised 2025: we’ll need to use the “Show Internals” page of Writing-Prototypes

Objectives:

Part 1: Left-to-Right Generation

Go to https://bigprimes.org/RSA-challenge and copy-paste a number from there. By construction, these numbers are the product of two large primes.

  1. Type this into the Playground: “The number NNN is composite because it can be written as the product of”. Replace NNN with your number, and don’t type a space afterwards. Leave all parameters at their defaults. Click Submit to generate. (It should give several numbers; if not, try again.) Check its output using a calculator on your computer (e.g., Python). Is it correct?

  2. Repeat the previous step a few more times. (The “Regenerate” button makes this easy.) Keep track of what factorizations it generates and whether they are correct.

  3. Now delete everything and change the prompt to “The number NNN is prime because”. Generate again. What do you notice? How does this result relate to the fact that language models generate text one token at a time?

Part 2: Token Probabilities

  1. Set the Temperature slider to 0. Change the prompt to “Here is a very funny joke:” (again, no space afterwards). What joke is generated?

  2. Compare your response to the previous question with that of a neighboring team. What do you notice?

  3. Now set the Temperature slider to 1. Delete the generated text and generate again (the “Regenerate” button won’t realize you changed the Temperature). What joke is generated?

  4. Repeat the previous step a few times. Summarize what you observe.

  5. Under “Show probabilities”, select “Full spectrum” (you’ll need to scroll down). Generate with a temperature of 0 again. Select the initial “Q”; you should see a table of words with corresponding probabilities. What options was the model considering for how to start the joke?

  6. Click each word in the generated text. (Make sure it was generated with Temperature set to 0.) Notice the words highlighted in red; those are the words that were chosen from the conditional distribution. How do you think the model chooses from among the options it’s considering when Temperature is 0?

  7. Now set Temperature to 1 and Regenerate. How do you think the model chooses from among the options it’s considering when Temperature is 1? Regenerate a few times to check your reasoning.

  8. Observe the highlighting behind each word. Describe what it means when a token is red.

Suppose the LM classifier computed scores of 0.1 and 0.2 for two possible words. (In neural-net lingo these are called logits). Compute e^x for each number (you can use math.exp()) to get two positive numbers. They probably don’t sum to 1, so they’re not a valid probability distribution—so divide them by their sum. (This operation—exponentiate and normalize—is called softmax in the NN lingo.). Now divide the logits by .001 and again compute the softmax. The number you divide the logits by is the temperature.

Part 3: Phrase Probabilities

  1. Select the first few words of the generated joke. You should see “Total: xx.xxx logprob on yy tokens”. Write down the logprob number.

  2. Click the first token and observe the corresponding “Total:” statement for that token. Write down the logprobs reported individually for each token, for the first few tokens.

  3. Sum the logprobs of each token. Check that the sum of the individual token logprobs matches the total logprob reported for the phrase.

  4. Compute the logprob for one token by computing the natural logarithm of the probability of the chosen word.

  5. Type your own joke. Set “maximum length” to the smallest value and click Generate. Ignoring the generated text, highlight your joke and see what probability the model gave to it. Compare joke logprobs with your neighbors; who has the highest and lowest? (You probably need to switch to one of the “Other” models, like davinci-002, for this to work; gpt-3.5-turbo-instruct broke this feature.)

CS 376 Lab 1: Tokenization

This lab is designed to help you make progress towards the following course objectives:

Work through the following notebook. (No accelerator is needed. Either Kaggle or Colab is fine; if you use Colab, remember to “Copy to Drive”.)

If you finish, you may get started on next week’s notebook:

Logits in Causal Language Models (name: u09n1-lm-logits.ipynb; show preview, open in Colab)

Generation Activity (draft!)
The content may not be revised for this year. If you really want to see it, click the link above.
Optional Extension: Token Efficiency Analysis

This is an optional mini-project that builds on the tokenization exercise that we did this week.

  1. Select 4-5 text samples (100-200 words each) from different domains. Examples include:

    • General prose (e.g., news article)
    • Code or structured data (e.g., HTML, JSON, XML, CSV, …)
    • Technical/scientific text
    • Social media/informal text
    • Non-English or multilingual text
  2. Select two or three tokenizers to compare (e.g., Llama, Gemma, spaCy, sklearn’s TextVectorizer, etc.). You may use the same tokenizers from the previous exercise or try new ones.

  3. For each tokenizer, tokenize the text samples and calculate the following metrics:

    • Characters-per-token ratio
    • 5 longest tokens
    • 5 multi-token words (if applicable)
    • 5 words mapped to an “unknown” token (if applicable)
  4. Analyze the results and answer the following questions:

    • Which types of text tokenize most efficiently and why?
    • How might these insights influence prompt design for different tasks?
    • What specific improvements could be made to the tokenization process for one of your text domains?