Contents

The content may not be revised for this year. If you really want to see it, click the link above.

Discussion 376.1: Are You Sure? An LLM Evaluation

How can we quantify the performance of large language models? Researchers have developed benchmarks to evaluate models on a variety of tasks.

This Discussion addresses the course objective MS-LLM-Eval. With additional thought, you could find connections to CI-LLM-Failures and various CI-Topics objectives here. You may also find connections to MS-LLM-Prompting, MS-LLM-API, and (if you’re really ambitious) LM-ICL.

In this discussion, we’ll try reproducing one interesting result: Are You Sure? Challenging LLMs Leads to Performance Drops in The FlipFlop Experiment | Abstract. The authors studied whether asking a chatbot “Are you sure?” led it to change its answer—and, importantly, whether that made it more or less accurate.

The paper is in Perusall; you can also read it by clicking on the “View PDF” link in the arXiv abstract. (Note: “arXiv” is pronounced “archive”; the “X” is the Greek letter “chi”. It’s a preprint server where researchers share papers before they’re peer-reviewed. Lots of AI/ML papers are posted there; note that quality may vary widely.) You don’t need to read the whole paper to participate in this discussion.

Instructions

Pick example questions. Read the Task Selection section (4.2). Pick one of the tasks listed there. (You may need to refer to the cited papers to find the details of the tasks; please use Perusall comments to share where details and links as you find them.) Each “task” is actually a collection of questions (with reference answers). Pick two specific example questions that are interesting to you. But try not to peek at the answer yet.
Try to answer the questions yourself. First try doing it without any Internet resources, then use the Internet if you need to. Then ask yourself “are you sure?” and see if you want to change your answer.
Follow the FlipFlop experimental procedure (Section 3.1), by hand, to try your example on a chatbot. You may use any chatbot you like, but you should ask it the same questions you asked yourself. You can use a commercial LLM like ChatGPT / Claude / Gemini, or an open-weights model; one easy way to run those is on the Hugging Face Playground, Meta AI, Google AI Studio, or Perplexity Labs’ Playground. (For simplicity, just use the “Are you sure?” prompt; don’t worry about the other prompts in the FlipFlop experiment.)

Record the initial accuracy and final accuracy.

Initial Post

Post a brief reflection on the experience.

Which task did you pick? (Give enough detail so that someone else would be able to try it too.)
Copy and paste the two questions you chose (but not the answers).
How easy or difficult was the task you chose?
How did the models do?
Do you think the task is a good indicator of how well someone or something understands language? Why or why not?

Reflect on whether the model flipped its answer when you asked “Are you sure?”, and whether that made it more or less accurate.

Replies

Give the answer to the example item that someone else posted. (Pick one that hasn’t already been answered.) Also respond to their comments about the task.

Rubric

See Moodle for the rubric.

Older benchmarks

A past version of this Discussion had us try out some other benchmarks; you’re welcome to try those out too.

Reasoning Over Paragraphs
Reading Comprehension
Fact Verification
Question-Answering
Find some others on the About tab of Hugging Face Open LLM Leaderboard
Covert Biases (h/t Gary Marcus)

In 23SP I suggested BIG-Bench, organized by Google. If you want to try one of these:

Pick one task, e.g., one of these. Skim the prior postings in this forum first to try to pick a task that hasn’t been done yet.
Pick two example items from that task, arbitrarily.
- For example, if you’re using BIG-Bench, I clicked the first task, bbq-lite, clicked the first JSON file under Resources, and grabbed an example from there.
- If you have the patience, the BIG-Bench Task Testing Notebook is really useful for exploring the tasks, but it takes a while to initially set up.

Exploring Language Models

Revised 2025: we’ll need to use the “Show Internals” page of Writing-Prototypes

Objectives:

Describe the implications of how language models generate text sequentially.
Describe what a conditional distribution is, in the context of language modeling.
Compute the log-probability that a language model assigns to of a sequence of words.

Part 1: Left-to-Right Generation

Go to https://bigprimes.org/RSA-challenge and copy-paste a number from there. By construction, these numbers are the product of two large primes.

Type this into the Playground: “The number NNN is composite because it can be written as the product of”. Replace NNN with your number, and don’t type a space afterwards. Leave all parameters at their defaults. Click Submit to generate. (It should give several numbers; if not, try again.) Check its output using a calculator on your computer (e.g., Python). Is it correct?
Repeat the previous step a few more times. (The “Regenerate” button makes this easy.) Keep track of what factorizations it generates and whether they are correct.
Now delete everything and change the prompt to “The number NNN is prime because”. Generate again. What do you notice? How does this result relate to the fact that language models generate text one token at a time?

Part 2: Token Probabilities

Set the Temperature slider to 0. Change the prompt to “Here is a very funny joke:” (again, no space afterwards). What joke is generated?
Compare your response to the previous question with that of a neighboring team. What do you notice?
Now set the Temperature slider to 1. Delete the generated text and generate again (the “Regenerate” button won’t realize you changed the Temperature). What joke is generated?
Repeat the previous step a few times. Summarize what you observe.
Under “Show probabilities”, select “Full spectrum” (you’ll need to scroll down). Generate with a temperature of 0 again. Select the initial “Q”; you should see a table of words with corresponding probabilities. What options was the model considering for how to start the joke?
Click each word in the generated text. (Make sure it was generated with Temperature set to 0.) Notice the words highlighted in red; those are the words that were chosen from the conditional distribution. How do you think the model chooses from among the options it’s considering when Temperature is 0?
Now set Temperature to 1 and Regenerate. How do you think the model chooses from among the options it’s considering when Temperature is 1? Regenerate a few times to check your reasoning.
Observe the highlighting behind each word. Describe what it means when a token is red.

Suppose the LM classifier computed scores of 0.1 and 0.2 for two possible words. (In neural-net lingo these are called logits). Compute e^x for each number (you can use math.exp()) to get two positive numbers. They probably don’t sum to 1, so they’re not a valid probability distribution—so divide them by their sum. (This operation—exponentiate and normalize—is called softmax in the NN lingo.). Now divide the logits by .001 and again compute the softmax. The number you divide the logits by is the temperature.

Part 3: Phrase Probabilities

Select the first few words of the generated joke. You should see “Total: xx.xxx logprob on yy tokens”. Write down the logprob number.
Click the first token and observe the corresponding “Total:” statement for that token. Write down the logprobs reported individually for each token, for the first few tokens.
Sum the logprobs of each token. Check that the sum of the individual token logprobs matches the total logprob reported for the phrase.
Compute the logprob for one token by computing the natural logarithm of the probability of the chosen word.
Type your own joke. Set “maximum length” to the smallest value and click Generate. Ignoring the generated text, highlight your joke and see what probability the model gave to it. Compare joke logprobs with your neighbors; who has the highest and lowest? (You probably need to switch to one of the “Other” models, like davinci-002, for this to work; gpt-3.5-turbo-instruct broke this feature.)

CS 376 Lab 1: Tokenization

This lab is designed to help you make progress towards the following course objectives:

[MS-LLM-Tokenization] I can explain the purpose, inputs, and outputs of tokenization.
[MS-LLM-TokenizationImpact] I can analyze how tokenization choices affect the performance of an LLM.

Work through the following notebook. (No accelerator is needed. Either Kaggle or Colab is fine; if you use Colab, remember to “Copy to Drive”.)

Tokenization (name: u08n1-tokenization.ipynb; show preview, open in Colab)

If you finish, you may get started on next week’s notebook:

Logits in Causal Language Models (name: u09n1-lm-logits.ipynb; show preview, open in Colab)

Generation Activity (draft!)