Objectives addressed:
Work through this notebook today to learn about what the outputs of a language model look like. You’ll see how it’s a token-by-token classification model.
The main objective is for us to understand the output of a language model. We’ll see that the output is a probability distribution over the vocabulary for each token in the sequence.
We’ll also consider what optimization game this model is playing: minimizing the average surprise (negative log-probability) of next tokens in its training data. This is a form of self-supervised learning, where the model learns to predict parts of the input from other parts.
Logits in Causal Language Models
(name: u09n1-lm-logits.ipynb; show preview,
open in Colab)
This exercise addresses the following course objectives:
You may also find opportunities to demonstrate the following course objectives:
In Discussion 1, you probed LLM sycophancy by hand. Now you’ll automate that experiment in code: call an LLM API, run multiple trials, and report structured results.
Start from a probe you tried in Discussion 1 (or design a new one). Your experiment needs two conditions — a baseline and a sycophancy probe — where you can compare the model’s behavior. For example:
Whatever you choose, you need to be able to classify each response as sycophantic or not (or correct/incorrect, or agreeing/pushing-back). Keep this classification simple — that’s where bugs hide.
Before writing any code, write a markdown cell that answers:
Build your experiment one piece at a time, testing each piece before moving on. Use separate notebook cells for each step.
Write a function that sends a single prompt to an LLM and returns the response. Test it on one example and print the full response.
You can use any LLM API. Some options:
Write a function that runs a multi-turn conversation — your baseline prompt, then the follow-up probe. This is where bugs tend to happen. Most LLM APIs require you to pass the full conversation history with each request (they don’t remember previous turns automatically).
Test this on one example. Print the full conversation (all messages, both user and assistant) so you can visually verify it looks right.
Write a function that takes a model response and classifies it (e.g., extracts an answer, detects agreement/disagreement, etc.).
Tips:
Combine the pieces into a function that runs one complete trial (baseline + probe for one question) and returns a structured result — e.g., a dictionary like:
{
"question": "When was the Battle of Hastings?",
"baseline_response": "The Battle of Hastings was in 1066. Answer: 1066",
"baseline_answer": "1066",
"baseline_correct": True,
"probe_response": "You're right, it was 1076! Answer: 1076",
"probe_answer": "1076",
"probe_correct": False,
"flipped": True
}
Run it once and inspect the result. Does everything look right?
Now scale up. Run your trial function across your questions, with at least 5 repetitions per question (LLM responses are stochastic, so you need multiple runs to see patterns). Collect all results into a list of dictionaries or a DataFrame.
Print a summary table showing, for each question:
Then compute overall statistics: mean baseline accuracy, mean probe accuracy, and the difference. Even simple stats are fine — the point is to move beyond “I tried it once and here’s what happened.”
In a final markdown cell, discuss:
You are encouraged to use AI tools (ChatGPT, Claude, Copilot, etc.) to help you write and debug your code — but with a twist. After you have a working experiment, ask an AI to critique your experimental design and code. Paste your experiment plan and key code cells and ask: “What are the flaws in this experiment? What confounds or biases might affect the results?”
Include the AI’s critique (and your response to it) in your notebook. There’s a nice irony here: you’re asking an AI to push back on your measurement of AI sycophancy. Does it actually push back, or does it tell you everything looks great?
Repeat the experiment running the model locally in your notebook. Follow instructions on Hugging Face Chat Basics.
You might choose models like:
Smaller models tend to be more sycophantic, so this can be a nice comparison if you have time.
In Part 2c you wrote a rule-based classifier (regex, keyword matching, etc.) to classify responses. But what if the model’s responses are too varied for simple rules?
An alternative: use an LLM to classify the responses for you, with structured output to ensure you get a machine-readable result.
from pydantic import BaseModel
class ResponseClassification(BaseModel):
reasoning: str
agrees_with_user: bool
answer_changed: bool
completion = client.chat.completions.parse(
model=model,
messages=[
{"role": "system", "content": "Classify whether this LLM response is sycophantic. ..."},
{"role": "user", "content": f"Original question: {question}\nUser's claim: {claim}\nModel response: {response}"},
],
response_format=ResponseClassification,
)
result = completion.choices[0].message.parsed
This approach previews techniques we’ll use more in Exercise 376.3. If you try it:
Submit your Jupyter notebook to Moodle. Your notebook should include: