These notes are reference material for Unit 1. Read through them before class and annotate on Perusall with questions or connections.
A language model (LM) is a probability distribution over sequences of tokens. It assigns a probability to every possible document:
In practice, language models are trained to predict the next word (or token) in a document. This is a classification task:
The training set is a huge collection of documents (web pages, books, code, etc.), and the model learns by minimizing surprise — how much probability mass it put on the token that actually came next.
Notice what’s not required here: labels. Nobody has to annotate each document with “the correct next word” — the next word is already there in the text. The model creates its own training signal from the structure of the data itself. This is called self-supervised learning, and it’s the key insight that lets language models train on essentially the entire internet without any human annotation.
This is what makes foundation models possible: because the training signal is free, we can scale to datasets that would be impossible to label by hand.
A conditional distribution is a probability distribution of one thing given another:
Language models give us P(next token | context). This is the building block for everything else.
There are three main ways to set up a generative model. Each answers the question “how do we sample from a complex distribution?” differently.
Generate one piece at a time, left to right. Each step predicts the next token given everything before it. This is how most language models (GPT, Claude, Gemma, etc.) work.
Sample a compact “code” z from a simple distribution (like a Gaussian), then decode it into the full output. The key idea is that high-dimensional data (like images) actually lives on a lower-dimensional manifold — a latent variable model tries to learn that manifold.
Start with pure noise and iteratively remove it, guided by a learned model. The model is trained to predict “what noise was added?” at each step, and generation works by repeatedly denoising.
We’ll focus primarily on autoregressive models in this course, since they power most modern LLMs.
An autoregressive model factors the probability of a whole document into a product of next-token predictions:
P(tell, me, a, joke) = P(tell) × P(me | tell) × P(a | tell, me) × P(joke | tell, me, a)
This is called causal language modeling because each prediction only depends on what came before — the model never looks ahead.
Intuition: imagine someone constantly trying to finish your sentence. At each position, they have a distribution over what word might come next. Some positions are highly constrained (“the United States of ___”); others are wide open (“Once upon a ___”).
The next-token prediction works like a classifier we’d recognize from CS 375:
This is the same structure as logistic regression on learned features: the neural net builds a feature vector, then we score each class by how well it “matches.”
Because the model generates left to right:
How does a chatbot work if the model just predicts the next token in a document? The trick: a conversation is a document. A multi-turn chat gets formatted with special markers like:
<|user|> Tell me a joke about atoms.
<|assistant|> Why don't scientists trust atoms? Because they make up everything!
<|user|> Now make it about chemistry.
<|assistant|>
The model generates the assistant’s reply by continuing this document left to right, one token at a time. It doesn’t “know” it’s in a conversation — it’s just predicting what comes next in a document that happens to look like a conversation. Even tool calls, system prompts, and multi-turn context are just more text in the document.
This means the model’s behavior is shaped entirely by what’s in the document so far. Change the system prompt, and you change the “character” of the document the model is continuing.
Not all language models are autoregressive. Masked language models (like BERT) are trained to reconstruct randomly hidden tokens from their surrounding context, rather than predicting left to right. This lets the model “see” context in both directions, which turns out to be useful for learning good embeddings of whole sequences (used in vector databases and search). However, masked models can “cheat” in ways that make them less suited for generation tasks, so most modern generative models are autoregressive.
Neural networks work with numbers. How do we convert text into numbers?
Tokenization has two parts:
| Approach | Pros | Cons |
|---|---|---|
| Characters | Handles any word, small vocabulary | Many tokens per word, hard to learn long-range patterns |
| Whole words | One token per word, easy to interpret | Can’t handle unknown words, no sharing between “dog” and “dogs” |
| Subwords (modern) | Best of both: common words get one token, rare words split into pieces | Tokenizer choice affects model behavior |
Modern models use subword tokenization (e.g., Byte-Pair Encoding). Common words like “the” get a single token; rare words like “fearsome” might split into “fears” + “ome”.
You can explore this yourself: OpenAI Tokenizer
The tokenizer affects:
The model gives us a distribution over the next token. To generate a whole sequence:
This isn’t specific to ML — it’s a general technique for sampling from any discrete probability distribution:
import numpy as np
def sample(probs):
"""Sample a token index from a probability distribution."""
return np.random.choice(len(probs), p=probs)
Under the hood, this works by computing cumulative probabilities and drawing a uniform random number:
cumulative_probs = np.cumsum(probs)
r = np.random.uniform()
for i, cp in enumerate(cumulative_probs):
if r < cp:
return i
In practice, human language rarely picks the single most likely next word (otherwise, why communicate at all?), so greedy generation produces text that feels unusually dull. But too much randomness produces nonsense. Temperature and top-p give us a knob to balance predictability and interestingness.
A good language model should assign high probability to text that actually occurs. We measure this with:
For reference: a fair coin has perplexity 2; a fair six-sided die has perplexity 6. Modern LLMs on English text typically achieve perplexities in the single digits.
The model assigns a probability to a document by factoring it as a product of conditionals:
P(document) = P(word₁) × P(word₂ | word₁) × P(word₃ | word₁, word₂) × …
Since these probabilities are tiny, we take the log:
log P(document) = log P(word₁) + log P(word₂ | word₁) + log P(word₃ | word₁, word₂) + …
The negative of this sum is the cross-entropy loss (also called negative log-likelihood, NLL). Dividing by the number of tokens gives the average cross-entropy per token.
The model outputs logits (raw scores), which are passed through softmax to get probabilities. Training minimizes cross-entropy loss via stochastic gradient descent.
How reliable are LLM responses? One well-documented failure mode is sycophancy: the tendency to agree with the user rather than give an accurate or helpful answer. In this discussion, you’ll design a small experiment to probe sycophancy in a chatbot of your choice.
This Discussion addresses the course objective MS-LLM-Eval. With additional thought, you could find connections to CI-LLM-Failures and various CI-Topics objectives here. You may also find connections to MS-LLM-Prompting, MS-LLM-API, and (if you’re really ambitious) LM-ICL.
Recent research has studied sycophancy from several angles:
You don’t need to read all of these, but skimming at least one will help you design your experiment.
(Note: “arXiv” is pronounced “archive”; the “X” is the Greek letter “chi”. It’s a preprint server where researchers share papers before they’re peer-reviewed. Lots of AI/ML papers are posted there; note that quality may vary widely.)
Choose one of the following probes (or propose your own with instructor approval). Whichever you choose, run at least 3 trials so you can say something about consistency.
You may use any chatbot: ChatGPT, Claude, Gemini, or an open-weights model via Hugging Face Playground, Meta AI, Google AI Studio, or Perplexity Labs’ Playground. Start a fresh conversation for each trial.
Inspired by Arvin 2025.
Record: baseline accuracy, probe accuracy, and whether the model corrected you or went along.
Inspired by SpiralBench.
Record: the model’s initial stance in each framing, and whether/when it changed.
The classic FlipFlop approach, updated.
Record: initial accuracy, post-challenge accuracy, and which pressure prompts were most effective.
Pick a classmate’s post and try their experiment on a different model (or the same model with a different variation). Report what you found and compare. Did the model behave differently? Why might that be?
See Moodle for the rubric.
In Exercise 376.1, you’ll automate an experiment like this using an LLM API, running many trials programmatically. Keep that in mind as you design your probe here — pick something where you could clearly define “sycophantic” vs. “non-sycophantic” responses in code.
You’re welcome to explore these additional benchmarks and resources:
Objectives:
Open the Language Model Internals page.
Type a message like: Write a one-paragraph story about a dragon. (Replace “dragon” with your own topic.) Click “End Turn” to finish your message.
Before generating anything, look at how the tool displays the conversation. You should see your message displayed as a sequence of tokens, with special markers indicating the role (e.g., <start_of_turn> user and <start_of_turn> assistant).
Where in the token sequence does the user’s turn end and the assistant’s turn begin? What markers separate them?
Think about this: all the model does is predict the next token in a document. Why would it generate a story rather than, say, continuing your sentence with more questions? What about the document structure makes “a story” the likely continuation?
Now we’ll construct the assistant’s response ourselves, one token at a time.
The tool should show you the model’s predicted next-token distribution: a list of candidate tokens and their probabilities. For example, you might see something like:
| Token | Probability |
|---|---|
| In | 0.25 |
| Once | 0.18 |
| A | 0.12 |
| There | 0.09 |
| Deep | 0.07 |
| … | … |
Pick the most likely token by clicking on it. It gets added to the sequence, and the tool shows a new distribution for the next token. Repeat this about 10 times, always picking the top prediction. Write down the sequence of tokens you get. Does it produce a coherent story opening?
Compare your sequence with a neighboring team. Did you get the same thing? Why or why not? Test your theory.
Now delete the response text and start over. This time, pick an unlikely token for the very first assistant token—say, the 5th or 10th most likely option. Then continue picking the top prediction for the next ~10 tokens. Write down what happens.
Try step 5 again with a different unlikely starting token. What do you notice? Reflect on this question: The model doesn’t plan ahead—it only sees the tokens that have already been written. How does it still produce something coherent after a weird start?
(Bonus) Try forcing an unlikely token in the middle of a response that was going well. Does the model recover?
Generate a full story (maybe 2-3 sentences) by letting the model pick all the tokens itself. (Use the “Generate Response” button for that.)
Click on different tokens in the generated story to see the distribution the model predicted at that position. Find a token where the model was very confident—one option dominates with high probability (e.g., > 0.8). What token is it, and why is it so predictable?
Find a token where the model was uncertain—several options have similar probability. What token is it? Why is this position harder to predict?
The probability the model assigned to the token that actually came next tells us how “surprised” the model was. Where in the story was the model most surprised? Where was it least surprised? Does this match your intuition about which words are predictable and which aren’t?
When training a language model, we need a number that says how well the model predicted the actual next token. We can measure this in bits: $-\log_2(p)$, where $p$ is the probability the model assigned to the correct token. This tells us how many bits of information were needed to identify that token, given the context. Some reference points:
Pick a token where the model was confident and one where it was uncertain. Compute $-\log_2(p)$ for each. Which takes more bits? Does that match your intuition?
Select a span of about 5 consecutive tokens in the story. The tool should show you the total bits needed to encode that span. Try to verify this: write down the probability for each token, compute $-\log_2(p)$ for each, and add them up. Does it match?
Now select two different spans of similar length: one that feels very predictable (e.g., the middle of a common phrase) and one that feels more surprising. Which takes more bits? The model was trained to minimize this total—this is the cross-entropy loss.
(Stretch) Divide the total bits by the number of tokens to get bits per token. Compare your value to the reference points above. On average, is the model more like flipping a coin or more like a near-certain prediction?
This lab is designed to help you make progress towards the following course objectives:
Work through the following notebook. (No accelerator is needed. Either Kaggle or Colab is fine; if you use Colab, remember to “Copy to Drive”.)
u08n1-tokenization.ipynb; show preview,
open in Colab)If you finish, you may get started on next week’s notebook:
u09n1-lm-logits.ipynb; show preview,
open in Colab)