Logits and Perplexity in Causal Language Models¶

Task: Inspect next-token logits, compute token-level loss, and measure perplexity.

Objectives: OG-LLM-APIs, OG-LLM-Pretrained, OG-LossFunctions

By the end of this notebook, you should be able to:

  • Extract and interpret next-token logits from a pretrained language model
  • Convert logits to probabilities and inspect top-k candidates
  • Compute $-\log(P(\text{token} \mid \text{context}))$ for a specific target token
  • Compute perplexity over a sequence and compare texts by model surprise
  • Measure how quantization changes perplexity

Reminders of some definitions:

  • parameters: the numbers that define the model's behavior, learned during training (e.g., weights and biases, and token embeddings)
  • logits: the raw output scores from a model before applying softmax
    • one for each token in the vocabulary
    • softmax changes logits into probabilities by exponentiating and normalizing
    • relative scores matter (softmax is shift-invariant)
  • perplexity: a measure of how well a probability model predicts a sample.
    • intuition: how "surprised" the model is by the text
    • Lower is better: Lower perplexity means better predictions (less surprise).
    • To compute, average the negative log probabilities of the target tokens, then exponentiate the average loss to get perplexity.
  • a greedy algorithm picks the best thing at each step, without considering future consequences (e.g., picking the token with the highest probability at each step).
  • quantization: reducing the precision of model parameters to save memory and computation, often at the cost of some accuracy.

Setup¶

You already practiced tokenization and model.generate in the previous notebook. Here we go one level deeper: we will inspect the raw logits and probabilities the model uses to choose each next token.

In [ ]:
# If the import fails, uncomment the following line:
# !pip install transformers
import torch, os
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer
import pandas as pd

# Avoid a warning message
os.environ["TOKENIZERS_PARALLELISM"] = "false"

One step in this notebook will ask you to write a function. The most common error when function-ifying notebook code is accidentally using a global variable instead of a value computed in the function. This is a quick and dirty little utility to check for that mistake. (For a more polished version, check out localscope.)

In [ ]:
def check_global_vars(func, allowed_globals):
    import inspect
    used_globals = set(inspect.getclosurevars(func).globals.keys())
    disallowed_globals = used_globals - set(allowed_globals)
    if len(disallowed_globals) > 0:
        raise AssertionError(f"The function {func.__name__} used unexpected global variables: {list(disallowed_globals)}")

The next cell will download and load the model.

Like the previous notebook, we'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, perhaps the first "large" language model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

  • Documentation for the model and tokenizer
  • Model Card for GPT-2
In [ ]:
model_name = "openai-community/gpt2"

# Other models you could try:
# model_name = "EleutherAI/pythia-1.4b-deduped"
# model_name = "google/gemma-3-4b"
# model_name = "google/gemma-3-4b-it"
# Note: you'll need to accept the license agreement on https://huggingface.co/google/gemma-7b to use Gemma models

print("Loading tokenizer...")
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
print("Loading model...")
model = AutoModelForCausalLM.from_pretrained(model_name)
model = model.to("cpu")
streamer = TextStreamer(tokenizer)

# Add the EOS token as PAD token to avoid warnings
if model.generation_config.pad_token_id is None:
    model.generation_config.pad_token_id = model.generation_config.eos_token_id
# Silence a warning.
tokenizer.decode([tokenizer.eos_token_id]);
print("Loaded on CPU.")
Loading tokenizer...
Loading model...
Loaded on CPU.
In [ ]:
print(f"The tokenizer has {len(tokenizer.get_vocab())} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")
The tokenizer has 50257 strings in its vocabulary.
The model has 124,439,808 parameters.

Task¶

In the previous notebook, you used generate to produce text. In this notebook, you will manually inspect what generate is based on: next-token logits.

Consider the following phrase:

In [ ]:
phrase = "This weekend I plan to"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

1: Call the tokenizer on the phrase to get a batch. After having a look at what goes in the batch, extract the input_ids.

In [ ]:
batch = tokenizer(ph..., return_tensors='pt')
input_ids = batch['in...']

2: Call the model on the input_ids. Examine the shape of the logits; what does each number mean?

Note: The model returns an object that has multiple values. The logits are in model_output.logits.

In [ ]:
with torch.no_grad(): # This tells PyTorch we don't need it to compute gradients for us.
    model_output = model(...)
print(f"logits shape: {list(model_output.logits.shape)}")
logits shape: [1, 5, 50257]

3: Pull out the logits corresponding to the last token in the input phrase. Hint: Think about what each number in the shape means. Remember that in Python, arr[-1] is shorthand for arr[len(arr) - 1].

In [ ]:
last_token_logits = model_output.logits[...]
assert last_token_logits.shape == (len(tokenizer.get_vocab()),)

4: Identify the token id and corresponding string of the most likely next token.

To find the most likely token, we need to find the index of the largest value in the last_token_logits. The method that does this is called argmax. (It's a common enough operation that it's built into PyTorch.)

Note: The tokenizer has a decode method that takes a token id, or a list of token ids, and returns the corresponding string.

In [ ]:
# compute the probability distribution over the next token
last_token_probabilities = last_token_logits.sof...(dim=-1)
# dim=-1 means to compute the softmax over the last dimension
In [ ]:
most_likely_token_id = ...
decoded_token = tokenizer.decode(most_likely_token_id)
probability_of_most_likely_token = last_token_probabilities[...]

print("For the phrase:", phrase)
print(f"Most likely next token: {most_likely_token_id}, which corresponds to {repr(decoded_token)}, with probability {probability_of_most_likely_token:.2%}")
For the phrase: This weekend I plan to
Most likely next token: 467, which corresponds to ' go', with probability 5.79%

5: Use the topk method to find the top-10 most likely choices for the next token.

See the documentation for torch.topk. Calling topk on a tensor returns a named tuple with two tensors: values and indices. The values are the top-k values, and the indices are the indices of those values in the original tensor. (In this case, the indices are the token ids.)

Note: This uses Pandas to make a nicely displayed table, and a list comprehension to decode the tokens. You don't need to understand how this all works, but I highly encourage thinking about what's going on.

In [ ]:
most_likely_tokens = last_token_logits.topk(...)
print(f"most likely token index from topk is {most_likely_tokens.indices[0]}") # this should be the same as argmax
decoded_tokens = [tokenizer.decode(...) for ... in most_likely_tokens.indices]
probabilities_of_most_likely_tokens = last_token_probabilities[most_likely_tokens.indices]

# Make a nice table to show the results
most_likely_tokens_df = pd.DataFrame({
    'tokens': decoded_tokens,
    'probabilities': probabilities_of_most_likely_tokens,
})
# Show the table, in a nice formatted way (see https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Builtin-Styles)
# Caution: this "gradient" has *nothing* to do with gradient descent! (It's a color gradient.)
most_likely_tokens_df.style.hide(axis='index').background_gradient()
most likely token index from topk is 467
Out[ ]:
tokens probabilities
go 0.057940
take 0.053050
attend 0.038625
visit 0.036411
be 0.027352
do 0.024958
make 0.023818
spend 0.021303
play 0.019172
travel 0.017760
  1. Write a function that is given a phrase and a k and returns the most_likely_tokens_df DataFrame with the top k most likely next tokens. (Don't include the style line.)

Build this function using only code that you've already filled in above. Clean up the code so that it doesn't do or display anything extraneous. Add comments about what each step does.

In [ ]:
def predict_next_tokens(...):
    # your code here

def show_tokens_df(tokens_df):
    return tokens_df.style.hide(axis='index').background_gradient()

check_global_vars(predict_next_tokens, allowed_globals=["torch", "tokenizer", "pd", "model"])
In [ ]:
show_tokens_df(predict_next_tokens("This weekend I plan to", 5))
Out[ ]:
tokens probabilities
go 0.057940
take 0.053050
attend 0.038625
visit 0.036411
be 0.027352
In [ ]:
show_tokens_df(predict_next_tokens("To be or not to", 5))
Out[ ]:
tokens probabilities
be 0.964031
become 0.004372
have 0.004315
Be 0.001392
get 0.000955
In [ ]:
show_tokens_df(predict_next_tokens("For God so loved the", 5))

Perplexity¶

  1. Loss for a single token

So far you looked at top predictions. Now flip the question: given text that already exists, how surprised was the model by a specific token? For a target token, compute the negative log-likelihood $-\log(P(\text{token} \mid \text{context}))$. This is exactly the single-token loss used in training.

The code below tokenizes a phrase and runs the model. Your job: extract the probability of the actual last token, then compute its loss.

In [ ]:
text = "Let's stop for lunch because I'm getting hungry"
input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(model.device)

with torch.no_grad():
    logits = model(input_ids=input_ids).logits

# The final token in the sequence is the target token, and logits at -2 predict it.
actual_last_token_id = input_ids[0, ...]
probs_before_last_token = logits[0, -2].softmax(dim=-1)

prob_of_actual_last_token = probs_before_last_token[...]
last_token_loss = -torch.log(...)
actual_last_token_int = int(actual_last_token_id.detach().cpu())

print("Text:", text)
print("Actual last token:", repr(tokenizer.decode([actual_last_token_int])))
print(f"Token with highest probability: {tokenizer.decode([probs_before_last_token.argmax()])!r}, with probability {float(probs_before_last_token.max()):.4f}")
print(f"P(actual token | previous context): {float(prob_of_actual_last_token):.4f}")
print(f"Token loss = -log(P): {float(last_token_loss):.4f}")

Think about the loss value you just computed. What would a loss of 0 mean? What would the loss be if the model were certain the next token would be "hungry"?

  1. Per-token surprise across a whole sequence

Now compute surprise for every predicted next token in a sentence. The token loss value $-\log(P(\text{token} \mid \text{context}))$ is larger when the model finds that token less expected.

Fill in the key line in the loop below.

In [ ]:
def token_surprise_table(text):
    input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(model.device)
    with torch.no_grad():
        logits = model(input_ids=input_ids).logits  # shape: (1, seq_len, vocab_size)

    # logits[0, i, :] predicts token at position i + 1
    rows = []
    for i in range(input_ids.shape[1] - 1):
        # Goal: compute the surprise (negative log probability) of the
        # token that actually appears at position i + 1,
        # given the model's predictions at position i.

        probs = logits[0, ...]...
        actual_next_token = input_ids[0, ...]
        token_loss = ...
        actual_next_token_int = int(actual_next_token.detach().cpu())
        rows.append({
            "previous_tokens": tokenizer.decode(input_ids[0, :i+1]),
            "token": tokenizer.decode([actual_next_token_int]),
            "probability": float(probs[actual_next_token]),
            "surprise": float(token_loss),
        })
    return pd.DataFrame(rows)
In [ ]:
surprise_df = token_surprise_table(text)
surprise_df

Look at the surprise values. Which token has the highest surprise? Which has the lowest? Why do you think the model found some tokens harder to predict than others?

  1. Sequence perplexity

Perplexity is defined as $\exp(\text{average token loss})$. Intuitively, it is the model's effective branching factor: lower means the model is less surprised by the text.

Write a compute_perplexity function. You already have all the pieces from the previous step — now wrap it up and compute the final number.

In [ ]:
import math

def compute_perplexity(text, model_to_use=None):
    if model_to_use is None:
        model_to_use = model
    input_ids = tokenizer(text, return_tensors='pt')['input_ids'].to(model_to_use.device)
    # Strategy: collect losses in a list, then average them,
    # then exponentiate the average loss to get perplexity.
    # your code here
    return math.exp(...)

check_global_vars(compute_perplexity, allowed_globals=["torch", "tokenizer", "model", "math"])
In [ ]:
texts = [
    "The cat sat on the mat.",
    "The cat computed the eigenvalue.",
    "Flurb zazzle moop tink wob.",
]

pd.DataFrame({
    "text": texts,
    "perplexity": [compute_perplexity(t) for t in texts],
}).sort_values("perplexity")

A fair coin has perplexity 2 (two equally likely options). A fair 6-sided die has perplexity 6. What does the perplexity you computed mean in terms of "how many equally likely options was the model choosing among"?

  1. Break the model, measure the damage.

How robust is the model to degraded parameters? Quantize the weights to lower precision and measure how perplexity changes.

Before running the next cell: this model has 124M parameters. If each parameter is stored at a given number of bits, how many bytes would that take? Fill in the table below with ballpark numbers (e.g., one of them will be 124 megabytes; remember 8 bits = 1 byte), then run the cell to see what happens to perplexity at each level.

Bits per parameter Model size
32 (original float32) fill in model size in familiar units
24 fill in model size in familiar units
16 fill in model size in familiar units
8 fill in model size in familiar units
4 fill in model size in familiar units
2 fill in model size in familiar units
1 fill in model size in familiar units
In [ ]:
import copy

def quantize_model(original_model, bits):
    """Simulate uniform quantization by rounding parameters to a fixed number of levels."""
    quantized = copy.deepcopy(original_model).cpu()
    with torch.no_grad():
        levels = 2 ** bits - 1
        for param in quantized.parameters():
            pmin = param.min()
            pmax = param.max()
            if torch.isclose(pmax, pmin):
                continue
            scale = (pmax - pmin) / levels
            q = ((param - pmin) / scale).round().clamp(0, levels)
            param.copy_(q * scale + pmin)
    return quantized
In [ ]:
quant_test_text = "The cat sat on the mat."
bit_results = []

for bits in [32, 24, 16, 8, 4, 2, 1]:
    q_model = quantize_model(model, bits=bits)
    q_ppl = compute_perplexity(quant_test_text, model_to_use=q_model)
    bit_results.append({"bits": bits, "perplexity": q_ppl})
    print(f"{bits}-bit perplexity: {q_ppl:.2f}")

pd.DataFrame(bit_results)

Analysis¶

Write your answers to these questions on Moodle.

Q1: Give a specific example of the shape of model_output.logits and explain what each number means.

your answer here

Q2: Change the -1 in the definition of last_token_logits to -3. What does the variable represent now (what would be a better name for it)? What does its argmax represent?

your answer here

Q3: In your per-token surprise table, what was the highest-surprise token? Why do you think the model found it surprising?

your answer here

Q4: How did quantization affect perplexity? At what bit width did the model start to degrade meaningfully? What does this suggest about precision needs?

your answer here

Q5 (Bonus): Without looking back at code, write the expression for -log(P(token | context)) for the word "Michigan" in "I visited Muskegon, Michigan". What values would you need from the model to compute it?

your answer here