In this assignment, you will evaluate language models by computing perplexity - a key metric that reveals how well models predict text. You’ll analyze how performance scales with model size, connecting to fundamental concepts in language model evaluation.
This assignment addresses the following course objectives:
Students may also use this exercise to demonstrate additional objectives, such as:
Your goal is to evaluate how language model performance (measured by perplexity) changes with model size:
Use models from the SmolLM2 family, which are available on the Hugging Face model hub:
HuggingFaceTB/SmolLM2-135M (135 million parameters)HuggingFaceTB/SmolLM2-360M (360 million parameters)HuggingFaceTB/SmolLM2-1.7B (1.7 billion parameters, if your system can handle it)Use the ROCStories dataset, which contains short five-sentence stories. You can load an unofficial mirror from the Hugging Face hub using the datasets library:
from datasets import load_dataset
dataset = load_dataset("roneneldan/TinyStories", split="train[:5000]")
# Take a sample of stories for evaluation
stories = rocstories["train"].select(range(50))
Here’s the recommended approach for computing perplexity:
Create a function with this signature (you may want to return additional values for token-level analysis, but start with this):
def compute_perplexity(model, tokenizer, text):
"""
Compute the perplexity of a model on a given text.
Args:
model: A language model that returns logits
tokenizer: The tokenizer associated with the model
text: The text to evaluate
Returns:
float: The perplexity of the model on the text
"""
# Your implementation here
Key implementation steps:
labels into the model, or asking an AI to generate the code for you), but I strongly recommend you do it manually to understand the process.Note: Refer to Lab 2 for examples of how to extract and work with logits from language models.
Caution about indexing: Pay careful attention to token positions! Remember that when predicting the token at position i, you use the logits from position i-1. This off-by-one error is easy to make.
For data collection, consider creating a structure like:
results = []
# For each model and story
for model_name in model_names:
for story_idx, story in enumerate(stories):
# Compute perplexity
perplexity = compute_perplexity(model, tokenizer, story["text"])
# Store results
results.append({
"model_name": model_name,
"story_idx": story_idx,
"perplexity": perplexity
})
# Convert to DataFrame for easier analysis
import pandas as pd
results_df = pd.DataFrame(results)
Create a Jupyter notebook that includes:
| Criterion | Level P (Progressing) | Level M (Met) | Level E (Excellent) |
|---|---|---|---|
| Implementation | Correctly implements perplexity calculation for at least one model | Correctly implements perplexity for all models and shows proper scaling analysis | Implements additional analyses (e.g., token-level perplexity, visualizations of challenging tokens) |
| Analysis | Presents basic comparison between models | Provides substantive analysis of the relationship between model size and performance | Connects findings to broader concepts in LLM scaling laws and performance patterns |
| Visualization | Creates basic table of results | Creates clear plot showing relationship between model size and perplexity | Creates multiple informative visualizations that effectively communicate patterns in the data |
In this lab, you’ll trace through parts of the implementation of a Transformer language model, focusing on the self-attention mechanism. We’ll compare the performance of a Transformer model with a baseline that only uses a feedforward network (MLP).
This lab address the following course objectives:
It could also be used to address the following course objectives:
Start with this notebook:
Implementing self-attention
(name: u10n1-implement-transformer.ipynb; show preview,
open in Colab)
You may find it helpful to refer to The Illustrated GPT-2 (Visualizing Transformer Language Models) – Jay Alammar – Visualizing machine learning one concept at a time.
Extension idea
torch.compile it first)You can use this exercise to demonstrate the following course objectives:
Run and report on a controlled experiment where you change one thing about a Transformer-based language model and plot the results. Your report should have at least 2 plots, one for each of the two y axes you choose (see below). You may have more than 2 plots if you pick two x axes, which you’ll need to do if you pick change a parameter already implemented in Lab 3.
I suggest starting with the simple Transformer implementation in Lab 3. Train the model for a bit longer than we did in the lab, though, since that the model clearly hadn’t converged. (Also, rather than going through the existing training set multiple times, you should probably just use a bigger training set. You can use the same TinyStories dataset, but just use more of it.) You can also use a different dataset if you prefer.
Here are some suggested possible x axes (things to change) - pick one or two:
x + self.value(x) for some learned self.value function etc.)Here are some suggested y axes (things to measure). Everyone should do the first one, and then also pick one other:
with torch.no_grad(): block.Run multiple trials if you can. Your plots should have at least 3 data points, ideally more.
Write a Jupyter notebook that includes:
Include all code needed to reproduce your results. Your notebook need not re-run all of the experiments; use saved results to make the plot. Example:
import matplotlib.pyplot as plt
import pandas as pd
results_nheads = [
{"num_heads": 1, "train_loss": 3.2, "val_loss": 3.5, "tokens_per_sec": 400},
{"num_heads": 1, "train_loss": 3.5, "val_loss": 3.4, "tokens_per_sec": 400},
{"num_heads": 2, "train_loss": 2.8, "val_loss": 3.0, "tokens_per_sec": 400},
{"num_heads": 2, "train_loss": 2.8, "val_loss": 3.0, "tokens_per_sec": 400},
{"num_heads": 4, "train_loss": 2.5, "val_loss": 2.7, "tokens_per_sec": 400},
{"num_heads": 4, "train_loss": 2.5, "val_loss": 2.7, "tokens_per_sec": 400},
{"num_heads": 8, "train_loss": 2.3, "val_loss": 2.9, "tokens_per_sec": 400},
{"num_heads": 8, "train_loss": 2.3, "val_loss": 2.9, "tokens_per_sec": 400},
]
pd.DataFrame(results_nheads).plot(x="num_heads", y="val_loss")
plt.xlabel("Number of attention heads")
plt.ylabel("Validation loss")
# and then in another code chunk
pd.DataFrame(results_nheads).plot(x="num_heads", y="tokens_per_sec")
fsspec errors This is a drastic step to fix an issue with the Kaggle image. It will probably not be necessary for you.
If you get strage errors about fsspec when importing datasets, you may need to blow away and reinstall fsspec. Put this at the top of the very first code chunk:
!rm -rfv /opt/conda/lib/python3.10/site-packages/fsspec-2024.3.0.dist-info/
You can add a progress bar to your training loop by using the tqdm library. Here’s how you can use it:
import tqdm
and then in yor training loop:
for example in tqdm.tqdm(train_tokenized):
You can speed up your Transformer training by using gradient accumulation. This simply means that instead of stepping the optimizer every iteration, you step it every N iterations. In code, instead of this:
loss.backward()
optimizer.step()
optimizer.zero_grad()
losses.append(loss.item())
You would do this (which would give an effective batch size of 4, since each step only used a single sequence):
loss.backward()
if sample_counter % 4 == 0:
optimizer.step()
optimizer.zero_grad()
losses.append(loss.item())
There’s a trade-off between speed and learning here: the more you accumulate gradients, the faster your training will be, but the less often you’ll be updating your model, so it might not actually learn faster. So you may need to change the learning rate a bit, depending how many iterations you accumulate gradients over. It reduces time per epoch, but I haven’t checked whether this actually improves time to convergence.