Optional Extension: Architectural Experimentation

Outcomes

You can use this exercise to demonstrate the following course objectives:

Task

Run and report on a controlled experiment where you change one thing about a Transformer-based language model and plot the results. Your report should have at least 2 plots, one for each of the two y axes you choose (see below). You may have more than 2 plots if you pick two x axes, which you’ll need to do if you pick change a parameter already implemented in Lab 3.

I suggest starting with the simple Transformer implementation in Lab 3. Train the model for a bit longer than we did in the lab, though, since that the model clearly hadn’t converged. (Also, rather than going through the existing training set multiple times, you should probably just use a bigger training set. You can use the same TinyStories dataset, but just use more of it.) You can also use a different dataset if you prefer.

x axes: things to change

Here are some suggested possible x axes (things to change) - pick one or two:

y axes: things to measure

Here are some suggested y axes (things to measure). Everyone should do the first one, and then also pick one other:

Run multiple trials if you can. Your plots should have at least 3 data points, ideally more.

Analysis and Submission

Write a Jupyter notebook that includes:

Include all code needed to reproduce your results. Your notebook need not re-run all of the experiments; use saved results to make the plot. Example:

import matplotlib.pyplot as plt
import pandas as pd

results_nheads = [
  {"num_heads": 1, "train_loss": 3.2, "val_loss": 3.5, "tokens_per_sec": 400},
  {"num_heads": 1, "train_loss": 3.5, "val_loss": 3.4, "tokens_per_sec": 400},
  {"num_heads": 2, "train_loss": 2.8, "val_loss": 3.0, "tokens_per_sec": 400},
  {"num_heads": 2, "train_loss": 2.8, "val_loss": 3.0, "tokens_per_sec": 400},
  {"num_heads": 4, "train_loss": 2.5, "val_loss": 2.7, "tokens_per_sec": 400},
  {"num_heads": 4, "train_loss": 2.5, "val_loss": 2.7, "tokens_per_sec": 400},
  {"num_heads": 8, "train_loss": 2.3, "val_loss": 2.9, "tokens_per_sec": 400},
  {"num_heads": 8, "train_loss": 2.3, "val_loss": 2.9, "tokens_per_sec": 400},
]
pd.DataFrame(results_nheads).plot(x="num_heads", y="val_loss")
plt.xlabel("Number of attention heads")
plt.ylabel("Validation loss")

# and then in another code chunk
pd.DataFrame(results_nheads).plot(x="num_heads", y="tokens_per_sec")

Tips

fsspec errors

This is a drastic step to fix an issue with the Kaggle image. It will probably not be necessary for you.

If you get strage errors about fsspec when importing datasets, you may need to blow away and reinstall fsspec. Put this at the top of the very first code chunk:

!rm -rfv /opt/conda/lib/python3.10/site-packages/fsspec-2024.3.0.dist-info/

Progress Bar

You can add a progress bar to your training loop by using the tqdm library. Here’s how you can use it:

import tqdm

and then in yor training loop:

for example in tqdm.tqdm(train_tokenized):

Gradient Accumulation

You can speed up your Transformer training by using gradient accumulation. This simply means that instead of stepping the optimizer every iteration, you step it every N iterations. In code, instead of this:

loss.backward()
optimizer.step()
optimizer.zero_grad()
losses.append(loss.item())

You would do this (which would give an effective batch size of 4, since each step only used a single sequence):

loss.backward()
if sample_counter % 4 == 0:
    optimizer.step()
    optimizer.zero_grad()
losses.append(loss.item())

There’s a trade-off between speed and learning here: the more you accumulate gradients, the faster your training will be, but the less often you’ll be updating your model, so it might not actually learn faster. So you may need to change the learning rate a bit, depending how many iterations you accumulate gradients over. It reduces time per epoch, but I haven’t checked whether this actually improves time to convergence.

Lab 376.3: Implementing Self-Attention