Task: Ask a language model for how likely each token is to be the next one.
We start in the same way as the tokenization notebook:
# If the import fails, uncomment the following line:
# !pip install transformers
import torch
from torch import tensor
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
# Avoid a warning message
import os; os.environ["TOKENIZERS_PARALLELISM"] = "false"
One step in this notebook will ask you to write a function. The most common error when function-ifying notebook code is accidentally using a global variable instead of a value computed in the function. This is a quick and dirty little utility to check for that mistake. (For a more polished version, check out localscope.)
def check_global_vars(func, allowed_globals):
import inspect
used_globals = set(inspect.getclosurevars(func).globals.keys())
disallowed_globals = used_globals - set(allowed_globals)
if len(disallowed_globals) > 0:
raise AssertionError(f"The function {func.__name__} used unexpected global variables: {list(disallowed_globals)}")
Download and load the model.
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
tokenizer.pad_token = tokenizer.eos_token
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
print(f"The tokenizer has {len(tokenizer.get_vocab())} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")
The tokenizer has 50257 strings in its vocabulary. The model has 81,912,576 parameters.
In the tokenization notebook, we simply used the generate method to have the model generate some text. Now we'll do it ourselves.
Consider the following phrase:
phrase = "This weekend I plan to"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"
1: Call the tokenizer on the phrase to get a batch that includes input_ids.
batch = tokenizer(ph..., return_tensors='pt')
input_ids = batch['in...']
2: Call the model on the input_ids. Examine the shape of the logits.
with torch.no_grad(): # This tells PyTorch we don't need it to compute gradients for us.
model_output = model(...)
print(f"logits shape: {list(model_output.lo...)}")
logits shape: [1, 5, 50257]
3: Pull out the logits corresponding to the last token in the input phrase. Hint: Think about what each number in the shape means.
Note: The model returns a dictionary-like object. The logits are in model_output.logits.
last_token_logits = model_output.logits[...]
assert last_token_logits.shape == (len(tokenizer.get_vocab()),)
4: Identify the token id and corresponding string of the most likely next token.
To find the most likely token, we need to find the index of the largest value in the last_token_logits. The method that does this is called argmax. (It's a common enough operation that it's built into PyTorch.)
Note: The tokenizer has a decode method that takes a token id, or a list of token ids, and returns the corresponding string.
# compute the probability distribution over the next token
last_token_probabilities = last_token_logits.sof...(dim=-1)
# dim=-1 means to compute the softmax over the last dimension
most_likely_token_id = ...
decoded_token = tokenizer.decode(most_likely_token_id)
probability_of_most_likely_token = last_token_probabilities[...]
print("For the phrase:", phrase)
print(f"Most likely next token: {most_likely_token_id}, which corresponds to {repr(decoded_token)}, with probability {probability_of_most_likely_token:.2%}")
For the phrase: This weekend I plan to Most likely next token: 467, which corresponds to ' go', with probability 5.98%
5: Use the topk method to find the top-10 most likely choices for the next token.
See the documentation for torch.topk. Calling topk on a tensor returns a named tuple with two tensors: values and indices. The values are the top-k values, and the indices are the indices of those values in the original tensor. (In this case, the indices are the token ids.)
Note: This uses Pandas to make a nicely displayed table, and a list comprehension to decode the tokens. You don't need to understand how this all works, but I highly encourage thinking about what's going on.
most_likely_tokens = last_token_logits.topk(...)
print(f"most likely token index from topk is {most_likely_tokens.indices[0]}") # this should be the same as argmax
decoded_tokens = [tokenizer.decode(...) for ... in most_likely_tokens.indices]
probabilities_of_most_likely_tokens = last_token_probabilities[most_likely_tokens.indices]
# Make a nice table to show the results
most_likely_tokens_df = pd.DataFrame({
'tokens': decoded_tokens,
'probabilities': probabilities_of_most_likely_tokens,
})
# Show the table, in a nice formatted way (see https://pandas.pydata.org/pandas-docs/stable/user_guide/style.html#Builtin-Styles)
# Caution: this "gradient" has *nothing* to do with gradient descent! (It's a color gradient.)
most_likely_tokens_df.style.hide_index().background_gradient()
most likely token index from topk is 467
/tmp/ipykernel_743969/2620894099.py:15: FutureWarning: this method is deprecated in favour of `Styler.hide(axis='index')` most_likely_tokens_df.style.hide_index().background_gradient()
| tokens | probabilities |
|---|---|
| go | 0.059828 |
| take | 0.043880 |
| spend | 0.031570 |
| make | 0.030519 |
| do | 0.029206 |
| be | 0.027960 |
| attend | 0.025885 |
| visit | 0.025827 |
| run | 0.022074 |
| have | 0.020955 |
Build this function using only code that you've already filled in above. Clean up the code so that it doesn't do or display anything extraneous. Add comments about what each step does.
def predict_next_tokens(...):
# your code here
check_global_vars(predict_next_tokens, allowed_globals=["torch", "tokenizer", "pd", "model"])
predict_next_tokens("This weekend I plan to", 5).style.hide_index().background_gradient()
/tmp/ipykernel_743969/1600815326.py:1: FutureWarning: this method is deprecated in favour of `Styler.hide(axis='index')`
predict_next_tokens("This weekend I plan to", 5).style.hide_index().background_gradient()
| tokens | probabilities |
|---|---|
| go | 0.059828 |
| take | 0.043880 |
| spend | 0.031570 |
| make | 0.030519 |
| do | 0.029206 |
predict_next_tokens("To be or not to", 5).style.hide_index().background_gradient()
/tmp/ipykernel_743969/479713111.py:1: FutureWarning: this method is deprecated in favour of `Styler.hide(axis='index')`
predict_next_tokens("To be or not to", 5).style.hide_index().background_gradient()
| tokens | probabilities |
|---|---|
| be | 0.648473 |
| have | 0.021346 |
| the | 0.012962 |
| do | 0.009471 |
| , | 0.007444 |
predict_next_tokens("For God so loved the", 5).style.hide_index().background_gradient()
Q1: Explain the shape of model_output.logits.
Q2: Change the -1 in the definition of last_token_logits to -3. What does the variable represent now? What does its argmax represent?
Q3: Let's think. The method in this notebook only get the scores for one next-token at a time. What if we wanted to do a whole sentence? We’d have to generate a token for each word in that sentence. What are a few different ways we could we adapt the approach used in this notebook to generate a complete sentence?
To think about different ways to do this, think about what decision(s) you have to make when generating each token.
Note: you don't have to write any code to answer this question.