Tokenization¶
Task: Convert text to numbers; interpret subword tokenization.
There are various different ways of converting text to numbers. This assignment works with one popular approach: assign numbers to parts of words.
Setup¶
We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.
- Documentation for the model and tokenizer.
- Model Card for GPT-2.
The transformers library is pre-installed on many systems, but in case you need to install it, you can run the following cell.
# Uncomment the following line to install the transformers library
#!pip install -q transformers
import torch
from torch import tensor
Download and load the model¶
This cell downloads the model and tokenizer, and loads them into memory.
# https://huggingface.co/docs/transformers/en/generation_strategies
from transformers import AutoTokenizer, AutoModelForCausalLM, TextStreamer, set_seed
model_name = "openai-community/gpt2"
# Here's a few larger models you could try:
# model_name = "EleutherAI/pythia-1.4b-deduped"
# model_name = "google/gemma-2b"
# model_name = "google/gemma-2b-it"
# Note: you'll need to accept the license agreement on https://huggingface.co/google/gemma-7b to use Gemma models
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained(model_name)
if model.generation_config.pad_token_id is None:
model.generation_config.pad_token_id = model.generation_config.eos_token_id
streamer = TextStreamer(tokenizer)
# Silence a warning.
tokenizer.decode([tokenizer.eos_token_id]);
token_to_id_dict = tokenizer.get_vocab()
print(f"The tokenizer has {len(token_to_id_dict)} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")
The tokenizer has 50257 strings in its vocabulary. The model has 124,439,808 parameters.
# warning: this assumes that there are no gaps in the token ids, which happens to be true for this tokenizer.
id_to_token = [token for token, id in sorted(token_to_id_dict.items(), key=lambda x: x[1])]
print(f"The first 10 tokens are: {id_to_token[:10]}")
print(f"The last 10 tokens are: {id_to_token[-10:]}")
The first 10 tokens are: ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']
The last 10 tokens are: ['Ġ(/', 'â̦."', 'Compar', 'Ġamplification', 'ominated', 'Ġregress', 'ĠCollider', 'Ġinformants', 'Ġgazed', '<|endoftext|>']
Demo¶
Run this code cell to see a demo of the language model in action. You can change the text to see how the model continues it. (You can play with the parameters if you want, but don't get side-tracked; we'll explore these in another notebook.)
Notice:
- The model continues the text in a way that seems coherent.
- The model generates one token at a time; tokens include punctuation.
- Some tokens include a space at the beginning.
set_seed(0)
model.generate(
**tokenizer("A list of colors: red, blue,", return_tensors="pt"),
max_new_tokens=10, do_sample=True, temperature=0.3, penalty_alpha=.5, top_k=5, streamer=streamer);
A list of colors: red, blue, green, blue, yellow, yellow, green,
Task¶
Consider the following phrase:
phrase = "I visited Muskegon"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"
Getting familiar with tokens¶
1: Use tokenizer.tokenize to convert the phrase into a list of tokens. (What do you think the Ġ means?)
tokens = tokenizer.tokenize(phrase)
tokens
['ĠI', 'Ġvisited', 'ĠMus', 'ke', 'gon']
2: Use tokenizer.convert_tokens_to_string to convert the tokens back into a string.
# your code here
' I visited Muskegon'
# for comparison:
''.join(tokens)
'ĠIĠvisitedĠMuskegon'
What is the difference between the output from convert_tokens_to_string and the result of ''.join(tokens)?
your answer here
3: Use tokenizer.encode to convert the original phrase into token ids. (Note: this is equivalent to tokenize followed by convert_tokens_to_ids. Remember, tokenizers have two jobs; these correspond to the two methods.) Call the result input_ids.
input_ids = ...
input_ids
[314, 8672, 2629, 365, 14520]
4: Turn input_ids back into a readable string. Try this two ways: (1) using tokenizer.decode and (2) in two steps: using convert_ids_to_tokens, then a second step that you've already done previously. The result of (1) should be the same as the result of (2).
# using convert_ids_to_tokens
# your code here
' I visited Muskegon'
# using tokenizer.decode
# your code here
' I visited Muskegon'
Applying what you learned¶
5: Use model.generate(input_ids_batch) to generate a completion of this phrase. (Note that we needed to add []s to give a "batch" dimension to the input, and convert the result to a PyTorch tensor for the model code to use it.) Call the result output_ids. This one is done for you.
input_ids_batch = tensor([input_ids])
output_ids = model.generate(input_ids_batch, max_new_tokens=20)[0] # the [0] is to get the first example in the batch
output_ids
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
tensor([ 314, 8672, 2629, 365, 14520, 11, 290, 314, 373, 1297,
326, 262, 1748, 373, 287, 262, 1429, 286, 852, 3170,
13, 314, 373, 1297, 326])
6: Convert your output_ids into a readable form.
# your code here
' I visited Muskegon, and I was told that the city was in the process of being built. I was told that'
Note: generate uses a greedy decoding by default, but it's highly customizable. We'll play more with it in later exercises. For now, if you want more interesting results, try adding the following arguments to generate:
- Turn on
do_sample=True. Run it a few times to see what it gives. Trytemperature = 0.7ortemperature = 1.5. - With sampling enabled, set
top_k=5. Or 50.
- What is the largest possible token id for the tokenizer we're using in this notebook? What token does it correspond to? (Hint: at the top of the notebook, we printed out the size of the vocabulary.)
# your code here
Analysis¶
Q1: Write a brief explanation of what a tokenizer does. Specifically, explain the two-step process of tokenization (text→tokens→ids) and how this enables language models to process text.
your response here
Q2: Try having the model complete the prefix "The word water is spelled w a". Explain why the model might struggle with this seemingly simple task by thinking about the tokenization process.
your response here
Q3: Does capitalization affect the output of the tokenizer? i.e., does the result of tokenizing a capitalized word differ from tokenizing a lowercased word? Run a simple test to find out. Then, try out how it handles misspellings.
your response here
# your code here