Tokenization¶

Task: Convert text to numbers; interpret subword tokenization.

There are various different ways of converting text to numbers. This assignment works with one popular approach: assign numbers to parts of words.

Setup¶

We'll be using the HuggingFace Transformers library, which provides a (mostly) consistent interface to many different language models. We'll focus on the OpenAI GPT-2 model, famous for OpenAI's assertion that it was "too dangerous" to release in full.

Documentation for the model and tokenizer.

The transformers library is pre-installed on many systems, but in case you need to install it, you can run the following cell.

In [1]:
# Uncomment the following line to install the transformers library
#!pip install -q transformers
In [2]:
import torch
from torch import tensor

Download and load the model¶

This cell downloads the model and tokenizer, and loads them into memory.

In [3]:
from transformers import AutoTokenizer, AutoModelForCausalLM
# We'll use this smaller version of GPT-2
model_name = "distilgpt2"
tokenizer = AutoTokenizer.from_pretrained(model_name, add_prefix_space=True)
# Alternative to add_prefix_space is to use `is_split_into_words=True`
# add the EOS token as PAD token to avoid warnings
model = AutoModelForCausalLM.from_pretrained(model_name, pad_token_id=tokenizer.eos_token_id)
In [4]:
token_to_id_dict = tokenizer.get_vocab()
print(f"The tokenizer has {len(token_to_id_dict)} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")
The tokenizer has 50257 strings in its vocabulary.
The model has 81,912,576 parameters.
In [5]:
# warning: this assumes that there are no gaps in the token ids, which happens to be true for this tokenizer.
id_to_token = [token for token, id in sorted(token_to_id_dict.items(), key=lambda x: x[1])]
print(f"The first 10 tokens are: {id_to_token[:10]}")
print(f"The last 10 tokens are: {id_to_token[-10:]}")
The first 10 tokens are: ['!', '"', '#', '$', '%', '&', "'", '(', ')', '*']
The last 10 tokens are: ['Ġ(/', 'â̦."', 'Compar', 'Ġamplification', 'ominated', 'Ġregress', 'ĠCollider', 'Ġinformants', 'Ġgazed', '<|endoftext|>']

Task¶

Consider the following phrase:

In [6]:
phrase = "I visited Muskegon"
# Another one to try later. This was a famous early example of the GPT-2 model:
# phrase = "In a shocking finding, scientists discovered a herd of unicorns living in"

Getting familiar with tokens¶

1: Use tokenizer.tokenize to convert the phrase into a list of tokens. (What do you think the Ġ means?)

In [7]:
tokens = tokenizer.tokenize(phrase)
tokens
Out[7]:
['ĠI', 'Ġvisited', 'ĠMus', 'ke', 'gon']

2: Use tokenizer.convert_tokens_to_string to convert the tokens back into a string.

In [8]:
# your code here
Out[8]:
' I visited Muskegon'

3: Use tokenizer.encode to convert the original phrase into token ids. (Note: this is equivalent to tokenize followed by convert_tokens_to_ids.) Call the result input_ids.

In [9]:
input_ids = ...
input_ids
Out[9]:
[314, 8672, 2629, 365, 14520]

4: Turn input_ids back into a readable string. Try this two ways: (1) using convert_ids_to_tokens and (2) using tokenizer.decode.

In [10]:
# using convert_ids_to_tokens
# your code here
Out[10]:
' I visited Muskegon'
In [11]:
# using tokenizer.decode
# your code here
Out[11]:
' I visited Muskegon'

Applying what you learned¶

5: Use model.generate(tensor([input_ids])) to generate a completion of this phrase. (Note that we needed to add []s to give a "batch" dimension to the input.) Call the result output_ids.

In [12]:
# your code here
Out[12]:
tensor([[  314,  8672,  2629,   365, 14520,    11,   290,   314,   373,  6655,
           284,  1064,   326,   262,  1748,   550,   407,   587,  1498,   284,
          2148,   257,  1774,  1171,  9358,  1080,    13,   198,   198,   198,
           198,   464,  1748,   468,   407,   587,  1498,   284,  2148,   257,
          1774,  1171,  9358,  1080,    13,   198,   464,  1748,   468,   407,
           587,  1498,   284,  2148,   257]])

6: Convert your output_ids into a readable form. (Note: it has an extra "batch" dimension, so you'll need to use output_ids[0].)

In [13]:
# your code here
Out[13]:
' I visited Muskegon, and I was surprised to find that the city had not been able to provide a proper public transportation system.\n\n\n\nThe city has not been able to provide a proper public transportation system.\nThe city has not been able to provide a'

Note: generate uses a greedy decoding by default, but it's highly customizable. We'll play more with it in later exercises. For now, if you want more interesting results, try:

  • Turn on do_sample=True. Run it a few times to see what it gives.
  • Set top_k=5. Or 50.
  1. What is the largest possible token id for this tokenizer? What token does it correspond to?
In [14]:
# your code here

Analysis¶

Q1: Write a brief explanation of what a tokenizer does. Note that we worked with two parts of a tokenizer in this exercise (one that deals only with strings, and another that deals with numbers); make sure your explanation addresses both parts.

your response here

Q2: What do you think the Ġ means? (Hint: it replaces a single well-known character.)

your response here

Q3: Suppose you add some personal flair to your writing by doubling some letters. Explain what the tokenizer we have loaded up in this notebook will do with your embellished writing.

your response here