Demo of Logits and Embeddings from a Language Model¶

In [1]:
# If the import fails, uncomment the following line:
!pip install transformers
import torch
from torch import tensor
from transformers import AutoTokenizer, AutoModelForCausalLM
import pandas as pd
# Avoid a warning message
import os; os.environ["TOKENIZERS_PARALLELISM"] = "false"
Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Requirement already satisfied: transformers in /usr/local/lib/python3.9/dist-packages (4.27.3)
Requirement already satisfied: requests in /usr/local/lib/python3.9/dist-packages (from transformers) (2.27.1)
Requirement already satisfied: tokenizers!=0.11.3,<0.14,>=0.11.1 in /usr/local/lib/python3.9/dist-packages (from transformers) (0.13.2)
Requirement already satisfied: filelock in /usr/local/lib/python3.9/dist-packages (from transformers) (3.10.2)
Requirement already satisfied: tqdm>=4.27 in /usr/local/lib/python3.9/dist-packages (from transformers) (4.65.0)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (1.22.4)
Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.9/dist-packages (from transformers) (6.0)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from transformers) (23.0)
Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.9/dist-packages (from transformers) (2022.10.31)
Requirement already satisfied: huggingface-hub<1.0,>=0.11.0 in /usr/local/lib/python3.9/dist-packages (from transformers) (0.13.3)
Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.9/dist-packages (from huggingface-hub<1.0,>=0.11.0->transformers) (4.5.0)
Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (2022.12.7)
Requirement already satisfied: charset-normalizer~=2.0.0 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (2.0.12)
Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (1.26.15)
Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.9/dist-packages (from requests->transformers) (3.4)
In [2]:
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True) # smaller version of GPT-2
tokenizer.pad_token = tokenizer.eos_token
tokenizer.padding_side = 'left'
model = AutoModelForCausalLM.from_pretrained("distilgpt2")
In [3]:
print(f"The tokenizer has {len(tokenizer.get_vocab())} strings in its vocabulary.")
print(f"The model has {model.num_parameters():,d} parameters.")
The tokenizer has 50257 strings in its vocabulary.
The model has 81,912,576 parameters.

Padding¶

The model can be trained and evaluated with several independent sequences at a time. It wasn't at training time, so we had to set a few flags above, but now this will work:

In [4]:
phrase = "This weekend I plan to"
In [5]:
batch = tokenizer(["Hi", phrase], padding=True, return_tensors='pt')
batch
Out[5]:
{'input_ids': tensor([[50256, 50256, 50256, 50256, 15902],
        [  770,  5041,   314,  1410,   284]]), 'attention_mask': tensor([[0, 0, 0, 0, 1],
        [1, 1, 1, 1, 1]])}
In [6]:
batch['input_ids'].shape
Out[6]:
torch.Size([2, 5])

Notice that input_ids is 2 (number of sequences in the batch) by 5 (number of tokens in the longest sequence.

The attention_mask is used by the model to make sure that the padding tokens aren't used in any of the model's calculations. We won't be needing it in these demos, but generally it is passed in.

Going forward we'll use this simple example:

In [7]:
input_ids = tokenizer(phrase, return_tensors='pt')['input_ids']; input_ids
Out[7]:
tensor([[ 770, 5041,  314, 1410,  284]])

Embeddings¶

The model includes two modules that are very important: one at the very beginning, one at the very end.

In [8]:
token_embedding_module = model.transformer.wte
token_embedding_module
Out[8]:
Embedding(50257, 768)
In [9]:
lm_head_module = model.lm_head
lm_head_module
Out[9]:
Linear(in_features=768, out_features=50257, bias=False)

Notice the dimensionality is exactly symmetrical: token_embedding takes each token id and maps it to one of the 50k possible token embeddings (each one 768-dimensional); lm_head takes embeddings and maps them to logits corresponding to each of the 50k vocab entries.

It turns out that for this model, the token embeddings are identical on the input and output. This is called "tied weights" and is quite common now, to save parameters. This is easy to see and implement in PyTorch because Linear layers store their W matrices transposed internally already.

In [10]:
(token_embedding_module.weight.data == lm_head_module.weight.data).all()
Out[10]:
tensor(True)

Example of mapping¶

The last token id is:

In [11]:
print(input_ids[0, -1],
    "which corresponds to",
    repr(tokenizer.decode(input_ids[0, -1])))
tensor(284) which corresponds to ' to'

It has vector:

In [12]:
with torch.no_grad():
    vec = token_embedding_module(input_ids[0, -1])
vec.shape
Out[12]:
torch.Size([768])

(The specific numbers in there are illegibile, so we hide them.)

Passing a vector through a linear layer is equivalent to computing the dot product with all of its rows, so we're going to see the dot product of vec with all of the token embeddings.

In [13]:
with torch.no_grad():
    logits = lm_head_module(vec)
logits.shape
Out[13]:
torch.Size([50257])
In [14]:
[tokenizer.decode(x) for x in logits.topk(k=10).indices]
Out[14]:
[' to', 'to', ' To', 'To', ' for', ' in', ' with', ' on', ' TO', ' and']

Astute observers will notice that the token space is wasted by those minor variants of the same token. Current research has improved on this slightly by allowing these related tokens to share information, but it doesn't make a big difference.

If we do this for all the input tokens at the same time, we get the most similar tokens for each input token. That will almost always be the token itself, but note that the token embeddings are not explicitly normalized so the dot product with a different token's embedding may turn out to be the largest one just because it's a different magnitude.

In [15]:
import pandas as pd
logits = lm_head_module(token_embedding_module(input_ids))
pd.DataFrame([
    [tokenizer.decode(x) for x in y]
    for y in logits.topk(k=10).indices[0]
])
Out[15]:
0 1 2 3 4 5 6 7 8 9
0 This This These this THIS It this These The That
1 weekend Weekend weekends week afternoon evening week Sunday Friday Saturday
2 I I we my We they me My My you
3 plan plans plan Plan Plans Plan PLAN intend planning planned
4 to to To To for in with on TO and

What the model does¶

When the model processes its input, it first looks up the embedding for each input token to produce its first "hidden states". Then it incrementally applies each layer of the model (consisting, in this case, of a self-attention "mixing" layer followed by a (one-token-at-a-time) feed-forward "mapping" layer), obtaining incrementally more refined hidden states that approach the context vector for the next token.

In [16]:
with torch.no_grad():
    model_output = model(input_ids, output_hidden_states=True)
hidden_states = model_output.hidden_states
In [17]:
len(hidden_states) # this is model.config.n_layer + 1, to include the input embeddings.
Out[17]:
7
In [18]:
logits = lm_head_module(hidden_states[0])
pd.DataFrame([
    [tokenizer.decode(x) for x in y]
    for y in logits.topk(k=10).indices[0]
]).T
Out[18]:
0 1 2 3 4
0 This weekend I plan to
1 This weekends I plans to
2 These Weekend we plan To
3 <|endoftext|> week my Plan for
4 theless afternoon me Plans To
5 It evening We Plan in
6 THIS week you PLAN on
7 There fortnight myself intend TO
8 this holidays My planning and
9 You Saturdays You proposal of
In [19]:
logits = lm_head_module(hidden_states[-1])
pd.DataFrame([
    [tokenizer.decode(x) for x in y]
    for y in logits.topk(k=10).indices[0]
]).T
Out[19]:
0 1 2 3 4
0 The , was to go
1 A in had on take
2 . was got a spend
3 \n 's went for make
4 The at decided my do
5 , is took not be
6 This I received an attend
7 I � started the visit
8 the we spent and run
9 It the met this have

Note: the logits after the first token seem messed up. I suspect an issue with the "distilling" part of this model's training. All of the other token distributions seem to be good.