Translation as Language Modeling¶

Goals:

Practice getting data into and out of a language model.
- embeddings (input and output)
- logits for next words
- cross-entropy loss
Explore different methods of decoding for sequence generation
Explain how data flows between the encoder and decoder in a sequence-to-sequence model
Interpret attention weights.

Setup¶

Install libraries.

#%pip install -q datasets transformers[sentencepiece]

Import PyTorch and the HuggingFace Transformers library.

import torch
import torch.nn as nn
import torch.nn.functional as F
import transformers
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

Load a Marian Machine Translation model.

Specifically, we're using one that was trained on the OPUS corpus (opus-mt) to translate text in any romance language (ROMANCE) to English (en).

from transformers import MarianMTModel, MarianTokenizer
model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'
tokenizer = MarianTokenizer.from_pretrained(model_name)
model = MarianMTModel.from_pretrained(model_name).to(device)
print(f"The model has {model.num_parameters():,d} parameters.")

Downloading: 100%|██████████| 781k/781k [00:06<00:00, 129kB/s] 
Downloading: 100%|██████████| 761k/761k [00:05<00:00, 134kB/s] 
Downloading: 100%|██████████| 1.39M/1.39M [00:11<00:00, 129kB/s]
Downloading: 100%|██████████| 265/265 [00:00<00:00, 188kB/s]
Downloading: 100%|██████████| 1.09k/1.09k [00:00<00:00, 693kB/s]
Downloading:  15%|█▍        | 43.6M/298M [06:11<34:36, 128kB/s]

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
/var/folders/1d/bxcn0jrn65b1ry6v3vb2124r0000gp/T/ipykernel_8352/1782771642.py in <module>
      2 model_name = 'Helsinki-NLP/opus-mt-ROMANCE-en'
      3 tokenizer = MarianTokenizer.from_pretrained(model_name)
----> 4 model = MarianMTModel.from_pretrained(model_name).to(device)
      5 print(f"The model has {model.num_parameters():,d} parameters.")

/usr/local/Caskroom/miniconda/base/envs/sp22/lib/python3.9/site-packages/transformers/modeling_utils.py in from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
   1356             try:
   1357                 # Load from URL or cache if already cached
-> 1358                 resolved_archive_file = cached_path(
   1359                     archive_file,
   1360                     cache_dir=cache_dir,

/usr/local/Caskroom/miniconda/base/envs/sp22/lib/python3.9/site-packages/transformers/file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
   1919     if is_remote_url(url_or_filename):
   1920         # URL, so get it from the cache (downloading if necessary)
-> 1921         output_path = get_from_cache(
   1922             url_or_filename,
   1923             cache_dir=cache_dir,

/usr/local/Caskroom/miniconda/base/envs/sp22/lib/python3.9/site-packages/transformers/file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
   2215             logger.info(f"{url} not found in cache or force_download set to True, downloading to {temp_file.name}")
   2216 
-> 2217             http_get(url_to_download, temp_file, proxies=proxies, resume_size=resume_size, headers=headers)
   2218 
   2219         logger.info(f"storing {url} in cache at {cache_path}")

/usr/local/Caskroom/miniconda/base/envs/sp22/lib/python3.9/site-packages/transformers/file_utils.py in http_get(url, temp_file, proxies, resume_size, headers)
   2074         desc="Downloading",
   2075     )
-> 2076     for chunk in r.iter_content(chunk_size=1024):
   2077         if chunk:  # filter out keep-alive new chunks
   2078             progress.update(len(chunk))

/usr/local/Caskroom/miniconda/base/envs/sp22/lib/python3.9/site-packages/requests/models.py in generate()
    758             if hasattr(self.raw, 'stream'):
    759                 try:
--> 760                     for chunk in self.raw.stream(chunk_size, decode_content=True):
    761                         yield chunk
    762                 except ProtocolError as e:

/usr/local/Caskroom/miniconda/base/envs/sp22/lib/python3.9/site-packages/urllib3/response.py in stream(self, amt, decode_content)
    573                 yield line
    574         else:
--> 575             while not is_fp_closed(self._fp):
    576                 data = self.read(amt=amt, decode_content=decode_content)
    577 

KeyboardInterrupt:

Downloading:  15%|█▍        | 43.6M/298M [06:23<34:36, 128kB/s]

Finally, these wrappers will make the code below easier to understand (you should completely ignore them).

from functools import partial
from transformers.models.marian.modeling_marian import shift_tokens_right
prepend_start_token = partial(
    shift_tokens_right,
    pad_token_id = model.config.pad_token_id, decoder_start_token_id = model.config.decoder_start_token_id)
encoder = model.get_encoder()
decoder = model.get_decoder()
encoder.forward = partial(encoder.forward, output_attentions=True, output_hidden_states=True)
decoder.forward = partial(decoder.forward, output_attentions=True, output_hidden_states=True)

Warm-up¶

Let's practice with the tokenizer. This should be mostly review, but we'll do it in the way that the HuggingFace docs do it.

spanish_text = "Yo les doy vida eterna."
spanish_batch = tokenizer(spanish_text, return_tensors='pt', padding=True).to(device)
spanish_batch

{'input_ids': tensor([[ 2554,    29,    73,   131,   860, 21658,     3,     0]],
       device='cuda:0'), 'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1]], device='cuda:0')}

Since we're only translating one sentence, we can ignore attention_mask (which just helps ignore padding tokens) and the extra initial dimension of the input_ids.

input_ids = spanish_batch.input_ids
input_ids.shape

torch.Size([1, 8])

tokenizer.convert_ids_to_tokens(input_ids[0])

['▁Yo', '▁les', '▁do', 'y', '▁vida', '▁eterna', '.', '</s>']

Now let's ask the model to generate a translation. Lots of magic happens here; we'll peel back the layers shortly.

translated = model.generate(input_ids = input_ids, num_beams=1, do_sample=False)
translated.shape

torch.Size([1, 8])

Decode the result!

with tokenizer.as_target_tokenizer():
    english_text = tokenizer.decode(translated[0])
english_text

'<pad> I give them eternal life.'

Generation Options¶

The generate method can use several different algorithms under the hood. Let's see how each of them behaves:

def cross_entropy_for_sequences(logits, targets):
    '''
    Standard F.cross_entropy doesn't handle multiple sequences consistently.

    This is a slow approach to get the correct sequences. (Faster would be to not reduce and multiply by (targets >= 0).sum(axis=1).)
    '''
    return [
        F.cross_entropy(inp, tgt) for inp, tgt in zip(logits.unbind(), targets.unbind())
    ]

def generate_with_params(input_ids, **kwargs):
    # Generate translations. Tell `generate` to give us the logits (which they call "scores")
    translations = model.generate(input_ids = input_ids, return_dict_in_generate=True, output_scores=True, **kwargs)

    # Recompute the cross-entropy (some `generate` outptus give us this, others don't, so we have to recompute).
    logprobs = cross_entropy_for_sequences(translations.scores, translations.sequences)
    
    with tokenizer.as_target_tokenizer():
        return pd.DataFrame({
            'sentence': tokenizer.batch_decode(translations.sequences),
            'logprobs': logprobs
        })

#with tokenizer.as_target_tokenizer():
#    english_batch = tokenizer(english_text, return_tensors='pt', padding=True).to(device)
#decoder_input_ids = torch.tensor([model.config.decoder_start_token_id]).unsqueeze(0)
#outputs = model(input_ids=spanish_batch.input_ids, decoder_input_ids=decoder_input_ids)
#outputs.logits.shape

Predict Next Token¶

def predict_next_token(source_input_ids, decoded_so_far=[], k=5):
    decoder_input_ids = torch.tensor([model.config.decoder_start_token_id] + decoded_so_far).unsqueeze(0).to(device)
    assert input_ids.shape[0] == 1
    with torch.no_grad(): # This tells PyTorch we don't need it to compute gradients for us.
        model_output = model(input_ids = source_input_ids, decoder_input_ids=decoder_input_ids)
    last_token_logits = model_output.logits[0, -1].cpu()
    assert len(last_token_logits.shape) == 1
    most_likely_tokens = last_token_logits.topk(k)
    with tokenizer.as_target_tokenizer():
        probs = most_likely_tokens.values.softmax(dim=0)
        return pd.DataFrame({
            'token': [tokenizer.decode(token_id) for token_id in most_likely_tokens.indices],
            'id': most_likely_tokens.indices,
            'probability': probs,
            'logprob': probs.log(),
            'cumulative probability': probs.cumsum(0)
        })


print(-0.028718 + -0.083301, "I give")
predict_next_token(spanish_batch.input_ids, [20, 685])

-0.11201900000000001 I give

Scoring a candidate translation¶

The model gives us conditional probabilities of each word: P(word_i | src, word_1, word_2, ..., word_{i-1}). You should recognize these as the softmax

We can use those to compute the probability of a complete translation by multiplying the conditional probabilities:

TODO

Those underflow, so we actually use the logs TODO

which are conveniently the cross-entropy loss values of each token. TODO

First, let's look at the loss that the model gives. We'll compare the correct translation with an incorrect one:

def tokenize_target_sentence(sentence):
    with tokenizer.as_target_tokenizer():
        return tokenizer(sentence, return_tensors='pt', padding=True).to(device)
wrong_target_batch = tokenize_target_sentence("I give them eternal death.")

Let's run a forward pass through the full model (encoder and decoder) with the complete candidate translation. First, the correct translation:

@torch.no_grad() # We don't need to compute gradients 
def get_logprob_of_translation(src_ids, tgt_ids):
    model_outputs = model(
        input_ids = src_ids,
        labels = tgt_ids
    )
    return model_outputs.loss # TODO: multiply by num tokens? Replace by manually doing cross_entropy_loss?
get_logprob_of_translation(spanish_batch.input_ids, correct_batch.input_ids)

---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
/var/folders/1d/bxcn0jrn65b1ry6v3vb2124r0000gp/T/ipykernel_8352/2736374267.py in <module>
      6     )
      7     return model_outputs.loss # TODO: multiply by num tokens? Replace by manually doing cross_entropy_loss?
----> 8 get_logprob_of_translation(spanish_batch.input_ids, correct_batch.input_ids)

NameError: name 'spanish_batch' is not defined

Now (your turn) the incorrect translation:

# your code here

tensor(1.3203)

Dig In!¶

Ok now how did it do that?

You may find it helpful to have the documentation for the MarianMT model in HuggingFace Transformers open. But you can do all of this without referring to it.

The guts of the model¶

I've ripped out all the plumbing code and things you only need in special situations to just show the guts of the model below. Study this code carefully with the help of the questions below it. Add comments to describe what each line does. Include, where applicable, the shape of the tensors involved.

encoder_input_ids = spanish_batch.input_ids
target_ids = english_batch.input_ids
decoder_input_ids = prepend_start_token(target_ids)

with torch.no_grad():
    encoder_outputs = encoder(input_ids = encoder_input_ids)
    # (Aside: an alternative to the above)
    # encoder_input_embeddings = encoder.embed_tokens(encoder_input_ids) * encoder.embed_scale
    # encoder_outputs = encoder(inputs_embeds = encoder_input_embeddings)

    decoder_outputs = decoder(
        input_ids = decoder_input_ids,
        encoder_hidden_states = encoder_outputs.last_hidden_state
    )

    output_embedding = decoder_outputs.last_hidden_state
    token_embeddings = model.lm_head.weight
    logits = output_embedding @ token_embeddings.t()
    logits += model.final_logits_bias

    # ignore the batch dimension.
    logits = logits[0]

nlls_of_correct_tokens = F.cross_entropy(logits, target_ids[0], reduction='none')
nlls_of_correct_tokens.mean()

tensor(0.2088)

Explain logits.shape.

logits.shape

torch.Size([7, 65001])

your narrative answer here

tokenizer.convert_ids_to_tokens(logits.argmax(dim=1))

['▁I', '▁give', '▁them', '▁eternal', '▁life', '.', '</s>']

tokenizer.convert_ids_to_tokens(target_ids[0])

['▁I', '▁give', '▁them', '▁eternal', '▁life', '.', '</s>']

What tensor contains all of the information from the Spanish sentence that is used to generate the English sentence? Explain each element of the shape of that tensor.

(The leading "1" is the batch dimension; you can ignore this unless you're translating multiple sentence simultaneously.)

What is the "shape" of this model? Specifically:

What is the dimensionality of the hidden vectors it uses to represent everything? (How does this relate to the dimensionality of the token embeddings?)
How many internal layers does the model have?

encoder_outputs.last_hidden_state.shape

torch.Size([1, 8, 512])

model.config.num_hidden_layers

6

Visualize attentions¶

Read these as: the row token looks at the column token.

There are actually 8 attention heads for each of the 6 layers, so to visualize simply, we take the mean over the attention weights (which are all positive).

decoder_outputs.cross_attentions[0].shape

torch.Size([1, 8, 7, 8])

layer = 1
plt.pcolormesh(decoder_outputs.cross_attentions[layer][0].mean(dim=0).cpu().numpy())
plt.title(f"Cross-Attention Weights for layer {layer} (avg over all {model.config.num_attention_heads} heads)")
plt.xticks(torch.arange(8)+.5, tokenizer.convert_ids_to_tokens(encoder_input_ids[0]))
plt.yticks(torch.arange(7)+.5, tokenizer.convert_ids_to_tokens(decoder_input_ids[0]))
plt.colorbar();

layer = -1
plt.pcolormesh(encoder_outputs.attentions[layer][0].mean(dim=0).cpu().numpy())
plt.title(f"Encoder Self-Attention Weights for layer {layer} (avg over all {model.config.num_attention_heads} heads)")
plt.xticks(torch.arange(8)+.5, tokenizer.convert_ids_to_tokens(encoder_input_ids[0]))
plt.yticks(torch.arange(8)+.5, tokenizer.convert_ids_to_tokens(encoder_input_ids[0]))
plt.colorbar();

layer = 0
plt.pcolormesh(decoder_outputs.attentions[layer][0].mean(dim=0).cpu().numpy())
plt.title(f"Decoder Self-Attention Weights for layer {layer} (avg over all {model.config.num_attention_heads} heads)")
plt.xticks(torch.arange(7)+.5, tokenizer.convert_ids_to_tokens(decoder_input_ids[0]))
plt.yticks(torch.arange(7)+.5, tokenizer.convert_ids_to_tokens(decoder_input_ids[0]))
plt.colorbar();

Similarity¶

Notice that the last step of the model is a dot product with all the token embeddings. Recall that a dot product is a measure of similarity. Let's look at similarity in embedding space.

normalized_token_embeddings = token_embeddings / token_embeddings.norm(p=2, dim=1, keepdim=True)

query_word = "London"
with tokenizer.as_target_tokenizer():
    query_ids = tokenizer.encode(query_word, add_special_tokens=False)
print(query_ids)
query = token_embeddings[query_ids].mean(dim=0)
similarities = query @ normalized_token_embeddings.t()
most_similar_indices = similarities.topk(50).indices
tokenizer.convert_ids_to_tokens(most_similar_indices)

[5226]

['<pad>',
 '▁London',
 '▁Moscow',
 '▁Cambridge',
 '▁Kingston',
 '▁Bremen',
 '▁Windsor',
 '▁Philadelphia',
 '▁Melbourne',
 '▁Baltimore',
 '▁Bristol',
 '▁Cleveland',
 '▁Houston',
 '▁Belfast',
 '▁Denver',
 '▁Baghdad',
 '▁Liverpool',
 '▁Oregon',
 '▁England',
 '▁Edinburgh',
 '▁Tripoli',
 '▁Missouri',
 '▁Flanders',
 '▁Mumbai',
 '▁Churchill',
 '▁Istanbul',
 '▁Bermuda',
 '▁Barcelona',
 '▁Kentucky',
 '▁Detroit',
 '▁Honda',
 '▁Lorraine',
 '▁Tibet',
 '▁Brussels',
 '▁Lusaka',
 '▁Honduran',
 '▁Madison',
 '▁Bordeaux',
 '▁Mormon',
 '▁Maryland',
 '▁Alabama',
 '▁Damascus',
 '▁Tibetan',
 '▁Versailles',
 '▁Iowa',
 '▁Orleans',
 '▁Burgundy',
 '▁Naples',
 '▁Murcia',
 '▁Glasgow']

Your turn: now, take query vectors from the output_embeddings that were calculated above and find the most similar token embeddings.

Compare the results with the translation output you saw from the model earlier.

# your code here

['▁them',
 '▁you',
 '▁eternal',
 "▁'",
 '▁to',
 '▁everlasting',
 '▁it',
 '▁the',
 "'",
 ',',
 '▁these',
 '▁unto',
 '▁[',
 '▁him',
 '▁forever',
 '▁all',
 '▁that',
 '▁their',
 '▁those',
 '▁up',
 '▁life',
 '▁they',
 '▁for',
 '▁y',
 '▁You',
 '▁ye',
 '▁out',
 '▁Oh',
 '▁I',
 '▁your',
 '▁an',
 '▁people',
 '▁-',
 '▁eternity',
 '▁"',
 '▁(',
 '▁YOU',
 '▁such',
 '▁her',
 '▁birth',
 '▁us',
 '▁perpetual',
 '▁forth',
 '▁of',
 '▁this',
 '▁a',
 '▁lasting',
 '▁Eternal',
 '▁lifelong',
 '▁Him']

The Logit Lens (optional)¶

This is an exploration inspired by this article. Intuition: the Transformer iteratively refines a guess.

# http://stephantul.github.io/python/pytorch/2020/09/18/fast_topk/
def get_ranks(values, indices):
    targets = values[range(len(values)), indices]
    return (values > targets[:, None]).long().sum(dim=1)

ranks = []
print(tokenizer.convert_ids_to_tokens(decoder_input_ids[0]))
for hidden in decoder_outputs.hidden_states[1:]:
    x = model.lm_head(hidden)[0]
    print(tokenizer.convert_ids_to_tokens(x.argmax(dim=1)))
    ranks.append(get_ranks(x, target_ids[0]))
torch.stack(ranks[::-1])

['<pad>', '▁I', '▁give', '▁them', '▁eternal', '▁life', '.']
['▁prevailed', 'oping', '▁give', '▁them', '▁MR', '▁life', 'dog']
['▁foi', 'quarter', '▁give', '▁themselves', '▁Basket', '▁life', 'com']
['▁foi', "'", '▁them', '▁a', 'ly', '▁-', '▁It']
['▁foi', "'", '▁them', '▁all', 'ly', ',', '▁[']
['▁"', "'", '▁them', '▁life', '▁life', ',', '▁[']
['▁I', '▁give', '▁them', '▁eternal', '▁life', ',', '▁-']

tensor([[    0,     0,     0,     0,     0,     3,    10],
        [    4,     1,     0,    17,     0,   131,    48],
        [  123,   197,     0, 15263,     3, 12010,   613],
        [ 1519,  1556,     0, 21065,    60, 22421,  2793],
        [ 9279, 15847,     4, 12473,   109, 21495,  9482],
        [40330,  5603,  6987, 18888, 12455, 26906, 25097]])

	token	id	probability	logprob	cumulative probability
0	them	224	0.760550	-0.273713	0.760550
1	you	31	0.211263	-1.554653	0.971813
2	eternal	16762	0.015364	-4.175735	0.987177
3	'	55	0.009026	-4.707601	0.996203
4	everlasting	29407	0.003797	-5.573581	1.000000