NLP Tasks

Review

With your neighbors, discuss the following questions about the conditional distribution to the right.

Part A: Which of the following statements are true?

The model will generate this joke 25.71% of the time.
The model will generate a Why after Tell me a joke\n\nQ: 25.71% of the time.
The model is 25.71% confident that this is a joke.
The model would assign a score of 68% to Tell me a joke\n\nQ: What don't scientists trust atoms?\nA: They make up everything.

Part B: Was the Temperature slider set at 0 or at 1? How can you tell?

Part C: Where does “-1.36 logprob” come from?

Identify at least two different language understanding tasks that can be addressed using machine learning methods and describe the inputs and targets of each.
Explain at least two different approaches for converting text data into a form usable by a machine learning model.
Identify both word and character n-grams in a given string.
Implement basic data manipulation operations in language processing

Classify whole documents
Extract parts of the document (named entities, question answers, parts of speech, …)
Generate¹ text based on a prompt (summarize, translate, respond in a dialogue)
Compute an embedding of a document (for similarity scoring, etc.)²

for a smart home device: given a command, determine which lights to turn on/off
for a text editor, generate summaries of each paragraph to help writers reflect on their work
for a travel review site: identify which reviews have a balance of positive and negative sentences
on Wikipedia, fill in missing infobox data, such as birthdates and birthplaces for people, based on the article text.
search for support tickets that might be duplicates of the one currently being typed.

for reference:

shot, noun, informal: an example input-output pair

Neural nets work with numbers. How do we convert text to numbers that we can feed into our models?
Neural nets give us numbers as output. How do we go back from numbers into text?

Two parts:

splitting strings into tokens
- sometimes just called tokenization
- may or may not be reversible, e.g., strips special characters
converting tokens into numbers
- vocabulary: the mapping of number to token (e.g., a list)
- size and contents of vocabulary don’t change

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True)

tokens = tokenizer.tokenize("Hello, world!")
tokens

['ĠHello', ',', 'Ġworld', '!']

(The “Ġ” is an internal detail to GPT-2; ignore it for now.)

token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[18435, 11, 995, 0]

tokenizer.decode(token_ids)

' Hello, world!'