What information is contained in each row of the MPRC dataset?
How does the tokenizer tell the model which part of the input is the first sentence vs second sentence?
Why do we need to pad the the inputs?
Section 3
What does a Trainer do?
What information do you need to pass when constructing a Trainer?
What information do you need to pass when computeing a metric? What information is given in the results?
note: f1 summarizes a model’s accuracy in a way that balances precision and recall. Technically, it is the harmonic mean of precision and recall. It’s not perfect, but it’s very commonly used.
Section 4
Note: look at the for - break. That’s a useful Python trick for debugging iterable things (like data loaders) in notebooks.
What does model(**batch) give us? (Note: the ** means to pass everything in batch as keyword arguments (“kwargs”) to the function. So gets parameters like input_ids=SOMETHING, attention_mask=SOMETHING, labels=SOMETHING.)
Be able to explain what each line of code in the code chunk right before “The evaluation loop” does.
It’s an embedding of the part of the input needed to continue the current calculation (e.g., predict the next token).
Why is just the start token’s hidden state typically used for classification?
Any individual token would otherwise contain some information from that single token as well as information from the entire sequence.
Note: It can attend to the entire sequence (in a bidirectional model like BERT).
Its hidden state representation doesn’t need to be used for anything else (unlike other tokens, which need to be able to decode to a token.)
There may be better ways to do this, but this is a common approach.
Why do we need padding?
Everything needs to be a tensor.
Tensors generally need to be rectangular, not ragged.
But when we put multiple sentences into a batch, they might have different lengths.
So we pad them to the same length.
(We could use ragged tensors instead, but hardware support for them isn’t as good.)
This is mainly a concern when training. At inference time we often only have one sequence at a time. (unless we’re doing beam search, which we’ll talk about later.)