Unit 9: NLP Modeling

Now that we’ve seen the basic capabilities of NLP models, let’s start getting under the hood. How do they work? How do we measure that?

Describe the basic steps in an NLP pipeline and what the data looks like coming into and going out of each step.
Describe practical considerations of handling batches of variable-length sequences, such as padding, attention masking, and truncation.
Define perplexity, and describe how it relates to log-likelihood and cross-entropy (and the general concept of partial credit in classifiers)

Preparation

Review chapters 1 and 2 of the Hugging Face NLP Course; do the end-of-chapter quizzes if you have not.
Read chapter 3 of the course. Do the end-of-chapter quiz. Additionally, be able to answer the following questions:
- Section 2
  - In the first code chunk:
    - Was the model given the desired output for each sentence?
    - For how many iterations was the model trained?
    - Review: What does loss.backward() do?
    - Note: We didn’t see optimizer objects before; see the new Extension of Fundamentals u6n1
  - What information is contained in each row of the MPRC dataset?
  - How does the tokenizer tell the model which part of the input is the first sentence vs second sentence?
  - Why do we need to pad the the inputs?
- Section 3
  - What does a Trainer do?
  - What information do you need to pass when constructing a Trainer?
  - What information do you need to pass when computeing a metric? What information is given in the results?
    - note: f1 summarizes a model’s accuracy in a way that balances precision and recall. Technically, it is the harmonic mean of precision and recall. It’s not perfect, but it’s very commonly used.
- Section 4
  - Note: look at the for - break. That’s a useful Python trick for debugging iterable things (like data loaders) in notebooks.
  - What does model(**batch) give us? (Note: the ** means to pass everything in batch as keyword arguments (“kwargs”) to the function. So gets parameters like input_ids=SOMETHING, attention_mask=SOMETHING, labels=SOMETHING.)
  - Be able to explain what each line of code in the code chunk right before “The evaluation loop” does.
  - You can skip the section on accelerate.
Read Evaluation Metrics for Language Modeling; stop at “Reasoning about entropy as a metric”

Supplemental Material

Class Meetings

Monday

Code Together: Inside an NLP pipeline
- What’s the shape of everything? What are batches?
perplexity = mean_cross_entropy.exp() and what that means

Wednesday

Discussion summary
More coding together
- Review what we did last time
  - Review the tensors: what they mean, what their shapes are
  - Label each part.
- How to read the implementation: find the data flow
- Hidden states (output_hidden_states) and word embeddings

Friday

Notes:
- Feedback survey posted
- Study Quizzes posted
- Outcomes grading
Review loss, look at how it’s implemented, clarify perplexity
Review embeddings and hidden states

Discussion 9: AI in Healthcare (due Thu Mar 17)
Homework 9 (due Thu Mar 24)

Due this Week

Discussion 9: AI in Healthcare (Thu)
Homework 8 (Thu)

Unit 9: NLP Modeling

Preparation

Supplemental Material

Class Meetings

Monday

Wednesday

Friday

Contents

Due this Week