Outline

This page will become an outline of the entire class.

Note: this page includes a substantial amount of text generated by GitHub Copilot, mostly in the form of either (1) fleshing out an outline that I’d written or (2) explicitly stating some of the implications and details that are obvious to me but, I realized from the Copilot suggestion, might not be obvious to students. I would love to be able to credit the original sources for the material that it retrieved, but this use case has not been emphasized by those developing the underlying technology; see my blog post on gratitude. Note that there is an ongoing class-action lawsuit against GitHub Copilot around this issue; my hope is that this lawsuit helps encourage the development of pro-gratitude technology. Perhaps this hope is overly optimistic.

Unit 1: Introduction

Overfitting as a problem?

The figure in the book is misleading.

overfitting, supposedly

  • That’s a polynomial fit. It’s much more sensitive than the functions typically used in NN’s.
  • Recent results: a model can completely memorize its training set while still generalizing well (its behavior away from training set points is typically much better behaved than the figure suggests!), and in fact continue to improve generalization performance after reaching 100% accuracy (e.g., Grokking paper).
  • Early stopping might hurt because of “double descent”.
  • So yes, overfitting can be a problem, but don’t angst over it, just keep training.
  • But: memorization might be a problem. “Phone: xxx-yyy-zzzz. SSN: "

There’s also underfitting. Underfitting means that the model isn’t capturing the patterns even of the training data. It usually means that your model is too small (so the range of functions it can approximate isn’t rich enough), or your training is insufficient (your learning rate is too low, you’re not giving it enough time to train, there’s something broken about the training process, etc.).

Why Python?

Lots of jargon!

The more often something shows up in class, the more important it is to know.

Do we need math?

Yes! But not all at once. Some highlights:

Can we explore the validation set, or should we leave it totally hidden?

To get some assurance about how our model will work once deployed, we need some data that we intentionally don’t look at until the very end. That’s the test set. But we often need to guess at how well it’s going to work before then—e.g., because we’re adjusting a parameter that might affect how well the model generalizes. The validation set (or, sometimes, validation sets) help us estimate that.

In general, it’s a good idea to look at the validation set to understand how and why the model worked or didn’t, e.g., get an overview of what kinds of images an image classifier tends to misclassify. But it’s probably not a good idea to study it in too much depth, or it will stop being a good proxy for the test set.

What if the randomly selected held-out part was the most unhelpful?

(Note: validation of 20% means training set is 80%.)

What are layers? What does each one do?

Gradually integrating information from wider area of the image. Lower layers = really zoomed in.

Converting sound to image is a cool idea.

Recently this approach has been replaced by: turn everything into a sequence and pretend it’s language.

Pretrained models are useful.

But can introduce bias, may not actually be as helpful as thought (more later).

How might we prove the universal approximation theorem?

How to collect data?

AI libraries?

Epoch?

One full pass through the training data. Not uncommon to see tens or hundreds of epochs, depending on training set size.

SGD?

Unit 2

General

Image Tensors

Random Numbers

Reflections on Homework 1

Unit 5

This section contains text generated by GitHub Copilot, an AI.

Classification = scores (from linear layers), transformed to probabilities via softmax, trained to optimize cross-entropy loss.

Classification Diagram

i n f e a t u r e s M o d e l ( s l c o o g r i v e t e s s n c ) t o r ) s o f t m a x p r o b v s n e c t o r ) c c o r r o r s e s c - t e n a t n r s o w p e y r ( n u 1 m b e l r o ) s s

note: scores are more commonly called logits.

Models

model = nn.Sequential(
  nn.Linear(in_features, n_hidden, bias=True),
  nn.ReLU(),
  nn.Linear(n_hidden, out_features, bias=True)
)

Basic Implementation of a Classifier

We do the normal training loop: for batch, labels in dataloader:. Inside the training loop:

logits = model(batch)
probs = F.softmax(logits, axis=1)
loss = F.nll_loss(probs, labels)
# Or, better replacement for the above two lines:
# loss = F.cross_entropy(logits, labels)
# F.cross_entropy is a PyTorch function that combines the softmax and the negative log likelihood loss
# it is faster and more computationally stable.
loss.backward()
optimizer.step()
model.zero_grad()

We also usually want to keep track of our metrics. So we might do:

prediction = logits.argmax(axis=1)
num_correct_this_batch = (prediction == labels).float().sum()
num_correct += num_correct_this_batch

Metric vs Loss

Cross-entropy loss

Negative logarithm of the probability of the correct class. (For classification problems.)

Let $y_i$ be the correct class for the $i$th example in a batch. Let $p_i$ be the probability of that class, as computed by the softmax function. Then the cross-entropy loss for that example is $-\log(p_i)$. The cross-entropy loss for the whole batch is the average of the losses for each example.

Aside: cross-entropy is a general idea, not just for classification. For example, MSE for a regression problem can be viewed as a cross-entropy loss, where we view the model as predicting the mean of a distribution with constant variance. Maximizing the likelihood is the same as minimizing cross-entropy.

Softmax

A way to turn scores (unconstrained) into probabilities (nonnegative, sum to 1).

$$\text{softmax}(scores) = \frac{\exp{score_i}}{\sum_j \exp{score_j}}$$

Additional resource: Softmax for neural networks

Where do scores come from?

Why nonlinearities?

Unit 6

Unit 7

Recommender Systems

(no Copilot in this section)

Embeddings

How to compare embeddings

These give similarity (higher is more similar). We could also measure distance (lower is more similar):

(some Copilot was used in this section.)

Matrix Factorization View

One-Hot vs Embedding

Neural Networks for Embeddings

Unit 8: NLP Intro

Language Models

Sampling from a Language Model

Text as Input

How to represent text as input to a neural network?

Instruction Fine-Tuning (IFT) and Reinforcement Learning (RLHF)

The text-davinci-003 model we were using was trained by:

  1. LM pretraining: Pretraining via language modeling on a large corpus of text
    • Almost certainly includes a lot of synthesized text, such as math problems and examples of spelling out words. Probably also includes synthetic or curated data on rhyming, word pronunciation, etc.
  2. IFT: Fine-tuning on a dataset of examples of instructions (“tell me a joke”, “summarize this article”, etc.) paired with a “demonstration” of the desired output.
  3. RLHF with PPO: Let the model generate outputs, and then have a human judge whether they’re good or bad. Use this feedback to improve the model.
    • Tweak: Instead of asking humans about every output, they (1) ask a human to rate a few outputs, (2) train a model to predict the human’s rating from the output, and (3) use that model’s rating as the reward signal for RLHF.

For details, see Illustrating Reinforcement Learning from Human Feedback (RLHF) and What Makes a Dialog Agent Useful? on the Hugging Face blog.

For sources for this claim, see https://platform.openai.com/docs/model-index-for-researchers.

Unit 9: NLP

Example of an encoder-decoder model generating a translation: Next-Token Predictions - a Hugging Face Space by kcarnold

Transformer architecture

See the study guide.

Generative Models

How can we generate text, images, sound, etc.?

Learning to Act

Other Topics on Demand

Please feel free to ask to do one of these in class.

Coding

Math

Theory

News

Resources
Glossary