376 Unit 1: Generative Modeling Introduction

Welcome to CS 376

Objectives

Understand how modern generative models work (for chatbots, image generation, etc.)
Learn to use them skillfully and wisely, as users and developers

Key Questions

Neural Computation:

How can we represent text, images, and other data as sequences?
How can we process and generate sequences using neural nets?
How can models capture and use nuanced long-range relationships?

ML Systems:

How do we evaluate language models?
Can I run an LLM on my laptop? Can I train one?
How do I get good-quality results from an LLM?
How can I use an LLM to make a (semi-)autonomous agent?

Key Questions (continued)

Learning Machines

How can we learn without labeled data? (self-supervised learning)
How do foundation models learn generalizable patterns from massive datasets?
How can generative agents learn to improve their behavior from feedback?
Some current models can learn at test time (e.g., in-context learning); how does this work?

Context and Implications

What are the limits of AI systems? Is superhuman AI imminent?
What might happen socially when AI systems are deployed broadly? (effects on work, education, creativity, …)
How might we design AI systems to align with human values? to honor each other and our neighbors? What are the risks if we don’t?
How do privacy and copyright relate with AI? Is generative AI all theft?
What is creativity? Agency? Truth?

Ways the logistics will be different from CS 375

We’ll have a final project
Meeting objectives will be more incremental and structured, with deadlines for progress

Projects

Project showcase instead of final exam
Can be in teams (if each member has a clear role and contribution)
Should demonstrate:
- understanding of how something in ML works
- implementation and experimentation skills
- communication skills

Some ideas are up on the course website.

This Week’s Readings

On Perusall (graded by participation: watch/read it all, write a few good comments)

A nice intro video from 3blue1brown (you may have watched this already)
Some intro to Transformers NLP
A Communications of the ACM article with some historical context

Generative Modeling

One Approach For Everything?

Transformers (Vaswani+ 2007) have transformed language processing … and most other machine learning also
Radical convergence: a single model architecture for all kinds of data and tasks. Just make everything look like a sequence
- Language: sequence of words
Vision: sequence of image patches
Speech: sequence of audio segments
Behavior: sequence of actions and consequences
…!

Language Modeling

Tell me a joke.

Q: What don't scientists trust atoms?
A: They ___

What is the first word that goes in the blank?

Another example

neighs and rhymes with course

Time Series

Sketch a line plot of what you think the outdoor air temperature will be over the next 3 days. (x axis is timestamp, y axis is temperature).

Then, on the same axes, sketch several alternative temperature plots for the same week.

The set of alternative temperature plots is a distribution.
It’s conditional on starting on the current temperature.

How would you quantify how accurate your distribution is?

Language Modeling

Definition: a language model is a probability distribution over sequences of tokens

P(tell, me, a, joke) = 0.0001 (or whatever)
P(tell, me, a, story) = 0.0002
etc.

Image Conditional Distribution

Image Given Text Conditional Distribution

“A teddy bear on a skateboard in Times Square.”

Conditional Distribution

A conditional distribution is a probability distribution of one variable or set of variables given another.

P(ambient temperature at time t | ambient temperature at time t-1)
P(word | prior words)
P(rest of image | part of image)
P(image | text prompt)

Wednesday

Readings, Moodle participation activity
Ways of Setting Up Generative Modeling (Autoregressive, Latent Variable, Diffusion)
- Autoregressive Language Models as Classifiers
- Perplexity as cumulative surprise
- Implications of Autoregressive Generation
Text <-> Numbers
Activity: Next-Token Predictions Activity

Scripture

The one who has knowledge uses words with restraint,
    and whoever has understanding is even-tempered.
Even fools are thought wise if they keep silent,
    and discerning if they hold their tongues.

Proverbs 17:27-28

Logistics

Readings: How is Perusall going?
Moodle participation activity
Preview Discussion 1

Generative AI

A general term for the current era of AI
Examples: ChatGPT, image / video generation, protein folding prediction, etc.
General characteristics
- Inputs and outputs are complex objects (text, images, etc.)
- Often trained from very large datasets
- … because they can learn from unlabeled data

How do we use neural computation for generative AI?

We might need to break the problem into steps: autoregressive, latent variable, diffusion
Inputs and outputs have to be numbers: we may need tokenization
We need to evaluate our models: perplexity, benchmarks.

Setting up Generative Modeling

Auto-regressive: generate one thing at a time
Latent Variable: sample a latent variable from a simple distribution, generate from that
Diffusion: start with a complete-but-noisy thing, iteratively remove noise

Examples:

Autoregression vs Diffusion in Text: InceptionLabs Diffusion LM
Diffusion Image Model: FLUX.1
Latent Variable Image Model: StyleGAN interpolations; TL-GAN

Causal Language Modeling

Write the joint probability as a product of conditional probabilities:

P(tell, me, a, joke) = P(tell) * P(me | tell) * P(a | tell, me) * P(joke | tell, me, a)

A causal language model gives P(word | prior words)

Don’t get to look at “future” words or backtrack
Analogy: someone constantly trying to finish your sentence
Intuitively: a classifier that predicts the next word in a sentence

Causal Language Modeling as Classification

Task: given what came so far, predict the next thing

Next character: # possibilities (classes) = ______
Next word: # possibilities = _____
What else could we use as “next thing”?

How that classifier works

P(word | context) = softmax(wordLogits(word, context))

wordLogits(word, context) = dot(vec(word), vec(context))
vec(word): look up in a (learnable) table: embedding
vec(context) : computed by a neural network

Do you recognize this structure?

Another Conditional Distribution Example

Translation: P(translation | source)
But translation is generated incrementally (usually left-to-right)
So it’s really P(word | prior words, source)

How to Sample a Sequence

The LM only gives us a distribution over the next token. How do we generate a sequence?

Start with a context
Compute the probability distribution over the next token
Use that distribution to pick (sample) a token
Add that token to the context
Repeat

Sampling strategies:

Greedy: always pick the most likely token
Random: sample from the next-token distribution
- Variants: temperature, top-k, top-p, various penalties, …
Beam search: keep track of the top k generated sequences

How to Sample from a Discrete Distribution

(This is not specific to ML.)

import numpy as np
def sample(probs):
    return np.random.choice(len(probs), p=probs)

Internally, what’s this doing?

cumulative_probs = np.cumsum(probs)
r = np.random.uniform()
for i, cp in enumerate(cumulative_probs):
    if r < cp:
        return i

Example: Generation Activity

How to Evaluate?

Remember: a good classifier gives high probability to the correct choice.
A good language model gives high probability to the correct next word: P(word | context) should be high when word was the right guess.
But we have sequences, not just single words: P(word1, word2, word3, … | context).
We can evaluate by looking at the probability of the whole sequence: the likelihood of the sequence.

Perplexity

Likelihood of sequence = P(word1, word2, word3, … | context) = product of P(word | context)
Likelihood will be tiny, so we often use log-likelihood
but log-likelihood is hard to interpret, so we often use perplexity
Perplexity is a measure of how surprised a model is by a sequence of tokens
Lower perplexity is better
Related to cross-entropy: perplexity = e^cross-entropy

Intuition: coin flip has perplexity 2; fair 6-sided die has perplexity 6.

Friday

Scripture

What shall I return to the Lord
    for all his goodness to me?
I will lift up the cup of salvation
    and call on the name of the Lord.

Psalm 116:12-13

Tech News

Logistics

Discussion 1
Readings: reminder about Perusall
Tentative decision on grading scheme: similar to 375, but:
- Each objective will have two levels: “progressing” and “met”. I’ll track “progressing” for you based on what you turn in.
- You can “meet” objectives in the same way as 375 (independent project, meeting with instructor, discussion with chatbot, etc.), but with a limit per week.
- “Project” will be a separate objective, with a separate rubric.

Projects

Projects Choice Page is up

Returning to Next-Token Generation Activity

Confusion about how to sample from a distribution? (see earlier slide)
- This is what ChatGPT does when you click “Retry”: it takes another sample from the conditional distribution of response given prompt
Compare the loss for greedy generation vs loss for other approaches: could you ever get lower loss than greedy generation?

Text to Numbers (and back)

Neural nets work with numbers. How do we convert text to numbers that we can feed into our models?
Neural nets give us numbers as output. How do we go back from numbers into text?

Tokenization

Two parts:

splitting strings into tokens
- sometimes just called tokenization
- may or may not be reversible, e.g., strips special characters
converting tokens into numbers
- vocabulary: a list (gives each token a number)
- size and contents of vocabulary don’t change

An example: https://platform.openai.com/tokenizer

Tokenization Examples

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True)

tokens = tokenizer.tokenize("Hello, world!")
tokens

['ĠHello', ',', 'Ġworld', '!']

(The “Ġ” is an internal detail to GPT-2; ignore it for now.)

token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids

[18435, 11, 995, 0]

tokenizer.decode(token_ids)

' Hello, world!'

Your Turn

Handout for today
Finish handout/activity from Wednesday
Tokenization notebook

Acknowledgments

Some figures from Prince, Understanding Deep Learning, 2023