376 Unit 1: Generative Modeling Introduction

Welcome to CS 376

Objectives

  • Understand how modern generative models work (for chatbots, image generation, etc.)
  • Learn to use them skillfully and wisely, as users and developers

Key Questions

Neural Computation:

  • How can we represent text, images, and other data as sequences?
  • How can we process and generate sequences using neural nets?
  • How can models capture and use nuanced long-range relationships?

ML Systems:

  • How do we evaluate language models?
  • Can I run an LLM on my laptop? Can I train one?
  • How do I get good-quality results from an LLM?
  • How can I use an LLM to make a (semi-)autonomous agent?

Key Questions (continued)

Learning Machines

  • How can we learn without labeled data? (self-supervised learning)
  • How do foundation models learn generalizable patterns from massive datasets?
  • How can generative agents learn to improve their behavior from feedback?
  • Some current models can learn at test time (e.g., in-context learning); how does this work?

Context and Implications

  • What are the limits of AI systems? Is superhuman AI imminent?
  • What might happen socially when AI systems are deployed broadly? (effects on work, education, creativity, …)
  • How might we design AI systems to align with human values? to honor each other and our neighbors? What are the risks if we don’t?
  • How do privacy and copyright relate with AI? Is generative AI all theft?
  • What is creativity? Agency? Truth?

Ways the logistics will be different from CS 375

  • We’ll have a final project
  • Meeting objectives will be more incremental and structured, with deadlines for progress

Projects

  • Project showcase instead of final exam
  • Can be in teams (if each member has a clear role and contribution)
  • Should demonstrate:
    • understanding of how something in ML works
    • implementation and experimentation skills
    • communication skills

Some ideas are up on the course website.

This Week’s Readings

On Perusall (graded by participation: watch/read it all, write a few good comments)

  • A nice intro video from 3blue1brown (you may have watched this already)
  • Some intro to Transformers NLP
  • A Communications of the ACM article with some historical context

Generative Modeling

One Approach For Everything?

  • Transformers (Vaswani+ 2007) have transformed language processing … and most other machine learning also
  • Radical convergence: a single model architecture for all kinds of data and tasks. Just make everything look like a sequence
    • Language: sequence of words
  • Vision: sequence of image patches
  • Speech: sequence of audio segments
  • Behavior: sequence of actions and consequences
  • …!

Language Modeling

Tell me a joke.

Q: What don't scientists trust atoms?
A: They ___

What is the first word that goes in the blank?

Another example

neighs and rhymes with course

Time Series

Sketch a line plot of what you think the outdoor air temperature will be over the next 3 days. (x axis is timestamp, y axis is temperature).

Then, on the same axes, sketch several alternative temperature plots for the same week.

  • The set of alternative temperature plots is a distribution.
  • It’s conditional on starting on the current temperature.

How would you quantify how accurate your distribution is?

Language Modeling

Definition: a language model is a probability distribution over sequences of tokens

  • P(tell, me, a, joke) = 0.0001 (or whatever)
  • P(tell, me, a, story) = 0.0002
  • etc.

Image Conditional Distribution

Image Given Text Conditional Distribution

“A teddy bear on a skateboard in Times Square.”

Conditional Distribution

A conditional distribution is a probability distribution of one variable or set of variables given another.

  • P(ambient temperature at time t | ambient temperature at time t-1)
  • P(word | prior words)
  • P(rest of image | part of image)
  • P(image | text prompt)

Wednesday

  • Readings, Moodle participation activity
  • Ways of Setting Up Generative Modeling (Autoregressive, Latent Variable, Diffusion)
    • Autoregressive Language Models as Classifiers
    • Perplexity as cumulative surprise
    • Implications of Autoregressive Generation
  • Text <-> Numbers
  • Activity: Next-Token Predictions Activity

Scripture

The one who has knowledge uses words with restraint,
    and whoever has understanding is even-tempered.
Even fools are thought wise if they keep silent,
    and discerning if they hold their tongues.

Proverbs 17:27-28

Logistics

  • Readings: How is Perusall going?
  • Moodle participation activity
  • Preview Discussion 1

Generative AI

  • A general term for the current era of AI
  • Examples: ChatGPT, image / video generation, protein folding prediction, etc.
  • General characteristics
    • Inputs and outputs are complex objects (text, images, etc.)
    • Often trained from very large datasets
    • … because they can learn from unlabeled data

How do we use neural computation for generative AI?

  • We might need to break the problem into steps: autoregressive, latent variable, diffusion
  • Inputs and outputs have to be numbers: we may need tokenization
  • We need to evaluate our models: perplexity, benchmarks.

Setting up Generative Modeling

  • Auto-regressive: generate one thing at a time
  • Latent Variable: sample a latent variable from a simple distribution, generate from that
  • Diffusion: start with a complete-but-noisy thing, iteratively remove noise

Examples:

Causal Language Modeling

Write the joint probability as a product of conditional probabilities:

P(tell, me, a, joke) = P(tell) * P(me | tell) * P(a | tell, me) * P(joke | tell, me, a)

A causal language model gives P(word | prior words)

  • Don’t get to look at “future” words or backtrack
  • Analogy: someone constantly trying to finish your sentence
  • Intuitively: a classifier that predicts the next word in a sentence

Causal Language Modeling as Classification

Task: given what came so far, predict the next thing

  • Next character: # possibilities (classes) = ______
  • Next word: # possibilities = _____
  • What else could we use as “next thing”?

How that classifier works

P(word | context) = softmax(wordLogits(word, context))

  • wordLogits(word, context) = dot(vec(word), vec(context))
  • vec(word): look up in a (learnable) table: embedding
  • vec(context) : computed by a neural network

Do you recognize this structure?

Another Conditional Distribution Example

  • Translation: P(translation | source)
  • But translation is generated incrementally (usually left-to-right)
  • So it’s really P(word | prior words, source)

How to Sample a Sequence

The LM only gives us a distribution over the next token. How do we generate a sequence?

  • Start with a context
  • Compute the probability distribution over the next token
  • Use that distribution to pick (sample) a token
  • Add that token to the context
  • Repeat

Sampling strategies:

  • Greedy: always pick the most likely token
  • Random: sample from the next-token distribution
    • Variants: temperature, top-k, top-p, various penalties, …
  • Beam search: keep track of the top k generated sequences

How to Sample from a Discrete Distribution

(This is not specific to ML.)

import numpy as np
def sample(probs):
    return np.random.choice(len(probs), p=probs)

Internally, what’s this doing?

cumulative_probs = np.cumsum(probs)
r = np.random.uniform()
for i, cp in enumerate(cumulative_probs):
    if r < cp:
        return i

Example: Generation Activity

How to Evaluate?

  • Remember: a good classifier gives high probability to the correct choice.
  • A good language model gives high probability to the correct next word: P(word | context) should be high when word was the right guess.
  • But we have sequences, not just single words: P(word1, word2, word3, … | context).
  • We can evaluate by looking at the probability of the whole sequence: the likelihood of the sequence.

Perplexity

  • Likelihood of sequence = P(word1, word2, word3, … | context) = product of P(word | context)
  • Likelihood will be tiny, so we often use log-likelihood
  • but log-likelihood is hard to interpret, so we often use perplexity
  • Perplexity is a measure of how surprised a model is by a sequence of tokens
  • Lower perplexity is better
  • Related to cross-entropy: perplexity = e^cross-entropy

Intuition: coin flip has perplexity 2; fair 6-sided die has perplexity 6.

Friday

Scripture

What shall I return to the Lord
    for all his goodness to me?
I will lift up the cup of salvation
    and call on the name of the Lord.

Psalm 116:12-13

See also James 1:16-18

Tech News

Logistics

  • Discussion 1
  • Readings: reminder about Perusall
  • Tentative decision on grading scheme: similar to 375, but:
    • Each objective will have two levels: “progressing” and “met”. I’ll track “progressing” for you based on what you turn in.
    • You can “meet” objectives in the same way as 375 (independent project, meeting with instructor, discussion with chatbot, etc.), but with a limit per week.
    • “Project” will be a separate objective, with a separate rubric.

Projects

  • Projects Choice Page is up

Returning to Next-Token Generation Activity

  • Confusion about how to sample from a distribution? (see earlier slide)
    • This is what ChatGPT does when you click “Retry”: it takes another sample from the conditional distribution of response given prompt
  • Compare the loss for greedy generation vs loss for other approaches: could you ever get lower loss than greedy generation?

Text to Numbers (and back)

  • Neural nets work with numbers. How do we convert text to numbers that we can feed into our models?

  • Neural nets give us numbers as output. How do we go back from numbers into text?

Tokenization

Two parts:

  • splitting strings into tokens
    • sometimes just called tokenization
    • may or may not be reversible, e.g., strips special characters
  • converting tokens into numbers
    • vocabulary: a list (gives each token a number)
    • size and contents of vocabulary don’t change

An example: https://platform.openai.com/tokenizer

Tokenization Examples

from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("distilgpt2", add_prefix_space=True)
tokens = tokenizer.tokenize("Hello, world!")
tokens
['ĠHello', ',', 'Ġworld', '!']

(The “Ġ” is an internal detail to GPT-2; ignore it for now.)

token_ids = tokenizer.convert_tokens_to_ids(tokens)
token_ids
[18435, 11, 995, 0]
tokenizer.decode(token_ids)
' Hello, world!'

Your Turn

  • Handout for today
  • Finish handout/activity from Wednesday
  • Tokenization notebook

Acknowledgments

Some figures from Prince, Understanding Deep Learning, 2023