Schedule - CS376 | CS 375-376 Spring 2025 at Calvin University

Any content in the future should be considered tentative and subject to change.

Week 1: Intro to Generative Modeling

Some of the most impactful developments in AI recently have come from modeling and generating sequences. How do we model sequences? How do we generate them? This unit will introduce some of the basic concepts and methods for sequence modeling and generation, with a focus on natural language processing (NLP).

Terms

Generative AI
Language model
Tokenization
Vocabulary
Autoregressive model
Conditional distribution
Latent variable model
Diffusion model
Perplexity

Key Questions

What is one implication of the fact that LMs generate text sequentially (i.e., that most language models are causal)?
What is a conditional distribution, in the context of language modeling (or another example we looked at in class)?
Define perplexity, and describe how it relates to log-likelihood and cross-entropy (and the general concept of partial credit and/or surprise in classifiers)

(next year: add a question about how to use a language model as the backend of a chatbot)

Objectives

This week will address course objectives on LM-SelfSupervised, MS-LLM-Tokenization, and MS-LLM-TokenizationImpact.

Explain what generative modeling is and its uses
Describe the high-level idea of three basic approaches to generative models: autoregressive, latent variable, and diffusion
Describe the inputs and outputs of an autoregressive language model
- tokens -> embeddings
- next-token conditional probability distribution

We will also discuss how a language model can be used for a chatbot. (TODO: this needs a course objective and a link to some Chat Templating documentation)

Prep and Readings

Before starting this unit, you should already know the basics of supervised learning. Specifically, you should be comfortable with training a fully-connected neural network on a classification task.

I recommend the following readings (in Perusall):

Large Language Models explained briefly (3blue1brown)
the Hugging Face Transformers course, chapter 1:
Artificial Intelligence Then and Now – Communications of the ACM

If you need some additional background, I recommend Understanding Deep Learning

Extension Opportunities

Activity: Optional Extension: Token Efficiency Analysis

Notes

Language Models

Language models (LMs) are trained to predict the next word in a document.
Next-word prediction = classification
- Input: document up to the current word
- Output: probability distribution over all possible words
Training set: a huge set of documents from the Internet
Trained to minimize “surprise” (cross-entropy loss)
- Model predicts a distribution P(word | document so far)
- Surprise = how much probability mass the model gave to the actual next word
  - Low surprise = it made a really good guess
  - High surprise = its guess was bad (or perhaps the model was rightly unsure)
Mathematically:
- the model assigns a probability distribution to all possible documents.
- P(document) = P(word 1) * P(word 2 | word 1) * P(word 3 | word 1, word 2) * …
- These probabilities would be tiny, so we take the log:
  - log P(document) = log P(word 1) + log P(word 2 | word 1) + log P(word 3 | word 1, word 2) + …
  - The log of a product is the sum of the logs
  - This is the log-likelihood of the document under the model. The negative of this (NLL for negative log-likelihood) is also called the cross-entropy loss.
  - Dividing this by the number of words in the document gives the average log-likelihood per word, or average cross-entropy loss per word
- The model is a function that outputs log P(word | document so far) for each word in the vocabulary
  - Typically the model outputs logits, which are then passed through a softmax to get probabilities.
- The model is trained to minimize cross-entropy loss by stochastic gradient descent on a training set of documents.

Text as Input

How to represent text as input to a neural network?

Input: a sequence of token ids. Output: a probability distribution over all possible tokens.
Token: a single unit. Can be a word, a character, a subword, etc.
Classical approaches:
- Character-level language models
  - Input: a sequence of characters
  - Output: probability distribution over all possible characters
- Word-level language models
  - Input: a sequence of words
  - Output: probability distribution over all possible words
Pros and cons
- Character-level models:
  - Pros: robust to spelling variations or unknown words
  - Cons: each word requires many tokens, so requires more computation. Difficult to learn long-range dependencies (many tokens away, info is spread over many tokens). Internals of the model are hard to interpret.
- Word-level models:
  - Pros: each word requires only one token, so requires less computation. Relationships between words are easier to learn. Internals of the model are easier to interpret.
  - Cons:
    - no sharing between obviously related words (e.g., “dog” and “dogs” are completely separate tokens; can only learn their relationship by example)
    - any word that doesn’t appear in the training set is completely unknown (even if its spelling is similar to a word that does appear in the training set, e.g., “dog” vs. “dogg”)
Modern approach: sub-word tokenization (e.g., Byte-Pair Encoding, SentencePiece, etc.)
- Common words are represented by a single token
- Less common words are represented by a sequence of tokens
  - e.g., “dogg” might be represented by “dog” + “##g” (where “##” is a special token that indicates a sub-word)
  - Alternative to marking sub-words with “##” is to include the leading space in the first token of the sub-word sequence (e.g., “dogg” might be represented by " dog" + “g”)
Effects of tokenizer choice: Most modern models use some sort of sub-word tokenization, but even so they differ by tokenization strategy (e.g., does each digit get its own token?) and vocabulary size (e.g., how many tokens are used to represent the vocabulary). This affects:
- how much memory is required to store all of the token embeddings
- how much computation is required to do dot products with all of them (for computing logits)
- How efficiently the model can handle morphological variations and generalize across languages
- The total number of tokens that a model needs to process and generate in the course of working with a given text

Sampling from a Language Model

How to generate text from a language model?
- Start with a prompt (e.g., “Every morning I wake up and”)
- Use the model to predict the next word
- Use the predicted distribution to choose the next word
- Keep adding words until the end-of-text token is generated
This corresponds to the left-to-right factorization of the joint probability distribution over documents:
- P(document) = P(word 1) * P(word 2 | word 1) * P(word 3 | word 1, word 2) * …
- To sample from that, we can start by sampling P(word 1) from the model, then sample P(word 2 | word 1) from the model, and so on.
- At each step, we’re sampling from a conditional distribution
- That distribution only depends on the words that came before it
- So we can sample from it independently of the words that come after it
Implications:
- the model never “looks ahead” to see what words come after the current word.
- We can get the model to “rationalize” a statement by including that statement as part of its prompt. The model’s “memory” is externalized in the words it’s already generated, so we can “edit” that memory by changing the document so far.
- We could have factored the joint distribution in a different way, depending on what we want to do. For example, some models are trained to “Fill In the Middle” (FIM), where the model is given context after the section to generate as well. But in practice this is actually implemented by transforming the input into a left-to-right sequence with special marker tokens for the section “”, “”, and “”.
- We could also train the model to “reconstruct” any given token from the tokens around it; this is the idea of masked language modeling (MLM) used in models like BERT. It turns out that this allows the model to “cheat” a lot, so tweaks are needed to make it work for generation tasks. But it’s reasonably good at learning representations (embeddings) of whole sequences, so it’s sometimes used in vector databases.
Temperature
- The model’s predictions are a probability distribution over all possible words
- We can control how much randomness is in the distribution by changing the temperature
- Can be used to control the “creativity” or “diversity” of the model’s output
  - Higher temperature = more randomness. Extreme: infinite temperature = uniform distribution over all words
  - Lower temperature = less randomness. Extreme: 0 temperature = always choose the most likely word
- Computed by dividing the logits by the temperature before passing them through the softmax
- Temperature = 1.0 means no change to the logits before sampling
- In practice, it’s a balance between predictability and interestingness
  - Too high temperature = output isn’t dependable. With some probability, the model will output something highly unusual.
  - Too low temperature = output is unusually dull. Human communication rarely chooses the single most likely thing (otherwise why would we bother communicating?), so always picking the most likely word yields text that is unusually flat.
- Other ways to control the randomness of the output:
  - Nucleus sampling (top-p): instead of sampling from the whole distribution, sample from the smallest set of words that add up to some threshold probability (e.g., 90% of the probability mass)
  - Top-k sampling: sample from the top k most likely words

Monday

Slides: 376 Unit 1: Generative Modeling Introduction
Topics:
- Intro
- Logistics for 376 vs 375
- Projects
- Intro to Generative Modeling
Handout: What do you already know about generative modeling?
Activity: Exploring Language Models
Resources:
- OpenAI Tokenizer

Wednesday

Slides: 376 Unit 1: Generative Modeling Introduction
- Scripture: Proverbs
- Readings, Moodle participation activity
- Ways of Setting Up Generative Modeling (Autoregressive, Latent Variable, Diffusion)
  - Autoregressive Language Models as Classifiers
  - Perplexity as cumulative surprise
  - Implications of Autoregressive Generation
- Text <-> Numbers
Handout: Review tokenization and chat docs; next-token prediction
Activity: Generation Activity
- Next-Token Predictions Activity

Friday

Handout: GenAI problem setup, LLM as next-token classifier
Slides: 376 Unit 1: Generative Modeling Introduction
Activity: CS 376 Lab 1: Tokenization

Week 2: Language Modeling

This week start to take off the covers of NLP models, just like we took off the covers of image models in CS 375. In particular, we’ll get our first taste of the Transformer model, the most important model in machine learning today.

Advising is this week, so we won’t get to a lot of new content.

Key Questions

What is a token embedding? What is an output (or context) embedding? How do these relate to the input and output of a language model?
How does a causal language model use embeddings of contexts (e.g., sentence prefixes)?
How can we use a language model to generate text?

Objectives

This week we start work on these objectives:

I can identify the shapes of data flowing through a Transformer-style language model. [NC-TransformerDataFlow]
I can identify various types of embeddings (tokens, hidden states, output, key, and query) in a language model and explain their purpose. [NC-Embeddings]

Notes

token and output embeddings work almost exactly like https://cs.calvin.edu/courses/cs/375/cur/notebooks/u07n1-image-embeddings.html

Q&A

Why are modern LMs so fast?

Some things that have helped make modern language models so fast are: quantization, which reduces memory bandwidth needed, specialized compute unit units like Google’s TPUs, which just do memory multiplies really fast, and some algorithmic improvements like Flash Attention where researchers carefully thought through what memory access is actually required during inference and made an implementation that is highly optimized to the kind of hardware that we have.

What do the human fine-tuners actually do?

Human fine tuners often do the kinds of tasks that you sometimes see ChatGPT asking: labeling which of these two options is better. Some of them also write reference answers that model should learn to imitate. The role of these sorts of labeling and feedback mechanisms will probably be changing as we see a shift to learning from computationally-generated feedback.

Why do commercial LLMs not actually have much trouble with misspellings?

This has actually been one of the things that challenged my understanding the most over the past few years. I would have expected modern language models to have more trouble with misspellings and typos then they empirically seem to do. I think there are two explanations for this: first, and probably the main explanation is that at the scale of the Internet most typos have happened before. Second is that model providers may be introducing some deliberate errors, such as typos or misspellings into the pre-training process as a kind of data augmentation. I don’t have any evidence that they’re actually doing that though.

Terms

language modeling
n-gram
token embeddings (sometimes called word embeddings)
output embeddings (sometimes called hidden states or contextual embeddings)
token logits
temperature

Monday

Logistics:

Scripture: Jeremiah 17:7-8
Reminders:
- Complete “Reflections Week 1”
- Discussion 1
Highlight-Edits example

Tokenization:

Activity: Lab 376.2: Logits in Causal Language Models

Supplemental material: list comprehensions in Python

Wednesday

Advising

Friday

Project Inspirations
- An example related to our topic today: How to make a racist AI without really trying | ConceptNet blog
- An idea: use a pretrained autoregressive model as if it were a diffusion LM by simply instructing it to “fill in the blanks” in a document (and then giving the blanked document as input)
Lab review
- For reference:
- Notebook: Probe an Image Classifier (name: u07n1-image-embeddings.ipynb; show preview, open in Colab)
Handout: Token and Context Embeddings
Resources: the softmax/cross-entropy interactive

Topics:

Token and Context Embeddings

Week 3: Architectures

Now that we’ve seen the basic capabilities of NLP models, let’s start getting under the hood. How do they work? How do we measure that?

The Transformers architecture (sometimes called self-attention networks) has been the power behind many recent advances not just in NLP but also vision, audio, etc. That’s because they’re currently one of the best tools we have for representing high-dimensional joint distributions, such as the distribution over all possible sequences of words or images. This week we’ll see how they work!

We’ll also look at other architectures that have been popular in the past, such as convolutional networks (CNNs) and recurrent networks (RNNs), and maybe even look at how some new architectures bring in ideas from those older architectures.

Objectives

[NC-SelfAttention] I can explain the purpose and components of a self-attention layer. (Bonus topics - multi-head attention, positional encodings)
[NC-Architectures] I can compare and contrast the following neural architectures - CNN, RNN, and Transformer. (Bonus topics - U-Nets, LSTMs, Vision Transformers, state-space models)
[NC-TransformerDataFlow] I can identify the shapes of data flowing through a Transformer-style language model.

Key Questions

By the end of this week you should be able to answer the following questions:

What is a layer in a self-attention network: what goes in, what comes out, and what are the shapes of all those things?
How do embeddings for words (or tokens) represent similarity / difference?
Why are variable-length sequences challenging for neural nets? How do self-attention networks handle that challenge?
How does data flow between tokens and between layers in a self-attention network? In what sense does it use conditional logic?
What does an attention head do? Specifically, what are queries, keys, and values, and what do they do? And how does this relate with the dot product and softmax? (Wait, is this logistic classification yet again?)

Things we didn’t explicitly get to this week:

How do self-attention networks keep track of position?
How does the data flow in Transformer networks differ from Convolutional or Recurrent networks?
What are encoders and decoders? Why does that matter? What impact does that have on what you can do with the model?

Terms

attention, especially self-attention
query, key, and value vectors
attention weights
multi-head attention
feed-forward network (MLP)
residual connection
layer normalization (bonus topic)

Prep and Reading

This week’s reading includes a brand new result from Anthropic that looks really helpful for getting an accurate intuition about what’s going on inside an LLM. We’re looking at the high-level overview article this week; if people are interested we can dig into the technical report in a future week.

3blue1brown articles (you may prefer to watch the linked video at the top)
- 3Blue1Brown - Visualizing Attention, a Transformer’s Heart | Chapter 6, Deep Learning
- 3Blue1Brown - How might LLMs store facts | Chapter 7, Deep Learning
LLM Visualization: an interactive article, take your time to walk through it over several sessions. It’s very detailed, so don’t expect to understanding everything at this point. The most important parts to pay attention to are:
- What’s the input look like (we’ve already studied this)
- The attention mechanism
- The MLP / Feed-Forward part (which should be familiar from CS 375)
- The Output (again, we’ve already studied this, but it has a few more details)
Tracing the thoughts of a large language model \ Anthropic
Ethics: Understanding Deep Learning book chapter 21, stopping at section 21.2.

News (in Perusall library, not officially assigned)

Supplemental Resources

A video course on How Transformer LLMs Work - DeepLearning.AI
Wanna code it? Zero to Hero part 6: Let’s build GPT: from scratch, in code, spelled out. - YouTube (go back to prior parts if you need to)
HandsOnLLM/Hands-On-Large-Language-Models: Official code repo for the O’Reilly Book - “Hands-On Large Language Models”

Q&A

Is there a limit to how far back a transformer can look? And how are they improving it? For example, in a chatbot with more context, you can feel that it is getting dumber.

“how far back a transformer can look” = its “context window”. Things that limit that:

the architecture. if position embeds are absolute (not, say, RoPE), then we need to set a limit before we even start training.
computation. Plain self-attention is quadratic in sequence length. So long attention takes way more computation time. This has seen lots of effort to optimize recently.
Training. Gotta actually give the models examples of documents / conversations / etc. where long-range attention is needed, otherwise it won’t learn it.

How does increased communication [via self-attention] actually translate to better token generation?

Consider the case of asking an LLM to fix up a paragraph that you wrote. It needs to basically copy what you gave as input, but with some edits / changes at some places. Self-attention lets the network basically keep a running pointer to where you are in the input, grab what you said next, and repeat that or something similar in the output. A recurrent network (like LSTM), in contrast, would somehow have to encode your entire input into a single vector, and then decode that into the output, which is really challenging to learn to do reliably.

How else to improve transformers, besides more training and more heads / layers / dimensions?

There’s so many little tweaks that people do (read the tech report of any new model release). Common things people play with are how to encode position (RoPE is big now), playing with how keys/queries/values mix and match (Grouped Query Attention etc.), and the data, loss functions, etc. (e.g., reinforcement learning from various kinds of rewards).

What does the Anthropic article mean by “Claude wasn’t designed as a calculator—it was trained on text, not equipped with mathematical algorithms”?

The Transformer architecture is hugely inefficient and unreliable if all you want to do is addition or multiplication. But it had to learn to do that anyway, even with unreliable building blocks, because being able to add and multiply make the Internet a bit less surprising.

Monday

Slides: Neural Architectures
- Intro
- Review
Review handout activity from Friday
- How does the Gemma model actually represent these tokens and contexts? (See logits-demo notebook)
- Let’s write the sampling algorithm together.
Handout: Self-Attention By Hand

Wednesday

Review
- Go over key questions from past 2 weeks
- Reminder: Quiz 1 Friday
- Exercises posted
Transformer Explainer
Slides: Neural Architectures
- Fixed wiring: Feed-forward (MLP)
- Current sample wired to previous sample:
  - Recurrent Networks (Elman; LSTM and GRU)
- Current sample wired to surrounding samples: Convolutional Networks (CNN)
  - What convolution does to an image: Image Kernels explained visually
  - How to use convolutions in a neural network: CS231n Convolutional Neural Networks for Visual Recognition
  - What they learn: Feature Visualization
- Wiring computed dynamically based on “self-attention”: Transformer
Tricks
- Residual Connections
- Dropout
Review: Self-Attention = conditional information flow
- Software: describe the wiring, then what flows through the wires.
- Hardware: compute queries, keys, and values, then compute the attention matrix, then compute the output.
For Friday, please start working on:
- Activity: Lab 376.3: Implementing Self-Attention
Notebook: Demo of Logits and Embeddings from a Language Model (name: u09n0-logits-demo.ipynb; show preview, open in Colab)
Notebook: Translation as Language Modeling (name: u09n2-decoding.ipynb; show preview, open in Colab)

Friday

Activity: Lab 376.3: Implementing Self-Attention
- Slides: 376 Lab 3: Implementing Self-Attention
Quiz 1: Looking for evidence of learning about:
- [MS-LLM-Tokenization] I can explain the purpose, inputs, and outputs of tokenization.
- [MS-LLM-Generation] I can extract and interpret model outputs (token logits) and use them to generate text.
- [LM-SelfSupervised] I can explain how self-supervised learning can be used to train foundation models on massive datasets without labeled data.
- [NC-Embeddings] I can identify various types of embeddings (tokens, hidden states, output, key, and query) in a language model and explain their purpose.
- [NC-SelfAttention] I can explain the purpose and components of a self-attention layer. (Bonus topics - multi-head attention, positional encodings) (basic intuition only)

Week 4: Generation and Prompting

How can a model trained to mimic become a helpful, capable, mostly-harmless(?), and even semi-autonomous agent? We’ll discuss how prompting techniques can get us partway there, but modern LLMs use extensive post-training from human and automated feedback to get the rest of the way.

Key Questions

How can each of the following be represented in a “document”:
- A conversation between a user and a model (assistant)
- An action to take in the world (e.g., calling an API or running code)
How might a helpful and harmless response differ from a mimicry response?
How can we use feedback to tune a model’s behavior?

Terms

dialog agents
prompting
post-training (e.g., instruction tuning, Reinforcement Learning from Human Feedback (RLHF))

Objectives

Core objectives:

[MS-LLM-API] I can apply industry-standard APIs to work with pretrained language models (LLMs) and generative AI systems.
[MS-LLM-Prompting] I can critique and refine prompts to improve the quality of responses from an LLM.
[MS-LLM-Advanced] I can apply techniques such as Retrieval-Augmented Generation, in-context learning, tool use, and multi-modal input to solve complex tasks with an LLM.
[MS-LLM-Train] I can describe the overall process of training a state-of-the-art dialogue LLM such as Llama or OLMo.

Review objectives:

[MS-LLM-Tokenization] I can explain the purpose, inputs, and outputs of tokenization.
[MS-LLM-TokenizationImpact] I can analyze how tokenization choices affect the performance of an LLM.
[MS-LLM-Generation] I can extract and interpret model outputs (token logits) and use them to generate text.

Extension objectives:

[MS-LLM-Compute] I can analyze the computational requirements of training and inference of generative AI systems.

Readings

All readings are posted on Perusall, copied here for reference.

The Tülu blog post gives a great summary of the current state-of-the-art post-training process.
- Make sure that you can identify the main steps in the overall process and what the main point of each one was.
- Also pay attention to wear human input steers the model.
- If you’re curious beyond this, find the optional reading of the OLMo2 article.
The Hugging Face article on agents provides a summary of how LLMs can become agents and what some of the implications of that are.
The Google / Gemma API docs (Function Calling with Gemma) provide some examples of how we can actually use some of these functionalities.
What’s “prompt injection”? A new kind of vulnerability—skim a few of the blog posts about this. Pay attention to how LLM-based agents are uniquely vulnerable to it.

Monday

Review quiz 1
- Solutions available for those who have completed it
- Grading by objectives
  - Revised the MS-LLM-API objective to match what the quiz assessed. (question 4 also addressed it, forgot to mark that)
Handout: Self-Attention Shapes
Review lab 3
Project encouragements
- Be the ones who can measure AI performance

Reference:

Demo of Logits and Embeddings from a Language Model (name: u09n0-logits-demo.ipynb; show preview, open in Colab)
Translation as Language Modeling (name: u09n2-decoding.ipynb; show preview, open in Colab)

Wednesday

Review attention via the Transformer Explainer
Motivational examples:
Activity: Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- Prompt Engineering
- Instruction Tuning
- Retrieval-Augmented Generation

Friday

Slides: Generation by Prompting
Quiz 2: An opportunity to demonstrate your understanding of some of the following objectives:
- [MS-LLM-Generation] I can extract and interpret model outputs (token logits) and use them to generate text.
- [MS-LLM-API] I can apply industry-standard APIs to work with pretrained language models (LLMs) and generative AI systems.
- [LM-SelfSupervised] I can explain how self-supervised learning can be used to train foundation models on massive datasets without labeled data.
- [NC-Embeddings] I can identify various types of embeddings (tokens, hidden states, output, key, and query) in a language model and explain their purpose.
- [NC-SelfAttention] I can explain the purpose and components of a self-attention layer. (Bonus topics - multi-head attention, positional encodings)
- [NC-TransformerDataFlow] I can identify the shapes of data flowing through a Transformer-style language model.

Note: I dropped the intro to Streamlit for time reasons, but I highly recommend you check it out. It’s a great way to make your models accessible to others. The next-token demo that we used in Week 2 was a Streamlit app; click the Files tab on the Hugging Face Space to see the code.

Week 5: Review

Since this is a short week, we’ll slow down to review and reinforce (1) how Transformers work inside and (2) how we can use them to make conversational agents that can interact with the world.

Resources

If you’re feeling fuzzy about any of the concepts we’ve covered so far, I recommend going back to these resources:

Videos / articles
- 3Blue1Brown - Visualizing Attention, a Transformer’s Heart | Chapter 6, Deep Learning
- 3Blue1Brown - How might LLMs store facts | Chapter 7, Deep Learning
Interactive
- Transformer Explainer
- LLM Visualization: an interactive article, take your time to walk through it over several sessions.
- Softmax and Cross-Entropy
Notebooks
- Notebook: Demo of Logits and Embeddings from a Language Model (name: u09n0-logits-demo.ipynb; show preview, open in Colab)

Supplemental resources:

Tracing the thoughts of a large language model \ Anthropic
OpenAI Tokenizer
Zero to Hero part 6: Let’s build GPT: from scratch, in code, spelled out. - YouTube (go back to prior parts if you need to)

Monday

Quiz 2 review
Feedback / checkin activity
Q&A

Wednesday

Results of feedback activity:
- Biggest hope (by far): good projects
- Biggest things we want to learn: How to make a (semi-autonomous) agent that improves its behavior from feedback
- Biggest thing to review how self-attention works
Project Work Time!
- Deliverable: what’s your project? What’s success look like (sketch an example)? What are two next steps that you can take to make progress?
Review (see Summary)
- LLMs view the world as a sequence of tokens
  - tokenization approach and vocabulary size is chosen before training
  - which tokens to use are determined by some training data
- LLMs learn to mimic sequences of tokens
  - by learning to predict the next token
    - by learning conditional distributions P(next token | sequence so far)
    - by learning to maximize the probability given to the actual next token (minimizing cross-entropy loss / perplexity)
- LLMs compute next-token distributions by asking “what sort of token usually comes next in this context?”
  - computes a score for each token in the vocabulary
    - by computing a dot product between the token embedding and the context embedding
      - a table of token embeddings is learned during training to put tokens that occur in similar contexts close together
  - context embeddings are computed based on the embeddings of prior tokens
    - for each token, we need to compute a context vector for predicting the next token
    - we could:
      - use the embedding of the current token (but then the model would just repeat itself)
      - use a neural network (“feed-forward network”) to transform each token’s embedding (but then we lose the information about the other tokens)
      - average the embeddings of all previous tokens (but then we’re overwhelmed by irrelevant information)
      - use a weighted average of the embeddings of all previous tokens (but then we need to learn the weights)
      - use a neural network to compute the weights for the averaging (but then we can’t change the information that each token carries)
      - use another neural network to compute what information each token shares with each other token (and now we get self-attention)
      - add more layers (alternating self-attention and feed-forward layers) to make it more expressive
      - add lots of tweaks to make it easier to learn (e.g., residual connections, layer normalization, etc.)

Friday

Good Friday

Week 6: Multimodal Models and Diffusion

What if we want to have AI conversations that include images or audio? – both as input and output?

This week we’ll look at models that can process (and sometimes generate) multiple types of data at once, such as images and text. We’ll also look at diffusion modeling, a powerful generative modeling technique.

Objectives

By the end of this week you should be able to:

Describe how autoregressive generation works
Describe how generative adversarial networks work
Describe how diffusion models work
Compare and contrast the process and results of generating sequences using three different algorithms: greedy generation, sampling, and beam search.
Explain the concept of a generator network.
Explain how a Generative Adversarial Network is trained.

Key Questions

How is noise useful for diffusion models for image generation?
Why does diffusion require multiple time steps?

Terms

Multimodal: Combining multiple modes of input, such as text, images, and sound.
Denoising Diffusion: Sampling from a conditional distribution by iteratively denoising a noisy sample.
Embedding: A vector representation of an object, such as a caption or an image. (In some contexts, also called latent space or latent representation.)
Manifold: The high-probability region of a distribution
- e.g., almost all possible images look like random noise; the manifold is the region of images that look like images in the training data

Readings

Another nice reading (about training data), but the server seems down: Models All the Way Down until Part 3

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale | Abstract

Resources

A minimalist diffusion model (just two tricky concepts, but after that it’s pretty accessible; check out the two tutorials linked at the top)
A video on the Manifold Hypothesis
Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song (mathy, but has good animated diagrams)

(some of these are drawn from the replies to this X/twitter post)

Also, many people refer to this blog post by Lilian Weng.

Monday

Easter Monday

Wednesday

Logistics
- Perusall readings
- Project Walkthrough
- Final Discussion
- Final Exercise
Homework 1 Examples
- Managing conversation context
Generative Models, Diffusion Slides
- Try the SigLIP demo that embeds images and text together. Try computing the dot products between a few texts that you write by hand. Does the dot product reflect the similarity of the texts? Repeat with images. What do you find?
Handout: Conversation documents, multimodal models, and LLM reliability

Friday

Review handout from last time (question 3)
Activity: Lab 376.6: Stable Diffusion
- Stable Diffusion
Project Work Time

Week 7

Monday

Handout: Tokenization and Scaling Review
Activity: Lab: RL, Transformers, or other topics
- choose-your-own-adventure Lab on reinforcement Learning or neural net architectures

Wednesday

Interpretability and Explanation (slides)
Quiz 3

Friday

Handout: Wrap-Up
Discussion 3 sharing, comparing our survey to the results of the Pew Research survey
Fairness and Wrap-Up slides

Final Discussion topics

Personal Impacts
- How AI has impacted my life in the past few years. For better? For worse?
- How AI has impacted the lives of people unlike me.
- How AI might impact our lives in the next 5 years.
Development
- Something useful or cool that has recently become possible thanks to AI.
- What are some things that AI systems are already better than humans at?
- What are some things that humans are still much better at than AI systems?
Broader impacts
- Is AI good for the environment? Bad?
- Is AI good for society? Bad?
- Is AI good for human creativity? is it bad?
Christian perspective
- Something that Christians should consider as people who consume AI-powered products
- …As people who use AI in their organizations
- …as people who develop AI?

Additional items:

Tuesday, May 6, 9am: Final Project Presentations during our class’s final exam time slot
Slides: A Final Commission