See also: CS 375 Schedule
Any content in the future should be considered tentative and subject to change.
Week 1: Intro to Generative Modeling
Some of the most impactful developments in AI recently have come from modeling and generating sequences. How do we model sequences? How do we generate them? This unit will introduce some of the basic concepts and methods for sequence modeling and generation, with a focus on natural language processing (NLP).
Terms
- Generative AI
- Language model
- Tokenization
- Vocabulary
- Autoregressive model
- Conditional distribution
- Latent variable model
- Diffusion model
- Perplexity
Key Questions
- What is one implication of the fact that LMs generate text sequentially (i.e., that most language models are causal)?
- What is a conditional distribution, in the context of language modeling (or another example we looked at in class)?
- How is a chat conversation (even with multiple turns, tool calls, etc.) just a document?
Objectives
This week will address course objectives on OG-SelfSupervised, OG-LLM-Tokenization, and OG-LLM-TokenizationImpact.
- Explain what generative modeling is and its uses
- Describe the high-level idea of three basic approaches to generative models: autoregressive, latent variable, and diffusion
- Describe the inputs and outputs of an autoregressive language model
- tokens -> embeddings
- next-token conditional probability distribution
- Describe how a language model can be used for a chatbot.
Prep and Readings
Before starting this unit, you should already know the basics of supervised learning. Specifically, you should be comfortable with training a fully-connected neural network on a classification task.
I recommend the following readings (in Perusall):
- Large Language Models explained briefly (3blue1brown)
- the Hugging Face Transformers course, chapter 1:
- Artificial Intelligence Then and Now – Communications of the ACM
If you need some additional background, I recommend Understanding Deep Learning
You may also appreciate the following more technical resources, but these are not required:
Extension Opportunities
- Activity: Optional Extension: Token Efficiency Analysis
- See: SuperBPE: multi-word tokens (!)
Notes
Notes page for reference material on language models, tokenization, sampling, and evaluation.
Monday 3/16
- Scripture: Psalm 23
- Slides: 376 Unit 1: Generative Modeling Introduction
- Topics:
- Intro
- Logistics for 376 vs 375
- Projects
- Intro to Generative Modeling
- Handout: What do you already know about generative modeling?
- Surprise - fire alarm! (moving exploring-lm activity to Wednesday)
Wednesday 3/18
- Handout: Exploring Language Models
- This uses the LM Internals tool.
- Slides: 376 Unit 1: Generative Modeling Introduction
- Scripture: Proverbs
- Readings, Moodle participation activity
- Ways of Setting Up Generative Modeling (Autoregressive, Latent Variable, Diffusion)
- Autoregressive Language Models as Classifiers
- Perplexity as cumulative surprise
- Implications of Autoregressive Generation
- Text <-> Numbers
Friday 3/20
- Slides: 376 Unit 1: Generative Modeling Introduction
- Three approaches to generative modeling (autoregressive, latent variable, diffusion)
- Tokenization
- LLM APIs overview
- Intro Discussion 376.1: Probing LLM Sycophancy
- Intro Exercise 376.1: LM Evaluation
- Activity: CS 376 Lab 1: Language Model Inputs and Outputs
Week 2: Language Modeling
This week start to take off the covers of NLP models, just like we took off the covers of image models in CS 375. In particular, we’ll get our first taste of the Transformer model, the most important model in machine learning today.
Advising is this week, so we won’t get to a lot of new content.
Key Questions
- Define perplexity, and describe how it relates to log-likelihood and cross-entropy (and the general concept of partial credit and/or surprise in classifiers)
- What is a token embedding? What is an output (or context) embedding? How do these relate to the input and output of a language model?
- How does a causal language model use embeddings of contexts (e.g., sentence prefixes)?
- How can we use a language model to generate text?
Objectives
This week we start work on these objectives:
- I can identify the shapes of data flowing through a Transformer-style language model. [NC-TransformerDataFlow]
- I can identify various types of embeddings (tokens, hidden states, output, key, and query) in a language model and explain their purpose. [NC-Embeddings]
Notes
- token and output embeddings work almost exactly like https://cs.calvin.edu/courses/cs/375/cur/notebooks/u07n1-image-embeddings.html
Q&A
How do models deal with really long conversations?
The system can cache the internal representation (“k-v cache”) so it doesn’t have to recompute the whole thing each time. But that takes RAM.
Does autoregressive generation mean that the model can’t plan ahead?
Not exactly. See Tracing the thoughts of a large language model \ Anthropic: “Claude will plan what it will say many words ahead, and write to get to that destination.”.
Does committing to a direction mean that text might be incoherent?
Not necessarily. Better pre-training will mean that it gets examples of plausible continuations of even rare prefixes. And post-training (RLHF and other techniques) can help steer the model to put higher probability on paths that are likely to be coherent.
How does the tokenizer decide where to split words?
The most common technique is called Byte Pair Encoding (BPE). It starts with a vocabulary of all the individual characters, and then iteratively merges the most common pairs of tokens into new tokens until it reaches the desired vocabulary size. For a deep dive, see Byte-Pair Encoding tokenization · Hugging Face or Let’s build the GPT Tokenizer - video by Andrej Karpathy.
Why are modern LMs so fast?
Some things that have helped make modern language models so fast are: quantization, which reduces memory bandwidth needed, specialized compute unit units like Google’s TPUs, which just do memory multiplies really fast, and some algorithmic improvements like Flash Attention where researchers carefully thought through what memory access is actually required during inference and made an implementation that is highly optimized to the kind of hardware that we have.
What do the human fine-tuners actually do?
Human fine tuners often do the kinds of tasks that you sometimes see ChatGPT asking: labeling which of these two options is better. Some of them also write reference answers that model should learn to imitate. The role of these sorts of labeling and feedback mechanisms will probably be changing as we see a shift to learning from computationally-generated feedback.
Why do commercial LLMs not actually have much trouble with misspellings?
This has actually been one of the things that challenged my understanding the most over the past few years. I would have expected modern language models to have more trouble with misspellings and typos then they empirically seem to do. I think there are two explanations for this: first, and probably the main explanation is that at the scale of the Internet most typos have happened before. Second is that model providers may be introducing some deliberate errors, such as typos or misspellings into the pre-training process as a kind of data augmentation. I don’t have any evidence that they’re actually doing that though.
Terms
- language modeling
- n-gram
- token embeddings (sometimes called word embeddings)
- output embeddings (sometimes called hidden states or contextual embeddings)
- token logits
- temperature
Monday 3/23
Logistics:
-
Scripture:
-
Reminders:
- Complete “Reflections Week 1”
- Discussion 1
-
Project
- NanoGPT speedrun – or Slowrun
- do an active Kaggle Competitions
-
Signups sheet: devos, tech updates, leading discussion, pair programming
-
Handout: GenAI problem setup: approaches, LLM-as-classifier, chat-as-document, next-token distribution
Tokenization:
- Run Google Image searches for: “how many r in strawberry” and “stable diffusion can’t spell”.
Activity: Lab 376.2: Logits in Causal Language Models
Supplemental material: list comprehensions in Python
Wednesday 3/25
- Advising
Friday 3/27
- Tech Update: current Transformers release
- Project Discussions
- Project Inspirations
- An example related to our topic today: How to make a racist AI without really trying | ConceptNet blog
- Review handout from last time
- Lab review
- For reference:
- Notebook: Image Embeddings
(name:
u07n1-image-embeddings.ipynb; show preview, open in Colab)
- Handout: Token and Context Embeddings, and Sampling Algorithm
- Resources: the softmax/cross-entropy interactive
Topics:
- Token and Context Embeddings
Week 3: Architectures
Now that we’ve seen the basic capabilities of NLP models, let’s start getting under the hood. How do they work? How do we measure that?
The Transformers architecture (sometimes called self-attention networks) has been the power behind many recent advances not just in NLP but also vision, audio, etc. That’s because they’re currently one of the best tools we have for representing high-dimensional joint distributions, such as the distribution over all possible sequences of words or images. This week we’ll see how they work!
We’ll also look at other architectures that have been popular in the past, such as convolutional networks (CNNs) and recurrent networks (RNNs), and maybe even look at how some new architectures bring in ideas from those older architectures.
Objectives
- [NC-SelfAttention]
- [NC-Architectures]
- [NC-TransformerDataFlow]
Key Questions
By the end of this week you should be able to answer the following questions:
- What is a layer in a self-attention network: what goes in, what comes out, and what are the shapes of all those things?
- How do embeddings for words (or tokens) represent similarity / difference?
- Why are variable-length sequences challenging for neural nets? How do self-attention networks handle that challenge?
- How does data flow between tokens and between layers in a self-attention network? In what sense does it use conditional logic?
- What does an attention head do? Specifically, what are queries, keys, and values, and what do they do? And how does this relate with the dot product and softmax? (Wait, is this logistic classification yet again?)
Things we didn’t explicitly get to this week:
- How do self-attention networks keep track of position?
- How does the data flow in Transformer networks differ from Convolutional or Recurrent networks?
- What are encoders and decoders? Why does that matter? What impact does that have on what you can do with the model?
Terms
- attention, especially self-attention
- query, key, and value vectors
- attention weights
- multi-head attention
- feed-forward network (MLP)
- residual connection
- layer normalization (bonus topic)
Prep and Reading
This week’s reading includes a result from Anthropic that looks helpful for getting an accurate intuition about what’s going on inside an LLM. We’re looking at the high-level overview article this week; if people are interested we can dig into the technical report in a future week.
- 3blue1brown articles (you may prefer to watch the linked video at the top)
- LLM Visualization: an interactive article, take your time to walk through it over several sessions. It’s very detailed, so don’t expect to understanding everything at this point. The most important parts to pay attention to are:
- What’s the input look like (we’ve already studied this)
- The attention mechanism
- The MLP / Feed-Forward part (which should be familiar from CS 375)
- The Output (again, we’ve already studied this, but it has a few more details)
- Tracing the thoughts of a large language model \ Anthropic
- Ethics: Understanding Deep Learning book chapter 21, stopping at section 21.2.
News (in Perusall library, not officially assigned):
Supplemental Resources
- Other neural network architectures (compare with self-attention):
- Recurrent Networks: Elman; LSTM and GRU
- Convolutional Networks:
- What convolution does to an image: Image Kernels explained visually
- How to use convolutions in a neural network: CS231n Convolutional Neural Networks for Visual Recognition
- What they learn: Feature Visualization
- A video course on How Transformer LLMs Work - DeepLearning.AI
- Wanna code it? Zero to Hero part 6: Let’s build GPT: from scratch, in code, spelled out. - YouTube (go back to prior parts if you need to)
- HandsOnLLM/Hands-On-Large-Language-Models: Official code repo for the O’Reilly Book - “Hands-On Large Language Models”
Q&A
Is there a limit to how far back a transformer can look? And how are they improving it? For example, in a chatbot with more context, you can feel that it is getting dumber.
“how far back a transformer can look” = its “context window”. Things that limit that:
- the architecture. if position embeds are absolute (not, say, RoPE), then we need to set a limit before we even start training.
- computation. Plain self-attention is quadratic in sequence length. So long attention takes way more computation time. This has seen lots of effort to optimize recently.
- Training. Gotta actually give the models examples of documents / conversations / etc. where long-range attention is needed, otherwise it won’t learn it.
How does increased communication [via self-attention] actually translate to better token generation?
Consider the case of asking an LLM to fix up a paragraph that you wrote. It needs to basically copy what you gave as input, but with some edits / changes at some places. Self-attention lets the network basically keep a running pointer to where you are in the input, grab what you said next, and repeat that or something similar in the output. A recurrent network (like LSTM), in contrast, would somehow have to encode your entire input into a single vector, and then decode that into the output, which is really challenging to learn to do reliably.
How else to improve transformers, besides more training and more heads / layers / dimensions?
There’s so many little tweaks that people do (read the tech report of any new model release). Common things people play with are how to encode position (RoPE is big now), playing with how keys/queries/values mix and match (Grouped Query Attention etc.), and the data, loss functions, etc. (e.g., reinforcement learning from various kinds of rewards).
What does the Anthropic article mean by “Claude wasn’t designed as a calculator—it was trained on text, not equipped with mathematical algorithms”?
The Transformer architecture is hugely inefficient and unreliable if all you want to do is addition or multiplication. But it had to learn to do that anyway, even with unreliable building blocks, because being able to add and multiply make the Internet a bit less surprising.
Monday 3/30
- Slides: Neural Architectures
- Intro
- Review
- Review handout activity from Friday
- Handout: Token and Context Embeddings, and Sampling Algorithm
- Pair up, compare answers (5-10 min), then debrief
- How does the model actually represent these tokens and contexts?
- Let’s write the sampling algorithm together.
- Notebook: Demo of Logits and Embeddings from a Language Model
(name:
u09n0-logits-demo.ipynb; show preview, open in Colab)- Vector analogies, logit lens — what does the model “think” at each layer?
- Concretely grounds embeddings before we move to self-attention
- Tease Wednesday: “Now that we know what embeddings are — how does the model decide which information to pay attention to?”
- Hand out the “Self-Attention By Hand” activity for Wednesday
- Handout: Self-Attention By Hand
Wednesday 4/1
- Mr. Chatterbox is a (weak) Victorian-era ethically trained model you can run on your own computer - a project idea?
- Slides: Neural Architectures (self-attention section)
- Birthday analogy exercises → Q/K/V intuition
- Self-Attention: One Attention Head (formal definition)
- Transformer block diagrams
- Handout: Self-Attention By Hand
- Students work in pairs/teams, debrief
- Transformer Explainer (if time; otherwise assign as explore-over-break)
- Review: Self-Attention = conditional information flow
- Software: describe the wiring, then what flows through the wires.
- Hardware: compute queries, keys, and values, then compute the attention matrix, then compute the output.
- Start / preview self-attention lab:
- Activity: Lab 376.3: Implementing Self-Attention
- See “before next class”
Friday 4/3
- Good Friday
Before next class (Wed Apr 8)
- Complete Self-Attention By Hand handout if not finished in class
- u10n1-implement-transformer notebook: work through Setup → Tokenization → MLP → “Trace the Simple Model” sections (stop before Self-Attention section)
- Reading: 3Blue1Brown attention video + Anthropic tracing-thoughts article
- Study for Quiz 1 (Wed Apr 8)
Week 4: Generation and Prompting
How can a model trained to mimic become a helpful, capable, mostly-harmless(?), and even semi-autonomous agent? We’ll discuss how prompting techniques can get us partway there, but modern LLMs use extensive post-training from human and automated feedback to get the rest of the way.
Key Questions
- How can each of the following be represented in a “document”:
- A conversation between a user and a model (assistant)
- An action to take in the world (e.g., calling an API or running code)
- How might a helpful and harmless response differ from a mimicry response?
- How can we use feedback to tune a model’s behavior?
Terms
- dialog agents
- prompting
- post-training (e.g., instruction tuning, Reinforcement Learning from Human Feedback (RLHF))
Objectives
Core objectives:
- [MS-LLM-API]
- [MS-LLM-Prompting]
- [MS-LLM-Advanced]
- [MS-LLM-Train]
Review objectives:
- [MS-LLM-Tokenization]
- [MS-LLM-TokenizationImpact]
- [MS-LLM-Generation]
Extension objectives:
- [MS-LLM-Compute]
Readings
All readings are posted on Perusall, copied here for reference.
- The Tülu blog post gives a great summary of the current state-of-the-art post-training process.
- Make sure that you can identify the main steps in the overall process and what the main point of each one was.
- Also pay attention to wear human input steers the model.
- If you’re curious beyond this, find the optional reading of the OLMo2 article.
- The Hugging Face article on agents provides a summary of how LLMs can become agents and what some of the implications of that are.
- The Google / Gemma API docs (Function Calling with Gemma) provide some examples of how we can actually use some of these functionalities.
- What’s “prompt injection”? A new kind of vulnerability—skim a few of the blog posts about this. Pay attention to how LLM-based agents are uniquely vulnerable to it.
Monday 4/6
- Easter Monday
Wednesday 4/8
- Reminder: Discussion 376.2: Training Data as Stewardship posts due today (replies due Fri Apr 11)
- Tech update: Anthropic Glasswing
- Quiz 1: Looking for evidence of learning about:
- [OG-LLM-Tokenization]
- [TM-LLM-Generation]
- [OG-SelfSupervised]
- [TM-LLM-Embeddings]
- [TM-SelfAttention] (basic intuition only)
- If you finish early:
- Self-Attention By Hand (in Code)
(name:
u10s1-attention-by-hand.ipynb; show preview, open in Colab) - Work on the lab we’ll be doing next class. See the slides:
- Slides: 376 Lab 3: Implementing Self-Attention
- Self-Attention By Hand (in Code)
(name:
Friday 4/10
- Handout: Self-Attention Shapes
- Class exercise: given model dimensions, what are the shapes of Q, K, V, attention matrix?
- Reference: the Qwen 2.5 tech report and model page
- And do: Self-Attention By Hand (in Code)
(name:
u10s1-attention-by-hand.ipynb; show preview, open in Colab) - Lab 3 (Self-Attention in Code)
- Slides: 376 Lab 3: Implementing Self-Attention
- Implementing self-attention (trace through transformer implementation)
- Continue u10n1-implement-transformer: Self-Attention section
- Project encouragements
- Be the ones who can measure AI performance
- Motivational examples:
Reference:
- Demo of Logits and Embeddings from a Language Model
(name:
u09n0-logits-demo.ipynb; show preview, open in Colab)
Week 5: Agents and Tool Use
How can we turn LLMs into agents that interact with the world? This week we’ll explore tool use, function calling, and context engineering for multi-turn agents.
Resources
If you’re feeling fuzzy about any of the concepts we’ve covered so far, I recommend going back to these resources:
- Videos / articles
- Interactive
- Transformer Explainer
- LLM Visualization: an interactive article, take your time to walk through it over several sessions.
- Softmax and Cross-Entropy
- Notebooks
- Notebook: Demo of Logits and Embeddings from a Language Model
(name:
u09n0-logits-demo.ipynb; show preview, open in Colab)
- Notebook: Demo of Logits and Embeddings from a Language Model
(name:
Supplemental resources:
- Tracing the thoughts of a large language model \ Anthropic
- OpenAI Tokenizer
- Zero to Hero part 6: Let’s build GPT: from scratch, in code, spelled out. - YouTube (go back to prior parts if you need to)
Monday 4/13
- Assign Discussion 376.3: When Agents Go Wrong (posts due Fri Apr 17, replies due Mon Apr 20)
- Review quiz 1 (brief, ~10 min)
- Tool use / function calling — live demo with API
- Example flow: call API with tool definition → model returns tool_use → execute → feed result back
- Reference: Qwen2.5 chat template
- Feedback / check-in activity
Wednesday 4/15
- Context engineering, multi-turn agents, failure modes
- Motivational examples from last year’s student feedback:
- How to make a (semi-autonomous) agent that improves its behavior from feedback
- Project Work Time!
- Deliverable: what’s your project? What’s success look like (sketch an example)? What are two next steps that you can take to make progress?
Friday 4/17
- Project scoping time
- Review (see Summary)
Week 6: Training Pipeline and Projects
How are modern LLMs trained? This week covers the training pipeline (pretraining → SFT → RLHF) and Quiz 2.
Objectives
By the end of this week you should be able to:
-
Describe how autoregressive generation works
-
Describe how generative adversarial networks work
-
Describe how diffusion models work
-
Compare and contrast the process and results of generating sequences using three different algorithms: greedy generation, sampling, and beam search.
-
Explain the concept of a generator network.
-
Explain how a Generative Adversarial Network is trained.
Key Questions
- How is noise useful for diffusion models for image generation?
- Why does diffusion require multiple time steps?
Terms
- Multimodal: Combining multiple modes of input, such as text, images, and sound.
- Denoising Diffusion: Sampling from a conditional distribution by iteratively denoising a noisy sample.
- Embedding: A vector representation of an object, such as a caption or an image. (In some contexts, also called latent space or latent representation.)
- Manifold: The high-probability region of a distribution
- e.g., almost all possible images look like random noise; the manifold is the region of images that look like images in the training data
Readings
- the rest of the ethics chapter from Understanding Deep Learning
- Scheming reasoning evaluations — Apollo Research
- Stanford 2025 AI Index Report
- Artificial intelligence learns to reason _ Science
- Turning Employees Into AI Janitors - by Cassie Kozyrkov
- Technical Report: Prompt Engineering is Complicated and Contingent - Wharton AI & Analytics Initiative
- 22365_3_Prompt Engineering_v7
Another nice reading (about training data), but the server seems down: Models All the Way Down until Part 3
Resources
- A minimalist diffusion model (just two tricky concepts, but after that it’s pretty accessible; check out the two tutorials linked at the top)
- A video on the Manifold Hypothesis
- Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song (mathy, but has good animated diagrams)
(some of these are drawn from the replies to this X/twitter post)
Also, many people refer to this blog post by Lilian Weng.
Monday 4/20
- Assign Discussion 376.4: Fans and Skeptics (posts due Mon Apr 27, replies due Thu Apr 30; we’ll share in class W7 Fri)
- Training pipeline overview: pre-training → SFT → RLHF
- Tülu blog post reading discussion
- Handout TODO from 2025_04_23 - Conversation documents, multimodal models, and LLM reliability
Wednesday 4/22
- Quiz 2 (proctored — Ken traveling): Looking for evidence of learning about:
- [MS-LLM-API]
- [MS-LLM-Prompting]
- [NC-SelfAttention] (deeper)
- [NC-TransformerDataFlow]
- [MS-LLM-Advanced]
- [MS-LLM-Train]
Friday 4/24
- Project Work Time
Week 7
Monday 4/27
- Diffusion and multimodal models (~20 min conceptual overview)
- Generative Models, Diffusion Slides
- Handout TODO from 2025_04_28 - Tokenization and Scaling Review
Wednesday 4/29
- Interpretability and Explanation (slides)
- Quiz 3
Friday 5/1
-
Handout TODO from 2025_05_02 - Wrap-Up
-
Discussion 376.4 sharing, comparing our survey to the results of the Pew Research survey
-
Fairness and Wrap-Up slides
Final Discussion topics
- Personal Impacts
- How AI has impacted my life in the past few years. For better? For worse?
- How AI has impacted the lives of people unlike me.
- How AI might impact our lives in the next 5 years.
- Development
- Something useful or cool that has recently become possible thanks to AI.
- What are some things that AI systems are already better than humans at?
- What are some things that humans are still much better at than AI systems?
- Broader impacts
- Is AI good for the environment? Bad?
- Is AI good for society? Bad?
- Is AI good for human creativity? is it bad?
- Christian perspective
- Something that Christians should consider as people who consume AI-powered products
- …As people who use AI in their organizations
- …as people who develop AI?
Additional items:
- Tuesday, May 6, 9am: Final Project Presentations during our class’s final exam time slot
- Slides: A Final Commission