See also: CS 375 Schedule
Any content in the future should be considered tentative and subject to change.
Week 1: Intro to Generative Modeling
Some of the most impactful developments in AI recently have come from modeling and generating sequences. How do we model sequences? How do we generate them? This unit will introduce some of the basic concepts and methods for sequence modeling and generation, with a focus on natural language processing (NLP).
Terms
- Generative AI
- Language model
- Tokenization
- Vocabulary
- Autoregressive model
- Conditional distribution
- Latent variable model
- Diffusion model
- Perplexity
Key Questions
- What is one implication of the fact that LMs generate text sequentially (i.e., that most language models are causal)?
- What is a conditional distribution, in the context of language modeling (or another example we looked at in class)?
- How is a chat conversation (even with multiple turns, tool calls, etc.) just a document?
Objectives
This week will address course objectives on OG-SelfSupervised, OG-LLM-Tokenization, and OG-LLM-TokenizationImpact.
- Explain what generative modeling is and its uses
- Describe the high-level idea of three basic approaches to generative models: autoregressive, latent variable, and diffusion
- Describe the inputs and outputs of an autoregressive language model
- tokens -> embeddings
- next-token conditional probability distribution
- Describe how a language model can be used for a chatbot.
Prep and Readings
Before starting this unit, you should already know the basics of supervised learning. Specifically, you should be comfortable with training a fully-connected neural network on a classification task.
I recommend the following readings (in Perusall):
- Large Language Models explained briefly (3blue1brown)
- the Hugging Face Transformers course, chapter 1:
- Artificial Intelligence Then and Now – Communications of the ACM
If you need some additional background, I recommend Understanding Deep Learning
You may also appreciate the following more technical resources, but these are not required:
Extension Opportunities
- Activity: Optional Extension: Token Efficiency Analysis
- See: SuperBPE: multi-word tokens (!)
Notes
Notes page for reference material on language models, tokenization, sampling, and evaluation.
Monday 3/16
- Scripture: Psalm 23
- Slides: 376 Unit 1: Generative Modeling Introduction
- Topics:
- Intro
- Logistics for 376 vs 375
- Projects
- Intro to Generative Modeling
- Handout: What do you already know about generative modeling?
- Surprise - fire alarm! (moving exploring-lm activity to Wednesday)
Wednesday 3/18
- Handout: Exploring Language Models
- This uses the LM Internals tool.
- Slides: 376 Unit 1: Generative Modeling Introduction
- Scripture: Proverbs
- Readings, Moodle participation activity
- Ways of Setting Up Generative Modeling (Autoregressive, Latent Variable, Diffusion)
- Autoregressive Language Models as Classifiers
- Perplexity as cumulative surprise
- Implications of Autoregressive Generation
- Text <-> Numbers
Friday 3/20
- Slides: 376 Unit 1: Generative Modeling Introduction
- Three approaches to generative modeling (autoregressive, latent variable, diffusion)
- Tokenization
- LLM APIs overview
- Intro Discussion 376.1: Probing LLM Sycophancy
- Intro Exercise 376.1: LM Evaluation
- Activity: CS 376 Lab 1: Language Model Inputs and Outputs
Week 2: Language Modeling
This week start to take off the covers of NLP models, just like we took off the covers of image models in CS 375. In particular, we’ll get our first taste of the Transformer model, the most important model in machine learning today.
Advising is this week, so we won’t get to a lot of new content.
Key Questions
- Define perplexity, and describe how it relates to log-likelihood and cross-entropy (and the general concept of partial credit and/or surprise in classifiers)
- What is a token embedding? What is an output (or context) embedding? How do these relate to the input and output of a language model?
- How does a causal language model use embeddings of contexts (e.g., sentence prefixes)?
- How can we use a language model to generate text?
Objectives
This week we start work on these objectives:
- I can identify the shapes of data flowing through a Transformer-style language model. [NC-TransformerDataFlow]
- I can identify various types of embeddings (tokens, hidden states, output, key, and query) in a language model and explain their purpose. [NC-Embeddings]
Notes
- token and output embeddings work almost exactly like https://cs.calvin.edu/courses/cs/375/cur/notebooks/u07n1-image-embeddings.html
Q&A
Why are modern LMs so fast?
Some things that have helped make modern language models so fast are: quantization, which reduces memory bandwidth needed, specialized compute unit units like Google’s TPUs, which just do memory multiplies really fast, and some algorithmic improvements like Flash Attention where researchers carefully thought through what memory access is actually required during inference and made an implementation that is highly optimized to the kind of hardware that we have.
What do the human fine-tuners actually do?
Human fine tuners often do the kinds of tasks that you sometimes see ChatGPT asking: labeling which of these two options is better. Some of them also write reference answers that model should learn to imitate. The role of these sorts of labeling and feedback mechanisms will probably be changing as we see a shift to learning from computationally-generated feedback.
Why do commercial LLMs not actually have much trouble with misspellings?
This has actually been one of the things that challenged my understanding the most over the past few years. I would have expected modern language models to have more trouble with misspellings and typos then they empirically seem to do. I think there are two explanations for this: first, and probably the main explanation is that at the scale of the Internet most typos have happened before. Second is that model providers may be introducing some deliberate errors, such as typos or misspellings into the pre-training process as a kind of data augmentation. I don’t have any evidence that they’re actually doing that though.
Terms
- language modeling
- n-gram
- token embeddings (sometimes called word embeddings)
- output embeddings (sometimes called hidden states or contextual embeddings)
- token logits
- temperature
Monday 3/23
Logistics:
-
Scripture: Jeremiah 17:7-8
-
Reminders:
- Complete “Reflections Week 1”
- Discussion 1
-
Signups sheet
Tokenization:
- Run Google Image searches for: “how many r in strawberry” and “stable diffusion can’t spell”.
Activity: Lab 376.2: Logits in Causal Language Models
Supplemental material: list comprehensions in Python
Wednesday 3/25
- Advising
Friday 3/27
- Project Inspirations
- An example related to our topic today: How to make a racist AI without really trying | ConceptNet blog
- Lab review
- For reference:
- Notebook: Image Embeddings
(name:
u07n1-image-embeddings.ipynb; show preview, open in Colab)
- Handout TODO from 2025_03_28 - Token and Context Embeddings
- Resources: the softmax/cross-entropy interactive
Topics:
- Token and Context Embeddings
Week 3: Architectures
Now that we’ve seen the basic capabilities of NLP models, let’s start getting under the hood. How do they work? How do we measure that?
The Transformers architecture (sometimes called self-attention networks) has been the power behind many recent advances not just in NLP but also vision, audio, etc. That’s because they’re currently one of the best tools we have for representing high-dimensional joint distributions, such as the distribution over all possible sequences of words or images. This week we’ll see how they work!
We’ll also look at other architectures that have been popular in the past, such as convolutional networks (CNNs) and recurrent networks (RNNs), and maybe even look at how some new architectures bring in ideas from those older architectures.
Objectives
- [NC-SelfAttention]
- [NC-Architectures]
- [NC-TransformerDataFlow]
Key Questions
By the end of this week you should be able to answer the following questions:
- What is a layer in a self-attention network: what goes in, what comes out, and what are the shapes of all those things?
- How do embeddings for words (or tokens) represent similarity / difference?
- Why are variable-length sequences challenging for neural nets? How do self-attention networks handle that challenge?
- How does data flow between tokens and between layers in a self-attention network? In what sense does it use conditional logic?
- What does an attention head do? Specifically, what are queries, keys, and values, and what do they do? And how does this relate with the dot product and softmax? (Wait, is this logistic classification yet again?)
Things we didn’t explicitly get to this week:
- How do self-attention networks keep track of position?
- How does the data flow in Transformer networks differ from Convolutional or Recurrent networks?
- What are encoders and decoders? Why does that matter? What impact does that have on what you can do with the model?
Terms
- attention, especially self-attention
- query, key, and value vectors
- attention weights
- multi-head attention
- feed-forward network (MLP)
- residual connection
- layer normalization (bonus topic)
Prep and Reading
This week’s reading includes a brand new result from Anthropic that looks really helpful for getting an accurate intuition about what’s going on inside an LLM. We’re looking at the high-level overview article this week; if people are interested we can dig into the technical report in a future week.
- 3blue1brown articles (you may prefer to watch the linked video at the top)
- LLM Visualization: an interactive article, take your time to walk through it over several sessions. It’s very detailed, so don’t expect to understanding everything at this point. The most important parts to pay attention to are:
- What’s the input look like (we’ve already studied this)
- The attention mechanism
- The MLP / Feed-Forward part (which should be familiar from CS 375)
- The Output (again, we’ve already studied this, but it has a few more details)
- Tracing the thoughts of a large language model \ Anthropic
- Ethics: Understanding Deep Learning book chapter 21, stopping at section 21.2.
News (in Perusall library, not officially assigned)
Supplemental Resources
- Other neural network architectures (compare with self-attention):
- Recurrent Networks: Elman; LSTM and GRU
- Convolutional Networks:
- What convolution does to an image: Image Kernels explained visually
- How to use convolutions in a neural network: CS231n Convolutional Neural Networks for Visual Recognition
- What they learn: Feature Visualization
- A video course on How Transformer LLMs Work - DeepLearning.AI
- Wanna code it? Zero to Hero part 6: Let’s build GPT: from scratch, in code, spelled out. - YouTube (go back to prior parts if you need to)
- HandsOnLLM/Hands-On-Large-Language-Models: Official code repo for the O’Reilly Book - “Hands-On Large Language Models”
Q&A
Is there a limit to how far back a transformer can look? And how are they improving it? For example, in a chatbot with more context, you can feel that it is getting dumber.
“how far back a transformer can look” = its “context window”. Things that limit that:
- the architecture. if position embeds are absolute (not, say, RoPE), then we need to set a limit before we even start training.
- computation. Plain self-attention is quadratic in sequence length. So long attention takes way more computation time. This has seen lots of effort to optimize recently.
- Training. Gotta actually give the models examples of documents / conversations / etc. where long-range attention is needed, otherwise it won’t learn it.
How does increased communication [via self-attention] actually translate to better token generation?
Consider the case of asking an LLM to fix up a paragraph that you wrote. It needs to basically copy what you gave as input, but with some edits / changes at some places. Self-attention lets the network basically keep a running pointer to where you are in the input, grab what you said next, and repeat that or something similar in the output. A recurrent network (like LSTM), in contrast, would somehow have to encode your entire input into a single vector, and then decode that into the output, which is really challenging to learn to do reliably.
How else to improve transformers, besides more training and more heads / layers / dimensions?
There’s so many little tweaks that people do (read the tech report of any new model release). Common things people play with are how to encode position (RoPE is big now), playing with how keys/queries/values mix and match (Grouped Query Attention etc.), and the data, loss functions, etc. (e.g., reinforcement learning from various kinds of rewards).
What does the Anthropic article mean by “Claude wasn’t designed as a calculator—it was trained on text, not equipped with mathematical algorithms”?
The Transformer architecture is hugely inefficient and unreliable if all you want to do is addition or multiplication. But it had to learn to do that anyway, even with unreliable building blocks, because being able to add and multiply make the Internet a bit less surprising.
Monday 3/30
- Slides: Neural Architectures
- Intro
- Review
- Review handout activity from Friday
- How does the Gemma model actually represent these tokens and contexts? (See logits-demo notebook)
- Let’s write the sampling algorithm together.
- Handout TODO from 2025_03_31 - Self-Attention By Hand
- Transformer Explainer
- Review: Self-Attention = conditional information flow
- Software: describe the wiring, then what flows through the wires.
- Hardware: compute queries, keys, and values, then compute the attention matrix, then compute the output.
- Notebook: Demo of Logits and Embeddings from a Language Model
(name:
u09n0-logits-demo.ipynb; show preview, open in Colab)
Wednesday 4/1
- Quiz 1: Looking for evidence of learning about:
- [MS-LLM-Tokenization]
- [MS-LLM-Generation]
- [OG-SelfSupervised]
- [NC-Embeddings]
- [NC-SelfAttention] (basic intuition only)
Friday 4/3
- Good Friday
Week 4: Generation and Prompting
How can a model trained to mimic become a helpful, capable, mostly-harmless(?), and even semi-autonomous agent? We’ll discuss how prompting techniques can get us partway there, but modern LLMs use extensive post-training from human and automated feedback to get the rest of the way.
Key Questions
- How can each of the following be represented in a “document”:
- A conversation between a user and a model (assistant)
- An action to take in the world (e.g., calling an API or running code)
- How might a helpful and harmless response differ from a mimicry response?
- How can we use feedback to tune a model’s behavior?
Terms
- dialog agents
- prompting
- post-training (e.g., instruction tuning, Reinforcement Learning from Human Feedback (RLHF))
Objectives
Core objectives:
- [MS-LLM-API]
- [MS-LLM-Prompting]
- [MS-LLM-Advanced]
- [MS-LLM-Train]
Review objectives:
- [MS-LLM-Tokenization]
- [MS-LLM-TokenizationImpact]
- [MS-LLM-Generation]
Extension objectives:
- [MS-LLM-Compute]
Readings
All readings are posted on Perusall, copied here for reference.
- The Tülu blog post gives a great summary of the current state-of-the-art post-training process.
- Make sure that you can identify the main steps in the overall process and what the main point of each one was.
- Also pay attention to wear human input steers the model.
- If you’re curious beyond this, find the optional reading of the OLMo2 article.
- The Hugging Face article on agents provides a summary of how LLMs can become agents and what some of the implications of that are.
- The Google / Gemma API docs (Function Calling with Gemma) provide some examples of how we can actually use some of these functionalities.
- What’s “prompt injection”? A new kind of vulnerability—skim a few of the blog posts about this. Pay attention to how LLM-based agents are uniquely vulnerable to it.
Monday 4/6
- Easter Monday
Wednesday 4/8
- Review quiz 1 (brief)
- Handout TODO from 2025_04_07 - Self-Attention Shapes
- Activity: Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- Slides: 376 Lab 3: Implementing Self-Attention
- Implementing self-attention (trace through transformer implementation)
- Project encouragements
- Be the ones who can measure AI performance
Reference:
- Demo of Logits and Embeddings from a Language Model
(name:
u09n0-logits-demo.ipynb; show preview, open in Colab)
Friday 4/10
- Motivational examples:
- Slides: Generation by Prompting
- Activity: Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- Prompt Engineering
- Instruction Tuning
- Retrieval-Augmented Generation
Week 5: Agents and Tool Use
How can we turn LLMs into agents that interact with the world? This week we’ll explore tool use, function calling, and context engineering for multi-turn agents.
Resources
If you’re feeling fuzzy about any of the concepts we’ve covered so far, I recommend going back to these resources:
- Videos / articles
- Interactive
- Transformer Explainer
- LLM Visualization: an interactive article, take your time to walk through it over several sessions.
- Softmax and Cross-Entropy
- Notebooks
- Notebook: Demo of Logits and Embeddings from a Language Model
(name:
u09n0-logits-demo.ipynb; show preview, open in Colab)
- Notebook: Demo of Logits and Embeddings from a Language Model
(name:
Supplemental resources:
- Tracing the thoughts of a large language model \ Anthropic
- OpenAI Tokenizer
- Zero to Hero part 6: Let’s build GPT: from scratch, in code, spelled out. - YouTube (go back to prior parts if you need to)
Monday 4/13
- Tool use / function calling — live demo with API
- Example flow: call API with tool definition → model returns tool_use → execute → feed result back
- Feedback / check-in activity
Wednesday 4/15
- Context engineering, multi-turn agents, failure modes
- Motivational examples from last year’s student feedback:
- How to make a (semi-autonomous) agent that improves its behavior from feedback
- Project Work Time!
- Deliverable: what’s your project? What’s success look like (sketch an example)? What are two next steps that you can take to make progress?
Friday 4/17
- Project scoping time
- Review (see Summary)
Week 6: Training Pipeline and Projects
How are modern LLMs trained? This week covers the training pipeline (pretraining → SFT → RLHF) and Quiz 2.
Objectives
By the end of this week you should be able to:
-
Describe how autoregressive generation works
-
Describe how generative adversarial networks work
-
Describe how diffusion models work
-
Compare and contrast the process and results of generating sequences using three different algorithms: greedy generation, sampling, and beam search.
-
Explain the concept of a generator network.
-
Explain how a Generative Adversarial Network is trained.
Key Questions
- How is noise useful for diffusion models for image generation?
- Why does diffusion require multiple time steps?
Terms
- Multimodal: Combining multiple modes of input, such as text, images, and sound.
- Denoising Diffusion: Sampling from a conditional distribution by iteratively denoising a noisy sample.
- Embedding: A vector representation of an object, such as a caption or an image. (In some contexts, also called latent space or latent representation.)
- Manifold: The high-probability region of a distribution
- e.g., almost all possible images look like random noise; the manifold is the region of images that look like images in the training data
Readings
- the rest of the ethics chapter from Understanding Deep Learning
- Scheming reasoning evaluations — Apollo Research
- Stanford 2025 AI Index Report
- Artificial intelligence learns to reason _ Science
- Turning Employees Into AI Janitors - by Cassie Kozyrkov
- Technical Report: Prompt Engineering is Complicated and Contingent - Wharton AI & Analytics Initiative
- 22365_3_Prompt Engineering_v7
Another nice reading (about training data), but the server seems down: Models All the Way Down until Part 3
Resources
- A minimalist diffusion model (just two tricky concepts, but after that it’s pretty accessible; check out the two tutorials linked at the top)
- A video on the Manifold Hypothesis
- Generative Modeling by Estimating Gradients of the Data Distribution | Yang Song (mathy, but has good animated diagrams)
(some of these are drawn from the replies to this X/twitter post)
Also, many people refer to this blog post by Lilian Weng.
Monday 4/20
- Training pipeline overview: pre-training → SFT → RLHF
- Tülu blog post reading discussion
- Handout TODO from 2025_04_23 - Conversation documents, multimodal models, and LLM reliability
Wednesday 4/22
- Quiz 2 (proctored — Ken traveling): Looking for evidence of learning about:
- [MS-LLM-API]
- [MS-LLM-Prompting]
- [NC-SelfAttention] (deeper)
- [NC-TransformerDataFlow]
- [MS-LLM-Advanced]
- [MS-LLM-Train]
Friday 4/24
- Project Work Time
Week 7
Monday 4/27
- Diffusion and multimodal models (~20 min conceptual overview)
- Generative Models, Diffusion Slides
- Handout TODO from 2025_04_28 - Tokenization and Scaling Review
Wednesday 4/29
- Interpretability and Explanation (slides)
- Quiz 3
Friday 5/1
-
Handout TODO from 2025_05_02 - Wrap-Up
-
Discussion 3 sharing, comparing our survey to the results of the Pew Research survey
-
Fairness and Wrap-Up slides
Final Discussion topics
- Personal Impacts
- How AI has impacted my life in the past few years. For better? For worse?
- How AI has impacted the lives of people unlike me.
- How AI might impact our lives in the next 5 years.
- Development
- Something useful or cool that has recently become possible thanks to AI.
- What are some things that AI systems are already better than humans at?
- What are some things that humans are still much better at than AI systems?
- Broader impacts
- Is AI good for the environment? Bad?
- Is AI good for society? Bad?
- Is AI good for human creativity? is it bad?
- Christian perspective
- Something that Christians should consider as people who consume AI-powered products
- …As people who use AI in their organizations
- …as people who develop AI?
Additional items:
- Tuesday, May 6, 9am: Final Project Presentations during our class’s final exam time slot
- Slides: A Final Commission