Schedule - CS376

See also: CS 375 Schedule

Any content in the future should be considered tentative and subject to change.

Week 1: Intro to Generative Modeling

Some of the most impactful developments in AI recently have come from modeling and generating sequences. How do we model sequences? How do we generate them? This unit will introduce some of the basic concepts and methods for sequence modeling and generation, with a focus on natural language processing (NLP).

Terms
  • Generative AI
  • Language model
  • Tokenization
  • Vocabulary
  • Autoregressive model
  • Conditional distribution
  • Latent variable model
  • Diffusion model
  • Perplexity
Key Questions
  • What is one implication of the fact that LMs generate text sequentially (i.e., that most language models are causal)?
  • What is a conditional distribution, in the context of language modeling (or another example we looked at in class)?
  • How is a chat conversation (even with multiple turns, tool calls, etc.) just a document?
Objectives

This week will address course objectives on OG-SelfSupervised, OG-LLM-Tokenization, and OG-LLM-TokenizationImpact.

  • Explain what generative modeling is and its uses
  • Describe the high-level idea of three basic approaches to generative models: autoregressive, latent variable, and diffusion
  • Describe the inputs and outputs of an autoregressive language model
    • tokens -> embeddings
    • next-token conditional probability distribution
  • Describe how a language model can be used for a chatbot.
Prep and Readings

Before starting this unit, you should already know the basics of supervised learning. Specifically, you should be comfortable with training a fully-connected neural network on a classification task.

I recommend the following readings (in Perusall):

If you need some additional background, I recommend Understanding Deep Learning

You may also appreciate the following more technical resources, but these are not required:

Extension Opportunities
Notes

Notes page for reference material on language models, tokenization, sampling, and evaluation.

Monday 3/16

Wednesday 3/18

Friday 3/20

Week 2: Language Modeling

This week start to take off the covers of NLP models, just like we took off the covers of image models in CS 375. In particular, we’ll get our first taste of the Transformer model, the most important model in machine learning today.

Advising is this week, so we won’t get to a lot of new content.

Key Questions
  • Define perplexity, and describe how it relates to log-likelihood and cross-entropy (and the general concept of partial credit and/or surprise in classifiers)
  • What is a token embedding? What is an output (or context) embedding? How do these relate to the input and output of a language model?
  • How does a causal language model use embeddings of contexts (e.g., sentence prefixes)?
  • How can we use a language model to generate text?
Objectives

This week we start work on these objectives:

  • I can identify the shapes of data flowing through a Transformer-style language model. [NC-TransformerDataFlow]
  • I can identify various types of embeddings (tokens, hidden states, output, key, and query) in a language model and explain their purpose. [NC-Embeddings]
Notes

Q&A

Why are modern LMs so fast?

Some things that have helped make modern language models so fast are: quantization, which reduces memory bandwidth needed, specialized compute unit units like Google’s TPUs, which just do memory multiplies really fast, and some algorithmic improvements like Flash Attention where researchers carefully thought through what memory access is actually required during inference and made an implementation that is highly optimized to the kind of hardware that we have.

What do the human fine-tuners actually do?

Human fine tuners often do the kinds of tasks that you sometimes see ChatGPT asking: labeling which of these two options is better. Some of them also write reference answers that model should learn to imitate. The role of these sorts of labeling and feedback mechanisms will probably be changing as we see a shift to learning from computationally-generated feedback.

Why do commercial LLMs not actually have much trouble with misspellings?

This has actually been one of the things that challenged my understanding the most over the past few years. I would have expected modern language models to have more trouble with misspellings and typos then they empirically seem to do. I think there are two explanations for this: first, and probably the main explanation is that at the scale of the Internet most typos have happened before. Second is that model providers may be introducing some deliberate errors, such as typos or misspellings into the pre-training process as a kind of data augmentation. I don’t have any evidence that they’re actually doing that though.

Terms
  • language modeling
  • n-gram
  • token embeddings (sometimes called word embeddings)
  • output embeddings (sometimes called hidden states or contextual embeddings)
  • token logits
  • temperature

Monday 3/23

Logistics:

Tokenization:

  • Run Google Image searches for: “how many r in strawberry” and “stable diffusion can’t spell”.

Activity: Lab 376.2: Logits in Causal Language Models

Supplemental material: list comprehensions in Python

Wednesday 3/25

  • Advising

Friday 3/27

Topics:

  • Token and Context Embeddings

Week 3: Architectures

Now that we’ve seen the basic capabilities of NLP models, let’s start getting under the hood. How do they work? How do we measure that?

The Transformers architecture (sometimes called self-attention networks) has been the power behind many recent advances not just in NLP but also vision, audio, etc. That’s because they’re currently one of the best tools we have for representing high-dimensional joint distributions, such as the distribution over all possible sequences of words or images. This week we’ll see how they work!

We’ll also look at other architectures that have been popular in the past, such as convolutional networks (CNNs) and recurrent networks (RNNs), and maybe even look at how some new architectures bring in ideas from those older architectures.

Objectives
  • [NC-SelfAttention]
  • [NC-Architectures]
  • [NC-TransformerDataFlow]
Key Questions

By the end of this week you should be able to answer the following questions:

  • What is a layer in a self-attention network: what goes in, what comes out, and what are the shapes of all those things?
  • How do embeddings for words (or tokens) represent similarity / difference?
  • Why are variable-length sequences challenging for neural nets? How do self-attention networks handle that challenge?
  • How does data flow between tokens and between layers in a self-attention network? In what sense does it use conditional logic?
  • What does an attention head do? Specifically, what are queries, keys, and values, and what do they do? And how does this relate with the dot product and softmax? (Wait, is this logistic classification yet again?)

Things we didn’t explicitly get to this week:

  • How do self-attention networks keep track of position?
  • How does the data flow in Transformer networks differ from Convolutional or Recurrent networks?
  • What are encoders and decoders? Why does that matter? What impact does that have on what you can do with the model?
Terms
  • attention, especially self-attention
  • query, key, and value vectors
  • attention weights
  • multi-head attention
  • feed-forward network (MLP)
  • residual connection
  • layer normalization (bonus topic)
Prep and Reading

This week’s reading includes a brand new result from Anthropic that looks really helpful for getting an accurate intuition about what’s going on inside an LLM. We’re looking at the high-level overview article this week; if people are interested we can dig into the technical report in a future week.

News (in Perusall library, not officially assigned)

Supplemental Resources
Q&A

Is there a limit to how far back a transformer can look? And how are they improving it? For example, in a chatbot with more context, you can feel that it is getting dumber.

“how far back a transformer can look” = its “context window”. Things that limit that:

  1. the architecture. if position embeds are absolute (not, say, RoPE), then we need to set a limit before we even start training.
  2. computation. Plain self-attention is quadratic in sequence length. So long attention takes way more computation time. This has seen lots of effort to optimize recently.
  3. Training. Gotta actually give the models examples of documents / conversations / etc. where long-range attention is needed, otherwise it won’t learn it.

How does increased communication [via self-attention] actually translate to better token generation?

Consider the case of asking an LLM to fix up a paragraph that you wrote. It needs to basically copy what you gave as input, but with some edits / changes at some places. Self-attention lets the network basically keep a running pointer to where you are in the input, grab what you said next, and repeat that or something similar in the output. A recurrent network (like LSTM), in contrast, would somehow have to encode your entire input into a single vector, and then decode that into the output, which is really challenging to learn to do reliably.

How else to improve transformers, besides more training and more heads / layers / dimensions?

There’s so many little tweaks that people do (read the tech report of any new model release). Common things people play with are how to encode position (RoPE is big now), playing with how keys/queries/values mix and match (Grouped Query Attention etc.), and the data, loss functions, etc. (e.g., reinforcement learning from various kinds of rewards).

What does the Anthropic article mean by “Claude wasn’t designed as a calculator—it was trained on text, not equipped with mathematical algorithms”?

The Transformer architecture is hugely inefficient and unreliable if all you want to do is addition or multiplication. But it had to learn to do that anyway, even with unreliable building blocks, because being able to add and multiply make the Internet a bit less surprising.

Monday 3/30

  • Slides: Neural Architectures
    • Intro
    • Review
  • Review handout activity from Friday
    • How does the Gemma model actually represent these tokens and contexts? (See logits-demo notebook)
    • Let’s write the sampling algorithm together.
  • Handout TODO from 2025_03_31 - Self-Attention By Hand
  • Transformer Explainer
  • Review: Self-Attention = conditional information flow
    • Software: describe the wiring, then what flows through the wires.
    • Hardware: compute queries, keys, and values, then compute the attention matrix, then compute the output.
  • Notebook: Demo of Logits and Embeddings from a Language Model (name: u09n0-logits-demo.ipynb; show preview, open in Colab)

Wednesday 4/1

  • Quiz 1: Looking for evidence of learning about:
    • [MS-LLM-Tokenization]
    • [MS-LLM-Generation]
    • [OG-SelfSupervised]
    • [NC-Embeddings]
    • [NC-SelfAttention] (basic intuition only)

Friday 4/3

  • Good Friday

Week 4: Generation and Prompting

How can a model trained to mimic become a helpful, capable, mostly-harmless(?), and even semi-autonomous agent? We’ll discuss how prompting techniques can get us partway there, but modern LLMs use extensive post-training from human and automated feedback to get the rest of the way.

Key Questions
  • How can each of the following be represented in a “document”:
    • A conversation between a user and a model (assistant)
    • An action to take in the world (e.g., calling an API or running code)
  • How might a helpful and harmless response differ from a mimicry response?
  • How can we use feedback to tune a model’s behavior?
Terms
  • dialog agents
  • prompting
  • post-training (e.g., instruction tuning, Reinforcement Learning from Human Feedback (RLHF))
Objectives

Core objectives:

  • [MS-LLM-API]
  • [MS-LLM-Prompting]
  • [MS-LLM-Advanced]
  • [MS-LLM-Train]

Review objectives:

  • [MS-LLM-Tokenization]
  • [MS-LLM-TokenizationImpact]
  • [MS-LLM-Generation]

Extension objectives:

  • [MS-LLM-Compute]
Readings

All readings are posted on Perusall, copied here for reference.

  • The Tülu blog post gives a great summary of the current state-of-the-art post-training process.
    • Make sure that you can identify the main steps in the overall process and what the main point of each one was.
    • Also pay attention to wear human input steers the model.
    • If you’re curious beyond this, find the optional reading of the OLMo2 article.
  • The Hugging Face article on agents provides a summary of how LLMs can become agents and what some of the implications of that are.
  • The Google / Gemma API docs (Function Calling with Gemma) provide some examples of how we can actually use some of these functionalities.
  • What’s “prompt injection”? A new kind of vulnerability—skim a few of the blog posts about this. Pay attention to how LLM-based agents are uniquely vulnerable to it.

Monday 4/6

  • Easter Monday

Wednesday 4/8

Reference:

Week 5: Agents and Tool Use

How can we turn LLMs into agents that interact with the world? This week we’ll explore tool use, function calling, and context engineering for multi-turn agents.

Resources

If you’re feeling fuzzy about any of the concepts we’ve covered so far, I recommend going back to these resources:

Supplemental resources:

Monday 4/13

  • Tool use / function calling — live demo with API
    • Example flow: call API with tool definition → model returns tool_use → execute → feed result back
  • Feedback / check-in activity

Wednesday 4/15

  • Context engineering, multi-turn agents, failure modes
  • Motivational examples from last year’s student feedback:
    • How to make a (semi-autonomous) agent that improves its behavior from feedback
  • Project Work Time!
    • Deliverable: what’s your project? What’s success look like (sketch an example)? What are two next steps that you can take to make progress?

Friday 4/17

  • Project scoping time
  • Review (see Summary)

Week 6: Training Pipeline and Projects

How are modern LLMs trained? This week covers the training pipeline (pretraining → SFT → RLHF) and Quiz 2.

Objectives

By the end of this week you should be able to:

  • Describe how autoregressive generation works

  • Describe how generative adversarial networks work

  • Describe how diffusion models work

  • Compare and contrast the process and results of generating sequences using three different algorithms: greedy generation, sampling, and beam search.

  • Explain the concept of a generator network.

  • Explain how a Generative Adversarial Network is trained.

Key Questions
  • How is noise useful for diffusion models for image generation?
  • Why does diffusion require multiple time steps?
Terms
  • Multimodal: Combining multiple modes of input, such as text, images, and sound.
  • Denoising Diffusion: Sampling from a conditional distribution by iteratively denoising a noisy sample.
  • Embedding: A vector representation of an object, such as a caption or an image. (In some contexts, also called latent space or latent representation.)
  • Manifold: The high-probability region of a distribution
    • e.g., almost all possible images look like random noise; the manifold is the region of images that look like images in the training data
Readings

Another nice reading (about training data), but the server seems down: Models All the Way Down until Part 3

Resources

(some of these are drawn from the replies to this X/twitter post)

Also, many people refer to this blog post by Lilian Weng.

Monday 4/20

  • Training pipeline overview: pre-training → SFT → RLHF
    • Tülu blog post reading discussion
  • Handout TODO from 2025_04_23 - Conversation documents, multimodal models, and LLM reliability

Wednesday 4/22

  • Quiz 2 (proctored — Ken traveling): Looking for evidence of learning about:
    • [MS-LLM-API]
    • [MS-LLM-Prompting]
    • [NC-SelfAttention] (deeper)
    • [NC-TransformerDataFlow]
    • [MS-LLM-Advanced]
    • [MS-LLM-Train]

Friday 4/24

  • Project Work Time

Week 7

Monday 4/27

  • Diffusion and multimodal models (~20 min conceptual overview)
    • Generative Models, Diffusion Slides
  • Handout TODO from 2025_04_28 - Tokenization and Scaling Review

Wednesday 4/29

  • Interpretability and Explanation (slides)
  • Quiz 3

Friday 5/1

  • Handout TODO from 2025_05_02 - Wrap-Up

  • Discussion 3 sharing, comparing our survey to the results of the Pew Research survey

  • Fairness and Wrap-Up slides

Final Discussion topics

  • Personal Impacts
    • How AI has impacted my life in the past few years. For better? For worse?
    • How AI has impacted the lives of people unlike me.
    • How AI might impact our lives in the next 5 years.
  • Development
    • Something useful or cool that has recently become possible thanks to AI.
    • What are some things that AI systems are already better than humans at?
    • What are some things that humans are still much better at than AI systems?
  • Broader impacts
    • Is AI good for the environment? Bad?
    • Is AI good for society? Bad?
    • Is AI good for human creativity? is it bad?
  • Christian perspective
    • Something that Christians should consider as people who consume AI-powered products
    • …As people who use AI in their organizations
    • …as people who develop AI?

Additional items:

Set Up Your Feeds
Schedule - CS375