This page lists all course objectives with their assessment criteria.
Coverage Matrix
Detailed Objectives
Tuneable Machines
[TM-LLM-Embeddings] (376)
I can identify various types of embeddings (tokens, hidden states, output, key, and query) in a language model and explain their purpose.
Criteria
- I can distinguish between token embeddings (input lookup table) and context embeddings (output of transformer layers).
- I can explain how context embeddings incorporate information from other tokens via attention.
- I can describe how the final context embedding is used to produce next-token logits (dot product with token embeddings).
Assessed in
- notebook Logits and Perplexity in Causal Language Models
- activity Lab 376.2: Logits in Causal Language Models
- notebook Implementing self-attention
- activity Lab 376.3: Implementing Self-Attention
- quiz Quiz 1
- quiz Quiz 2
- quiz Quiz 3
[TM-SelfAttention] (376)
I can explain the purpose and components of a self-attention layer (key, query, value; multi-head attention; positional encodings).
Criteria
- I can explain what queries, keys, and values represent and how they interact (dot product → softmax → weighted sum).
- I can compute a simple attention calculation by hand given Q, K, V vectors.
- I can explain why multi-head attention is useful (different heads can attend to different relationship types).
- I can explain why causal masking is needed for autoregressive models.
Assessed in
- notebook Implementing self-attention
- notebook Self-Attention By Hand (in Code)
- activity Lab 376.3: Implementing Self-Attention
- notebook Programming with Self-Attention
- quiz Quiz 1
- quiz Quiz 2
- quiz Quiz 3
[TM-TransformerDataFlow] (376)
I can identify the shapes of data flowing through a Transformer-style language model.
Criteria
- Given model hyperparameters (layers, heads, hidden dim, vocab size, seq length), I can state the shape of key tensors (embeddings, Q/K/V, attention weights, logits).
- I can trace how a single token's representation changes from input embedding through attention and MLP layers to output logits.
- I can explain the role of residual connections in preserving information across transformer layers.
Assessed in
- notebook Implementing self-attention
- notebook Self-Attention By Hand (in Code)
- activity Lab 376.3: Implementing Self-Attention
- notebook Programming with Self-Attention
- quiz Quiz 2
- quiz Quiz 3
[TM-LLM-Generation] (376)
I can extract and interpret model outputs (token logits) and use them to generate text.
Criteria
- I can explain the autoregressive generation loop (prompt → logits → sample → append → repeat).
- I can explain how temperature affects the sampling distribution (higher = more random).
- I can write pseudocode for a basic text generation algorithm given a model and tokenizer.
Assessed in
- activity Exploring Language Models
- notebook Logits and Perplexity in Causal Language Models
- notebook Translation as Language Modeling
- activity Lab 376.2: Logits in Causal Language Models
- activity Exercise 376.2: Perplexity
- activity Lab 376.3: Implementing Self-Attention
- activity Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- notebook Models for Sequence Data
- quiz Quiz 1
[TM-LLM-Compute] (376)
I can analyze how computational requirements scale with model size and context length, and reason about the feasibility of training and running generative AI systems.
Criteria
- I can estimate memory requirements for a model given its parameter count and numerical precision.
- I can explain how compute scales with parameters and sequence length.
- I can describe at least two optimization techniques (e.g., quantization, KV caching) and their trade-offs.
- I can evaluate whether a given model can run on specific hardware (e.g., 12GB GPU, laptop CPU, cloud API).
Assessed in
Optimization Games
[OG-Eval-Experiment] (both)
I can design and execute valid experiments to evaluate model performance.
Criteria
- I partition data appropriately (train/val/test) before any model fitting.
- I can explain why we need held-out data and what goes wrong without it.
- I can select metrics appropriate to the task and stakeholder needs.
- I can interpret learning curves (loss/metric vs epoch) to understand training dynamics.
Assessed in
- activity Exercise 376.1: LM Evaluation
- activity Exercise 376.2: Perplexity
- activity Optional Extension: Architectural Experimentation
[OG-LLM-APIs] optional (both)
I can apply LLM APIs (such as the Chat Completions API) to build AI-powered applications.
Criteria
- I can construct appropriate API calls with system and user messages.
- I can process and use the model's response in an application.
- I can identify tasks where an LLM API is and is not appropriate.
Assessed in
[OG-LLM-Tokenization] (376)
I can explain how inputs get chunked into tokens, how outputs are generated token by token, and how this affects usage of the model.
Criteria
- I can describe the tokenization pipeline (text → token IDs → embeddings) and its reverse.
- I can explain why subword tokenization is used instead of character-level or word-level approaches.
- I can identify at least one consequence of tokenization choices (e.g., multilingual bias, difficulty counting syllables, spelling quirks).
Assessed in
- notebook Language Model Inputs and Outputs
- activity CS 376 Lab 1: Language Model Inputs and Outputs
- notebook Translation as Language Modeling
- activity Lab 376.2: Logits in Causal Language Models
- activity Exercise 376.2: Perplexity
- activity Optional Extension: Architectural Experimentation
- activity Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- quiz Quiz 1
- quiz Quiz 2
[OG-LLM-ConversationAsDocument] (376)
I can explain how a conversation with an LLM can be represented as a carefully structured document, including system messages, tool calls, and multimodal inputs and outputs.
Criteria
- I can describe how system, user, and assistant turns are serialized into a single token sequence.
- I can explain how function/tool calls fit into the conversation document format.
- I can explain how this framing allows a next-token predictor to behave as a dialogue agent.
Assessed in
- notebook Language Model Inputs and Outputs
- activity CS 376 Lab 1: Language Model Inputs and Outputs
- activity Exploring Language Models
- notebook Prompt Engineering
- activity Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- quiz Quiz 3
[OG-LLM-Prompting] (376)
I can critique and refine prompts to improve the quality of responses from an LLM.
Criteria
- I can identify why a given prompt produces poor results (ambiguity, missing context, wrong framing).
- I can apply at least two prompting strategies (e.g., role assignment, few-shot examples, chain-of-thought, structured output constraints).
- I can explain the difference between system, user, and assistant messages and when to use each.
Assessed in
[OG-LLM-ContextAndTools] (376)
I can construct effective inputs for LLM-powered systems and use tool calling to connect models to external information.
Criteria
- I can trace how a prompt is assembled from components (system message, examples, tool results, conversation history) and explain why each part is there.
- I can build an LLM-powered system that uses structured outputs and at least one tool call.
- I can identify when adding context (examples, retrieved docs) helps vs. when it wastes the context window or distracts the model.
- I can diagnose failures in an LLM-powered system (e.g., hallucinating instead of using retrieved context, irrelevant tool results, prompt injection).
Assessed in
- activity Exercise 376.1: LM Evaluation
- activity Optional Extension: Architectural Experimentation
- notebook Prompt Engineering
- activity Exercise 376.3: Course Advisor Bot
- activity Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- quiz Quiz 3
[OG-LLM-Eval] (376)
I can apply and critically analyze evaluation strategies for generative models.
Criteria
- I can explain why evaluating generative systems is harder than evaluating classifiers.
- I can describe at least two evaluation approaches (e.g., perplexity, human preference, task-specific metrics, LLM-as-judge).
- I can identify limitations of automatic metrics for open-ended generation.
- I can design a basic evaluation strategy for a specific LLM application.
Assessed in
- activity Discussion 376.1: Probing LLM Sycophancy
- activity Exercise 376.1: LM Evaluation
- activity Exercise 376.2: Perplexity
- activity Exercise 376.3: Course Advisor Bot
- activity Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- quiz Quiz 2
- quiz Quiz 3
[OG-SelfSupervised] (376)
I can explain how self-supervised learning can be used to train foundation models on massive datasets without labeled data.
Criteria
- I can explain next-token prediction as a self-supervised task (the "labels" come from the data itself).
- I can connect cross-entropy loss and perplexity to the idea of prediction quality and "surprise."
- I can explain why self-supervised pretraining enables capabilities that weren't explicitly trained for.
Assessed in
- activity Exploring Language Models
- notebook Logits and Perplexity in Causal Language Models
- activity Lab 376.2: Logits in Causal Language Models
- activity Lab 376.3: Implementing Self-Attention
- activity Discussion 376.2: Training Data as Stewardship
- quiz Quiz 1
- quiz Quiz 2
[OG-LLM-Train] (376)
I can describe the overall process of training a state-of-the-art dialogue LLM such as Llama or OLMo.
Criteria
- I can identify the three main stages of training (pretraining, supervised fine-tuning, RLHF/RLVR) and what each accomplishes.
- I can explain what data is used at each stage and where human input enters the process.
- I can describe how the model's behavior changes across stages (mimicry → instruction following → aligned responses).
- I can explain the basic insight of scaling laws (more data + more compute → predictably better models) and its practical implications.
Assessed in
- notebook Prompt Engineering
- activity Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- quiz Quiz 3
[OG-Theory-Feedback] (376)
I can explain how feedback tuning can improve the performance and reliability of a model or agent.
Criteria
- I can explain the basic RLHF loop (generate samples → collect preferences → train reward model → optimize policy).
- I can explain what a reward signal is and give examples for different tasks (human preference, code correctness, factual accuracy).
- I can articulate why the reward signal is the hard part (reward hacking, specification gaming, proxy objectives).
- I can describe how RLVR (RL with verifiable rewards) simplifies the reward problem for certain tasks.
Overall
[Overall-Impact] (both)
I can analyze real-world situations to identify potential negative impacts of AI systems.
Criteria
- Given a scenario, I can identify at least two distinct stakeholder groups who might be affected differently.
- I can articulate how training data distribution might differ from deployment conditions.
- I can identify feedback loops where model outputs might affect future training data or user behavior.
- I can flag concerns that warrant careful analysis before deployment.
Assessed in
[Overall-Dispositions] optional (both)
I demonstrate growth mindset and integrity in my AI learning and practice.
Criteria
- I can identify a specific instance where I persisted through difficulty in this course.
- I can describe how I use AI tools in ways that support rather than replace my learning.
- I can articulate my own boundaries for AI assistance and why I hold them.
[Overall-PhilNarrative] optional (both)
I can engage with philosophical questions raised by AI systems.
Criteria
- I can articulate at least one philosophical question that AI raises (e.g., consciousness, intelligence, creativity, agency).
- I can distinguish between what AI systems do and what those capabilities might mean.
- I can identify assumptions embedded in how we talk about AI (e.g., "AI thinks", "AI understands").
Assessed in
[Overall-LLM-Failures] (376)
I can identify common types of failures in LLMs, such as hallucination (confabulation) and bias.
Criteria
- I can explain what hallucination/confabulation is and why LLMs are prone to it.
- I can identify at least two other failure modes (e.g., bias amplification, prompt injection, sycophancy, inconsistency across turns).
- I can describe strategies for mitigating specific failure modes in a given application context.
Assessed in
- activity Discussion 376.1: Probing LLM Sycophancy
- activity Exercise 376.1: LM Evaluation
- activity Exercise 376.3: Course Advisor Bot
- activity Discussion 376.3: When Agents Go Wrong
- activity Lab 376.4: Dialogue Agents, Prompt Engineering, Retrieval-Augmented Generation, and Tool Use
- quiz Quiz 3