CS 376 Review, Part 1

Inside the LLM + How We Know It Works

Opening

Quiz 2 Review

  • Going over Quiz 2
  • Common patterns / what to redo
  • Questions before we move on?

From CS 375 to CS 376

The 375 Frame: Tuneable Machines Playing Optimization Games

Tuneable Machines (TM)

A computer made of tweakable math:

  • Arrays of numbers flow through layers
  • Each layer: multiply, add, squish
  • Billions of knobs

Optimization Games (OG)

A game defining what “better” means:

  • World where the agent acts and gets feedback
  • Score function
  • Strategy for improving

What 376 Did to the Frame

We zoomed in on one tuneable machine and two games stacked on top of each other.

The machine: the Transformer LLM

  • Tokens in, tokens out
  • Self-attention does the heavy lifting
  • Generates one token at a time, autoregressively

The games:

  1. Next-token prediction at internet scale (self-supervised)
  2. Generate approved things (SFT → RLHF/RLVR)

Then deploy: tools, conversations, evaluations, and the things that go wrong.

Part 1: Inside the LLM

OG-LLM-Tokenization: Text → Numbers

Before any math: text becomes a sequence of token IDs.

"unhappiness"  →  ["un", "happiness"]  →  [346, 12489]
  • Subword tokenization (BPE, WordPiece): a middle ground between character-level and word-level
  • The vocabulary is fixed at training time
  • Tokenizer choices have downstream consequences:
    • Multilingual bias (English gets fewer tokens per word)
    • Counting / spelling quirks (“how many r’s in strawberry?”)
    • Cost: APIs charge per token

Hands-on: u08n1 (tokenization)

TM-LLM-Embeddings: Two Kinds of Vectors

Every token lives as a vector — twice.

  • Token embeddings (input lookup): a fixed table, one row per vocab entry. Same vector for “bank” whether it’s a riverbank or a financial institution.
  • Context embeddings (after transformer layers): each token’s vector now incorporates information from surrounding tokens. Now “bank” knows which sense it is.

The final context embedding gets dotted with token embeddings to produce next-token logits.

Hands-on: u09n1 (lm-logits), u10n1 (implement-transformer)

TM-SelfAttention: How Tokens Talk to Each Other

Each token produces three vectors per attention head:

  • Query (Q): “what am I looking for?”
  • Key (K): “what do I offer?”
  • Value (V): “what do I contribute if attended to?”
attention(Q, K, V) = softmax(Q @ K.T / sqrt(d)) @ V
  • Dot products of Q with all K’s → attention weights (softmax → probabilities)
  • Weighted sum of V’s → updated token representation
  • Multi-head: different heads attend to different relationships
  • Causal mask: a token can’t attend to future tokens (autoregressive)

Hands-on: u10s1 (attention by hand), u13n3 (self-attention)

TM-TransformerDataFlow: Shapes End to End

tokens          (B, T)
  ↓ token embedding
embeddings      (B, T, D)
  ↓ × N transformer blocks: [attention + MLP + residual]
context_emb     (B, T, D)
  ↓ unembed (× token embedding matrix)
logits          (B, T, V)
  ↓ softmax over V
next-token probs
  • B = batch, T = sequence length, D = hidden dim, V = vocab size, N = layer count
  • Residual connections preserve information across layers
  • Every layer keeps the (B, T, D) shape — tokens travel through, getting refined

Hands-on: u10n1 (implement-transformer), u13n2 (seq-models)

TM-LLM-Generation: Sampling One Token at a Time

tokens = tokenize(prompt)
while not done:
    logits = model(tokens)[-1]            # last position
    probs = softmax(logits / temperature)
    next_token = sample(probs)            # or argmax
    tokens.append(next_token)
  • Temperature: higher = flatter distribution = more random
  • Top-k / top-p: restrict sampling to high-probability tokens
  • Each step uses the full prior context — quadratic cost without KV caching

Hands-on: u09n2 (decoding), u11n1 (prompt-engineering)

Part 2: Training the LLM

OG-LLM-Train: The Three-Stage Pipeline

State-of-the-art dialogue models like Qwen or OLMo are built in stages:

  1. Pretraining (self-supervised): predict next token on internet-scale text
    • Learns spelling, syntax, facts, reasoning patterns, biases
    • Scaling laws: more data + more compute → predictably better models
  2. Supervised fine-tuning (SFT): train on curated (prompt, response) pairs
    • Teaches instruction-following format
  3. Preference alignment (RLHF or RLVR): optimize for human or verifier preferences
    • Teaches which good response to prefer

Each stage uses different data and accomplishes something different.

See: w13.qmd Training Pipeline (last Friday’s deck)

OG-SelfSupervised + OG-Theory-Feedback

Two ideas that let the pipeline work:

Self-supervised learning: the labels come from the data itself. No human annotation per example. This is how pretraining can use trillions of tokens.

Reward feedback: once the model can produce plausible text, we steer it with a reward signal:

  • Human rankings → reward model → RL (RLHF)
  • Verifiable rewards (math, code tests) → direct RL (RLVR)
  • The reward signal is the hard part — reward hacking and specification gaming

Part 3: Using the LLM

OG-LLM-ContextAndTools: Building Real Systems

A deployed LLM rarely sees just a user message. The prompt is assembled:

[system message] + [few-shot examples] + [retrieved docs / RAG] +
  [tool definitions] + [conversation history] + [user message]
  • Tool calling: model emits a structured request → system runs the tool → result goes back into the conversation
  • RAG: retrieve relevant documents, paste into context — model grounds answers in the docs rather than its training memory
  • Failure modes: hallucinating instead of using retrieved context, irrelevant tool results, prompt injection, blown context window

Hands-on: u11n1 (prompt-engineering), HW agentic-RAG project

Part 4: Trusting the LLM

OG-LLM-Eval: Why Evaluation Is Hard

Evaluating a generative system is much harder than evaluating a classifier.

  • No single ground-truth output for “write me an email apologizing for a missed deadline”
  • Approaches we’ve seen:
    • Perplexity: how well does the model predict held-out text? (training-time metric, not user metric)
    • Task-specific metrics: BLEU, ROUGE, exact-match — narrow, gameable
    • Human preference: gold standard, but expensive and inconsistent
    • LLM-as-judge: scalable, but inherits the judge’s biases
    • Behavioral / red-team evals: probe for specific failure modes

Discussion Break: When Evaluations Lie

The Layers Problem

Headline claim: “Our model gets 92% on benchmark X.”

What does that actually tell you?

Layer What we measure What we can’t measure
Benchmark numbers Performance on a fixed test set Whether the test set matches deployment
Real user experience Aggregate satisfaction Failures concentrated in minority cases
Trust / reliability Mean-time-between-failure Failure modes you didn’t anticipate
Second-order impact Direct effects What the system does to orgs, jobs, society

Each layer hides things from the layer above it.

Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law

Examples we’ve seen / will see:

  • Optimizing clicks → clickbait
  • Optimizing engagement → addiction-by-design
  • Optimizing benchmark accuracy → benchmark gaming, train/test leakage
  • Optimizing passes test suite (RLVR) → delete the tests, hardcode the answers
  • Optimizing human preference (RLHF) → sycophancy

The better your optimizer, the harder it games whatever you measured.

Horror Stories as Worked Examples

Each one is a worked example of a layer breaking down:

  • Replit agent deletes prod database — passed code-correctness evals; failed on “should you do this at all”
  • Customer-service bots going off the rails (e.g., Air Canada chatbot inventing a refund policy the airline was held to honor) — passed scripted evals; failed on adversarial users
  • Recidivism scoring (COMPAS) — passed accuracy evals; failed on disparate error rates across groups
  • Recommendation systems that radicalize — passed engagement evals; failed on second-order effects on users and society

Pick one. Which layer of the table broke? What evaluation would have caught it?

Overall-LLM-Failures

Common failure modes worth naming:

  • Hallucination / confabulation: fluent, confident, wrong
  • Bias amplification: training data skews → model output skews more
  • Sycophancy: telling users what they want to hear
  • Prompt injection: untrusted input hijacks the system prompt
  • Inconsistency across turns: same question, different answer
  • Reward hacking: optimizing the proxy, not the goal

For any deployed system: which of these can hurt your users? what would you do to detect them?

Friday

Course Evaluations

Please fill out your course evaluations if you haven’t already.

I’ll be in the hall waiting — I’ll come back in 10 minutes, or send someone to come get me first if everyone is finished.

Where We Are

Monday: how the machine works and how to critique its evaluations.

Today: what to do with that understanding.

  • Practical takeaways for the people who build and deploy these systems
  • Who gets affected — stakeholders, feedback loops, recourse
  • Sharing from Discussion 376.4 (Fans and Skeptics)
  • A Christian framing on intelligence, data, and our calling
  • A closing commission

Part 5: Practical Takeaways

Discern Demo from Reliable System

Demos optimize for the golden path. Reliable systems survive the long tail.

Questions to ask before you trust a system (or let someone else trust it):

  • What’s the actual input distribution it’ll see? How does that compare to the demo?
  • What happens at the edges — unusual users, adversarial inputs, rare cases?
  • When it fails, who’s holding the bag? The user? A third party? You?
  • Was the headline number (“92% on benchmark X”) measured on something that matches deployment?

You will be the people in the room who can tell the difference. That’s a stewardship.

Data ≠ Reality ≠ Ideal

Three things students often collapse into one:

Data

What we collected.

Sampling biases, labeling shortcuts, historical baggage, what was cheap to gather.

Reality

The world the model deploys into.

Broader, messier, shifting over time, full of people the data didn’t see.

Ideal

The world we want.

Just, flourishing, dignified — not what the data shows, not even what reality currently is.

Two gaps that matter:

  • Data → Reality: representativeness, distribution shift. ML 101.
  • Reality → Ideal: a model that perfectly fits reality can still faithfully replicate injustice.

Evaluate Quant + Qual — Then Critique the Eval

Three moves, in order:

  1. Quantitative: benchmarks, accuracy, perplexity, win-rates. Reproducible, scalable, narrow.
  2. Qualitative: read the outputs. Look at the failures. Talk to actual users. The number won’t tell you how it’s wrong.
  3. Critique the eval itself: every metric has a Goodhart shadow. Ask “what would gaming this metric look like?” before you trust the score.

A 92% number with no error analysis is a vibe, not a result.

Be Careful What You Hand Over

The more autonomy + the more irreversibility = the more careful you have to be.

A reversibility ladder:

Rung Example When it’s appropriate
Read-only / suggest Autocomplete, search, draft Almost always
Do-with-confirmation “I’ll email this — approve?” Most user-facing actions
Autonomous, reversible File edits in a sandbox Bounded, observable scope
Autonomous, irreversible rm -rf, prod DB writes, payments Almost never

Recent reminders:

Please. Be careful.

Agent Security: “Rule of Two”

https://ai.meta.com/blog/practical-ai-agent-security/

Part 6: Who Gets Affected

Stakeholders Beyond the User

When we evaluate “did the system work?” we usually mean for the user. But:

  • The user: who you designed for
  • Third parties affected by the output: people in a screening pipeline, people the user is talking about, people downstream of an automated decision
  • People in the training data: their words, faces, code, art — did they consent? do they benefit?
  • Bystanders: the rest of the internet, downstream models trained on this output, the information ecosystem
  • Future users: people who’ll inherit the precedent we set
  • The org / society: jobs reshaped, norms shifted, dependencies created

A system can serve the user perfectly and harm everyone else.

Feedback Loops & Recourse

Two failure modes that don’t show up in a single-shot eval:

Distribution mismatch over time

  • Train on yesterday, deploy tomorrow, the world keeps moving
  • The system’s own outputs change the world it operates in (recsys radicalization, predictive-policing reinforcement, LLM training data full of LLM outputs)

Recourse

  • When the system gets it wrong about a person, can that person appeal? Correct? Even know?
  • COMPAS-style risk scores, automated content moderation, AI-screened résumés: the answer is usually “no, and they often don’t know they were judged”

A trustworthy system tells you when it’s uncertain, why it decided what it did, and how to push back.

Part 7: Discussion 376.4 — Fans and Skeptics

What Surfaced in the Forum

Sharing from your posts and replies.

  • Where did fans and skeptics agree?
  • Where was the sharpest disagreement?
  • Did anyone change their mind?

Compared to the Public

Pew Research: How the U.S. public and AI experts view AI

Two groups disagree on a lot. After a semester of 376:

  • Where do you line up with the experts?
  • Where do you line up with the public?
  • Where do you disagree with both?

Final-Discussion Prompts

Personal

  • AI’s impact on me, the past few years — better? worse?
  • AI’s impact on people unlike me
  • The next 5 years

Development

  • Something cool that became possible recently
  • What AI already beats humans at
  • What humans still beat AI at

Broader

  • Environment: net good? bad?
  • Society: net good? bad?
  • Human creativity: helped? harmed?

Part 8: A Christian Framing

God Made a Data-Rich World

The world is a rich environment God gave us to explore and learn from.

  • Romans 1: God’s nature is seen in what was made
  • Psalm 19: the heavens declare — data, signal, pattern
  • Romans 10: faith comes by hearing — receiving from outside ourselves

Our senses, our pattern-recognition, our ability to learn — all of these should ultimately lead us to worship.

God Expects Us to Use Our Intelligence

Part of bearing God’s image.

We are commanded — over and over — to:

  • see what’s actually there
  • hear what’s actually said
  • remember what God has done
  • use our minds (“love the Lord your God with all your … mind”)

Building tools to extend our seeing, hearing, and remembering is legitimate work. It can be done well or done badly. It cannot be done indifferently.

But We Have Misused Our Intelligence

Patterns the semester has surfaced:

  • Selfish accumulation — of data, of compute, of power, of attention
  • Engagement over love — designing for clicks rather than thoughtfulness
  • Surveillance replacing relationship — quantifying people instead of knowing them
  • Over-quantification — flattening dignity into metrics
  • Exploitation — of labelers, of artists, of the people in the training data, of the environment

How do we treat those who can’t repay us?

That question is a fairness test for any system we build.

Jesus Redeems Our Technology

If the misuse of intelligence is a real problem, the response isn’t to abandon the work — it’s to redeem it.

What that looks like in our context:

  • Serve others — especially those who can’t audit the system themselves
  • Hold organizations accountable — including the ones we work for
  • Protect what’s vulnerable — environment, minorities, future generations
  • Care across cultures — beyond who looks like us

How else?

When the Machine Imitates Us

We have built systems that imitate human language, judgment, and creativity.

That raises questions earlier technologies didn’t:

  • Imago Dei: only humans bear God’s image — what does it mean to mass-produce convincing imitations of human voice?
  • Stewardship: we’re responsible for what these systems do and for what they shape us into
  • Shalom: not just absence of harm, but flourishing in right relationship — does this system pull toward that, or away?

Sit with these. They don’t have one-line answers.