CS 376 Review, Part 1

Inside the LLM + How We Know It Works

Opening

2 Timothy 4:3-4: what we want to hear
Isaiah 55:10-11

Quiz 2 Review

Going over Quiz 2
Common patterns / what to redo
Questions before we move on?

From CS 375 to CS 376

The 375 Frame: Tuneable Machines Playing Optimization Games

Tuneable Machines (TM)

A computer made of tweakable math:

Arrays of numbers flow through layers
Each layer: multiply, add, squish
Billions of knobs

Optimization Games (OG)

A game defining what “better” means:

World where the agent acts and gets feedback
Score function
Strategy for improving

What 376 Did to the Frame

We zoomed in on one tuneable machine and two games stacked on top of each other.

The machine: the Transformer LLM

Tokens in, tokens out
Self-attention does the heavy lifting
Generates one token at a time, autoregressively

The games:

Next-token prediction at internet scale (self-supervised)
Generate approved things (SFT → RLHF/RLVR)

Then deploy: tools, conversations, evaluations, and the things that go wrong.

Part 1: Inside the LLM

OG-LLM-Tokenization: Text → Numbers

Before any math: text becomes a sequence of token IDs.

"unhappiness"  →  ["un", "happiness"]  →  [346, 12489]

Subword tokenization (BPE, WordPiece): a middle ground between character-level and word-level
The vocabulary is fixed at training time
Tokenizer choices have downstream consequences:
- Multilingual bias (English gets fewer tokens per word)
- Counting / spelling quirks (“how many r’s in strawberry?”)
- Cost: APIs charge per token

Hands-on: u08n1 (tokenization)

TM-LLM-Embeddings: Two Kinds of Vectors

Every token lives as a vector — twice.

Token embeddings (input lookup): a fixed table, one row per vocab entry. Same vector for “bank” whether it’s a riverbank or a financial institution.
Context embeddings (after transformer layers): each token’s vector now incorporates information from surrounding tokens. Now “bank” knows which sense it is.

The final context embedding gets dotted with token embeddings to produce next-token logits.

Hands-on: u09n1 (lm-logits), u10n1 (implement-transformer)

TM-SelfAttention: How Tokens Talk to Each Other

Each token produces three vectors per attention head:

Query (Q): “what am I looking for?”
Key (K): “what do I offer?”
Value (V): “what do I contribute if attended to?”

attention(Q, K, V) = softmax(Q @ K.T / sqrt(d)) @ V

Dot products of Q with all K’s → attention weights (softmax → probabilities)
Weighted sum of V’s → updated token representation
Multi-head: different heads attend to different relationships
Causal mask: a token can’t attend to future tokens (autoregressive)

Hands-on: u10s1 (attention by hand), u13n3 (self-attention)

TM-TransformerDataFlow: Shapes End to End

tokens          (B, T)
  ↓ token embedding
embeddings      (B, T, D)
  ↓ × N transformer blocks: [attention + MLP + residual]
context_emb     (B, T, D)
  ↓ unembed (× token embedding matrix)
logits          (B, T, V)
  ↓ softmax over V
next-token probs

B = batch, T = sequence length, D = hidden dim, V = vocab size, N = layer count
Residual connections preserve information across layers
Every layer keeps the (B, T, D) shape — tokens travel through, getting refined

Hands-on: u10n1 (implement-transformer), u13n2 (seq-models)

TM-LLM-Generation: Sampling One Token at a Time

tokens = tokenize(prompt)
while not done:
    logits = model(tokens)[-1]            # last position
    probs = softmax(logits / temperature)
    next_token = sample(probs)            # or argmax
    tokens.append(next_token)

Temperature: higher = flatter distribution = more random
Top-k / top-p: restrict sampling to high-probability tokens
Each step uses the full prior context — quadratic cost without KV caching

Hands-on: u09n2 (decoding), u11n1 (prompt-engineering)

Part 2: Training the LLM

OG-LLM-Train: The Three-Stage Pipeline

State-of-the-art dialogue models like Qwen or OLMo are built in stages:

Pretraining (self-supervised): predict next token on internet-scale text
- Learns spelling, syntax, facts, reasoning patterns, biases
- Scaling laws: more data + more compute → predictably better models
Supervised fine-tuning (SFT): train on curated (prompt, response) pairs
- Teaches instruction-following format
Preference alignment (RLHF or RLVR): optimize for human or verifier preferences
- Teaches which good response to prefer

Each stage uses different data and accomplishes something different.

See: w13.qmd Training Pipeline (last Friday’s deck)

OG-SelfSupervised + OG-Theory-Feedback

Two ideas that let the pipeline work:

Self-supervised learning: the labels come from the data itself. No human annotation per example. This is how pretraining can use trillions of tokens.

Reward feedback: once the model can produce plausible text, we steer it with a reward signal:

Human rankings → reward model → RL (RLHF)
Verifiable rewards (math, code tests) → direct RL (RLVR)
The reward signal is the hard part — reward hacking and specification gaming

Part 3: Using the LLM

OG-LLM-ContextAndTools: Building Real Systems

A deployed LLM rarely sees just a user message. The prompt is assembled:

[system message] + [few-shot examples] + [retrieved docs / RAG] +
  [tool definitions] + [conversation history] + [user message]

Tool calling: model emits a structured request → system runs the tool → result goes back into the conversation
RAG: retrieve relevant documents, paste into context — model grounds answers in the docs rather than its training memory
Failure modes: hallucinating instead of using retrieved context, irrelevant tool results, prompt injection, blown context window

Hands-on: u11n1 (prompt-engineering), HW agentic-RAG project

Part 4: Trusting the LLM

OG-LLM-Eval: Why Evaluation Is Hard

Evaluating a generative system is much harder than evaluating a classifier.

No single ground-truth output for “write me an email apologizing for a missed deadline”
Approaches we’ve seen:
- Perplexity: how well does the model predict held-out text? (training-time metric, not user metric)
- Task-specific metrics: BLEU, ROUGE, exact-match — narrow, gameable
- Human preference: gold standard, but expensive and inconsistent
- LLM-as-judge: scalable, but inherits the judge’s biases
- Behavioral / red-team evals: probe for specific failure modes

Discussion Break: When Evaluations Lie

The Layers Problem

Headline claim: “Our model gets 92% on benchmark X.”

What does that actually tell you?

Layer	What we measure	What we can’t measure
Benchmark numbers	Performance on a fixed test set	Whether the test set matches deployment
Real user experience	Aggregate satisfaction	Failures concentrated in minority cases
Trust / reliability	Mean-time-between-failure	Failure modes you didn’t anticipate
Second-order impact	Direct effects	What the system does to orgs, jobs, society

Each layer hides things from the layer above it.

Goodhart’s Law

“When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law

Examples we’ve seen / will see:

Optimizing clicks → clickbait
Optimizing engagement → addiction-by-design
Optimizing benchmark accuracy → benchmark gaming, train/test leakage
Optimizing passes test suite (RLVR) → delete the tests, hardcode the answers
Optimizing human preference (RLHF) → sycophancy

The better your optimizer, the harder it games whatever you measured.

Horror Stories as Worked Examples

Each one is a worked example of a layer breaking down:

Replit agent deletes prod database — passed code-correctness evals; failed on “should you do this at all”
Customer-service bots going off the rails (e.g., Air Canada chatbot inventing a refund policy the airline was held to honor) — passed scripted evals; failed on adversarial users
Recidivism scoring (COMPAS) — passed accuracy evals; failed on disparate error rates across groups
Recommendation systems that radicalize — passed engagement evals; failed on second-order effects on users and society

Pick one. Which layer of the table broke? What evaluation would have caught it?

Overall-LLM-Failures

Common failure modes worth naming:

Hallucination / confabulation: fluent, confident, wrong
Bias amplification: training data skews → model output skews more
Sycophancy: telling users what they want to hear
Prompt injection: untrusted input hijacks the system prompt
Inconsistency across turns: same question, different answer
Reward hacking: optimizing the proxy, not the goal

For any deployed system: which of these can hurt your users? what would you do to detect them?

Friday

Course Evaluations

Please fill out your course evaluations if you haven’t already.

I’ll be in the hall waiting — I’ll come back in 10 minutes, or send someone to come get me first if everyone is finished.

Where We Are

Monday: how the machine works and how to critique its evaluations.

Today: what to do with that understanding.

Practical takeaways for the people who build and deploy these systems
Who gets affected — stakeholders, feedback loops, recourse
Sharing from Discussion 376.4 (Fans and Skeptics)
A Christian framing on intelligence, data, and our calling
A closing commission

Part 5: Practical Takeaways

Discern Demo from Reliable System

Demos optimize for the golden path. Reliable systems survive the long tail.

Questions to ask before you trust a system (or let someone else trust it):

What’s the actual input distribution it’ll see? How does that compare to the demo?
What happens at the edges — unusual users, adversarial inputs, rare cases?
When it fails, who’s holding the bag? The user? A third party? You?
Was the headline number (“92% on benchmark X”) measured on something that matches deployment?

You will be the people in the room who can tell the difference. That’s a stewardship.

Data ≠ Reality ≠ Ideal

Three things students often collapse into one:

Data

What we collected.

Sampling biases, labeling shortcuts, historical baggage, what was cheap to gather.

Reality

The world the model deploys into.

Broader, messier, shifting over time, full of people the data didn’t see.

Ideal

The world we want.

Just, flourishing, dignified — not what the data shows, not even what reality currently is.

Two gaps that matter:

Data → Reality: representativeness, distribution shift. ML 101.
Reality → Ideal: a model that perfectly fits reality can still faithfully replicate injustice.

Evaluate Quant + Qual — Then Critique the Eval

Three moves, in order:

Quantitative: benchmarks, accuracy, perplexity, win-rates. Reproducible, scalable, narrow.
Qualitative: read the outputs. Look at the failures. Talk to actual users. The number won’t tell you how it’s wrong.
Critique the eval itself: every metric has a Goodhart shadow. Ask “what would gaming this metric look like?” before you trust the score.

A 92% number with no error analysis is a vibe, not a result.

Be Careful What You Hand Over

The more autonomy + the more irreversibility = the more careful you have to be.

A reversibility ladder:

Rung	Example	When it’s appropriate
Read-only / suggest	Autocomplete, search, draft	Almost always
Do-with-confirmation	“I’ll email this — approve?”	Most user-facing actions
Autonomous, reversible	File edits in a sandbox	Bounded, observable scope
Autonomous, irreversible	`rm -rf`, prod DB writes, payments	Almost never

Recent reminders:

Replit agent deletes prod DB — agent had write access; “confessed” afterward
Vibe-coding will bite you — what happens when guardrails are weak
Computer-use agents (the “OpenClaw” genre) handed full desktop access with little oversight.

Please. Be careful.

Agent Security: “Rule of Two”

https://ai.meta.com/blog/practical-ai-agent-security/

Part 6: Who Gets Affected

Stakeholders Beyond the User

When we evaluate “did the system work?” we usually mean for the user. But:

The user: who you designed for
Third parties affected by the output: people in a screening pipeline, people the user is talking about, people downstream of an automated decision
People in the training data: their words, faces, code, art — did they consent? do they benefit?
Bystanders: the rest of the internet, downstream models trained on this output, the information ecosystem
Future users: people who’ll inherit the precedent we set
The org / society: jobs reshaped, norms shifted, dependencies created

A system can serve the user perfectly and harm everyone else.

Feedback Loops & Recourse

Two failure modes that don’t show up in a single-shot eval:

Distribution mismatch over time

Train on yesterday, deploy tomorrow, the world keeps moving
The system’s own outputs change the world it operates in (recsys radicalization, predictive-policing reinforcement, LLM training data full of LLM outputs)

Recourse

When the system gets it wrong about a person, can that person appeal? Correct? Even know?
COMPAS-style risk scores, automated content moderation, AI-screened résumés: the answer is usually “no, and they often don’t know they were judged”

A trustworthy system tells you when it’s uncertain, why it decided what it did, and how to push back.

Part 7: Discussion 376.4 — Fans and Skeptics

What Surfaced in the Forum

Sharing from your posts and replies.

Where did fans and skeptics agree?
Where was the sharpest disagreement?
Did anyone change their mind?

Compared to the Public

Pew Research: How the U.S. public and AI experts view AI

Two groups disagree on a lot. After a semester of 376:

Where do you line up with the experts?
Where do you line up with the public?
Where do you disagree with both?

Final-Discussion Prompts

Personal

AI’s impact on me, the past few years — better? worse?
AI’s impact on people unlike me
The next 5 years

Development

Something cool that became possible recently
What AI already beats humans at
What humans still beat AI at

Broader

Environment: net good? bad?
Society: net good? bad?
Human creativity: helped? harmed?

Part 8: A Christian Framing

God Made a Data-Rich World

The world is a rich environment God gave us to explore and learn from.

Romans 1: God’s nature is seen in what was made
Psalm 19: the heavens declare — data, signal, pattern
Romans 10: faith comes by hearing — receiving from outside ourselves

Our senses, our pattern-recognition, our ability to learn — all of these should ultimately lead us to worship.

God Expects Us to Use Our Intelligence

Part of bearing God’s image.

We are commanded — over and over — to:

see what’s actually there
hear what’s actually said
remember what God has done
use our minds (“love the Lord your God with all your … mind”)

Building tools to extend our seeing, hearing, and remembering is legitimate work. It can be done well or done badly. It cannot be done indifferently.

But We Have Misused Our Intelligence

Patterns the semester has surfaced:

Selfish accumulation — of data, of compute, of power, of attention
Engagement over love — designing for clicks rather than thoughtfulness
Surveillance replacing relationship — quantifying people instead of knowing them
Over-quantification — flattening dignity into metrics
Exploitation — of labelers, of artists, of the people in the training data, of the environment

How do we treat those who can’t repay us?

That question is a fairness test for any system we build.

Jesus Redeems Our Technology

If the misuse of intelligence is a real problem, the response isn’t to abandon the work — it’s to redeem it.

What that looks like in our context:

Serve others — especially those who can’t audit the system themselves
Hold organizations accountable — including the ones we work for
Protect what’s vulnerable — environment, minorities, future generations
Care across cultures — beyond who looks like us

How else?

When the Machine Imitates Us

We have built systems that imitate human language, judgment, and creativity.

That raises questions earlier technologies didn’t:

Imago Dei: only humans bear God’s image — what does it mean to mass-produce convincing imitations of human voice?
Stewardship: we’re responsible for what these systems do and for what they shape us into
Shalom: not just absence of harm, but flourishing in right relationship — does this system pull toward that, or away?

Sit with these. They don’t have one-line answers.