Inside the LLM + How We Know It Works
Tuneable Machines (TM)
A computer made of tweakable math:
Optimization Games (OG)
A game defining what “better” means:
We zoomed in on one tuneable machine and two games stacked on top of each other.
The machine: the Transformer LLM
The games:
Then deploy: tools, conversations, evaluations, and the things that go wrong.
Before any math: text becomes a sequence of token IDs.
"unhappiness" → ["un", "happiness"] → [346, 12489]
Hands-on: u08n1 (tokenization)
Every token lives as a vector — twice.
The final context embedding gets dotted with token embeddings to produce next-token logits.
Hands-on: u09n1 (lm-logits), u10n1 (implement-transformer)
Each token produces three vectors per attention head:
attention(Q, K, V) = softmax(Q @ K.T / sqrt(d)) @ V
Hands-on: u10s1 (attention by hand), u13n3 (self-attention)
tokens (B, T)
↓ token embedding
embeddings (B, T, D)
↓ × N transformer blocks: [attention + MLP + residual]
context_emb (B, T, D)
↓ unembed (× token embedding matrix)
logits (B, T, V)
↓ softmax over V
next-token probs
Hands-on: u10n1 (implement-transformer), u13n2 (seq-models)
Hands-on: u09n2 (decoding), u11n1 (prompt-engineering)
State-of-the-art dialogue models like Qwen or OLMo are built in stages:
Each stage uses different data and accomplishes something different.
See: w13.qmd Training Pipeline (last Friday’s deck)
Two ideas that let the pipeline work:
Self-supervised learning: the labels come from the data itself. No human annotation per example. This is how pretraining can use trillions of tokens.
Reward feedback: once the model can produce plausible text, we steer it with a reward signal:
A deployed LLM rarely sees just a user message. The prompt is assembled:
[system message] + [few-shot examples] + [retrieved docs / RAG] +
[tool definitions] + [conversation history] + [user message]
Hands-on: u11n1 (prompt-engineering), HW agentic-RAG project
Evaluating a generative system is much harder than evaluating a classifier.
Headline claim: “Our model gets 92% on benchmark X.”
What does that actually tell you?
| Layer | What we measure | What we can’t measure |
|---|---|---|
| Benchmark numbers | Performance on a fixed test set | Whether the test set matches deployment |
| Real user experience | Aggregate satisfaction | Failures concentrated in minority cases |
| Trust / reliability | Mean-time-between-failure | Failure modes you didn’t anticipate |
| Second-order impact | Direct effects | What the system does to orgs, jobs, society |
Each layer hides things from the layer above it.
“When a measure becomes a target, it ceases to be a good measure.” — Goodhart’s Law
Examples we’ve seen / will see:
The better your optimizer, the harder it games whatever you measured.
Each one is a worked example of a layer breaking down:
Pick one. Which layer of the table broke? What evaluation would have caught it?
Common failure modes worth naming:
For any deployed system: which of these can hurt your users? what would you do to detect them?
Please fill out your course evaluations if you haven’t already.
I’ll be in the hall waiting — I’ll come back in 10 minutes, or send someone to come get me first if everyone is finished.
Monday: how the machine works and how to critique its evaluations.
Today: what to do with that understanding.
Demos optimize for the golden path. Reliable systems survive the long tail.
Questions to ask before you trust a system (or let someone else trust it):
You will be the people in the room who can tell the difference. That’s a stewardship.
Three things students often collapse into one:
Data
What we collected.
Sampling biases, labeling shortcuts, historical baggage, what was cheap to gather.
Reality
The world the model deploys into.
Broader, messier, shifting over time, full of people the data didn’t see.
Ideal
The world we want.
Just, flourishing, dignified — not what the data shows, not even what reality currently is.
Two gaps that matter:
Three moves, in order:
A 92% number with no error analysis is a vibe, not a result.
The more autonomy + the more irreversibility = the more careful you have to be.
A reversibility ladder:
| Rung | Example | When it’s appropriate |
|---|---|---|
| Read-only / suggest | Autocomplete, search, draft | Almost always |
| Do-with-confirmation | “I’ll email this — approve?” | Most user-facing actions |
| Autonomous, reversible | File edits in a sandbox | Bounded, observable scope |
| Autonomous, irreversible | rm -rf, prod DB writes, payments |
Almost never |
Recent reminders:
Please. Be careful.
https://ai.meta.com/blog/practical-ai-agent-security/
When we evaluate “did the system work?” we usually mean for the user. But:
A system can serve the user perfectly and harm everyone else.
Two failure modes that don’t show up in a single-shot eval:
Distribution mismatch over time
Recourse
A trustworthy system tells you when it’s uncertain, why it decided what it did, and how to push back.
Sharing from your posts and replies.
Pew Research: How the U.S. public and AI experts view AI
Two groups disagree on a lot. After a semester of 376:
Personal
Development
Broader
The world is a rich environment God gave us to explore and learn from.
Our senses, our pattern-recognition, our ability to learn — all of these should ultimately lead us to worship.
Part of bearing God’s image.
We are commanded — over and over — to:
Building tools to extend our seeing, hearing, and remembering is legitimate work. It can be done well or done badly. It cannot be done indifferently.
Patterns the semester has surfaced:
How do we treat those who can’t repay us?
That question is a fairness test for any system we build.
If the misuse of intelligence is a real problem, the response isn’t to abandon the work — it’s to redeem it.
What that looks like in our context:
How else?
We have built systems that imitate human language, judgment, and creativity.
That raises questions earlier technologies didn’t:
Sit with these. They don’t have one-line answers.