Tuneable Machines Playing Optimization Games
Go and tell this people:
“‘Be ever hearing, but never understanding;
be ever seeing, but never perceiving.’
Make the heart of this people calloused;
make their ears dull
and close their eyes.
Otherwise they might see with their eyes,
hear with their ears,
understand with their hearts,
and turn and be healed.”
(Isaiah 6:9-10, NIV)
We studied AI by building one from scratch: a neural network classifier.
Tuneable Machines (TM)
A computer made of tweakable math:
Key deep-dive: the MLP classifier
Extension: the CNN we built for image classification = MLP but with convolutional layers in the feature extractor body.
Optimization Games (OG)
A game that defines what “better” means:
Key deep-dive: supervised learning: mimicking a given set of (input, correct answer) pairs
The simplest “deep” network — and the building block of everything else.
Input → [Linear] → [ReLU] → [Linear] → [Softmax] → Probabilities → Loss
(784) 784→100 100→10
Every modern neural network is a variation on this theme: linear transformations, nonlinearities, and a loss function.
A linear layer computes y = x @ W + b:
@): each output is a weighted combination of all inputs+ b): shifts the outputShape rule: (batch, n_in) @ (n_in, n_out) → (batch, n_out)
Hands-on: u03n1 (manual linear regression), u04n3 (PyTorch logistic regression)
ReLU: y = max(0, x) — chop off the negative part.
Without it, stacking linear layers just gives another linear layer. With it, the network can learn conditional behavior: “activate this feature only when the input has this property.”


In practice, variants like GELU and SiLU/Swish avoid ReLU’s hard zero, giving smoother gradients. We use ReLU for simplicity; the intuition transfers.
Hands-on: u05n00 (ReLU interactive)
Three steps: exponentiate → sum → divide.
logits: [2.0, 1.0, 0.1]
exp: [7.39, 2.72, 1.11] (make positive)
softmax: [0.66, 0.24, 0.10] (normalize to sum to 1)
Hands-on: u04n2 (softmax), u05s2 (softmax deep-dive)
Given an image of a digit (flattened to a vector), compute the network’s prediction and loss:
z1 = x @ W1 + b1 → shape (100,) — these are pre-activationsa1 = max(0, z1) → shape (100,) — these are activationsz2 = a1 @ W2 + b2 → shape (10,) — these are logitsprobs = softmax(z2) → shape (10,) — these are probabilitiesloss = -log(probs[correct_class]) — this is cross-entropyWe show one hidden layer; real networks stack many (repeating steps 2–3). The pattern is always: linear → nonlinearity → linear → nonlinearity → … → output.
Key vocabulary: weights, biases, activations, logits, probabilities, loss
Hands-on: u06n1 (trace MNIST classifier step by step)
The dot product of two vectors: multiply pairs, then sum.
a = [1, 0, 3]
b = [2, 5, 1]
a · b = 1×2 + 0×5 + 3×1 = 5
Three ways to think about it:
Caveat: dot product conflates direction and magnitude — a huge vector gets a high dot product with everything. Cosine similarity fixes this by normalizing to unit vectors first, isolating direction.
Hands-on: u04n1 (manual multi-output linear regression uses @ throughout)
Shapes tell you what the data means:
(batch_size, n_features) — a batch of feature vectors(n_features, n_classes) — a weight matrix(batch_size, n_classes) — predictions for a batchThe matmul shape rule: (m, n) @ (n, p) → (m, p) — inner dims must match.
Most bugs in neural network code are shape mismatches. Trace shapes through the network to debug.
Hands-on: u02n1 (PyTorch basics), u04n1 (multi-output regression)
Parameters Activations
────────── ───────────
Input ──→ W1:(784,100) ──→ z1:(B,100)
(B,784) b1:(100,) │
▼ ReLU
a1:(B,100)
│
W2:(100,10) ──→ z2:(B,10) ← logits
b2:(10,) │
▼ Softmax
probs:(B,10) ← probabilities
│
y_true:(B,) ──→ loss: scalar ← cross-entropy
Hands-on: u06n1 (trace MNIST), Quiz 2
TM-Autograd: You write the forward pass; PyTorch computes the backward pass.
# Forward pass (you write this)
y_pred = model(x)
loss = F.cross_entropy(y_pred, y_true)
# Backward pass (PyTorch does this)
loss.backward() # computes gradients of loss w.r.t. all parameters
# The gradient of each parameter tells you:
# "which direction would INCREASE the loss?"
# So we go the opposite direction.
optimizer.step() # updates parameters to DECREASE lossrequires_grad=True on parameters tells PyTorch to track operationsloss.backward() walks the computation graph in reverse (chain rule)optimizer.zero_grad() clears them each stepHands-on: u06n2 (compute gradients in PyTorch)
for epoch in range(n_epochs):
for x_batch, y_batch in dataloader:
y_pred = model(x_batch) # 1. Forward pass
loss = loss_fn(y_pred, y_batch) # 2. Compute loss
optimizer.zero_grad() # 3. Backward pass
loss.backward() #
optimizer.step() # 4. Update parameters
# 5. Evaluate on validation set
val_loss = 0.
for x_val, y_val in val_loader:
with torch.no_grad():
y_val_pred = model(x_val)
val_loss += loss_fn(y_val_pred, y_val).item()This is the same loop for every neural network — from MNIST to GPT.
Hands-on: u06n1 (MNIST in PyTorch), u06s2 (MNIST with augmentation)
(Bridges both pillars: this is what connects the machine to the game.)
Gradient = direction of steepest increase of the loss.
Gradient descent = move parameters in the opposite direction.
Parameters ──[subtract lr × gradient]──→ Updated Parameters
Beyond vanilla SGD:
Most practitioners default to Adam.
Hands-on: u03n1 (manual gradient descent for linear regression)
TM-Embeddings: Neural networks represent data as vectors where geometric relationships encode meaning.
Examples: word embeddings (king - man + woman ≈ queen), image embeddings (CNNs), user/movie embeddings (recommender systems)
Hands-on: u07n1 (image embeddings), u09n1 (word embeddings - in 376)
TM-RepresentationLearning: Hidden layers transform the data to make the task easier.
This is why transfer learning works: the features learned for one task (ImageNet classification) are useful for many other tasks.
Hands-on: u05n1 (image classifier with pretrained feature extractor)
Convolution: Specialized layers for spatial data (images).
Two key improvements over fully-connected layers:
A CNN = convolution layers + pooling + fully-connected layers at the end.
CNN Explainer interactive visualization: poloclub.github.io/cnn-explainer/
The game has simple rules:
The machine learns by mimicry — imitating the training data.
To frame a supervised learning problem:
Example: “Detect pneumonia from chest X-rays” → inputs = image, target = {pneumonia, healthy}, loss = cross-entropy, metric = sensitivity (TPR) — because missing a case is worse than a false alarm. Note: the loss (cross-entropy) is what the model optimizes; the metric (sensitivity) is what stakeholders care about. They’re often different!
Hands-on: u02n2 (sklearn regression), u03n2 (sklearn classification)
MSE (regression): average of squared errors. Penalizes big mistakes heavily.
\[\text{MSE} = \frac{1}{n}\sum_i (y_i - \hat{y}_i)^2\]
Cross-entropy (classification): average surprise at the correct answer.
\[\text{CE} = -\frac{1}{n}\sum_i \log p(\text{correct class}_i)\]
Why not accuracy as a loss? Its gradient is almost always zero — the model can’t learn from it.
Hands-on: u03n1 (MSE by hand), u04n2 (cross-entropy), u04n3 (PyTorch classification)
OG-Eval-Experiment: Why held-out data matters.
Learning curves (loss vs. epoch) tell the story of training:
Hands-on: every lab from u04 onward includes train/val curves
Split before you fit. Always.
# 1. Split data FIRST
X_train, X_val, y_train, y_val = train_test_split(X, y)
# 2. Train on training set only
model.fit(X_train, y_train)
# 3. Evaluate on validation set
val_score = model.score(X_val, y_val)
# 4. Spot-check predictions on specific examples
model.predict(X_val[:5]) # Do these make sense?Validation performance is a better estimate of real-world performance than training performance.
Overfitting: model memorizes training data but doesn’t generalize.
Underfitting: model can’t even fit the training data.
Bias-variance tradeoff:
Interactive: Bias-Variance · Double Descent (MLU Explain)
Hands-on: u06s1 (bias-variance), u06s2 (augmentation and regularization)
The model can only learn what’s in the training data.
A model that’s 99% accurate on ImageNet may fail on photos from a different camera, lighting, or culture.
Hands-on: u06s2 (augmentation), HW1 (real-world image classification)
OG-ProblemFraming-Paradigms: Not all learning is mimicry.
| Paradigm | Learning signal | Example | What it can learn |
|---|---|---|---|
| Supervised | Labeled examples | Image classification | To imitate the labels |
| Self-supervised | Predict hidden parts of data | GPT (predict next token) | Patterns in data |
| Reinforcement | Rewards from interaction | Game playing, robotics | Strategies beyond imitation |
OG-Pretrained: Standing on the shoulders of giants.
The body + head pattern:
[Pretrained feature extractor] → [Your task-specific classifier]
(frozen or fine-tuned) (trained from scratch)
Hands-on: u05n1 (pretrained CNN for image classification), HW1
Key abstraction: the conversation — a structured “document” with system instructions, user messages, assistant responses, “tool” calls/responses, reasoning traces
API: you ask for the next message given the conversation so far (no training)
Stateless: Each conversation is independent — the model itself doesn’t remember past conversations (but system can prepend them to future conversations)
Agent extension: the model can output requests to run code (e.g., search, calculate, edit file); the system runs the code and includes the output in the conversation
When appropriate: text tasks, prototyping, when training data is scarce
When not: latency-critical, cost-sensitive, tasks requiring precise numeric output
We’ve studied the mechanics. Now: what does it mean?
When deploying an AI system, ask:
(based on Shneiderman, Human-Centered AI; see also Gender Shades, COMPAS; ACM FAccT
Christian concepts that inform how we think about AI:
“We shape our tools, and thereafter our tools shape us.” — Churchill/McLuhan
Building state-of-the-art AI systems out of the building blocks we’ve studied:
Remind you of Calvin’s mission?
We are not machines. That’s good. Don’t forget it.
What concept from this course has changed how you think about AI?
What questions do you still have?
Whatever you do, work at it with all your heart, as working for the Lord, not for human masters, since you know that you will receive an inheritance from the Lord as a reward. It is the Lord Christ you are serving. — Colossians 3:23-24
Treat people as God’s image-bearers, not like machines.
Work with AI systems, not as magic but as computational tools.
Keep learning. Keep thinking. Use the natural intelligence God gave you, together with all the artificial intelligence that God enabled, to love God and neighbor.