Slides – Training Pipeline

Mimicing the Internet: powerful but limited

Minimizing next-word surprisal is a powerful objective: models learn about:

Spelling
Common phrases (“one word at a ____”)
Subject-verb agreement
Rhyming (e.g., children’s books, poetry, song lyrics)
Summarizing, translating, sentiment classification, named-entity recognition…
Standard structures (e.g., the 5-paragraph essay)
Programming: JSON, HTML/JavaScript/Python, diagrams, bugs, vulnerabilities, errors
Viewpoints (liberal, conservative, conspiracy, propaganda, …)
And all stereotypes that can be expressed in writing

Problems with Pure Mimicry

Users have to think about training data (what would a document look like?)
Examples of crisply following instructions are rare on the Internet
Internet includes a lot of bad, unhelpful, and harmful content

Instruction Tuning

After mimicry training, do two kinds of fine-tuning:

Further training on curated conversation examples
Have humans give feedback on generated text

How Human Feedback Works

Model gives multiple completions
Human chooses the best one
Model learns to generate more like the chosen one

Example

https://github.com/natolambert/rlhf-book/blob/main/book/images/chatgpt-ab-test.jpeg

A Simplified Example

Prompt: “User: What is the capital of France?:”

Model’s completions (from temperature sampling)

A. Paris\nB. London\nC. Berlin\nD. Rome
What is the capital of England?\nWhat is the capital of Belgium?\nWhat is the capital of Italy?
"The capital of France is Paris."
"Paris"

Human chooses 3

How to update the model

Goal: model should be more likely to generate completions like 3 in the future.

Simplest approach: treat the chosen completion as a training example — fine-tune on it just like supervised training.

RLHF: train a separate reward model on the human rankings, then use it as an ongoing signal to update the policy via reinforcement learning. More data-efficient; can score new completions the humans never saw.

Reinforcement Learning

Classical RL setup

RLHF setup

But Human Ranking Doesn’t Scale

A rollout (or trace) is one complete generated response — the full sequence of tokens from prompt to end. In agent settings it includes tool calls and their results.

Slow, expensive, inconsistent across raters
Hard to apply to long traces — which step went wrong?
Some tasks have checkable answers — why involve humans per rollout?

Verified Rewards (RLVR)

Instead of human ratings, use a verifier:

Math: run the computation, check the answer
Code: run the test suite
Tool use: check that the model called the right tool with the right arguments (verified against an expected tool-call trace or final answer)

No human in the per-rollout loop. Reward is objective.

Training Pipeline

Tülu 3 (by AllenAI, see Olmo for details and examples)

GRPO

From HuggingFace TRL docs

Reward Design Is Still Hard

Binary reward is the same whether the solution is 5 lines or 500
- DeepSeek R1 rollouts got longer under GRPO for this reason
Reward hacking: rewarded for passing tests → delete the tests, or hardcode expected outputs for test inputs
KL penalty to base policy limits drift — KL divergence measures how different the updated policy’s distribution is from the original; penalizing it keeps the model from changing too much

Variations

Constitutional AI (Anthropic): Write a set of principles (“the constitution”); model rates its own outputs against them; those ratings drive RL. Humans write the principles once — no per-response labels needed.

RLAIF: Use a language model as rater instead of humans — cheaper, but inherits rater biases
Process rewards: Score intermediate reasoning steps, not just the final answer

Go Deeper

rlhfbook.com Chapter 6 — policy gradients, GRPO derivation, reward design tradeoffs

Training Pipeline

From Mimicry to Useful Assistant

Mimicing the Internet: powerful but limited

Problems with Pure Mimicry

Instruction Tuning

How Human Feedback Works

Example

A Simplified Example

How to update the model

Reinforcement Learning

But Human Ranking Doesn’t Scale

Verified Rewards (RLVR)

Training Pipeline

GRPO

Reward Design Is Still Hard

Variations

Go Deeper