Training Pipeline

From Mimicry to Useful Assistant

Mimicing the Internet: powerful but limited

Minimizing next-word surprisal is a powerful objective: models learn about:

  • Spelling
  • Common phrases (“one word at a ____”)
  • Subject-verb agreement
  • Rhyming (e.g., children’s books, poetry, song lyrics)
  • Summarizing, translating, sentiment classification, named-entity recognition…
  • Standard structures (e.g., the 5-paragraph essay)
  • Programming: JSON, HTML/JavaScript/Python, diagrams, bugs, vulnerabilities, errors
  • Viewpoints (liberal, conservative, conspiracy, propaganda, …)
  • And all stereotypes that can be expressed in writing

Problems with Pure Mimicry

  • Users have to think about training data (what would a document look like?)
  • Examples of crisply following instructions are rare on the Internet
  • Internet includes a lot of bad, unhelpful, and harmful content

Instruction Tuning

After mimicry training, do two kinds of fine-tuning:

  1. Further training on curated conversation examples
  2. Have humans give feedback on generated text

How Human Feedback Works

  • Model gives multiple completions
  • Human chooses the best one
  • Model learns to generate more like the chosen one

Example

https://github.com/natolambert/rlhf-book/blob/main/book/images/chatgpt-ab-test.jpeg

A Simplified Example

Prompt: “User: What is the capital of France?:”

Model’s completions (from temperature sampling)

  1. A. Paris\nB. London\nC. Berlin\nD. Rome
  2. What is the capital of England?\nWhat is the capital of Belgium?\nWhat is the capital of Italy?
  3. "The capital of France is Paris."
  4. "Paris"

Human chooses 3

How to update the model

Goal: model should be more likely to generate completions like 3 in the future.

Simplest approach: treat the chosen completion as a training example — fine-tune on it just like supervised training.

RLHF: train a separate reward model on the human rankings, then use it as an ongoing signal to update the policy via reinforcement learning. More data-efficient; can score new completions the humans never saw.

Reinforcement Learning

Classical RL setup

RLHF setup

But Human Ranking Doesn’t Scale

A rollout (or trace) is one complete generated response — the full sequence of tokens from prompt to end. In agent settings it includes tool calls and their results.

  • Slow, expensive, inconsistent across raters
  • Hard to apply to long traces — which step went wrong?
  • Some tasks have checkable answers — why involve humans per rollout?

Verified Rewards (RLVR)

Instead of human ratings, use a verifier:

  • Math: run the computation, check the answer
  • Code: run the test suite
  • Tool use: check that the model called the right tool with the right arguments (verified against an expected tool-call trace or final answer)

No human in the per-rollout loop. Reward is objective.

Training Pipeline

Tülu 3 (by AllenAI, see Olmo for details and examples)

GRPO

From HuggingFace TRL docs

Reward Design Is Still Hard

  • Binary reward is the same whether the solution is 5 lines or 500
    • DeepSeek R1 rollouts got longer under GRPO for this reason
  • Reward hacking: rewarded for passing tests → delete the tests, or hardcode expected outputs for test inputs
  • KL penalty to base policy limits drift — KL divergence measures how different the updated policy’s distribution is from the original; penalizing it keeps the model from changing too much

Variations

Constitutional AI (Anthropic): Write a set of principles (“the constitution”); model rates its own outputs against them; those ratings drive RL. Humans write the principles once — no per-response labels needed.

  • RLAIF: Use a language model as rater instead of humans — cheaper, but inherits rater biases
  • Process rewards: Score intermediate reasoning steps, not just the final answer

Go Deeper

rlhfbook.com Chapter 6 — policy gradients, GRPO derivation, reward design tradeoffs