What is the capital of England?\nWhat is the capital of Belgium?\nWhat is the capital of Italy?
"The capital of France is Paris."
"Paris"
Human chooses 3
How to update the model
Goal: model should be more likely to generate completions like 3 in the future.
Simplest approach: treat the chosen completion as a training example — fine-tune on it just like supervised training.
RLHF: train a separate reward model on the human rankings, then use it as an ongoing signal to update the policy via reinforcement learning. More data-efficient; can score new completions the humans never saw.
Reinforcement Learning
Classical RL setup
RLHF setup
But Human Ranking Doesn’t Scale
A rollout (or trace) is one complete generated response — the full sequence of tokens from prompt to end. In agent settings it includes tool calls and their results.
Slow, expensive, inconsistent across raters
Hard to apply to long traces — which step went wrong?
Some tasks have checkable answers — why involve humans per rollout?
Verified Rewards (RLVR)
Instead of human ratings, use a verifier:
Math: run the computation, check the answer
Code: run the test suite
Tool use: check that the model called the right tool with the right arguments (verified against an expected tool-call trace or final answer)
No human in the per-rollout loop. Reward is objective.
Training Pipeline
Tülu 3 (by AllenAI, see Olmo for details and examples)
Binary reward is the same whether the solution is 5 lines or 500
DeepSeek R1 rollouts got longer under GRPO for this reason
Reward hacking: rewarded for passing tests → delete the tests, or hardcode expected outputs for test inputs
KL penalty to base policy limits drift — KL divergence measures how different the updated policy’s distribution is from the original; penalizing it keeps the model from changing too much
Variations
Constitutional AI (Anthropic): Write a set of principles (“the constitution”); model rates its own outputs against them; those ratings drive RL. Humans write the principles once — no per-response labels needed.
RLAIF: Use a language model as rater instead of humans — cheaper, but inherits rater biases
Process rewards: Score intermediate reasoning steps, not just the final answer