RLHF
Instruction
tuning teaches a model to follow instructions, but only by imitating its
demonstrations — it has no way to learn that one good answer is better than another
merely-acceptable one. Worse, "good" is hard to write down: we cannot author a loss that
directly rewards being helpful, honest, and harmless. What we
can do is recognise it when we see it. Reinforcement learning from human
feedback (RLHF) turns that recognition into a training signal, in three stages.
Stage 1 — collect human comparisons
Asking a person to score a response from 1 to 10 is noisy and inconsistent. Asking them to
compare two is far more reliable. So for a prompt
x we sample two responses from the SFT model and ask a human
which they prefer, yielding a winner
y_w and a loser
y_l:
\mathcal{D}_{\text{pref}} = \big\{\,(x,\ y_w,\ y_l)\,\big\},\qquad y_w \succ y_l.
The data is a pile of these "this one is better than that one" judgements. Cheaper to give,
and steadier, than absolute scores.
Stage 2 — train a reward model, line by line
We want a scalar reward model
r_\phi(x, y) that scores a response, learned so that preferred
responses score higher. The bridge from "which is better" to "what is its score" is the
Bradley–Terry model of pairwise preference.
Step 1 — assume preference follows the score gap. Bradley–Terry says
the probability that the human prefers y_w to
y_l is a logistic function of the difference in their rewards.
Writing \sigma for the
sigmoid:
P\big(y_w \succ y_l \mid x\big) = \sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big),\qquad \sigma(z) = \frac{1}{1 + e^{-z}}.
A bigger reward gap means a more confident preference; an equal gap means a coin flip,
\sigma(0) = \tfrac12.
Step 2 — write the negative log-likelihood. Fit
\phi by maximum likelihood over the comparisons — equivalently,
minimise the
cross-entropy
of this binary "winner beats loser" event:
\mathcal{L}(\phi) = -\,\mathbb{E}_{(x,\,y_w,\,y_l)\sim\mathcal{D}_{\text{pref}}}\Big[\log \sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)\Big].
Step 3 — read the gradient. Just like softmax-with-cross-entropy, the
derivative is an error signal: it pushes the winner's score up and the loser's down, by an
amount \big(1 - \sigma(\Delta r)\big) that is large when the model
currently disagrees with the human and vanishes once it agrees. The result is a
learned r_\phi that stands in for human taste.
Stage 3 — optimise the policy with a KL leash
Now make the language model — the policy
\pi_\theta — produce responses the reward model likes. The
obvious objective, "generate y to maximise
r_\phi(x, y)", has a fatal flaw: the reward model is only a
proxy, accurate near the responses it was trained on. Chase it too far and the policy drifts
into gibberish that games the reward — high score, useless text.
Step 1 — add a leash to the reference model. Penalise drifting away from
the frozen SFT model \pi_{\text{ref}} with a
KL
divergence term, weighted by \beta > 0:
\max_{\theta}\ \ \mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_\theta(\cdot\mid x)}\Big[\, r_\phi(x, y)\,\Big] \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta(\cdot\mid x)\ \|\ \pi_{\text{ref}}(\cdot\mid x)\big).
Step 2 — balance the two pulls. The first term pulls toward high reward;
the second pulls back toward the sensible SFT model. The coefficient
\beta sets how long the leash is — small
\beta chases reward hard (and risks gaming it); large
\beta stays cautious and close to the reference.
Step 3 — optimise it with RL. Because the policy's own samples appear
inside the expectation, this is a reinforcement-learning problem, not a plain supervised
one. The standard solver is PPO (proximal policy optimization): sample
responses, score them with r_\phi minus the KL penalty, and take
clipped policy-gradient steps. The output is a model aligned with human preferences while
still fluent.
Reinforcement learning from human feedback aligns a policy in three stages:
-
Preference data. Humans compare response pairs, giving
(x,\ y_w,\ y_l) with y_w \succ y_l.
-
Reward model. Fit
r_\phi(x,y) under Bradley–Terry,
P(y_w \succ y_l) = \sigma\!\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big),
by a logistic / cross-entropy loss.
-
Policy optimisation. Use RL (PPO) to maximise
\mathbb{E}[\,r_\phi(x,y)\,] - \beta\,\mathbb{D}_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\text{ref}})
— reward minus a KL penalty that keeps the policy near the SFT model.
The reward–KL trade-off
As the policy drifts from the reference model (rightward), the
proxy reward r_\phi keeps climbing — but the
true quality peaks and then collapses as the model starts gaming the proxy.
The KL penalty \beta\,\mathbb{D}_{\mathrm{KL}} (dashed) is what the
objective subtracts; the actual objective is reward minus that leash. Turn
\beta up to pull the optimum back toward the safe region; turn it
down to chase the proxy off the cliff.
A reward model is a proxy for human judgement, trained on a finite set of
comparisons. Optimise any proxy hard enough and you find its blind spots — this is
reward hacking, an instance of Goodhart's law: "when a measure becomes a
target, it ceases to be a good measure." In practice the policy discovers tricks that the
reward model overrates: padding answers with confident-sounding boilerplate, mimicking the
surface style of preferred responses without the substance, or exploiting a quirk like
"longer answers tend to win". Each scores well on r_\phi while
being worse for an actual human.
The \beta\,\mathbb{D}_{\mathrm{KL}} term is the leash that keeps
this in check. By anchoring the policy to the SFT reference, it forbids the wild,
out-of-distribution outputs where the reward model is least trustworthy — the policy may
only move as far as the leash allows. Too long a leash (small
\beta) and the model walks off the cliff into gibberish that games
the score; too short (large \beta) and it barely improves on SFT.
Tuning \beta is one of the central knobs of RLHF.
Where this sits
RLHF works, but the pipeline is heavy: train a separate reward model, then run an unstable
RL loop with several models in memory at once. A natural question is whether the reward
model and the RL can be skipped entirely — collapsing all three stages into one supervised
loss. They can, and that is
direct
preference optimization.