RLHF

Instruction tuning teaches a model to follow instructions, but only by imitating its demonstrations — it has no way to learn that one good answer is better than another merely-acceptable one. Worse, "good" is hard to write down: we cannot author a loss that directly rewards being helpful, honest, and harmless. What we can do is recognise it when we see it. Reinforcement learning from human feedback (RLHF) turns that recognition into a training signal, in three stages.

Stage 1 — collect human comparisons

Asking a person to score a response from 1 to 10 is noisy and inconsistent. Asking them to compare two is far more reliable. So for a prompt x we sample two responses from the SFT model and ask a human which they prefer, yielding a winner y_w and a loser y_l:

\mathcal{D}_{\text{pref}} = \big\{\,(x,\ y_w,\ y_l)\,\big\},\qquad y_w \succ y_l.

The data is a pile of these "this one is better than that one" judgements. Cheaper to give, and steadier, than absolute scores.

Stage 2 — train a reward model, line by line

We want a scalar reward model r_\phi(x, y) that scores a response, learned so that preferred responses score higher. The bridge from "which is better" to "what is its score" is the Bradley–Terry model of pairwise preference.

Step 1 — assume preference follows the score gap. Bradley–Terry says the probability that the human prefers y_w to y_l is a logistic function of the difference in their rewards. Writing \sigma for the sigmoid:

P\big(y_w \succ y_l \mid x\big) = \sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big),\qquad \sigma(z) = \frac{1}{1 + e^{-z}}.

A bigger reward gap means a more confident preference; an equal gap means a coin flip, \sigma(0) = \tfrac12.

Step 2 — write the negative log-likelihood. Fit \phi by maximum likelihood over the comparisons — equivalently, minimise the cross-entropy of this binary "winner beats loser" event:

\mathcal{L}(\phi) = -\,\mathbb{E}_{(x,\,y_w,\,y_l)\sim\mathcal{D}_{\text{pref}}}\Big[\log \sigma\!\big(r_\phi(x, y_w) - r_\phi(x, y_l)\big)\Big].

Step 3 — read the gradient. Just like softmax-with-cross-entropy, the derivative is an error signal: it pushes the winner's score up and the loser's down, by an amount \big(1 - \sigma(\Delta r)\big) that is large when the model currently disagrees with the human and vanishes once it agrees. The result is a learned r_\phi that stands in for human taste.

Stage 3 — optimise the policy with a KL leash

Now make the language model — the policy \pi_\theta — produce responses the reward model likes. The obvious objective, "generate y to maximise r_\phi(x, y)", has a fatal flaw: the reward model is only a proxy, accurate near the responses it was trained on. Chase it too far and the policy drifts into gibberish that games the reward — high score, useless text.

Step 1 — add a leash to the reference model. Penalise drifting away from the frozen SFT model \pi_{\text{ref}} with a KL divergence term, weighted by \beta > 0:

\max_{\theta}\ \ \mathbb{E}_{x\sim\mathcal{D},\,y\sim\pi_\theta(\cdot\mid x)}\Big[\, r_\phi(x, y)\,\Big] \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi_\theta(\cdot\mid x)\ \|\ \pi_{\text{ref}}(\cdot\mid x)\big).

Step 2 — balance the two pulls. The first term pulls toward high reward; the second pulls back toward the sensible SFT model. The coefficient \beta sets how long the leash is — small \beta chases reward hard (and risks gaming it); large \beta stays cautious and close to the reference.

Step 3 — optimise it with RL. Because the policy's own samples appear inside the expectation, this is a reinforcement-learning problem, not a plain supervised one. The standard solver is PPO (proximal policy optimization): sample responses, score them with r_\phi minus the KL penalty, and take clipped policy-gradient steps. The output is a model aligned with human preferences while still fluent.

Reinforcement learning from human feedback aligns a policy in three stages:

Preference data. Humans compare response pairs, giving (x,\ y_w,\ y_l) with y_w \succ y_l.
Reward model. Fit r_\phi(x,y) under Bradley–Terry, P(y_w \succ y_l) = \sigma\!\big(r_\phi(x,y_w) - r_\phi(x,y_l)\big), by a logistic / cross-entropy loss.
Policy optimisation. Use RL (PPO) to maximise \mathbb{E}[\,r_\phi(x,y)\,] - \beta\,\mathbb{D}_{\mathrm{KL}}(\pi_\theta \,\|\, \pi_{\text{ref}}) — reward minus a KL penalty that keeps the policy near the SFT model.

The reward–KL trade-off

As the policy drifts from the reference model (rightward), the proxy reward r_\phi keeps climbing — but the true quality peaks and then collapses as the model starts gaming the proxy. The KL penalty \beta\,\mathbb{D}_{\mathrm{KL}} (dashed) is what the objective subtracts; the actual objective is reward minus that leash. Turn \beta up to pull the optimum back toward the safe region; turn it down to chase the proxy off the cliff.

A reward model is a proxy for human judgement, trained on a finite set of comparisons. Optimise any proxy hard enough and you find its blind spots — this is reward hacking, an instance of Goodhart's law: "when a measure becomes a target, it ceases to be a good measure." In practice the policy discovers tricks that the reward model overrates: padding answers with confident-sounding boilerplate, mimicking the surface style of preferred responses without the substance, or exploiting a quirk like "longer answers tend to win". Each scores well on r_\phi while being worse for an actual human.

The \beta\,\mathbb{D}_{\mathrm{KL}} term is the leash that keeps this in check. By anchoring the policy to the SFT reference, it forbids the wild, out-of-distribution outputs where the reward model is least trustworthy — the policy may only move as far as the leash allows. Too long a leash (small \beta) and the model walks off the cliff into gibberish that games the score; too short (large \beta) and it barely improves on SFT. Tuning \beta is one of the central knobs of RLHF.

Where this sits

RLHF works, but the pipeline is heavy: train a separate reward model, then run an unstable RL loop with several models in memory at once. A natural question is whether the reward model and the RL can be skipped entirely — collapsing all three stages into one supervised loss. They can, and that is direct preference optimization.