The derivation, line by line
Step 1 — recall the RLHF objective. RLHF maximises reward minus a KL leash
to the reference model:
\max_{\pi}\ \ \mathbb{E}_{x,\,y\sim\pi}\big[\, r(x, y)\,\big] \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi(\cdot\mid x)\ \|\ \pi_{\text{ref}}(\cdot\mid x)\big).
Step 2 — it has a closed-form optimum. This particular objective — a linear
reward with a KL penalty — is solved exactly, not just numerically. The optimal
policy tilts the reference distribution by the exponentiated reward:
\pi^\star(y \mid x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y \mid x)\,\exp\!\Big(\tfrac{1}{\beta}\, r(x, y)\Big),
where Z(x) = \sum_{y}\pi_{\text{ref}}(y\mid x)\exp(r(x,y)/\beta)
is the normaliser. (High-reward responses get up-weighted; the
1/\beta controls how aggressively.)
Step 3 — invert it to read the reward off the policy. Take logs of
Step 2 and solve for r. The reward is recovered from the optimal
policy:
r(x, y) = \beta\,\log\frac{\pi^\star(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \;+\; \beta\,\log Z(x).
This is the key re-expression: the reward is an implicit quantity, the
log-ratio of the policy to the reference (times \beta), plus a
term that depends only on x.
Step 4 — substitute into Bradley–Terry; the normaliser cancels. The
preference likelihood depends only on the difference of rewards,
r(x, y_w) - r(x, y_l). Both share the same
\beta\log Z(x), so it cancels — the intractable
normaliser disappears:
r(x, y_w) - r(x, y_l) = \beta\,\log\frac{\pi^\star(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} \;-\; \beta\,\log\frac{\pi^\star(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}.
Step 5 — drop it into the preference loss. Bradley–Terry's negative
log-likelihood, -\log\sigma\big(r(x,y_w) - r(x,y_l)\big), now
contains only the policy we are training and the frozen reference. Replacing the
unknown optimum \pi^\star with our trainable
\pi_\theta gives the DPO loss:
\mathcal{L}_{\text{DPO}} = -\,\mathbb{E}_{(x,\,y_w,\,y_l)}\left[\log \sigma\!\left(\beta\,\log\frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta\,\log\frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right].
Step 6 — read what it does. No reward model, no RL, no sampling — just a
log-sigmoid loss evaluated on preference pairs, exactly like a classifier. It raises the
likelihood of the preferred response y_w and lowers that of the
rejected y_l, each relative to the reference model.
The reference appears in both ratios, so DPO rewards moving the right way relative to where
you started — and the same \beta from RLHF is still the leash.
DPO trains directly on preference pairs with one supervised loss:
-
Implicit reward. The optimal RLHF reward is the log-ratio
r(x,y) = \beta\,\log\!\big(\pi_\theta(y\mid x)/\pi_{\text{ref}}(y\mid x)\big)
(up to an x-only term that cancels in any comparison).
-
The loss. Substituting into Bradley–Terry gives a single
log-sigmoid loss,
-\log\sigma\!\big(\beta\log\tfrac{\pi_\theta(y_w\mid x)}{\pi_{\text{ref}}(y_w\mid x)} - \beta\log\tfrac{\pi_\theta(y_l\mid x)}{\pi_{\text{ref}}(y_l\mid x)}\big).
-
Equivalence. Its optimum is the same as RLHF's — same KL-leashed
objective — but with no reward model and no RL, so it is far simpler and more stable to
train.
The loss is just −log σ of the margin
Strip the loss to its shape. Call the bracketed quantity the preference margin
z = \beta\,\log\frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta\,\log\frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)},
and the loss is simply -\log\sigma(z). When the policy already
prefers the winner (margin z > 0) the loss is small; when it
prefers the loser (z < 0) the loss is large. Its derivative,
-\sigma(-z), is steepest exactly where the model is most wrong —
so the gradient relentlessly pushes the margin up: likelihood of
y_w up, of y_l down, until the winner is
comfortably ahead. Drag \beta to scale how hard a given log-ratio
gap is pushed.
On paper DPO and PPO-RLHF optimise the same KL-leashed objective; in practice DPO won most of
the field for engineering reasons. PPO needs four models live at once — policy,
reference, reward model, and a value/critic network — plus on-policy sampling inside the
training loop, which makes it memory-hungry, slow, and notoriously sensitive to
hyperparameters (a slightly-off learning rate or KL coefficient and the run diverges). DPO
needs only the policy and a frozen reference, no reward model, no sampling, no critic: it is
an ordinary supervised loss you can run with the same tooling as
SFT,
often as a few extra lines on top of an
adapter
fine-tune. Cheaper, stabler, and far easier to reproduce, DPO and its descendants became the
default way to align open models. PPO-RLHF retains an edge in some frontier settings — its
online sampling can squeeze out reward the static preference set never reveals — but for most
teams, "one log-sigmoid loss" beat "stand up an RL stack".