Direct Preference Optimization

RLHF works, but its machinery is heavy: train a separate reward model, then run an unstable reinforcement-learning loop with several models resident in memory. Direct preference optimization (DPO) reaches the same destination — a policy aligned to human preferences — with neither a reward model nor RL. The whole pipeline collapses into a single supervised loss on preference pairs. The magic is entirely algebra, and it is worth seeing line by line.

The derivation, line by line

Step 1 — recall the RLHF objective. RLHF maximises reward minus a KL leash to the reference model:

\max_{\pi}\ \ \mathbb{E}_{x,\,y\sim\pi}\big[\, r(x, y)\,\big] \;-\; \beta\, \mathbb{D}_{\mathrm{KL}}\!\big(\pi(\cdot\mid x)\ \|\ \pi_{\text{ref}}(\cdot\mid x)\big).

Step 2 — it has a closed-form optimum. This particular objective — a linear reward with a KL penalty — is solved exactly, not just numerically. The optimal policy tilts the reference distribution by the exponentiated reward:

\pi^\star(y \mid x) = \frac{1}{Z(x)}\,\pi_{\text{ref}}(y \mid x)\,\exp\!\Big(\tfrac{1}{\beta}\, r(x, y)\Big),

where Z(x) = \sum_{y}\pi_{\text{ref}}(y\mid x)\exp(r(x,y)/\beta) is the normaliser. (High-reward responses get up-weighted; the 1/\beta controls how aggressively.)

Step 3 — invert it to read the reward off the policy. Take logs of Step 2 and solve for r. The reward is recovered from the optimal policy:

r(x, y) = \beta\,\log\frac{\pi^\star(y \mid x)}{\pi_{\text{ref}}(y \mid x)} \;+\; \beta\,\log Z(x).

This is the key re-expression: the reward is an implicit quantity, the log-ratio of the policy to the reference (times \beta), plus a term that depends only on x.

Step 4 — substitute into Bradley–Terry; the normaliser cancels. The preference likelihood depends only on the difference of rewards, r(x, y_w) - r(x, y_l). Both share the same \beta\log Z(x), so it cancels — the intractable normaliser disappears:

r(x, y_w) - r(x, y_l) = \beta\,\log\frac{\pi^\star(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} \;-\; \beta\,\log\frac{\pi^\star(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}.

Step 5 — drop it into the preference loss. Bradley–Terry's negative log-likelihood, -\log\sigma\big(r(x,y_w) - r(x,y_l)\big), now contains only the policy we are training and the frozen reference. Replacing the unknown optimum \pi^\star with our trainable \pi_\theta gives the DPO loss:

\mathcal{L}_{\text{DPO}} = -\,\mathbb{E}_{(x,\,y_w,\,y_l)}\left[\log \sigma\!\left(\beta\,\log\frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta\,\log\frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)}\right)\right].

Step 6 — read what it does. No reward model, no RL, no sampling — just a log-sigmoid loss evaluated on preference pairs, exactly like a classifier. It raises the likelihood of the preferred response y_w and lowers that of the rejected y_l, each relative to the reference model. The reference appears in both ratios, so DPO rewards moving the right way relative to where you started — and the same \beta from RLHF is still the leash.

DPO trains directly on preference pairs with one supervised loss:

The loss is just −log σ of the margin

Strip the loss to its shape. Call the bracketed quantity the preference margin

z = \beta\,\log\frac{\pi_\theta(y_w \mid x)}{\pi_{\text{ref}}(y_w \mid x)} - \beta\,\log\frac{\pi_\theta(y_l \mid x)}{\pi_{\text{ref}}(y_l \mid x)},

and the loss is simply -\log\sigma(z). When the policy already prefers the winner (margin z > 0) the loss is small; when it prefers the loser (z < 0) the loss is large. Its derivative, -\sigma(-z), is steepest exactly where the model is most wrong — so the gradient relentlessly pushes the margin up: likelihood of y_w up, of y_l down, until the winner is comfortably ahead. Drag \beta to scale how hard a given log-ratio gap is pushed.

On paper DPO and PPO-RLHF optimise the same KL-leashed objective; in practice DPO won most of the field for engineering reasons. PPO needs four models live at once — policy, reference, reward model, and a value/critic network — plus on-policy sampling inside the training loop, which makes it memory-hungry, slow, and notoriously sensitive to hyperparameters (a slightly-off learning rate or KL coefficient and the run diverges). DPO needs only the policy and a frozen reference, no reward model, no sampling, no critic: it is an ordinary supervised loss you can run with the same tooling as SFT, often as a few extra lines on top of an adapter fine-tune. Cheaper, stabler, and far easier to reproduce, DPO and its descendants became the default way to align open models. PPO-RLHF retains an edge in some frontier settings — its online sampling can squeeze out reward the static preference set never reveals — but for most teams, "one log-sigmoid loss" beat "stand up an RL stack".