Speculative Decoding

Generating text one token at a time is slow because each step needs a full forward pass of a huge model, and each pass mostly just reads the weights — the GPU's arithmetic units sit nearly idle. Speculative decoding exploits that idle compute to go 2–3× faster while producing exactly the same output distribution as plain decoding. Nothing about the model's answers changes; only the wall-clock time does.

The idea: let a small, cheap draft model guess several tokens ahead, then have the big target model check all the guesses at once — and one big forward pass can check many tokens for almost the price of generating one.

Deriving the speedup

Write the target (big) model's next-token distribution as p(\cdot \mid \text{prefix}) and the draft (small) model's as q(\cdot \mid \text{prefix}). Plain decoding samples one token from p per forward pass.

Step 1 — draft k tokens cheaply. Run the small model autoregressively to propose a short continuation x_1, x_2, \dots, x_k, each sampled from q. This costs k small forward passes — cheap, because the draft model is a fraction of the target's size.

Step 2 — verify all k in one parallel pass. Feed the prefix plus the k proposed tokens into the target model as a single sequence. Because attention over a known prefix is computed in parallel across positions, one forward pass yields the target's distribution p at every proposed position at once:

\text{one target pass} \;\Rightarrow\; p(\cdot \mid x_{<1}),\ p(\cdot \mid x_{<2}),\ \dots,\ p(\cdot \mid x_{

For a memory-bound decode this costs about the same as generating a single token: the weights are read once either way; we merely do a little more (nearly free) arithmetic over the k positions.

Step 3 — accept along the matching prefix. Walk the proposed tokens left to right. Token x_i, drawn from q, is accepted with probability

\min\!\left(1,\ \frac{p(x_i \mid \text{prefix})}{q(x_i \mid \text{prefix})}\right).

If the target likes the token at least as much as the draft did (p \ge q) it is accepted outright; otherwise it survives only in proportion p/q. Stop at the first rejection.

Step 4 — correct the first rejected token. When x_i is rejected, resample that one position from the residual distribution — the target's leftover probability after removing what the draft already covered:

p_{\text{res}}(\cdot) \;\propto\; \max\!\big(0,\ p(\cdot \mid \text{prefix}) - q(\cdot \mid \text{prefix})\big).

Step 5 — why the distribution is exactly preserved. This accept/reject rule is a form of rejection sampling. The probability of finally emitting any token y is "drafted and accepted" plus "rejected, then resampled to y":

q(y)\cdot\min\!\Big(1, \tfrac{p(y)}{q(y)}\Big) \;+\; (\text{reject prob})\cdot p_{\text{res}}(y) \;=\; p(y).

The two pieces sum to exactly p(y) — the target's own distribution. So the output is statistically indistinguishable from running the big model alone; the draft only changes speed.

Step 6 — net speedup. If on average n of the k drafted tokens are accepted, one expensive target pass yields n + 1 tokens (the accepted run, plus the corrected one) instead of 1. When the draft agrees often — which it does for the easy, predictable tokens that dominate text — that is the observed 2\text{–}3\times speedup.

Accelerate generation with a draft–verify loop that leaves the output law unchanged:

Draft then verify. A small model proposes k tokens; the target model scores all k in one parallel forward pass (≈ the cost of one token, since decode is memory-bound).
Accept / reject. Accept each proposal with probability \min(1, p/q), stop at the first rejection, and resample it from the residual \propto \max(0, p - q).
Exactness + speed. The emitted distribution equals the target's p exactly; with n tokens accepted per pass you get n+1 tokens per expensive pass — 2\text{–}3\times in practice.

The whole scheme rests on decode being memory-bandwidth-bound, not compute-bound. Autoregressive decoding produces one token per pass, and each pass must stream all the model's weights through the chip. The actual floating-point work for a single position is a tiny fraction of what the hardware could do in that time, so the arithmetic units idle while the weights load.

Verifying k proposed tokens reads those same weights exactly once and spends the otherwise-wasted FLOPs scoring k positions in parallel. The extra compute is nearly free, so a single expensive weight-load can confirm a whole run of tokens — turning idle arithmetic into a speedup.

One round of draft-then-verify

The top row is the small draft model proposing k tokens. The bottom row is the single target pass that scores them all at once. The matching prefix is accepted (kept), and the first mismatch is replaced by the target's corrected token; everything after the mismatch is thrown away and re-drafted next round. Use Refresh to roll a new draft and see how many tokens land.