Generating text one token at a time is slow because each step needs a full forward pass of a huge model, and each pass mostly just reads the weights — the GPU's arithmetic units sit nearly idle. Speculative decoding exploits that idle compute to go 2–3× faster while producing exactly the same output distribution as plain decoding. Nothing about the model's answers changes; only the wall-clock time does.
The idea: let a small, cheap draft model guess several tokens ahead, then have the big target model check all the guesses at once — and one big forward pass can check many tokens for almost the price of generating one.
Write the target (big) model's next-token distribution as
Step 1 — draft
Step 2 — verify all
For a memory-bound decode this costs about the same as generating a single token: the
weights are read once either way; we merely do a little more (nearly free) arithmetic over the
Step 3 — accept along the matching prefix. Walk the proposed tokens left to
right. Token
If the target likes the token at least as much as the draft did
(
Step 4 — correct the first rejected token. When
Step 5 — why the distribution is exactly preserved. This accept/reject rule is
a form of rejection sampling. The probability of finally emitting any token
The two pieces sum to exactly
Step 6 — net speedup. If on average
The whole scheme rests on decode being memory-bandwidth-bound, not
compute-bound.
Verifying
The top row is the small draft model proposing