Autoregressive Decoding

We have trained a language model to score the next token given the tokens so far. That gives a probability distribution p(x_{t+1} \mid x_1, \dots, x_t) over the vocabulary. But a distribution is not yet text. To make the model actually write, we run it in a loop — predict a token, glue it on, predict again. That loop is autoregressive decoding, and it is how every chatbot you have used produces words one at a time.

The generation loop, line by line

Start with a prompt — a sequence of tokens x_1, \dots, x_t — and generate a continuation.

Step 1 — run the model to get a next-token distribution. Feed the current sequence through the network. Thanks to causal masking, the final position's output is a distribution over the whole vocabulary V for the token that comes next:

p(\cdot \mid x_1, \dots, x_t) = \operatorname{softmax}\big(\text{logits}(x_1, \dots, x_t)\big) \in \mathbb{R}^{|V|}.

Step 2 — pick a token. Turn that distribution into one concrete token x_{t+1}. The simplest rule is to take the most likely token, but there is a whole family of choices here — that is the subject of the next page, sampling strategies:

x_{t+1} \sim p(\cdot \mid x_1, \dots, x_t).

Step 3 — append it. Stick the new token onto the end of the sequence. The context is now one token longer:

(x_1, \dots, x_t) \;\longrightarrow\; (x_1, \dots, x_t, x_{t+1}).

Step 4 — feed the longer sequence back in. This is the "auto-regressive" part: the model's own output becomes part of its next input. Increment t \leftarrow t + 1 and return to Step 1.

Step 5 — stop. Repeat until the model emits a designated end-of-sequence token \langle \text{eos} \rangle, or until a length cap is hit:

\text{halt when } x_{t+1} = \langle \text{eos} \rangle \ \text{ or } \ t = t_{\max}.

Why this is inherently sequential

Look at Step 4. To produce token x_{t+1} you must already have token x_t — it is part of the input. So the tokens come out strictly in order, one forward pass per token, and you cannot parallelise the output: the (t+1)-th token genuinely depends on the t-th. This is the sharp contrast with training, where the entire target sequence is known ahead of time and causal masking lets all n next-token predictions be computed in a single parallel forward pass. Training is parallel over positions; generation is a serial chain.

Two phases: prefill, then decode

The loop has an asymmetry worth naming, because it dominates how fast generation feels.

Prefill. The prompt x_1, \dots, x_t is already known in full, so — exactly as in training — it goes through the model in one big forward pass that processes all t prompt tokens in parallel. One pass, the whole prompt.
Decode. After that, each new token needs its own forward pass, because each depends on the one before. Generating m new tokens costs m sequential forward passes — the slow part.

So generating m tokens from a t-token prompt is one prefill pass plus m decode passes (the last decode pass is the one that emits the final token; counting whether prefill itself emits a first token is a bookkeeping detail). The decode passes run back-to-back, which is why a long answer takes longer to start finishing than a short one.

A trained language model generates text by iterating a fixed loop:

Sample → append → repeat. Run the model on the current sequence to get p(\cdot \mid x_{\le t}), pick a token x_{t+1} from it, append it, and feed the longer sequence back in — until \langle \text{eos} \rangle or a length cap.
Inherently sequential. Token x_{t+1} requires token x_t as input, so the output cannot be parallelised — one forward pass per generated token, in strict order. (Contrast training, which scores all positions at once.)
Prefill vs decode. The known prompt is processed in a single parallel prefill pass; each subsequent token needs its own serial decode pass. Generating m tokens costs 1 prefill +\,m decode forward passes.

Training a model is a throughput problem: you have a mountain of text and want to push as many tokens through the GPUs per second as possible, and parallelism over positions and examples keeps the hardware saturated. Generation is the opposite — a latency problem. Each decode step is one forward pass that must finish before the next can begin, so the user waits out a chain of

m \times (\text{time for one forward pass})

to read an m-token reply. Worse, a single decode pass moves the entire weight matrix of the model from memory just to produce one token, so the GPU spends most of its time waiting on memory rather than doing arithmetic — generation is memory-bandwidth bound, not compute-bound. This is why so much inference engineering — batching many users together, the KV cache, speculative decoding — is aimed squarely at making each of those serial steps cheaper. The arithmetic is cheap; the waiting is everything.

Watch the loop run

The first cells are the prompt, processed in one prefill pass. Then each step adds exactly one generated token: the model reads everything to its left (the highlighted context) and emits the next cell — a single decode forward pass per token. Step forward to watch the sequence grow one box at a time, right up to the \langle \text{eos} \rangle stop. Refresh for a fresh random run with a different prompt length and stopping point.