We have trained a
language model
to score the next token given the tokens so far. That gives a probability distribution
p(x_{t+1} \mid x_1, \dots, x_t) over the vocabulary. But a
distribution is not yet text. To make the model actually write, we
run it in a loop — predict a token, glue it on, predict again. That loop is
autoregressive decoding, and it is how every chatbot you have used produces
words one at a time.
The generation loop, line by line
Start with a prompt — a sequence of tokens x_1, \dots, x_t — and
generate a continuation.
Step 1 — run the model to get a next-token distribution. Feed the current
sequence through the network. Thanks to
causal masking,
the final position's output is a distribution over the whole vocabulary
V for the token that comes next:
p(\cdot \mid x_1, \dots, x_t) = \operatorname{softmax}\big(\text{logits}(x_1, \dots, x_t)\big) \in \mathbb{R}^{|V|}.
Step 2 — pick a token. Turn that distribution into one concrete token
x_{t+1}. The simplest rule is to take the most likely token, but
there is a whole family of choices here — that is the subject of the next page,
sampling strategies:
x_{t+1} \sim p(\cdot \mid x_1, \dots, x_t).
Step 3 — append it. Stick the new token onto the end of the sequence. The
context is now one token longer:
(x_1, \dots, x_t) \;\longrightarrow\; (x_1, \dots, x_t, x_{t+1}).
Step 4 — feed the longer sequence back in. This is the "auto-regressive"
part: the model's own output becomes part of its next input. Increment
t \leftarrow t + 1 and return to Step 1.
Step 5 — stop. Repeat until the model emits a designated
end-of-sequence token \langle \text{eos} \rangle, or until a length
cap is hit:
\text{halt when } x_{t+1} = \langle \text{eos} \rangle \ \text{ or } \ t = t_{\max}.
Why this is inherently sequential
Look at Step 4. To produce token x_{t+1} you must already have
token x_t — it is part of the input. So the tokens come out strictly
in order, one forward pass per token, and you cannot parallelise the output:
the (t+1)-th token genuinely depends on the t-th.
This is the sharp contrast with training, where the entire target sequence is known
ahead of time and causal masking lets all n next-token predictions
be computed in a single parallel forward pass. Training is parallel over positions; generation
is a serial chain.
Two phases: prefill, then decode
The loop has an asymmetry worth naming, because it dominates how fast generation feels.
-
Prefill. The prompt x_1, \dots, x_t is already
known in full, so — exactly as in training — it goes through the model in one big
forward pass that processes all t prompt tokens in parallel.
One pass, the whole prompt.
-
Decode. After that, each new token needs its own forward pass, because each
depends on the one before. Generating m new tokens costs
m sequential forward passes — the slow part.
So generating m tokens from a t-token
prompt is one prefill pass plus m decode passes
(the last decode pass is the one that emits the final token; counting whether prefill itself
emits a first token is a bookkeeping detail). The decode passes run back-to-back, which is why
a long answer takes longer to start finishing than a short one.
A trained language model generates text by iterating a fixed loop:
-
Sample → append → repeat. Run the model on the current sequence to get
p(\cdot \mid x_{\le t}), pick a token
x_{t+1} from it, append it, and feed the longer sequence back
in — until \langle \text{eos} \rangle or a length cap.
-
Inherently sequential. Token x_{t+1} requires
token x_t as input, so the output cannot be parallelised — one
forward pass per generated token, in strict order. (Contrast training, which scores all
positions at once.)
-
Prefill vs decode. The known prompt is processed in a single parallel
prefill pass; each subsequent token needs its own serial
decode pass. Generating m tokens costs
1 prefill +\,m decode forward passes.
Training a model is a throughput problem: you have a mountain of text and want to
push as many tokens through the GPUs per second as possible, and parallelism over positions
and examples keeps the hardware saturated. Generation is the opposite — a latency
problem. Each decode step is one forward pass that must finish before the next can begin, so
the user waits out a chain of
m \times (\text{time for one forward pass})
to read an m-token reply. Worse, a single decode pass moves the
entire weight matrix of the model from memory just to produce one token, so the GPU
spends most of its time waiting on memory rather than doing arithmetic — generation is
memory-bandwidth bound, not compute-bound. This is why so much inference
engineering — batching many users together, the
KV cache,
speculative decoding — is aimed squarely at making each of those serial steps cheaper. The
arithmetic is cheap; the waiting is everything.