From distribution to choice, line by line
Write the model's pre-softmax outputs (the
logits)
as z_1, \dots, z_{|V|}, one per vocabulary token, and let
p_i = \operatorname{softmax}(z)_i be the model's probabilities.
Step 1 — greedy: take the argmax. The bluntest rule: always emit the single
most probable token.
x_{t+1} = \arg\max_i \, p_i.
It is deterministic — same prompt, same output, every time. That sounds
reassuring but is a trap for open-ended text: greedy decoding gleefully falls into
loops ("I think that I think that I think that…"), because the highest-probability
continuation of a repetitive phrase is usually more of the same.
Step 2 — temperature: rescale the logits before softmax. Divide every logit
by a temperature T > 0:
p_i^{(T)} = \operatorname{softmax}\!\Big(\frac{z}{T}\Big)_i = \frac{e^{\,z_i / T}}{\sum_{j} e^{\,z_j / T}}.
Temperature reshapes how peaked the distribution is, without changing the ranking of
the tokens:
-
T < 1 sharpens — the gaps between logits get
amplified, mass piles onto the top tokens, and as T \to 0 it
collapses to greedy.
-
T > 1 flattens — the logits get squashed
together, mass spreads toward the tail, and as T \to \infty it
approaches a uniform random pick.
-
T = 1 leaves the model's own distribution untouched.
Step 3 — top-k: keep the k best,
renormalise, sample. Throw away all but the k
highest-probability tokens, then renormalise what survives into a fresh distribution and draw
from it. Let \mathcal{T}_k be the indices of the top
k:
p_i^{(k)} = \frac{p_i \,\mathbf{1}[\,i \in \mathcal{T}_k\,]}{\sum_{j \in \mathcal{T}_k} p_j}, \qquad x_{t+1} \sim p^{(k)}.
The long tail of barely-plausible tokens — the ones that produce non-sequiturs — is simply cut
off. The catch: a fixed k is blunt. When the model is very sure,
k = 40 still admits 39 junk tokens; when it is genuinely uncertain,
k = 40 may chop off good ones.
Step 4 — top-p (nucleus): keep the smallest set summing to
\ge p. Instead of a fixed count, fix a
probability mass. Sort tokens by probability and keep the smallest group whose
cumulative probability first reaches the threshold p (the
"nucleus"), then renormalise and sample:
\mathcal{N}_p = \text{smallest set with} \sum_{i \in \mathcal{N}_p} p_i \ge p, \qquad x_{t+1} \sim \frac{p_i\,\mathbf{1}[\,i \in \mathcal{N}_p\,]}{\sum_{j \in \mathcal{N}_p} p_j}.
This adapts: where the model is confident the nucleus is tiny (maybe one or
two tokens), and where it is unsure the nucleus widens to admit more options — exactly the
behaviour a fixed k can't give you. A typical setting is
p = 0.9.
The trade-off underneath all four
Every knob here is the same tension dressed differently:
quality versus diversity. Concentrate on the top tokens (greedy, low
T, small k, small p)
and you get safe, coherent, but bland and loop-prone text. Spread the mass out (high
T, large k, large
p) and you get variety and surprise at the risk of incoherence. The
art of decoding is picking where on that line your task wants to sit.
Four ways to turn the next-token distribution
p = \operatorname{softmax}(z) into one token:
-
Greedy. x_{t+1} = \arg\max_i p_i — deterministic;
coherent but repetitive and prone to loops.
-
Temperature T. Sample from
\operatorname{softmax}(z / T):
T < 1 sharpens toward greedy, T > 1
flattens toward uniform, T = 1 is the raw model.
-
Top-k. Keep the k
highest-probability tokens, renormalise, sample — a fixed-count truncation of the
tail.
-
Top-p (nucleus). Keep the smallest set with
cumulative probability \ge p, renormalise, sample — a
fixed-mass truncation that adapts to the model's confidence.
All four trade quality (concentrate the mass) against
diversity (spread it out).
Why does the safest decoder, greedy, produce the most embarrassing failure — the broken
record? Because language models are trained to imitate text, and real text contains
repetition (refrains, lists, names). Once a phrase appears, the model's own attention makes
repeating it the single highest-probability move, so greedy commits to it; the repeat makes
the next repeat even more likely, and the loop self-reinforces. Sampling helps because a die
roll can break the cycle, but practitioners often add an explicit
repetition penalty: before the softmax, down-weight the logits of tokens
that have already appeared,
z_i \;\leftarrow\; z_i - \lambda \cdot \mathbf{1}[\,i \text{ already generated}\,],
(or divide by a factor > 1), so each reuse of a token is taxed.
It is a crude hack — penalise a legitimately recurring word and you garble the text — but it
is a cheap, effective patch on greedy's most visible flaw, and a reminder that the raw
distribution is only ever a starting point.