Sampling Strategies

Autoregressive decoding leaves one step deliberately vague: given the next-token distribution p(\cdot \mid x_{\le t}), which token do we actually pick? That single choice — the sampling strategy — is the dial that sets a model's personality, from a deterministic, repetitive drone to a wild, incoherent improviser. Here are the four moves every practitioner reaches for, each a small transformation of the same distribution before we draw from it.

From distribution to choice, line by line

Write the model's pre-softmax outputs (the logits) as z_1, \dots, z_{|V|}, one per vocabulary token, and let p_i = \operatorname{softmax}(z)_i be the model's probabilities.

Step 1 — greedy: take the argmax. The bluntest rule: always emit the single most probable token.

x_{t+1} = \arg\max_i \, p_i.

It is deterministic — same prompt, same output, every time. That sounds reassuring but is a trap for open-ended text: greedy decoding gleefully falls into loops ("I think that I think that I think that…"), because the highest-probability continuation of a repetitive phrase is usually more of the same.

Step 2 — temperature: rescale the logits before softmax. Divide every logit by a temperature T > 0:

p_i^{(T)} = \operatorname{softmax}\!\Big(\frac{z}{T}\Big)_i = \frac{e^{\,z_i / T}}{\sum_{j} e^{\,z_j / T}}.

Temperature reshapes how peaked the distribution is, without changing the ranking of the tokens:

T < 1 sharpens — the gaps between logits get amplified, mass piles onto the top tokens, and as T \to 0 it collapses to greedy.
T > 1 flattens — the logits get squashed together, mass spreads toward the tail, and as T \to \infty it approaches a uniform random pick.
T = 1 leaves the model's own distribution untouched.

Step 3 — top-k: keep the k best, renormalise, sample. Throw away all but the k highest-probability tokens, then renormalise what survives into a fresh distribution and draw from it. Let \mathcal{T}_k be the indices of the top k:

p_i^{(k)} = \frac{p_i \,\mathbf{1}[\,i \in \mathcal{T}_k\,]}{\sum_{j \in \mathcal{T}_k} p_j}, \qquad x_{t+1} \sim p^{(k)}.

The long tail of barely-plausible tokens — the ones that produce non-sequiturs — is simply cut off. The catch: a fixed k is blunt. When the model is very sure, k = 40 still admits 39 junk tokens; when it is genuinely uncertain, k = 40 may chop off good ones.

Step 4 — top-p (nucleus): keep the smallest set summing to \ge p. Instead of a fixed count, fix a probability mass. Sort tokens by probability and keep the smallest group whose cumulative probability first reaches the threshold p (the "nucleus"), then renormalise and sample:

\mathcal{N}_p = \text{smallest set with} \sum_{i \in \mathcal{N}_p} p_i \ge p, \qquad x_{t+1} \sim \frac{p_i\,\mathbf{1}[\,i \in \mathcal{N}_p\,]}{\sum_{j \in \mathcal{N}_p} p_j}.

This adapts: where the model is confident the nucleus is tiny (maybe one or two tokens), and where it is unsure the nucleus widens to admit more options — exactly the behaviour a fixed k can't give you. A typical setting is p = 0.9.

The trade-off underneath all four

Every knob here is the same tension dressed differently: quality versus diversity. Concentrate on the top tokens (greedy, low T, small k, small p) and you get safe, coherent, but bland and loop-prone text. Spread the mass out (high T, large k, large p) and you get variety and surprise at the risk of incoherence. The art of decoding is picking where on that line your task wants to sit.

Four ways to turn the next-token distribution p = \operatorname{softmax}(z) into one token:

Greedy. x_{t+1} = \arg\max_i p_i — deterministic; coherent but repetitive and prone to loops.
Temperature T. Sample from \operatorname{softmax}(z / T): T < 1 sharpens toward greedy, T > 1 flattens toward uniform, T = 1 is the raw model.
Top-k. Keep the k highest-probability tokens, renormalise, sample — a fixed-count truncation of the tail.
Top-p (nucleus). Keep the smallest set with cumulative probability \ge p, renormalise, sample — a fixed-mass truncation that adapts to the model's confidence.

All four trade quality (concentrate the mass) against diversity (spread it out).

Why does the safest decoder, greedy, produce the most embarrassing failure — the broken record? Because language models are trained to imitate text, and real text contains repetition (refrains, lists, names). Once a phrase appears, the model's own attention makes repeating it the single highest-probability move, so greedy commits to it; the repeat makes the next repeat even more likely, and the loop self-reinforces. Sampling helps because a die roll can break the cycle, but practitioners often add an explicit repetition penalty: before the softmax, down-weight the logits of tokens that have already appeared,

z_i \;\leftarrow\; z_i - \lambda \cdot \mathbf{1}[\,i \text{ already generated}\,],

(or divide by a factor > 1), so each reuse of a token is taxed. It is a crude hack — penalise a legitimately recurring word and you garble the text — but it is a cheap, effective patch on greedy's most visible flaw, and a reminder that the raw distribution is only ever a starting point.

Reshape the distribution

A fixed set of next-token logits, shown as a bar per token. The faint bars are the model's raw probabilities (T = 1). The bold bars are what you actually sample from after applying temperature and a top-k cutoff, then renormalising. Pull T below 1 and watch the mass pile onto the leader; push it above and the bars even out. Drop k and the tail bars vanish, their mass redistributed across the survivors so the kept bars still sum to 1.