Beam Search

Greedy decoding commits to the single best token at every step — and a locally best token can lead into a globally mediocre sentence, with no way to take it back. Beam search hedges: instead of one running hypothesis it keeps B of them in flight at once, lets them all grow, and prunes back to the best B after each step. It is a breadth-limited search for the highest-probability whole sequence, not just the next token.

Keeping B hypotheses alive, line by line

We score sequences by their total log-probability. For a sequence y_1, \dots, y_L the model factorises, so the score is a sum of log-probabilities — sums are numerically kinder than products of many small numbers:

\operatorname{score}(y_{1:L}) = \log p(y_{1:L}) = \sum_{t=1}^{L} \log p(y_t \mid y_{

Each term \log p \le 0, so the score is negative and a less negative score is better. The beam width B is how many partial sequences ("beams") we carry.

Step 1 — initialise. Start with the single empty hypothesis (just the prompt), score 0. This is our one beam so far.

Step 2 — expand every beam by every token. Run the model on each of the (up to) B current beams. Each produces a distribution over the whole vocabulary V, so extending each beam by each possible next token gives

B \times |V| \quad\text{candidate continuations.}

Step 3 — score every candidate. A candidate is a parent beam plus one new token; its score is the parent's running score plus the new token's log-probability — the running sum just grows by one term:

\operatorname{score}(\text{beam} + y) = \operatorname{score}(\text{beam}) + \log p(y \mid \text{beam}).

Step 4 — prune to the top B. Out of all B \times |V| candidates, keep only the B with the highest scores; discard the rest. These become the beams for the next step. (Crucially, the surviving B may come from any mix of parents — a strong parent can spawn several survivors while a weak one is wiped out entirely.)

Step 5 — repeat, retiring finished beams. Loop Steps 2–4. When a beam emits \langle \text{eos} \rangle it is complete and set aside (still competing on score). Stop when all B beams are complete, or a length cap is hit.

Step 6 — return the best complete sequence. Of all finished hypotheses, output the one with the highest total log-probability:

\hat{y} = \arg\max_{y \text{ complete}} \ \sum_{t} \log p(y_t \mid y_{

Note that B = 1 recovers greedy decoding exactly — one beam, always extended by its single best token. Larger B searches more of the tree and, generally, finds higher-probability sequences (though never with a guarantee of the true global optimum — it is a heuristic, not an exhaustive search).

What it is good — and bad — at

Beam search shines on closed-ended tasks, where there is essentially one right answer and you want the model's highest-probability rendering of it: machine translation, summarisation, speech transcription. There, chasing maximum likelihood is chasing correctness. But on open-ended generation — storytelling, chat — the highest-probability sequence is paradoxically the blandest: it favours short, safe, generic, repetitive text, because boring continuations are individually likely. For creative generation, practitioners reach for the sampling strategies of the previous page instead. High probability is not the same as good writing.

A breadth-limited search for the highest-log-probability sequence, with beam width B:
  • Keep B beams by \sum \log p. Maintain the B best partial sequences, each ranked by its cumulative log-probability \sum_t \log p(y_t \mid y_{.
  • Expand and prune. Each step extends every beam by every token — B \times |V| candidates — scores each as parent score + \log p, and keeps the top B; output the best complete sequence at the end.
  • Best for closed-ended tasks. Ideal where one high-probability answer is wanted (translation, summarisation); on open-ended generation it tends to produce bland, generic text. B = 1 is exactly greedy.

Each step adds a negative term \log p \le 0 to the score, so longer sequences are systematically penalised: a 5-token sequence has five negative terms, a 50-token one has fifty. Left unchecked, beam search will happily prefer a curt "Yes." over a fuller, more useful answer — not because it is better, but because it stopped sooner and accumulated less penalty. The standard fix is length normalisation: divide the total score by the length (often raised to a power \alpha),

\operatorname{score}_{\text{norm}}(y_{1:L}) = \frac{1}{L^{\alpha}} \sum_{t=1}^{L} \log p(y_t \mid y_{

so beams of different lengths compete on a per-token basis. With \alpha = 1 this is the mean log-probability; \alpha \approx 0.6\text{–}0.7 is a common compromise. It is a small correction with an outsized effect: without it, beam search's "best" answer is usually just its shortest.

Watch the beams branch and prune

A beam search with width B = 2. At each step every surviving beam fans out into candidate children, each labelled with its cumulative log-probability \sum \log p; the two highest-scoring candidates are kept (bold) and the rest are pruned (faded). Step forward to grow the tree level by level and see how the search keeps exactly B hypotheses alive while exploring more than greedy ever would. Refresh for a fresh random set of scores.