Each term \log p \le 0, so the score is negative and a
less negative score is better. The beam width
B is how many partial sequences ("beams") we carry.
Step 1 — initialise. Start with the single empty hypothesis (just the
prompt), score 0. This is our one beam so far.
Step 2 — expand every beam by every token. Run the model on each of the (up
to) B current beams. Each produces a distribution over the whole
vocabulary V, so extending each beam by each possible next token
gives
B \times |V| \quad\text{candidate continuations.}
Step 3 — score every candidate. A candidate is a parent beam plus one new
token; its score is the parent's running score plus the new token's log-probability —
the running sum just grows by one term:
\operatorname{score}(\text{beam} + y) = \operatorname{score}(\text{beam}) + \log p(y \mid \text{beam}).
Step 4 — prune to the top B. Out of all
B \times |V| candidates, keep only the
B with the highest scores; discard the rest. These become the
beams for the next step. (Crucially, the surviving B may come from
any mix of parents — a strong parent can spawn several survivors while a weak one is wiped
out entirely.)
Step 5 — repeat, retiring finished beams. Loop Steps 2–4. When a beam emits
\langle \text{eos} \rangle it is complete and set aside
(still competing on score). Stop when all B beams are complete, or
a length cap is hit.
Step 6 — return the best complete sequence. Of all finished hypotheses,
output the one with the highest total log-probability:
\hat{y} = \arg\max_{y \text{ complete}} \ \sum_{t} \log p(y_t \mid y_{
Note that B = 1 recovers greedy decoding exactly — one beam,
always extended by its single best token. Larger B searches more
of the tree and, generally, finds higher-probability sequences (though never with a
guarantee of the true global optimum — it is a heuristic, not an exhaustive search).
What it is good — and bad — at
Beam search shines on closed-ended tasks, where there is essentially one
right answer and you want the model's highest-probability rendering of it: machine
translation, summarisation, speech transcription. There, chasing maximum likelihood is chasing
correctness. But on open-ended generation — storytelling, chat — the highest-probability
sequence is paradoxically the blandest: it favours short, safe, generic, repetitive
text, because boring continuations are individually likely. For creative generation,
practitioners reach for the sampling strategies of the previous page instead. High
probability is not the same as good writing.
A breadth-limited search for the highest-log-probability sequence, with beam width
B:
-
Keep B beams by \sum \log p.
Maintain the B best partial sequences, each ranked by its
cumulative log-probability \sum_t \log p(y_t \mid y_{.
-
Expand and prune. Each step extends every beam by every token —
B \times |V| candidates — scores each as parent score
+ \log p, and keeps the top B;
output the best complete sequence at the end.
-
Best for closed-ended tasks. Ideal where one high-probability answer is
wanted (translation, summarisation); on open-ended generation it tends to produce
bland, generic text. B = 1 is exactly greedy.
Each step adds a negative term \log p \le 0 to the
score, so longer sequences are systematically penalised: a 5-token
sequence has five negative terms, a 50-token one has fifty. Left unchecked, beam search will
happily prefer a curt "Yes." over a fuller, more useful answer — not because it is
better, but because it stopped sooner and accumulated less penalty. The standard fix is
length normalisation: divide the total score by the length (often raised to
a power \alpha),
\operatorname{score}_{\text{norm}}(y_{1:L}) = \frac{1}{L^{\alpha}} \sum_{t=1}^{L} \log p(y_t \mid y_{
so beams of different lengths compete on a per-token basis. With \alpha = 1
this is the mean log-probability; \alpha \approx 0.6\text{–}0.7 is
a common compromise. It is a small correction with an outsized effect: without it, beam
search's "best" answer is usually just its shortest.