Perplexity

We trained a language model by minimising the average next-token cross-entropy. Now we need a single number that says how good the result is — comparable across models, datasets, and papers. That number is perplexity, and it is just the cross-entropy in nicer clothes:

\mathrm{PPL} = \exp\!\left(-\frac{1}{N} \sum_{t=1}^{N} \log P(x_t \mid x_{

Lower is better. Its great virtue is that it has a meaning you can feel: the average number of equally-likely next tokens the model is choosing among — its effective branching factor. Let us derive that reading.

From cross-entropy to a branching factor, line by line

Step 1 — start from the average cross-entropy. This is exactly the language-modeling loss \mathcal{L}: the mean negative log-likelihood per token over N tokens of held-out text:

H = -\frac{1}{N} \sum_{t=1}^{N} \log P(x_t \mid x_{

It is the model's average surprise, in nats (using \ln) per token.

Step 2 — exponentiate to undo the log. Perplexity is defined as the exponential of that average cross-entropy. Whatever base the log uses, the exponential matches it (\exp = e^{\,\cdot} for natural log):

\mathrm{PPL} = e^{\,H} = \exp\!\left(-\frac{1}{N} \sum_{t=1}^{N} \log P(x_t \mid x_{

Exponentiating is what converts “average surprise” into a plain count, undoing the logarithm so the units become tokens, not nats.

Step 3 — rewrite it as a geometric mean of probabilities. Pull the sum inside the exponential. The exponential of an average of logs is the geometric mean, so

\mathrm{PPL} = \left(\prod_{t=1}^{N} P(x_t \mid x_{

So perplexity is one over the geometric-mean probability the model assigned to the true tokens. Assign high probability to what actually came next and the denominator is large, so perplexity is small — exactly the “lower is better” we wanted.

Step 4 — read off the branching factor. Suppose at every step the model were uniformly unsure among b equally-likely tokens, i.e. P(x_t \mid x_{. Then the geometric mean is 1/b and

\mathrm{PPL} = \left(\tfrac{1}{b}\right)^{-1} = b.

A perplexity of b means the model is, on average, as confused as if it were picking uniformly among b choices — its effective branching factor. A model with perplexity 20 is “as lost as a fair 20-sided die” per token; one with perplexity 2 is essentially flipping a coin and usually right.

Step 5 — the worst-case baseline. A model that learned nothing and spreads its mass uniformly over the whole vocabulary \mathcal{V} sets b = |\mathcal{V}|, so

\mathrm{PPL}_{\text{uniform}} = |\mathcal{V}|.

With a 50,000-token vocabulary that is a perplexity of 50,000. Every nat of cross-entropy a real model shaves off pulls that branching factor down geometrically — which is why a drop from perplexity 30 to 20 is a genuinely big deal, not a rounding error.

For held-out text of N tokens with model probabilities P(x_t \mid x_{ and average cross-entropy H = -\frac{1}{N}\sum_t \log P(x_t \mid x_{:
  • Exponential of mean cross-entropy. \mathrm{PPL} = e^{\,H} = \big(\prod_t P(x_t \mid x_{, the reciprocal geometric-mean probability; lower is better.
  • Branching-factor reading. If the model is uniformly unsure among b tokens per step, then \mathrm{PPL} = b — the effective number of equally-likely next tokens it chooses among.
  • Uniform baseline. A model that spreads mass uniformly over the whole vocabulary has \mathrm{PPL} = |\mathcal{V}|, the worst-case branching factor.

Switch the logarithm from natural (\ln, nats) to base 2 and the cross-entropy becomes bits per token; normalise by characters instead of tokens and you get bits per character (BPC), the unit beloved of compression people. The two views are the same coin:

\mathrm{PPL} = 2^{\,H_2}, \qquad H_2 = \text{bits per token}.

This is no accident. Cross-entropy is the expected code length when you compress the true text using the model's probabilities — Shannon's entropy is the floor no model can beat. So a better language model is, quite literally, a better compressor of text, and perplexity is its compression ratio in disguise. Predict and compress turn out to be the same problem wearing two hats.

The catch worth flagging: perplexity is only comparable when the tokenisation and dataset match. Two models scored on different vocabularies are not on the same ruler — a number alone is meaningless without saying “perplexity on what.”

How perplexity rides on cross-entropy

Plotting \mathrm{PPL} = \exp(H) against the cross-entropy H shows why the exponential matters. The curve is steep: shaving one nat off H divides the perplexity by e \approx 2.72. The dashed marker sits at H = \ln 2 \approx 0.69, where the branching factor is exactly 2 — a model down to a coin-flip of uncertainty per token.