We trained a
Lower is better. Its great virtue is that it has a meaning you can feel: the average number of equally-likely next tokens the model is choosing among — its effective branching factor. Let us derive that reading.
Step 1 — start from the average cross-entropy. This is exactly the
language-modeling loss
It is the model's average surprise, in
Step 2 — exponentiate to undo the log. Perplexity is defined as the
exponential of that average cross-entropy. Whatever base the log uses, the exponential
matches it (
Exponentiating is what converts “average surprise” into a plain count, undoing the logarithm so the units become tokens, not nats.
Step 3 — rewrite it as a geometric mean of probabilities. Pull the sum inside the exponential. The exponential of an average of logs is the geometric mean, so
So perplexity is one over the geometric-mean probability the model assigned to the true tokens. Assign high probability to what actually came next and the denominator is large, so perplexity is small — exactly the “lower is better” we wanted.
Step 4 — read off the branching factor. Suppose at every step the model
were uniformly unsure among
A perplexity of
Step 5 — the worst-case baseline. A model that learned nothing and
spreads its mass uniformly over the whole vocabulary
With a 50,000-token vocabulary that is a perplexity of 50,000. Every nat of cross-entropy a real model shaves off pulls that branching factor down geometrically — which is why a drop from perplexity 30 to 20 is a genuinely big deal, not a rounding error.
Switch the logarithm from natural (
This is no accident. Cross-entropy is the expected code length when you compress the true
text using the model's probabilities — Shannon's
The catch worth flagging: perplexity is only comparable when the tokenisation and dataset match. Two models scored on different vocabularies are not on the same ruler — a number alone is meaningless without saying “perplexity on what.”
Plotting