Scaling Laws

The GPT family showed that bigger is better. Scaling laws say something far stronger and far more useful: bigger is better predictably. Plot a model's test loss against its size and you don't get a noisy scatter — you get a clean power law, a straight line on a log-log plot, stable across many orders of magnitude. That regularity is what lets you spend millions of dollars on a training run before you start, and know roughly what loss you'll get.

The power law, line by line

Step 1 — measure loss against one knob, holding the others abundant. Vary the parameter count N (with data and compute not the bottleneck) and record the converged test loss L. Empirically it follows

L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N},

where N_c and the exponent \alpha_N are fitted constants. The same shape holds for dataset size D and compute C, each with its own exponent:

L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \qquad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}.

Step 2 — take logs, and the curve becomes a line. A power law is a straight line in log–log coordinates. Take \log of both sides of L(N):

\log L = \alpha_N \log N_c - \alpha_N \log N.

Read it as y = b + m x with x = \log N, y = \log L: a straight line of slope m = -\alpha_N. That is the signature of a scaling law — and why the interactive below is drawn on log–log axes, where the data lies down as a ruler-straight line.

Step 3 — read off the payoff per 10×. Because the relationship is a power law, every factor-of-ten increase in N multiplies the loss by the same factor. Scaling N \to 10N:

\frac{L(10N)}{L(N)} = \left(\frac{N}{10N}\right)^{\alpha_N} = 10^{-\alpha_N}.

With a typical \alpha_N \approx 0.076, that is 10^{-0.076} \approx 0.84: each 10\times in parameters shaves the loss to about 84\% of its value. Diminishing, but utterly reliable.

Step 4 — the Chinchilla correction: grow N and D together. Given a fixed compute budget C \approx 6 N D, you can spend it on a bigger model or on more data. Minimising L(N, D) subject to that budget gives the compute-optimal split — and the answer is that N and D should grow at roughly the same rate:

N_{\text{opt}} \propto C^{\,a}, \quad D_{\text{opt}} \propto C^{\,b}, \qquad a \approx b \approx 0.5 \ \Rightarrow\ N_{\text{opt}} \propto D_{\text{opt}}.

In practice this lands near 20 training tokens per parameter. The headline finding was that the giant models of the day were badly over-parameterised and under-trained: for the same compute, a smaller model fed far more data wins.

Across orders of magnitude, the test loss of a language model is governed by simple power laws:

The reason scaling laws are a planning tool, not just an observation, is that a straight line extrapolates. Fit the loss at a handful of small, cheap models, draw the line on log–log axes, and extend it: you get a quantitative forecast of the loss at a model 1000\times larger, before committing the compute. Whole training runs are budgeted this way — pick the compute C you can afford, use Chinchilla to split it into N and D, and predict the result.

But a power law in test loss is not a power law in usefulness. The smooth loss curve hides discontinuities in behaviour: some abilities stay flat then jump (emergence), and the line eventually bends — you hit the irreducible loss (the entropy of language itself, a constant added term L_\infty the curve flattens toward), or you run out of unique data, or out of compute. Extrapolate with respect; the ruler is straight only inside the regime you fit it on.

A straight line you can read off

Both axes are logarithmic: the horizontal is \log_{10} C (compute), the vertical is \log_{10} L (loss). A pure power law L = (C_c/C)^{\alpha} is then a perfectly straight line of slope -\alpha. Drag the exponent slider to tilt it — a steeper line means each 10\times of compute buys a bigger drop in loss. Add an irreducible loss L_\infty and watch the line bend down and flatten at the right: real curves don't fall forever.