The power law, line by line
Step 1 — measure loss against one knob, holding the others abundant. Vary the
parameter count N (with data and compute not the bottleneck) and
record the converged test loss L. Empirically it follows
L(N) \approx \left(\frac{N_c}{N}\right)^{\alpha_N},
where N_c and the exponent \alpha_N are
fitted constants. The same shape holds for dataset size
D and compute C, each with its own
exponent:
L(D) \approx \left(\frac{D_c}{D}\right)^{\alpha_D}, \qquad L(C) \approx \left(\frac{C_c}{C}\right)^{\alpha_C}.
Step 2 — take logs, and the curve becomes a line. A power law is a straight
line in log–log coordinates. Take \log of both sides of
L(N):
\log L = \alpha_N \log N_c - \alpha_N \log N.
Read it as y = b + m x with x = \log N,
y = \log L: a straight line of slope
m = -\alpha_N. That is the signature of a scaling law — and
why the interactive below is drawn on log–log axes, where the data lies down as a ruler-straight
line.
Step 3 — read off the payoff per 10×. Because the relationship is a power law,
every factor-of-ten increase in N multiplies the loss by the same
factor. Scaling N \to 10N:
\frac{L(10N)}{L(N)} = \left(\frac{N}{10N}\right)^{\alpha_N} = 10^{-\alpha_N}.
With a typical \alpha_N \approx 0.076, that is
10^{-0.076} \approx 0.84: each 10\times in
parameters shaves the loss to about 84\% of its value. Diminishing,
but utterly reliable.
Step 4 — the Chinchilla correction: grow N and D together. Given a
fixed compute budget C \approx 6 N D, you can spend it on a
bigger model or on more data. Minimising L(N, D) subject to that
budget gives the compute-optimal split — and the answer is that
N and D should grow at roughly the
same rate:
N_{\text{opt}} \propto C^{\,a}, \quad D_{\text{opt}} \propto C^{\,b}, \qquad a \approx b \approx 0.5 \ \Rightarrow\ N_{\text{opt}} \propto D_{\text{opt}}.
In practice this lands near 20 training tokens per parameter. The headline
finding was that the giant models of the day were badly over-parameterised and
under-trained: for the same compute, a smaller model fed far more data wins.
Across orders of magnitude, the test loss of a language model is governed by simple power laws:
-
Power law in N, D, C.
L(N) \approx (N_c/N)^{\alpha_N}, and likewise
L(D) \approx (D_c/D)^{\alpha_D},
L(C) \approx (C_c/C)^{\alpha_C}.
-
Straight on log–log. Taking logs gives
\log L = \text{const} - \alpha \log N — a line of slope
-\alpha; each 10\times in scale
multiplies the loss by 10^{-\alpha}.
-
Chinchilla compute-optimal. For a fixed budget
C \approx 6ND, the optimum has
N_{\text{opt}} \propto D_{\text{opt}} — grow parameters and data
together (≈ 20 tokens per parameter), not just the model.
The reason scaling laws are a planning tool, not just an observation, is that a
straight line extrapolates. Fit the loss at a handful of small, cheap models, draw the line on
log–log axes, and extend it: you get a quantitative forecast of the loss at a model
1000\times larger, before committing the compute. Whole training
runs are budgeted this way — pick the compute C you can afford, use
Chinchilla to split it into N and D, and
predict the result.
But a power law in test loss is not a power law in usefulness. The smooth loss curve
hides discontinuities in behaviour: some abilities stay flat then jump (emergence), and the
line eventually bends — you hit the
irreducible loss (the entropy of language itself, a constant added term
L_\infty the curve flattens toward), or you run out of unique data,
or out of compute. Extrapolate with respect; the ruler is straight only inside the regime you
fit it on.