Start with a token's feature vector
Step 1 — recall LayerNorm. It subtracts the per-example mean
Step 2 — drop the mean. Set
Its square root is the root-mean-square, the overall magnitude of the vector:
Step 3 — divide by the RMS. Re-scale the raw vector by its own magnitude — no subtraction first:
Now
Step 4 — scale, but drop the bias. Apply a learnable per-feature gain
Step 5 — count what we saved. LayerNorm needs two passes' worth of work over the
features — one to find
Step 6 — and the quality? Empirically, dead even. Dropping the re-centring barely moves the loss, which is the surprising part: the load-bearing operation in LayerNorm was the re-scaling all along. Keep the part that matters, pay less for it.
LayerNorm was originally justified as controlling both the mean and the variance of each activation. RMSNorm's ablation tells a cleaner story: the stabilising effect comes almost entirely from fixing the scale of the vector, which is what keeps gradients from exploding or vanishing as signals pass through a deep stack. Re-centring contributes little because the downstream linear layer already has a bias and can absorb any constant shift — so subtracting the mean was, in effect, redundant work the rest of the network was happy to do itself.
There is a tidy invariance, too. RMSNorm is scale-invariant: multiply the whole
input vector by any positive constant
The bars are one token's raw features. Flip the control between the two normalisers and watch what each one does. LayerNorm first slides every bar so the dashed mean line drops to zero, then rescales — the features end up centred. RMSNorm skips the slide entirely: it only rescales by the root-mean-square, so the mean line stays wherever it was. Same spread control, one fewer operation.