Batch Normalization

Good weight initialization sets the variance up at the start — but as soon as training begins, the weights move, and the distribution of each layer's pre-activations drifts again. Wouldn't it be nice to re-centre and re-scale that distribution continuously, at every step, so each layer always sees well-behaved inputs?

That is exactly what batch normalization does. For each pre-activation, it normalises the values across the mini-batch to mean 0 and variance 1, then applies a learnable rescale. The result trains faster, tolerates higher learning rates, and is far less fussy about initialization.

The four steps, line by line

Fix one pre-activation feature and look at its values across a mini-batch of B examples, x_1, \dots, x_B. Batch norm transforms them in four moves.

Step 1 — the batch mean. Average the feature over the batch; this is the centre we will subtract off:

\mu_B = \frac{1}{B}\sum_{i=1}^{B} x_i.

Step 2 — the batch variance. The mean squared deviation from that centre, measuring the spread we will divide out:

\sigma_B^2 = \frac{1}{B}\sum_{i=1}^{B} (x_i - \mu_B)^2.

Step 3 — normalise. Shift by the mean and divide by the standard deviation. The small constant \varepsilon (e.g. 10^{-5}) guards against dividing by zero when a feature happens to be constant across the batch:

\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}.

By construction the \hat{x}_i now have (almost exactly) mean 0 and variance 1 — a clean, standardised distribution, every step, for every feature.

Step 4 — scale and shift, learnably. Forcing mean 0, variance 1 is sometimes too rigid — the layer might genuinely want a different mean or spread (for instance to use the non-linear part of a sigmoid). So batch norm hands two learnable parameters, \gamma and \beta, back to gradient descent:

y_i = \gamma\,\hat{x}_i + \beta.

Crucially the network can undo the normalisation if that is what minimises the loss: setting \gamma = \sqrt{\sigma_B^2 + \varepsilon} and \beta = \mu_B recovers the original x_i exactly. So batch norm never restricts the model — it only gives the optimiser an easier, better-conditioned coordinate system to search in.

Train time versus inference time

One subtlety: at training time \mu_B and \sigma_B^2 are computed from the current mini-batch. At inference time we often have a single example — there is no batch to average over, and we don't want the prediction for one input to depend on whatever other inputs happen to share its batch. So during training a running average of the mean and variance is accumulated, e.g.

\mu_{\text{run}} \leftarrow (1-\alpha)\,\mu_{\text{run}} + \alpha\,\mu_B,

and at inference those fixed running statistics replace the batch statistics. The transform then becomes a plain, deterministic affine map — no batch required.

For a pre-activation feature with mini-batch values x_1, \dots, x_B, batch normalization applies:

Batch mean: \mu_B = \tfrac{1}{B}\sum_i x_i, and batch variance: \sigma_B^2 = \tfrac{1}{B}\sum_i (x_i - \mu_B)^2.
Normalise: \hat{x}_i = \dfrac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}} — mean 0, variance 1.
Rescale: y_i = \gamma\,\hat{x}_i + \beta, with \gamma, \beta learnable (so the layer can recover any mean/variance it wants, including the original).
Train vs inference: training uses the live batch statistics; inference uses fixed running averages of the mean and variance, so a single example needs no batch.

The original 2015 paper (Ioffe & Szegedy) explained batch norm through internal covariate shift: as earlier layers update, the distribution of inputs to later layers keeps shifting, so each layer is forever chasing a moving target. Pinning every layer's inputs to a fixed mean and variance, the story went, stops the shift and lets layers learn in peace.

It is a tidy intuition — but later work (Santurkar et al., 2018) cast doubt on it: you can inject covariate shift back in and batch norm still helps. The modern reinterpretation is geometric: batch norm smooths the loss landscape, making the gradients more predictable and Lipschitz-bounded. A smoother loss surface means larger, safer steps and faster, more stable descent. Whichever framing you prefer, the empirical verdict is unambiguous: it works.

See a distribution get normalised

The faint bell is a messy raw feature — off-centre and too wide. Batch norm first standardises it to the clean unit bell (mean 0, variance 1), then the learnable \gamma and \beta reshape that into the bold output. Slide \gamma to widen or narrow it and \beta to slide it sideways — exactly the two knobs gradient descent gets to tune.