Good weight
initialization sets the variance up at the start — but as soon as
training begins, the weights move, and the distribution of each layer's pre-activations
drifts again. Wouldn't it be nice to re-centre and re-scale that distribution
continuously, at every step, so each layer always sees well-behaved inputs?
That is exactly what batch normalization does. For each pre-activation, it
normalises the values across the mini-batch to mean 0 and
variance 1, then applies a learnable rescale. The result trains
faster, tolerates higher
learning
rates, and is far less fussy about initialization.
The four steps, line by line
Fix one pre-activation feature and look at its values across a mini-batch of
B examples, x_1, \dots, x_B. Batch norm
transforms them in four moves.
Step 1 — the batch mean. Average the feature over the batch; this is the
centre we will subtract off:
\mu_B = \frac{1}{B}\sum_{i=1}^{B} x_i.
Step 2 — the batch variance. The mean squared deviation from that centre,
measuring the spread we will divide out:
\sigma_B^2 = \frac{1}{B}\sum_{i=1}^{B} (x_i - \mu_B)^2.
Step 3 — normalise. Shift by the mean and divide by the standard deviation.
The small constant \varepsilon (e.g.
10^{-5}) guards against dividing by zero when a feature happens to
be constant across the batch:
\hat{x}_i = \frac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}.
By construction the \hat{x}_i now have (almost exactly) mean
0 and variance 1 — a clean, standardised
distribution, every step, for every feature.
Step 4 — scale and shift, learnably. Forcing mean
0, variance 1 is sometimes too rigid —
the layer might genuinely want a different mean or spread (for instance to use the
non-linear part of a sigmoid). So batch norm hands two learnable parameters,
\gamma and \beta, back to gradient
descent:
y_i = \gamma\,\hat{x}_i + \beta.
Crucially the network can undo the normalisation if that is what minimises the loss:
setting \gamma = \sqrt{\sigma_B^2 + \varepsilon} and
\beta = \mu_B recovers the original
x_i exactly. So batch norm never restricts the model — it
only gives the optimiser an easier, better-conditioned coordinate system to search in.
Train time versus inference time
One subtlety: at training time \mu_B and
\sigma_B^2 are computed from the current mini-batch. At
inference time we often have a single example — there is no batch to average
over, and we don't want the prediction for one input to depend on whatever other inputs
happen to share its batch. So during training a running average of the mean
and variance is accumulated, e.g.
\mu_{\text{run}} \leftarrow (1-\alpha)\,\mu_{\text{run}} + \alpha\,\mu_B,
and at inference those fixed running statistics replace the batch statistics. The transform
then becomes a plain, deterministic affine map — no batch required.
For a pre-activation feature with mini-batch values
x_1, \dots, x_B, batch normalization applies:
-
Batch mean:
\mu_B = \tfrac{1}{B}\sum_i x_i, and
batch variance:
\sigma_B^2 = \tfrac{1}{B}\sum_i (x_i - \mu_B)^2.
-
Normalise:
\hat{x}_i = \dfrac{x_i - \mu_B}{\sqrt{\sigma_B^2 + \varepsilon}}
— mean 0, variance 1.
-
Rescale: y_i = \gamma\,\hat{x}_i + \beta, with
\gamma, \beta learnable (so the layer can recover any
mean/variance it wants, including the original).
-
Train vs inference: training uses the live batch statistics; inference
uses fixed running averages of the mean and variance, so a single
example needs no batch.
The original 2015 paper (Ioffe & Szegedy) explained batch norm through
internal covariate shift: as earlier layers update, the distribution of
inputs to later layers keeps shifting, so each layer is forever chasing a moving target.
Pinning every layer's inputs to a fixed mean and variance, the story went, stops the shift
and lets layers learn in peace.
It is a tidy intuition — but later work (Santurkar et al., 2018) cast doubt on it: you can
inject covariate shift back in and batch norm still helps. The modern
reinterpretation is geometric: batch norm smooths the loss landscape,
making the gradients more predictable and Lipschitz-bounded. A smoother
loss
surface means larger, safer steps and faster, more stable descent. Whichever
framing you prefer, the empirical verdict is unambiguous: it works.