Before a network learns a single thing, you have to fill its weight matrices with
some numbers. It is tempting to treat this as a throwaway detail — scatter small
random values and start training. But the scale of those starting weights decides whether a
signal can even reach the far end of a deep network. Pick them wrong and, layer after layer,
the activations either shrink toward zero (the signal vanishes) or
blow up toward infinity (it explodes) — and
The whole game is to keep the variance of the activations — their typical spread — roughly constant from one layer to the next. A signal that neither fades nor detonates can be trained. Let's work out exactly what variance the weights need.
Take a single neuron in a layer with
Make the standard starting assumptions: the weights
Step 1 — variance of a sum of independent terms. Independence means the variances simply add, with no covariance cross-terms:
Step 2 — variance of a single product. For two independent zero-mean
variables, the variance of the product is just the product of the variances (the general
identity
Step 3 — add up the identical terms. There are
Step 4 — demand the variance be preserved. For the spread to leave the
layer the same size it came in —
That is Xavier (Glorot) initialization: scale the weight variance by the
reciprocal of the fan-in. Each layer then hands the next a signal of the same size, and a
hundred layers later the variance is still
That derivation assumed the activation passes the signal through untouched. A
Step 5 — account for the halving. After a ReLU the propagated variance is only
Step 6 — re-impose preservation. Set the prefactor to
That extra factor of
The two failure modes bracket the sweet spot from both sides.
Scale too small (toward 0). If you initialise every weight to exactly the
same value (the limiting case, zero variance), every neuron in a layer computes the
identical thing and receives the identical gradient — they update in lockstep and stay
clones forever. This is the symmetry trap: with no variation to break it,
a wide layer has the expressive power of a single neuron. Push the scale merely very small
and you get the vanishing cousin: each layer multiplies the variance by
something well below
Scale too large. Now each layer multiplies the variance by something above
NaN. Only the
knife-edge
Each curve tracks the typical activation variance as the signal passes through layer after
layer (depth along the bottom, variance up the side, log-scaled so vanishing and exploding
both fit). The slider sets the weight scale as a multiple of the He value