Weight Initialization

Before a network learns a single thing, you have to fill its weight matrices with some numbers. It is tempting to treat this as a throwaway detail — scatter small random values and start training. But the scale of those starting weights decides whether a signal can even reach the far end of a deep network. Pick them wrong and, layer after layer, the activations either shrink toward zero (the signal vanishes) or blow up toward infinity (it explodes) — and backpropagation suffers the identical fate on the way back.

The whole game is to keep the variance of the activations — their typical spread — roughly constant from one layer to the next. A signal that neither fades nor detonates can be trained. Let's work out exactly what variance the weights need.

The variance of one linear layer, line by line

Take a single neuron in a layer with n_{\text{in}} inputs. Before its activation it computes the weighted sum

y = \sum_{i=1}^{n_{\text{in}}} w_i x_i.

Make the standard starting assumptions: the weights w_i are independent, zero-mean and share one variance \operatorname{Var}(w); the inputs x_i are independent of the weights, also zero-mean, with common variance \operatorname{Var}(x). Watch how the variance of y falls out.

Step 1 — variance of a sum of independent terms. Independence means the variances simply add, with no covariance cross-terms:

\operatorname{Var}(y) = \operatorname{Var}\!\left(\sum_{i=1}^{n_{\text{in}}} w_i x_i\right) = \sum_{i=1}^{n_{\text{in}}} \operatorname{Var}(w_i x_i).

Step 2 — variance of a single product. For two independent zero-mean variables, the variance of the product is just the product of the variances (the general identity \operatorname{Var}(AB) = \operatorname{Var}(A)\operatorname{Var}(B) + \operatorname{Var}(A)\,\mathbb{E}[B]^2 + \mathbb{E}[A]^2\,\operatorname{Var}(B) loses its last two terms because both means are 0):

\operatorname{Var}(w_i x_i) = \operatorname{Var}(w_i)\,\operatorname{Var}(x_i) = \operatorname{Var}(w)\,\operatorname{Var}(x).

Step 3 — add up the identical terms. There are n_{\text{in}} of them, each equal, so the sum is just n_{\text{in}} copies:

\operatorname{Var}(y) = n_{\text{in}}\,\operatorname{Var}(w)\,\operatorname{Var}(x).

Step 4 — demand the variance be preserved. For the spread to leave the layer the same size it came in — \operatorname{Var}(y) = \operatorname{Var}(x) — the factor multiplying \operatorname{Var}(x) must be exactly 1:

n_{\text{in}}\,\operatorname{Var}(w) = 1 \quad\Longrightarrow\quad \operatorname{Var}(w) = \frac{1}{n_{\text{in}}}.

That is Xavier (Glorot) initialization: scale the weight variance by the reciprocal of the fan-in. Each layer then hands the next a signal of the same size, and a hundred layers later the variance is still \operatorname{Var}(x) rather than 10^{-30} or 10^{30}.

The ReLU correction: He initialization

That derivation assumed the activation passes the signal through untouched. A ReLU does not — it zeroes every negative input, deleting (on a symmetric, zero-mean input) about half the distribution. Killing half the values halves the variance.

Step 5 — account for the halving. After a ReLU the propagated variance is only

\operatorname{Var}(y) = \tfrac{1}{2}\, n_{\text{in}}\,\operatorname{Var}(w)\,\operatorname{Var}(x).

Step 6 — re-impose preservation. Set the prefactor to 1 again, now carrying the stray \tfrac12:

\tfrac{1}{2}\, n_{\text{in}}\,\operatorname{Var}(w) = 1 \quad\Longrightarrow\quad \operatorname{Var}(w) = \frac{2}{n_{\text{in}}}.

That extra factor of 2 is He initialization, and it is the default for the ReLU-family networks that dominate modern deep learning. The moral: the right initial variance is not a universal constant — it tracks how much variance your activation throws away.

For a linear layer y = \sum_{i=1}^{n_{\text{in}}} w_i x_i with independent, zero-mean weights and inputs:

The two failure modes bracket the sweet spot from both sides.

Scale too small (toward 0). If you initialise every weight to exactly the same value (the limiting case, zero variance), every neuron in a layer computes the identical thing and receives the identical gradient — they update in lockstep and stay clones forever. This is the symmetry trap: with no variation to break it, a wide layer has the expressive power of a single neuron. Push the scale merely very small and you get the vanishing cousin: each layer multiplies the variance by something well below 1, so the signal decays geometrically and the deep layers see essentially nothing — many ReLUs also go permanently dead (stuck outputting 0).

Scale too large. Now each layer multiplies the variance by something above 1; the activations explode geometrically. Saturating activations like sigmoid/tanh slam into their flat tails where the gradient is near zero (so, perversely, large weights also stall learning), and unbounded activations overflow to \pm\infty / NaN. Only the knife-edge \operatorname{Var}(w) \approx 1/n_{\text{in}} (or 2/n_{\text{in}} for ReLU) keeps the multiplier at 1 and the signal alive end to end.

Watch the variance live

Each curve tracks the typical activation variance as the signal passes through layer after layer (depth along the bottom, variance up the side, log-scaled so vanishing and exploding both fit). The slider sets the weight scale as a multiple of the He value 2/n_{\text{in}}. At 1.0 the curve stays flat — the signal survives. Dial it down and the variance decays to nothing; dial it up and it rockets off the top.