Vanishing & Exploding Gradients

Backpropagation hands an early layer its gradient by running the chain rule backwards through every layer that comes after it. In a deep network that means a product of many factors — and a long product of numbers is a treacherous thing. If the typical factor is a little below 1 the product collapses to zero; a little above 1 and it blows up. Early layers then learn at a crawl (vanishing) or diverge (exploding).

The gradient of an early layer is a product

Write the network as a chain of layers, each one feeding the next:

z_L = f_L\big(f_{L-1}(\cdots f_1(x)\cdots)\big).

Let \theta_1 be a weight in the first layer and L the loss. We want \partial \mathcal{L}/\partial \theta_1.

Step 1 — apply the chain rule along the whole chain. The loss depends on \theta_1 only through every layer in between, so the chain rule threads the derivative through each one:

\frac{\partial \mathcal{L}}{\partial \theta_1} = \frac{\partial \mathcal{L}}{\partial z_L}\cdot\frac{\partial z_L}{\partial z_{L-1}}\cdot\frac{\partial z_{L-1}}{\partial z_{L-2}}\cdots\frac{\partial z_2}{\partial z_1}\cdot\frac{\partial z_1}{\partial \theta_1}.

Step 2 — name the per-layer factors. Each \partial z_{k}/\partial z_{k-1} is the Jacobian of layer k — its local slope. Collect them into a product:

\frac{\partial \mathcal{L}}{\partial \theta_1} = \frac{\partial \mathcal{L}}{\partial z_L}\left(\prod_{k=2}^{L} J_k\right)\frac{\partial z_1}{\partial \theta_1}, \qquad J_k = \frac{\partial z_k}{\partial z_{k-1}}.

Step 3 — read off the danger. The gradient that reaches layer 1 is a product of L-1 Jacobians. Whatever the typical factor's size is, it gets raised to roughly the L-th power. Long products are unstable.

A scalar toy makes it unmistakable

Strip the network down to one number per layer, with the same weight w repeated and a linear activation. Then every Jacobian is just w.

Step 1 — the chain becomes a power. With J_k = w for all k, the product of L identical factors is

\frac{\partial \mathcal{L}}{\partial \theta_1} \;\propto\; \underbrace{w \cdot w \cdots w}_{L\ \text{times}} \;=\; w^{L}.

Step 2 — let the depth grow. Send L \to \infty and the fate of the gradient is decided entirely by whether w sits below, at, or above 1:

\lim_{L\to\infty} w^{L} = \begin{cases} 0 & |w| < 1 \quad(\text{vanishing}),\\[2pt] 1 & w = 1 \quad(\text{the knife edge}),\\[2pt] \infty & |w| > 1 \quad(\text{exploding}).\end{cases}

Step 3 — appreciate how sharp this is. At w = 0.9 and L = 50 the gradient is 0.9^{50}\approx 0.005 — already nearly gone. At w = 1.1 it is 1.1^{50}\approx 117. A ten percent miscalibration of the typical factor is the difference between a dead network and a diverging one. That is the whole problem.

For a depth-L network the gradient of an early parameter is a product of per-layer Jacobians:

The product law. \dfrac{\partial \mathcal{L}}{\partial \theta_1} = \dfrac{\partial \mathcal{L}}{\partial z_L}\Big(\prod_{k=2}^{L} J_k\Big)\dfrac{\partial z_1}{\partial \theta_1}, with J_k = \partial z_k/\partial z_{k-1}.
Two failure modes. If the typical factor has magnitude < 1 the product \to 0 (vanishing); if > 1 it \to \infty (exploding). The scalar toy makes it bare: the gradient \propto w^{L}.
The fixes. Careful weight initialization (keep factors near 1), normalization layers, non-saturating activations like \mathrm{ReLU}, gradient clipping for the exploding case, and residual connections that add a gradient highway of 1.

Watch it on a log axis

Plot the gradient magnitude |w|^{L} against depth L, with the vertical axis on a log scale (so a pure exponential is a straight line). Drag the weight scale across w = 1: below it the line slopes down to the floor (vanishing), above it the line rockets up (exploding), and exactly at 1 it is flat — the only stable setting.

Exploding gradients have a brutally simple patch: if the gradient's norm exceeds a threshold \tau, rescale the whole vector down to that norm, keeping its direction:

g \;\leftarrow\; g \cdot \frac{\tau}{\max(\lVert g\rVert,\ \tau)}.

When \lVert g\rVert \le \tau nothing happens; when it blows past \tau the step is capped, so one freak batch can't hurl the weights across the landscape. It does nothing for vanishing gradients (you cannot rescale zero into something), which is why those need the structural fixes — most of all the residual connection of the next page.

Where this is going

Every modern trick for training deep networks is, at bottom, a way to keep that product of Jacobians near 1. The cleanest of them — the residual connection — adds an explicit +1 to every layer's Jacobian, so the product can never vanish no matter how deep the network goes. That is the subject of the next page.