Vanishing & Exploding Gradients
Backpropagation hands an early layer its gradient by running the
chain rule
backwards through every layer that comes after it. In a deep network that means a
product of many factors — and a long product of numbers is a treacherous
thing. If the typical factor is a little below 1 the product
collapses to zero; a little above 1 and it blows up. Early layers
then learn at a crawl (vanishing) or diverge (exploding).
The gradient of an early layer is a product
Write the network as a chain of layers, each one feeding the next:
z_L = f_L\big(f_{L-1}(\cdots f_1(x)\cdots)\big).
Let \theta_1 be a weight in the first layer and
L the loss. We want \partial \mathcal{L}/\partial \theta_1.
Step 1 — apply the chain rule along the whole chain. The loss depends on
\theta_1 only through every layer in between, so the chain rule
threads the derivative through each one:
\frac{\partial \mathcal{L}}{\partial \theta_1} = \frac{\partial \mathcal{L}}{\partial z_L}\cdot\frac{\partial z_L}{\partial z_{L-1}}\cdot\frac{\partial z_{L-1}}{\partial z_{L-2}}\cdots\frac{\partial z_2}{\partial z_1}\cdot\frac{\partial z_1}{\partial \theta_1}.
Step 2 — name the per-layer factors. Each
\partial z_{k}/\partial z_{k-1} is the Jacobian
of layer k — its local slope. Collect them into a product:
\frac{\partial \mathcal{L}}{\partial \theta_1} = \frac{\partial \mathcal{L}}{\partial z_L}\left(\prod_{k=2}^{L} J_k\right)\frac{\partial z_1}{\partial \theta_1}, \qquad J_k = \frac{\partial z_k}{\partial z_{k-1}}.
Step 3 — read off the danger. The gradient that reaches layer
1 is a product of L-1 Jacobians.
Whatever the typical factor's size is, it gets raised to roughly the
L-th power. Long products are unstable.
A scalar toy makes it unmistakable
Strip the network down to one number per layer, with the same weight
w repeated and a linear activation. Then every Jacobian is just
w.
Step 1 — the chain becomes a power. With
J_k = w for all k, the product of
L identical factors is
\frac{\partial \mathcal{L}}{\partial \theta_1} \;\propto\; \underbrace{w \cdot w \cdots w}_{L\ \text{times}} \;=\; w^{L}.
Step 2 — let the depth grow. Send L \to \infty
and the fate of the gradient is decided entirely by whether
w sits below, at, or above 1:
\lim_{L\to\infty} w^{L} = \begin{cases} 0 & |w| < 1 \quad(\text{vanishing}),\\[2pt] 1 & w = 1 \quad(\text{the knife edge}),\\[2pt] \infty & |w| > 1 \quad(\text{exploding}).\end{cases}
Step 3 — appreciate how sharp this is. At
w = 0.9 and L = 50 the gradient is
0.9^{50}\approx 0.005 — already nearly gone. At
w = 1.1 it is 1.1^{50}\approx 117. A ten
percent miscalibration of the typical factor is the difference between a dead network and a
diverging one. That is the whole problem.
For a depth-L network the gradient of an early parameter is a
product of per-layer Jacobians:
-
The product law.
\dfrac{\partial \mathcal{L}}{\partial \theta_1} = \dfrac{\partial \mathcal{L}}{\partial z_L}\Big(\prod_{k=2}^{L} J_k\Big)\dfrac{\partial z_1}{\partial \theta_1},
with J_k = \partial z_k/\partial z_{k-1}.
-
Two failure modes. If the typical factor has magnitude
< 1 the product \to 0
(vanishing); if > 1 it
\to \infty (exploding). The scalar toy makes it bare:
the gradient \propto w^{L}.
-
The fixes. Careful weight initialization (keep factors near
1), normalization layers, non-saturating activations like
\mathrm{ReLU}, gradient clipping for the exploding case, and
residual connections that add a gradient highway of 1.
Watch it on a log axis
Plot the gradient magnitude |w|^{L} against depth
L, with the vertical axis on a log scale (so a
pure exponential is a straight line). Drag the weight scale across
w = 1: below it the line slopes down to the floor (vanishing),
above it the line rockets up (exploding), and exactly at 1 it is
flat — the only stable setting.
Exploding gradients have a brutally simple patch: if the gradient's norm exceeds a threshold
\tau, rescale the whole vector down to that norm, keeping
its direction:
g \;\leftarrow\; g \cdot \frac{\tau}{\max(\lVert g\rVert,\ \tau)}.
When \lVert g\rVert \le \tau nothing happens; when it blows past
\tau the step is capped, so one freak batch can't hurl the weights
across the landscape. It does nothing for vanishing gradients (you cannot rescale
zero into something), which is why those need the structural fixes — most of all the
residual
connection of the next page.
Where this is going
Every modern trick for training deep networks is, at bottom, a way to keep that product of
Jacobians near 1. The cleanest of them — the
residual
connection — adds an explicit
+1 to every layer's Jacobian, so the product can never vanish no
matter how deep the network goes. That is the subject of the next page.