Propagating the gradient back through time
Recall the recurrence h_t = \tanh(W h_{t-1} + U x_t + b). A loss
L at the end depends on h_T, which depends on
h_{T-1}, which depends on h_{T-2}, … all the
way back. We want how the loss changes with an early hidden state,
\partial L / \partial h_k.
Step 1 — one step of the chain rule. The loss sees
h_{t-1} only through h_t, so:
\frac{\partial L}{\partial h_{t-1}} = \frac{\partial L}{\partial h_t}\,\frac{\partial h_t}{\partial h_{t-1}}.
Step 2 — name the recurrent Jacobian. Differentiating the recurrence, the local
step factor is
\frac{\partial h_t}{\partial h_{t-1}} = \operatorname{diag}\!\big(\tanh'(z_t)\big)\, W \;=:\; J_t, \qquad z_t = W h_{t-1} + U x_t + b.
Step 3 — chain it from step T back to step
k. Apply Step 1 repeatedly; the single-step Jacobians
multiply:
\frac{\partial L}{\partial h_k} = \frac{\partial L}{\partial h_T}\,\prod_{t=k+1}^{T} J_t = \frac{\partial L}{\partial h_T}\, J_T\, J_{T-1}\cdots J_{k+1}.
Step 4 — strip it to a scalar to see the law. Take one number per step, a fixed
weight w, and pretend \tanh' \approx 1. Then
every J_t = w, and the product over T - k
steps collapses to a power:
\frac{\partial L}{\partial h_k} \;\propto\; \underbrace{w \cdot w \cdots w}_{T - k \ \text{times}} \;=\; w^{\,T-k}.
Step 5 — read off the fate of long-range gradients. The gradient reaching a step
T - k moments in the past is scaled by w^{T-k}
— roughly W^{T-k} in the matrix case (it picks up powers of
W's largest singular value). So as the gap grows:
\big\|W\big\| < 1 \;\Rightarrow\; \text{gradient} \to 0 \ (\text{vanishing}), \qquad \big\|W\big\| > 1 \;\Rightarrow\; \text{gradient} \to \infty \ (\text{exploding}).
This is the identical product law as in
deep
feed-forward nets — only here the long product runs along time, and
the repeated factor is the same recurrent matrix W at every
step (weight sharing), which makes the powers-of-W blow-up especially
sharp. Vanishing gradients are why a plain RNN forgets the far past: the signal from step 1 simply
cannot reach step 100 with any strength.
Training an RNN unrolls it and backpropagates through every step.
-
Unroll, then backprop. The unrolled net is a depth-T
chain with tied weights; the chain rule sends
\partial L / \partial h_t backwards step by step.
-
A product of Jacobians.
\dfrac{\partial L}{\partial h_k} = \dfrac{\partial L}{\partial h_T}\prod_{t=k+1}^{T} J_t
with J_t = \operatorname{diag}(\tanh'(z_t))\,W — roughly
W^{T-k}.
-
Vanishing & exploding along time.
\|W\| < 1 sends long-range gradients to 0
(the RNN forgets the far past); \|W\| > 1 sends them to
\infty. Same product law as deep nets, now over time.
Unrolling a 10,000-step sequence means storing 10,000 hidden states to backpropagate through —
ruinous in memory, and (thanks to the product above) the distant gradients are near-useless
anyway. Truncated BPTT is the practical fix: process the sequence in chunks of,
say, k = 35 steps, and only backpropagate the gradient
within each chunk. The hidden state still flows forward across chunk
boundaries (so the model keeps its running memory), but the backward pass is cut off after
k steps.
It is a deliberate bias–memory trade: you lose credit assignment beyond
k steps, but training becomes constant-memory and fast. The deeper
cure for the vanishing half — letting gradients travel far without decaying — is a gated
architecture such as the LSTM, whose constant-error path is the recurrent cousin of the
residual connection.