Backpropagation Through Time

An RNN unrolled across T steps is just a deep feed-forward chain with tied weights. So we train it the only way we know how: unroll it, then run the chain rule backwards through every step. This is backpropagation through time (BPTT) — ordinary backprop, with the "layers" being moments in time.

Propagating the gradient back through time

Recall the recurrence h_t = \tanh(W h_{t-1} + U x_t + b). A loss L at the end depends on h_T, which depends on h_{T-1}, which depends on h_{T-2}, … all the way back. We want how the loss changes with an early hidden state, \partial L / \partial h_k.

Step 1 — one step of the chain rule. The loss sees h_{t-1} only through h_t, so:

\frac{\partial L}{\partial h_{t-1}} = \frac{\partial L}{\partial h_t}\,\frac{\partial h_t}{\partial h_{t-1}}.

Step 2 — name the recurrent Jacobian. Differentiating the recurrence, the local step factor is

\frac{\partial h_t}{\partial h_{t-1}} = \operatorname{diag}\!\big(\tanh'(z_t)\big)\, W \;=:\; J_t, \qquad z_t = W h_{t-1} + U x_t + b.

Step 3 — chain it from step T back to step k. Apply Step 1 repeatedly; the single-step Jacobians multiply:

\frac{\partial L}{\partial h_k} = \frac{\partial L}{\partial h_T}\,\prod_{t=k+1}^{T} J_t = \frac{\partial L}{\partial h_T}\, J_T\, J_{T-1}\cdots J_{k+1}.

Step 4 — strip it to a scalar to see the law. Take one number per step, a fixed weight w, and pretend \tanh' \approx 1. Then every J_t = w, and the product over T - k steps collapses to a power:

\frac{\partial L}{\partial h_k} \;\propto\; \underbrace{w \cdot w \cdots w}_{T - k \ \text{times}} \;=\; w^{\,T-k}.

Step 5 — read off the fate of long-range gradients. The gradient reaching a step T - k moments in the past is scaled by w^{T-k} — roughly W^{T-k} in the matrix case (it picks up powers of W's largest singular value). So as the gap grows:

\big\|W\big\| < 1 \;\Rightarrow\; \text{gradient} \to 0 \ (\text{vanishing}), \qquad \big\|W\big\| > 1 \;\Rightarrow\; \text{gradient} \to \infty \ (\text{exploding}).

This is the identical product law as in deep feed-forward nets — only here the long product runs along time, and the repeated factor is the same recurrent matrix W at every step (weight sharing), which makes the powers-of-W blow-up especially sharp. Vanishing gradients are why a plain RNN forgets the far past: the signal from step 1 simply cannot reach step 100 with any strength.

Training an RNN unrolls it and backpropagates through every step.

Unrolling a 10,000-step sequence means storing 10,000 hidden states to backpropagate through — ruinous in memory, and (thanks to the product above) the distant gradients are near-useless anyway. Truncated BPTT is the practical fix: process the sequence in chunks of, say, k = 35 steps, and only backpropagate the gradient within each chunk. The hidden state still flows forward across chunk boundaries (so the model keeps its running memory), but the backward pass is cut off after k steps.

It is a deliberate bias–memory trade: you lose credit assignment beyond k steps, but training becomes constant-memory and fast. The deeper cure for the vanishing half — letting gradients travel far without decaying — is a gated architecture such as the LSTM, whose constant-error path is the recurrent cousin of the residual connection.

Watch the gradient decay (or blow up) with distance

Plot the gradient magnitude \|W\|^{k} against how many steps back k it has travelled, on a log vertical axis — so a pure exponential is a straight line. Slide \|W\| across 1: below it the line slopes down to the floor (the distant past vanishes), above it the line rockets up (exploding), and exactly at 1 it is flat — the knife edge.