The Long-Range Problem

A recurrent network reads a sequence one token at a time, carrying everything it has seen so far in a single hidden vector h_t. That sounds enough — until the sequence gets long. Consider the sentence

\text{“The } \underline{\text{cat}} \text{ that the dog chased all afternoon } \underline{\text{was}} \text{ tired.”}

To choose was over were, the network must remember that the subject cat was singular — a fact deposited many tokens earlier. A plain RNN struggles to connect them, and there are two distinct reasons, which we now make precise: the signal decays going forward, and the gradient decays going backward. Both decay exponentially in the distance, and both trace back to the same repeated matrix.

One recurrence, applied again and again

Strip the RNN to its skeleton. At each step it mixes the previous state with the new input through a fixed weight matrix W and a squashing nonlinearity \sigma (a \tanh, say):

h_t = \sigma\!\left(W h_{t-1} + U x_t\right).

The same W is reused at every step. That single fact is the whole story: whatever W does to the state, it does over and over, and repeated multiplication is the engine of exponential growth or collapse.

Why influence decays: the W^{k} product

We measure how strongly a state k steps in the past, h_{t-k}, still influences the present state h_t — the Jacobian \partial h_t / \partial h_{t-k}. (This same quantity is the factor backprop through time multiplies when it carries a gradient back across those k steps, so one calculation settles both the forward signal and the backward gradient.)

Step 1 — one step of the chain rule. Differentiate h_t = \sigma(W h_{t-1} + U x_t) with respect to h_{t-1}. The nonlinearity contributes its diagonal slope matrix D_t = \operatorname{diag}\big(\sigma'(\cdot)\big), and the linear part contributes W:

\frac{\partial h_t}{\partial h_{t-1}} = D_t\, W.

Step 2 — chain across k steps. To reach h_{t-k} we multiply one such factor per step:

\frac{\partial h_t}{\partial h_{t-k}} = \prod_{j=t-k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}} = \prod_{j=t-k+1}^{t} D_j\, W.

Step 3 — take the size. Bound the norm of a product by the product of the norms. The \tanh slope satisfies \sigma' \le 1, so each \|D_j\| \le 1 and it can only help the decay; the surviving driver is one factor of \|W\| per step:

\left\lVert \frac{\partial h_t}{\partial h_{t-k}} \right\rVert \;\le\; \prod_{j} \lVert D_j\rVert\,\lVert W\rVert \;\le\; \lVert W\rVert^{\,k}.

Step 4 — read the exponential. Write r = \lVert W\rVert (more precisely its largest singular value, or spectral radius). The influence of a token k steps back scales like

\left\lVert \frac{\partial h_t}{\partial h_{t-k}} \right\rVert \sim r^{\,k}.

Step 5 — the two bad cases. If r < 1, then r^k \to 0 exponentially fast: distant tokens have essentially zero influence — the vanishing regime, and overwhelmingly the common one. If r > 1, then r^k \to \infty: the signal and gradient explode. Only the razor's edge r = 1 preserves magnitude, and a single matrix cannot sit there for every direction at once. This is exactly the vanishing-and-exploding gradient problem, but stretched along time instead of depth: every recurrent step is, in effect, another layer sharing one weight matrix.

So at r = 0.8 the influence at distance k = 30 is 0.8^{30} \approx 0.001 — a thousandth of full strength. The subject's number is long gone by the time the verb needs it.

The other wall: a fixed-width bottleneck

Even if the gradient behaved, there is a second, structural limit. The hidden state h_t \in \mathbb{R}^d has a fixed dimension d, chosen once, before the network ever sees a sentence. Yet it must summarise the entire past x_1, \dots, x_t, however long.

Step 1 — count what must fit. As t grows the amount of relevant history grows without bound, but the container holding it stays d numbers wide.

Step 2 — conclude a forced loss. A finite-dimensional vector cannot losslessly store arbitrarily much information, so as the sequence lengthens, detail must be overwritten. The state is a leaky fixed-size buffer: new tokens crowd out old ones. Crucially this limit is independent of the decay above — it would bite even with a perfectly behaved W.

A plain RNN cannot reliably link distant tokens, for two independent reasons:

Watch the influence drain away

The curves plot the influence retained at distance k — that is r^{\,k} — for a few fixed decay rates, with a slider for a fourth. Two faint reference curves sit at r = 0.7 and r = 0.9; the bold curve is yours to move. Push r below 1 and the curve plunges toward the axis within a handful of tokens — that collapse is the long-range problem. Nudge it to r = 1 and the line goes flat; cross above and it runs off the top (the exploding regime).

Language is built from dependencies that leap across many words. Subject–verb agreement can span a whole relative clause — “the keys to the cabinet are on the table”, where a nearby singular noun (cabinet) is a deliberate trap. Coreference is worse: a pronoun like it or they may refer to an entity introduced sentences ago, and resolving it correctly means the network still holds that entity. Discourse — “on the one hand … on the other hand” — stretches the span further still.

These aren't rare flourishes; they are the ordinary fabric of text. A model whose memory of a token fades like r^{\,k} will systematically get them wrong, no matter how much data it sees. So the long-range problem isn't an academic edge case — it is the central obstacle that gated cells, encoder–decoder models, and finally attention were each invented to overcome.