The Long-Range Problem
A recurrent
network reads a sequence one token at a time, carrying everything it has seen so far
in a single hidden vector h_t. That sounds enough — until the sequence
gets long. Consider the sentence
\text{“The } \underline{\text{cat}} \text{ that the dog chased all afternoon } \underline{\text{was}} \text{ tired.”}
To choose was over were, the network must remember that the subject
cat was singular — a fact deposited many tokens earlier. A plain RNN struggles to
connect them, and there are two distinct reasons, which we now make precise:
the signal decays going forward, and the gradient decays going
backward. Both decay exponentially in the distance, and both trace back to the same
repeated matrix.
One recurrence, applied again and again
Strip the RNN to its skeleton. At each step it mixes the previous state with the new input
through a fixed weight matrix W and a squashing nonlinearity
\sigma (a \tanh, say):
h_t = \sigma\!\left(W h_{t-1} + U x_t\right).
The same W is reused at every step. That single fact is the
whole story: whatever W does to the state, it does over and over, and
repeated multiplication is the engine of exponential growth or collapse.
Why influence decays: the W^{k} product
We measure how strongly a state k steps in the past,
h_{t-k}, still influences the present state
h_t — the Jacobian
\partial h_t / \partial h_{t-k}. (This same quantity is the factor
backprop
through time multiplies when it carries a gradient back across those
k steps, so one calculation settles both the forward signal and the
backward gradient.)
Step 1 — one step of the chain rule. Differentiate
h_t = \sigma(W h_{t-1} + U x_t) with respect to
h_{t-1}. The nonlinearity contributes its diagonal slope matrix
D_t = \operatorname{diag}\big(\sigma'(\cdot)\big), and the linear part
contributes W:
\frac{\partial h_t}{\partial h_{t-1}} = D_t\, W.
Step 2 — chain across k steps. To reach
h_{t-k} we multiply one such factor per step:
\frac{\partial h_t}{\partial h_{t-k}} = \prod_{j=t-k+1}^{t} \frac{\partial h_j}{\partial h_{j-1}} = \prod_{j=t-k+1}^{t} D_j\, W.
Step 3 — take the size. Bound the norm of a product by the product of the
norms. The \tanh slope satisfies
\sigma' \le 1, so each \|D_j\| \le 1 and it
can only help the decay; the surviving driver is one factor of
\|W\| per step:
\left\lVert \frac{\partial h_t}{\partial h_{t-k}} \right\rVert \;\le\; \prod_{j} \lVert D_j\rVert\,\lVert W\rVert \;\le\; \lVert W\rVert^{\,k}.
Step 4 — read the exponential. Write r = \lVert W\rVert
(more precisely its largest singular value, or spectral radius). The influence of a token
k steps back scales like
\left\lVert \frac{\partial h_t}{\partial h_{t-k}} \right\rVert \sim r^{\,k}.
Step 5 — the two bad cases. If r < 1, then
r^k \to 0 exponentially fast: distant tokens have essentially
zero influence — the vanishing regime, and overwhelmingly the common
one. If r > 1, then r^k \to \infty: the
signal and gradient explode. Only the razor's edge r = 1
preserves magnitude, and a single matrix cannot sit there for every direction at once. This is
exactly the
vanishing-and-exploding
gradient problem, but stretched along time instead of depth:
every recurrent step is, in effect, another layer sharing one weight matrix.
So at r = 0.8 the influence at distance k = 30
is 0.8^{30} \approx 0.001 — a thousandth of full strength. The
subject's number is long gone by the time the verb needs it.
The other wall: a fixed-width bottleneck
Even if the gradient behaved, there is a second, structural limit. The hidden state
h_t \in \mathbb{R}^d has a fixed dimension
d, chosen once, before the network ever sees a sentence. Yet it must
summarise the entire past x_1, \dots, x_t, however long.
Step 1 — count what must fit. As t grows the amount
of relevant history grows without bound, but the container holding it stays
d numbers wide.
Step 2 — conclude a forced loss. A finite-dimensional vector cannot losslessly
store arbitrarily much information, so as the sequence lengthens, detail must be
overwritten. The state is a leaky fixed-size buffer: new tokens crowd
out old ones. Crucially this limit is independent of the decay above — it would bite even with a
perfectly behaved W.
A plain RNN cannot reliably link distant tokens, for two independent reasons:
-
Exponential decay of influence. Across k steps
the Jacobian is a product of k copies of (a slope times)
W, so
\lVert \partial h_t/\partial h_{t-k}\rVert \lesssim \lVert W\rVert^{\,k} \sim r^{\,k}.
With r < 1 both the forward signal and the backward gradient
vanish; with r > 1 they explode.
-
Fixed-state bottleneck. The whole history is squeezed into one
d-dimensional vector, so for long sequences information is
necessarily overwritten — a limit independent of the decay.
-
Consequence. Plain RNNs handle only short-range dependencies; long-distance
agreement, coreference, and discourse structure are out of reach. The rest of the field is a
sequence of answers to this one problem.
Watch the influence drain away
The curves plot the influence retained at distance k — that is
r^{\,k} — for a few fixed decay rates, with a slider for a fourth.
Two faint reference curves sit at r = 0.7 and
r = 0.9; the bold curve is yours to move. Push r
below 1 and the curve plunges toward the axis within a handful of
tokens — that collapse is the long-range problem. Nudge it to
r = 1 and the line goes flat; cross above and it runs off the top
(the exploding regime).
Language is built from dependencies that leap across many words. Subject–verb
agreement can span a whole relative clause — “the keys to the cabinet
are on the table”, where a nearby singular noun (cabinet) is a deliberate trap.
Coreference is worse: a pronoun like it or they may refer to
an entity introduced sentences ago, and resolving it correctly means the network still
holds that entity. Discourse — “on the one hand … on the other hand” — stretches the
span further still.
These aren't rare flourishes; they are the ordinary fabric of text. A model whose memory of a
token fades like r^{\,k} will systematically get them wrong, no matter
how much data it sees. So the long-range problem isn't an academic edge case — it is the central
obstacle that gated
cells, encoder–decoder
models, and finally
attention
were each invented to overcome.