Recurrent Neural Networks
We have a sequence of
embedded
vectors x_1, x_2, \dots, x_T, of whatever length, and we want a network
that reads them in order. The idea behind a recurrent neural network (RNN) is to
walk the sequence one step at a time, carrying a running summary called the hidden
state h_t — a fixed-size vector that remembers everything seen
so far. At each step it reads the next input, updates the summary, and (optionally) emits an
output.
One small memory, updated by the same rule every step — that single trick handles any length and
respects order automatically.
The recurrence, line by line
Step 1 — combine the past summary with the new input. At step
t we have last step's hidden state
h_{t-1} and the new input x_t. Mix them with
two weight matrices and a bias — each a
matrix times a vector:
z_t = W\, h_{t-1} + U\, x_t + b.
Step 2 — squash to get the new hidden state. Pass
z_t through a nonlinear
activation
(classically \tanh, which keeps the state in
(-1, 1)):
h_t = \tanh\!\big(W\, h_{t-1} + U\, x_t + b\big).
Step 3 — read out an output. If this step needs a prediction, map the hidden
state through one more matrix (and usually a
softmax
for word probabilities):
y_t = V\, h_t.
Step 4 — start it off. There is no h_0 from the data,
so we seed the memory with zeros (a common choice):
h_0 = \mathbf{0}.
Step 5 — notice the weights never change. The same three matrices
W, U, V and bias b are reused at every
time step — this is weight sharing. The network has the same number of parameters
whether the sequence is 4 steps or 4000, and what it learns about "a verb usually follows a
subject" applies at every position, not relearned per slot the way a flattened MLP would.
Unroll the loop across time
The recurrence looks like a tiny loop, but if you write out each step it becomes a deep
feed-forward chain — one layer per time step, all sharing weights:
h_1 = \tanh(W h_0 + U x_1 + b), \quad h_2 = \tanh(W h_1 + U x_2 + b), \quad \dots, \quad h_T = \tanh(W h_{T-1} + U x_T + b).
This is called unrolling. Substituting each h into the
next shows that h_T depends, through a long composition, on every input
back to x_1:
h_T = \tanh\!\Big(W\,\tanh\big(W\,\tanh(\cdots) + U x_{T-1} + b\big) + U x_T + b\Big).
So an RNN of "depth 1" is secretly a network of depth T — as deep as the
sequence is long. That depth is its power (it can carry information across many steps) and, as the
next page on
backpropagation
through time shows, its peril.
A recurrent network processes a sequence x_1, \dots, x_T by carrying a
hidden state.
-
Hidden-state update.
h_t = \tanh(W\, h_{t-1} + U\, x_t + b), with output
y_t = V\, h_t and a seed h_0 = \mathbf{0}.
The state h_t summarises every input up to step
t.
-
Weight sharing. The same W, U, V, b are
reused at every step, so the parameter count is independent of the sequence length
T.
-
Unrolling. Writing out the steps turns the loop into a depth-T
feed-forward chain with tied weights, where h_T is a composition
reaching back to x_1.
Because inputs and outputs are read off step by step, the same recurrence covers a whole zoo of
tasks just by choosing which steps you feed and read:
-
many-to-one — read a whole sequence, emit one answer at the end (sentiment of
a review: take only y_T).
-
one-to-many — one input, generate a sequence (image → caption: feed once,
then keep emitting).
-
many-to-many — a label per step (part-of-speech tagging: read
x_t, emit y_t), or read all then generate
all (translation, an encoder–decoder).
One equation, four pictures — the flexibility that made RNNs the workhorse of sequence modelling
before attention arrived.
Step through the unrolled chain
Each column is one time step: input x_t at the bottom, hidden state
h_t in the middle, output y_t on top. The
horizontal arrows are the recurrent connection passing h_{t-1} forward,
and they all carry the same matrix W. Step the control to light
up the sequence one position at a time and watch the memory flow left to right.