Recurrent Neural Networks

We have a sequence of embedded vectors x_1, x_2, \dots, x_T, of whatever length, and we want a network that reads them in order. The idea behind a recurrent neural network (RNN) is to walk the sequence one step at a time, carrying a running summary called the hidden state h_t — a fixed-size vector that remembers everything seen so far. At each step it reads the next input, updates the summary, and (optionally) emits an output.

One small memory, updated by the same rule every step — that single trick handles any length and respects order automatically.

The recurrence, line by line

Step 1 — combine the past summary with the new input. At step t we have last step's hidden state h_{t-1} and the new input x_t. Mix them with two weight matrices and a bias — each a matrix times a vector:

z_t = W\, h_{t-1} + U\, x_t + b.

Step 2 — squash to get the new hidden state. Pass z_t through a nonlinear activation (classically \tanh, which keeps the state in (-1, 1)):

h_t = \tanh\!\big(W\, h_{t-1} + U\, x_t + b\big).

Step 3 — read out an output. If this step needs a prediction, map the hidden state through one more matrix (and usually a softmax for word probabilities):

y_t = V\, h_t.

Step 4 — start it off. There is no h_0 from the data, so we seed the memory with zeros (a common choice):

h_0 = \mathbf{0}.

Step 5 — notice the weights never change. The same three matrices W, U, V and bias b are reused at every time step — this is weight sharing. The network has the same number of parameters whether the sequence is 4 steps or 4000, and what it learns about "a verb usually follows a subject" applies at every position, not relearned per slot the way a flattened MLP would.

Unroll the loop across time

The recurrence looks like a tiny loop, but if you write out each step it becomes a deep feed-forward chain — one layer per time step, all sharing weights:

h_1 = \tanh(W h_0 + U x_1 + b), \quad h_2 = \tanh(W h_1 + U x_2 + b), \quad \dots, \quad h_T = \tanh(W h_{T-1} + U x_T + b).

This is called unrolling. Substituting each h into the next shows that h_T depends, through a long composition, on every input back to x_1:

h_T = \tanh\!\Big(W\,\tanh\big(W\,\tanh(\cdots) + U x_{T-1} + b\big) + U x_T + b\Big).

So an RNN of "depth 1" is secretly a network of depth T — as deep as the sequence is long. That depth is its power (it can carry information across many steps) and, as the next page on backpropagation through time shows, its peril.

A recurrent network processes a sequence x_1, \dots, x_T by carrying a hidden state.

Hidden-state update. h_t = \tanh(W\, h_{t-1} + U\, x_t + b), with output y_t = V\, h_t and a seed h_0 = \mathbf{0}. The state h_t summarises every input up to step t.
Weight sharing. The same W, U, V, b are reused at every step, so the parameter count is independent of the sequence length T.
Unrolling. Writing out the steps turns the loop into a depth-T feed-forward chain with tied weights, where h_T is a composition reaching back to x_1.

Because inputs and outputs are read off step by step, the same recurrence covers a whole zoo of tasks just by choosing which steps you feed and read:

many-to-one — read a whole sequence, emit one answer at the end (sentiment of a review: take only y_T).
one-to-many — one input, generate a sequence (image → caption: feed once, then keep emitting).
many-to-many — a label per step (part-of-speech tagging: read x_t, emit y_t), or read all then generate all (translation, an encoder–decoder).

One equation, four pictures — the flexibility that made RNNs the workhorse of sequence modelling before attention arrived.

Step through the unrolled chain

Each column is one time step: input x_t at the bottom, hidden state h_t in the middle, output y_t on top. The horizontal arrows are the recurrent connection passing h_{t-1} forward, and they all carry the same matrix W. Step the control to light up the sequence one position at a time and watch the memory flow left to right.