LSTM and GRU

The long-range problem told us why a plain RNN forgets: chaining one matrix W across many steps makes influence decay like r^{\,k}. The gated RNN answers it with a beautifully targeted fix. It keeps a separate memory — a cell state c_t — and updates it almost entirely by addition, with small learned valves (gates) deciding what to keep, what to write, and what to read. The most famous design is the LSTM (Long Short-Term Memory).

A gate is a sigmoid valve in [0, 1]

Every gate is a vector of numbers between 0 and 1, produced by a sigmoid \sigma(z) = 1/(1 + e^{-z}) (the two-class sibling of softmax). Fed the previous state and the new input, a gate computes

g = \sigma\!\left(W_g\,[\,h_{t-1},\, x_t\,] + b_g\right) \in (0, 1)^d.

Read each entry as a dial: 0 closes the valve (block this coordinate), 1 opens it fully (let it through), anything between is a partial pass. The LSTM learns three such gates — forget f_t, input i_t, and output o_t — and they act elementwise on the memory, multiplying it coordinate by coordinate (the Hadamard product \odot).

The cell update, line by line

Here is the whole LSTM, derived in the order the signal actually flows. Write [\,h_{t-1}, x_t\,] for the concatenated previous state and current input.

Step 1 — propose a candidate memory. A \tanh layer suggests what could be written into the cell this step — new content in (-1, 1):

\tilde{c}_t = \tanh\!\left(W_c\,[\,h_{t-1},\, x_t\,] + b_c\right).

Step 2 — compute the three gates. Each is its own sigmoid valve over the same input:

f_t = \sigma(W_f[\cdot]),\qquad i_t = \sigma(W_i[\cdot]),\qquad o_t = \sigma(W_o[\cdot]).

Step 3 — the additive cell update (the whole point). Keep a gated fraction of the old memory and add a gated fraction of the candidate:

c_t = f_t \odot c_{t-1} \;+\; i_t \odot \tilde{c}_t.

Stare at that +. The old memory c_{t-1} is not passed through a matrix W and a squashing nonlinearity — it is carried forward by a plain elementwise multiply-and-add. If the forget gate opens fully, f_t = 1, then c_t = c_{t-1} + i_t \odot \tilde{c}_t — the memory is the previous memory plus a correction. This is structurally the residual connection's y = x + F(x), the same +1 highway — but laid along time instead of depth.

Step 4 — read the gradient highway. Differentiate the cell update across one step (with the forget gate held as a value):

\frac{\partial c_t}{\partial c_{t-1}} = \operatorname{diag}(f_t).

Chaining across k steps multiplies the forget gates, not powers of a matrix:

\frac{\partial c_t}{\partial c_{t-k}} = \operatorname{diag}\!\left(\textstyle\prod_{j} f_j\right).

When the cell decides to remember a coordinate it holds f_j \approx 1 there, and the product stays near 1 for as long as it likes — no exponential decay. The memory, and the gradient through it, survive across hundreds of steps. The forget gate makes the decay rate learnable per coordinate instead of fixed by one shared \lVert W\rVert.

Step 5 — read out the hidden state. The cell is the long-term store; the hidden state h_t is a gated view of it, what the rest of the network actually sees this step:

h_t = o_t \odot \tanh(c_t).

Memory in c, exposed through o — the cell can hold a fact silently for many steps and only surface it when the output gate opens.

The GRU: the same idea, two gates

The GRU (Gated Recurrent Unit) is a streamlined cousin that drops the separate cell state and uses just two gates. An update gate z_t interpolates between keeping the old state and taking the new candidate, while a reset gate r_t controls how much past leaks into the candidate:

h_t = (1 - z_t)\odot h_{t-1} \;+\; z_t \odot \tilde{h}_t.

It is the same convex, additive blend — z_t = 0 copies the past unchanged (the memory highway), z_t = 1 overwrites it. Fewer gates means fewer parameters and a touch faster, often with comparable accuracy; the LSTM's extra output gate buys finer control. Both win the same way: an additive update guarded by sigmoids.

Gated RNNs defeat exponential forgetting with a guarded additive memory:

Gates in [0,1]. Each gate is a sigmoid valve, g = \sigma(W_g[h_{t-1}, x_t] + b_g), acting elementwise (\odot) — 0 blocks, 1 passes.
Additive cell highway. c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, so \partial c_t/\partial c_{t-1} = \operatorname{diag}(f_t). With f_t \approx 1 the gradient neither vanishes nor explodes — the same x + F(x) trick as a residual connection, along time. The readout is h_t = o_t \odot \tanh(c_t).
LSTM vs GRU. The LSTM has three gates and a separate cell state; the GRU merges them into two gates and one state, h_t = (1-z_t)\odot h_{t-1} + z_t \odot \tilde{h}_t — leaner, often just as good.

See the memory highway

Step through the cell. The old memory c_{t-1} arrives along the top and is scaled by the forget gate; the input gate scales a fresh candidate \tilde{c}_t; the two are summed — that plain + is the highway — to give the new memory c_t. Finally the output gate decides how much of \tanh(c_t) becomes the visible hidden state h_t.

The forget gate is the knob the long-range problem was begging for. Suppose a sentence opens with a singular subject. The cell writes “subject = singular” into one coordinate and the network learns to hold f_t \approx 1 on that coordinate through the entire relative clause — that the dog chased all afternoon — so c_t \approx c_{t-1} there, step after step. The fact simply persists, intact, until the verb needs it.

Then comes a full stop, a new subject, and the model wants a clean slate. Now it drives f_t \approx 0 on that coordinate: c_t = 0 \cdot c_{t-1} + \dots wipes the old value and writes the new. The same valve that grants near-perfect long-term memory also grants deliberate forgetting — and because it is learned per coordinate and per step, the cell can keep one fact while erasing another in the very same update. (Tellingly, LSTMs train best when the forget gate's bias starts positive, so it begins life biased toward remembering.)

Where this is going

A gated cell can carry information across a long sequence. The next question is how to use that to transform one whole sequence into another — translate a sentence, summarise a paragraph — with a reader and a writer: the encoder–decoder model.