LSTM and GRU
The
long-range
problem told us why a plain RNN forgets: chaining one matrix
W across many steps makes influence decay like
r^{\,k}. The gated RNN answers it with a beautifully
targeted fix. It keeps a separate memory — a cell state
c_t — and updates it almost entirely by addition, with small
learned valves (gates) deciding what to keep, what to write, and what to read.
The most famous design is the LSTM (Long Short-Term Memory).
A gate is a sigmoid valve in [0, 1]
Every gate is a vector of numbers between 0 and
1, produced by a
sigmoid
\sigma(z) = 1/(1 + e^{-z}) (the two-class sibling of softmax). Fed the
previous state and the new input, a gate computes
g = \sigma\!\left(W_g\,[\,h_{t-1},\, x_t\,] + b_g\right) \in (0, 1)^d.
Read each entry as a dial: 0 closes the valve (block this coordinate),
1 opens it fully (let it through), anything between is a partial pass.
The LSTM learns three such gates — forget f_t,
input i_t, and output
o_t — and they act elementwise on the memory, multiplying it
coordinate by coordinate (the Hadamard product \odot).
The cell update, line by line
Here is the whole LSTM, derived in the order the signal actually flows. Write
[\,h_{t-1}, x_t\,] for the concatenated previous state and current
input.
Step 1 — propose a candidate memory. A \tanh layer
suggests what could be written into the cell this step — new content in
(-1, 1):
\tilde{c}_t = \tanh\!\left(W_c\,[\,h_{t-1},\, x_t\,] + b_c\right).
Step 2 — compute the three gates. Each is its own sigmoid valve over the same
input:
f_t = \sigma(W_f[\cdot]),\qquad i_t = \sigma(W_i[\cdot]),\qquad o_t = \sigma(W_o[\cdot]).
Step 3 — the additive cell update (the whole point). Keep a gated fraction of
the old memory and add a gated fraction of the candidate:
c_t = f_t \odot c_{t-1} \;+\; i_t \odot \tilde{c}_t.
Stare at that +. The old memory c_{t-1} is
not passed through a matrix W and a squashing nonlinearity — it is
carried forward by a plain elementwise multiply-and-add. If the forget gate
opens fully, f_t = 1, then c_t = c_{t-1} + i_t \odot \tilde{c}_t
— the memory is the previous memory plus a correction. This is structurally the
residual
connection's y = x + F(x), the same
+1 highway — but laid along time instead of depth.
Step 4 — read the gradient highway. Differentiate the cell update across one
step (with the forget gate held as a value):
\frac{\partial c_t}{\partial c_{t-1}} = \operatorname{diag}(f_t).
Chaining across k steps multiplies the forget gates, not powers of a
matrix:
\frac{\partial c_t}{\partial c_{t-k}} = \operatorname{diag}\!\left(\textstyle\prod_{j} f_j\right).
When the cell decides to remember a coordinate it holds
f_j \approx 1 there, and the product stays near
1 for as long as it likes — no exponential decay. The
memory, and the gradient through it, survive across hundreds of steps. The forget gate makes the
decay rate learnable per coordinate instead of fixed by one shared
\lVert W\rVert.
Step 5 — read out the hidden state. The cell is the long-term store; the hidden
state h_t is a gated view of it, what the rest of the network
actually sees this step:
h_t = o_t \odot \tanh(c_t).
Memory in c, exposed through o — the cell can
hold a fact silently for many steps and only surface it when the output gate opens.
The GRU: the same idea, two gates
The GRU (Gated Recurrent Unit) is a streamlined cousin that drops the separate
cell state and uses just two gates. An update gate
z_t interpolates between keeping the old state and taking the new
candidate, while a reset gate r_t controls how much
past leaks into the candidate:
h_t = (1 - z_t)\odot h_{t-1} \;+\; z_t \odot \tilde{h}_t.
It is the same convex, additive blend — z_t = 0 copies the past
unchanged (the memory highway), z_t = 1 overwrites it. Fewer gates
means fewer parameters and a touch faster, often with comparable accuracy; the LSTM's extra
output gate buys finer control. Both win the same way: an additive update guarded by sigmoids.
Gated RNNs defeat exponential forgetting with a guarded additive memory:
-
Gates in [0,1]. Each gate is a sigmoid valve,
g = \sigma(W_g[h_{t-1}, x_t] + b_g), acting elementwise
(\odot) — 0 blocks, 1 passes.
-
Additive cell highway.
c_t = f_t \odot c_{t-1} + i_t \odot \tilde{c}_t, so
\partial c_t/\partial c_{t-1} = \operatorname{diag}(f_t). With
f_t \approx 1 the gradient neither vanishes nor explodes — the
same x + F(x) trick as a residual connection, along time. The
readout is h_t = o_t \odot \tanh(c_t).
-
LSTM vs GRU. The LSTM has three gates and a separate cell state; the GRU
merges them into two gates and one state,
h_t = (1-z_t)\odot h_{t-1} + z_t \odot \tilde{h}_t — leaner, often
just as good.
See the memory highway
Step through the cell. The old memory c_{t-1} arrives along the top
and is scaled by the forget gate; the input gate scales a fresh
candidate \tilde{c}_t; the two are summed — that
plain + is the highway — to give the new memory
c_t. Finally the output gate decides how much of
\tanh(c_t) becomes the visible hidden state
h_t.
The forget gate is the knob the long-range problem was begging for. Suppose a sentence opens with
a singular subject. The cell writes “subject = singular” into one coordinate and the network
learns to hold f_t \approx 1 on that coordinate through the entire
relative clause — that the dog chased all afternoon — so
c_t \approx c_{t-1} there, step after step. The fact simply persists,
intact, until the verb needs it.
Then comes a full stop, a new subject, and the model wants a clean slate. Now it drives
f_t \approx 0 on that coordinate: c_t = 0 \cdot c_{t-1} + \dots
wipes the old value and writes the new. The same valve that grants near-perfect long-term memory
also grants deliberate forgetting — and because it is learned per coordinate and per
step, the cell can keep one fact while erasing another in the very same update. (Tellingly, LSTMs
train best when the forget gate's bias starts positive, so it begins life biased toward
remembering.)
Where this is going
A gated cell can carry information across a long sequence. The next question is how to
use that to transform one whole sequence into another — translate a sentence, summarise a
paragraph — with a reader and a writer:
the
encoder–decoder model.