The Attention Mechanism
The
encoder–decoder
model had one fatal pinch point: the whole input was crushed into a single frozen
context vector c = h_n. Attention removes it with one
idea so good it reorganised the entire field. Instead of one summary, keep all the
encoder states h_1, \dots, h_n around, and let the decoder, at
each output step, build a fresh context by taking a weighted
average of them — putting most of the weight on the input positions that matter right
now. The decoder learns to look back, and where it looks is interpretable.
Attention, line by line
Fix a decoder step t with current decoder state
s_t. We build its context c_t in three
moves: score, normalise, average.
Step 1 — score each encoder state. Ask how relevant input position
i is to what we're writing now, with an alignment score (the additive,
Bahdanau form — a tiny neural net that compares s_t with each
h_i):
e_{t i} = v^{\top}\tanh\!\left(W_s\, s_t + W_h\, h_i\right).
A larger e_{t i} means “position i is more
relevant to output t”. These raw scores can be any real numbers.
Step 2 — normalise into weights with softmax. Turn the
n scores into a probability distribution over input positions using the
softmax:
\alpha_{t i} = \frac{e^{\,e_{t i}}}{\sum_{j=1}^{n} e^{\,e_{t j}}},\qquad \alpha_{t i} \ge 0,\quad \sum_{i=1}^{n}\alpha_{t i} = 1.
The weights \alpha_{t i} are non-negative and sum to
1 — a clean “how much of my attention goes to each input token”
budget, spent across the input.
Step 3 — average the encoder states. The context for this step is the
weighted sum of the encoder states under those weights:
c_t = \sum_{i=1}^{n}\alpha_{t i}\, h_i.
This is a content-based weighted average: where the weight concentrates, that
encoder state dominates c_t. Because
\sum_i \alpha_{t i} = 1, the context is a genuine convex combination of
the inputs — it lives in their hull, never larger than the states it averages.
Step 4 — decode with this fresh context. Feed c_t
(alongside s_t) into the decoder's output, then advance to step
t+1 — where a new s_{t+1} produces
new scores, new weights, and a new context. Each output token gets its own custom-built view of
the input.
Two consequences fall out for free
The bottleneck is gone. The decoder no longer depends on a single
c; at every step it has direct access to all
n encoder states, and the path from any input token to any output token
is now length one. A long sentence is no harder to reach into than a short one — there is nothing
to compress.
The weights are an alignment. The matrix
\alpha_{t i} says, for each output token, which input tokens it drew
from. Plotted as a heatmap it reveals a soft word-to-word alignment — for translation it lights up
roughly along the diagonal, bending where word order differs between languages. Attention isn't
just accurate; it tells you where the model looked.
At each decoder step t, attention builds a custom context from all the
encoder states:
-
Scores → softmax weights → \sum_i \alpha_i h_i.
Score each h_i against s_t to get
e_{t i}, normalise
\alpha_t = \operatorname{softmax}(e_t) (so
\alpha_{t i} \ge 0 and \sum_i \alpha_{t i} = 1),
then take the weighted average c_t = \sum_i \alpha_{t i} h_i.
-
No bottleneck. Every output step sees all n
encoder states directly — no single fixed vector, so long inputs stop degrading.
-
Interpretable alignment. The weights \alpha_{t i}
form a soft alignment between output and input tokens, readable as a heatmap.
Watch the attention move
The curve shows the attention weights \alpha_{t i} spread across six
input tokens for one output step. Drag the slider to change the output step
t: the peak of attention slides along the input, and — because the
weights are a softmax — they always re-normalise to sum to 1, however
the peak moves. That sliding bump is the model choosing which input tokens to read as it writes
each output token.
Look again at Steps 1–3: score the decoder state against every encoder state, softmax the
scores, take the weighted sum. Rename the pieces and you have the Transformer's
query–key–value attention. The thing doing the looking,
s_t, is a query q; each
encoder state plays both a key k_i (what it's scored
against) and a value v_i (what gets averaged). The
score becomes a dot product, scaled,
\alpha_i = \operatorname{softmax}\!\left(\frac{q\cdot k_i}{\sqrt{d}}\right),\qquad c = \sum_i \alpha_i\, v_i.
Same three moves — score, softmax, weighted sum. The leap is
self-attention: let the sequence attend to itself (queries, keys and
values all from the same tokens), drop the recurrence entirely, and you can process every position
in parallel. That is the
self-attention
at the heart of the Transformer — the architecture this humble weighted average grew into.