The Attention Mechanism

The encoder–decoder model had one fatal pinch point: the whole input was crushed into a single frozen context vector c = h_n. Attention removes it with one idea so good it reorganised the entire field. Instead of one summary, keep all the encoder states h_1, \dots, h_n around, and let the decoder, at each output step, build a fresh context by taking a weighted average of them — putting most of the weight on the input positions that matter right now. The decoder learns to look back, and where it looks is interpretable.

Attention, line by line

Fix a decoder step t with current decoder state s_t. We build its context c_t in three moves: score, normalise, average.

Step 1 — score each encoder state. Ask how relevant input position i is to what we're writing now, with an alignment score (the additive, Bahdanau form — a tiny neural net that compares s_t with each h_i):

e_{t i} = v^{\top}\tanh\!\left(W_s\, s_t + W_h\, h_i\right).

A larger e_{t i} means “position i is more relevant to output t”. These raw scores can be any real numbers.

Step 2 — normalise into weights with softmax. Turn the n scores into a probability distribution over input positions using the softmax:

\alpha_{t i} = \frac{e^{\,e_{t i}}}{\sum_{j=1}^{n} e^{\,e_{t j}}},\qquad \alpha_{t i} \ge 0,\quad \sum_{i=1}^{n}\alpha_{t i} = 1.

The weights \alpha_{t i} are non-negative and sum to 1 — a clean “how much of my attention goes to each input token” budget, spent across the input.

Step 3 — average the encoder states. The context for this step is the weighted sum of the encoder states under those weights:

c_t = \sum_{i=1}^{n}\alpha_{t i}\, h_i.

This is a content-based weighted average: where the weight concentrates, that encoder state dominates c_t. Because \sum_i \alpha_{t i} = 1, the context is a genuine convex combination of the inputs — it lives in their hull, never larger than the states it averages.

Step 4 — decode with this fresh context. Feed c_t (alongside s_t) into the decoder's output, then advance to step t+1 — where a new s_{t+1} produces new scores, new weights, and a new context. Each output token gets its own custom-built view of the input.

Two consequences fall out for free

The bottleneck is gone. The decoder no longer depends on a single c; at every step it has direct access to all n encoder states, and the path from any input token to any output token is now length one. A long sentence is no harder to reach into than a short one — there is nothing to compress.

The weights are an alignment. The matrix \alpha_{t i} says, for each output token, which input tokens it drew from. Plotted as a heatmap it reveals a soft word-to-word alignment — for translation it lights up roughly along the diagonal, bending where word order differs between languages. Attention isn't just accurate; it tells you where the model looked.

At each decoder step t, attention builds a custom context from all the encoder states:

Scores → softmax weights → \sum_i \alpha_i h_i. Score each h_i against s_t to get e_{t i}, normalise \alpha_t = \operatorname{softmax}(e_t) (so \alpha_{t i} \ge 0 and \sum_i \alpha_{t i} = 1), then take the weighted average c_t = \sum_i \alpha_{t i} h_i.
No bottleneck. Every output step sees all n encoder states directly — no single fixed vector, so long inputs stop degrading.
Interpretable alignment. The weights \alpha_{t i} form a soft alignment between output and input tokens, readable as a heatmap.

Watch the attention move

The curve shows the attention weights \alpha_{t i} spread across six input tokens for one output step. Drag the slider to change the output step t: the peak of attention slides along the input, and — because the weights are a softmax — they always re-normalise to sum to 1, however the peak moves. That sliding bump is the model choosing which input tokens to read as it writes each output token.

Look again at Steps 1–3: score the decoder state against every encoder state, softmax the scores, take the weighted sum. Rename the pieces and you have the Transformer's query–key–value attention. The thing doing the looking, s_t, is a query q; each encoder state plays both a key k_i (what it's scored against) and a value v_i (what gets averaged). The score becomes a dot product, scaled,

\alpha_i = \operatorname{softmax}\!\left(\frac{q\cdot k_i}{\sqrt{d}}\right),\qquad c = \sum_i \alpha_i\, v_i.

Same three moves — score, softmax, weighted sum. The leap is self-attention: let the sequence attend to itself (queries, keys and values all from the same tokens), drop the recurrence entirely, and you can process every position in parallel. That is the self-attention at the heart of the Transformer — the architecture this humble weighted average grew into.