Self-Attention

The attention mechanism let a decoder look back at an encoder. Self-attention takes the same idea and turns it inward: let every token in a sequence gather information from every other token — including itself — in a single step. No recurrence, no marching left to right; each position looks at the whole sequence at once. To do it, each token plays three roles at the same time.

Query, key, value: three roles per token

Think of every token as both a librarian and a patron. As a patron it has a query — "what am I looking for?" As a book on the shelf it has a key — "what do I match?" — and a value — "what do I hand over if you pick me?" From each token's (positionally encoded) vector x_i we produce all three by learned linear projections:

q_i = W_Q\, x_i,\qquad k_i = W_K\, x_i,\qquad v_i = W_V\, x_i.

Three weight matrices W_Q, W_K, W_V — the only learned parameters of the mechanism — slice each token into the thing that searches, the thing that gets searched, and the thing that gets passed on.

Self-attention, line by line

Fix a token i; we build its new representation z_i by letting its query consult every key.

Step 1 — project into queries, keys, values. Already done above: q_i = W_Q x_i, and for every token j, k_j = W_K x_j and v_j = W_V x_j.

Step 2 — score query i against every key j. How well does what token i wants match what token j offers? The natural similarity of two vectors is their dot product:

s_{ij} = q_i \cdot k_j.

A big s_{ij} means "token j is highly relevant to token i". These raw scores are unbounded reals.

Step 3 — normalise the scores into weights. Turn token i's row of scores into a probability distribution over the sequence with the softmax:

\alpha_{ij} = \operatorname{softmax}_j(s_{ij}) = \frac{e^{\,s_{ij}}}{\sum_{\ell} e^{\,s_{i\ell}}},\qquad \alpha_{ij} \ge 0,\quad \sum_{j} \alpha_{ij} = 1.

Now \alpha_{ij} is "the fraction of token i's attention spent on token j" — a clean budget summing to 1.

Step 4 — mix the values. Token i's new representation is the weighted average of all the values under those weights:

z_i = \sum_{j} \alpha_{ij}\, v_j.

Where the weight concentrates, that token's value dominates z_i. A verb can pull in its subject, a pronoun can pull in its antecedent — each token rewrites itself as a blend of the tokens it found relevant.

This is precisely the score → softmax → weighted-sum of the original attention mechanism. The one new twist that earns the name is that the queries, keys, and values all come from the same sequence: it is a sequence attending to itself.

Each token x_i is projected into a query, key, and value, then rewritten as a weighted average of values:

Three roles. q_i = W_Q x_i (what I seek), k_j = W_K x_j (what I offer), v_j = W_V x_j (what I pass on).
Score = query · key. The relevance of token j to token i is the dot product s_{ij} = q_i \cdot k_j.
Output = softmax · value. \alpha_{ij} = \operatorname{softmax}_j(s_{ij}) (non-negative, row-sums to 1) and z_i = \sum_j \alpha_{ij}\, v_j — a content-based weighted average of the whole sequence's values.

The projections W_Q and W_K are not hand-designed — they are learned, and what they learn is delightfully interpretable. A common finding: a verb's query learns to fire on the key of its subject, so "the cat that chased the dog ran" lets ran reach back past the distractor and attend to cat. A pronoun's query learns to find its antecedent's key; a closing bracket attends to its opener. Different attention heads specialise in different relations — syntactic dependency here, coreference there — so the same machinery, with different W_Q, W_K, discovers different threads of structure in the sentence.

Read the attention as a grid

Here is a tiny sentence as an n \times n grid: row i is the query token, column j the key token, and the darkness of a cell is the weight \alpha_{ij} — how much token i attends to token j. Pick the query token with the control to spotlight its row; each row's cells are a softmax, so they always sum to 1. Watch the verb reach for its subject.