Query, key, value: three roles per token
Think of every token as both a librarian and a patron. As a patron it has a
query — "what am I looking for?" As a book on the shelf it has a
key — "what do I match?" — and a value — "what do I hand over if
you pick me?" From each token's (positionally encoded) vector
x_i we produce all three by learned linear projections:
q_i = W_Q\, x_i,\qquad k_i = W_K\, x_i,\qquad v_i = W_V\, x_i.
Three weight matrices W_Q, W_K, W_V — the only learned parameters of
the mechanism — slice each token into the thing that searches, the thing that gets searched, and
the thing that gets passed on.
Self-attention, line by line
Fix a token i; we build its new representation
z_i by letting its query consult every key.
Step 1 — project into queries, keys, values. Already done above:
q_i = W_Q x_i, and for every token
j, k_j = W_K x_j and
v_j = W_V x_j.
Step 2 — score query i against every key
j. How well does what token
i wants match what token j offers? The
natural similarity of two vectors is their
dot product:
s_{ij} = q_i \cdot k_j.
A big s_{ij} means "token j is highly
relevant to token i". These raw scores are unbounded reals.
Step 3 — normalise the scores into weights. Turn token
i's row of scores into a probability distribution over the sequence
with the
softmax:
\alpha_{ij} = \operatorname{softmax}_j(s_{ij}) = \frac{e^{\,s_{ij}}}{\sum_{\ell} e^{\,s_{i\ell}}},\qquad \alpha_{ij} \ge 0,\quad \sum_{j} \alpha_{ij} = 1.
Now \alpha_{ij} is "the fraction of token
i's attention spent on token j" — a clean
budget summing to 1.
Step 4 — mix the values. Token i's new
representation is the weighted average of all the values under those weights:
z_i = \sum_{j} \alpha_{ij}\, v_j.
Where the weight concentrates, that token's value dominates z_i. A verb
can pull in its subject, a pronoun can pull in its antecedent — each token rewrites itself as a
blend of the tokens it found relevant.
This is precisely the score → softmax → weighted-sum of the original attention mechanism. The one
new twist that earns the name is that the queries, keys, and values all come from the
same sequence: it is a sequence attending to itself.
Each token x_i is projected into a query, key, and value, then
rewritten as a weighted average of values:
-
Three roles. q_i = W_Q x_i (what I seek),
k_j = W_K x_j (what I offer),
v_j = W_V x_j (what I pass on).
-
Score = query · key. The relevance of token
j to token i is the dot product
s_{ij} = q_i \cdot k_j.
-
Output = softmax · value.
\alpha_{ij} = \operatorname{softmax}_j(s_{ij}) (non-negative,
row-sums to 1) and
z_i = \sum_j \alpha_{ij}\, v_j — a content-based weighted average of
the whole sequence's values.
The projections W_Q and W_K are not
hand-designed — they are learned, and what they learn is delightfully interpretable. A common
finding: a verb's query learns to fire on the key of its subject, so
"the cat that chased the dog ran" lets ran reach back past
the distractor and attend to cat. A pronoun's query learns to find its antecedent's
key; a closing bracket attends to its opener. Different attention heads specialise in different
relations — syntactic dependency here, coreference there — so the same machinery, with different
W_Q, W_K, discovers different threads of structure in the sentence.