Self vs. cross: where Q, K, V come from, line by line
Both kinds of attention run the identical
scaled
dot-product formula. The only difference — and it is the whole difference — is
which sequence supplies the queries, and which supplies the keys and values.
Step 1 — recall the one formula. Given queries
Q, keys K, and values
V, attention is always
\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V.
Step 2 — self-attention: one source for all three. In self-attention the
sequence X attends to itself, so queries, keys, and values are all
linear projections of the same X:
Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V.
Every position mixes information drawn from its own sequence.
Step 3 — cross-attention: split the sources. Now let the decoder's
representations Y (length m) attend to
the encoder's output Z (length n). The
queries come from the decoder; the keys and values come from the
encoder:
Q = Y W_Q \quad(\text{decoder}), \qquad K = Z W_K, \quad V = Z W_V \quad(\text{encoder}).
Step 4 — score across the bridge. Decoder query
q_t = y_t W_Q dots against every encoder key, scaled, then softmaxed
into a distribution over the n source positions:
\alpha_{t j} = \operatorname{softmax}_j\!\left(\frac{q_t \cdot k_j}{\sqrt{d_k}}\right), \qquad \sum_{j=1}^{n} \alpha_{t j} = 1.
Step 5 — pull source information into the decoder. The output for decoder
position t is the weighted sum of the encoder's values:
c_t = \sum_{j=1}^{n} \alpha_{t j}\, v_j.
Read that again: a vector living in the decoder
(q_t) chooses how much of each source token's value to
absorb. Each decoder position pulls exactly the slice of the source it needs right now. This is
precisely the seq2seq attention mechanism — score, softmax, weighted sum — promoted from a bolt-on
to a standard sublayer of the architecture.
Cross-attention runs scaled dot-product attention with its sources split across two sequences:
-
Q from the decoder, K and V from the encoder. Queries are
Q = Y W_Q from the decoder representations, while keys and
values are K = Z W_K and V = Z W_V
from the encoder output — unlike self-attention, where all three come from one sequence.
-
Same scaled dot-product formula. The mechanism is unchanged:
\operatorname{softmax}(Q K^{\top}/\sqrt{d_k})\,V, giving weights
\alpha_{tj} over the n source
positions and output c_t = \sum_j \alpha_{tj} v_j.
-
It bridges encoder → decoder. Cross-attention is the channel through
which the encoded source flows into the decoder, so each output token can be conditioned
on the most relevant parts of the input — the seq2seq attention mechanism as a built-in
sublayer.
The full encoder–decoder transformer uses attention in three distinct places, and
telling them apart is half of understanding the architecture:
-
Encoder self-attention. Q, K, V all from the source. No mask — the
encoder is allowed to look both ways across the input it is reading.
-
Decoder masked self-attention. Q, K, V all from the (partial) output,
but with a
causal
mask so each output position attends only to earlier output positions —
preserving left-to-right generation.
-
Decoder → encoder cross-attention. Q from the decoder, K and V from the
encoder. No causal mask is needed here: the entire source is already known, so a
decoder position is free to attend to every source token. This is the layer that lets the
translation actually depend on the sentence being translated.
Same arithmetic everywhere; only the wiring of Q, K, V (and whether a mask is present)
changes. That economy — one operation, three roles — is a big part of why the transformer
is so uniform and so easy to scale.