Cross-Attention

Translation, summarisation, and question answering all share a shape: a decoder writing an output while constantly consulting a source it has already read. The attention mechanism was born to do exactly this — let a decoder look back at an encoded input. In the transformer it returns as a first-class sublayer with a name that says what it does: cross-attention. Where self-attention has a sequence attend to itself, cross-attention has one sequence (the decoder) attend to another (the encoder's output).

Self vs. cross: where Q, K, V come from, line by line

Both kinds of attention run the identical scaled dot-product formula. The only difference — and it is the whole difference — is which sequence supplies the queries, and which supplies the keys and values.

Step 1 — recall the one formula. Given queries Q, keys K, and values V, attention is always

\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V.

Step 2 — self-attention: one source for all three. In self-attention the sequence X attends to itself, so queries, keys, and values are all linear projections of the same X:

Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V.

Every position mixes information drawn from its own sequence.

Step 3 — cross-attention: split the sources. Now let the decoder's representations Y (length m) attend to the encoder's output Z (length n). The queries come from the decoder; the keys and values come from the encoder:

Q = Y W_Q \quad(\text{decoder}), \qquad K = Z W_K, \quad V = Z W_V \quad(\text{encoder}).

Step 4 — score across the bridge. Decoder query q_t = y_t W_Q dots against every encoder key, scaled, then softmaxed into a distribution over the n source positions:

\alpha_{t j} = \operatorname{softmax}_j\!\left(\frac{q_t \cdot k_j}{\sqrt{d_k}}\right), \qquad \sum_{j=1}^{n} \alpha_{t j} = 1.

Step 5 — pull source information into the decoder. The output for decoder position t is the weighted sum of the encoder's values:

c_t = \sum_{j=1}^{n} \alpha_{t j}\, v_j.

Read that again: a vector living in the decoder (q_t) chooses how much of each source token's value to absorb. Each decoder position pulls exactly the slice of the source it needs right now. This is precisely the seq2seq attention mechanism — score, softmax, weighted sum — promoted from a bolt-on to a standard sublayer of the architecture.

Cross-attention runs scaled dot-product attention with its sources split across two sequences:

Q from the decoder, K and V from the encoder. Queries are Q = Y W_Q from the decoder representations, while keys and values are K = Z W_K and V = Z W_V from the encoder output — unlike self-attention, where all three come from one sequence.
Same scaled dot-product formula. The mechanism is unchanged: \operatorname{softmax}(Q K^{\top}/\sqrt{d_k})\,V, giving weights \alpha_{tj} over the n source positions and output c_t = \sum_j \alpha_{tj} v_j.
It bridges encoder → decoder. Cross-attention is the channel through which the encoded source flows into the decoder, so each output token can be conditioned on the most relevant parts of the input — the seq2seq attention mechanism as a built-in sublayer.

The full encoder–decoder transformer uses attention in three distinct places, and telling them apart is half of understanding the architecture:

Encoder self-attention. Q, K, V all from the source. No mask — the encoder is allowed to look both ways across the input it is reading.
Decoder masked self-attention. Q, K, V all from the (partial) output, but with a causal mask so each output position attends only to earlier output positions — preserving left-to-right generation.
Decoder → encoder cross-attention. Q from the decoder, K and V from the encoder. No causal mask is needed here: the entire source is already known, so a decoder position is free to attend to every source token. This is the layer that lets the translation actually depend on the sentence being translated.

Same arithmetic everywhere; only the wiring of Q, K, V (and whether a mask is present) changes. That economy — one operation, three roles — is a big part of why the transformer is so uniform and so easy to scale.

Watch one decoder step read the source

Each row is a decoder position t; each column is an encoder (source) position j. A cell's brightness is the cross-attention weight \alpha_{t j} — how much of source token j decoder step t absorbs. Move the selector to a decoder step and watch its row: the weights form a peak over the source tokens it is reading, and — being a softmax — they always re-normalise to sum to 1 across the whole source. Notice there is no triangular blackout here: cross-attention is unmasked, because the full source is fair game.