Cross-Attention

Translation, summarisation, and question answering all share a shape: a decoder writing an output while constantly consulting a source it has already read. The attention mechanism was born to do exactly this — let a decoder look back at an encoded input. In the transformer it returns as a first-class sublayer with a name that says what it does: cross-attention. Where self-attention has a sequence attend to itself, cross-attention has one sequence (the decoder) attend to another (the encoder's output).

Self vs. cross: where Q, K, V come from, line by line

Both kinds of attention run the identical scaled dot-product formula. The only difference — and it is the whole difference — is which sequence supplies the queries, and which supplies the keys and values.

Step 1 — recall the one formula. Given queries Q, keys K, and values V, attention is always

\operatorname{Attention}(Q, K, V) = \operatorname{softmax}\!\left(\frac{Q K^{\top}}{\sqrt{d_k}}\right) V.

Step 2 — self-attention: one source for all three. In self-attention the sequence X attends to itself, so queries, keys, and values are all linear projections of the same X:

Q = X W_Q, \qquad K = X W_K, \qquad V = X W_V.

Every position mixes information drawn from its own sequence.

Step 3 — cross-attention: split the sources. Now let the decoder's representations Y (length m) attend to the encoder's output Z (length n). The queries come from the decoder; the keys and values come from the encoder:

Q = Y W_Q \quad(\text{decoder}), \qquad K = Z W_K, \quad V = Z W_V \quad(\text{encoder}).

Step 4 — score across the bridge. Decoder query q_t = y_t W_Q dots against every encoder key, scaled, then softmaxed into a distribution over the n source positions:

\alpha_{t j} = \operatorname{softmax}_j\!\left(\frac{q_t \cdot k_j}{\sqrt{d_k}}\right), \qquad \sum_{j=1}^{n} \alpha_{t j} = 1.

Step 5 — pull source information into the decoder. The output for decoder position t is the weighted sum of the encoder's values:

c_t = \sum_{j=1}^{n} \alpha_{t j}\, v_j.

Read that again: a vector living in the decoder (q_t) chooses how much of each source token's value to absorb. Each decoder position pulls exactly the slice of the source it needs right now. This is precisely the seq2seq attention mechanism — score, softmax, weighted sum — promoted from a bolt-on to a standard sublayer of the architecture.

Cross-attention runs scaled dot-product attention with its sources split across two sequences:

The full encoder–decoder transformer uses attention in three distinct places, and telling them apart is half of understanding the architecture:

Same arithmetic everywhere; only the wiring of Q, K, V (and whether a mask is present) changes. That economy — one operation, three roles — is a big part of why the transformer is so uniform and so easy to scale.

Watch one decoder step read the source

Each row is a decoder position t; each column is an encoder (source) position j. A cell's brightness is the cross-attention weight \alpha_{t j} — how much of source token j decoder step t absorbs. Move the selector to a decoder step and watch its row: the weights form a peak over the source tokens it is reading, and — being a softmax — they always re-normalise to sum to 1 across the whole source. Notice there is no triangular blackout here: cross-attention is unmasked, because the full source is fair game.