Multi-Head Attention

One run of scaled dot-product attention gives the model a single point of view: every token looks at every other token through one set of query, key and value projections, and blends accordingly. That is powerful, but it is one relationship. A sentence has many at once — a verb agreeing with its subject, a pronoun bound to its antecedent, a word leaning on the one beside it. Forcing all of that through a single attention pattern is like reading with one eye.

Multi-head attention runs several attention operations in parallel, each with its own small projections, so the model can attend to different kinds of relationship simultaneously — and then it stitches the views back together. The surprise is that all of this costs about the same as a single full-width attention.

From one head to h heads, line by line

Start with a sequence of n token vectors stacked into a matrix X \in \mathbb{R}^{n \times d}, where d is the model width. A single attention head would project X to queries, keys and values and run one attention. We are going to do h of them at once.

Step 1 — pick the per-head width. Split the model dimension evenly across the h heads, so each head works in a smaller space of dimension

d_k = \frac{d}{h}.

With d = 512 and h = 8, each head lives in d_k = 64 dimensions. This single choice is what keeps the cost flat — hold onto it.

Step 2 — give each head its own small projections. Head i gets three learned matrices that map the full d-wide tokens down into its own d_k-dimensional view:

Q_i = X\,W_Q^{(i)}, \quad K_i = X\,W_K^{(i)}, \quad V_i = X\,W_V^{(i)}, \qquad W_Q^{(i)}, W_K^{(i)}, W_V^{(i)} \in \mathbb{R}^{d \times d_k}.

Because each projection lands in d_k = d/h dimensions rather than the full d, each head is a narrow attention, not a full one. Different W^{(i)} means each head can specialise in a different relationship.

Step 3 — each head attends independently. Run ordinary scaled dot-product attention inside head i, using its own Q_i, K_i, V_i:

\text{head}_i = \operatorname{Attention}(Q_i, K_i, V_i) = \operatorname{softmax}\!\left(\frac{Q_i K_i^{\top}}{\sqrt{d_k}}\right) V_i \;\in\; \mathbb{R}^{n \times d_k}.

The h heads share no weights and do not talk to each other here — they run side by side, each producing its own n \times d_k result.

Step 4 — concatenate the heads back to full width. Lay the h head outputs side by side. Each contributes d_k columns, and h \cdot d_k = h \cdot (d/h) = d, so the concatenation is exactly d wide again:

\text{Concat}(\text{head}_1, \dots, \text{head}_h) = \big[\,\text{head}_1 \;\oplus\; \text{head}_2 \;\oplus\; \cdots \;\oplus\; \text{head}_h\,\big] \;\in\; \mathbb{R}^{n \times d}.

Here \oplus is concatenation along the feature axis — the views are stacked, not yet combined.

Step 5 — mix the heads with an output projection. The concatenation is just the heads parked next to each other; a final learned matrix W_O \in \mathbb{R}^{d \times d} lets the model blend what the heads found into a single representation:

\operatorname{MultiHead}(X) = \text{Concat}(\text{head}_1, \dots, \text{head}_h)\,W_O \;\in\; \mathbb{R}^{n \times d}.

The output is n \times d — same shape as the input — so multi-head attention is a drop-in block you can stack.

Step 6 — count the cost. All h heads together project into h \cdot d_k = d dimensions for each of Q, K, V — the very same total width a single full attention would use. So the number of parameters and the amount of compute are ≈ the same as one full-width head; splitting into heads buys several points of view essentially for free.

Multi-head attention runs several narrow attentions in parallel and mixes their outputs:

The hope that different heads specialise is not just wishful — probing trained models shows it happening. Some heads turn out to be positional, attending almost entirely to the previous (or next) token, acting like a local smoothing window. Others track syntax, linking a verb to its subject or a determiner to its noun. Others handle coreference, pointing a pronoun back at the entity it refers to.

The most celebrated are induction heads: a head (often a pair working together) that spots a repeated pattern [A][B]\dots[A] and predicts [B] next — “the last time I saw A, it was followed by B, so do that again.” This simple copy-the-pattern circuit is widely believed to underpin a model's ability to learn from examples given in its prompt. None of this is hand-designed; it emerges because h independent heads are free to divide the labour.

See several heads at once

Each grid is one head's attention pattern: row = the token doing the looking (the query), column = the token being looked at (the key), and a brighter cell means more attention. Flip between heads and watch the pattern change character — one hugs the diagonal (always glancing one token back), another lights up a specific off-diagonal link, another spreads its attention wide. Same input, genuinely different points of view.

Where this goes

Multi-head attention is the “gather context” half of a transformer. The other half is a small per-token network that does the actual thinking, and both halves get wrapped in residual connections and normalization before being stacked into a transformer block.