The Full Transformer

Every part is now on the table: scaled dot-product attention, multi-head attention, layer norm, residual connections, and — from the last two pages — causal masking and cross-attention. Time to bolt them together into the original 2017 architecture, the encoder–decoder transformer of "Attention Is All You Need". It is two towers — an encoder that reads the source and a decoder that writes the output — connected by a cross-attention bridge, and capped by a head that turns the decoder's last state into next-token probabilities.

Tracing a translation, line by line

Follow a sentence from input embeddings to output probabilities. Write x = (x_1, \dots, x_n) for the source tokens and y = (y_1, \dots, y_m) for the (so-far generated) target tokens.

Step 1 — embed and add position. Tokens become vectors via an embedding table, and since attention is order-blind we inject order with a positional encoding P added to the embeddings:

E^{\text{enc}} = \operatorname{Embed}(x) + P, \qquad E^{\text{dec}} = \operatorname{Embed}(y) + P.

Step 2 — the encoder stack. Run E^{\text{enc}} through N identical blocks. Each block is self-attention then a position-wise feed-forward network, each wrapped in a residual Add & Norm:

\text{Encoder block:}\quad \operatorname{AddNorm}\big(\text{SelfAttn}\big) \;\to\; \operatorname{AddNorm}\big(\text{FFN}\big).

Stacking N of these yields Z \in \mathbb{R}^{n \times d} — a context-rich representation of the source, one vector per input position. The encoder's self-attention is unmasked: reading benefits from looking both ways.

Step 3 — the decoder stack, with the extra sublayer. Run E^{\text{dec}} through N decoder blocks. Each block has three sublayers, every one wrapped in Add & Norm:

\text{Decoder block:}\quad \operatorname{AddNorm}\big(\underbrace{\text{MaskedSelfAttn}}_{\text{causal}}\big) \;\to\; \operatorname{AddNorm}\big(\underbrace{\text{CrossAttn}(\cdot,\, Z)}_{\text{reads }Z}\big) \;\to\; \operatorname{AddNorm}\big(\text{FFN}\big).

The first sublayer is causally masked self-attention — the decoder may attend only to earlier output positions. The second is cross-attention whose queries come from the decoder and whose keys and values come from the encoder's Z — the bridge that lets the output depend on the source.

Step 4 — project to the vocabulary and softmax. Take the decoder's final representation h_t at each position, project it to a score per vocabulary word with a linear map, and softmax into a next-token distribution:

p(y_{t+1} \mid y_{\le t},\, x) = \operatorname{softmax}\!\big(W_{\text{out}}\, h_t + b\big).

Sample or argmax a token, append it, and repeat — the decoder writes the translation one word at a time, each word conditioned on the whole source (through cross-attention) and on every word written so far (through masked self-attention). That is the entire model.

The 2017 transformer maps a source sequence to a target sequence with two stacks and an output head:

Encoder stack. Embeddings + positional encoding, then N blocks of [unmasked self-attention → Add & Norm → FFN → Add & Norm], producing a context-rich source representation Z.
Decoder stack. Embeddings + positional encoding, then N blocks of [masked self-attention → Add & Norm → cross-attention to Z → Add & Norm → FFN → Add & Norm] — three sublayers per block, one more than the encoder.
Embeddings + positional encoding. Both stacks add a positional encoding to token embeddings, since attention itself is order-invariant.
Final linear + softmax. A linear projection W_{\text{out}} to vocabulary size followed by a softmax turns the decoder's output into next-token probabilities p(y_{t+1} \mid y_{\le t}, x).

The radical move of Vaswani et al. was a subtraction, not an addition: they removed recurrence entirely. Earlier sequence models processed tokens one after another, so the computation for token t had to wait for token t-1 — inherently sequential, hard to parallelise, and prone to forgetting across long gaps. Attention has no such dependency: every position attends to every other in a single matrix multiply, so an entire sequence is processed at once. The title was a slight overstatement (you still need embeddings, an FFN, positional encodings, and norm) but the thesis held: with attention doing the heavy lifting, recurrence was optional. Removing it made training fully parallel, which let models grow far larger and train on far more data than recurrent ones ever could — the architectural unlock behind the entire era of large language models that followed.

The whole architecture, drawn

The encoder column (left) reads the source: embed + position, then self-attention and an FFN, each with Add & Norm, stacked N times. Its output Z flows along the bridge into the decoder column (right). Each decoder block has the extra middle sublayer — cross-attention — that drinks from Z; below it, the masked self-attention; above, the FFN. The decoder's top feeds the output head: a linear projection then a softmax over the vocabulary. Step through to assemble it piece by piece.