Every part is now on the table: scaled dot-product attention, multi-head attention, layer
norm, residual connections, and — from the last two pages — causal masking and
cross-attention.
Time to bolt them together into the original 2017 architecture, the encoder–decoder
transformer of "Attention Is All You Need". It is two towers — an encoder that reads
the source and a decoder that writes the output — connected by a cross-attention bridge, and
capped by a head that turns the decoder's last state into next-token probabilities.
Tracing a translation, line by line
Follow a sentence from input embeddings to output probabilities. Write
x = (x_1, \dots, x_n) for the source tokens and
y = (y_1, \dots, y_m) for the (so-far generated) target tokens.
Step 1 — embed and add position. Tokens become vectors via an embedding
table, and since attention is order-blind we inject order with a positional encoding
P added to the embeddings:
E^{\text{enc}} = \operatorname{Embed}(x) + P, \qquad E^{\text{dec}} = \operatorname{Embed}(y) + P.
Step 2 — the encoder stack. Run E^{\text{enc}}
through N identical blocks. Each block is
self-attention then a position-wise feed-forward network, each wrapped in a
residual Add &
Norm:
\text{Encoder block:}\quad \operatorname{AddNorm}\big(\text{SelfAttn}\big) \;\to\; \operatorname{AddNorm}\big(\text{FFN}\big).
Stacking N of these yields
Z \in \mathbb{R}^{n \times d} — a context-rich representation of
the source, one vector per input position. The encoder's self-attention is
unmasked: reading benefits from looking both ways.
Step 3 — the decoder stack, with the extra sublayer. Run
E^{\text{dec}} through N decoder blocks.
Each block has three sublayers, every one wrapped in Add & Norm:
\text{Decoder block:}\quad \operatorname{AddNorm}\big(\underbrace{\text{MaskedSelfAttn}}_{\text{causal}}\big) \;\to\; \operatorname{AddNorm}\big(\underbrace{\text{CrossAttn}(\cdot,\, Z)}_{\text{reads }Z}\big) \;\to\; \operatorname{AddNorm}\big(\text{FFN}\big).
The first sublayer is
causally
masked self-attention — the decoder may attend only to earlier output positions.
The second is cross-attention whose queries come from the decoder and whose keys and values
come from the encoder's Z — the bridge that lets the output depend
on the source.
Step 4 — project to the vocabulary and softmax. Take the decoder's final
representation h_t at each position, project it to a score per
vocabulary word with a linear map, and
softmax
into a next-token distribution:
p(y_{t+1} \mid y_{\le t},\, x) = \operatorname{softmax}\!\big(W_{\text{out}}\, h_t + b\big).
Sample or argmax a token, append it, and repeat — the decoder writes the translation one word
at a time, each word conditioned on the whole source (through cross-attention) and on every
word written so far (through masked self-attention). That is the entire model.
The 2017 transformer maps a source sequence to a target sequence with two stacks and an output
head:
-
Encoder stack. Embeddings + positional encoding, then
N blocks of [unmasked self-attention → Add & Norm → FFN →
Add & Norm], producing a context-rich source representation
Z.
-
Decoder stack. Embeddings + positional encoding, then
N blocks of [masked self-attention → Add & Norm →
cross-attention to Z → Add & Norm → FFN → Add &
Norm] — three sublayers per block, one more than the encoder.
-
Embeddings + positional encoding. Both stacks add a positional encoding
to token embeddings, since attention itself is order-invariant.
-
Final linear + softmax. A linear projection
W_{\text{out}} to vocabulary size followed by a softmax turns
the decoder's output into next-token probabilities
p(y_{t+1} \mid y_{\le t}, x).
The radical move of Vaswani et al. was a subtraction, not an addition: they removed
recurrence entirely. Earlier sequence models processed tokens one after another, so the
computation for token t had to wait for token
t-1 — inherently sequential, hard to parallelise, and prone to
forgetting across long gaps. Attention has no such dependency: every position attends to
every other in a single matrix multiply, so an entire sequence is processed at once. The
title was a slight overstatement (you still need embeddings, an FFN, positional encodings,
and norm) but the thesis held: with attention doing the heavy lifting, recurrence was
optional. Removing it made training fully parallel, which let models grow
far larger and train on far more data than recurrent ones ever could — the architectural
unlock behind the entire era of large
language
models that followed.