The Transformer Block

We now have every part. A multi-head attention lets each token gather context; a feed-forward block lets each token think; and the Add & Norm wrapper makes both safe to stack. Bolt them together in the right order and you get the transformer block — the single unit that, copied many times, is the entire model.

It is a strikingly small recipe for something so capable: gather context, think, repeat.

Assembling one block, line by line

Let the input be a whole sequence, X \in \mathbb{R}^{n \times d} — n tokens, each a d-wide vector. We build the block as two wrapped sublayers, using the pre-LN wrapper that modern models default to.

Step 1 — the attention sublayer, wrapped. Run multi-head self-attention on a normalised copy of the stream, and add the result back:

Z = X + \operatorname{MultiHead}\big(\operatorname{LayerNorm}(X)\big).

After this line, every token has mixed in information from every other token. The shape is unchanged: Z \in \mathbb{R}^{n \times d}.

Step 2 — the feed-forward sublayer, wrapped. Feed Z through the same kind of wrapper, this time around the position-wise FFN:

Y = Z + \operatorname{FFN}\big(\operatorname{LayerNorm}(Z)\big).

After this line, every token has been transformed in place. Again the shape is unchanged: Y \in \mathbb{R}^{n \times d}.

Step 3 — name the block. The two wrapped sublayers, in this order, are one transformer block:

\operatorname{Block}(X) = \underbrace{\big(\text{FFN sublayer}\big)}_{\text{think}} \circ \underbrace{\big(\text{attention sublayer}\big)}_{\text{gather}}(X).

Attention first (gather context), feed-forward second (process it) — each wrapped in Add & Norm.

Step 4 — track the shape. The input was n \times d; after attention it is n \times d; after the FFN it is n \times d. The block is a map

\operatorname{Block}: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}.

Input shape equals output shape. That is the whole reason blocks compose.

Step 5 — stack N identical blocks. Because each block maps n \times d \to n \times d, you can feed one block's output straight into the next. Chain N of them (each with its own weights):

H_0 = X, \qquad H_{\ell} = \operatorname{Block}_{\ell}(H_{\ell-1}), \qquad \ell = 1, \dots, N.

The sequence keeps its n \times d shape the entire way up the tower; only the contents of the vectors get richer with depth. A “12-layer” or “96-layer” transformer is exactly this — N copies of one block.

Step 6 — read the rhythm. Every block gives each token one more chance to look around (attention) and then reconsider (FFN), reading from and writing to the same residual stream. Depth is how many rounds of gather-then-think the model gets; width d is how much each token can hold. Scaling a transformer is scaling these two numbers.

A transformer block is two Add & Norm-wrapped sublayers, and a transformer is a stack of them:

Attention sublayer. Z = X + \operatorname{MultiHead}(\operatorname{LayerNorm}(X)) — every token gathers context.
Feed-forward sublayer. Y = Z + \operatorname{FFN}(\operatorname{LayerNorm}(Z)) — every token is transformed in place.
Shape preserved. The block maps \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}; the sequence stays n \times d throughout.
Stack N. Chaining N identical-in-shape blocks (each with its own weights), H_{\ell} = \operatorname{Block}_{\ell}(H_{\ell-1}), gives a deep transformer.

The cleanest way to see a deep transformer is as one long residual stream — a single n \times d tensor running from the bottom of the tower to the top, the model's working memory. Each sublayer reads a normalised view of the stream, computes something, and adds its contribution back. Attention writes gathered context; the FFN writes its per-token conclusions. Nothing ever overwrites the stream; the model only ever appends to it. Information laid down by an early block can still be read by a late one, because the bus carried it the whole way.

This view makes the two scaling knobs concrete. Depth (N blocks) is how many read-think-write rounds the stream undergoes; width (d) is the bandwidth of the bus — how much each token can carry. Bigger models grow both, and the parameter count grows roughly as N \cdot (\,\text{attention} + \text{FFN}\,) \approx N \cdot 12 d^2 (about 4d^2 for attention's projections plus 8d^2 for the feed-forward block, per block). The famous large language models are, structurally, nothing but a tall stack of this one humble block.

Build the tower

On the left, one block laid out in full: attention wrapped in Add & Norm, then the feed-forward block wrapped in Add & Norm, with the residual stream running straight up the side. On the right, the same block copied N times into a tower — drag the slider to stack more. Every rung is the same recipe, and the sequence keeps its n \times d shape from bottom to top.

What you've built

That is a transformer. Embed the tokens, add positional information, run them up a stack of these blocks, and read off the top — encoder, decoder, and the large language models all share this exact backbone. Everything beyond here is variation on the theme: how the attention is masked, how the blocks are arranged, and how high the tower goes.