Add & Norm

Each sublayer of a transformer — the attention, the feed-forward block — is not used bare. It is always wrapped in two things: a residual connection and a layer normalization. This little wrapper — universally called Add & Norm — is what makes it possible to stack the sublayers dozens deep without training falling apart.

The residual is the gradient highway; the normalization keeps the activations from drifting to wild scales. Two jobs, one wrapper, repeated everywhere.

The wrapper, line by line

Let x be a token vector entering a sublayer, and let \operatorname{Sublayer}(\cdot) be whichever sublayer we are wrapping (attention or FFN). We build the wrapped output one piece at a time.

Step 1 — run the sublayer. Compute the sublayer's transformation of the input:

F(x) = \operatorname{Sublayer}(x).

Step 2 — add the input back (the residual). Instead of replacing x with F(x), add the change onto the original:

x + F(x).

Exactly as for a plain residual block, the bare wire carrying x across gives the gradient a path with a built-in +1 — a highway straight back to earlier layers that depth cannot close.

Step 3 — normalize (the original recipe). The transformer as first proposed put the normalization after the addition. This is post-LN:

\operatorname{out} = \operatorname{LayerNorm}\big(x + \operatorname{Sublayer}(x)\big).

Add, then norm — hence “Add & Norm.” Every sublayer output is re-standardised before the next sublayer sees it.

Step 4 — move the norm inside (the modern default). Practitioners found post-LN fragile in very deep stacks. The fix is to normalize the input to the sublayer instead, and leave the addition un-normalised. This is pre-LN:

\operatorname{out} = x + \operatorname{Sublayer}\big(\operatorname{LayerNorm}(x)\big).

Norm first, then run the sublayer, then add. The two differ only in where the LayerNorm sits relative to the addition — but that placement decides how stably the network trains.

Step 5 — why pre-LN trains more stably. Follow the residual path. In post-LN, the addition x + F(x) is immediately fed through LayerNorm, so the residual stream is re-scaled at every sublayer. The clean +1 gradient path now has a LayerNorm sitting on top of it at each step, and their product across many layers can blow up or shrink — the very instability the residual was supposed to cure. In pre-LN, the LayerNorm is moved off the highway and onto the sublayer's input. The addition itself is untouched, so the residual stream runs un-normalised from the first layer to the last:

x_{\ell+1} = x_{\ell} + \operatorname{Sublayer}_{\ell}\big(\operatorname{LayerNorm}(x_{\ell})\big).

Unrolling this, x_L = x_0 + \sum_{\ell} (\text{sublayer outputs}): a single additive trunk with clean +1 paths all the way down and no normalizer clamping the gradient between layers. That is why pre-LN can be trained at great depth, often without the careful warm-up that post-LN demands.

Every transformer sublayer is wrapped in a residual connection and a layer normalization:

The wrapper. A sublayer F is never used bare; it is combined with an identity skip and a LayerNorm so the block maps \mathbb{R}^d \to \mathbb{R}^d and can be stacked.
Post-LN (original). \operatorname{out} = \operatorname{LayerNorm}(x + \operatorname{Sublayer}(x)) — add, then normalize.
Pre-LN (modern default). \operatorname{out} = x + \operatorname{Sublayer}(\operatorname{LayerNorm}(x)) — normalize the input, then add.
Why pre-LN is stabler. It keeps the residual stream un-normalised end to end, so the +1 gradient highway is never clamped between layers — letting very deep stacks train cleanly.

Pre-LN invites a powerful mental model: the residual stream. Picture a single d-wide vector flowing straight through the whole network, a bus running top to bottom. Each sublayer does not replace the bus — it reads a (normalised) copy of it, computes something, and adds its result back on. Attention writes “here is context I gathered”; the feed-forward block writes “here is what I made of it.”

In this picture the residual stream is the model's working memory, and every sublayer is a small device wired onto the same bus, communicating only by what it adds. It explains why the stream width d is the architecture's most important number, and why removing one sublayer rarely breaks the model outright — the bus still carries everything the others wrote. We will lean on this view when we assemble the full block.

Move the LayerNorm and watch the stream

Flip the control to slide the LayerNorm between the two placements. In post-LN the normalizer sits on the residual stream, right after the addition — so the stream is re-scaled at every sublayer. In pre-LN the normalizer moves off the stream onto the sublayer's input branch, leaving the addition (and the whole highway) untouched. Same components, different wiring, very different training behaviour.

Where this goes

Add & Norm is the last missing piece. With attention, a feed-forward block, and this wrapper in hand, we can assemble the complete transformer block and stack it into a deep model.