Each sublayer of a transformer — the
The residual is the gradient highway; the normalization keeps the activations from drifting to wild scales. Two jobs, one wrapper, repeated everywhere.
Let
Step 1 — run the sublayer. Compute the sublayer's transformation of the input:
Step 2 — add the input back (the residual). Instead of replacing
Exactly as for a plain residual block, the bare wire carrying
Step 3 — normalize (the original recipe). The transformer as first proposed put the normalization after the addition. This is post-LN:
Add, then norm — hence “Add & Norm.” Every sublayer output is re-standardised before the next sublayer sees it.
Step 4 — move the norm inside (the modern default). Practitioners found post-LN fragile in very deep stacks. The fix is to normalize the input to the sublayer instead, and leave the addition un-normalised. This is pre-LN:
Norm first, then run the sublayer, then add. The two differ only in where the LayerNorm sits relative to the addition — but that placement decides how stably the network trains.
Step 5 — why pre-LN trains more stably. Follow the residual path. In
post-LN, the addition
Unrolling this,
Pre-LN invites a powerful mental model: the residual stream. Picture a single
In this picture the residual stream is the model's working memory, and every sublayer is a small
device wired onto the same bus, communicating only by what it adds. It explains why the stream
width
Flip the control to slide the LayerNorm between the two placements. In post-LN the normalizer sits on the residual stream, right after the addition — so the stream is re-scaled at every sublayer. In pre-LN the normalizer moves off the stream onto the sublayer's input branch, leaving the addition (and the whole highway) untouched. Same components, different wiring, very different training behaviour.
Add & Norm is the last missing piece. With attention, a feed-forward block, and this wrapper in
hand, we can assemble the complete