Assembling one block, line by line
Let the input be a whole sequence, X \in \mathbb{R}^{n \times d} —
n tokens, each a d-wide vector. We build the
block as two wrapped sublayers, using the pre-LN wrapper that modern models default to.
Step 1 — the attention sublayer, wrapped. Run multi-head self-attention on a
normalised copy of the stream, and add the result back:
Z = X + \operatorname{MultiHead}\big(\operatorname{LayerNorm}(X)\big).
After this line, every token has mixed in information from every other token. The shape is unchanged:
Z \in \mathbb{R}^{n \times d}.
Step 2 — the feed-forward sublayer, wrapped. Feed Z
through the same kind of wrapper, this time around the position-wise FFN:
Y = Z + \operatorname{FFN}\big(\operatorname{LayerNorm}(Z)\big).
After this line, every token has been transformed in place. Again the shape is unchanged:
Y \in \mathbb{R}^{n \times d}.
Step 3 — name the block. The two wrapped sublayers, in this order, are one
transformer block:
\operatorname{Block}(X) = \underbrace{\big(\text{FFN sublayer}\big)}_{\text{think}} \circ \underbrace{\big(\text{attention sublayer}\big)}_{\text{gather}}(X).
Attention first (gather context), feed-forward second (process it) — each wrapped in Add & Norm.
Step 4 — track the shape. The input was
n \times d; after attention it is
n \times d; after the FFN it is
n \times d. The block is a map
\operatorname{Block}: \mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}.
Input shape equals output shape. That is the whole reason blocks compose.
Step 5 — stack N identical blocks. Because each block
maps n \times d \to n \times d, you can feed one block's output straight
into the next. Chain N of them (each with its own weights):
H_0 = X, \qquad H_{\ell} = \operatorname{Block}_{\ell}(H_{\ell-1}), \qquad \ell = 1, \dots, N.
The sequence keeps its n \times d shape the entire way up the tower; only
the contents of the vectors get richer with depth. A “12-layer” or
“96-layer” transformer is exactly this — N copies of one block.
Step 6 — read the rhythm. Every block gives each token one more chance to look
around (attention) and then reconsider (FFN), reading from and writing to the same residual stream.
Depth is how many rounds of gather-then-think the model gets; width d is
how much each token can hold. Scaling a transformer is scaling these two numbers.
A transformer block is two Add & Norm-wrapped sublayers, and a transformer is a stack of them:
-
Attention sublayer.
Z = X + \operatorname{MultiHead}(\operatorname{LayerNorm}(X)) — every
token gathers context.
-
Feed-forward sublayer.
Y = Z + \operatorname{FFN}(\operatorname{LayerNorm}(Z)) — every token
is transformed in place.
-
Shape preserved. The block maps
\mathbb{R}^{n \times d} \to \mathbb{R}^{n \times d}; the sequence stays
n \times d throughout.
-
Stack N. Chaining
N identical-in-shape blocks (each with its own weights),
H_{\ell} = \operatorname{Block}_{\ell}(H_{\ell-1}), gives a deep
transformer.
The cleanest way to see a deep transformer is as one long
residual stream — a single n \times d tensor running
from the bottom of the tower to the top, the model's working memory. Each sublayer reads a
normalised view of the stream, computes something, and adds its contribution back. Attention writes
gathered context; the FFN writes its per-token conclusions. Nothing ever overwrites the stream; the
model only ever appends to it. Information laid down by an early block can still be read by
a late one, because the bus carried it the whole way.
This view makes the two scaling knobs concrete. Depth (N
blocks) is how many read-think-write rounds the stream undergoes; width
(d) is the bandwidth of the bus — how much each token can carry. Bigger
models grow both, and the parameter count grows roughly as
N \cdot (\,\text{attention} + \text{FFN}\,) \approx N \cdot 12 d^2
(about 4d^2 for attention's projections plus
8d^2 for the feed-forward block, per block). The famous large language
models are, structurally, nothing but a tall stack of this one humble block.