The block, line by line
Take a single token's vector x \in \mathbb{R}^d coming out of the
attention sublayer. The feed-forward network is a two-layer MLP with one nonlinearity in the middle.
Step 1 — expand into a wider hidden layer. Multiply by a matrix that maps from
d up to a larger hidden width
d_{\text{ff}}, and add a bias:
h = W_1 x + b_1, \qquad W_1 \in \mathbb{R}^{d_{\text{ff}} \times d}, \; b_1 \in \mathbb{R}^{d_{\text{ff}}}.
The standard choice is d_{\text{ff}} = 4d — the hidden layer is
four times wider than the model. With d = 512 that is a
hidden layer of 2048.
Step 2 — apply the nonlinearity. Pass the hidden vector through an
activation,
classically \operatorname{ReLU}(z) = \max(0, z), applied elementwise:
a = \operatorname{ReLU}(h) = \max(0,\, W_1 x + b_1).
Without this nonlinearity the two matrices would collapse into one — the activation is what gives the
block its expressive power.
Step 3 — project back down to the model width. A second matrix brings the wide
hidden vector back to d dimensions, so the output matches the input shape:
\operatorname{FFN}(x) = W_2\, a + b_2, \qquad W_2 \in \mathbb{R}^{d \times d_{\text{ff}}}, \; b_2 \in \mathbb{R}^{d}.
Step 4 — read it as one expression. Composing the three steps:
\operatorname{FFN}(x) = W_2\,\operatorname{ReLU}(W_1 x + b_1) + b_2.
Up to 4d, squash, back down to d. That is the
whole block.
Step 5 — the same network at every position. Crucially, the very same
W_1, b_1, W_2, b_2 are applied independently and
identically to each of the n tokens. The block has no idea where
a token sits in the sequence and never looks at its neighbours — it is “position-wise.”
Running it on the whole sequence X \in \mathbb{R}^{n \times d} is just the
same operation broadcast across the n rows.
Step 6 — count the parameters. The two weight matrices dominate:
W_1 has d \cdot d_{\text{ff}} entries and
W_2 has d_{\text{ff}} \cdot d entries. With
d_{\text{ff}} = 4d that is
d \cdot 4d \;+\; 4d \cdot d \;=\; 8d^2 \;\approx\; 2 \cdot d \cdot 4d \quad\text{parameters (ignoring biases).}
Two attention-vs-FFN facts fall out of this. Attention's projections are roughly
4d^2 parameters; the FFN is roughly 8d^2. So in
a typical transformer block the feed-forward network holds the majority of the
parameters.
The division of labour is clean: attention mixes information across tokens; the
feed-forward block transforms each token in place. Gather, then think.
Between attention sublayers sits a small per-token MLP:
-
Two layers, one nonlinearity.
\operatorname{FFN}(x) = W_2\,\operatorname{ReLU}(W_1 x + b_1) + b_2,
mapping \mathbb{R}^d \to \mathbb{R}^d.
-
Position-wise. The same weights are applied independently and identically to
every token; the block never mixes across positions (that is attention's job).
-
Expansion. The hidden width is wider than the model, typically
d_{\text{ff}} = 4d (so d \to 4d \to d).
-
Most of the parameters. Its two matrices total
\approx 2 \cdot d \cdot 4d = 8d^2 weights, the bulk of a block's
parameter count.
A productive way to read the feed-forward block is as a key–value memory.
The first matrix W_1 acts like a set of keys: each row, dotted
with the token, fires (after the ReLU) when the token matches some learned pattern — a topic, a
piece of syntax, even a specific fact. The second matrix W_2 supplies
the values those firings write back into the residual stream. Edit a model's stored facts
and you are, increasingly, editing entries of these matrices. The wide hidden layer is, in this
view, the model's lookup table — which is why it is so large.
Modern models often replace the plain MLP with a gated variant — GeGLU or SwiGLU.
The idea is to compute two projections of the token and let one gate the other before the
down-projection, e.g.
\operatorname{SwiGLU}(x) = \big(\operatorname{Swish}(W_1 x) \odot (V x)\big)\,W_2,
where \odot is the elementwise product. The multiplicative gate gives the
block sharper control over what passes through, and tends to work a little better for the same
budget. We will meet these properly when we discuss
modern
architecture refinements.