The Feed-Forward Block

Attention lets every token gather information from every other token — it mixes across the sequence. But after a token has collected the relevant context, something has to actually process it: combine those gathered features, reshape them, decide what they mean. That is the job of the feed-forward block — a small multilayer perceptron applied to each token on its own.

It is the quiet workhorse of the transformer: while attention gets the headlines, the feed-forward block holds most of the parameters and does most of the per-token “thinking.”

The block, line by line

Take a single token's vector x \in \mathbb{R}^d coming out of the attention sublayer. The feed-forward network is a two-layer MLP with one nonlinearity in the middle.

Step 1 — expand into a wider hidden layer. Multiply by a matrix that maps from d up to a larger hidden width d_{\text{ff}}, and add a bias:

h = W_1 x + b_1, \qquad W_1 \in \mathbb{R}^{d_{\text{ff}} \times d}, \; b_1 \in \mathbb{R}^{d_{\text{ff}}}.

The standard choice is d_{\text{ff}} = 4d — the hidden layer is four times wider than the model. With d = 512 that is a hidden layer of 2048.

Step 2 — apply the nonlinearity. Pass the hidden vector through an activation, classically \operatorname{ReLU}(z) = \max(0, z), applied elementwise:

a = \operatorname{ReLU}(h) = \max(0,\, W_1 x + b_1).

Without this nonlinearity the two matrices would collapse into one — the activation is what gives the block its expressive power.

Step 3 — project back down to the model width. A second matrix brings the wide hidden vector back to d dimensions, so the output matches the input shape:

\operatorname{FFN}(x) = W_2\, a + b_2, \qquad W_2 \in \mathbb{R}^{d \times d_{\text{ff}}}, \; b_2 \in \mathbb{R}^{d}.

Step 4 — read it as one expression. Composing the three steps:

\operatorname{FFN}(x) = W_2\,\operatorname{ReLU}(W_1 x + b_1) + b_2.

Up to 4d, squash, back down to d. That is the whole block.

Step 5 — the same network at every position. Crucially, the very same W_1, b_1, W_2, b_2 are applied independently and identically to each of the n tokens. The block has no idea where a token sits in the sequence and never looks at its neighbours — it is “position-wise.” Running it on the whole sequence X \in \mathbb{R}^{n \times d} is just the same operation broadcast across the n rows.

Step 6 — count the parameters. The two weight matrices dominate: W_1 has d \cdot d_{\text{ff}} entries and W_2 has d_{\text{ff}} \cdot d entries. With d_{\text{ff}} = 4d that is

d \cdot 4d \;+\; 4d \cdot d \;=\; 8d^2 \;\approx\; 2 \cdot d \cdot 4d \quad\text{parameters (ignoring biases).}

Two attention-vs-FFN facts fall out of this. Attention's projections are roughly 4d^2 parameters; the FFN is roughly 8d^2. So in a typical transformer block the feed-forward network holds the majority of the parameters.

The division of labour is clean: attention mixes information across tokens; the feed-forward block transforms each token in place. Gather, then think.

Between attention sublayers sits a small per-token MLP:

Two layers, one nonlinearity. \operatorname{FFN}(x) = W_2\,\operatorname{ReLU}(W_1 x + b_1) + b_2, mapping \mathbb{R}^d \to \mathbb{R}^d.
Position-wise. The same weights are applied independently and identically to every token; the block never mixes across positions (that is attention's job).
Expansion. The hidden width is wider than the model, typically d_{\text{ff}} = 4d (so d \to 4d \to d).
Most of the parameters. Its two matrices total \approx 2 \cdot d \cdot 4d = 8d^2 weights, the bulk of a block's parameter count.

A productive way to read the feed-forward block is as a key–value memory. The first matrix W_1 acts like a set of keys: each row, dotted with the token, fires (after the ReLU) when the token matches some learned pattern — a topic, a piece of syntax, even a specific fact. The second matrix W_2 supplies the values those firings write back into the residual stream. Edit a model's stored facts and you are, increasingly, editing entries of these matrices. The wide hidden layer is, in this view, the model's lookup table — which is why it is so large.

Modern models often replace the plain MLP with a gated variant — GeGLU or SwiGLU. The idea is to compute two projections of the token and let one gate the other before the down-projection, e.g. \operatorname{SwiGLU}(x) = \big(\operatorname{Swish}(W_1 x) \odot (V x)\big)\,W_2, where \odot is the elementwise product. The multiplicative gate gives the block sharper control over what passes through, and tends to work a little better for the same budget. We will meet these properly when we discuss modern architecture refinements.

See the expand-then-contract

Each token's vector is pushed up into a wider hidden layer, squashed by the ReLU, and brought back down to the original width. Notice the hidden column is the fat one — that 4\times expansion is where the parameters live. And the very same block is reused at every token position; the rows do not interact.

Where this goes

Attention and the feed-forward block are the two sublayers of a transformer. Before they can be safely stacked dozens deep, each one is wrapped in a residual connection and a normalization step — the Add & Norm wrapper that comes next.