Gating the feed-forward block, line by line
Take a token's vector x \in \mathbb{R}^d. We build the gated block in
four moves.
Step 1 — a smooth activation: Swish/SiLU. Replace the kinked ReLU with a smooth
cousin. Swish (also called SiLU) multiplies the input by its own
sigmoid
\sigma(z) = 1/(1 + e^{-z}):
\operatorname{Swish}(z) = z\,\sigma(z) = \frac{z}{1 + e^{-z}}.
For large positive z the sigmoid is near 1 and
\operatorname{Swish}(z) \approx z — just like ReLU. For large negative
z the sigmoid is near 0 and the output decays to
0. But around the origin it is smooth, even dipping
slightly below zero, instead of ReLU's hard corner — a softer, differentiable gate.
Step 2 — make two projections, not one. Project the token up to the hidden width
twice, with two separate matrices W and
V:
a = xW \quad(\text{the activation branch}), \qquad b = xV \quad(\text{the gate branch}).
Step 3 — gate one branch with the other. Pass the activation branch through Swish,
then multiply it elementwise (\odot) by the raw gate
branch:
g = \operatorname{Swish}(xW) \odot (xV).
Read it as a learned valve: xV decides, coordinate by coordinate, how much
of the activation \operatorname{Swish}(xW) is allowed through. Where the
gate is near zero the channel is shut; where it is large the channel is amplified. That
multiplicative interaction is something the plain additive MLP simply cannot express.
Step 4 — project back down. A third matrix W_2 brings the
gated hidden vector back to the model width:
\operatorname{SwiGLU}(x) = \big(\operatorname{Swish}(xW) \odot (xV)\big)\,W_2.
Step 5 — keep the parameter budget honest. We now carry three matrices
(W, V, W_2) where the old block had two (W_1, W_2),
so a naïve swap would inflate the parameter count by 50%. To compare fairly, shrink the hidden width
by the same factor: use
d_{\text{ff}} = \tfrac{2}{3}\cdot 4d \approx \tfrac{8}{3}d
instead of the usual 4d. Three matrices at width
\tfrac23 \cdot 4d total the same parameters as two matrices at
4d — so the gain below is a genuine, budget-matched win, not just a bigger
block.
Step 6 — and it wins. At matched parameters and compute, SwiGLU consistently lowers
the loss versus the ReLU MLP. The extra arithmetic is cheap; the multiplicative gate is what pays off.
SwiGLU is a gated feed-forward block built from a smooth activation:
-
Gated FFN.
\operatorname{SwiGLU}(x) = \big(\operatorname{Swish}(xW) \odot (xV)\big)\,W_2,
where \odot is the elementwise product and the gate branch
xV modulates the activation branch.
-
Swish/SiLU activation.
\operatorname{Swish}(z) = z\,\sigma(z) — a smooth, differentiable
stand-in for ReLU that behaves like z for large positive
z and decays to 0 for large negative
z.
-
Budget-matched width. Because a third matrix is added, the hidden width is set to
\tfrac23 \cdot 4d so the parameter count matches the plain
4d ReLU block — and SwiGLU still wins on loss.
SwiGLU is one member of the gated linear unit (GLU) family, all sharing the shape
(\phi(xW) \odot xV)\,W_2 and differing only in the activation
\phi: GeGLU uses GELU, ReGLU uses ReLU,
the bare GLU uses a sigmoid, and SwiGLU uses Swish. A widely cited 2020 study
benchmarked the lot at matched parameter counts and found the gated variants beat the plain MLP
across the board, with GEGLU and SwiGLU at the top.
The author famously closed by attributing the improvement to “divine benevolence” — a
wry admission that there is no clean theoretical reason one gate should beat another; the ranking is
settled empirically. What is clear is why gating helps at all: a multiplicative
interaction lets the block represent functions an additive two-layer MLP cannot, for almost no extra
cost. That practical edge, reproduced again and again, is why SwiGLU is now the default FFN of
frontier models.