SwiGLU

The classic feed-forward block is a sandwich: project up, apply a ReLU, project back down, W_2\,\operatorname{ReLU}(W_1 x). Modern models swap the filling for a gate. SwiGLU computes two projections of the token and lets one multiplicatively modulate the other before the down-projection. That single multiplicative interaction buys more expressive power per parameter — which is why Llama, PaLM, and most recent models use SwiGLU in place of the plain ReLU MLP.

Gating the feed-forward block, line by line

Take a token's vector x \in \mathbb{R}^d. We build the gated block in four moves.

Step 1 — a smooth activation: Swish/SiLU. Replace the kinked ReLU with a smooth cousin. Swish (also called SiLU) multiplies the input by its own sigmoid \sigma(z) = 1/(1 + e^{-z}):

\operatorname{Swish}(z) = z\,\sigma(z) = \frac{z}{1 + e^{-z}}.

For large positive z the sigmoid is near 1 and \operatorname{Swish}(z) \approx z — just like ReLU. For large negative z the sigmoid is near 0 and the output decays to 0. But around the origin it is smooth, even dipping slightly below zero, instead of ReLU's hard corner — a softer, differentiable gate.

Step 2 — make two projections, not one. Project the token up to the hidden width twice, with two separate matrices W and V:

a = xW \quad(\text{the activation branch}), \qquad b = xV \quad(\text{the gate branch}).

Step 3 — gate one branch with the other. Pass the activation branch through Swish, then multiply it elementwise (\odot) by the raw gate branch:

g = \operatorname{Swish}(xW) \odot (xV).

Read it as a learned valve: xV decides, coordinate by coordinate, how much of the activation \operatorname{Swish}(xW) is allowed through. Where the gate is near zero the channel is shut; where it is large the channel is amplified. That multiplicative interaction is something the plain additive MLP simply cannot express.

Step 4 — project back down. A third matrix W_2 brings the gated hidden vector back to the model width:

\operatorname{SwiGLU}(x) = \big(\operatorname{Swish}(xW) \odot (xV)\big)\,W_2.

Step 5 — keep the parameter budget honest. We now carry three matrices (W, V, W_2) where the old block had two (W_1, W_2), so a naïve swap would inflate the parameter count by 50%. To compare fairly, shrink the hidden width by the same factor: use

d_{\text{ff}} = \tfrac{2}{3}\cdot 4d \approx \tfrac{8}{3}d

instead of the usual 4d. Three matrices at width \tfrac23 \cdot 4d total the same parameters as two matrices at 4d — so the gain below is a genuine, budget-matched win, not just a bigger block.

Step 6 — and it wins. At matched parameters and compute, SwiGLU consistently lowers the loss versus the ReLU MLP. The extra arithmetic is cheap; the multiplicative gate is what pays off.

SwiGLU is a gated feed-forward block built from a smooth activation:

SwiGLU is one member of the gated linear unit (GLU) family, all sharing the shape (\phi(xW) \odot xV)\,W_2 and differing only in the activation \phi: GeGLU uses GELU, ReGLU uses ReLU, the bare GLU uses a sigmoid, and SwiGLU uses Swish. A widely cited 2020 study benchmarked the lot at matched parameter counts and found the gated variants beat the plain MLP across the board, with GEGLU and SwiGLU at the top.

The author famously closed by attributing the improvement to “divine benevolence” — a wry admission that there is no clean theoretical reason one gate should beat another; the ranking is settled empirically. What is clear is why gating helps at all: a multiplicative interaction lets the block represent functions an additive two-layer MLP cannot, for almost no extra cost. That practical edge, reproduced again and again, is why SwiGLU is now the default FFN of frontier models.

Smooth versus kinked

Plotted together: the kinked \operatorname{ReLU}(z) = \max(0, z) with its hard corner at the origin, the smooth \operatorname{Swish}(z) = z\,\sigma(z) that hugs ReLU for large |z| but rounds the corner and dips slightly negative around zero, and the sigmoid \sigma(z) that does the gating — rising smoothly from 0 to 1. Swish is just z times that gate.