A
feed-forward
block holds most of a transformer's parameters and does most of its per-token thinking.
The obvious way to make a model smarter is to make that block bigger — but a bigger FFN costs more
compute on every token, and that bill is paid trillions of times during training and serving.
Mixture of experts (MoE) escapes the bargain. Instead of one fat FFN, give the block
N separate expert FFNs and a tiny router
that, for each token, picks just a few experts to run. The model gains the capacity of
N FFNs while each token only pays for a handful — the very definition of a
free lunch, or as close as deep learning gets to one.
The sparse block, line by line
Take a single token's vector x \in \mathbb{R}^d arriving at the
feed-forward position. In a dense block it would go through one
\operatorname{FFN}. In an MoE block it meets a router and a roster of
N experts E_1, \dots, E_N, each one its own
ordinary FFN.
Step 1 — score the experts. A small router matrix
W_r \in \mathbb{R}^{N \times d} turns the token into one logit per expert:
\ell = W_r\, x \in \mathbb{R}^{N}, \qquad \ell_i = w_i \cdot x \;\text{ is expert } i\text{'s score.}
This router is cheap — a single N \times d matrix, negligible next to the
experts it dispatches to.
Step 2 — turn scores into gates. Pass the logits through a
softmax
so they become non-negative weights that sum to one — the gating distribution over
experts:
g = \operatorname{softmax}(W_r\, x) \in \mathbb{R}^{N}, \qquad g_i \ge 0, \quad \sum_{i=1}^{N} g_i = 1.
Step 3 — keep only the top k. Here is the whole trick.
Rather than run all N experts, select the
k with the largest gates (typically
k = 2 out of N = 8) and discard the rest:
\mathcal{T} = \operatorname{top\text{-}}k\big(g\big) \subseteq \{1, \dots, N\}, \qquad |\mathcal{T}| = k \ll N.
Every other expert is simply not evaluated for this token. This is what makes the block
sparse: of N available FFNs, only
k ever touch the token.
Step 4 — combine the chosen experts. Run the token through just the chosen experts and
take a gate-weighted sum of their outputs. (The gates over \mathcal{T} are
usually renormalised so the weights on the survivors sum to one.)
y = \sum_{i \in \mathcal{T}} \frac{g_i}{\sum_{j \in \mathcal{T}} g_j}\; E_i(x).
The output has the same shape \mathbb{R}^d as a dense FFN's, so the MoE
block drops straight into a transformer where the feed-forward block used to be — nothing downstream
notices.
Step 5 — count parameters two ways. Let one expert FFN hold
P parameters. The block stores all
N experts but runs only k:
\underbrace{P_{\text{total}} \approx N \cdot P}_{\text{capacity (in memory)}} \qquad\text{vs}\qquad \underbrace{P_{\text{active}} \approx k \cdot P}_{\text{compute per token}}.
With N = 8, k = 2 the model carries 8\times the
FFN parameters but activates only 2\times per token — and the router has
decoupled the two numbers. You scale capacity by adding experts (raise
N) without raising the per-token compute (hold k
fixed). That gap between total and active parameters is the entire
reason MoE exists.
Replace a block's single feed-forward network with a sparse mixture:
-
Experts + router. N independent expert FFNs
E_1, \dots, E_N and a small router
g = \operatorname{softmax}(W_r\, x) that scores them per token.
-
Top-k sparse routing. Each token is sent to only its
k highest-gated experts (k \ll N), and the
output is the gate-weighted sum
y = \sum_{i \in \mathcal{T}} \tfrac{g_i}{\sum_{j} g_j} E_i(x).
-
Total \gg active parameters. The block stores
P_{\text{total}} \approx N P but computes only
P_{\text{active}} \approx k P per token — capacity scales with
N while compute scales with k.
A router left to its own devices is lazy: it discovers a few good experts early and keeps routing
everything to them, while the others starve, never receive gradient, and stay useless — a
collapse that wastes most of the capacity you paid for. The fix is an auxiliary
load-balancing loss added to the training objective. Let f_i be
the fraction of tokens in a batch routed to expert i and
p_i the average gate it received; a term proportional to
\mathcal{L}_{\text{aux}} = N \sum_{i=1}^{N} f_i\, p_i
is minimised when the load is spread evenly across experts, gently pressuring the router to
use all of them. Some designs instead cap each expert's capacity and drop overflow tokens,
or add noise to the router logits to encourage exploration.
This is not a toy. Sparse-MoE blocks power several frontier models — Mixtral
(8 experts, top-2) is the best-known open example, and a number of the largest proprietary models
are widely believed to be MoEs. They are how a model can advertise a trillion total parameters
while only activating a few tens of billions per token: most of the headline count is capacity
sitting in memory, summoned a couple of experts at a time.