Mixture of Experts

A feed-forward block holds most of a transformer's parameters and does most of its per-token thinking. The obvious way to make a model smarter is to make that block bigger — but a bigger FFN costs more compute on every token, and that bill is paid trillions of times during training and serving.

Mixture of experts (MoE) escapes the bargain. Instead of one fat FFN, give the block N separate expert FFNs and a tiny router that, for each token, picks just a few experts to run. The model gains the capacity of N FFNs while each token only pays for a handful — the very definition of a free lunch, or as close as deep learning gets to one.

The sparse block, line by line

Take a single token's vector x \in \mathbb{R}^d arriving at the feed-forward position. In a dense block it would go through one \operatorname{FFN}. In an MoE block it meets a router and a roster of N experts E_1, \dots, E_N, each one its own ordinary FFN.

Step 1 — score the experts. A small router matrix W_r \in \mathbb{R}^{N \times d} turns the token into one logit per expert:

\ell = W_r\, x \in \mathbb{R}^{N}, \qquad \ell_i = w_i \cdot x \;\text{ is expert } i\text{'s score.}

This router is cheap — a single N \times d matrix, negligible next to the experts it dispatches to.

Step 2 — turn scores into gates. Pass the logits through a softmax so they become non-negative weights that sum to one — the gating distribution over experts:

g = \operatorname{softmax}(W_r\, x) \in \mathbb{R}^{N}, \qquad g_i \ge 0, \quad \sum_{i=1}^{N} g_i = 1.

Step 3 — keep only the top k. Here is the whole trick. Rather than run all N experts, select the k with the largest gates (typically k = 2 out of N = 8) and discard the rest:

\mathcal{T} = \operatorname{top\text{-}}k\big(g\big) \subseteq \{1, \dots, N\}, \qquad |\mathcal{T}| = k \ll N.

Every other expert is simply not evaluated for this token. This is what makes the block sparse: of N available FFNs, only k ever touch the token.

Step 4 — combine the chosen experts. Run the token through just the chosen experts and take a gate-weighted sum of their outputs. (The gates over \mathcal{T} are usually renormalised so the weights on the survivors sum to one.)

y = \sum_{i \in \mathcal{T}} \frac{g_i}{\sum_{j \in \mathcal{T}} g_j}\; E_i(x).

The output has the same shape \mathbb{R}^d as a dense FFN's, so the MoE block drops straight into a transformer where the feed-forward block used to be — nothing downstream notices.

Step 5 — count parameters two ways. Let one expert FFN hold P parameters. The block stores all N experts but runs only k:

\underbrace{P_{\text{total}} \approx N \cdot P}_{\text{capacity (in memory)}} \qquad\text{vs}\qquad \underbrace{P_{\text{active}} \approx k \cdot P}_{\text{compute per token}}.

With N = 8, k = 2 the model carries 8\times the FFN parameters but activates only 2\times per token — and the router has decoupled the two numbers. You scale capacity by adding experts (raise N) without raising the per-token compute (hold k fixed). That gap between total and active parameters is the entire reason MoE exists.

Replace a block's single feed-forward network with a sparse mixture:

A router left to its own devices is lazy: it discovers a few good experts early and keeps routing everything to them, while the others starve, never receive gradient, and stay useless — a collapse that wastes most of the capacity you paid for. The fix is an auxiliary load-balancing loss added to the training objective. Let f_i be the fraction of tokens in a batch routed to expert i and p_i the average gate it received; a term proportional to

\mathcal{L}_{\text{aux}} = N \sum_{i=1}^{N} f_i\, p_i

is minimised when the load is spread evenly across experts, gently pressuring the router to use all of them. Some designs instead cap each expert's capacity and drop overflow tokens, or add noise to the router logits to encourage exploration.

This is not a toy. Sparse-MoE blocks power several frontier models — Mixtral (8 experts, top-2) is the best-known open example, and a number of the largest proprietary models are widely believed to be MoEs. They are how a model can advertise a trillion total parameters while only activating a few tens of billions per token: most of the headline count is capacity sitting in memory, summoned a couple of experts at a time.

Route a token to its experts

One token enters the router, which scores all N experts and lights up only the top k. The lit experts run and their outputs are gate-weighted; the dim ones do no work at all. Slide N and k and watch the readout: total parameters (all experts, in memory) climb with N, while active parameters (per token) track only k.

Where this goes

Mixture of experts buys parameters cheaply, but it does nothing for the other thing a large model strains against: the length of the sequence it can attend over. That is the next frontier — pushing the context window from thousands of tokens to millions.