The Modern LLM Stack

This is the summit. Behind us lies the whole climb — embeddings, attention, the transformer block, the training recipe, the inference tricks, and a clutch of modern refinements. Every one of them was a small idea, taught on its own page. Now we bolt them together into a single object: a 2020s frontier language model, the kind that powers the assistants people actually use.

The remarkable thing is how little the skeleton has changed since 2017. A modern model is still a stack of transformer blocks reading and writing a residual stream. What changed is that nearly every component inside the block has been quietly swapped for a better one — and your job here is simply to recognise each upgrade as it slots into place.

Walking one modern block, line by line

A modern model (Llama-style) is a decoder-only transformer: one tower of causal blocks, no separate encoder. Take a single block and follow a token's vector x \in \mathbb{R}^d up through it. At each line, note both what the component does and which 2017 part it replaced.

Step 1 — normalise first, the modern way. The original block normalised after each sublayer (post-norm LayerNorm); modern blocks normalise before (pre-norm) and use the cheaper RMSNorm, which rescales by the root-mean-square without subtracting a mean:

\hat{x} = \operatorname{RMSNorm}(x) = \frac{x}{\sqrt{\tfrac{1}{d}\sum_i x_i^2 + \varepsilon}} \odot \gamma.

Pre-norm keeps the residual path clean, so very deep stacks train stably — that swap alone is much of why 2017's "post-norm LayerNorm" became "pre-norm RMSNorm".

Step 2 — attend, with rotary positions and shared KV. Run causal self-attention on \hat{x}. Two upgrades ride along. Positions are injected inside attention by rotating queries and keys — RoPE, replacing the sinusoids added to the embeddings in 2017. And the heads share keys and values in groups — grouped-query attention, replacing full multi-head attention to shrink the KV cache:

x \;\leftarrow\; x + \operatorname{GQA}_{\text{RoPE, causal}}\big(\operatorname{RMSNorm}(x)\big).

Same residual-add as ever — the upgrades are inside the attention, invisible to the stream around it.

Step 3 — think, through a gated FFN. Normalise again, then run the feed-forward block — but the modern FFN is the gated SwiGLU, replacing 2017's plain \operatorname{ReLU} MLP:

x \;\leftarrow\; x + \operatorname{SwiGLU}\big(\operatorname{RMSNorm}(x)\big), \qquad \operatorname{SwiGLU}(u) = \big(\operatorname{Swish}(W_1 u) \odot (W_3 u)\big)\,W_2.

Step 4 — (optional) make the FFN sparse. The largest models go one step further and replace that single FFN with a mixture of experts — N expert FFNs and a router that fires only the top k per token, buying capacity without paying compute. Optional, but increasingly standard at the frontier.

Step 5 — read the whole block. Two pre-norm, residual-wrapped sublayers, exactly as in 2017 — only every part inside is the upgraded one:

\operatorname{Block}(x) = \big[\,\text{SwiGLU sublayer}\,\big] \circ \big[\,\text{GQA+RoPE sublayer}\,\big](x), \quad\text{both pre-RMSNorm.}

Stack N of these, embed the tokens at the bottom, apply a final RMSNorm and an output projection to vocabulary logits at the top, and the architecture is complete.

Step 6 — train it at scale. The architecture is inert until trained. Pretrain on trillions of tokens with the training recipe: AdamW, a warmup-then-cosine learning rate, mixed-precision arithmetic, and 3-D parallelism (data × tensor × pipeline) to spread the model over thousands of GPUs — at a size chosen by scaling laws to match the compute budget.

Step 7 — align it. A freshly pretrained model predicts text but does not yet follow instructions. Two more stages fix that: supervised fine-tuning (SFT) on demonstrations, then preference optimisation — DPO or RLHF — to make it helpful and steerable. Pretraining grows the brain; alignment teaches it manners.

Step 8 — serve it. Finally the model meets users, and a separate stack of inference tricks earns its keep: the KV cache (reuse past keys/values), quantization (run the weights in 8 or 4 bits), and paged attention (manage the cache like virtual memory). Together they turn a research artifact into a service that answers in milliseconds.

A 2020s frontier language model is a decoder-only transformer with every component upgraded, then trained, aligned, and served:

Architecture. Decoder-only blocks with RoPE (vs added sinusoids), pre-norm RMSNorm (vs post-norm LayerNorm), a SwiGLU gated FFN (vs ReLU MLP), and grouped-query attention (vs full multi-head) — optionally with mixture-of-experts FFNs.
Trained by the scaled recipe: AdamW + warmup–cosine schedule + mixed precision + 3-D parallelism, sized by scaling laws.
Aligned by SFT then DPO/RLHF to follow instructions.
Served with the KV cache, quantization, and paged attention.

Treat this page as a snapshot, not a monument. The skeleton (a stack of attention + FFN blocks on a residual stream) has held for years, but the parts keep turning over, and several frontiers are open right now:

Beyond quadratic attention. State-space models (Mamba) and linear-attention hybrids chase O(n) sequence mixing; many recent models interleave them with ordinary attention blocks.
Sparser and bigger. Finer-grained mixture-of-experts (many small experts, more of them) keeps pushing total parameters up while holding active parameters down.
Test-time compute. Reasoning models spend extra computation at inference — generating long chains of thought — trading serving cost for accuracy, a knob the 2017 design never had.
Multimodality and longer memory. Images, audio, and tools fold into the same token stream, and context windows keep stretching toward the million-token mark.

The lesson of this branch is not any single architecture but the method: each gain came from one isolated, well-understood improvement, dropped into a backbone clean enough to receive it. That is how the frontier moves — one digestible idea at a time.

See the upgrades slot in

One transformer block, drawn twice over. Flip the switch between 2017 and today and watch each component swap: added sinusoids become RoPE, multi-head becomes grouped-query, post-norm LayerNorm becomes pre-norm RMSNorm, the ReLU MLP becomes a SwiGLU gate (with optional experts). The wiring — two residual-wrapped sublayers — never moves; only the parts inside do.

The whole branch, in one breath

You started with a single neuron and a sigmoid. You learned how to make networks deep without their gradients dying, how attention let a model look anywhere in a sequence, how the transformer block packages gather-then-think, how to train such a thing across a datacentre, and how a parade of modern refinements — RoPE, RMSNorm, SwiGLU, grouped-query attention, mixture-of-experts, long-context scaling — sharpened every edge. Assembled, trained, aligned, and served, that is a frontier large language model. There is no further secret ingredient: it is these ideas, stacked. The frontier will keep moving, but you now know the shape of the thing it is moving.