This is the summit. Behind us lies the whole climb — embeddings, attention, the
The remarkable thing is how little the skeleton has changed since 2017. A modern model is still a stack of transformer blocks reading and writing a residual stream. What changed is that nearly every component inside the block has been quietly swapped for a better one — and your job here is simply to recognise each upgrade as it slots into place.
A modern model (Llama-style) is a decoder-only transformer: one tower of causal
blocks, no separate encoder. Take a single block and follow a token's vector
Step 1 — normalise first, the modern way. The original block normalised after
each sublayer (post-norm LayerNorm); modern blocks normalise before (pre-norm) and use the
cheaper
Pre-norm keeps the residual path clean, so very deep stacks train stably — that swap alone is much of why 2017's "post-norm LayerNorm" became "pre-norm RMSNorm".
Step 2 — attend, with rotary positions and shared KV. Run causal self-attention on
Same residual-add as ever — the upgrades are inside the attention, invisible to the stream around it.
Step 3 — think, through a gated FFN. Normalise again, then run the feed-forward
block — but the modern FFN is the gated
Step 4 — (optional) make the FFN sparse. The largest models go one step further and
replace that single FFN with a
Step 5 — read the whole block. Two pre-norm, residual-wrapped sublayers, exactly as in 2017 — only every part inside is the upgraded one:
Stack
Step 6 — train it at scale. The architecture is inert until trained. Pretrain on
trillions of tokens with the
Step 7 — align it. A freshly pretrained model predicts text but does not yet follow instructions. Two more stages fix that: supervised fine-tuning (SFT) on demonstrations, then preference optimisation — DPO or RLHF — to make it helpful and steerable. Pretraining grows the brain; alignment teaches it manners.
Step 8 — serve it. Finally the model meets users, and a separate stack of inference
tricks earns its keep: the
Treat this page as a snapshot, not a monument. The skeleton (a stack of attention + FFN blocks on a residual stream) has held for years, but the parts keep turning over, and several frontiers are open right now:
The lesson of this branch is not any single architecture but the method: each gain came from one isolated, well-understood improvement, dropped into a backbone clean enough to receive it. That is how the frontier moves — one digestible idea at a time.
One transformer block, drawn twice over. Flip the switch between 2017 and today and watch each component swap: added sinusoids become RoPE, multi-head becomes grouped-query, post-norm LayerNorm becomes pre-norm RMSNorm, the ReLU MLP becomes a SwiGLU gate (with optional experts). The wiring — two residual-wrapped sublayers — never moves; only the parts inside do.
You started with a single neuron and a sigmoid. You learned how to make networks deep without their gradients dying, how attention let a model look anywhere in a sequence, how the transformer block packages gather-then-think, how to train such a thing across a datacentre, and how a parade of modern refinements — RoPE, RMSNorm, SwiGLU, grouped-query attention, mixture-of-experts, long-context scaling — sharpened every edge. Assembled, trained, aligned, and served, that is a frontier large language model. There is no further secret ingredient: it is these ideas, stacked. The frontier will keep moving, but you now know the shape of the thing it is moving.