Modern Transformer Improvements

The 2017 original transformer is still recognisable inside today's large language models — but a series of refinements have made it train more stably, run more efficiently, and reach far longer contexts. None of them changes the big picture (attention + feed-forward blocks, stacked deep); each is a targeted upgrade to one component. Knowing them is knowing the difference between the textbook transformer and the ones actually running in production.

The upgrades that stuck

Pre-norm instead of post-norm. Move the layer normalisation to before each sub-block rather than after. This keeps the residual path clean and gradients well-behaved, letting very deep stacks train without the warm-up gymnastics the original needed.
RMSNorm. A cheaper normalisation that rescales by the root-mean-square of the activations and drops the mean-subtraction and bias — nearly the same effect for less compute.
Rotary position embeddings. Swap the added positional vectors for RoPE, encoding relative position by rotation, which extrapolates to longer sequences.
SwiGLU feed-forward. Replace the plain ReLU feed-forward block with a gated variant (a GLU using the SiLU/Swish activation), which consistently improves quality for the same parameter budget.
Grouped-query attention (GQA). Let several query heads share one set of key/value heads. This shrinks the memory the model must cache during generation, speeding up inference with almost no loss in quality.
Fewer biases, and FlashAttention. Modern models often drop bias terms from linear layers (a tiny simplification) and compute attention with a FlashAttention kernel for speed and long-context memory.

Stability: pre-norm + RMSNorm let deep models train reliably.
Position: RoPE encodes relative position and extends context.
Quality & efficiency: SwiGLU feed-forwards, grouped-query attention, and FlashAttention kernels.

Line up the architecture of a modern open model against the 2017 paper and the bones are identical: embed, add position, alternate attention and feed-forward blocks with residual connections, and predict the next token. Every difference is a swapped-in part — a different norm, a gated feed-forward, a rotary position scheme, a shared-KV attention. It's a striking case of an architecture that was right enough to keep, and got faster and stronger by a hundred small, sharp improvements rather than one revolution.

These are refinements, not a new architecture. The core idea — attention plus feed-forward blocks, stacked with residuals — is unchanged; each improvement upgrades a single component.
Grouped-query attention trades a little modelling capacity for a big cut in the key/value cache during generation — it's mainly an inference-efficiency win, not an accuracy one.