Modern Transformer Improvements

The 2017 original transformer is still recognisable inside today's large language models — but a series of refinements have made it train more stably, run more efficiently, and reach far longer contexts. None of them changes the big picture (attention + feed-forward blocks, stacked deep); each is a targeted upgrade to one component. Knowing them is knowing the difference between the textbook transformer and the ones actually running in production.

The upgrades that stuck

Line up the architecture of a modern open model against the 2017 paper and the bones are identical: embed, add position, alternate attention and feed-forward blocks with residual connections, and predict the next token. Every difference is a swapped-in part — a different norm, a gated feed-forward, a rotary position scheme, a shared-KV attention. It's a striking case of an architecture that was right enough to keep, and got faster and stronger by a hundred small, sharp improvements rather than one revolution.