The 2017
Line up the architecture of a modern open model against the 2017 paper and the bones are identical: embed, add position, alternate attention and feed-forward blocks with residual connections, and predict the next token. Every difference is a swapped-in part — a different norm, a gated feed-forward, a rotary position scheme, a shared-KV attention. It's a striking case of an architecture that was right enough to keep, and got faster and stronger by a hundred small, sharp improvements rather than one revolution.