The
Step 1 — encoder-only (BERT): bidirectional, for understanding. Stack only encoder blocks, with unmasked self-attention, so every position attends both left and right:
Two-sided context makes these models excellent at understanding a fixed
input — classification, retrieval, tagging. They are pretrained by
Step 2 — decoder-only (GPT): causal, for generation. Stack only decoder
blocks with causally
This makes "predict the next token" a genuine task at every position, so the model is great at
generation. It is pretrained by plain
Step 3 — encoder–decoder (T5, the original): for sequence-to-sequence. Keep both stacks: an unmasked encoder reads the source, a causally-masked decoder writes the output, and cross-attention bridges them:
This is the natural fit for true
For a while the field split work across all three. Then decoder-only models won the scale race, and for reasons that compound. One simple objective: next-token prediction on any text whatsoever — no paired source/target needed, so the entire internet is training data. It scales beautifully: a single uniform stack, one loss, no encoder/decoder split to balance, which makes bigger-is-better engineering straightforward. In-context learning emerges: at sufficient scale these models follow instructions and learn from examples in the prompt, so a single decoder-only model handles translation, summarisation, and classification — the very tasks the other families specialised in — without changing architecture. The encoder–decoder's clean division of labour turned out to matter less than the decoder-only's clean, scalable objective.
Encoder–decoder models have an elegant story: a reader and a writer, each specialised. So why did the plainer decoder-only design come to dominate? Because elegance lost to scalability. Next-token prediction needs no aligned source–target pairs, so its training set is effectively all text ever written — orders of magnitude more than the parallel corpora seq2seq models feed on. The architecture is one homogeneous stack, which is far easier to scale to hundreds of billions of parameters than a two-tower model whose halves must be sized and balanced. And once large enough, decoder-only models exhibit in-context learning — absorbing a task from examples in the prompt — so the one model does translation, Q&A, and summarisation that previously demanded task-specific encoder–decoder setups. A simpler objective, scaled further, beat a smarter architecture scaled less. That lesson — favour what scales — is the through-line of the modern era.
Switch between the three families and watch the attention grid change. Encoder-only is a full grid — every position attends everywhere (bidirectional). Decoder-only is a lower-triangular grid — each position attends only to itself and the past (causal). Encoder–decoder shows the cross pattern: decoder rows attending over encoder columns, the bridge between the two stacks. Same cells, different masks — and that difference is what each family is for.