Encoder vs. Decoder Models

The full encoder–decoder transformer has two stacks, but you do not always need both. Keep just the encoder, just the decoder, or both, and you get three families of model — BERT, GPT, and T5 — built from the same parts but pointed at different jobs. The deciding difference is almost entirely one thing: which way the attention is allowed to look.

Three families, line by line

Step 1 — encoder-only (BERT): bidirectional, for understanding. Stack only encoder blocks, with unmasked self-attention, so every position attends both left and right:

A_{ij} \ne 0 \ \text{ allowed for all } i, j \quad (\text{no mask}).

Two-sided context makes these models excellent at understanding a fixed input — classification, retrieval, tagging. They are pretrained by masked-language-modeling: blank out some tokens and predict them from both sides. What they cannot naturally do is generate left to right — every position has already seen the future.

Step 2 — decoder-only (GPT): causal, for generation. Stack only decoder blocks with causally masked self-attention, so each position sees only itself and the past:

A_{ij} = 0 \ \text{ for } j > i \quad (\text{lower-triangular}).

This makes "predict the next token" a genuine task at every position, so the model is great at generation. It is pretrained by plain next-token prediction on raw text — and this is the design behind modern large language models. (There is no separate encoder and so no cross-attention: a decoder-only model just attends over its own running context.)

Step 3 — encoder–decoder (T5, the original): for sequence-to-sequence. Keep both stacks: an unmasked encoder reads the source, a causally-masked decoder writes the output, and cross-attention bridges them:

\text{source} \xrightarrow{\text{encoder (unmasked)}} Z \xrightarrow{\text{cross-attn}} \text{decoder (masked)} \to \text{output}.

This is the natural fit for true sequence-to-sequence tasks like translation, where a distinct input is transformed into a distinct output.

Why decoder-only became the dominant LLM architecture

For a while the field split work across all three. Then decoder-only models won the scale race, and for reasons that compound. One simple objective: next-token prediction on any text whatsoever — no paired source/target needed, so the entire internet is training data. It scales beautifully: a single uniform stack, one loss, no encoder/decoder split to balance, which makes bigger-is-better engineering straightforward. In-context learning emerges: at sufficient scale these models follow instructions and learn from examples in the prompt, so a single decoder-only model handles translation, summarisation, and classification — the very tasks the other families specialised in — without changing architecture. The encoder–decoder's clean division of labour turned out to matter less than the decoder-only's clean, scalable objective.

Built from the same blocks, three families differ by attention pattern, pretraining objective, and use:

Encoder-only (BERT). Bidirectional (unmasked) self-attention; pretrained by masked-language-modeling; for understanding — classification, retrieval, tagging.
Decoder-only (GPT). Causal (lower-triangular) self-attention; pretrained by next-token prediction; for generation — the architecture behind modern LLMs.
Encoder–decoder (T5, the original). Unmasked encoder + masked decoder joined by cross-attention; for sequence-to-sequence tasks such as translation.

The pattern (bidirectional vs. causal) and the objective decide what each family is good for.

Encoder–decoder models have an elegant story: a reader and a writer, each specialised. So why did the plainer decoder-only design come to dominate? Because elegance lost to scalability. Next-token prediction needs no aligned source–target pairs, so its training set is effectively all text ever written — orders of magnitude more than the parallel corpora seq2seq models feed on. The architecture is one homogeneous stack, which is far easier to scale to hundreds of billions of parameters than a two-tower model whose halves must be sized and balanced. And once large enough, decoder-only models exhibit in-context learning — absorbing a task from examples in the prompt — so the one model does translation, Q&A, and summarisation that previously demanded task-specific encoder–decoder setups. A simpler objective, scaled further, beat a smarter architecture scaled less. That lesson — favour what scales — is the through-line of the modern era.

The three patterns, side by side

Switch between the three families and watch the attention grid change. Encoder-only is a full grid — every position attends everywhere (bidirectional). Decoder-only is a lower-triangular grid — each position attends only to itself and the past (causal). Encoder–decoder shows the cross pattern: decoder rows attending over encoder columns, the bridge between the two stacks. Same cells, different masks — and that difference is what each family is for.