The GPT Family

GPT stands for Generative Pre-trained Transformer, and the name is the whole recipe in three words. Take a decoder-only transformer, train it by causal language modeling — predict the next token, left to right — and then make it big. GPT-1, GPT-2, GPT-3, GPT-4: the same architecture, run with more parameters and more data each time. Nothing exotic was added between them. The lever that was pulled is scale.

The recipe, line by line

Step 1 — start with a decoder-only transformer. Drop the encoder stack entirely; keep one homogeneous stack of decoder blocks with causally masked self-attention, so each position attends only to itself and the past:

A_{ij} = 0 \quad \text{for } j > i \qquad (\text{lower-triangular}).

With the future hidden at every position, "predict the next token" is a genuine task at every position — which is exactly what we want to train on.

Step 2 — train on the next-token objective. Run the model over raw text and, at each position t, maximise the probability it assigns to the token that actually came next. Equivalently, minimise the average negative log-likelihood:

\mathcal{L}(\theta) = -\frac{1}{T}\sum_{t=1}^{T} \log p_\theta\!\left(x_t \mid x_{

No labels, no paired source and target — just text predicting itself. So the entire internet is training data, and the loss is a single, uniform number to push down. (This is the self-supervised pretraining objective, scored by perplexity.)

Step 3 — pull the one lever: scale. Hold the architecture and objective fixed, and grow three quantities together — the parameter count N, the training data D, and the compute C. The GPT line is a study in doing exactly this:

\underbrace{0.12\text{B}}_{\text{GPT-1}} \ \to\ \underbrace{1.5\text{B}}_{\text{GPT-2}} \ \to\ \underbrace{175\text{B}}_{\text{GPT-3}} \ \to\ \underbrace{\text{(undisclosed, larger)}}_{\text{GPT-4}}.

Step 4 — watch the behaviour change qualitatively, not just quantitatively. Small GPTs complete text: give them a prefix and they continue it plausibly. Large GPTs do something that looks different in kind — given a task description and a few worked examples in the prompt, they perform the new task with no weight update at all. That is in-context learning, and it is the reward for scale. The same next-token machine, made large enough, stops being a text-completer and starts being a few-shot learner.

Every model in the GPT family is the same construction at a different size:
  • Decoder-only. One uniform stack of causally masked transformer blocks — no encoder, no cross-attention, just attention over its own running context.
  • Next-token objective. Trained by causal language modeling, \mathcal{L} = -\frac{1}{T}\sum_t \log p_\theta(x_t \mid x_{, on raw text — self-supervised, so all text is data.
  • Scale is the lever. GPT-1 → 2 → 3 → 4 keep the architecture and objective fixed and grow N, D, C together; new capabilities (few-shot, instruction-following) emerge from size alone.

GPT-2 already wrote fluent paragraphs, but to use it for a task — sentiment, translation, question answering — you still fine-tuned it: collect a labelled dataset, run gradient descent, ship a task-specific copy of the weights. GPT-3, at 175 billion parameters (over 100\times GPT-2), broke that habit. Its 2020 paper was titled "Language Models are Few-Shot Learners", and the claim was startling: you no longer train the model for your task at all. You describe the task and show a handful of examples in the prompt, and the frozen model does it. One set of weights, every task — steered by words rather than by gradients.

This is why the era's interface is the prompt. The shift from "fine-tune a copy per task" to "write a good prompt for the one model" is the practical legacy of scaling the GPT recipe — and it sets up the next three pages: how loss falls with scale, why in-context learning works, and how to phrase the prompt.

Bigger is (predictably) better

Slide the model size from a small GPT to a GPT-3-scale giant. The horizontal axis is the parameter count N on a logarithmic scale — each tick is 10\times larger — because size spans many orders of magnitude. The falling curve is the training loss; the rising curve is a rough capability score. Both move smoothly with size: there is no magic threshold, just a steady payoff for scale (the precise law is the next page).