GPT stands for Generative Pre-trained Transformer, and the name is the whole
recipe in three words. Take a
Step 1 — start with a decoder-only transformer. Drop the encoder stack
entirely; keep one homogeneous stack of decoder blocks with
With the future hidden at every position, "predict the next token" is a genuine task at every position — which is exactly what we want to train on.
Step 2 — train on the next-token objective. Run the model over raw text and,
at each position
No labels, no paired source and target — just text predicting itself. So the entire internet is
training data, and the loss is a single, uniform number to push down. (This is the
Step 3 — pull the one lever: scale. Hold the architecture and objective fixed,
and grow three quantities together — the parameter count
Step 4 — watch the behaviour change qualitatively, not just quantitatively.
Small GPTs complete text: give them a prefix and they continue it plausibly. Large GPTs
do something that looks different in kind — given a task description and a few worked examples in
the prompt, they perform the new task with no weight update at all. That is
GPT-2 already wrote fluent paragraphs, but to use it for a task — sentiment,
translation, question answering — you still fine-tuned it: collect a labelled dataset, run
gradient descent, ship a task-specific copy of the weights. GPT-3, at
This is why the era's interface is the prompt. The shift from "fine-tune a copy per task" to "write a good prompt for the one model" is the practical legacy of scaling the GPT recipe — and it sets up the next three pages: how loss falls with scale, why in-context learning works, and how to phrase the prompt.
Slide the model size from a small GPT to a GPT-3-scale giant. The horizontal axis is the
parameter count