Here is the trick that turned language models from a curiosity into a revolution. To train a
Step 1 — supervised learning needs costly labels. Classical supervised
learning fits
Every label is a person reading an example and annotating it — slow, expensive, and the hard ceiling on how much data you can ever have. Labels, not compute, were the bottleneck.
Step 2 — self-supervision manufactures the label from the data. For a
sequence
The target
Step 3 — the loss is the same; only the labels changed. Plug those free
pairs into the ordinary cross-entropy objective. It is the identical
Step 4 — free labels scale to web-sized data. Because no human is in the
loop, the supply of training pairs is bounded only by how much raw text exists. From a corpus
Point this at the whole internet and you have trillions of labelled examples that cost nothing to annotate. Supervised learning could never assemble that — this is the unlock.
Step 5 — pretrain once, then finetune cheaply. Run the self-supervised objective over that ocean of text and the model learns general language representations — grammar, facts, style, a sketch of reasoning. That expensive run happens once. To specialise it for a downstream task you finetune: continue training on a small, often human-labelled, task dataset, reusing everything already learned:
The general knowledge is paid for once, in the pretraining; each new task rents it cheaply. That two-stage split — pretrain on everything, finetune on a little — is the paradigm under every modern large language model.
For decades, language AI advanced by hand-engineering — grammars, parse trees, curated features, rules lovingly written by experts. Self-supervised pretraining made most of that obsolete almost overnight, and the reason is what Rich Sutton dubbed the bitter lesson: general methods that ride more computation and more data reliably overtake systems built on human-encoded knowledge, given enough of both.
Self-supervision is the bitter lesson in its purest form. It removes the one thing that
could not scale — the human labeller — and replaces clever feature engineering with
“predict the next token over the whole internet.” The model discovers grammar,
facts, and structure on its own, as side effects of getting the
Watch a single sentence get sliced into supervised examples automatically. At each step the
words shaded in the first colour are the context (the input