Self-Supervised Pretraining

Here is the trick that turned language models from a curiosity into a revolution. To train a language model you need examples of “input → correct answer”. The breakthrough realisation: for next-token prediction, the text is its own answer. The label for the context “the cat sat on the” is simply the word that actually follows it — mat — already sitting right there in the corpus. No human ever had to write it down. This is self-supervised pretraining, and it is why a model can learn from essentially the entire internet.

Free labels, line by line

Step 1 — supervised learning needs costly labels. Classical supervised learning fits f_\theta(x) \approx y from pairs (x, y) where a human supplied each target y:

\mathcal{D}_{\text{sup}} = \{(x^{(i)}, y^{(i)})\}_{i=1}^{n}, \qquad y^{(i)} \text{ hand-labelled}.

Every label is a person reading an example and annotating it — slow, expensive, and the hard ceiling on how much data you can ever have. Labels, not compute, were the bottleneck.

Step 2 — self-supervision manufactures the label from the data. For a sequence x_1, \dots, x_T, define the input/target pair at each position by splitting the text against itself: the context so far is the input, the very next token is the label.

x^{(t)} = (x_1, \dots, x_{t-1}) = x_{

The target y^{(t)} = x_t was already in the corpus — no annotator required. A single document of length T hands you T-1 training pairs for free, and causal masking lets the model learn from all of them in one parallel pass.

Step 3 — the loss is the same; only the labels changed. Plug those free pairs into the ordinary cross-entropy objective. It is the identical cross-entropy you would use with human labels — the targets just happen to come from the text:

\mathcal{L}(\theta) = -\frac{1}{T-1} \sum_{t=2}^{T} \log P_\theta(\underbrace{x_t}_{y^{(t)}} \mid \underbrace{x_{

Step 4 — free labels scale to web-sized data. Because no human is in the loop, the supply of training pairs is bounded only by how much raw text exists. From a corpus \mathcal{C} of documents, the number of training pairs is

\#\text{pairs} = \sum_{x \in \mathcal{C}} (T_x - 1) \approx \text{total tokens in } \mathcal{C}.

Point this at the whole internet and you have trillions of labelled examples that cost nothing to annotate. Supervised learning could never assemble that — this is the unlock.

Step 5 — pretrain once, then finetune cheaply. Run the self-supervised objective over that ocean of text and the model learns general language representations — grammar, facts, style, a sketch of reasoning. That expensive run happens once. To specialise it for a downstream task you finetune: continue training on a small, often human-labelled, task dataset, reusing everything already learned:

\theta_0 \;\xrightarrow[\text{huge, free, self-supervised}]{\text{pretrain}}\; \theta_{\text{pre}} \;\xrightarrow[\text{small, labelled, supervised}]{\text{finetune}}\; \theta_{\text{task}}.

The general knowledge is paid for once, in the pretraining; each new task rents it cheaply. That two-stage split — pretrain on everything, finetune on a little — is the paradigm under every modern large language model.

Next-token prediction needs no human labels:
  • The text is the label. For each position, input x^{(t)} = x_{ and target y^{(t)} = x_t come straight from the corpus — a sequence of length T yields T-1 free pairs.
  • It scales to web-sized data. With no annotation cost, the number of training pairs is just the total token count, so the model can train on essentially the entire internet.
  • Pretrain → finetune. One huge self-supervised pretraining run learns general representations \theta_{\text{pre}}; cheap supervised finetuning then adapts them to each downstream task \theta_{\text{task}}.

For decades, language AI advanced by hand-engineering — grammars, parse trees, curated features, rules lovingly written by experts. Self-supervised pretraining made most of that obsolete almost overnight, and the reason is what Rich Sutton dubbed the bitter lesson: general methods that ride more computation and more data reliably overtake systems built on human-encoded knowledge, given enough of both.

Self-supervision is the bitter lesson in its purest form. It removes the one thing that could not scale — the human labeller — and replaces clever feature engineering with “predict the next token over the whole internet.” The model discovers grammar, facts, and structure on its own, as side effects of getting the perplexity down. Painful for anyone who spent years crafting features by hand, liberating for everyone who would rather add data and compute. The whole modern LLM era is built on taking that lesson seriously.

One sentence, many free pairs

Watch a single sentence get sliced into supervised examples automatically. At each step the words shaded in the first colour are the context (the input x_{), and the next word, in the second colour, is the label (x_t) the model must predict. Step through: a sentence of n words yields n-1 pairs, and nobody labelled a thing. Hit Refresh for another sentence.