Here is the trick that made
scaling
worth the money. Take a large
GPT,
write a few input → output examples into the prompt, then a fresh input — and
it continues the pattern, performing a task it was never trained on. No fine-tuning, no gradient
step, no change to a single weight. The "learning" happens entirely in the
forward pass. This is in-context learning, and it is what
turned a text-completer into a general-purpose tool.
How few-shot learning works, line by line
Step 1 — lay out k demonstrations, then the query. Build a prompt that is just
k worked examples of the task followed by a new input with its answer
left blank:
\underbrace{(x_1 \to y_1),\ (x_2 \to y_2),\ \dots,\ (x_k \to y_k)}_{k \text{ demonstrations}},\ \ x_{\text{query}} \to \;?
Nothing here is special syntax — it is plain text. The arrows and line breaks are a convention
the model reads like any other tokens.
Step 2 — it's still just next-token prediction. The model does the one thing it
always does: extend the sequence by predicting the most likely continuation. But the
continuation that best fits a context full of x_i \to y_i pairs is the
matching output for x_{\text{query}}:
\hat{y} = \arg\max_{y}\ p_\theta\!\left(y \;\middle|\; (x_1{\to}y_1),\dots,(x_k{\to}y_k),\, x_{\text{query}}{\to}\right).
The task is specified by the examples, which live in
\theta's conditioning, not in \theta
itself.
Step 3 — no weights move. Contrast with ordinary learning, where you'd nudge
the parameters by gradient descent on a loss,
\theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L} \qquad(\text{fine-tuning}).
In-context learning does none of this. The weights \theta
are frozen; the only thing that changed between "doesn't know the task" and "does the task" is
the text you placed in the context window. The computation that adapts the model is the
attention
pass reading those demonstrations — adaptation at inference time, not at training time.
Step 4 — it emerges with scale. This is the punchline that ties back to the
previous pages. A small model shown the same k demonstrations
cannot do this — it just completes text. Past a certain size, the same prompts start
working, and few-shot accuracy climbs with model scale. In-context learning is an
emergent capability: you don't add it, you grow into it.
A sufficiently large language model can perform a new task from examples placed in its prompt:
-
Few-shot demonstrations. The prompt holds
k input→output pairs followed by a query; the model continues the
pattern via ordinary next-token prediction conditioned on that context.
-
No weight update. The parameters \theta stay
frozen — unlike fine-tuning's
\theta \leftarrow \theta - \eta\nabla_\theta\mathcal{L}.
Adaptation happens in the forward pass (attention), at inference time.
-
Emergent with scale. Small models can't do it; large ones can, and
few-shot accuracy improves with size. The ability appears as a function of scale.
The number of demonstrations names the setting. Zero-shot: just a task
description, no examples (k = 0). One-shot: a
single example (k = 1) — often a big jump over zero, because it
pins down the exact output format. Few-shot: a handful, typically
k from a few up to a few dozen, with accuracy usually rising as
k grows until the context window fills.
Why can attention do this at all? One identified mechanism is the induction
head: a pair of attention heads that implements "find where this token appeared
before, and copy what followed it." Given \dots A\,B \dots A it
predicts B — a literal pattern-completer built out of attention.
Scale the model and these circuits form; with them in place, the
x_i \to y_i demonstrations become a pattern the model can match and
extend. Pattern-copying machinery, discovered by training, repurposed at inference to "learn"
your task.