In-Context Learning

Here is the trick that made scaling worth the money. Take a large GPT, write a few input → output examples into the prompt, then a fresh input — and it continues the pattern, performing a task it was never trained on. No fine-tuning, no gradient step, no change to a single weight. The "learning" happens entirely in the forward pass. This is in-context learning, and it is what turned a text-completer into a general-purpose tool.

How few-shot learning works, line by line

Step 1 — lay out k demonstrations, then the query. Build a prompt that is just k worked examples of the task followed by a new input with its answer left blank:

\underbrace{(x_1 \to y_1),\ (x_2 \to y_2),\ \dots,\ (x_k \to y_k)}_{k \text{ demonstrations}},\ \ x_{\text{query}} \to \;?

Nothing here is special syntax — it is plain text. The arrows and line breaks are a convention the model reads like any other tokens.

Step 2 — it's still just next-token prediction. The model does the one thing it always does: extend the sequence by predicting the most likely continuation. But the continuation that best fits a context full of x_i \to y_i pairs is the matching output for x_{\text{query}}:

\hat{y} = \arg\max_{y}\ p_\theta\!\left(y \;\middle|\; (x_1{\to}y_1),\dots,(x_k{\to}y_k),\, x_{\text{query}}{\to}\right).

The task is specified by the examples, which live in \theta's conditioning, not in \theta itself.

Step 3 — no weights move. Contrast with ordinary learning, where you'd nudge the parameters by gradient descent on a loss,

\theta \leftarrow \theta - \eta\,\nabla_\theta \mathcal{L} \qquad(\text{fine-tuning}).

In-context learning does none of this. The weights \theta are frozen; the only thing that changed between "doesn't know the task" and "does the task" is the text you placed in the context window. The computation that adapts the model is the attention pass reading those demonstrations — adaptation at inference time, not at training time.

Step 4 — it emerges with scale. This is the punchline that ties back to the previous pages. A small model shown the same k demonstrations cannot do this — it just completes text. Past a certain size, the same prompts start working, and few-shot accuracy climbs with model scale. In-context learning is an emergent capability: you don't add it, you grow into it.

A sufficiently large language model can perform a new task from examples placed in its prompt:

The number of demonstrations names the setting. Zero-shot: just a task description, no examples (k = 0). One-shot: a single example (k = 1) — often a big jump over zero, because it pins down the exact output format. Few-shot: a handful, typically k from a few up to a few dozen, with accuracy usually rising as k grows until the context window fills.

Why can attention do this at all? One identified mechanism is the induction head: a pair of attention heads that implements "find where this token appeared before, and copy what followed it." Given \dots A\,B \dots A it predicts B — a literal pattern-completer built out of attention. Scale the model and these circuits form; with them in place, the x_i \to y_i demonstrations become a pattern the model can match and extend. Pattern-copying machinery, discovered by training, repurposed at inference to "learn" your task.

Build a k-shot prompt

Slide k from 0 (zero-shot — just the query) upward and watch demonstrations stack into the context. Each row is one x_i \to y_i pair; the last row is the query x_{\text{query}} \to \,?, and the model's job is to fill the blank by continuing the established pattern. The task here is "double the number" — but the model only ever sees the examples, never that rule stated.