Long Context

Early transformers read a few hundred tokens at a time; today's frontier models read whole books, codebases, and hour-long transcripts — context windows of hundreds of thousands, even millions, of tokens. Getting there meant beating two separate enemies at once. One is about cost: attention scales badly with length. The other is about position: a model trained at one length forgets how to count past it. We take them in turn.

Obstacle (a): the quadratic cost

Step 1 — name the scaling. Scaled dot-product attention forms the full n \times n score matrix S = QK^{\top}, so its time and memory grow as

\text{attention cost} = O(n^2), \qquad n = \text{context length}.

Step 2 — feel the doubling. Because the cost is quadratic, doubling the context does not double the work — it quadruples it:

\frac{(2n)^2}{n^2} = 4.

Go from 2{,}000 to 1{,}000{,}000 tokens — a 500\times longer context — and the naive attention matrix grows by 500^2 = 250{,}000\times. Left unchecked, this single n^2 makes long context impossible.

Step 3 — stop storing the matrix. FlashAttention computes the exact same attention while never writing the n \times n scores to memory — it tiles the work and keeps a running softmax, dropping the memory from O(n^2) to O(n). The compute is still quadratic, but the crushing memory wall is gone.

Step 4 — stop comparing every pair. To attack the compute itself, restrict which tokens each token may attend to. Sliding-window (local) attention lets each token see only the nearest w neighbours; more general sparse patterns add a few global or strided links. Either way each token attends to O(w) others instead of all n, so the cost falls from quadratic to linear:

O(n^2) \;\longrightarrow\; O(n \cdot w), \qquad w \ll n.

Step 5 — mind the real limiter: the KV cache. At inference time a decoder caches the keys and values of every past token so it need not recompute them — the KV cache. Its size grows linearly with context,

\text{KV-cache memory} = O(n \cdot d \cdot L) \quad(L \text{ layers}),

and at a million tokens this cache — not the attention compute — is what fills the GPU. Shrinking it is exactly why grouped-query attention shares keys and values across heads. Long context is, in practice, a memory problem.

Obstacle (b): running out of positions

Step 1 — the trained range. A model trained at length L only ever sees positions 0, 1, \dots, L-1. Its sense of "how far apart" two tokens are is calibrated to that range and no further.

Step 2 — extrapolation fails. Ask it about position 3L at test time and the position signal lands somewhere it has never encountered. With rotary position embeddings (RoPE) each dimension rotates a query/key by an angle proportional to position,

\theta_m(p) = p \cdot \omega_m,

and beyond L the fastest-rotating dimensions have wound through angles the model never trained on — quality collapses.

Step 3 — interpolate the positions back in range. The fix is RoPE scaling: squeeze the new, longer position axis back into the trained one. Position interpolation simply rescales every position by L / L' when extending to a new length L' > L,

p \;\longmapsto\; p \cdot \frac{L}{L'}, \qquad\text{so } p \in [0, L') \text{ maps into } [0, L).

Now even position L'-1 presents an angle inside the trained band, so the model recognises it.

Step 4 — interpolate smartly. Plain interpolation crushes the high-frequency dimensions that encode fine local order. NTK-aware scaling and YaRN interpolate the low-frequency (long-range) dimensions while leaving the high-frequency (short-range) ones nearly untouched — extending reach without blurring adjacency. A short fine-tune at the new length locks it in. This is how a model trained at 4\text{k} is stretched to 128\text{k} and beyond.

Two obstacles stand between a transformer and a very long context:

Quadratic attention. Cost is O(n^2), so doubling context quadruples it. Tamed by FlashAttention (exact, O(n) memory) and by sliding-window / sparse attention (each token sees O(w) others, so O(n^2) \to O(nw)).
The KV cache is the memory limiter. At inference the cache grows O(n d L) with context, so at long lengths it — not the attention compute — fills the GPU.
Positions run out. A model trained to length L degrades past it; RoPE scaling (position interpolation / NTK / YaRN) maps positions beyond L back into the trained range, p \mapsto p\, L/L', extending reach.

A model can accept a million tokens and still not use them. The standard probe is needle-in-a-haystack: hide a single fact (the needle) at a random depth in a long filler document (the haystack), then ask for it. Sweep the needle's position and the haystack's length and you get a grid of pass/fail — a map of where recall actually holds.

The results are humbling. Models routinely show a U-shaped profile — sharp at the very start and very end, hazy in the middle (the "lost in the middle" effect) — and recall that decays well before the advertised limit. Hence the distinction between advertised context (the number on the box, how many tokens it will accept) and effective context (the length over which it reliably retrieves and reasons). The second is the one that matters, and it is usually the smaller.

Watch the curves separate

Two costs plotted against context length n (in thousands of tokens). The steep curve is full attention compute, growing like n^2 — it pulls away from everything else as the context grows. The straight line is KV-cache memory, growing like n; and the gentler line is sliding-window attention, \propto n \cdot w for a fixed window — linear, like the cache. Past a certain length the quadratic curve simply dominates, which is the whole story of why long context is hard.

Where this goes

We now have every modern upgrade in hand — sparse capacity, and the tricks that stretch context to the horizon. Time to assemble them. The next page builds a 2020s frontier model from these parts and shows how each one slots into the original 2017 design.