Early transformers read a few hundred tokens at a time; today's frontier models read whole books, codebases, and hour-long transcripts — context windows of hundreds of thousands, even millions, of tokens. Getting there meant beating two separate enemies at once. One is about cost: attention scales badly with length. The other is about position: a model trained at one length forgets how to count past it. We take them in turn.
Step 1 — name the scaling.
Step 2 — feel the doubling. Because the cost is quadratic, doubling the context does not double the work — it quadruples it:
Go from
Step 3 — stop storing the matrix.
Step 4 — stop comparing every pair. To attack the compute itself, restrict
which tokens each token may attend to. Sliding-window (local) attention lets
each token see only the nearest
Step 5 — mind the real limiter: the KV cache. At inference time a decoder
caches the keys and values of every past token so it need not recompute them — the
and at a million tokens this cache — not the attention compute — is what fills the GPU. Shrinking it
is exactly why
Step 1 — the trained range. A model trained at length
Step 2 — extrapolation fails. Ask it about position
and beyond
Step 3 — interpolate the positions back in range. The fix is RoPE
scaling: squeeze the new, longer position axis back into the trained one. Position
interpolation simply rescales every position by
Now even position
Step 4 — interpolate smartly. Plain interpolation crushes the high-frequency
dimensions that encode fine local order. NTK-aware scaling and YaRN
interpolate the low-frequency (long-range) dimensions while leaving the
high-frequency (short-range) ones nearly untouched — extending reach without blurring
adjacency. A short fine-tune at the new length locks it in. This is how a model trained at
A model can accept a million tokens and still not use them. The standard probe is needle-in-a-haystack: hide a single fact (the needle) at a random depth in a long filler document (the haystack), then ask for it. Sweep the needle's position and the haystack's length and you get a grid of pass/fail — a map of where recall actually holds.
The results are humbling. Models routinely show a U-shaped profile — sharp at the very start and very end, hazy in the middle (the "lost in the middle" effect) — and recall that decays well before the advertised limit. Hence the distinction between advertised context (the number on the box, how many tokens it will accept) and effective context (the length over which it reliably retrieves and reasons). The second is the one that matters, and it is usually the smaller.
Two costs plotted against context length
We now have every modern upgrade in hand — sparse capacity, and the tricks that stretch context to the horizon. Time to assemble them. The next page builds a 2020s frontier model from these parts and shows how each one slots into the original 2017 design.