Recall what attention does at decode step
Step 1 — spot the recomputation. A naïve implementation, at every step
key/value computations — quadratic in the length, and most of it redundant.
Step 2 — the key fact: past keys and values are frozen. Here is the crux.
Because of
Step 3 — so cache them. Keep a growing store — the KV cache —
holding every
where
Step 4 — the new cost: linear per step. Each step now computes exactly
one new key and value (plus one query), regardless of how long the sequence already
is. The per-step work drops from
Concretely: without the cache, generating a token after
Step 5 — count the memory cost. Nothing is free. The cache stores a key and a value vector for every past token, in every layer, for every attention head. So its size grows as
We have traded recomputation for storage — the classic deal. And as the next vignette shows, for long contexts and big batches that storage becomes the binding constraint.
The KV cache solves the time problem and hands you a memory problem. Because its size grows linearly with sequence length — and is multiplied by every layer and every head — a long-context model holding tens of thousands of tokens can spend more memory on the cache than on its own weights. For a batch of users, multiply again by the batch size. Suddenly the cache, not the parameters, is what won't fit on the GPU, and it is what caps how many requests you can serve at once or how long a context you can offer.
This single pressure motivates a whole subfield. Grouped-query attention (GQA) shrinks the cache by sharing one set of keys/values across several query heads — fewer heads in the size formula. Paged attention (the idea behind vLLM) stops the cache from wastefully reserving a contiguous block for the maximum length, paging it like virtual memory so fragmentation doesn't strand free space. Sliding-window attention caps the sequence-length term by forgetting the distant past. Every one of these is an answer to the same question the KV cache poses: now that we must store the past, how do we afford it?
Each column is one token's stored key/value pair. Step forward to take one decode step at a time: the model computes exactly one new column (highlighted — freshly computed), and every earlier column is simply reused from the cache (faded — no recomputation). Count the bright cells across all steps and you have counted the entire amount of key/value work done; count the faded ones and you have counted everything the cache saved you from recomputing. Refresh for a different prompt length and number of generated tokens.