The KV Cache

Autoregressive decoding looks ruinously wasteful at first glance. At each decode step the model re-runs attention over the entire sequence so far — but almost all of that sequence was already there last step. Are we recomputing the same thing again and again? Yes — and the KV cache is the one observation that fixes it, turning the cost of generating each token from quadratic into linear. It is the single most important optimisation in LLM inference.

Why the past never changes, line by line

Recall what attention does at decode step t. The new token forms a query q_t, and every position j \le t contributes a key k_j and a value v_j. The output mixes the values by the softmax of the query·key scores:

z_t = \sum_{j \le t} \operatorname{softmax}_j\!\Big(\tfrac{q_t \cdot k_j}{\sqrt{d_k}}\Big)\, v_j.

Step 1 — spot the recomputation. A naïve implementation, at every step t, recomputes k_j and v_j for all j \le t from scratch. Step t touches t positions, so producing n tokens this way costs

\sum_{t=1}^{n} t = \frac{n(n+1)}{2} = O(n^2)

key/value computations — quadratic in the length, and most of it redundant.

Step 2 — the key fact: past keys and values are frozen. Here is the crux. Because of causal masking, a token at position j only ever attends to positions \le j — it cannot see anything to its right. So k_j and v_j are computed from token j and its left context only, neither of which changes when we later append token j + 1, j + 2, \dots. Once computed, a past key and value are fixed forever:

k_j,\ v_j \ \text{ depend only on } x_{\le j}, \quad\text{so they are unchanged at every step } t > j.

Step 3 — so cache them. Keep a growing store — the KV cache — holding every k_j, v_j computed so far. At decode step t the new token computes only its own trio q_t, k_t, v_t, appends k_t, v_t to the cache, and attends over the cached keys/values:

\text{cache} \mathrel{+}= (k_t, v_t), \qquad z_t = \operatorname{Attention}\big(q_t,\ K_{\le t},\ V_{\le t}\big),

where K_{\le t}, V_{\le t} are read straight from the cache — no recomputation of the past.

Step 4 — the new cost: linear per step. Each step now computes exactly one new key and value (plus one query), regardless of how long the sequence already is. The per-step work drops from O(t) recomputation to O(1) new key/values, and the attention itself reads t cached entries — so the dominant per-step cost falls from O(t) repeated work on every past token to O(t) cheap reads:

\underbrace{O(t^2)}_{\text{recompute everything}} \;\longrightarrow\; \underbrace{O(t)}_{\text{just the new token}} \quad\text{per step (} O(n^2)\text{→}O(n^2)\text{ total, but with the constant slashed and zero redundant key/value work).}

Concretely: without the cache, generating a token after t existing ones re-derives t keys and values; with it, you derive exactly one. That is the whole speedup — and it is enormous in practice.

Step 5 — count the memory cost. Nothing is free. The cache stores a key and a value vector for every past token, in every layer, for every attention head. So its size grows as

\text{cache size} \;\propto\; \underbrace{2}_{k \text{ and } v} \times\, (\text{sequence length}) \times (\text{layers}) \times (\text{heads}) \times (\text{head dim}).

We have traded recomputation for storage — the classic deal. And as the next vignette shows, for long contexts and big batches that storage becomes the binding constraint.

During autoregressive decoding:

Past keys/values are fixed under causal masking. Because token j attends only to positions \le j, k_j and v_j depend only on x_{\le j} and never change when later tokens are appended.
So cache them. Store every k_j, v_j; each new step computes only its own q_t, k_t, v_t, appends k_t, v_t, and attends over the cached K_{\le t}, V_{\le t}.
O(t^2) \to O(t) per step. Per-step work falls from recomputing all t past keys/values to computing just 1 — the single biggest inference speedup.
Memory \propto seq \times layers \times heads. The cache holds a key and a value per token, per layer, per head, so its size is 2 \times \text{seq} \times \text{layers} \times \text{heads} \times d_{\text{head}}.

The KV cache solves the time problem and hands you a memory problem. Because its size grows linearly with sequence length — and is multiplied by every layer and every head — a long-context model holding tens of thousands of tokens can spend more memory on the cache than on its own weights. For a batch of users, multiply again by the batch size. Suddenly the cache, not the parameters, is what won't fit on the GPU, and it is what caps how many requests you can serve at once or how long a context you can offer.

This single pressure motivates a whole subfield. Grouped-query attention (GQA) shrinks the cache by sharing one set of keys/values across several query heads — fewer heads in the size formula. Paged attention (the idea behind vLLM) stops the cache from wastefully reserving a contiguous block for the maximum length, paging it like virtual memory so fragmentation doesn't strand free space. Sliding-window attention caps the sequence-length term by forgetting the distant past. Every one of these is an answer to the same question the KV cache poses: now that we must store the past, how do we afford it?

Watch the cache grow

Each column is one token's stored key/value pair. Step forward to take one decode step at a time: the model computes exactly one new column (highlighted — freshly computed), and every earlier column is simply reused from the cache (faded — no recomputation). Count the bright cells across all steps and you have counted the entire amount of key/value work done; count the faded ones and you have counted everything the cache saved you from recomputing. Refresh for a different prompt length and number of generated tokens.