Paged Attention & Batching

Serving one user is easy; serving thousands at once is the real engineering problem. Two ideas, both borrowed from operating-systems thinking, let a single GPU keep far more sequences in flight: continuous batching (never let the GPU wait) and paged attention (never waste KV-cache memory). Together they are why modern servers like vLLM reach many times the throughput of a naive loop.

Deriving the two fixes

Step 1 — the naive static batch wastes the GPU. A simple server gathers a fixed batch of B requests, decodes them in lockstep, and only returns the batch when the slowest sequence finishes. Short replies sit idle waiting for the longest one, and no new request can join until the whole batch is done. The GPU spends much of its time decoding finished or empty slots.

Step 2 — continuous (in-flight) batching. Instead of a fixed cohort, manage the batch per step: every decoding step, evict any sequence that just emitted its end token and admit a waiting request into the freed slot. The batch is reassembled continuously, so the GPU is always working on a full set of live sequences — no waiting for the slowest, no idle slots.

Step 3 — but the KV cache fragments. Each sequence keeps a KV cache that grows by one entry per generated token. A naive server pre-allocates one big contiguous block per sequence, sized for the maximum possible length. A request that stops early leaves most of its block unused, and the leftover gaps are too small and scattered to reuse — classic fragmentation. Measurements on early systems found most KV memory wasted this way, capping how many sequences fit.

Step 4 — page the KV cache. Borrow virtual memory's trick: chop each sequence's KV cache into fixed-size pages (say 16 tokens each) and store them in non-contiguous physical blocks, with a per-sequence block table mapping logical position → physical page. A sequence now grabs one more page only when it needs it, and frees pages exactly when it ends. There is no giant pre-allocation and no scattered remainder:

\text{wasted memory} \le (\text{one partly-filled page}) \text{ per sequence}.

Internal waste drops from "most of a max-length block" to "less than one page" — a few percent.

Step 5 — more sequences fit ⇒ more throughput. KV memory is the binding constraint on how many sequences run concurrently. Recover the wasted fraction and you raise the concurrent count B by the same factor. Since aggregate throughput scales with how many live sequences share each weight-load, more concurrency is directly more tokens per second:

B_{\text{paged}} \approx \frac{B_{\text{contiguous}}}{\text{utilisation}} \quad\Rightarrow\quad \text{throughput} \uparrow. Serve many requests on one GPU with two OS-style tricks:

People assume the model weights are what fill GPU memory during serving, but the weights are loaded once and shared by every concurrent request. What grows per sequence — and therefore limits how many users you can serve at once — is the KV cache: for a sequence of length L it holds L key/value vectors in every layer.

Its size scales as 2 \times L \times n_{\text{layers}} \times d_{\text{model}} \times (\text{bytes/elem}), which for a long context and many sequences quickly dwarfs everything else. That is why the whole game is fitting more KV cache into the same memory: paging recovers the wasted space, and lower-precision KV (an inference-time quantization of the cache) shrinks each entry further.

Fragmented blocks vs neat pages

Both columns hold the same amount of GPU memory. On the left, each sequence gets one big contiguous block sized for the worst case — the filled part is used KV, the rest is wasted, and the scattered gaps can't be reused. On the right, the cache is cut into fixed-size pages packed back to back; with the waste recovered, more sequences fit in the same space.