Serving one user is easy; serving thousands at once is the real engineering problem. Two ideas, both borrowed from operating-systems thinking, let a single GPU keep far more sequences in flight: continuous batching (never let the GPU wait) and paged attention (never waste KV-cache memory). Together they are why modern servers like vLLM reach many times the throughput of a naive loop.
Step 1 — the naive static batch wastes the GPU. A simple server gathers a
fixed batch of
Step 2 — continuous (in-flight) batching. Instead of a fixed cohort, manage the batch per step: every decoding step, evict any sequence that just emitted its end token and admit a waiting request into the freed slot. The batch is reassembled continuously, so the GPU is always working on a full set of live sequences — no waiting for the slowest, no idle slots.
Step 3 — but the KV cache fragments. Each sequence keeps a KV cache that grows by one entry per generated token. A naive server pre-allocates one big contiguous block per sequence, sized for the maximum possible length. A request that stops early leaves most of its block unused, and the leftover gaps are too small and scattered to reuse — classic fragmentation. Measurements on early systems found most KV memory wasted this way, capping how many sequences fit.
Step 4 — page the KV cache. Borrow virtual memory's trick: chop each
sequence's KV cache into fixed-size pages (say
Internal waste drops from "most of a max-length block" to "less than one page" — a few percent.
Step 5 — more sequences fit ⇒ more throughput. KV memory is the binding
constraint on how many sequences run concurrently. Recover the wasted fraction and you raise the
concurrent count
People assume the model weights are what fill GPU memory during serving, but the weights are
loaded once and shared by every concurrent request. What grows per sequence — and
therefore limits how many users you can serve at once — is the
Its size scales as
Both columns hold the same amount of GPU memory. On the left, each sequence gets one big contiguous block sized for the worst case — the filled part is used KV, the rest is wasted, and the scattered gaps can't be reused. On the right, the cache is cut into fixed-size pages packed back to back; with the waste recovered, more sequences fit in the same space.