Throughput vs Latency

Every serving decision comes down to one trade-off. Latency is what a single user feels — the time to get their next token. Throughput is what the bill depends on — the total tokens per second across all the users sharing the GPU. The knob that trades one for the other is the batch size, and which way you turn it depends on whether you are optimising for a person or for a budget.

Deriving the trade-off

Step 1 — define the two quantities. For a batch of B concurrent requests decoding in lockstep, let t(B) be the time for one decoding step. Then

\text{latency} = t(B) \ \text{(per token, per request)}, \qquad \text{throughput} = \frac{B}{t(B)} \ \text{(tokens/s, all requests)}.

A single user cares about t(B); the operator cares about B / t(B).

Step 2 — decode is memory-bandwidth-bound. Each decoding step must read every model weight from memory to produce its tokens. With weights of size W bytes and memory bandwidth \beta bytes/s, the step time is at best the time to stream the weights once — and crucially that read is shared by all B requests in the batch:

t(B) \approx \frac{W}{\beta} + B\cdot c,

where B\cdot c is the small extra compute per request. For small B the fixed weight-load W/\beta dominates, so adding requests is nearly free.

Step 3 — throughput rises, then saturates. Substitute Step 2 into the throughput formula:

\text{throughput}(B) = \frac{B}{\,W/\beta + B\,c\,}.

At small B this grows almost linearly (you amortise the one weight load over more requests); as B grows it flattens toward the ceiling 1/c, where the GPU becomes compute-bound. This is the roofline: cheap throughput gains until you hit the wall.

Step 4 — latency climbs with the batch. From Step 2, t(B) = W/\beta + Bc increases linearly in B. So the very batching that buys throughput makes each individual user wait longer per token:

B \uparrow \;\Rightarrow\; \text{throughput} \uparrow \text{ (then flat)}, \quad \text{latency} \uparrow.

Step 5 — read off the operating point. There is no free lunch: pick a small B for a snappy interactive chat (low latency, costlier per token), or a large B for cheap bulk generation (high throughput, laggier per request). The right batch size is wherever your latency budget is just met — and not a token bigger.

Serving balances two metrics against the batch size B:

Two metrics. Latency = t(B) is per-request (one user's wait); throughput = B/t(B) is aggregate (tokens/s, and so cost).
Batching trades them. Larger B raises throughput (amortising the shared weight-load) but raises per-request latency; the two pull opposite ways.
Memory-bandwidth-bound. Decode must read all weights each step, so throughput is gated by useful work per byte of weights moved — batching helps until the GPU saturates at \approx 1/c.

Because the GPU costs roughly a fixed dollars-per-hour, the cost per token is just (GPU cost) ÷ (throughput). Throughput rises with batch size up to the ceiling, so cost per token falls with batching — until it flattens. That is why bulk/offline jobs run at the largest batch their memory allows, while interactive endpoints stay smaller to protect latency.

The roofline model makes the wall precise. Plot achievable performance against arithmetic intensity (FLOPs per byte moved): below a threshold you are on the sloped memory-bandwidth roof (more intensity = more performance — exactly what batching buys), and above it you hit the flat compute roof. Single-request decode lives far down the bandwidth slope, which is precisely why there is so much free throughput to gain by batching — and why paged attention & batching matter so much for serving cost.

Turn the batch-size knob

Both curves are drawn against the batch size B. The bold rising-then-flattening curve is throughput B/t(B) — big early gains, then a ceiling. The other climbing curve is per-request latency t(B) — it only ever goes up. Drag the slider to read off an operating point: where do you stop trading a user's wait for the operator's tokens-per-second?