Every serving decision comes down to one trade-off. Latency is what a single user feels — the time to get their next token. Throughput is what the bill depends on — the total tokens per second across all the users sharing the GPU. The knob that trades one for the other is the batch size, and which way you turn it depends on whether you are optimising for a person or for a budget.
Step 1 — define the two quantities. For a batch of
A single user cares about
Step 2 — decode is memory-bandwidth-bound. Each decoding step must read
every model weight from memory to produce its tokens. With weights of size
where
Step 3 — throughput rises, then saturates. Substitute Step 2 into the throughput formula:
At small
Step 4 — latency climbs with the batch. From Step 2,
Step 5 — read off the operating point. There is no free lunch: pick a small
Because the GPU costs roughly a fixed dollars-per-hour, the cost per token is just (GPU cost) ÷ (throughput). Throughput rises with batch size up to the ceiling, so cost per token falls with batching — until it flattens. That is why bulk/offline jobs run at the largest batch their memory allows, while interactive endpoints stay smaller to protect latency.
The roofline model makes the wall precise. Plot achievable performance
against arithmetic intensity (FLOPs per byte moved): below a threshold you are on the
sloped memory-bandwidth roof (more intensity = more performance — exactly what
batching buys), and above it you hit the flat compute roof. Single-request decode
lives far down the bandwidth slope, which is precisely why there is so much free throughput to
gain by batching — and why
Both curves are drawn against the batch size