Quantization for Inference

A served model's weights are read from memory on every token. The cheapest way to make inference smaller and faster is therefore not to change the architecture at all, but to store each weight in fewer bits. Mixed-precision training already moved us from fp32 to 16-bit; quantization for inference goes further, mapping the fp16 weights down to INT8 or INT4 integers.

The trick is a tiny linear map: pick a scale s and a zero-point z, round each weight to the nearest integer, and remember how to undo it. The whole craft is choosing s well — because a handful of large-magnitude outlier weights can wreck a naive choice.

Deriving the quantizer

Step 1 — fix the integer grid. With b bits an unsigned integer takes one of 2^b values, q \in \{0, 1, \dots, 2^b - 1\}. That is the entire budget: INT8 gives 256 levels, INT4 just 16. Our job is to spread those few levels across the range of weights we actually have.

Step 2 — map a real weight onto the grid. Given a weight w, the scale s sets the spacing between levels and the zero-point z says which integer represents w = 0. Quantizing is "divide by the spacing, shift, and round":

q = \operatorname{round}\!\left(\frac{w}{s} + z\right).

Step 3 — recover an approximate weight. To use the integer in a matmul we invert the map. Undo the shift, then undo the scaling — this dequantizes back to a near-copy \hat w of the original:

\hat w = s\,(q - z).

It is only near: rounding in Step 2 throws away everything finer than one level, so \hat w differs from w by at most half a step, |\hat w - w| \le \tfrac{s}{2}. Fewer bits ⇒ larger s ⇒ coarser rounding.

Step 4 — count the memory. A weight stored in fp16 costs 16 bits; in INTb it costs b. So the memory ratio versus fp16 is simply b/16:

\text{INT8: } \tfrac{8}{16} = \tfrac{1}{2}, \qquad \text{INT4: } \tfrac{4}{16} = \tfrac{1}{4}.

INT8 halves the model; INT4 quarters it. (A small per-group s, z overhead aside.) Because decode is dominated by reading weights, fewer bytes per weight is fewer bytes to move — a direct speedup, not just a space saving.

Step 5 — the outlier catch. A naive scale stretches the grid to cover the single largest-magnitude weight. If most weights sit near 10^{-2} but a few outliers hit w_{\max} \approx 5, then s \approx w_{\max} / 2^{b-1} is set by the outliers, and with only 16 INT4 levels nearly all the ordinary weights collapse onto the same one or two integers — destroying accuracy.

Step 6 — quantize carefully. The fix is not one global s but many: a separate scale per channel (per weight column), so an outlier-heavy column gets its own coarse grid while the rest stay fine. Methods such as GPTQ and AWQ go further, using a little calibration data to choose scales (and which weights to protect) so that the output of each layer — not just the raw weights — is preserved.

Store weights as low-bit integers via an affine map:

Quantize / dequantize. q = \operatorname{round}(w/s + z) and \hat w = s\,(q - z), where s is the scale and z the zero-point; rounding error is at most s/2.
Memory. INTb uses 2^b levels and b/16 of fp16's bytes — INT8 is \tfrac12 the memory, INT4 is \tfrac14.
Outliers. A few large-magnitude weights blow up a global scale, so quantize carefully — per-channel scales plus calibration (GPTQ, AWQ) — to keep accuracy.

The quantizer above is weight-only: the stored weights are integers, but they are dequantized to 16-bit on the fly and the matmul runs in 16-bit. That alone wins because inference is memory-bandwidth-bound — each decoded token must read every weight once, so halving the bytes nearly halves the time spent moving them, even though the arithmetic is unchanged.

Activation quantization goes further and stores the activations in low precision too, so the matmul itself runs in integer arithmetic (e.g. INT8 × INT8). That can use faster integer tensor cores, but activations carry their own, larger outliers, so it is harder to do without losing accuracy. The common sweet spot for serving is therefore weight-only INT8/INT4: most of the bandwidth win, little of the risk.

Watch the grid coarsen

The dots are a spread of continuous fp16 weights; the ticks below are the 2^b integer levels they snap to, and each weight's vertical drop shows the rounding to its nearest level. Slide the bit-width from 8 down to 3: the levels thin out and the snapping gets coarser. Notice the lone outlier on the right — it stretches the grid, starving the cluster of weights near zero.