A served model's weights are read from memory on every token. The cheapest way to
make inference smaller and faster is therefore not to change the architecture at all, but to
store each weight in fewer bits. Mixed-precision
training already moved us from fp32 to 16-bit; quantization for
inference goes further, mapping the fp16 weights down to INT8 or
INT4 integers.
The trick is a tiny linear map: pick a scale s and
a zero-point z, round each weight to the nearest
integer, and remember how to undo it. The whole craft is choosing s
well — because a handful of large-magnitude outlier weights can wreck a naive
choice.
Deriving the quantizer
Step 1 — fix the integer grid. With b bits an
unsigned integer takes one of 2^b values,
q \in \{0, 1, \dots, 2^b - 1\}. That is the entire budget: INT8
gives 256 levels, INT4 just 16. Our job
is to spread those few levels across the range of weights we actually have.
Step 2 — map a real weight onto the grid. Given a weight
w, the scale s sets the spacing between
levels and the zero-point z says which integer represents
w = 0. Quantizing is "divide by the spacing, shift, and round":
q = \operatorname{round}\!\left(\frac{w}{s} + z\right).
Step 3 — recover an approximate weight. To use the integer in a matmul we
invert the map. Undo the shift, then undo the scaling — this dequantizes back
to a near-copy \hat w of the original:
\hat w = s\,(q - z).
It is only near: rounding in Step 2 throws away everything finer than one level, so
\hat w differs from w by at most half a
step, |\hat w - w| \le \tfrac{s}{2}. Fewer bits ⇒ larger
s ⇒ coarser rounding.
Step 4 — count the memory. A weight stored in fp16 costs
16 bits; in INTb it costs
b. So the memory ratio versus fp16 is simply
b/16:
\text{INT8: } \tfrac{8}{16} = \tfrac{1}{2}, \qquad \text{INT4: } \tfrac{4}{16} = \tfrac{1}{4}.
INT8 halves the model; INT4 quarters it. (A small per-group s, z
overhead aside.) Because decode is dominated by reading weights, fewer bytes per
weight is fewer bytes to move — a direct speedup, not just a space saving.
Step 5 — the outlier catch. A naive scale stretches the grid to cover the
single largest-magnitude weight. If most weights sit near
10^{-2} but a few outliers hit
w_{\max} \approx 5, then
s \approx w_{\max} / 2^{b-1} is set by the outliers, and with only
16 INT4 levels nearly all the ordinary weights collapse onto the
same one or two integers — destroying accuracy.
Step 6 — quantize carefully. The fix is not one global
s but many: a separate scale per channel (per
weight column), so an outlier-heavy column gets its own coarse grid while the rest stay fine.
Methods such as GPTQ and AWQ go further, using a little
calibration data to choose scales (and which weights to protect) so that the
output of each layer — not just the raw weights — is preserved.
Store weights as low-bit integers via an affine map:
-
Quantize / dequantize.
q = \operatorname{round}(w/s + z) and
\hat w = s\,(q - z), where s is the
scale and z the zero-point; rounding error is at most
s/2.
-
Memory. INTb uses
2^b levels and b/16 of fp16's bytes —
INT8 is \tfrac12 the memory, INT4 is
\tfrac14.
-
Outliers. A few large-magnitude weights blow up a global scale, so
quantize carefully — per-channel scales plus calibration (GPTQ,
AWQ) — to keep accuracy.
The quantizer above is weight-only: the stored weights are integers, but
they are dequantized to 16-bit on the fly and the matmul runs in 16-bit. That alone wins
because inference is memory-bandwidth-bound — each decoded token must read
every weight once, so halving the bytes nearly halves the time spent moving them, even
though the arithmetic is unchanged.
Activation quantization goes further and stores the activations in low
precision too, so the matmul itself runs in integer arithmetic (e.g. INT8 × INT8). That can
use faster integer tensor cores, but activations carry their own, larger outliers, so it is
harder to do without losing accuracy. The common sweet spot for serving is therefore
weight-only INT8/INT4: most of the bandwidth win, little of the risk.