Let the base have
Step 1 — quantize the frozen base to 4 bits. Store each base weight in a
4-bit format (NF4) instead of 16-bit. Bytes per parameter fall from
The base is frozen, so its weights are read-only during fine-tuning — we never need their gradients, which is exactly what makes such aggressive quantization safe here.
Step 2 — put LoRA adapters on top, in higher precision. Attach the
low-rank factors
Step 3 — gradients flow only into the adapters. Backprop dequantizes a
base weight on the fly to compute its contribution, but no gradient is stored for it; the
optimizer state (momentum, variance) exists only for the small
Step 4 — do the memory math. Compare the three regimes for a base of
For
Two refinements make the 4-bit base work without hurting accuracy. NF4 (4-bit NormalFloat) is a quantization grid whose 16 levels are placed to be information-theoretically optimal for normally distributed weights — and neural-network weights are close to Gaussian. Instead of evenly spaced bins, NF4 packs its levels where the weights actually cluster, so the quantization error is far smaller than a naive 4-bit integer grid.
Double quantization goes one step further: quantization itself needs a
small scale (a “quantization constant”) per block of weights, and those constants add up.
QLoRA quantizes the constants too, shaving roughly another half-bit per parameter
off the footprint. The savings are modest individually but meaningful at 65B scale. The
same precision-versus-memory trade-off reappears at deployment time as
Three regimes for the same 65B model, in gigabytes. Full fine-tuning (weights, gradients, and optimizer state in fp16) is off the top of any single card. LoRA trims the trainable cost but must still hold the fp16 base. QLoRA's 4-bit base drops the bar under the dashed 48 GB line — the line a single GPU can hold.