QLoRA: Fine-Tuning on One GPU

LoRA made the trainable parameters tiny. But the frozen base still has to sit in GPU memory while you fine-tune — and a 65-billion-parameter model in 16-bit precision is about 130 GB before you have stored a single gradient. That alone overflows any single GPU. QLoRA removes the last barrier by shrinking the frozen base itself: quantize it to 4-bit, then train LoRA adapters on top. The result is the headline that made it famous — fine-tune a 65B model on a single 48 GB GPU.

Shrink the base, adapt on top, line by line

Let the base have N parameters. The memory to hold them is N times the bytes-per-parameter, so precision is a direct lever on memory.

Step 1 — quantize the frozen base to 4 bits. Store each base weight in a 4-bit format (NF4) instead of 16-bit. Bytes per parameter fall from 2 to 0.5 — a factor of four:

\text{mem}_{\text{base}} = N \times \tfrac{4}{8}\,\text{B} = \tfrac{N}{2}\,\text{B}, \qquad \frac{0.5}{2} = \frac{1}{4}.

The base is frozen, so its weights are read-only during fine-tuning — we never need their gradients, which is exactly what makes such aggressive quantization safe here.

Step 2 — put LoRA adapters on top, in higher precision. Attach the low-rank factors A, B from LoRA and keep them in 16-bit (or bf16). They are the only trainable tensors:

h = W_{\text{4bit}}\,x + B(A x), \qquad \text{train only } A, B.

Step 3 — gradients flow only into the adapters. Backprop dequantizes a base weight on the fly to compute its contribution, but no gradient is stored for it; the optimizer state (momentum, variance) exists only for the small A, B. So the expensive 16-bit gradients and optimizer states scale with 2dr, not with N.

Step 4 — do the memory math. Compare the three regimes for a base of N parameters (ignoring activations):

\begin{aligned} \text{Full FT (fp16)} &\approx 2N \ \text{(weights)} + 2N\ \text{(grads)} + 8N\ \text{(optim)} \approx 12N\,\text{B},\\ \text{LoRA (fp16 base)} &\approx 2N \ \text{(frozen weights)} + \text{tiny adapters},\\ \text{QLoRA (4-bit base)} &\approx \tfrac{N}{2} \ \text{(frozen weights)} + \text{tiny adapters}. \end{aligned}

For N = 65\text{B}: full fine-tuning wants on the order of \sim 780 GB; LoRA still needs \sim 130 GB just to hold the fp16 base; QLoRA needs only \sim 33 GB for the 4-bit base plus a sliver for adapters — which fits a single 48 GB card with room for activations.

QLoRA fine-tunes a large model on one GPU by quantizing the frozen base:

Two refinements make the 4-bit base work without hurting accuracy. NF4 (4-bit NormalFloat) is a quantization grid whose 16 levels are placed to be information-theoretically optimal for normally distributed weights — and neural-network weights are close to Gaussian. Instead of evenly spaced bins, NF4 packs its levels where the weights actually cluster, so the quantization error is far smaller than a naive 4-bit integer grid.

Double quantization goes one step further: quantization itself needs a small scale (a “quantization constant”) per block of weights, and those constants add up. QLoRA quantizes the constants too, shaving roughly another half-bit per parameter off the footprint. The savings are modest individually but meaningful at 65B scale. The same precision-versus-memory trade-off reappears at deployment time as quantization for inference.

The memory, in bars

Three regimes for the same 65B model, in gigabytes. Full fine-tuning (weights, gradients, and optimizer state in fp16) is off the top of any single card. LoRA trims the trainable cost but must still hold the fp16 base. QLoRA's 4-bit base drops the bar under the dashed 48 GB line — the line a single GPU can hold.