LoRA: Low-Rank Adaptation

PEFT says: freeze the base, train a tiny module \phi. But what should \phi actually be? LoRA — Low-Rank Adaptation — gives the dominant answer, and it is beautifully direct. The thing a weight matrix needs during fine-tuning is an update, \Delta W. LoRA's claim is that this update is intrinsically low-rank, so we can store it as a skinny product instead of a full matrix.

The low-rank update, line by line

Take one frozen weight matrix of the pretrained model, W \in \mathbb{R}^{d \times d}. Fine-tuning would replace it with W + \Delta W. The whole idea is to never form \Delta W in full.

Step 1 — freeze W, seek only the update. Hold W fixed and learn the correction \Delta W beside it:

W \ \text{frozen}, \qquad W_{\text{eff}} = W + \Delta W.

Step 2 — factor the update as low rank. Instead of a full d \times d matrix, write \Delta W as a product of two thin matrices through a bottleneck of width r:

\Delta W = B\,A, \qquad B \in \mathbb{R}^{d \times r}, \quad A \in \mathbb{R}^{r \times d}, \qquad r \ll d.

Any product BA has rank at most r: multiplying a d\times r by an r\times d can only produce a matrix of rank \le r. The bottleneck is the rank constraint.

Step 3 — initialise so training starts as the pretrained model. Set B = 0 (and A small random), so that at the first step \Delta W = BA = 0 and the model is exactly the pretrained one — fine-tuning departs smoothly from a known-good point:

B^{(0)} = 0 \quad\Longrightarrow\quad \Delta W^{(0)} = 0 \quad\Longrightarrow\quad W_{\text{eff}}^{(0)} = W.

Step 4 — the layer's forward pass. For an input x, never materialise BA; apply the two factors in sequence and add to the frozen path:

h = W x + \Delta W\,x = W x + B(A x).

Only A and B carry gradients; W sits frozen.

Step 5 — count the trainable parameters. A full update has d^2 entries. The factors together have

|A| + |B| = r\,d + d\,r = 2dr \quad\text{trainable parameters}.

With d = 4096 and r = 8: 2dr = 2\cdot 4096 \cdot 8 = 65{,}536 against d^2 = 16{,}777{,}216 — about 0.1\%. The fraction is 2dr/d^2 = 2r/d, which shrinks as the model widens.

Step 6 — merge at inference, for zero extra latency. Once trained, you can fold the update back into the weights once and for all:

W' = W + BA, \qquad h = W' x.

The deployed layer is an ordinary matrix again — same shape, same cost as the original. LoRA's adapter is free at inference, unlike adapters that add layers to the forward pass.

LoRA fine-tunes a frozen weight matrix W \in \mathbb{R}^{d\times d} through a low-rank update:

Low-rank update. \Delta W = BA with B \in \mathbb{R}^{d\times r}, A \in \mathbb{R}^{r\times d}, and rank r \ll d; the layer computes Wx + B(Ax).
Parameter count. Trainable parameters drop from d^2 to 2dr — a fraction 2r/d (e.g. \sim 0.1\% at d=4096, r=8).
Zero-init. B = 0 so \Delta W = 0 at step zero — training begins as the exact pretrained model.
Mergeable. At inference W' = W + BA folds in, leaving an ordinary matrix — no added latency.

The rank r is a capacity dial. A larger r lets \Delta W express a richer update (more trainable parameters, 2dr growing linearly in r); a smaller r is cheaper but more constrained. In practice values as small as r = 4 to 16 match full fine-tuning on many tasks — a striking amount of mileage from a thin bottleneck.

Why should so few directions suffice? Recall from the singular value decomposition that any matrix is a sum of rank-one pieces ordered by singular value, and that a matrix with a few dominant singular values is well approximated by its top-r truncation. Empirically the fine-tuning update \Delta W behaves this way: it lives on a low-dimensional subspace — only a handful of directions in weight space actually need to move to adapt a pretrained model. LoRA simply builds that low-rank assumption into the parameterisation from the start.

A fat update, factored thin

A full update is a dense d \times d block — all d^2 entries trainable. LoRA replaces it with a tall d \times r times a wide r \times d, meeting at a bottleneck of width r. Slide the rank and watch the LoRA cost 2dr (bold) crawl up from near zero while the full-update cost d^2 (faint) towers, flat and indifferent. Here d = 1024; parameters are in thousands.