The low-rank update, line by line
Take one frozen weight matrix of the pretrained model,
W \in \mathbb{R}^{d \times d}. Fine-tuning would replace it with
W + \Delta W. The whole idea is to never form
\Delta W in full.
Step 1 — freeze W, seek only the update. Hold
W fixed and learn the correction
\Delta W beside it:
W \ \text{frozen}, \qquad W_{\text{eff}} = W + \Delta W.
Step 2 — factor the update as low rank. Instead of a full
d \times d matrix, write
\Delta W as a product of two thin matrices through a bottleneck
of width r:
\Delta W = B\,A, \qquad B \in \mathbb{R}^{d \times r}, \quad A \in \mathbb{R}^{r \times d}, \qquad r \ll d.
Any product BA has rank at most
r: multiplying
a d\times r by an r\times d can only
produce a matrix of
rank
\le r. The bottleneck is the rank constraint.
Step 3 — initialise so training starts as the pretrained model. Set
B = 0 (and A small random), so that at
the first step \Delta W = BA = 0 and the model is exactly the
pretrained one — fine-tuning departs smoothly from a known-good point:
B^{(0)} = 0 \quad\Longrightarrow\quad \Delta W^{(0)} = 0 \quad\Longrightarrow\quad W_{\text{eff}}^{(0)} = W.
Step 4 — the layer's forward pass. For an input
x, never materialise BA; apply the two
factors in sequence and add to the frozen path:
h = W x + \Delta W\,x = W x + B(A x).
Only A and B carry gradients;
W sits frozen.
Step 5 — count the trainable parameters. A full update has
d^2 entries. The factors together have
|A| + |B| = r\,d + d\,r = 2dr \quad\text{trainable parameters}.
With d = 4096 and r = 8:
2dr = 2\cdot 4096 \cdot 8 = 65{,}536 against
d^2 = 16{,}777{,}216 — about
0.1\%. The fraction is
2dr/d^2 = 2r/d, which shrinks as the model widens.
Step 6 — merge at inference, for zero extra latency. Once trained, you can
fold the update back into the weights once and for all:
W' = W + BA, \qquad h = W' x.
The deployed layer is an ordinary matrix again — same shape, same cost as the original.
LoRA's adapter is free at inference, unlike adapters that add layers to the forward pass.
LoRA fine-tunes a frozen weight matrix W \in \mathbb{R}^{d\times d}
through a low-rank update:
-
Low-rank update. \Delta W = BA with
B \in \mathbb{R}^{d\times r},
A \in \mathbb{R}^{r\times d}, and rank
r \ll d; the layer computes
Wx + B(Ax).
-
Parameter count. Trainable parameters drop from
d^2 to 2dr — a fraction
2r/d (e.g. \sim 0.1\% at
d=4096, r=8).
-
Zero-init. B = 0 so
\Delta W = 0 at step zero — training begins as the exact
pretrained model.
-
Mergeable. At inference W' = W + BA folds in,
leaving an ordinary matrix — no added latency.
The rank r is a capacity dial. A larger
r lets \Delta W express a richer
update (more trainable parameters, 2dr growing linearly in
r); a smaller r is cheaper but more
constrained. In practice values as small as
r = 4 to 16 match full fine-tuning
on many tasks — a striking amount of mileage from a thin bottleneck.
Why should so few directions suffice? Recall from the
singular
value decomposition that any matrix is a sum of rank-one pieces ordered by
singular value, and that a matrix with a few dominant singular values is well approximated
by its top-r truncation. Empirically the fine-tuning update
\Delta W behaves this way: it lives on a low-dimensional
subspace — only a handful of directions in weight space actually need to move to adapt a
pretrained model. LoRA simply builds that low-rank assumption into the parameterisation
from the start.