Full
fine-tuning works, but it is profligate. Each new task produces a complete new
copy of every weight — all of \theta. For a model with
billions of parameters, that is billions of numbers stored per task. Ten tasks means ten
full models on disk; serving them means swapping gigabytes in and out of memory. Something
is clearly wrong: we just argued in transfer learning that the update
\Delta\theta is small, yet full fine-tuning stores the
whole new \theta anyway.
Parameter-efficient fine-tuning (PEFT) takes the small-update observation
seriously. Freeze the entire pretrained model and train only a tiny new set of
parameters — typically a fraction of a percent of the whole. You keep one frozen
base and a featherweight task module beside it.
Freeze the base, train a sliver, line by line
Write the pretrained weights as \theta_{\text{pre}} and the
task loss as \mathcal{L}_{\text{task}}, exactly as before. PEFT
rewrites the model as a frozen part plus a small trainable part.
Step 1 — freeze the base. Hold every pretrained weight fixed; it will
never receive a gradient:
\theta_{\text{pre}} \ \text{frozen} \quad\Longrightarrow\quad \nabla_{\theta_{\text{pre}}}\mathcal{L}_{\text{task}} \ \text{is never applied}.
Step 2 — add a small trainable module. Introduce a new set of parameters
\phi — an adapter, a prompt, or a low-rank update — that modifies
the frozen model's computation. The model becomes a function of both:
f_{\theta_{\text{pre}},\,\phi}(x), \qquad |\phi| \ll |\theta_{\text{pre}}|.
The defining inequality is |\phi| \ll |\theta|: the new
parameters number a small fraction of the base, very often
0.1\% to 1\%.
Step 3 — train only \phi. Gradient descent
touches the new parameters alone; the base is a fixed backdrop:
\phi^{(k+1)} = \phi^{(k)} - \eta\,\nabla_\phi \mathcal{L}_{\text{task}}\big(\theta_{\text{pre}}, \phi^{(k)}\big).
Step 4 — one base, many tasks. Because
\theta_{\text{pre}} is shared and frozen, each task needs to
store only its own little \phi_t:
\underbrace{\theta_{\text{pre}}}_{\text{shared, frozen}} \ + \ \{\phi_1, \phi_2, \dots, \phi_T\}_{\text{tiny per-task}}.
Storing T tasks costs one base plus T
slivers, not T full models. Swapping tasks at serve time means
swapping a few megabytes, not gigabytes — you can even keep many
\phi_t resident at once.
PEFT adapts a pretrained model while storing almost nothing new per task:
-
Freeze the base. The pretrained weights
\theta_{\text{pre}} are held fixed and receive no gradient.
-
Train a sliver. Only a small new module
\phi is trained, with
|\phi| \ll |\theta_{\text{pre}}| — typically well under
1\% of the parameters.
-
One base, many tasks. A single shared frozen base plus a small
per-task \phi_t means tasks are cheap to store and fast to
swap.
PEFT is a family, distinguished by what the trainable
\phi is and where it plugs in.
-
Adapters insert small bottleneck layers
(down-project to a low dimension, nonlinearity, up-project back) inside each
transformer block. Only those bottlenecks train; the surrounding block is frozen.
-
Prefix- / prompt-tuning prepends a handful of trainable “virtual
token” vectors to the input (or to the keys and values at each layer). The model's
weights never move — you are learning a soft prompt that steers the frozen
model.
-
Low-rank updates add a trainable low-rank correction to the weight
matrices themselves — the approach that became dominant, and the subject of the
next page.
All three obey the same contract — frozen base, tiny \phi —
and so all three inherit the one-base-many-tasks economics above.