Parameter-Efficient Fine-Tuning

Full fine-tuning works, but it is profligate. Each new task produces a complete new copy of every weight — all of \theta. For a model with billions of parameters, that is billions of numbers stored per task. Ten tasks means ten full models on disk; serving them means swapping gigabytes in and out of memory. Something is clearly wrong: we just argued in transfer learning that the update \Delta\theta is small, yet full fine-tuning stores the whole new \theta anyway.

Parameter-efficient fine-tuning (PEFT) takes the small-update observation seriously. Freeze the entire pretrained model and train only a tiny new set of parameters — typically a fraction of a percent of the whole. You keep one frozen base and a featherweight task module beside it.

Freeze the base, train a sliver, line by line

Write the pretrained weights as \theta_{\text{pre}} and the task loss as \mathcal{L}_{\text{task}}, exactly as before. PEFT rewrites the model as a frozen part plus a small trainable part.

Step 1 — freeze the base. Hold every pretrained weight fixed; it will never receive a gradient:

\theta_{\text{pre}} \ \text{frozen} \quad\Longrightarrow\quad \nabla_{\theta_{\text{pre}}}\mathcal{L}_{\text{task}} \ \text{is never applied}.

Step 2 — add a small trainable module. Introduce a new set of parameters \phi — an adapter, a prompt, or a low-rank update — that modifies the frozen model's computation. The model becomes a function of both:

f_{\theta_{\text{pre}},\,\phi}(x), \qquad |\phi| \ll |\theta_{\text{pre}}|.

The defining inequality is |\phi| \ll |\theta|: the new parameters number a small fraction of the base, very often 0.1\% to 1\%.

Step 3 — train only \phi. Gradient descent touches the new parameters alone; the base is a fixed backdrop:

\phi^{(k+1)} = \phi^{(k)} - \eta\,\nabla_\phi \mathcal{L}_{\text{task}}\big(\theta_{\text{pre}}, \phi^{(k)}\big).

Step 4 — one base, many tasks. Because \theta_{\text{pre}} is shared and frozen, each task needs to store only its own little \phi_t:

\underbrace{\theta_{\text{pre}}}_{\text{shared, frozen}} \ + \ \{\phi_1, \phi_2, \dots, \phi_T\}_{\text{tiny per-task}}.

Storing T tasks costs one base plus T slivers, not T full models. Swapping tasks at serve time means swapping a few megabytes, not gigabytes — you can even keep many \phi_t resident at once.

PEFT adapts a pretrained model while storing almost nothing new per task:

Freeze the base. The pretrained weights \theta_{\text{pre}} are held fixed and receive no gradient.
Train a sliver. Only a small new module \phi is trained, with |\phi| \ll |\theta_{\text{pre}}| — typically well under 1\% of the parameters.
One base, many tasks. A single shared frozen base plus a small per-task \phi_t means tasks are cheap to store and fast to swap.

PEFT is a family, distinguished by what the trainable \phi is and where it plugs in.

Adapters insert small bottleneck layers (down-project to a low dimension, nonlinearity, up-project back) inside each transformer block. Only those bottlenecks train; the surrounding block is frozen.
Prefix- / prompt-tuning prepends a handful of trainable “virtual token” vectors to the input (or to the keys and values at each layer). The model's weights never move — you are learning a soft prompt that steers the frozen model.
Low-rank updates add a trainable low-rank correction to the weight matrices themselves — the approach that became dominant, and the subject of the next page.

All three obey the same contract — frozen base, tiny \phi — and so all three inherit the one-base-many-tasks economics above.

What it costs to serve many tasks

Move the slider to set how big each task's trainable module \phi is, as a percentage of the base. The bold curve is the total storage for PEFT — one shared base plus that sliver, T times — as the number of tasks grows along the bottom. The faint curve is full fine-tuning, which stores a whole fresh model per task (so it climbs at 100\% per task no matter where the slider sits). Storage is in units of one base model.