Training a large network from scratch is expensive: millions of labelled examples, thousands of GPU-hours, and a long, fragile climb down the loss surface. Yet most of what such a network learns — edges and textures in vision, syntax and word meaning in language — is general, not specific to any one task. So why relearn it every time?
Transfer learning is the answer: take a model that has already been
Write
Step 1 — start from the pretrained weights, not from noise. Pretraining
produced a parameter vector
Step 2 — continue training on the new task. Run the same gradient descent you always would, now on the task loss, updating all the weights — this is full fine-tuning:
Step 3 — exploit the head start. Because
Step 4 — (optional) freeze the body, retrain only the head. When the new task shares almost all of its structure with pretraining, you can hold the lower layers fixed and replace just the final classification layer (the head):
The pretrained body acts as a fixed feature extractor; only the thin head learns the new labels. This is the cheapest form of transfer — and the conceptual seed of the parameter-efficient methods that follow.
The head start is also a hazard. If you fine-tune with a large learning rate, the early,
noisy gradients of the small new dataset can shove
The standard guard is to fine-tune gently: use a learning rate one or two orders
of magnitude smaller than in pretraining, often with a short warm-up and only a few
epochs. The goal is to nudge the weights, keeping
Below is a stylised learning-curve comparison: task accuracy as a function of how much labelled task data you have. The pretrained-then-fine-tuned model (bold) starts high even with a handful of examples and saturates quickly; the from-scratch model (faint) needs a great deal of data just to catch up. The gap is widest on the left — the low-data regime where transfer learning pays off most.