Transfer Learning

Training a large network from scratch is expensive: millions of labelled examples, thousands of GPU-hours, and a long, fragile climb down the loss surface. Yet most of what such a network learns — edges and textures in vision, syntax and word meaning in language — is general, not specific to any one task. So why relearn it every time?

Transfer learning is the answer: take a model that has already been pretrained on a large generic corpus, and adapt it to your new task with a comparatively small amount of labelled data. You start from a good place instead of a random one, and the head start is enormous.

From pretrained to fine-tuned, line by line

Write \theta for the model's weights and \mathcal{L}_{\text{task}}(\theta) for the loss on the new task's (small) labelled dataset. Ordinary training would initialise \theta at random and run gradient descent. Transfer learning changes exactly one thing — where the descent begins.

Step 1 — start from the pretrained weights, not from noise. Pretraining produced a parameter vector \theta_{\text{pre}} that already encodes general features. Initialise there:

\theta^{(0)} = \theta_{\text{pre}} \qquad(\text{not } \theta^{(0)} \sim \mathcal{N}(0,\sigma^2)).

Step 2 — continue training on the new task. Run the same gradient descent you always would, now on the task loss, updating all the weights — this is full fine-tuning:

\theta^{(k+1)} = \theta^{(k)} - \eta\,\nabla_\theta \mathcal{L}_{\text{task}}\big(\theta^{(k)}\big).

Step 3 — exploit the head start. Because \theta_{\text{pre}} already sits in a good region of the loss surface, the descent needs to move only a little: \theta^\star is close to \theta_{\text{pre}}, so the update \Delta\theta = \theta^\star - \theta_{\text{pre}} is small. Fewer steps, less data, faster convergence — and a higher final accuracy, especially when task data is scarce.

Step 4 — (optional) freeze the body, retrain only the head. When the new task shares almost all of its structure with pretraining, you can hold the lower layers fixed and replace just the final classification layer (the head):

\theta = (\underbrace{\theta_{\text{body}}}_{\text{frozen}},\ \underbrace{\theta_{\text{head}}}_{\text{trained}}), \qquad \text{train only } \theta_{\text{head}}.

The pretrained body acts as a fixed feature extractor; only the thin head learns the new labels. This is the cheapest form of transfer — and the conceptual seed of the parameter-efficient methods that follow.

Adapting a pretrained model to a new task rests on three facts:

Start from \theta_{\text{pre}}. Initialise training at the pretrained weights, \theta^{(0)} = \theta_{\text{pre}}, rather than at random.
Full fine-tuning. Continue gradient descent on the task loss, updating every weight: \theta^{(k+1)} = \theta^{(k)} - \eta\,\nabla_\theta \mathcal{L}_{\text{task}}(\theta^{(k)}) (or freeze the body and train only the head).
Data-efficient versus scratch. Because the general features are already learned, fine-tuning needs far less task data and converges far faster than training from a random start — and wins most decisively in the low-data regime.

The head start is also a hazard. If you fine-tune with a large learning rate, the early, noisy gradients of the small new dataset can shove \theta far from \theta_{\text{pre}} — overwriting the very features that made the model useful. The network learns the new task but forgets the old general competence. This is catastrophic forgetting.

The standard guard is to fine-tune gently: use a learning rate one or two orders of magnitude smaller than in pretraining, often with a short warm-up and only a few epochs. The goal is to nudge the weights, keeping \Delta\theta small, not to retrain the model. Freezing the body (Step 4) is the extreme of this caution: those weights cannot move at all, so they cannot be forgotten.

The head start, plotted

Below is a stylised learning-curve comparison: task accuracy as a function of how much labelled task data you have. The pretrained-then-fine-tuned model (bold) starts high even with a handful of examples and saturates quickly; the from-scratch model (faint) needs a great deal of data just to catch up. The gap is widest on the left — the low-data regime where transfer learning pays off most.