Even with
Adam,
a single fixed
learning
rate \eta is rarely the best choice. Too big at the
start and training is unstable — the loss explodes before it settles. Too big at the
end and the optimizer keeps bouncing around the minimum, never landing. A great
early rate is a terrible late rate.
So we let \eta change over training — a schedule.
The modern recipe is a short linear warmup ramping up to a peak, followed by
a smooth cosine decay gliding down to a small floor. Up fast, then a long,
gentle landing.
Deriving the warmup + cosine schedule
Fix a peak rate \eta_{\max}, a floor
\eta_{\min}, a warmup length
T_{\text{warm}} and a total budget T
steps. The schedule has two phases, joined at t = T_{\text{warm}}.
Step 1 — warmup (linear ramp). For
t < T_{\text{warm}}, rise linearly from
0 to the peak:
\eta_t = \eta_{\max}\,\frac{t}{T_{\text{warm}}}.
At t = 0 the rate is 0; at
t = T_{\text{warm}} it has climbed exactly to
\eta_{\max} — the peak, where the two phases meet.
Step 2 — cosine decay. For
t \ge T_{\text{warm}}, ride a half-cosine from the peak down to the
floor. Let the progress through the decay phase be
p = \frac{t - T_{\text{warm}}}{T - T_{\text{warm}}} \in [0, 1]; then
\eta_t = \eta_{\min} + \tfrac{1}{2}\,(\eta_{\max} - \eta_{\min})\,\big(1 + \cos(\pi\, p)\big).
Step 3 — check the endpoints. At the start of decay,
p = 0, so \cos(0) = 1 and
\eta_t = \eta_{\min} + (\eta_{\max} - \eta_{\min}) = \eta_{\max} —
it leaves the peak smoothly. At the end, p = 1, so
\cos(\pi) = -1 and
\eta_t = \eta_{\min} — it lands exactly on the floor.
Step 4 — why cosine, not a line. The cosine's derivative is zero at both
ends (p = 0 and p = 1): it leaves the
peak gently and flattens as it approaches the floor, spending its final steps barely
moving. That long, soft landing lets the optimizer settle precisely into the minimum instead
of being yanked down a straight ramp.
With peak \eta_{\max}, floor
\eta_{\min}, warmup T_{\text{warm}} and
total T:
-
Warmup. For
t < T_{\text{warm}},
\eta_t = \eta_{\max}\, t / T_{\text{warm}} rises linearly to
the peak.
-
Cosine decay. For
t \ge T_{\text{warm}},
\eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max}-\eta_{\min})\big(1+\cos(\pi p)\big)
with p = (t - T_{\text{warm}})/(T - T_{\text{warm}}).
-
Endpoints. The two phases meet at the peak
\eta_{\max} at t = T_{\text{warm}},
and the decay lands smoothly (zero slope) on
\eta_{\min} at t = T.
Warmup is not cosmetic — it is a fix for a real failure. At step
0, Adam's
second-moment
estimate \hat{v}_t is built from almost no data, so it is noisy
and unreliable. Dividing by a bad \sqrt{\hat{v}_t} at full
learning rate can produce a giant, destabilizing first step. Ramping
\eta up slowly buys the second-moment estimate time to settle
before the steps get large.
The older alternative is step decay: hold
\eta constant, then drop it by a factor (say
10\times) at a few milestones — a staircase. It works, but each
cliff jolts training, and you must guess the milestones. Cosine decay needs no milestones,
changes \eta smoothly every step, and has become the default for
training large models.
The last knob in the recipe
Optimizer plus schedule is the full modern training recipe: pick AdamW, then warm up and
cosine-decay its learning rate. With
mini-batches
feeding noisy gradients,
momentum
smoothing them, Adam adapting per parameter, and the schedule shaping the step size over time,
you have every moving part of how a
deep network is
actually trained.