Learning-Rate Schedules

Even with Adam, a single fixed learning rate \eta is rarely the best choice. Too big at the start and training is unstable — the loss explodes before it settles. Too big at the end and the optimizer keeps bouncing around the minimum, never landing. A great early rate is a terrible late rate.

So we let \eta change over training — a schedule. The modern recipe is a short linear warmup ramping up to a peak, followed by a smooth cosine decay gliding down to a small floor. Up fast, then a long, gentle landing.

Deriving the warmup + cosine schedule

Fix a peak rate \eta_{\max}, a floor \eta_{\min}, a warmup length T_{\text{warm}} and a total budget T steps. The schedule has two phases, joined at t = T_{\text{warm}}.

Step 1 — warmup (linear ramp). For t < T_{\text{warm}}, rise linearly from 0 to the peak:

\eta_t = \eta_{\max}\,\frac{t}{T_{\text{warm}}}.

At t = 0 the rate is 0; at t = T_{\text{warm}} it has climbed exactly to \eta_{\max} — the peak, where the two phases meet.

Step 2 — cosine decay. For t \ge T_{\text{warm}}, ride a half-cosine from the peak down to the floor. Let the progress through the decay phase be p = \frac{t - T_{\text{warm}}}{T - T_{\text{warm}}} \in [0, 1]; then

\eta_t = \eta_{\min} + \tfrac{1}{2}\,(\eta_{\max} - \eta_{\min})\,\big(1 + \cos(\pi\, p)\big).

Step 3 — check the endpoints. At the start of decay, p = 0, so \cos(0) = 1 and \eta_t = \eta_{\min} + (\eta_{\max} - \eta_{\min}) = \eta_{\max} — it leaves the peak smoothly. At the end, p = 1, so \cos(\pi) = -1 and \eta_t = \eta_{\min} — it lands exactly on the floor.

Step 4 — why cosine, not a line. The cosine's derivative is zero at both ends (p = 0 and p = 1): it leaves the peak gently and flattens as it approaches the floor, spending its final steps barely moving. That long, soft landing lets the optimizer settle precisely into the minimum instead of being yanked down a straight ramp.

With peak \eta_{\max}, floor \eta_{\min}, warmup T_{\text{warm}} and total T:

Warmup. For t < T_{\text{warm}}, \eta_t = \eta_{\max}\, t / T_{\text{warm}} rises linearly to the peak.
Cosine decay. For t \ge T_{\text{warm}}, \eta_t = \eta_{\min} + \tfrac{1}{2}(\eta_{\max}-\eta_{\min})\big(1+\cos(\pi p)\big) with p = (t - T_{\text{warm}})/(T - T_{\text{warm}}).
Endpoints. The two phases meet at the peak \eta_{\max} at t = T_{\text{warm}}, and the decay lands smoothly (zero slope) on \eta_{\min} at t = T.

Warmup is not cosmetic — it is a fix for a real failure. At step 0, Adam's second-moment estimate \hat{v}_t is built from almost no data, so it is noisy and unreliable. Dividing by a bad \sqrt{\hat{v}_t} at full learning rate can produce a giant, destabilizing first step. Ramping \eta up slowly buys the second-moment estimate time to settle before the steps get large.

The older alternative is step decay: hold \eta constant, then drop it by a factor (say 10\times) at a few milestones — a staircase. It works, but each cliff jolts training, and you must guess the milestones. Cosine decay needs no milestones, changes \eta smoothly every step, and has become the default for training large models.

Shape the schedule

The curve below is the learning rate \eta_t against the step t: a straight warmup ramp on the left, the cosine bowl on the right. Drag warmup length to move the peak, total steps to stretch the landing, and peak rate to raise the whole curve. Watch the ramp and the cosine glide adjust live.

The last knob in the recipe

Optimizer plus schedule is the full modern training recipe: pick AdamW, then warm up and cosine-decay its learning rate. With mini-batches feeding noisy gradients, momentum smoothing them, Adam adapting per parameter, and the schedule shaping the step size over time, you have every moving part of how a deep network is actually trained.