The Training Recipe

You now own every ingredient separately: Adam (in its decoupled-decay form, AdamW), a warmup-then-cosine learning-rate schedule, and the noisy gradient of mini-batch SGD. This page assembles them into the single loop that trains essentially every modern large language model — the recipe.

Two pieces are still missing, and both guard against the same enemy: a single bad mini-batch throwing a giant gradient and blowing up the run. Weight decay keeps the parameters from drifting large, and gradient clipping rescales any oversized gradient down to a fixed maximum norm before it can land a destabilizing step.

Deriving the training step

One step processes one mini-batch \mathcal{B} of examples and nudges the parameters \theta once. We build it line by line; the whole loop is just this step repeated for T steps.

Step 1 — forward pass. Run the network on the batch and average the per-example loss over its B members:

L(\theta) = \frac{1}{B} \sum_{i \in \mathcal{B}} L_i(\theta).

Step 2 — backward pass. Backpropagate to get the gradient of that loss with respect to every parameter (this is what automatic differentiation computes for you):

g = \nabla_\theta L(\theta).

Step 3 — clip the gradient. Measure the gradient's global norm \lVert g \rVert (stack every parameter's gradient into one vector and take its length). If it exceeds a threshold c, rescale the whole vector down so its norm is exactly c; otherwise leave it alone:

\hat{g} = g \cdot \min\!\left(1,\ \frac{c}{\lVert g \rVert}\right).

The direction is untouched — only the length is capped. A normal step passes through unchanged; a freak spike of norm 100c is shrunk back to c, so it can no longer wreck the run. (This is the same gradient clipping that tames exploding gradients.)

Step 4 — schedule the learning rate. Look up this step's rate from the warmup-then-cosine schedule — a short linear ramp to a peak \eta_{\max}, then a cosine glide to a floor \eta_{\min}:

\eta_t = \begin{cases} \eta_{\max}\, \dfrac{t}{T_{\text{warm}}}, & t < T_{\text{warm}}, \\[2ex] \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})\big(1 + \cos(\pi p)\big), & t \ge T_{\text{warm}}, \end{cases}

with decay progress p = (t - T_{\text{warm}})/(T - T_{\text{warm}}).

Step 5 — the AdamW update. Feed the clipped gradient \hat{g} into Adam's two moments, bias-correct them, take the adaptive step, and — decoupled — shrink the weights by \eta_t\,\lambda (the weight decay):

\theta_t = \theta_{t-1} - \eta_t\, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} - \eta_t\,\lambda\,\theta_{t-1}.

That is the entire recipe. Repeat steps 1–5 over mini-batches for T steps and you have trained a deep network the way the field actually does it: forward → loss → backward → clip → AdamW step → advance the schedule.

The standard loop repeats, per mini-batch, the step \hat{g} \to \text{AdamW} under a scheduled rate \eta_t. Its four signature choices:

Watch a large training run and the loss curve is not perfectly smooth — every so often it jumps upward in a spike. The usual culprit is one pathological mini-batch whose gradient is enormous. Without protection, AdamW takes a huge step along it, the parameters lurch into a bad region, and the loss leaps — sometimes never to recover (a divergence).

Clipping is the seatbelt. A spike of norm 50c is rescaled to c, so the step it causes is no larger than an ordinary one — the model wobbles but does not blow up, and the next good batch pulls it back. The schedule helps too: warmup keeps early steps small while Adam's second moment is still noisy, and the cosine tail shrinks \eta_t late in training, so the same raw spike does ever less damage as the run proceeds. Clipping caps the gradient; the schedule caps the rate — together they keep a long run on the rails.

The loss and the schedule, together

The bold curve is a typical training loss falling over steps; the faint curve is the learning rate \eta_t on the same time axis — the warmup bump on the left, then the long cosine decay. Drag warmup length to move the rate's peak and peak rate to raise it; a longer, gentler schedule gives a smoother loss. Notice the loss falls fastest while the rate is high and flattens as the rate glides to its floor.

From recipe to model

This loop, run at scale over a vast text corpus, is what produces a pretrained language model. Everything that follows in this track — mixed precision, gradient accumulation, checkpointing, and data parallelism — does not change this recipe. It changes how to fit and speed up the very same step on real, finite hardware.