You now own every ingredient separately:
Adam
(in its decoupled-decay form, AdamW), a
warmup-then-cosine
learning-rate schedule, and the noisy gradient of
mini-batch
SGD. This page assembles them into the single loop that trains essentially every
modern large language model — the recipe.
Two pieces are still missing, and both guard against the same enemy: a single bad mini-batch
throwing a giant gradient and blowing up the run. Weight decay keeps the
parameters from drifting large, and gradient clipping rescales any
oversized gradient down to a fixed maximum norm before it can land a destabilizing step.
Deriving the training step
One step processes one mini-batch \mathcal{B} of examples and
nudges the parameters \theta once. We build it line by line; the
whole loop is just this step repeated for T steps.
Step 1 — forward pass. Run the network on the batch and average the
per-example loss over its B members:
L(\theta) = \frac{1}{B} \sum_{i \in \mathcal{B}} L_i(\theta).
Step 2 — backward pass. Backpropagate to get the gradient of that loss with
respect to every parameter (this is what
automatic
differentiation computes for you):
g = \nabla_\theta L(\theta).
Step 3 — clip the gradient. Measure the gradient's global norm
\lVert g \rVert (stack every parameter's gradient into one vector
and take its length). If it exceeds a threshold c, rescale the
whole vector down so its norm is exactly c; otherwise leave it
alone:
\hat{g} = g \cdot \min\!\left(1,\ \frac{c}{\lVert g \rVert}\right).
The direction is untouched — only the length is capped. A normal step passes through
unchanged; a freak spike of norm 100c is shrunk back to
c, so it can no longer wreck the run. (This is the same
gradient
clipping that tames exploding gradients.)
Step 4 — schedule the learning rate. Look up this step's rate from the
warmup-then-cosine schedule — a short linear ramp to a peak
\eta_{\max}, then a cosine glide to a floor
\eta_{\min}:
\eta_t = \begin{cases} \eta_{\max}\, \dfrac{t}{T_{\text{warm}}}, & t < T_{\text{warm}}, \\[2ex] \eta_{\min} + \tfrac{1}{2}(\eta_{\max} - \eta_{\min})\big(1 + \cos(\pi p)\big), & t \ge T_{\text{warm}}, \end{cases}
with decay progress p = (t - T_{\text{warm}})/(T - T_{\text{warm}}).
Step 5 — the AdamW update. Feed the clipped gradient
\hat{g} into Adam's two moments, bias-correct them, take the
adaptive step, and — decoupled — shrink the weights by
\eta_t\,\lambda (the weight decay):
\theta_t = \theta_{t-1} - \eta_t\, \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon} - \eta_t\,\lambda\,\theta_{t-1}.
That is the entire recipe. Repeat steps 1–5 over mini-batches for
T steps and you have trained a deep network the way the field
actually does it: forward → loss → backward → clip → AdamW step → advance the
schedule.
The standard loop repeats, per mini-batch, the step
\hat{g} \to \text{AdamW} under a scheduled rate
\eta_t. Its four signature choices:
-
AdamW optimizer. Adam's per-parameter adaptive step with
decoupled weight decay — the default for transformers.
-
Warmup + cosine schedule. A linear ramp to
\eta_{\max}, then a cosine decay to
\eta_{\min}.
-
Weight decay. A coefficient \lambda shrinking
\theta each step, regularizing the model.
-
Gradient clipping. Rescale g to norm at most
c via \hat{g} = g\,\min(1, c/\lVert g\rVert),
taming spikes.
Watch a large training run and the loss curve is not perfectly smooth — every so often it
jumps upward in a spike. The usual culprit is one pathological
mini-batch whose gradient is enormous. Without protection, AdamW takes a huge step along
it, the parameters lurch into a bad region, and the loss leaps — sometimes never to
recover (a divergence).
Clipping is the seatbelt. A spike of norm 50c is rescaled to
c, so the step it causes is no larger than an ordinary one — the
model wobbles but does not blow up, and the next good batch pulls it back. The schedule
helps too: warmup keeps early steps small while Adam's second moment is still noisy, and
the cosine tail shrinks \eta_t late in training, so the same raw
spike does ever less damage as the run proceeds. Clipping caps the gradient; the schedule
caps the rate — together they keep a long run on the rails.
From recipe to model
This loop, run at scale over a vast text corpus, is what produces a pretrained
language
model. Everything that follows in this track —
mixed
precision,
gradient
accumulation,
checkpointing,
and
data
parallelism — does not change this recipe. It changes how to fit and
speed up the very same step on real, finite hardware.