Training a
Three flavours sit on a spectrum. Full-batch uses every example: an exact
gradient, but one slow step. Stochastic (SGD) uses a single example: blazing
but noisy. Mini-batch uses
The loss is an average of per-example losses
Step 1 — the loss is an average. Write the total training loss as the mean of the per-example losses:
Step 2 — the gradient is the same average. The
Step 3 — sample a mini-batch. Draw a random subset
Step 4 — the estimate is unbiased. Take the
So
Step 5 — the update. Step the parameters along the negative mini-batch
gradient, with learning rate
One pass through all
Full-batch descent follows the exact gradient — and can roll straight into the nearest narrow, brittle minimum, then get stuck. The noise in a mini-batch gradient acts like a gentle jiggle: it lets the optimizer rattle out of sharp, low-quality minima and settle instead in broad, flat basins.
That matters because flat minima generalize better. A wide basin means the loss barely changes if the test distribution shifts the parameters a little, so the model is robust; a sharp spike is a knife-edge fit to the training set. The "noise" everyone tries to remove from the gradient turns out to be a quiet, free regularizer — one more reason small batches often generalize better than huge ones, even when huge ones converge faster.
Both paths start in the same spot and head for the bottom of the loss bowl. The smooth path follows the exact full-batch gradient. The jittery path follows mini-batch estimates — same average direction, but with a random wobble each step. Drag the batch size slider: a small batch is very noisy, a large batch tightens the path toward the smooth one. Hit Refresh to roll a fresh set of random batches.
Mini-batch SGD is the beating heart of the training loop you will meet everywhere:
shuffle the data, slice it into batches, and for each batch compute the gradient and step.
Everything that follows —