Mini-Batch SGD

Training a deep network means running gradient descent on a loss averaged over your data. The catch: there can be millions of examples, and one honest gradient touches all of them. So we cheat — carefully. We estimate the gradient from a small random handful of examples each step. Done right, the estimate points the right way on average, and we take far more steps per second.

Three flavours sit on a spectrum. Full-batch uses every example: an exact gradient, but one slow step. Stochastic (SGD) uses a single example: blazing but noisy. Mini-batch uses B examples — the practical middle ground that runs essentially all of deep learning.

Deriving the mini-batch gradient

The loss is an average of per-example losses L_i(\theta) over the N training examples. We will show, line by line, that the gradient over a small random batch is an honest stand-in for the full gradient.

Step 1 — the loss is an average. Write the total training loss as the mean of the per-example losses:

L(\theta) = \frac{1}{N} \sum_{i=1}^{N} L_i(\theta).

Step 2 — the gradient is the same average. The gradient \nabla is linear, so it passes through the sum and the constant 1/N:

\nabla L(\theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla L_i(\theta).

Step 3 — sample a mini-batch. Draw a random subset \mathcal{B} of B indices (uniformly, without replacement) and average the gradient over just those:

g_{\mathcal{B}}(\theta) = \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla L_i(\theta).

Step 4 — the estimate is unbiased. Take the expectation over the random choice of batch. Each index is equally likely to appear, so the expected per-term gradient is the average of all N of them:

\mathbb{E}_{\mathcal{B}}\!\left[\, g_{\mathcal{B}}(\theta) \,\right] = \frac{1}{B} \sum_{i \in \mathcal{B}} \mathbb{E}\big[\nabla L_i\big] = \frac{1}{N} \sum_{i=1}^{N} \nabla L_i(\theta) = \nabla L(\theta).

So g_{\mathcal{B}} is an unbiased estimate of the true gradient: on average it points exactly downhill. It is noisy, but it is not biased — the wobble cancels out, the direction does not.

Step 5 — the update. Step the parameters along the negative mini-batch gradient, with learning rate η:

\theta \;\leftarrow\; \theta - \eta\, g_{\mathcal{B}}(\theta) \;=\; \theta - \eta\,\frac{1}{B} \sum_{i \in \mathcal{B}} \nabla L_i(\theta).

One pass through all N examples — that is one epoch — now takes N/B of these cheap steps instead of a single expensive one. Smaller B means more steps, more noise; larger B means fewer steps, a smoother but slower descent.

Let L(\theta) = \tfrac{1}{N}\sum_{i=1}^{N} L_i(\theta) and let \mathcal{B} be a uniformly random batch of size B. Then:

Unbiased estimate. The mini-batch gradient g_{\mathcal{B}} = \tfrac{1}{B}\sum_{i\in\mathcal{B}} \nabla L_i satisfies \mathbb{E}_{\mathcal{B}}[g_{\mathcal{B}}] = \nabla L(\theta).
Update rule. Each step is \theta \leftarrow \theta - \eta\, g_{\mathcal{B}}(\theta).
Epoch. One epoch is one full sweep of the data, comprising N/B mini-batch updates.

Full-batch descent follows the exact gradient — and can roll straight into the nearest narrow, brittle minimum, then get stuck. The noise in a mini-batch gradient acts like a gentle jiggle: it lets the optimizer rattle out of sharp, low-quality minima and settle instead in broad, flat basins.

That matters because flat minima generalize better. A wide basin means the loss barely changes if the test distribution shifts the parameters a little, so the model is robust; a sharp spike is a knife-edge fit to the training set. The "noise" everyone tries to remove from the gradient turns out to be a quiet, free regularizer — one more reason small batches often generalize better than huge ones, even when huge ones converge faster.

Smooth vs jittery descent

Both paths start in the same spot and head for the bottom of the loss bowl. The smooth path follows the exact full-batch gradient. The jittery path follows mini-batch estimates — same average direction, but with a random wobble each step. Drag the batch size slider: a small batch is very noisy, a large batch tightens the path toward the smooth one. Hit Refresh to roll a fresh set of random batches.

The default training loop

Mini-batch SGD is the beating heart of the training loop you will meet everywhere: shuffle the data, slice it into batches, and for each batch compute the gradient and step. Everything that follows — momentum, Adam, and learning-rate schedules — is a refinement of how to use this noisy gradient, not a replacement for it.