Large models often train best with a huge batch — thousands of examples per step —
because the
mini-batch
gradient is then a low-noise estimate of the true gradient. But a huge batch's
activations may not fit in GPU memory, even after
mixed
precision halves them. You are stuck with a batch a fraction of the size you want.
Gradient accumulation escapes the trap. Instead of one batch of size
K\cdot B (which won't fit), run K small
micro-batches of size B one after another, add up
their gradients, and take a single optimizer step. You pay in wall-clock time, but
you get the gradient of a batch K times larger than memory allows —
exactly, not approximately.
Deriving the accumulated gradient
Suppose the big batch you wish you could run is
\mathcal{B} of size K\cdot B. Split it
into K disjoint micro-batches
\mathcal{B}_1, \dots, \mathcal{B}_K, each of size
B. We show the accumulated gradient equals the big-batch gradient.
Step 1 — the big-batch gradient. It is the average of the per-example
gradients over all K B examples:
g_{\mathcal{B}} = \frac{1}{KB} \sum_{i \in \mathcal{B}} \nabla L_i(\theta).
Step 2 — split the sum over the micro-batches. The examples partition into
the K chunks, so the one big sum is the sum of
K chunk-sums (no approximation — just regrouping the terms):
\sum_{i \in \mathcal{B}} \nabla L_i = \sum_{k=1}^{K} \ \sum_{i \in \mathcal{B}_k} \nabla L_i.
Step 3 — name each micro-batch's gradient. Let
g_k be the average gradient over micro-batch
k — exactly what one ordinary forward/backward on that chunk
computes:
g_k = \frac{1}{B} \sum_{i \in \mathcal{B}_k} \nabla L_i, \qquad\text{so}\qquad \sum_{i \in \mathcal{B}_k} \nabla L_i = B\, g_k.
Step 4 — substitute and simplify. Put Step 3 into Steps 1–2; the
B cancels and the big-batch gradient is just the
mean of the micro-batch gradients:
g_{\mathcal{B}} = \frac{1}{KB} \sum_{k=1}^{K} B\, g_k = \frac{1}{K} \sum_{k=1}^{K} g_k.
Step 5 — accumulate, then step once. So run the
K micro-batches, sum their gradients into one buffer, and
divide by K (or sum each g_k/K) before a
single optimizer update:
\theta \leftarrow \theta - \eta\, \underbrace{\frac{1}{K}\sum_{k=1}^{K} g_k}_{=\,g_{\mathcal{B}}}.
The result is bit-for-bit the step the giant batch would have taken — the
effective batch size is K\cdot B, while peak
memory only ever holds one micro-batch of B. You traded
K\times the time for K\times the batch,
at constant memory.
Splitting a batch of size K B into K
micro-batches of size B:
-
Accumulate then step. Sum the K micro-batch
gradients and take one optimizer step (not K steps).
-
Effective batch. The optimizer sees an effective batch size of
K \cdot B, at the memory cost of a single micro-batch
B.
-
Exactly the large-batch gradient. By linearity,
\tfrac{1}{K}\sum_k g_k = g_{\mathcal{B}} — identical to the
gradient of the full KB-batch, not an approximation.
Accumulation is a pure time-for-memory trade: K micro-batches run
sequentially, so a step takes about K\times as long, but
peak memory never exceeds one micro-batch. When memory is the wall and you have only one
device, that is exactly the trade you want.
The picture sharpens next to its sibling,
data
parallelism, which runs the chunks in parallel across many GPUs and
all-reduces their gradients — same accumulated mean
\tfrac1K\sum_k g_k, but in wall-clock time of one micro-batch
instead of K. In practice the two compose: each GPU
accumulates a few micro-batches locally, and the cluster
all-reduces across GPUs, so the effective batch is
(\text{GPUs}) \times K \times B. Accumulation buys batch when you
are out of memory; parallelism buys batch when you have spare GPUs.