Gradient Accumulation

Large models often train best with a huge batch — thousands of examples per step — because the mini-batch gradient is then a low-noise estimate of the true gradient. But a huge batch's activations may not fit in GPU memory, even after mixed precision halves them. You are stuck with a batch a fraction of the size you want.

Gradient accumulation escapes the trap. Instead of one batch of size K\cdot B (which won't fit), run K small micro-batches of size B one after another, add up their gradients, and take a single optimizer step. You pay in wall-clock time, but you get the gradient of a batch K times larger than memory allows — exactly, not approximately.

Deriving the accumulated gradient

Suppose the big batch you wish you could run is \mathcal{B} of size K\cdot B. Split it into K disjoint micro-batches \mathcal{B}_1, \dots, \mathcal{B}_K, each of size B. We show the accumulated gradient equals the big-batch gradient.

Step 1 — the big-batch gradient. It is the average of the per-example gradients over all K B examples:

g_{\mathcal{B}} = \frac{1}{KB} \sum_{i \in \mathcal{B}} \nabla L_i(\theta).

Step 2 — split the sum over the micro-batches. The examples partition into the K chunks, so the one big sum is the sum of K chunk-sums (no approximation — just regrouping the terms):

\sum_{i \in \mathcal{B}} \nabla L_i = \sum_{k=1}^{K} \ \sum_{i \in \mathcal{B}_k} \nabla L_i.

Step 3 — name each micro-batch's gradient. Let g_k be the average gradient over micro-batch k — exactly what one ordinary forward/backward on that chunk computes:

g_k = \frac{1}{B} \sum_{i \in \mathcal{B}_k} \nabla L_i, \qquad\text{so}\qquad \sum_{i \in \mathcal{B}_k} \nabla L_i = B\, g_k.

Step 4 — substitute and simplify. Put Step 3 into Steps 1–2; the B cancels and the big-batch gradient is just the mean of the micro-batch gradients:

g_{\mathcal{B}} = \frac{1}{KB} \sum_{k=1}^{K} B\, g_k = \frac{1}{K} \sum_{k=1}^{K} g_k.

Step 5 — accumulate, then step once. So run the K micro-batches, sum their gradients into one buffer, and divide by K (or sum each g_k/K) before a single optimizer update:

\theta \leftarrow \theta - \eta\, \underbrace{\frac{1}{K}\sum_{k=1}^{K} g_k}_{=\,g_{\mathcal{B}}}.

The result is bit-for-bit the step the giant batch would have taken — the effective batch size is K\cdot B, while peak memory only ever holds one micro-batch of B. You traded K\times the time for K\times the batch, at constant memory.

Splitting a batch of size K B into K micro-batches of size B:

Accumulation is a pure time-for-memory trade: K micro-batches run sequentially, so a step takes about K\times as long, but peak memory never exceeds one micro-batch. When memory is the wall and you have only one device, that is exactly the trade you want.

The picture sharpens next to its sibling, data parallelism, which runs the chunks in parallel across many GPUs and all-reduces their gradients — same accumulated mean \tfrac1K\sum_k g_k, but in wall-clock time of one micro-batch instead of K. In practice the two compose: each GPU accumulates a few micro-batches locally, and the cluster all-reduces across GPUs, so the effective batch is (\text{GPUs}) \times K \times B. Accumulation buys batch when you are out of memory; parallelism buys batch when you have spare GPUs.

Many small gradients, one big step

Each thin arrow is one micro-batch gradient g_k; together they sum (tip-to-tail) into the single bold step the optimizer actually takes — the accumulated gradient. Drag K to add micro-batches: more arrows accumulate, and the effective batch size K\cdot B grows with them while each individual arrow (and the memory it costs) stays the same size.

Batch size, decoupled from memory

With accumulation, the batch the optimizer sees is no longer chained to what memory holds — you dial in any effective batch you like by choosing K. The remaining memory hog is the stack of activations a single micro-batch's backward pass must keep around; the next page attacks exactly that.