Every trick so far —
mixed
precision,
accumulation,
checkpointing
— squeezed a bigger job onto one GPU. To go genuinely faster you need
many. Data parallelism is the first and simplest way to use a whole
cluster: put a full copy of the model on every GPU and have each one chew through a different
slice of the batch.
The one subtlety is keeping the copies identical. If each replica just stepped on its
own slice's gradient, the copies would drift apart into N different
models. The fix is a single collective operation — an all-reduce — that
averages the gradients across all GPUs before anyone takes a step, so every replica applies the
same update and stays in lockstep.
Deriving the all-reduce step
Run on N GPUs. The global batch
\mathcal{B} of size B is split into
N disjoint shards
\mathcal{B}_1, \dots, \mathcal{B}_N, one per GPU, each of size
B/N. We show that averaging the per-GPU gradients reproduces the
full-batch gradient — so all replicas can step together.
Step 1 — replicate the model. Every GPU j starts
the step with the same parameters \theta. (This is the
invariant we must preserve.)
Step 2 — each GPU computes a local gradient. GPU
j does a forward/backward on its shard alone, getting the average
gradient over its B/N examples:
g_j = \frac{N}{B} \sum_{i \in \mathcal{B}_j} \nabla L_i(\theta).
Step 3 — average across GPUs (the all-reduce). The
all-reduce collective sums a value held on every GPU and hands the
same total back to all of them. Average the N local
gradients:
\bar g = \frac{1}{N} \sum_{j=1}^{N} g_j.
Step 4 — it equals the full-batch gradient. Substitute Step 2 and recombine
the shards into the whole batch — the N cancels and the shards
reunite:
\bar g = \frac{1}{N} \sum_{j=1}^{N} \frac{N}{B} \sum_{i \in \mathcal{B}_j} \nabla L_i = \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla L_i = g_{\mathcal{B}}.
Step 5 — every replica applies the same update. Because each GPU now holds
the identical averaged gradient \bar g, they all take the identical
step — so they remain bit-for-bit copies, ready for the next batch:
\theta \leftarrow \theta - \eta\, \bar g \qquad \text{(same on every GPU)}.
Each GPU did only 1/N of the work, so with
N of them a step takes roughly 1/N the
time: throughput scales ~linearly in N — as long
as the all-reduce communication does not become the bottleneck.
Across N GPUs, with global batch
B:
-
Replicate the model. Every GPU holds a full copy of the parameters
\theta.
-
Split the batch. Each GPU processes a disjoint shard of size
B/N and computes its own local gradient.
-
All-reduce-average the gradients.
\bar g = \tfrac{1}{N}\sum_j g_j = g_{\mathcal{B}} — the exact
full-batch gradient, made available on every GPU.
-
Synchronized update. Every replica applies the same
\theta \leftarrow \theta - \eta\,\bar g, staying identical;
throughput scales ~linearly in N.
Linear scaling has a catch: the all-reduce must move the entire gradient — one
number per parameter, so gigabytes for a large model — across the network every
step. If the GPUs compute faster than the interconnect can shuffle gradients, they sit idle
waiting for the all-reduce, and adding more GPUs stops helping. This is the communication
wall of data parallelism.
Hardware and algorithms fight back. Ring all-reduce arranges the GPUs in a
ring and passes partial sums around it, so each GPU only ever talks to its two neighbours and
the total bytes moved are independent of N — bandwidth-optimal.
Fast interconnects (NVLink, InfiniBand) and overlapping communication with the backward pass
(all-reducing early layers' gradients while later layers are still computing) hide most of
the cost. When even an optimal all-reduce is too slow — or the model itself no longer fits on
one GPU — you reach for model/tensor and pipeline parallelism, where the model, not
just the batch, is split across devices.
From recipe to cluster
You have now followed the modern
training
recipe from a single step to a humming cluster: AdamW with a scheduled rate,
gradient clipping for stability, 16-bit math for speed, accumulation and checkpointing to fit
the batch, and data parallelism to spread it across GPUs. This is, in its essentials, how every
large
language
model you use was trained.