Data Parallelism

Every trick so far — mixed precision, accumulation, checkpointing — squeezed a bigger job onto one GPU. To go genuinely faster you need many. Data parallelism is the first and simplest way to use a whole cluster: put a full copy of the model on every GPU and have each one chew through a different slice of the batch.

The one subtlety is keeping the copies identical. If each replica just stepped on its own slice's gradient, the copies would drift apart into N different models. The fix is a single collective operation — an all-reduce — that averages the gradients across all GPUs before anyone takes a step, so every replica applies the same update and stays in lockstep.

Deriving the all-reduce step

Run on N GPUs. The global batch \mathcal{B} of size B is split into N disjoint shards \mathcal{B}_1, \dots, \mathcal{B}_N, one per GPU, each of size B/N. We show that averaging the per-GPU gradients reproduces the full-batch gradient — so all replicas can step together.

Step 1 — replicate the model. Every GPU j starts the step with the same parameters \theta. (This is the invariant we must preserve.)

Step 2 — each GPU computes a local gradient. GPU j does a forward/backward on its shard alone, getting the average gradient over its B/N examples:

g_j = \frac{N}{B} \sum_{i \in \mathcal{B}_j} \nabla L_i(\theta).

Step 3 — average across GPUs (the all-reduce). The all-reduce collective sums a value held on every GPU and hands the same total back to all of them. Average the N local gradients:

\bar g = \frac{1}{N} \sum_{j=1}^{N} g_j.

Step 4 — it equals the full-batch gradient. Substitute Step 2 and recombine the shards into the whole batch — the N cancels and the shards reunite:

\bar g = \frac{1}{N} \sum_{j=1}^{N} \frac{N}{B} \sum_{i \in \mathcal{B}_j} \nabla L_i = \frac{1}{B} \sum_{i \in \mathcal{B}} \nabla L_i = g_{\mathcal{B}}.

Step 5 — every replica applies the same update. Because each GPU now holds the identical averaged gradient \bar g, they all take the identical step — so they remain bit-for-bit copies, ready for the next batch:

\theta \leftarrow \theta - \eta\, \bar g \qquad \text{(same on every GPU)}.

Each GPU did only 1/N of the work, so with N of them a step takes roughly 1/N the time: throughput scales ~linearly in N — as long as the all-reduce communication does not become the bottleneck.

Across N GPUs, with global batch B:

Replicate the model. Every GPU holds a full copy of the parameters \theta.
Split the batch. Each GPU processes a disjoint shard of size B/N and computes its own local gradient.
All-reduce-average the gradients. \bar g = \tfrac{1}{N}\sum_j g_j = g_{\mathcal{B}} — the exact full-batch gradient, made available on every GPU.
Synchronized update. Every replica applies the same \theta \leftarrow \theta - \eta\,\bar g, staying identical; throughput scales ~linearly in N.

Linear scaling has a catch: the all-reduce must move the entire gradient — one number per parameter, so gigabytes for a large model — across the network every step. If the GPUs compute faster than the interconnect can shuffle gradients, they sit idle waiting for the all-reduce, and adding more GPUs stops helping. This is the communication wall of data parallelism.

Hardware and algorithms fight back. Ring all-reduce arranges the GPUs in a ring and passes partial sums around it, so each GPU only ever talks to its two neighbours and the total bytes moved are independent of N — bandwidth-optimal. Fast interconnects (NVLink, InfiniBand) and overlapping communication with the backward pass (all-reducing early layers' gradients while later layers are still computing) hide most of the cost. When even an optimal all-reduce is too slow — or the model itself no longer fits on one GPU — you reach for model/tensor and pipeline parallelism, where the model, not just the batch, is split across devices.

Replicas in a ring

Each node is a GPU holding a full model replica on its own batch shard B/N; the arrows around the ring are the all-reduce that averages every replica's gradient and hands the same \bar g back to all of them. Drag N to add GPUs: the ring grows, each shard B/N shrinks, and (communication permitting) the step gets faster.

From recipe to cluster

You have now followed the modern training recipe from a single step to a humming cluster: AdamW with a scheduled rate, gradient clipping for stability, 16-bit math for speed, accumulation and checkpointing to fit the batch, and data parallelism to spread it across GPUs. This is, in its essentials, how every large language model you use was trained.