We now have four tools for training a model too big or too slow for one GPU:
The three parallelism axes are orthogonal: each splits a different dimension of the problem, so they multiply together. Stack all three and you get 3-D parallelism, the standard recipe behind every large model.
Picture a fleet of GPUs and assign each a coordinate
Step 1 — tensor parallelism within a node. Tensor parallelism all-reduces every
layer, on every forward and backward pass, so it demands the fastest interconnect you have. Put a
tensor-parallel group of size
Step 2 — pipeline parallelism across nodes. Pipeline parallelism talks rarely
(only activations at the stage cuts), so it tolerates the slower network between servers.
Chain
Step 3 — data parallelism on top. A single
Step 4 — shard and accelerate within. Layer ZeRO/FSDP across the data-parallel
replicas so no GPU stores a redundant full copy of the optimizer state, and run FlashAttention
inside every attention layer so the
Step 5 — count the GPUs. Because the axes are independent, the total is simply their product:
With, say,
Notice that every design choice above is really about communication. The placement rule — tensor-parallel within a node, pipeline across nodes — exists because the bandwidth inside a server (NVLink, hundreds of GB/s) is an order of magnitude higher than the bandwidth between servers (InfiniBand/Ethernet). Match the chattiest parallelism to the fastest link.
At thousands of GPUs the arithmetic is rarely the limit; moving gradients, activations, and shards between devices is. This is why frontier clusters are built around their network as much as their accelerators, why ZeRO and FlashAttention — both fundamentally about moving less data — matter so much, and why a careless parallelism layout can leave most of an expensive GPU fleet stalled waiting on the wire. Scaling laws assume you can keep the GPUs fed; the interconnect is what decides whether you can.
Each cell below is one GPU, labelled by its