Tensor
parallelism splits a layer across GPUs, but it chatters constantly over the
interconnect, so it only works between GPUs wired together very fast. To span many machines we
need a coarser cut — one that talks far less often.
Pipeline parallelism (also called inter-layer parallelism) makes that
cut along the depth of the network. Instead of slicing each layer, it hands whole groups
of consecutive layers to different GPUs, and streams activations down the chain like parts moving
along a factory line.
Building the pipeline, line by line
Take a network of L layers and P GPUs
(stages), with k = L/P layers per stage.
Step 1 — assign a contiguous block of layers to each GPU. Layers
1\dots k live on GPU 1, layers
k+1\dots 2k on GPU 2, and so on:
\text{GPU } p \;\text{holds layers}\; \{(p-1)k + 1,\, \dots,\, pk\}.
No single GPU stores the whole model, so a network far too deep for one card now fits across the
P of them.
Step 2 — flow the forward pass down the stages. An input enters GPU 1, which
runs its layers and ships the resulting activations to GPU 2, which runs its layers and ships to
GPU 3, and so on to the output:
x \;\longrightarrow\; \text{GPU 1} \;\longrightarrow\; \text{GPU 2} \;\longrightarrow\; \cdots \;\longrightarrow\; \text{GPU } P \;\longrightarrow\; \hat{y}.
Step 3 — flow the backward pass back up. Gradients travel the same chain in
reverse, GPU P \to P-1 \to \dots \to 1, each stage doing backprop
through its own layers. The cross-GPU traffic is small: only the activations at the cut points
cross the wire, once each way — not every layer's weights.
Step 4 — see the problem: the bubble. Naively, GPU 2 cannot start until GPU 1
finishes; GPU 3 waits on GPU 2; and so on. While the input is being processed by stage 1, every
other stage sits idle. This idle time is the pipeline bubble — at any instant,
most of your expensive GPUs are doing nothing:
\text{at start: only GPU 1 works} \;\Rightarrow\; P-1 \text{ GPUs idle.}
Step 5 — fill the bubble with micro-batches. The fix is to stop feeding one fat
batch and instead chop it into M small micro-batches
and stream them. As soon as micro-batch 1 leaves GPU 1 for GPU 2, micro-batch 2 enters GPU 1.
After a brief fill, every stage is working on a different micro-batch at once:
\text{steady state: all } P \text{ GPUs busy, each on a different micro-batch.}
The bubble doesn't vanish — there is still a fill at the start and a drain at the end — but it
shrinks: more micro-batches means a longer busy middle and a relatively smaller bubble. Pipeline
parallelism lets one model span many machines, with the micro-batch knob trading a little extra
scheduling for far higher GPU utilization.
A deep network can be split by depth across P GPUs:
-
Layers split across GPUs. Each GPU (stage) holds a contiguous block of
L/P layers; activations flow stage-to-stage forward, gradients
stage-to-stage backward, with little cross-GPU traffic.
-
The bubble. Run naively and stages wait on one another, leaving
P-1 GPUs idle while one works — wasted compute.
-
Micro-batching fills it. Split the batch into M
micro-batches and stream them, so all P stages stay busy. The
idle fraction shrinks toward \dfrac{P-1}{M + P - 1} as
M grows.
Count the time slots. With P stages and
M micro-batches, the pipeline needs P-1
slots to fill (each new stage waits one slot longer) and the same to drain at the end. Of the
M + P - 1 slots the whole job spans, exactly
P-1 are bubble for any given stage, so the bubble
fraction is
\text{bubble fraction} \;\approx\; \frac{P-1}{M + P - 1}.
With P = 4 stages and only M = 1 batch
the bubble is 3/4 — three quarters of your GPUs idle. Push to
M = 32 micro-batches and it falls to
3/35 \approx 8.6\%. The lesson: keep
M \gg P, and the bubble all but disappears.