Pipeline Parallelism

Tensor parallelism splits a layer across GPUs, but it chatters constantly over the interconnect, so it only works between GPUs wired together very fast. To span many machines we need a coarser cut — one that talks far less often.

Pipeline parallelism (also called inter-layer parallelism) makes that cut along the depth of the network. Instead of slicing each layer, it hands whole groups of consecutive layers to different GPUs, and streams activations down the chain like parts moving along a factory line.

Building the pipeline, line by line

Take a network of L layers and P GPUs (stages), with k = L/P layers per stage.

Step 1 — assign a contiguous block of layers to each GPU. Layers 1\dots k live on GPU 1, layers k+1\dots 2k on GPU 2, and so on:

\text{GPU } p \;\text{holds layers}\; \{(p-1)k + 1,\, \dots,\, pk\}.

No single GPU stores the whole model, so a network far too deep for one card now fits across the P of them.

Step 2 — flow the forward pass down the stages. An input enters GPU 1, which runs its layers and ships the resulting activations to GPU 2, which runs its layers and ships to GPU 3, and so on to the output:

x \;\longrightarrow\; \text{GPU 1} \;\longrightarrow\; \text{GPU 2} \;\longrightarrow\; \cdots \;\longrightarrow\; \text{GPU } P \;\longrightarrow\; \hat{y}.

Step 3 — flow the backward pass back up. Gradients travel the same chain in reverse, GPU P \to P-1 \to \dots \to 1, each stage doing backprop through its own layers. The cross-GPU traffic is small: only the activations at the cut points cross the wire, once each way — not every layer's weights.

Step 4 — see the problem: the bubble. Naively, GPU 2 cannot start until GPU 1 finishes; GPU 3 waits on GPU 2; and so on. While the input is being processed by stage 1, every other stage sits idle. This idle time is the pipeline bubble — at any instant, most of your expensive GPUs are doing nothing:

\text{at start: only GPU 1 works} \;\Rightarrow\; P-1 \text{ GPUs idle.}

Step 5 — fill the bubble with micro-batches. The fix is to stop feeding one fat batch and instead chop it into M small micro-batches and stream them. As soon as micro-batch 1 leaves GPU 1 for GPU 2, micro-batch 2 enters GPU 1. After a brief fill, every stage is working on a different micro-batch at once:

\text{steady state: all } P \text{ GPUs busy, each on a different micro-batch.}

The bubble doesn't vanish — there is still a fill at the start and a drain at the end — but it shrinks: more micro-batches means a longer busy middle and a relatively smaller bubble. Pipeline parallelism lets one model span many machines, with the micro-batch knob trading a little extra scheduling for far higher GPU utilization.

A deep network can be split by depth across P GPUs:

Count the time slots. With P stages and M micro-batches, the pipeline needs P-1 slots to fill (each new stage waits one slot longer) and the same to drain at the end. Of the M + P - 1 slots the whole job spans, exactly P-1 are bubble for any given stage, so the bubble fraction is

\text{bubble fraction} \;\approx\; \frac{P-1}{M + P - 1}.

With P = 4 stages and only M = 1 batch the bubble is 3/4 — three quarters of your GPUs idle. Push to M = 32 micro-batches and it falls to 3/35 \approx 8.6\%. The lesson: keep M \gg P, and the bubble all but disappears.

Watch the bubble shrink

A Gantt-style schedule: rows are the P=4 GPU stages, columns are time slots. Each coloured cell is a micro-batch's forward pass on that stage; the diagonal shows the wavefront flowing down the pipeline. The faint shaded cells are the bubble — idle GPUs. Slide the micro-batch count up and watch the busy diagonal lengthen and the bubble fraction fall.