The backward pass has a hidden appetite. To compute a layer's gradient,
Gradient checkpointing (also called activation checkpointing) makes
a different bargain: store only a few activations — the checkpoints — and
recompute the rest on the fly during the backward pass. It spends extra
compute (one extra forward pass) to slash activation memory from
Take a network of
Step 1 — the naive cost. Store all
Step 2 — keep only m checkpoints. Save activations at
Step 3 — recompute a segment at a time. The
Step 4 — minimize over m. Balance the two terms. Their sum is smallest when
they are equal,
Step 5 — the compute it costs. Backprop already does one forward and one
backward. Recomputing each segment once during the backward pass adds, in total, exactly
one more forward pass over the network — a fixed
So
Transformer activation memory scales with depth × sequence length × width, and the attention activations grow with the square of the sequence length. Double the context window and the activations balloon; they, not the weights, are what pin a long-context model to a too-small batch (or off the GPU entirely). This is the memory wall.
Checkpointing is the standard tool for climbing it: by recomputing instead of storing, it
lets you train deeper networks and longer sequences on the same hardware, or fit a larger
micro-batch (which then pairs with
The curve is total activation memory
Mixed precision, accumulation, and checkpointing are all single-device tricks: each squeezes a
bigger model or batch onto the GPU you already have. When even that is not enough — or you
simply want to go faster — the next move is to spread the work across many GPUs, which
is