Residual Connections

The vanishing gradient problem says a deep stack of layers strangles its own gradient. The residual connection fixes it with a single, almost insolent idea: instead of asking a layer to transform its input, ask it to learn the change to the input, and add that change back on:

y = x + F(x).

Here F is the layer's usual transformation; the bare wire that carries x straight to the output is the skip connection. It looks like nothing. It changes everything.

The backward pass through a skip

The whole payoff is in one derivative. Differentiate the block output with respect to its input.

Step 1 — differentiate the sum. The output is y = x + F(x), a sum of two terms, so its derivative is the sum of their derivatives:

\frac{\partial y}{\partial x} = \frac{\partial}{\partial x}\big[\,x + F(x)\,\big] = \frac{\partial x}{\partial x} + \frac{\partial F}{\partial x}.

Step 2 — the first term is exactly 1. The derivative of x with respect to itself is the identity:

\frac{\partial y}{\partial x} = 1 + \frac{\partial F}{\partial x}.

Step 3 — read the highway. When backprop multiplies this layer's Jacobian into the running product, it multiplies by 1 + \partial F/\partial x, not by \partial F/\partial x alone. Even if F's slope shrinks toward zero — the very thing that causes vanishing — the +1 survives. The gradient has a highway straight back to the early layers that no amount of depth can close.

Step 4 — stack the blocks. Chain n residual blocks and the chain rule gives a product of (1 + \partial F_k/\partial x) terms. Expand it and there is always a clean 1\cdot 1\cdots 1 = 1 path threading through every block:

\frac{\partial y_n}{\partial x_0} = \prod_{k=1}^{n}\left(1 + \frac{\partial F_k}{\partial x_{k-1}}\right) = 1 + (\text{terms that may shrink}).

The gradient reaching layer 0 is at worst 1 plus small stuff — it cannot collapse to zero.

And adding depth can't hurt

There is a second gift. Set F = 0 and the block computes y = x — the identity. So the easy default for a residual block is to do nothing and pass its input through unchanged.

A plain layer must work to reproduce its input (it has to learn the identity map from scratch, which is surprisingly hard); a residual block gets the identity for free by simply driving its weights toward zero. So if extra depth isn't useful, the network can switch those blocks off and lose nothing — making a network deeper can no longer make it worse.

A residual block wraps a transformation F around an identity skip:

See the block, and the reach

Step through the block: the input x splits, one copy runs through the transformation F, the skip carries the original straight across, and the two are summed at the end. Below the block, two columns of layers compare how far a gradient reaches: in a plain stack it fades to nothing toward the bottom; with skips it stays bright all the way down.

Residual connections are the reason networks got deep. The ResNet architecture stacked these blocks to train networks of 100, then over 1000 layers — depths that were simply untrainable before, because the gradient never survived the trip. The earlier, plainer networks weren't failing from overfitting; they were failing because the gradient signal died on the way back.

The idea then quietly took over everything. Every sublayer of a transformer — each attention block, each feed-forward block — is wrapped in a residual connection of exactly this form, x + \text{sublayer}(x). The deepest models in use today are, structurally, towers of skip connections. The humble +1 in \partial y/\partial x is load-bearing for the entire field.

Where this is going

We keep invoking "the gradient of F" as if a framework simply hands it to us. It does — and the next page shows how: automatic differentiation builds a graph of elementary operations and walks it backwards, computing every derivative (skip connections and all) for you.