Residual Connections
The
vanishing
gradient problem says a deep stack of layers strangles its own gradient. The
residual connection fixes it with a single, almost insolent idea: instead of
asking a layer to transform its input, ask it to learn the change to the
input, and add that change back on:
y = x + F(x).
Here F is the layer's usual transformation; the bare wire that
carries x straight to the output is the skip
connection. It looks like nothing. It changes everything.
The backward pass through a skip
The whole payoff is in one derivative. Differentiate the block output with respect to its
input.
Step 1 — differentiate the sum. The output is
y = x + F(x), a sum of two terms, so its derivative is the sum of
their derivatives:
\frac{\partial y}{\partial x} = \frac{\partial}{\partial x}\big[\,x + F(x)\,\big] = \frac{\partial x}{\partial x} + \frac{\partial F}{\partial x}.
Step 2 — the first term is exactly 1. The
derivative of x with respect to itself is the identity:
\frac{\partial y}{\partial x} = 1 + \frac{\partial F}{\partial x}.
Step 3 — read the highway. When backprop multiplies this layer's Jacobian
into the running product, it multiplies by 1 + \partial F/\partial x,
not by \partial F/\partial x alone. Even if
F's slope shrinks toward zero — the very thing that causes
vanishing — the +1 survives. The gradient has a
highway straight back to the early layers that no amount of depth can close.
Step 4 — stack the blocks. Chain n residual
blocks and the chain rule gives a product of (1 + \partial F_k/\partial x)
terms. Expand it and there is always a clean 1\cdot 1\cdots 1 = 1
path threading through every block:
\frac{\partial y_n}{\partial x_0} = \prod_{k=1}^{n}\left(1 + \frac{\partial F_k}{\partial x_{k-1}}\right) = 1 + (\text{terms that may shrink}).
The gradient reaching layer 0 is at worst
1 plus small stuff — it cannot collapse to zero.
And adding depth can't hurt
There is a second gift. Set F = 0 and the block computes
y = x — the identity. So the easy default
for a residual block is to do nothing and pass its input through unchanged.
A plain layer must work to reproduce its input (it has to learn the identity map from
scratch, which is surprisingly hard); a residual block gets the identity for free by simply
driving its weights toward zero. So if extra depth isn't useful, the network can switch those
blocks off and lose nothing — making a network deeper can no longer make it worse.
A residual block wraps a transformation F around an identity skip:
-
The block. y = x + F(x) — learn the
change to x, not a from-scratch transformation of it.
-
The gradient highway.
\dfrac{\partial y}{\partial x} = 1 + \dfrac{\partial F}{\partial x}.
The +1 is a path the gradient always takes, so it cannot vanish
through the block however deep the stack.
-
Identity is the easy default. F = 0 gives
y = x for free, so adding a residual block can never hurt the
network — at worst it learns to pass its input straight through.
See the block, and the reach
Step through the block: the input x splits, one copy runs through
the transformation F, the skip carries the original straight across,
and the two are summed at the end. Below the block, two columns of layers compare how far a
gradient reaches: in a plain stack it fades to nothing toward the bottom; with skips it stays
bright all the way down.
Residual connections are the reason networks got deep. The ResNet
architecture stacked these blocks
to train networks of 100, then over 1000
layers — depths that were simply untrainable before, because the gradient never survived the
trip. The earlier, plainer networks weren't failing from overfitting; they were failing
because the gradient signal died on the way back.
The idea then quietly took over everything. Every sublayer of a transformer —
each attention block, each feed-forward block — is wrapped in a residual connection of exactly
this form, x + \text{sublayer}(x). The deepest models in use today
are, structurally, towers of skip connections. The humble +1 in
\partial y/\partial x is load-bearing for the entire field.
Where this is going
We keep invoking "the gradient of F" as if a framework simply hands
it to us. It does — and the next page shows how:
automatic
differentiation builds a graph of elementary operations and walks it backwards,
computing every derivative (skip connections and all) for you.