Automatic Differentiation
We have leaned on phrases like "the gradient of F" as if a
framework just produces it. It does, and the mechanism has a name:
automatic differentiation (autodiff). You write the forward computation;
the framework records it as a computational graph of elementary operations,
then walks that graph backwards, multiplying each node's local derivative by the
gradient flowing into it. That backward walk is the
chain rule,
automated — backprop generalised from neural networks to any computation.
The forward pass: build the graph, store the values
Take a tiny example and follow it node by node:
\mathcal{L} = (a\cdot b + c)^2.
Break it into elementary operations, each a node in the graph. Pick concrete inputs
a = 3,\ b = 2,\ c = 1 to make it vivid.
Step 1 — multiply. u = a\cdot b. Forward value:
u = 3\cdot 2 = 6.
Step 2 — add. v = u + c. Forward value:
v = 6 + 1 = 7.
Step 3 — square. \mathcal{L} = v^2. Forward value:
\mathcal{L} = 7^2 = 49.
The forward pass stores every intermediate value (u, v, \mathcal{L})
— they are exactly the numbers the backward pass will need.
The backward pass: local derivative × upstream gradient
Now walk the graph in reverse. Each node already knows its own local
derivative (how its output moves when its input moves); multiply that by the
upstream gradient handed down from above, and pass the result to its inputs.
Seed the very top with \partial \mathcal{L}/\partial \mathcal{L} = 1.
Step 1 — through the square. \mathcal{L} = v^2
has local derivative \partial \mathcal{L}/\partial v = 2v. Multiply
by the upstream 1:
\frac{\partial \mathcal{L}}{\partial v} = 2v \cdot 1 = 2\cdot 7 = 14.
Step 2 — through the add. v = u + c has local
derivatives \partial v/\partial u = 1 and
\partial v/\partial c = 1. Each gets the upstream
14:
\frac{\partial \mathcal{L}}{\partial u} = 1\cdot 14 = 14, \qquad \frac{\partial \mathcal{L}}{\partial c} = 1\cdot 14 = 14.
Step 3 — through the multiply. u = a\cdot b has
local derivatives \partial u/\partial a = b and
\partial u/\partial b = a. Multiply each by the upstream
14 (using the stored forward values
a=3,\ b=2):
\frac{\partial \mathcal{L}}{\partial a} = b\cdot 14 = 2\cdot 14 = 28, \qquad \frac{\partial \mathcal{L}}{\partial b} = a\cdot 14 = 3\cdot 14 = 42.
Step 4 — collect. One forward sweep and one backward sweep produced every
gradient at once:
\frac{\partial \mathcal{L}}{\partial a} = 28, \quad \frac{\partial \mathcal{L}}{\partial b} = 42, \quad \frac{\partial \mathcal{L}}{\partial c} = 14.
Check one by hand if you like: \mathcal{L} = (ab + c)^2 gives
\partial \mathcal{L}/\partial a = 2(ab+c)\cdot b = 2\cdot 7\cdot 2 = 28.
It matches — and the graph never needed the closed form. It only ever multiplied a local
derivative by an upstream number.
Reverse-mode automatic differentiation computes the gradient of one output with respect to all
inputs in a single backward sweep:
-
A graph of operations. The computation is recorded as a DAG whose nodes
are elementary ops (add, multiply, square, …).
-
Forward stores values. One forward pass evaluates and caches every node's
output.
-
Reverse multiplies derivatives. Walking back from
\partial \mathcal{L}/\partial \mathcal{L} = 1, each node returns
(\text{local derivative}) \times (\text{upstream gradient}) to
its inputs.
-
It is backprop, generalised. Neural-network backpropagation is exactly
this rule applied to the network's graph — autodiff just does it for any
computation.
Step through the graph
Below is the graph for \mathcal{L} = (a\cdot b + c)^2. The forward
values sit on the nodes; stepping forward fills them in. Then the backward pass lights up the
gradients flowing down the edges — each one the local derivative times the gradient above it.
Watch 14 arrive at the add, split unchanged to
u and c, then fan out to
28 and 42 at the multiply.
There are two ways to automate the chain rule. Forward-mode pushes
derivatives along with the values, computing
\partial(\text{everything})/\partial a in one sweep — one sweep
per input. Reverse-mode (above) computes
\partial \mathcal{L}/\partial(\text{everything}) in one sweep — one
sweep per output.
Deep learning has a loss that is one scalar and parameters that number in the
millions or billions: many inputs, one output. Reverse-mode nails that shape — every gradient
for the price of a single backward pass, independent of how many parameters there are.
Forward-mode would need one pass per parameter, which is hopeless. That asymmetry is the entire
reason frameworks are built around reverse-mode autodiff (and why it is backprop).
Where this is going
Autodiff will happily differentiate any graph you build — including the one at the very end of
a classifier, the
softmax.
Its gradient turns out to be one of the most elegant in all of deep learning, and it is the
subject of the next page.