Automatic Differentiation

We have leaned on phrases like "the gradient of F" as if a framework just produces it. It does, and the mechanism has a name: automatic differentiation (autodiff). You write the forward computation; the framework records it as a computational graph of elementary operations, then walks that graph backwards, multiplying each node's local derivative by the gradient flowing into it. That backward walk is the chain rule, automated — backprop generalised from neural networks to any computation.

The forward pass: build the graph, store the values

Take a tiny example and follow it node by node:

\mathcal{L} = (a\cdot b + c)^2.

Break it into elementary operations, each a node in the graph. Pick concrete inputs a = 3,\ b = 2,\ c = 1 to make it vivid.

Step 1 — multiply. u = a\cdot b. Forward value: u = 3\cdot 2 = 6.

Step 2 — add. v = u + c. Forward value: v = 6 + 1 = 7.

Step 3 — square. \mathcal{L} = v^2. Forward value: \mathcal{L} = 7^2 = 49.

The forward pass stores every intermediate value (u, v, \mathcal{L}) — they are exactly the numbers the backward pass will need.

The backward pass: local derivative × upstream gradient

Now walk the graph in reverse. Each node already knows its own local derivative (how its output moves when its input moves); multiply that by the upstream gradient handed down from above, and pass the result to its inputs. Seed the very top with \partial \mathcal{L}/\partial \mathcal{L} = 1.

Step 1 — through the square. \mathcal{L} = v^2 has local derivative \partial \mathcal{L}/\partial v = 2v. Multiply by the upstream 1:

\frac{\partial \mathcal{L}}{\partial v} = 2v \cdot 1 = 2\cdot 7 = 14.

Step 2 — through the add. v = u + c has local derivatives \partial v/\partial u = 1 and \partial v/\partial c = 1. Each gets the upstream 14:

\frac{\partial \mathcal{L}}{\partial u} = 1\cdot 14 = 14, \qquad \frac{\partial \mathcal{L}}{\partial c} = 1\cdot 14 = 14.

Step 3 — through the multiply. u = a\cdot b has local derivatives \partial u/\partial a = b and \partial u/\partial b = a. Multiply each by the upstream 14 (using the stored forward values a=3,\ b=2):

\frac{\partial \mathcal{L}}{\partial a} = b\cdot 14 = 2\cdot 14 = 28, \qquad \frac{\partial \mathcal{L}}{\partial b} = a\cdot 14 = 3\cdot 14 = 42.

Step 4 — collect. One forward sweep and one backward sweep produced every gradient at once:

\frac{\partial \mathcal{L}}{\partial a} = 28, \quad \frac{\partial \mathcal{L}}{\partial b} = 42, \quad \frac{\partial \mathcal{L}}{\partial c} = 14.

Check one by hand if you like: \mathcal{L} = (ab + c)^2 gives \partial \mathcal{L}/\partial a = 2(ab+c)\cdot b = 2\cdot 7\cdot 2 = 28. It matches — and the graph never needed the closed form. It only ever multiplied a local derivative by an upstream number.

Reverse-mode automatic differentiation computes the gradient of one output with respect to all inputs in a single backward sweep:

A graph of operations. The computation is recorded as a DAG whose nodes are elementary ops (add, multiply, square, …).
Forward stores values. One forward pass evaluates and caches every node's output.
Reverse multiplies derivatives. Walking back from \partial \mathcal{L}/\partial \mathcal{L} = 1, each node returns (\text{local derivative}) \times (\text{upstream gradient}) to its inputs.
It is backprop, generalised. Neural-network backpropagation is exactly this rule applied to the network's graph — autodiff just does it for any computation.

Step through the graph

Below is the graph for \mathcal{L} = (a\cdot b + c)^2. The forward values sit on the nodes; stepping forward fills them in. Then the backward pass lights up the gradients flowing down the edges — each one the local derivative times the gradient above it. Watch 14 arrive at the add, split unchanged to u and c, then fan out to 28 and 42 at the multiply.

There are two ways to automate the chain rule. Forward-mode pushes derivatives along with the values, computing \partial(\text{everything})/\partial a in one sweep — one sweep per input. Reverse-mode (above) computes \partial \mathcal{L}/\partial(\text{everything}) in one sweep — one sweep per output.

Deep learning has a loss that is one scalar and parameters that number in the millions or billions: many inputs, one output. Reverse-mode nails that shape — every gradient for the price of a single backward pass, independent of how many parameters there are. Forward-mode would need one pass per parameter, which is hopeless. That asymmetry is the entire reason frameworks are built around reverse-mode autodiff (and why it is backprop).

Where this is going

Autodiff will happily differentiate any graph you build — including the one at the very end of a classifier, the softmax. Its gradient turns out to be one of the most elegant in all of deep learning, and it is the subject of the next page.