Backpropagation

To train a network we need the gradient of the loss with respect to every weight — often millions of them. Computing each one separately would be hopeless. Backpropagation gets them all in a single backward sweep, using one idea you already know: the chain rule.

The network is a chain of functions, so the chain rule says the gradient at an early layer is the product of the local slopes of all the layers after it. Backprop computes the error at the output, then propagates it backward layer by layer — each layer multiplying the incoming gradient by its own local slope and passing it on.

Forward, then backward

Step through a full training pass. First the signal flows forward to a prediction (blue). Then the error flows backward (orange), the chain rule handing each layer the gradient it needs to adjust its weights. Two sweeps — one to predict, one to learn.

The algorithm that lit the fuse

Backpropagation is what makes training deep networks feasible: it computes every gradient in roughly the same time as a single forward pass, no matter how many weights. Paired with gradient descent, it's the engine behind essentially all of deep learning. Modern frameworks do it automatically (autodiff), but it is, at heart, just the chain rule applied with ruthless efficiency — calculus and linear algebra, working together.