To train a network we need the gradient of the loss with respect to every weight — often
millions of them. Computing each one separately would be hopeless.
Backpropagation gets them all in a single backward sweep, using one idea you
already know: the
The network is a chain of functions, so the chain rule says the gradient at an early layer is the product of the local slopes of all the layers after it. Backprop computes the error at the output, then propagates it backward layer by layer — each layer multiplying the incoming gradient by its own local slope and passing it on.
Step through a full training pass. First the signal flows forward to a prediction (blue). Then the error flows backward (orange), the chain rule handing each layer the gradient it needs to adjust its weights. Two sweeps — one to predict, one to learn.
Backpropagation is what makes training deep networks feasible: it computes every gradient
in roughly the same time as a single forward pass, no matter how many weights. Paired with