The Loss Landscape

For linear regression the cost was a perfect bowl — one minimum, easy to find. Neural networks are not so kind. Because their layers are stacked and non-linear, the loss as a function of the millions of weights is a wild, bumpy landscape: hills, valleys, ridges, and many local minima rather than a single global one.

A surface like this is called non-convex. Gradient descent still works — follow the slope downhill — but it can settle into a local valley that isn't the deepest one. Where it ends up depends on where it starts.

A bumpy descent

Drag the starting position and watch the ball roll downhill into the nearest valley. Some starts find the deep global minimum; others get caught in a shallower local one. This is why neural networks are trained more than once, from different random starts — the landscape is full of traps.

Why it works anyway

Here's the happy surprise: in the very high-dimensional spaces of real networks, most local minima turn out to be almost as good as the global one, and truly bad traps are rare. Add a dash of randomness — small batches of data each step (stochastic gradient descent), a little momentum — and training reliably finds excellent solutions despite the chaos. To actually compute the downhill direction through all those layers, we need backpropagation.