For linear regression the cost was a perfect
A surface like this is called non-convex.
Drag the starting position and watch the ball roll downhill into the nearest valley. Some starts find the deep global minimum; others get caught in a shallower local one. This is why neural networks are trained more than once, from different random starts — the landscape is full of traps.
Here's the happy surprise: in the very high-dimensional spaces of real networks, most local minima
turn out to be almost as good as the global one, and truly bad traps are rare. Add a dash
of randomness — small batches of data each step (stochastic gradient descent), a little
momentum — and training reliably finds excellent solutions despite the chaos. To actually compute
the downhill direction through all those layers, we need