The Normal Equation

Gradient descent crawls to the best fit. For linear regression there's a shortcut that leaps straight to it in a single calculation — the normal equation. Stack the feature vectors into a matrix X and the labels into \vec{y}; the optimal weights are

\vec{w} = (X^{\mathsf T} X)^{-1} X^{\mathsf T} \vec{y}.

No steps, no learning rate, no iterations — just one formula built from a matrix inverse. It's the exact point where the cost bowl is flattest, found analytically.

One shot to the optimum

Try to beat the formula. Adjust your line and compare your cost with the optimal cost the normal equation achieves (the faint line is its answer). No matter how carefully you tune by hand, you can only ever match it — never beat it. That faint line is the provably best straight-line fit.

So why ever use gradient descent?

Because that (X^{\mathsf T} X)^{-1} gets brutally expensive as features grow — inverting a huge matrix is slow, and impossible if it's singular. With thousands of features or millions of examples, gradient descent wins easily — and it's the only option for models like neural networks that have no tidy formula at all. The normal equation is the elegant special case; gradient descent is the universal workhorse.