The Normal Equation
Gradient descent crawls to the best fit. For linear regression there's a shortcut that
leaps straight to it in a single calculation — the normal equation. Stack the
feature vectors into a matrix X and the labels into
\vec{y}; the optimal weights are
\vec{w} = (X^{\mathsf T} X)^{-1} X^{\mathsf T} \vec{y}.
No steps, no learning rate, no iterations — just one formula built from a
matrix inverse.
It's the exact point where the cost bowl is flattest, found analytically.
One shot to the optimum
Try to beat the formula. Adjust your line and compare your cost with the
optimal cost the normal equation achieves (the faint line is its answer). No
matter how carefully you tune by hand, you can only ever match it — never beat it. That faint
line is the provably best straight-line fit.
So why ever use gradient descent?
Because that (X^{\mathsf T} X)^{-1} gets brutally expensive as features
grow — inverting a huge matrix is slow, and impossible if it's
singular.
With thousands of features or millions of examples,
gradient
descent wins easily — and it's the only option for models like neural
networks that have no tidy formula at all. The normal equation is the elegant special case;
gradient descent is the universal workhorse.