Newton and Quasi-Newton Methods

Gradient descent feels the slope and takes a cautious step downhill — but it is blind to curvature. It cannot tell a gently rounded valley (where it should stride confidently to the bottom) from a sharply curved one (where it should tiptoe). Newton's method opens a second eye: it uses the second derivative — the curvature — to jump straight toward the minimum in one informed leap instead of many timid ones.

The idea is elegant. Near any point, approximate f by its second-order Taylor polynomial — a parabola that matches the value, slope, and curvature at x_k:

f(x) \approx f(x_k) + f'(x_k)(x - x_k) + \tfrac12 f''(x_k)(x - x_k)^2.

A parabola has an obvious minimum — its vertex — so instead of inching, jump straight to the vertex. Minimizing the right-hand side gives the Newton update:

x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}.

In several variables the second derivative becomes the Hessian matrix \nabla^2 f, and dividing becomes multiplying by its inverse:

\mathbf{x}_{k+1} = \mathbf{x}_k - \bigl[\nabla^2 f(\mathbf{x}_k)\bigr]^{-1} \nabla f(\mathbf{x}_k).

Fit a parabola, jump to its vertex

Newton's method is best seen. Below is a convex curve f(x) = 0.1x^4 + 0.5x^2. At each step it fits the matching parabola (the second-order Taylor approximation) at the current point, then leaps to that parabola's lowest point. Step through the iterations: the leaps home in on the minimum with startling speed — each step roughly squares the error, so the number of correct digits doubles every iteration.

One step on a quadratic — exactly

The clearest demonstration of Newton's power: on a quadratic, it lands on the exact minimum in a single step, from anywhere. Take f(x) = (x - 3)^2, so f'(x) = 2(x - 3) and f''(x) = 2. Start at x_0 = 10:

x_1 = 10 - \frac{2(10 - 3)}{2} = 10 - 7 = 3.

Done — the minimum, in one leap, because the parabola Newton fits is the function: its vertex is the true minimum. Gradient descent, by contrast, would have crept toward x = 3 over many steps. This is the deep reason Newton is fast near an optimum: any smooth function looks quadratic there, so the local parabola is nearly exact and each step is nearly perfect.

Close enough to a minimum where f'' > 0 (or the Hessian is positive definite), Newton's method converges quadratically: the error e_{k+1} is proportional to e_k^{\,2}, so the number of accurate digits roughly doubles each step. On a quadratic it converges in one step.

Quasi-Newton: the curvature you can afford

Newton's speed has a price: every step needs the full Hessian — n^2 second derivatives — and the solution of a linear system to invert it. For a model with a million parameters that is a million-by-million matrix per step: utterly impractical. Quasi-Newton methods are the ingenious compromise. They never compute the Hessian; instead they build up an approximation to it (or directly to its inverse) from the gradients they were going to compute anyway, watching how the gradient changes from step to step to infer the curvature.

The celebrated BFGS method (and its low-memory cousin L-BFGS, the default workhorse for large smooth problems) does exactly this. It keeps most of Newton's fast convergence — a superlinear rate, between gradient descent's linear crawl and Newton's quadratic sprint — at a fraction of the cost, using only first derivatives. It is the sweet spot: curvature-aware, but Hessian-free.

BFGS is named for Broyden, Fletcher, Goldfarb and Shanno — four mathematicians who, astonishingly, discovered essentially the same update rule independently and published it all in 1970, in four separate papers. Rather than fight over priority, the community simply stapled all four surnames together. It is one of mathematics' friendlier acronyms and a nice case of an idea whose time had so clearly come that four people reached for it at once. Its memory-thrifty descendant, L-BFGS, still quietly powers a huge share of the world's numerical optimization today.