Newton and Quasi-Newton Methods
Gradient descent feels the
slope and takes a cautious step downhill — but it is blind to curvature. It cannot tell a
gently rounded valley (where it should stride confidently to the bottom) from a sharply curved one
(where it should tiptoe). Newton's method opens a second eye: it uses the second
derivative — the curvature — to jump straight toward the minimum in one informed leap instead of many
timid ones.
The idea is elegant. Near any point, approximate f by its
second-order
Taylor polynomial — a parabola that matches the value, slope, and curvature
at x_k:
f(x) \approx f(x_k) + f'(x_k)(x - x_k) + \tfrac12 f''(x_k)(x - x_k)^2.
A parabola has an obvious minimum — its vertex — so instead of inching, jump straight to the
vertex. Minimizing the right-hand side gives the Newton update:
x_{k+1} = x_k - \frac{f'(x_k)}{f''(x_k)}.
In several variables the second derivative becomes the Hessian matrix
\nabla^2 f, and dividing becomes multiplying by its inverse:
\mathbf{x}_{k+1} = \mathbf{x}_k - \bigl[\nabla^2 f(\mathbf{x}_k)\bigr]^{-1} \nabla f(\mathbf{x}_k).
Fit a parabola, jump to its vertex
Newton's method is best seen. Below is a convex curve
f(x) = 0.1x^4 + 0.5x^2. At each step it fits the matching parabola (the
second-order Taylor approximation) at the current point, then leaps to that parabola's lowest point.
Step through the iterations: the leaps home in on the minimum with startling speed — each step roughly
squares the error, so the number of correct digits doubles every iteration.
One step on a quadratic — exactly
The clearest demonstration of Newton's power: on a quadratic, it lands on the exact minimum in
a single step, from anywhere. Take f(x) = (x - 3)^2, so
f'(x) = 2(x - 3) and f''(x) = 2. Start at
x_0 = 10:
x_1 = 10 - \frac{2(10 - 3)}{2} = 10 - 7 = 3.
Done — the minimum, in one leap, because the parabola Newton fits is the function: its vertex
is the true minimum. Gradient descent, by contrast, would have crept toward
x = 3 over many steps. This is the deep reason Newton is fast near an
optimum: any smooth function looks quadratic there, so the local parabola is nearly exact and each step
is nearly perfect.
Close enough to a minimum where f'' > 0 (or the Hessian is positive
definite), Newton's method converges quadratically: the error
e_{k+1} is proportional to e_k^{\,2}, so the
number of accurate digits roughly doubles each step. On a quadratic it converges in one
step.
Quasi-Newton: the curvature you can afford
Newton's speed has a price: every step needs the full Hessian — n^2 second
derivatives — and the solution of a linear system to invert it. For a model with a million parameters
that is a million-by-million matrix per step: utterly impractical. Quasi-Newton
methods are the ingenious compromise. They never compute the Hessian; instead they build up an
approximation to it (or directly to its inverse) from the gradients they were going to compute
anyway, watching how the gradient changes from step to step to infer the curvature.
The celebrated BFGS method (and its low-memory cousin L-BFGS, the default workhorse
for large smooth problems) does exactly this. It keeps most of Newton's fast convergence — a
superlinear rate, between gradient descent's linear crawl and Newton's quadratic sprint —
at a fraction of the cost, using only first derivatives. It is the sweet spot: curvature-aware, but
Hessian-free.
BFGS is named for Broyden, Fletcher, Goldfarb and Shanno — four mathematicians who,
astonishingly, discovered essentially the same update rule independently and published it all
in 1970, in four separate papers. Rather than fight over priority, the community simply
stapled all four surnames together. It is one of mathematics' friendlier acronyms and a nice case of an
idea whose time had so clearly come that four people reached for it at once. Its memory-thrifty
descendant, L-BFGS, still quietly powers a huge share of the world's numerical optimization today.
-
Newton solves f' = 0 — any stationary point. The update
is happy to converge to a maximum or a saddle, not just a minimum. It only reliably
seeks a minimum when the curvature is right: f'' > 0 in 1-D, or a
positive-definite Hessian in several variables.
-
The Hessian must be invertible. If f''(x_k) = 0 (a flat
inflection) the step divides by zero and blows up; a singular Hessian does the same. Practical solvers
add a safeguard (a "trust region" or a damping term) to stay well-behaved.
-
Newton can diverge far from the optimum. The quadratic-convergence guarantee is
local. Started too far away, the fitted parabola can be a poor model and the step can
overshoot wildly. Real implementations combine Newton's direction with a line search (a controlled
step length) to stay safe globally.