Multivariable Taylor and the Hessian
A straight line is blind to curvature. The
best linear approximation
tells you which way a landscape tilts, but stand at a spot where the ground is dead level — a
mountain peak, the bottom of a valley, or a mountain pass — and the tilt is zero at all
three. The linear term throws up its hands: peak, pit, and pass look identical to first order. To
tell them apart you must look at how the surface bends, and bending is a
second-order quantity.
The bookkeeping device for all the second derivatives at once is the Hessian
matrix, and the statement that packages first and second order together is the
second-order Taylor expansion. Together they answer the question every optimiser
asks — "is this critical point a minimum, a maximum, or a saddle?" — and they are the reason
multivariable optimization
has a systematic test at all. The same matrix decides the stability of an equilibrium in physics
(a ball rests in a bowl, not on a dome) and drives second-order methods like Newton's in machine
learning.
Second order in one variable, then in many
You already know the one-variable Taylor expansion to second order:
f(a + h) = f(a) + f'(a)\,h + \tfrac12 f''(a)\,h^2 + o(h^2).
Three ingredients: the value, the slope times the step, and — the new part — half the
curvature times the step squared. That quadratic term is what bends the
approximation off the tangent line to hug the graph. Now promote every piece to vectors. Let
f : \mathbb{R}^n \to \mathbb{R} be twice continuously differentiable
(C^2). Then
f(a + \mathbf{h}) = f(a) + \nabla f(a)\cdot \mathbf{h} + \tfrac12\, \mathbf{h}^{\top} H(a)\, \mathbf{h} + o\big(\lVert \mathbf{h}\rVert^2\big).
The slope f'(a) became the gradient
\nabla f(a), and the curvature f''(a) became the
Hessian H(a) — the n \times n
matrix of all second
partial derivatives,
H(a) = \begin{pmatrix} \dfrac{\partial^2 f}{\partial x_1^2} & \cdots & \dfrac{\partial^2 f}{\partial x_1 \partial x_n} \\ \vdots & \ddots & \vdots \\ \dfrac{\partial^2 f}{\partial x_n \partial x_1} & \cdots & \dfrac{\partial^2 f}{\partial x_n^2} \end{pmatrix}, \qquad H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}.
The middle term \nabla f(a)\cdot\mathbf{h} is the linear map from before;
the new term \tfrac12\,\mathbf{h}^\top H\,\mathbf{h} is a
quadratic form — the multivariable "\tfrac12 f'' h^2."
-
If f is C^2, mixed partials commute
(Clairaut / Schwarz):
\dfrac{\partial^2 f}{\partial x_i \partial x_j} = \dfrac{\partial^2 f}{\partial x_j \partial x_i}.
-
Hence H(a) is a symmetric matrix, so it has real
eigenvalues and orthogonal eigenvectors — the fact that makes the second-derivative test work.
See the curvature term earn its place
Below, f(x) = \sin x (faint) is approximated at a base point
a two ways: the tangent line
f(a) + f'(a)(x - a) (first order) and the Taylor parabola
f(a) + f'(a)(x - a) + \tfrac12 f''(a)(x - a)^2 (second order). Slide
a and compare.
The line always shoots off tangentially, hugging the curve for only a sliver before drifting away.
The parabola, carrying the curvature term \tfrac12 f''(a), bends the
right way and clings to the curve over a much wider stretch. Watch the parabola flip from
opening upward to opening downward as you cross an inflection point, where
f'' changes sign — the sign of the second-order term is exactly the
information the linear term cannot carry. In many variables that single sign becomes the
definiteness of the Hessian.
The second-derivative test: reading the quadratic form
At a critical point a the gradient vanishes,
\nabla f(a) = 0, so the linear term drops out and the Taylor expansion
reduces to
f(a + \mathbf{h}) \approx f(a) + \tfrac12\, \mathbf{h}^{\top} H(a)\, \mathbf{h}.
The behaviour near a is now dictated entirely by the quadratic form — by
whether \mathbf{h}^\top H\, \mathbf{h} is positive, negative, or mixed. Its
sign is governed by the eigenvalues of the symmetric matrix
H(a):
At a critical point a of a C^2 function:
-
H(a) positive definite (all eigenvalues
> 0) \Rightarrow strict local
minimum — the bowl curves up in every direction.
-
H(a) negative definite (all eigenvalues
< 0) \Rightarrow strict local
maximum.
-
H(a) indefinite (eigenvalues of both signs)
\Rightarrow saddle point — up one way, down another.
-
H(a) singular (some eigenvalue
= 0) \Rightarrow the test is
inconclusive.
In two variables you need not find eigenvalues — the discriminant does it. With
H = \begin{pmatrix} f_{xx} & f_{xy} \\ f_{xy} & f_{yy} \end{pmatrix}, let
D = \det H = f_{xx}f_{yy} - f_{xy}^2. Then
D > 0,\ f_{xx} > 0 gives a minimum;
D > 0,\ f_{xx} < 0 a maximum; D < 0 a saddle;
and D = 0 is inconclusive. (The determinant is the product of the
eigenvalues and f_{xx} reveals their common sign.)
Three worked classifications
Each of these has its only critical point at the origin, where
\nabla f = (0, 0).
Bowl — f = x^2 + y^2.
H = \begin{pmatrix} 2 & 0 \\ 0 & 2 \end{pmatrix}, eigenvalues
2, 2 both positive; D = 4 > 0 and
f_{xx} = 2 > 0. Local minimum — the surface curves up
every way you leave the origin.
Saddle — f = x^2 - y^2.
H = \begin{pmatrix} 2 & 0 \\ 0 & -2 \end{pmatrix}, eigenvalues
2 and -2 of opposite sign;
D = -4 < 0. Saddle — up along the
x-axis, down along the y-axis. This is the
Pringle-crisp shape, and no first-order information could ever have flagged it.
Inconclusive — f = x^2 + y^4 vs
g = x^2 - y^4. Both have the same Hessian at the origin,
\begin{pmatrix} 2 & 0 \\ 0 & 0 \end{pmatrix}, with a zero eigenvalue —
D = 0. The test is silent, and rightly so: f has
a minimum there while g has a saddle. Two functions, identical to second
order, different fates — the tie is broken only by the fourth-order term.
Two traps sit around the Hessian test.
-
Inconclusive is a real verdict, not a minimum. When
D = 0 (or any eigenvalue is 0) the quadratic
form is flat in some direction, and the outcome is decided by higher-order terms the
Hessian cannot see. The pair x^2 \pm y^4 above — same Hessian, one a
minimum and one a saddle — is the standard warning. Do not read
"D = 0" as "boundary case, probably a minimum"; it means "look harder."
-
The Hessian is only guaranteed symmetric for C^2 functions.
Clairaut's theorem — that f_{xy} = f_{yx} — needs the mixed partials to
be continuous. There is a famous rogue,
f(x,y) = \dfrac{xy(x^2 - y^2)}{x^2 + y^2} (with
f(0,0)=0), whose mixed partials at the origin disagree
(f_{xy}(0,0) = -1 but f_{yx}(0,0) = +1). Its
Hessian is not symmetric, and the eigenvalue reasoning breaks. Almost every function you
meet is C^2, but the hypothesis is doing quiet work.