The
directional derivative
gave us a vector worth naming. Collecting the two partials of
f(x, y) into a single vector defines the
gradient:
\nabla f = \left( f_x,\; f_y \right).
It is more than bookkeeping. The gradient is a genuine arrow living in the
xy-plane, and that arrow has two superpowers: it points the way
steepest uphill, its length is exactly how steep that climb is, and
it always stands perpendicular to the level curves. All three fall out of
one identity.
Deriving the three properties
Everything starts from the directional-derivative formula and a single fact about dot
products.
Step 1 — write the rate as a dot product. For a unit direction
\mathbf{u},
D_{\mathbf{u}} f = \nabla f \cdot \mathbf{u}.
Step 2 — turn the dot product into a cosine. For any two vectors,
\mathbf{a} \cdot \mathbf{b} = \|\mathbf{a}\|\,\|\mathbf{b}\|\cos\theta.
Here \|\mathbf{u}\| = 1, so with
\theta the angle between \nabla f and
\mathbf{u},
D_{\mathbf{u}} f = \|\nabla f\|\,\|\mathbf{u}\|\cos\theta = \|\nabla f\|\cos\theta.
Step 3 — maximise over directions. As
\mathbf{u} swings around, the only thing that changes is
\cos\theta, which is largest when
\theta = 0 — that is, when \mathbf{u}
points along \nabla f. There
\cos\theta = 1 and
\max_{\mathbf{u}} D_{\mathbf{u}} f = \|\nabla f\|.
So \nabla f points in the direction of steepest
ascent, and the steepest slope is its magnitude
\|\nabla f\|. (At \theta = 180^\circ,
\cos\theta = -1: the opposite direction
-\nabla f is steepest descent.)
Step 4 — set up the level curve. A level curve is the set where
f stays constant. Walk along it with a path
\big(x(t), y(t)\big), so that
f\big(x(t), y(t)\big) = c \quad\text{(constant) for all } t.
Step 5 — differentiate the constant. The right side is constant, so its
derivative is zero; the left side opens up by the chain rule:
\frac{d}{dt}\, f\big(x(t), y(t)\big) = f_x\, x'(t) + f_y\, y'(t) = 0.
Step 6 — recognise the dot product. That sum is the gradient dotted with
the path's velocity \mathbf{T} = (x', y'), which is the
tangent to the level curve:
\nabla f \cdot \mathbf{T} = 0.
A zero dot product means perpendicular. The gradient is at right angles to
the level curve through every point — it always points straight "across the contours",
never along them.
Let f be differentiable at a point with
\nabla f \neq \mathbf{0} there. Then:
-
Steepest ascent. \nabla f points in the
direction in which f increases fastest.
-
Maximum rate. That fastest rate of increase equals the gradient's
magnitude, \|\nabla f\| (and the steepest descent is
-\nabla f, with rate -\|\nabla f\|).
-
Perpendicular to level sets. \nabla f is
orthogonal to the level curve f = c through the point —
because f does not change along that curve, so
\nabla f \cdot \mathbf{T} = 0 for the tangent
\mathbf{T}.
A worked example
Let f(x, y) = x^2 + y^2 at the point
(3, 4).
Step 1 — the gradient. f_x = 2x,
f_y = 2y, so
\nabla f(3, 4) = (6, 8).
Step 2 — direction of steepest ascent. Straight along
(6, 8) — radially outward from the origin, which makes sense:
f is a bowl and the fastest way up is directly away from the
bottom.
Step 3 — maximum rate.
\|\nabla f\| = \sqrt{6^2 + 8^2} = \sqrt{100} = 10.
Step 4 — perpendicularity check. The level curves of
x^2 + y^2 are circles centred at the origin; their tangents are
perpendicular to the radius, and (6, 8) is exactly the radial
direction. The gradient crosses the contour at a right angle, as promised.
If \nabla f is the steepest way up, then
-\nabla f is the steepest way down — and rolling
downhill is how almost every machine-learning model is trained. Gradient
descent repeatedly nudges a point against its gradient,
\mathbf{x}_{n+1} = \mathbf{x}_n - \eta\, \nabla f(\mathbf{x}_n),
where the small step size \eta is the "learning rate". Each
step lowers f (for small enough \eta),
and the process coasts to a halt exactly where
\nabla f = \mathbf{0} — a
critical point,
the subject of the next page. Training a neural network with a billion parameters is, at
heart, this one line run a great many times on a function nobody could ever picture. The
gradient is what makes the impossible-to-visualise navigable.