Gradient Descent
You are lost on a foggy hillside and want to reach the valley floor. You can't see far, but you
can feel which way the ground slopes under your feet. So you take a step straight downhill,
feel the slope again, step again, and repeat. That simple, patient idea — always step in the
direction of steepest descent — is gradient descent, the single most widely
used optimization algorithm on Earth. It trains neural networks with billions of parameters, fits
statistical models, and tunes controllers, all with the same two-line update.
The gradient
\nabla f(\mathbf{x}) points in the direction of steepest increase.
To go downhill we walk the opposite way, taking a step of size
\alpha (the step size or learning rate):
\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha\, \nabla f(\mathbf{x}_k).
Start anywhere, iterate, and — for a well-behaved function with a sensible step size — the sequence
slides down to a minimum. When the gradient reaches zero, the update stops moving: you've arrived at a
stationary point.
The one-variable story, exactly
Strip everything to the simplest bowl, f(x) = x^2, whose derivative is
f'(x) = 2x. The update becomes
x_{k+1} = x_k - \alpha\, (2x_k) = (1 - 2\alpha)\, x_k.
Each step just multiplies x by the fixed factor
(1 - 2\alpha), so x_k = (1 - 2\alpha)^k\, x_0 — a
geometric sequence. Now the whole behaviour of gradient descent is visible in one number:
- 0 < \alpha < \tfrac12: the factor is between 0 and 1 — steady, smooth
convergence to 0.
- \alpha = \tfrac12: the factor is 0 — you land on the minimum in a
single step (the ideal step for this bowl).
- \tfrac12 < \alpha < 1: the factor is negative but small — you
overshoot the minimum and zig-zag across it, but still converge.
- \alpha \ge 1: the factor has size \ge 1 — each
step lands further from the minimum. Gradient descent diverges.
The lesson generalises: convergence hinges on the step size versus the function's curvature. For a
function whose gradient doesn't change faster than a rate L (an
L-smooth function), any step size \alpha \le 1/L
is guaranteed to make progress. Too big and you bounce out; too small and you crawl.
Watch it descend — and diverge
Here is a two-dimensional bowl f(x, y) = x^2 + 3y^2, drawn as faint
elliptical contours (it is steeper in y than in x).
The path starts at the same corner each time; drag the step-size slider and watch what happens:
- a small step gives a smooth, if slow, curve into the centre;
- a medium step zig-zags — overshooting the steep y
direction while inching along the shallow x one;
- a large step makes the path bounce outward and diverge.
That zig-zag is the famous curse of ill-conditioning: when one direction is far
steeper than another, a single step size can't suit both.
A worked descent, by hand
Minimize f(x) = x^2 from x_0 = 8 with step size
\alpha = 0.25. The factor is
1 - 2(0.25) = 0.5, so each step halves x:
8 \;\to\; 4 \;\to\; 2 \;\to\; 1 \;\to\; 0.5 \;\to\; \cdots \;\to\; 0.
The objective f = x^2 shrinks even faster (it squares the factor, ×0.25 per
step): 64 \to 16 \to 4 \to 1 \to \cdots. This geometric decay — halve the
distance, quarter the loss, every step — is exactly the linear convergence that
gradient descent delivers on a strongly convex function: a fixed fraction of progress per iteration.
Modern machine learning is gradient descent at colossal scale. A model's loss measures how
wrong its predictions are; training means minimising that loss over millions of parameters. Computing
the exact gradient over the whole dataset each step is far too slow, so practitioners use
stochastic gradient descent: estimate the gradient from a small random batch
of examples and step on that noisy estimate. The noise even helps — it can jiggle the path out of
shallow bad valleys. Nearly every headline AI system you've heard of was, at bottom, walked downhill by
this 170-year-old idea (Cauchy proposed it in 1847), one small step at a time.
-
Too big a step diverges; too small crawls. The step size is the single most
important knob. Overshoot the curvature and the iterates fly apart; undershoot and you may need
thousands of steps. There is no single right value — it depends on the function's steepness.
-
Gradient descent only finds a local minimum. On a non-convex landscape it
settles into whatever basin it started in and can stall near a saddle point, flat in every direction
it first checks. Only for a
convex function is
the point it reaches guaranteed global.
-
Ill-conditioning causes zig-zagging. When directions differ wildly in steepness, a
scalar step size fits none of them and the path bounces across the valley. Rescaling the variables
(or using curvature, as
Newton's method
does) straightens the path.