Gradient Descent

You are lost on a foggy hillside and want to reach the valley floor. You can't see far, but you can feel which way the ground slopes under your feet. So you take a step straight downhill, feel the slope again, step again, and repeat. That simple, patient idea — always step in the direction of steepest descent — is gradient descent, the single most widely used optimization algorithm on Earth. It trains neural networks with billions of parameters, fits statistical models, and tunes controllers, all with the same two-line update.

The gradient \nabla f(\mathbf{x}) points in the direction of steepest increase. To go downhill we walk the opposite way, taking a step of size \alpha (the step size or learning rate):

\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha\, \nabla f(\mathbf{x}_k).

Start anywhere, iterate, and — for a well-behaved function with a sensible step size — the sequence slides down to a minimum. When the gradient reaches zero, the update stops moving: you've arrived at a stationary point.

The one-variable story, exactly

Strip everything to the simplest bowl, f(x) = x^2, whose derivative is f'(x) = 2x. The update becomes

x_{k+1} = x_k - \alpha\, (2x_k) = (1 - 2\alpha)\, x_k.

Each step just multiplies x by the fixed factor (1 - 2\alpha), so x_k = (1 - 2\alpha)^k\, x_0 — a geometric sequence. Now the whole behaviour of gradient descent is visible in one number:

The lesson generalises: convergence hinges on the step size versus the function's curvature. For a function whose gradient doesn't change faster than a rate L (an L-smooth function), any step size \alpha \le 1/L is guaranteed to make progress. Too big and you bounce out; too small and you crawl.

Watch it descend — and diverge

Here is a two-dimensional bowl f(x, y) = x^2 + 3y^2, drawn as faint elliptical contours (it is steeper in y than in x). The path starts at the same corner each time; drag the step-size slider and watch what happens:

That zig-zag is the famous curse of ill-conditioning: when one direction is far steeper than another, a single step size can't suit both.

A worked descent, by hand

Minimize f(x) = x^2 from x_0 = 8 with step size \alpha = 0.25. The factor is 1 - 2(0.25) = 0.5, so each step halves x:

8 \;\to\; 4 \;\to\; 2 \;\to\; 1 \;\to\; 0.5 \;\to\; \cdots \;\to\; 0.

The objective f = x^2 shrinks even faster (it squares the factor, ×0.25 per step): 64 \to 16 \to 4 \to 1 \to \cdots. This geometric decay — halve the distance, quarter the loss, every step — is exactly the linear convergence that gradient descent delivers on a strongly convex function: a fixed fraction of progress per iteration.

Modern machine learning is gradient descent at colossal scale. A model's loss measures how wrong its predictions are; training means minimising that loss over millions of parameters. Computing the exact gradient over the whole dataset each step is far too slow, so practitioners use stochastic gradient descent: estimate the gradient from a small random batch of examples and step on that noisy estimate. The noise even helps — it can jiggle the path out of shallow bad valleys. Nearly every headline AI system you've heard of was, at bottom, walked downhill by this 170-year-old idea (Cauchy proposed it in 1847), one small step at a time.