Gradient Descent

You are lost on a foggy hillside and want to reach the valley floor. You can't see far, but you can feel which way the ground slopes under your feet. So you take a step straight downhill, feel the slope again, step again, and repeat. That simple, patient idea — always step in the direction of steepest descent — is gradient descent, the single most widely used optimization algorithm on Earth. It trains neural networks with billions of parameters, fits statistical models, and tunes controllers, all with the same two-line update.

The gradient \nabla f(\mathbf{x}) points in the direction of steepest increase. To go downhill we walk the opposite way, taking a step of size \alpha (the step size or learning rate):

\mathbf{x}_{k+1} = \mathbf{x}_k - \alpha\, \nabla f(\mathbf{x}_k).

Start anywhere, iterate, and — for a well-behaved function with a sensible step size — the sequence slides down to a minimum. When the gradient reaches zero, the update stops moving: you've arrived at a stationary point.

The one-variable story, exactly

Strip everything to the simplest bowl, f(x) = x^2, whose derivative is f'(x) = 2x. The update becomes

x_{k+1} = x_k - \alpha\, (2x_k) = (1 - 2\alpha)\, x_k.

Each step just multiplies x by the fixed factor (1 - 2\alpha), so x_k = (1 - 2\alpha)^k\, x_0 — a geometric sequence. Now the whole behaviour of gradient descent is visible in one number:

0 < \alpha < \tfrac12: the factor is between 0 and 1 — steady, smooth convergence to 0.
\alpha = \tfrac12: the factor is 0 — you land on the minimum in a single step (the ideal step for this bowl).
\tfrac12 < \alpha < 1: the factor is negative but small — you overshoot the minimum and zig-zag across it, but still converge.
\alpha \ge 1: the factor has size \ge 1 — each step lands further from the minimum. Gradient descent diverges.

The lesson generalises: convergence hinges on the step size versus the function's curvature. For a function whose gradient doesn't change faster than a rate L (an L-smooth function), any step size \alpha \le 1/L is guaranteed to make progress. Too big and you bounce out; too small and you crawl.

Watch it descend — and diverge

Here is a two-dimensional bowl f(x, y) = x^2 + 3y^2, drawn as faint elliptical contours (it is steeper in y than in x). The path starts at the same corner each time; drag the step-size slider and watch what happens:

a small step gives a smooth, if slow, curve into the centre;
a medium step zig-zags — overshooting the steep y direction while inching along the shallow x one;
a large step makes the path bounce outward and diverge.

That zig-zag is the famous curse of ill-conditioning: when one direction is far steeper than another, a single step size can't suit both.

A worked descent, by hand

Minimize f(x) = x^2 from x_0 = 8 with step size \alpha = 0.25. The factor is 1 - 2(0.25) = 0.5, so each step halves x:

8 \;\to\; 4 \;\to\; 2 \;\to\; 1 \;\to\; 0.5 \;\to\; \cdots \;\to\; 0.

The objective f = x^2 shrinks even faster (it squares the factor, ×0.25 per step): 64 \to 16 \to 4 \to 1 \to \cdots. This geometric decay — halve the distance, quarter the loss, every step — is exactly the linear convergence that gradient descent delivers on a strongly convex function: a fixed fraction of progress per iteration.

Modern machine learning is gradient descent at colossal scale. A model's loss measures how wrong its predictions are; training means minimising that loss over millions of parameters. Computing the exact gradient over the whole dataset each step is far too slow, so practitioners use stochastic gradient descent: estimate the gradient from a small random batch of examples and step on that noisy estimate. The noise even helps — it can jiggle the path out of shallow bad valleys. Nearly every headline AI system you've heard of was, at bottom, walked downhill by this 170-year-old idea (Cauchy proposed it in 1847), one small step at a time.

Too big a step diverges; too small crawls. The step size is the single most important knob. Overshoot the curvature and the iterates fly apart; undershoot and you may need thousands of steps. There is no single right value — it depends on the function's steepness.
Gradient descent only finds a local minimum. On a non-convex landscape it settles into whatever basin it started in and can stall near a saddle point, flat in every direction it first checks. Only for a convex function is the point it reaches guaranteed global.
Ill-conditioning causes zig-zagging. When directions differ wildly in steepness, a scalar step size fits none of them and the path bounces across the valley. Rescaling the variables (or using curvature, as Newton's method does) straightens the path.