Momentum

Plain SGD has a frustrating habit. In a long, narrow ravine — steep across, shallow along — it zig-zags wildly across the steep walls while barely creeping along the gentle floor toward the minimum. Every step over-corrects sideways and makes almost no forward progress.

Momentum fixes this with a single idea borrowed from physics: give the optimizer inertia. Instead of stepping along the raw gradient, accumulate a running velocity that remembers where you have been going. Consistent directions build up speed; oscillating directions cancel against themselves.

Deriving the momentum update

Let g_t = \nabla L(\theta_t) be the gradient at step t, and \beta \in [0,1) the momentum coefficient (typically 0.9). We track a velocity v_t, starting from v_0 = 0.

Step 1 — accumulate velocity. Each step, decay the old velocity by \beta and add the fresh gradient:

v_t = \beta\, v_{t-1} + g_t.

Step 2 — step along the velocity. Move the parameters along v_t (not the raw gradient), scaled by the learning rate \eta:

\theta_t = \theta_{t-1} - \eta\, v_t.

Step 3 — unroll the recursion. Expand v_t by substituting its own definition, again and again:

v_t = g_t + \beta\, g_{t-1} + \beta^2\, g_{t-2} + \beta^3\, g_{t-3} + \cdots = \sum_{k=0}^{t-1} \beta^{k}\, g_{t-k}.

Step 4 — read it as a weighted average. The velocity is a exponentially-weighted average of all past gradients: recent ones count fully, older ones are discounted by powers of \beta. The geometric weights sum to a finite total:

\sum_{k=0}^{\infty} \beta^{k} = \frac{1}{1-\beta}.

Step 5 — the speed-up. Suppose the gradient is steadily g in some consistent direction. The velocity converges to

v_\infty = g \sum_{k=0}^{\infty} \beta^{k} = \frac{g}{1-\beta},

so the effective step along a consistent direction is amplified by a factor of \tfrac{1}{1-\beta} — with \beta = 0.9, a 10\times boost. Meanwhile, components that flip sign every step — the zig-zag across the ravine — keep cancelling in the sum, so they are damped. Momentum accelerates the useful direction and quiets the useless one.

With velocity v_t = \beta v_{t-1} + g_t and update \theta_t = \theta_{t-1} - \eta v_t (and v_0 = 0):

The mechanical picture is exact: \theta_t = \theta_{t-1} - \eta v_t with v_t = \beta v_{t-1} + g_t is a discretized heavy ball rolling on the loss surface, where \beta plays the role of friction (low friction → more inertia). A heavy ball does not stop dead at every pebble; it carries through, ignores the high-frequency bumps, and rolls steadily downhill.

Nesterov momentum sharpens this with a look-ahead: evaluate the gradient not at the current point but where the velocity is about to carry you, \theta_{t-1} - \eta\beta v_{t-1}. By peeking ahead, it can brake before overshooting rather than after — a ball that sees the upcoming wall and slows in time. In practice it often converges a touch faster than classical momentum for the same \beta.

Watch the zig-zag vanish

This bowl is stretched — steep in one direction, shallow in the other. Plain SGD snaps down the steep axis but then crawls for ages along the shallow one, barely inching toward the minimum. The momentum path builds velocity along that consistent shallow direction and sails the rest of the way home. Drag β upward and watch the momentum path accelerate (push it too high and it starts to overshoot the minimum).

The cheapest upgrade in optimization

Momentum costs one extra buffer (the velocity) and turns a stumbling descent into a purposeful glide. It is the conceptual bridge to Adam, which keeps this very velocity as its "first moment" and then adds a per-parameter adaptive step on top. Master the running average here and Adam is half-built already.