Plain
Momentum fixes this with a single idea borrowed from physics: give the optimizer inertia. Instead of stepping along the raw gradient, accumulate a running velocity that remembers where you have been going. Consistent directions build up speed; oscillating directions cancel against themselves.
Let
Step 1 — accumulate velocity. Each step, decay the old velocity by
Step 2 — step along the velocity. Move the parameters along
Step 3 — unroll the recursion. Expand
Step 4 — read it as a weighted average. The velocity is a
exponentially-weighted average of all past gradients: recent ones count
fully, older ones are discounted by powers of
Step 5 — the speed-up. Suppose the gradient is steadily
so the effective step along a consistent direction is amplified by a factor of
The mechanical picture is exact:
Nesterov momentum sharpens this with a look-ahead: evaluate the gradient
not at the current point but where the velocity is about to carry you,
This bowl is stretched — steep in one direction, shallow in the other. Plain SGD snaps down the steep axis but then crawls for ages along the shallow one, barely inching toward the minimum. The momentum path builds velocity along that consistent shallow direction and sails the rest of the way home. Drag β upward and watch the momentum path accelerate (push it too high and it starts to overshoot the minimum).
Momentum costs one extra buffer (the velocity) and turns a stumbling descent into a
purposeful glide. It is the conceptual bridge to