The Learning Rate
The learning rate \alpha sets how big a step
gradient
descent takes each round. It's the single most important dial to get right, and it's
a Goldilocks problem:
- Too small — the steps are tiny, so training crawls and may take forever to
reach the bottom.
- Too large — the steps overshoot the minimum, bouncing from wall to wall, and
can even diverge, flying off to ever-worse costs.
- Just right — brisk steps that settle smoothly into the minimum.
Too slow, too wild, just right
Switch between three learning rates and watch the path down the bowl. The tiny rate barely
moves; the huge rate ricochets off the sides and climbs away from the answer; the
middle one glides to the bottom. Same algorithm, same data — only the step size differs.
Finding a good one
There's no universal best value — it depends on the problem — so in practice you try a few
(often powers of ten: 0.001, 0.01, 0.1) and watch the cost. A cost
that falls steadily means a good rate; one that explodes means turn it down; one that barely
moves means turn it up. Clever optimizers even adapt \alpha as they
go — but the intuition you just built never changes.