Feature Scaling

Suppose one feature is "house area" (hundreds) and another is "number of bedrooms" (single digits). Their wildly different ranges warp the cost surface into a long, narrow valley — and gradient descent zig-zags agonisingly down it instead of heading straight for the bottom.

The fix is feature scaling: rescale every feature to a similar range (often by subtracting the mean and dividing by the spread, so each has mean 0 and a standard size). The valley becomes a round bowl, and descent walks straight in.

Stretched valley vs round bowl

The rings are contours of the cost — points of equal cost, like a topographic map. Unscaled, they're stretched ellipses and the descent path skitters side to side. Scaled, they're near-circles and the path drives straight to the centre. Same problem, far fewer steps.

A cheap habit that always pays

Scaling costs almost nothing and speeds up training dramatically, so it's standard practice before running gradient descent. It also keeps features comparable for distance-based methods like k-nearest neighbours and for regularization, where one giant-scaled feature would otherwise drown out the rest. When in doubt, scale your features.