Suppose one feature is "house area" (hundreds) and another is "number of bedrooms" (single
digits). Their wildly different ranges warp the
The fix is feature scaling: rescale every feature to a similar range (often by subtracting the mean and dividing by the spread, so each has mean 0 and a standard size). The valley becomes a round bowl, and descent walks straight in.
The rings are contours of the cost — points of equal cost, like a topographic map. Unscaled, they're stretched ellipses and the descent path skitters side to side. Scaled, they're near-circles and the path drives straight to the centre. Same problem, far fewer steps.
Scaling costs almost nothing and speeds up training dramatically, so it's standard practice
before running gradient descent. It also keeps features comparable for distance-based methods
like