The Cost Function

To improve a model we first need to score it: a single number saying how wrong it currently is. For regression the standard score is the mean squared error — for each example, take the gap between prediction and truth, square it, and average:

J(w, b) = \frac{1}{n}\sum_{i=1}^{n}\big(h(x_i) - y_i\big)^2.

This is the cost function (or loss). Squaring does two jobs: it makes every error positive (so overshoots and undershoots both count), and it punishes big misses far more than small ones. Low cost means a good fit; the goal of training is to make J as small as possible.

Errors are literally squares

Each red square has a side equal to one example's error, so its area is that error squared. The cost is the total shaded area (averaged). Tilt and shift the line: as it fits better the squares shrink, and the cost — the area — drops toward its minimum.

A score you can optimize

Turning "how good is this line?" into a single differentiable number is the move that makes learning possible. Because J(w,b) is a smooth function of the parameters, we can ask which way to nudge w and b to make it smaller — and just keep walking downhill. First, let's see what that cost surface actually looks like.