To improve a model we first need to score it: a single number saying how wrong it currently is. For regression the standard score is the mean squared error — for each example, take the gap between prediction and truth, square it, and average:
This is the cost function (or loss). Squaring does two jobs: it makes every
error positive (so overshoots and undershoots both count), and it punishes big misses far more
than small ones. Low cost means a good fit; the goal of training is to make
Each red square has a side equal to one example's error, so its area is that error squared. The cost is the total shaded area (averaged). Tilt and shift the line: as it fits better the squares shrink, and the cost — the area — drops toward its minimum.
Turning "how good is this line?" into a single differentiable number is the move that makes
learning possible. Because