The Regression Line

Once a scatter looks roughly linear, we summarise it with one line of best fit — the regression line. Written as a prediction for y from x, it is

\hat y = a + b x,

where the hat on \hat y marks it as a predicted value: b is the slope and a the intercept. But of all the lines we could draw through the cloud, which one is "best"?

Least squares

For each observation the line makes a prediction \hat y_i = a + b x_i, and the vertical gap to the actual point is the residual

e_i = y_i - \hat y_i.

A residual is positive when the point sits above the line, negative below. The regression line is the one that makes these gaps small overall — specifically, the line chosen by least squares: it minimises the sum of squared residuals

\text{SSE} = \sum_{i=1}^{n} \left(y_i - \hat y_i\right)^2 = \sum_{i=1}^{n}\left(y_i - a - b x_i\right)^2.

We square the gaps so positives and negatives cannot cancel, and so that a few large misses are penalised heavily. The slope and intercept that drive \text{SSE} to its smallest possible value are the regression line.

Hunt for the line

Drag the slope and intercept to move the line through the fixed scatter. The coloured stalks are the residuals — the vertical gaps from each point to the line — and the live \text{SSE} readout adds up their squares. Try to make it as small as you can; the line that wins is the least-squares regression line.

It pivots through the means

The least-squares line is not free to sit anywhere: it always passes through the point of means (\bar x, \bar y). So if you knew nothing else, plugging the average x into the line returns the average y — the fit is "centred" on the data's centre of mass.