The Regression Line
Once a scatter looks roughly linear, we summarise it with one line of best fit —
the regression line. Written as a prediction for y from
x, it is
\hat y = a + b x,
where the hat on \hat y marks it as a predicted value:
b is the slope and a the intercept. But of
all the lines we could draw through the cloud, which one is "best"?
Least squares
For each observation the line makes a prediction \hat y_i = a + b x_i,
and the vertical gap to the actual point is the residual
e_i = y_i - \hat y_i.
A residual is positive when the point sits above the line, negative below. The regression line is
the one that makes these gaps small overall — specifically, the line chosen by
least squares: it minimises the sum of squared residuals
\text{SSE} = \sum_{i=1}^{n} \left(y_i - \hat y_i\right)^2 = \sum_{i=1}^{n}\left(y_i - a - b x_i\right)^2.
We square the gaps so positives and negatives cannot cancel, and so that a few large misses are
penalised heavily. The slope and intercept that drive \text{SSE} to its
smallest possible value are the regression line.
Hunt for the line
Drag the slope and intercept to move the line through the fixed
scatter. The coloured stalks are the residuals — the vertical gaps from each point to the line —
and the live \text{SSE} readout adds up their squares. Try to make it
as small as you can; the line that wins is the least-squares regression line.
It pivots through the means
The least-squares line is not free to sit anywhere: it always passes through the point of
means (\bar x, \bar y). So if you knew nothing else, plugging
the average x into the line returns the average
y — the fit is "centred" on the data's centre of mass.
- The line of best fit is \hat y = a + b x; a residual is the vertical gap e_i = y_i - \hat y_i.
- Least squares picks a, b to minimise \text{SSE} = \sum (y_i - \hat y_i)^2.
- The fitted line always passes through the means (\bar x, \bar y).