Correlation
A scatter shows a relationship; the correlation coefficient
r puts a single number on it. It lives in
-1 \le r \le 1,
and measures the strength and direction of the
linear relationship between two variables.
- The sign is the direction: r > 0 for a rising cloud, r < 0 for a falling one.
- The magnitude is the tightness: |r| near 1 means the dots hug a line; near 0 means a loose, shapeless scatter.
So r = 1 and r = -1 are perfect straight
lines (up and down); r = 0 is no linear trend at all.
Where the number comes from
Standardise each variable into z-scores — subtract the mean and divide by the
standard deviation — so both axes are measured in the
same unitless scale. Then r is simply the average product of
the paired z-scores:
r = \frac{1}{n}\sum_{i=1}^{n} z_{x_i}\, z_{y_i}, \qquad z_{x_i} = \frac{x_i - \bar x}{s_x},\quad z_{y_i} = \frac{y_i - \bar y}{s_y}.
Read the sign off the products: a point that is above average in both
x and y contributes
(+)(+) > 0; below average in both gives
(-)(-) > 0. Points that match this way push
r up; points that disagree (high
x, low y) pull it down. When agreements
and disagreements cancel, r \approx 0.
Loosen the cloud
Each dot starts on the perfect line y = x and is then nudged off it by
a fixed amount times the noise dial. With no noise the points are collinear and
r = 1; as you add noise the cloud fattens and r
slides toward 0. The live readout recomputes
r from the points on screen.
Two warnings
First, r only sees straight-line structure. A relationship can
be strong yet curved — a perfect parabola — and still give
r \approx 0, because the rising and falling halves cancel. So
r = 0 means "no linear link", not "no link".
Second, like any association, a high |r| is not
causation. A tight correlation can be driven entirely by a lurking variable, or be pure
coincidence.
- r \in [-1, 1] measures the strength + direction of a linear relationship.
- It is the average product of z-scores: r = \frac{1}{n}\sum z_{x_i} z_{y_i}.
- Sign = direction; |r| near 1 = tight, near 0 = scattered.
- r = 0 rules out a line, not a curve — and correlation is never proof of cause.