Statistical Models and Estimators
A physicist reports a particle's mass as 938.3 \pm 0.5 MeV. A pollster
reports 43% support. In both cases the raw material is a fistful of noisy numbers, and the job is to
say something disciplined about the process that produced them. Mathematical statistics is
the theory of exactly that step — turning data into statements about an unknown mechanism — and it
rests on two objects: a model for how the data could have arisen, and an
estimator that reads a parameter of that model off the data.
This page pins down both, together with the three numbers we use to judge an estimator:
bias, variance, and the mean squared error that
binds them. It is the grammar the rest of the module is written in.
A statistical model is a family of distributions
We treat the data X_1,\dots,X_n as a
random sample
— one draw from some probability distribution we cannot see. A parametric statistical
model is a whole family of candidate distributions, indexed by a parameter
\theta that ranges over a parameter space
\Theta:
\mathcal{P} = \{\, P_\theta : \theta \in \Theta \,\}.
The modelling assumption is that some true \theta_0 generated the
data — we just don't know which. Examples: heights are
N(\mu,\sigma^2) with \theta=(\mu,\sigma^2) and
\Theta=\mathbb{R}\times(0,\infty); a stream of yes/no answers is
\text{Bernoulli}(p) with \Theta=[0,1]; hourly
arrival counts are \text{Poisson}(\lambda) with
\Theta=(0,\infty).
Statistic vs. parameter — the fault line
Everything downstream depends on keeping two kinds of quantity apart.
-
A parameter \theta is a fixed (usually unknown) feature
of the distribution — \mu, \sigma^2,
p. It is a property of the world, not of your sample.
-
A statistic T = T(X_1,\dots,X_n) is any quantity
computed from the data alone. Because it is a function of random data it is itself a
random variable, with its own distribution — the
sampling
distribution.
An estimator is simply a statistic chosen to guess a parameter; we write
\hat\theta = T(X_1,\dots,X_n). A specific number it produces on one dataset
is an estimate. The crucial and easily-missed point: an estimator is random.
Draw a fresh sample and it jumps. So we do not ask "is this one estimate right?" — we ask about the
behaviour of the whole sampling distribution.
Bias, variance, and mean squared error
Three numbers summarise how good an estimator \hat\theta is for a true
value \theta. The bias measures systematic lean:
\operatorname{Bias}(\hat\theta) = \mathbb{E}_\theta[\hat\theta] - \theta.
An estimator with zero bias for every \theta is called
unbiased: on average, over all the samples we might have drawn, it lands exactly on
target. The variance \operatorname{Var}_\theta(\hat\theta)
measures how much it scatters from sample to sample. The single figure that captures overall accuracy
is the mean squared error, and it splits cleanly into the two:
\operatorname{MSE}(\hat\theta) = \mathbb{E}_\theta\!\big[(\hat\theta-\theta)^2\big] = \operatorname{Var}_\theta(\hat\theta) + \operatorname{Bias}(\hat\theta)^2.
This is the bias–variance decomposition. It says accuracy has two enemies —
scatter and lean — and a good estimator keeps both small. Sometimes we even accept a little
bias to buy a large cut in variance, because MSE is what actually costs us.
- \operatorname{MSE}(\hat\theta)=\operatorname{Var}(\hat\theta)+\operatorname{Bias}(\hat\theta)^2 — error is scatter plus lean, squared.
- An unbiased estimator has \operatorname{Bias}=0, so its MSE is exactly its variance.
- Minimising MSE, not bias alone, is the honest goal — a biased estimator can beat an unbiased one.
Worked example 1 — the sample mean is unbiased
Let X_1,\dots,X_n be i.i.d. with mean \mu and
variance \sigma^2. Take the estimator
\hat\mu = \bar X = \tfrac1n\sum_i X_i. By linearity of expectation,
\mathbb{E}[\bar X] = \frac1n\sum_i \mathbb{E}[X_i] = \frac1n\,(n\mu) = \mu,
so \bar X is unbiased for \mu. Its variance
shrinks with the sample size:
\operatorname{Var}(\bar X) = \frac{1}{n^2}\sum_i \operatorname{Var}(X_i) = \frac{\sigma^2}{n}.
Because it is unbiased, its MSE equals that variance,
\sigma^2/n — pure scatter, no lean, and it vanishes as
n\to\infty.
Worked example 2 — why we divide by n-1
The "obvious" variance estimator \hat\sigma^2_n = \tfrac1n\sum_i (X_i-\bar X)^2
is biased: because the deviations are measured from the sample mean
\bar X rather than the true \mu, it
systematically underestimates. A calculation gives
\mathbb{E}\!\left[\tfrac1n\textstyle\sum_i (X_i-\bar X)^2\right] = \frac{n-1}{n}\,\sigma^2 < \sigma^2.
Dividing by n-1 instead of n exactly corrects
the lean, giving the unbiased sample variance
s^2 = \tfrac{1}{n-1}\sum_i (X_i-\bar X)^2 with
\mathbb{E}[s^2]=\sigma^2. That "minus one" is Bessel's
correction, and it is a bias fix, nothing more mysterious.
See the decomposition move
The vertical line marks the true value \theta. The bell is the estimator's
sampling distribution. Slide bias to shift its centre off the true value,
and spread to widen it. Watch the readout: the mean squared error is always the
variance plus the squared bias — push either up and MSE climbs.
A statistic is by definition a function of the data only. So
\hat\theta = \bar X is a legitimate estimator, but
\hat\theta = \tfrac12(\bar X + \mu) is not — it contains
the very quantity \mu you are trying to estimate, which you do not have.
This sounds obvious written down, yet it sneaks in constantly: "standardise by the true
\sigma", "shrink toward the population mean". If a formula needs an unknown
parameter to be computed, it is a lovely piece of theory but it is not something you can
actually evaluate from a sample. Anything you compute must be buildable from
X_1,\dots,X_n and known constants alone.
Yes — routinely. Suppose one estimator is unbiased with variance
10, and a rival is biased by 1 but has variance
only 2. Their mean squared errors are
10 versus 2 + 1^2 = 3 — the biased estimator is
more than three times as accurate. This is the logic behind shrinkage and regularisation:
deliberately nudging estimates toward zero (or toward each other) introduces a whisper of bias while
slashing variance. James and Stein stunned statisticians in 1961 by proving that for estimating three
or more means at once, the plain sample mean is inadmissible — a shrunken, biased
estimator beats it everywhere. Unbiasedness is a virtue, not a commandment.