Statistical Models and Estimators

A physicist reports a particle's mass as 938.3 \pm 0.5 MeV. A pollster reports 43% support. In both cases the raw material is a fistful of noisy numbers, and the job is to say something disciplined about the process that produced them. Mathematical statistics is the theory of exactly that step — turning data into statements about an unknown mechanism — and it rests on two objects: a model for how the data could have arisen, and an estimator that reads a parameter of that model off the data.

This page pins down both, together with the three numbers we use to judge an estimator: bias, variance, and the mean squared error that binds them. It is the grammar the rest of the module is written in.

A statistical model is a family of distributions

We treat the data X_1,\dots,X_n as a random sample — one draw from some probability distribution we cannot see. A parametric statistical model is a whole family of candidate distributions, indexed by a parameter \theta that ranges over a parameter space \Theta:

\mathcal{P} = \{\, P_\theta : \theta \in \Theta \,\}.

The modelling assumption is that some true \theta_0 generated the data — we just don't know which. Examples: heights are N(\mu,\sigma^2) with \theta=(\mu,\sigma^2) and \Theta=\mathbb{R}\times(0,\infty); a stream of yes/no answers is \text{Bernoulli}(p) with \Theta=[0,1]; hourly arrival counts are \text{Poisson}(\lambda) with \Theta=(0,\infty).

Statistic vs. parameter — the fault line

Everything downstream depends on keeping two kinds of quantity apart.

An estimator is simply a statistic chosen to guess a parameter; we write \hat\theta = T(X_1,\dots,X_n). A specific number it produces on one dataset is an estimate. The crucial and easily-missed point: an estimator is random. Draw a fresh sample and it jumps. So we do not ask "is this one estimate right?" — we ask about the behaviour of the whole sampling distribution.

Bias, variance, and mean squared error

Three numbers summarise how good an estimator \hat\theta is for a true value \theta. The bias measures systematic lean:

\operatorname{Bias}(\hat\theta) = \mathbb{E}_\theta[\hat\theta] - \theta.

An estimator with zero bias for every \theta is called unbiased: on average, over all the samples we might have drawn, it lands exactly on target. The variance \operatorname{Var}_\theta(\hat\theta) measures how much it scatters from sample to sample. The single figure that captures overall accuracy is the mean squared error, and it splits cleanly into the two:

\operatorname{MSE}(\hat\theta) = \mathbb{E}_\theta\!\big[(\hat\theta-\theta)^2\big] = \operatorname{Var}_\theta(\hat\theta) + \operatorname{Bias}(\hat\theta)^2.

This is the bias–variance decomposition. It says accuracy has two enemies — scatter and lean — and a good estimator keeps both small. Sometimes we even accept a little bias to buy a large cut in variance, because MSE is what actually costs us.

Worked example 1 — the sample mean is unbiased

Let X_1,\dots,X_n be i.i.d. with mean \mu and variance \sigma^2. Take the estimator \hat\mu = \bar X = \tfrac1n\sum_i X_i. By linearity of expectation,

\mathbb{E}[\bar X] = \frac1n\sum_i \mathbb{E}[X_i] = \frac1n\,(n\mu) = \mu,

so \bar X is unbiased for \mu. Its variance shrinks with the sample size:

\operatorname{Var}(\bar X) = \frac{1}{n^2}\sum_i \operatorname{Var}(X_i) = \frac{\sigma^2}{n}.

Because it is unbiased, its MSE equals that variance, \sigma^2/n — pure scatter, no lean, and it vanishes as n\to\infty.

Worked example 2 — why we divide by n-1

The "obvious" variance estimator \hat\sigma^2_n = \tfrac1n\sum_i (X_i-\bar X)^2 is biased: because the deviations are measured from the sample mean \bar X rather than the true \mu, it systematically underestimates. A calculation gives

\mathbb{E}\!\left[\tfrac1n\textstyle\sum_i (X_i-\bar X)^2\right] = \frac{n-1}{n}\,\sigma^2 < \sigma^2.

Dividing by n-1 instead of n exactly corrects the lean, giving the unbiased sample variance s^2 = \tfrac{1}{n-1}\sum_i (X_i-\bar X)^2 with \mathbb{E}[s^2]=\sigma^2. That "minus one" is Bessel's correction, and it is a bias fix, nothing more mysterious.

See the decomposition move

The vertical line marks the true value \theta. The bell is the estimator's sampling distribution. Slide bias to shift its centre off the true value, and spread to widen it. Watch the readout: the mean squared error is always the variance plus the squared bias — push either up and MSE climbs.

A statistic is by definition a function of the data only. So \hat\theta = \bar X is a legitimate estimator, but \hat\theta = \tfrac12(\bar X + \mu) is not — it contains the very quantity \mu you are trying to estimate, which you do not have. This sounds obvious written down, yet it sneaks in constantly: "standardise by the true \sigma", "shrink toward the population mean". If a formula needs an unknown parameter to be computed, it is a lovely piece of theory but it is not something you can actually evaluate from a sample. Anything you compute must be buildable from X_1,\dots,X_n and known constants alone.

Yes — routinely. Suppose one estimator is unbiased with variance 10, and a rival is biased by 1 but has variance only 2. Their mean squared errors are 10 versus 2 + 1^2 = 3 — the biased estimator is more than three times as accurate. This is the logic behind shrinkage and regularisation: deliberately nudging estimates toward zero (or toward each other) introduces a whisper of bias while slashing variance. James and Stein stunned statisticians in 1961 by proving that for estimating three or more means at once, the plain sample mean is inadmissible — a shrunken, biased estimator beats it everywhere. Unbiasedness is a virtue, not a commandment.