The Cramér–Rao Bound and Efficiency
Every unbiased estimator scatters from sample to sample. A natural, almost greedy question is: how
small can that scatter possibly be? Is there an estimator so precise it pins the parameter
down to arbitrary tightness at fixed sample size — or is there a hard floor no honest estimator can
break through?
There is a floor. The Cramér–Rao bound gives an exact lower limit on the variance of
any unbiased estimator, set by a single quantity — the Fisher information the data
carry about the parameter. An estimator that reaches the floor is called efficient:
it wrings out every last drop of information the sample contains. This is the theoretical ceiling on
estimation, and it explains why
maximum likelihood
is the method of choice.
The score and Fisher information
Start from the log-likelihood \ell(\theta)=\log p(X;\theta). Its
derivative is the score,
S(\theta) = \frac{\partial}{\partial\theta}\log p(X;\theta),
a random variable (it depends on the data X). Under mild regularity
conditions the score has mean zero,
\mathbb{E}_\theta[S(\theta)]=0. Its variance is the
Fisher information:
I(\theta) = \operatorname{Var}_\theta\!\big[S(\theta)\big] = \mathbb{E}_\theta\!\left[\Big(\tfrac{\partial}{\partial\theta}\log p(X;\theta)\Big)^{\!2}\right] = -\,\mathbb{E}_\theta\!\left[\tfrac{\partial^2}{\partial\theta^2}\log p(X;\theta)\right].
The second form — minus the expected curvature of the log-likelihood — is the one you usually
compute, and it is deeply intuitive: a sharply peaked log-likelihood (large negative
curvature) means the data strongly distinguish nearby parameter values, so the data are
informative. Information adds up over independent observations: for an
i.i.d. sample of size n,
I_n(\theta) = n\,I_1(\theta).
The bound
- For any unbiased estimator \hat\theta of \theta (under regularity conditions),
\operatorname{Var}_\theta(\hat\theta) \ \ge\ \frac{1}{I_n(\theta)} = \frac{1}{n\,I_1(\theta)}.
- The floor shrinks like 1/n: more data, more information, less irreducible scatter.
- An unbiased estimator that attains the bound is efficient.
The efficiency of an unbiased estimator is the ratio of the bound to its actual
variance, e(\hat\theta) = \dfrac{1/I_n(\theta)}{\operatorname{Var}(\hat\theta)}\in(0,1];
efficiency 1 means it sits exactly on the floor. And the punchline for maximum likelihood: the MLE is
asymptotically efficient,
\sqrt{n}\,(\hat\theta_{\text{MLE}} - \theta) \ \xrightarrow{d}\ N\!\Big(0,\ \tfrac{1}{I_1(\theta)}\Big),
so for large samples the MLE achieves the smallest possible variance. That is the real reason it
dominates practice.
Worked example 1 — Bernoulli
For one \text{Bernoulli}(\theta) observation,
\log p = x\log\theta + (1-x)\log(1-\theta), so
\frac{\partial^2}{\partial\theta^2}\log p = -\frac{x}{\theta^2}-\frac{1-x}{(1-\theta)^2}.
Taking -\mathbb{E}[\cdot] with \mathbb{E}[X]=\theta,
I_1(\theta) = \frac{\theta}{\theta^2} + \frac{1-\theta}{(1-\theta)^2} = \frac{1}{\theta}+\frac{1}{1-\theta} = \frac{1}{\theta(1-\theta)}.
The Cramér–Rao floor for an unbiased estimator of \theta from
n trials is therefore
\dfrac{1}{n\,I_1(\theta)} = \dfrac{\theta(1-\theta)}{n}. But the sample
proportion \hat\theta=\bar X has variance exactly
\theta(1-\theta)/n — it sits on the floor. The sample proportion is a
fully efficient estimator.
Worked example 2 — Poisson and Normal
For \text{Poisson}(\lambda),
\log p = x\log\lambda - \lambda - \log x! gives
-\mathbb{E}[\partial^2_\lambda \log p] = \mathbb{E}[X]/\lambda^2 = 1/\lambda,
so I_1(\lambda)=1/\lambda and the bound is
\lambda/n. The sample mean has variance
\lambda/n — efficient again.
For N(\mu,\sigma^2) with known \sigma^2,
I_1(\mu)=1/\sigma^2, so the floor is \sigma^2/n —
met exactly by \bar X. A pattern emerges: for these workhorse models the
sample mean is not just reasonable, it is optimal among unbiased estimators.
The floor, and how it moves
This is the Cramér–Rao floor \theta(1-\theta)/n for estimating a
Bernoulli bias. Two things to notice as you slide the sample size n. The
whole curve drops like 1/n — quadruple the data and the floor quarters.
And its shape peaks at \theta=\tfrac12: a near-fair coin is the
hardest to estimate (Fisher information
1/(\theta(1-\theta)) is smallest there), while a strongly biased coin is
easy.
The Cramér–Rao bound limits the variance of unbiased estimators. A biased
estimator can — and often does — have variance below 1/I_n(\theta),
and even a smaller mean squared error than any unbiased estimator, exactly the
bias–variance trade we met earlier. So "beating the Cramér–Rao bound" is not a paradox and not a
Nobel prize; it just means your estimator isn't unbiased. When you compare estimators, be clear about
which fence you're inside: the bound is a statement about unbiased estimators specifically, not a law
of nature about all estimators. For the honest comparison, put everything on the common footing of
MSE.
The bound comes with fine print: regularity conditions, chiefly that the support of
the distribution does not depend on the parameter (so you may swap differentiation and integration).
Break that and the bound simply does not apply. The
U(0,\theta) is the famous rebel: its support
[0,\theta] moves with \theta, and the MLE
(based on the sample maximum, suitably rescaled) has variance of order
1/n^2 — vanishing far faster than the 1/n
the bound would suggest. It isn't magic and it isn't a loophole in a theorem; it is a model that was
never inside the theorem's hypotheses to begin with. Always check the regularity conditions before
quoting a bound.