The Cramér–Rao Bound and Efficiency

Every unbiased estimator scatters from sample to sample. A natural, almost greedy question is: how small can that scatter possibly be? Is there an estimator so precise it pins the parameter down to arbitrary tightness at fixed sample size — or is there a hard floor no honest estimator can break through?

There is a floor. The Cramér–Rao bound gives an exact lower limit on the variance of any unbiased estimator, set by a single quantity — the Fisher information the data carry about the parameter. An estimator that reaches the floor is called efficient: it wrings out every last drop of information the sample contains. This is the theoretical ceiling on estimation, and it explains why maximum likelihood is the method of choice.

The score and Fisher information

Start from the log-likelihood \ell(\theta)=\log p(X;\theta). Its derivative is the score,

S(\theta) = \frac{\partial}{\partial\theta}\log p(X;\theta),

a random variable (it depends on the data X). Under mild regularity conditions the score has mean zero, \mathbb{E}_\theta[S(\theta)]=0. Its variance is the Fisher information:

I(\theta) = \operatorname{Var}_\theta\!\big[S(\theta)\big] = \mathbb{E}_\theta\!\left[\Big(\tfrac{\partial}{\partial\theta}\log p(X;\theta)\Big)^{\!2}\right] = -\,\mathbb{E}_\theta\!\left[\tfrac{\partial^2}{\partial\theta^2}\log p(X;\theta)\right].

The second form — minus the expected curvature of the log-likelihood — is the one you usually compute, and it is deeply intuitive: a sharply peaked log-likelihood (large negative curvature) means the data strongly distinguish nearby parameter values, so the data are informative. Information adds up over independent observations: for an i.i.d. sample of size n, I_n(\theta) = n\,I_1(\theta).

The bound

The efficiency of an unbiased estimator is the ratio of the bound to its actual variance, e(\hat\theta) = \dfrac{1/I_n(\theta)}{\operatorname{Var}(\hat\theta)}\in(0,1]; efficiency 1 means it sits exactly on the floor. And the punchline for maximum likelihood: the MLE is asymptotically efficient,

\sqrt{n}\,(\hat\theta_{\text{MLE}} - \theta) \ \xrightarrow{d}\ N\!\Big(0,\ \tfrac{1}{I_1(\theta)}\Big),

so for large samples the MLE achieves the smallest possible variance. That is the real reason it dominates practice.

Worked example 1 — Bernoulli

For one \text{Bernoulli}(\theta) observation, \log p = x\log\theta + (1-x)\log(1-\theta), so \frac{\partial^2}{\partial\theta^2}\log p = -\frac{x}{\theta^2}-\frac{1-x}{(1-\theta)^2}. Taking -\mathbb{E}[\cdot] with \mathbb{E}[X]=\theta,

I_1(\theta) = \frac{\theta}{\theta^2} + \frac{1-\theta}{(1-\theta)^2} = \frac{1}{\theta}+\frac{1}{1-\theta} = \frac{1}{\theta(1-\theta)}.

The Cramér–Rao floor for an unbiased estimator of \theta from n trials is therefore \dfrac{1}{n\,I_1(\theta)} = \dfrac{\theta(1-\theta)}{n}. But the sample proportion \hat\theta=\bar X has variance exactly \theta(1-\theta)/n — it sits on the floor. The sample proportion is a fully efficient estimator.

Worked example 2 — Poisson and Normal

For \text{Poisson}(\lambda), \log p = x\log\lambda - \lambda - \log x! gives -\mathbb{E}[\partial^2_\lambda \log p] = \mathbb{E}[X]/\lambda^2 = 1/\lambda, so I_1(\lambda)=1/\lambda and the bound is \lambda/n. The sample mean has variance \lambda/n — efficient again.

For N(\mu,\sigma^2) with known \sigma^2, I_1(\mu)=1/\sigma^2, so the floor is \sigma^2/n — met exactly by \bar X. A pattern emerges: for these workhorse models the sample mean is not just reasonable, it is optimal among unbiased estimators.

The floor, and how it moves

This is the Cramér–Rao floor \theta(1-\theta)/n for estimating a Bernoulli bias. Two things to notice as you slide the sample size n. The whole curve drops like 1/n — quadruple the data and the floor quarters. And its shape peaks at \theta=\tfrac12: a near-fair coin is the hardest to estimate (Fisher information 1/(\theta(1-\theta)) is smallest there), while a strongly biased coin is easy.

The Cramér–Rao bound limits the variance of unbiased estimators. A biased estimator can — and often does — have variance below 1/I_n(\theta), and even a smaller mean squared error than any unbiased estimator, exactly the bias–variance trade we met earlier. So "beating the Cramér–Rao bound" is not a paradox and not a Nobel prize; it just means your estimator isn't unbiased. When you compare estimators, be clear about which fence you're inside: the bound is a statement about unbiased estimators specifically, not a law of nature about all estimators. For the honest comparison, put everything on the common footing of MSE.

The bound comes with fine print: regularity conditions, chiefly that the support of the distribution does not depend on the parameter (so you may swap differentiation and integration). Break that and the bound simply does not apply. The U(0,\theta) is the famous rebel: its support [0,\theta] moves with \theta, and the MLE (based on the sample maximum, suitably rescaled) has variance of order 1/n^2 — vanishing far faster than the 1/n the bound would suggest. It isn't magic and it isn't a loophole in a theorem; it is a model that was never inside the theorem's hypotheses to begin with. Always check the regularity conditions before quoting a bound.