Sufficiency and the Factorization Theorem

You flip a coin 1000 times to estimate its bias. Do you need to remember the exact sequence — HTHHT… all thousand symbols — or is the plain count of heads enough? Your intuition says the count is enough, and it is exactly right. Once you know 632 heads came up, the particular order in which they fell tells you nothing more about the bias \theta.

That intuition has a precise name: the count is a sufficient statistic. Sufficiency is the theory of lossless compression of data with respect to a parameter — how to throw away mountains of raw data while keeping every scrap of information about \theta. And there is a beautiful mechanical test for it, the factorization theorem, that reads sufficiency straight off the likelihood.

What "sufficient" means

A statistic T = T(X_1,\dots,X_n) is sufficient for \theta if the conditional distribution of the data given T does not depend on \theta:

P\big(X_1,\dots,X_n = x \mid T = t,\ \theta\big) \ \text{is free of } \theta.

Read it as a two-stage story. To simulate the data you could first draw the summary T from a distribution that depends on \theta, and then fill in the rest of the detail by a mechanism that ignores \theta entirely. All the information about the parameter has been squeezed into T; the leftover randomness is parameter-free noise. Knowing the full dataset then buys you nothing over knowing T alone.

The Fisher–Neyman factorization theorem

Checking a conditional distribution by hand is painful. The factorization theorem replaces it with a one-line pattern-match on the joint density (or mass function) p(x;\theta).

The recipe: write down the likelihood, and try to peel it into a piece that touches \theta only via some summary T(x), times a leftover piece h(x) that \theta never sees. If you can, that T is sufficient. Because the likelihood then depends on the data only through T, every likelihood-based method — the MLE included — can be computed from T alone.

Worked example 1 — Bernoulli: the count of successes

For n independent \text{Bernoulli}(\theta) trials with outcomes x_i\in\{0,1\}, the joint mass function is

p(x;\theta) = \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} = \theta^{\sum_i x_i}\,(1-\theta)^{\,n-\sum_i x_i}.

Everything on the right depends on the data only through T(x)=\sum_i x_i, the number of successes. Take g(T,\theta)=\theta^{T}(1-\theta)^{n-T} and h(x)=1: factorization holds, so the count of successes is sufficient for \theta. The order of the flips is irrelevant — just as intuition promised.

Worked example 2 — Normal with known variance

Let X_i \sim N(\mu,\sigma^2) with \sigma^2 known. The joint density is

p(x;\mu) \propto \exp\!\left(-\frac{1}{2\sigma^2}\sum_i (x_i-\mu)^2\right) = \exp\!\left(-\frac{1}{2\sigma^2}\Big(\sum_i x_i^2 - 2\mu\sum_i x_i + n\mu^2\Big)\right).

The only place \mu meets the data is through \sum_i x_i (the \sum_i x_i^2 term carries no \mu and gets absorbed into h(x)). So T=\sum_i x_i — equivalently the sample mean \bar X — is sufficient for \mu. If instead both \mu and \sigma^2 are unknown, the pair \big(\sum_i x_i,\ \sum_i x_i^2\big) is jointly sufficient.

Sufficiency, seen on the likelihood

Here is sufficiency made visible. For n=10 Bernoulli trials, the whole likelihood curve L(\theta)=\theta^{k}(1-\theta)^{10-k} is fixed once you know the count k=\sum_i x_i — two datasets with the same k give the identical curve, whatever order their heads fell in. Slide k and watch the peak sit exactly at the MLE k/n.

The full data X_1,\dots,X_n is trivially sufficient — it obviously loses no information about \theta. So sufficiency on its own is not the prize; the prize is compression: a sufficient statistic that is as small as possible. A minimal sufficient statistic is one that is a function of every other sufficient statistic — the coarsest lossless summary. For the Bernoulli, the count \sum_i x_i (a single number) is minimal sufficient, whereas the full sequence (a thousand numbers) is sufficient but wastefully so. Don't stop at "is it sufficient?" — ask "how much can I throw away and still be sufficient?"

Because of the Rao–Blackwell theorem: if you have any unbiased estimator and you condition it on a sufficient statistic, you get a new estimator that is still unbiased but has variance no larger — usually strictly smaller. In slogan form: any estimator can be improved by ignoring the parts of the data that a sufficient statistic has already thrown away. Sufficiency is thus not a curiosity but the engine of optimal estimation — it tells you the exact coordinates in which to do your work, and it guarantees that averaging out the irrelevant noise can only help. Much of classical estimation theory is the working-out of that single idea.