Sufficiency and the Factorization Theorem
You flip a coin 1000 times to estimate its bias. Do you need to remember the exact
sequence — HTHHT… all thousand symbols — or is the plain count of heads enough? Your
intuition says the count is enough, and it is exactly right. Once you know 632 heads came up, the
particular order in which they fell tells you nothing more about the bias
\theta.
That intuition has a precise name: the count is a sufficient statistic. Sufficiency
is the theory of lossless compression of data with respect to a parameter — how to throw
away mountains of raw data while keeping every scrap of information about
\theta. And there is a beautiful mechanical test for it, the
factorization theorem, that reads sufficiency straight off the likelihood.
What "sufficient" means
A statistic T = T(X_1,\dots,X_n) is sufficient for
\theta if the conditional distribution of the data given
T does not depend on \theta:
P\big(X_1,\dots,X_n = x \mid T = t,\ \theta\big) \ \text{is free of } \theta.
Read it as a two-stage story. To simulate the data you could first draw the summary
T from a distribution that depends on \theta,
and then fill in the rest of the detail by a mechanism that ignores
\theta entirely. All the information about the parameter has been squeezed
into T; the leftover randomness is parameter-free noise. Knowing the full
dataset then buys you nothing over knowing T alone.
The Fisher–Neyman factorization theorem
Checking a conditional distribution by hand is painful. The factorization theorem replaces it with a
one-line pattern-match on the joint density (or mass function)
p(x;\theta).
- T(X) is sufficient for \theta if and only if the joint density factors as
p(x;\theta) = g\big(T(x),\,\theta\big)\;h(x),
- where g depends on the data only through T(x) (and on \theta), and h(x) does not involve \theta at all.
The recipe: write down the likelihood, and try to peel it into a piece that touches
\theta only via some summary T(x), times a
leftover piece h(x) that \theta never sees. If
you can, that T is sufficient. Because the
likelihood then
depends on the data only through T, every likelihood-based
method — the MLE included — can be computed from T alone.
Worked example 1 — Bernoulli: the count of successes
For n independent \text{Bernoulli}(\theta)
trials with outcomes x_i\in\{0,1\}, the joint mass function is
p(x;\theta) = \prod_{i=1}^n \theta^{x_i}(1-\theta)^{1-x_i} = \theta^{\sum_i x_i}\,(1-\theta)^{\,n-\sum_i x_i}.
Everything on the right depends on the data only through
T(x)=\sum_i x_i, the number of successes. Take
g(T,\theta)=\theta^{T}(1-\theta)^{n-T} and
h(x)=1: factorization holds, so the count of successes is
sufficient for \theta. The order of the flips is irrelevant —
just as intuition promised.
Worked example 2 — Normal with known variance
Let X_i \sim N(\mu,\sigma^2) with \sigma^2
known. The joint density is
p(x;\mu) \propto \exp\!\left(-\frac{1}{2\sigma^2}\sum_i (x_i-\mu)^2\right) = \exp\!\left(-\frac{1}{2\sigma^2}\Big(\sum_i x_i^2 - 2\mu\sum_i x_i + n\mu^2\Big)\right).
The only place \mu meets the data is through
\sum_i x_i (the \sum_i x_i^2 term carries no
\mu and gets absorbed into h(x)). So
T=\sum_i x_i — equivalently the sample mean
\bar X — is sufficient for \mu. If instead
both \mu and \sigma^2 are unknown, the
pair \big(\sum_i x_i,\ \sum_i x_i^2\big) is jointly sufficient.
Sufficiency, seen on the likelihood
Here is sufficiency made visible. For n=10 Bernoulli trials, the whole
likelihood curve L(\theta)=\theta^{k}(1-\theta)^{10-k} is fixed once you
know the count k=\sum_i x_i — two datasets with the same
k give the identical curve, whatever order their heads fell in.
Slide k and watch the peak sit exactly at the MLE
k/n.
The full data X_1,\dots,X_n is trivially sufficient — it obviously loses no
information about \theta. So sufficiency on its own is not the prize; the
prize is compression: a sufficient statistic that is as small as possible. A
minimal sufficient statistic is one that is a function of every other sufficient
statistic — the coarsest lossless summary. For the Bernoulli, the count
\sum_i x_i (a single number) is minimal sufficient, whereas the full
sequence (a thousand numbers) is sufficient but wastefully so. Don't stop at "is it sufficient?" —
ask "how much can I throw away and still be sufficient?"
Because of the Rao–Blackwell theorem: if you have any unbiased estimator and you
condition it on a sufficient statistic, you get a new estimator that is still unbiased but
has variance no larger — usually strictly smaller. In slogan form: any estimator
can be improved by ignoring the parts of the data that a sufficient statistic has already thrown
away. Sufficiency is thus not a curiosity but the engine of optimal estimation — it tells you the
exact coordinates in which to do your work, and it guarantees that averaging out the irrelevant noise
can only help. Much of classical estimation theory is the working-out of that single idea.