Exponential Families
The Bernoulli, the Poisson, the
normal,
the exponential, the gamma, the beta, the binomial — a working statistician's whole zoo of
distributions looks like a pile of unrelated formulas. It isn't. Almost all of them are secretly the
same shape, wearing different costumes. Peel back the costume and you find one template, the
exponential family.
This is one of the great unifications in statistics. Recognising a model as an exponential family
hands you, for free: an obvious sufficient statistic, moments computed by
differentiation, a clean MLE, a matching conjugate prior, and — as
we'll see two pages on — the entire theory of
generalized
linear models. Learn the template once and you have learned a dozen distributions at
once.
The canonical form
A one-parameter exponential family is any model whose density or mass function can be
written in the natural (canonical) form
p(x;\eta) = h(x)\,\exp\!\big(\eta\,T(x) - A(\eta)\big).
Four ingredients, each with a job:
- \eta — the natural parameter (a possibly re-parametrised version of the usual one);
- T(x) — the natural sufficient statistic;
- A(\eta) — the log-partition (cumulant) function, the term that makes the density integrate to 1;
- h(x) — the base measure, carrying no \eta.
Compare this with the
factorization
theorem: the family is built in factorised form, with
g(T,\eta)=\exp(\eta T - A(\eta)) and h(x) the
leftover. So T(x) is sufficient by construction, and for a sample the sum
\sum_i T(x_i) is sufficient for \eta.
The log-partition generates the moments
The function A(\eta) is not just bookkeeping — its derivatives are
the cumulants of the sufficient statistic. Differentiate once and twice:
A'(\eta) = \mathbb{E}[T(X)], \qquad A''(\eta) = \operatorname{Var}(T(X)).
Two consequences drop out immediately. First, A''(\eta)=\operatorname{Var}(T)\ge 0,
so A is convex — which is exactly why the log-likelihood is
concave and the MLE is a well-behaved unique maximum. Second, you can read a distribution's mean and
variance straight off A by calculus, with no integrals at all. The
map \eta \mapsto \mathbb{E}[T] = A'(\eta) is called the
mean function, and it will reappear as the inverse of a GLM link.
Worked example 1 — Bernoulli, and the birth of the logit
The Bernoulli mass function p(x;\theta)=\theta^x(1-\theta)^{1-x} hides an
exponential family. Take the log and re-exponentiate:
p(x;\theta) = \exp\!\Big(x\log\tfrac{\theta}{1-\theta} + \log(1-\theta)\Big).
Match the template: the natural parameter is the log-odds
\eta = \log\frac{\theta}{1-\theta} (the logit),
T(x)=x, and the log-partition is
A(\eta)=\log(1+e^{\eta}). Check the moment identity:
A'(\eta) = \frac{e^{\eta}}{1+e^{\eta}} = \theta = \mathbb{E}[X].\ \checkmark
That A'(\eta) is the logistic sigmoid — the same S-curve
that turns a linear predictor into a probability in logistic regression. It was here in the Bernoulli
all along.
Worked example 2 — Poisson
For p(x;\lambda)=\dfrac{\lambda^x e^{-\lambda}}{x!}, write it as
p(x;\lambda) = \frac{1}{x!}\,\exp\!\big(x\log\lambda - \lambda\big).
So \eta=\log\lambda, T(x)=x,
A(\eta)=e^{\eta}=\lambda, and h(x)=1/x!. The
moment check is instant: A'(\eta)=e^{\eta}=\lambda=\mathbb{E}[X] and
A''(\eta)=e^{\eta}=\lambda=\operatorname{Var}(X) — recovering the famous
Poisson fact that mean equals variance with a single derivative.
From natural parameter to mean
For the Bernoulli, the mean function A'(\eta)=e^{\eta}/(1+e^{\eta}) is the
logistic sigmoid. Slide the natural parameter \eta (the log-odds) and read
off the mean \theta=\mathbb{E}[X] it maps to: as
\eta\to+\infty the mean saturates at 1, as
\eta\to-\infty it falls to 0, and \eta=0 gives a
fair \theta=\tfrac12. This one curve is the link between GLM linear
predictors and probabilities.
- Canonical form p(x;\eta)=h(x)\exp(\eta T(x)-A(\eta)); T(x) is the natural sufficient statistic.
- The log-partition generates moments: A'(\eta)=\mathbb{E}[T], A''(\eta)=\operatorname{Var}(T), so A is convex.
- Bernoulli, Poisson, normal, gamma, beta and more all fit — a single template covering the everyday zoo.
The template h(x)\exp(\eta T(x)-A(\eta)) requires the support
— the set of x where the density is positive — to be the same for every
parameter value. The moment you let the range of the data depend on the parameter, the model
falls out of the exponential family. The classic offender is the uniform
U(0,\theta): its density is 1/\theta on
[0,\theta] and 0 beyond, so the support grows with
\theta. No amount of algebra rewrites that indicator into the canonical
form, and indeed U(0,\theta) behaves quite differently (its sufficient
statistic is the maximum \max_i x_i, and the MLE is biased). Exponential
family means fixed support — always check it first.
In Bayesian statistics you multiply a prior by the likelihood to get a posterior, and life is easy
when the posterior stays in the same shape as the prior — that is a conjugate
prior. Exponential families guarantee one: because the likelihood is
\exp(\eta\sum_i T(x_i) - nA(\eta)), a prior of the same exponential shape in
\eta updates simply by adding the data's sufficient statistic to the
prior's parameters. That is why the Beta is the natural prior for a Bernoulli's
\theta (Beta → Beta), and the Gamma for a Poisson's
\lambda. Conjugacy isn't a lucky coincidence for a few textbook pairs — it
is a structural gift of the exponential family, and it turns Bayesian updating into arithmetic.