Exponential Families

The Bernoulli, the Poisson, the normal, the exponential, the gamma, the beta, the binomial — a working statistician's whole zoo of distributions looks like a pile of unrelated formulas. It isn't. Almost all of them are secretly the same shape, wearing different costumes. Peel back the costume and you find one template, the exponential family.

This is one of the great unifications in statistics. Recognising a model as an exponential family hands you, for free: an obvious sufficient statistic, moments computed by differentiation, a clean MLE, a matching conjugate prior, and — as we'll see two pages on — the entire theory of generalized linear models. Learn the template once and you have learned a dozen distributions at once.

The canonical form

A one-parameter exponential family is any model whose density or mass function can be written in the natural (canonical) form

p(x;\eta) = h(x)\,\exp\!\big(\eta\,T(x) - A(\eta)\big).

Four ingredients, each with a job:

\eta — the natural parameter (a possibly re-parametrised version of the usual one);
T(x) — the natural sufficient statistic;
A(\eta) — the log-partition (cumulant) function, the term that makes the density integrate to 1;
h(x) — the base measure, carrying no \eta.

Compare this with the factorization theorem: the family is built in factorised form, with g(T,\eta)=\exp(\eta T - A(\eta)) and h(x) the leftover. So T(x) is sufficient by construction, and for a sample the sum \sum_i T(x_i) is sufficient for \eta.

The log-partition generates the moments

The function A(\eta) is not just bookkeeping — its derivatives are the cumulants of the sufficient statistic. Differentiate once and twice:

A'(\eta) = \mathbb{E}[T(X)], \qquad A''(\eta) = \operatorname{Var}(T(X)).

Two consequences drop out immediately. First, A''(\eta)=\operatorname{Var}(T)\ge 0, so A is convex — which is exactly why the log-likelihood is concave and the MLE is a well-behaved unique maximum. Second, you can read a distribution's mean and variance straight off A by calculus, with no integrals at all. The map \eta \mapsto \mathbb{E}[T] = A'(\eta) is called the mean function, and it will reappear as the inverse of a GLM link.

Worked example 1 — Bernoulli, and the birth of the logit

The Bernoulli mass function p(x;\theta)=\theta^x(1-\theta)^{1-x} hides an exponential family. Take the log and re-exponentiate:

p(x;\theta) = \exp\!\Big(x\log\tfrac{\theta}{1-\theta} + \log(1-\theta)\Big).

Match the template: the natural parameter is the log-odds \eta = \log\frac{\theta}{1-\theta} (the logit), T(x)=x, and the log-partition is A(\eta)=\log(1+e^{\eta}). Check the moment identity:

A'(\eta) = \frac{e^{\eta}}{1+e^{\eta}} = \theta = \mathbb{E}[X].\ \checkmark

That A'(\eta) is the logistic sigmoid — the same S-curve that turns a linear predictor into a probability in logistic regression. It was here in the Bernoulli all along.

Worked example 2 — Poisson

For p(x;\lambda)=\dfrac{\lambda^x e^{-\lambda}}{x!}, write it as

p(x;\lambda) = \frac{1}{x!}\,\exp\!\big(x\log\lambda - \lambda\big).

So \eta=\log\lambda, T(x)=x, A(\eta)=e^{\eta}=\lambda, and h(x)=1/x!. The moment check is instant: A'(\eta)=e^{\eta}=\lambda=\mathbb{E}[X] and A''(\eta)=e^{\eta}=\lambda=\operatorname{Var}(X) — recovering the famous Poisson fact that mean equals variance with a single derivative.

From natural parameter to mean

For the Bernoulli, the mean function A'(\eta)=e^{\eta}/(1+e^{\eta}) is the logistic sigmoid. Slide the natural parameter \eta (the log-odds) and read off the mean \theta=\mathbb{E}[X] it maps to: as \eta\to+\infty the mean saturates at 1, as \eta\to-\infty it falls to 0, and \eta=0 gives a fair \theta=\tfrac12. This one curve is the link between GLM linear predictors and probabilities.

Canonical form p(x;\eta)=h(x)\exp(\eta T(x)-A(\eta)); T(x) is the natural sufficient statistic.
The log-partition generates moments: A'(\eta)=\mathbb{E}[T], A''(\eta)=\operatorname{Var}(T), so A is convex.
Bernoulli, Poisson, normal, gamma, beta and more all fit — a single template covering the everyday zoo.

The template h(x)\exp(\eta T(x)-A(\eta)) requires the support — the set of x where the density is positive — to be the same for every parameter value. The moment you let the range of the data depend on the parameter, the model falls out of the exponential family. The classic offender is the uniform U(0,\theta): its density is 1/\theta on [0,\theta] and 0 beyond, so the support grows with \theta. No amount of algebra rewrites that indicator into the canonical form, and indeed U(0,\theta) behaves quite differently (its sufficient statistic is the maximum \max_i x_i, and the MLE is biased). Exponential family means fixed support — always check it first.

In Bayesian statistics you multiply a prior by the likelihood to get a posterior, and life is easy when the posterior stays in the same shape as the prior — that is a conjugate prior. Exponential families guarantee one: because the likelihood is \exp(\eta\sum_i T(x_i) - nA(\eta)), a prior of the same exponential shape in \eta updates simply by adding the data's sufficient statistic to the prior's parameters. That is why the Beta is the natural prior for a Bernoulli's \theta (Beta → Beta), and the Gamma for a Poisson's \lambda. Conjugacy isn't a lucky coincidence for a few textbook pairs — it is a structural gift of the exponential family, and it turns Bayesian updating into arithmetic.