Characteristic Functions and the Central Limit Theorem
Add up enough independent random nudges — measurement errors in a lab, the heights of a crowd, the daily
ups and downs of a stock, the votes in a poll — and no matter what the individual nudges look like, their
(properly scaled) total draws the same bell-shaped curve. This uncanny universality is
the Central Limit Theorem, the reason the normal distribution turns up everywhere in
science. It is easy to state and, with the right tool, astonishingly clean to prove. That tool is the
characteristic function.
For a real random variable X with law
\mu_X, the characteristic function is the expectation of a rotating unit phasor:
\varphi_X(t) \;=\; \mathbb{E}\!\left[e^{itX}\right] \;=\; \int_{\mathbb{R}} e^{itx}\, d\mu_X(x), \qquad t \in \mathbb{R}.
Read the middle expression as an
expectation,
the right one as a
Lebesgue integral
against the distribution of X. In fact
\varphi_X is nothing but the
Fourier transform
of the measure \mu_X (with the probabilist's sign and normalisation
convention). Every fact you know about Fourier transforms — that they turn convolution into
multiplication, that they are invertible, that they encode smoothness and decay — is about to earn its
keep in probability.
Unlike the moment generating function \mathbb{E}[e^{tX}], which can blow up
(a heavy tail makes e^{tX} non-integrable), the characteristic function is
defined for every distribution and every t. The reason is a
one-line bound: the integrand has constant modulus,
\bigl|e^{itX}\bigr| = 1 \quad\Longrightarrow\quad \bigl|\varphi_X(t)\bigr| \le \int 1 \, d\mu_X = 1 < \infty.
A bounded function is always integrable against a probability measure. So
\varphi_X exists unconditionally — no tail assumptions, no radius of
convergence. That robustness is exactly why it, and not the MGF, is the right engine for limit theorems.
The properties that make it useful
A handful of properties follow straight from the definition and the linearity of the integral. Each is
small; together they are everything we need.
-
Normalisation: \varphi_X(0) = \mathbb{E}[e^{0}] = \mathbb{E}[1] = 1.
-
Bounded: |\varphi_X(t)| \le 1 for all
t, with equality at t = 0.
-
Hermitian: \varphi_X(-t) = \overline{\varphi_X(t)}; if
X is symmetric about 0 then
\varphi_X is real-valued.
-
Uniformly continuous: \varphi_X is uniformly continuous on
\mathbb{R} (dominated convergence, with the constant dominator
1) — no matter how rough the underlying distribution.
-
Affine maps: for constants a, b,
\varphi_{aX + b}(t) = \mathbb{E}\!\left[e^{it(aX + b)}\right] = e^{itb}\,\varphi_X(at).
And now the property that does all the heavy lifting — the reason we changed coordinates into Fourier
space at all:
If X and Y are
independent,
then the phasors e^{itX} and e^{itY} are
independent too, so the expectation of their product factors:
\varphi_{X+Y}(t) = \mathbb{E}\!\left[e^{it(X+Y)}\right] = \mathbb{E}\!\left[e^{itX}\right]\,\mathbb{E}\!\left[e^{itY}\right] = \varphi_X(t)\,\varphi_Y(t).
The convolution of two laws (the messy integral that describes the distribution of a
sum) becomes an ordinary product of two functions. This is the Fourier miracle,
transplanted into probability: adding independent random variables is multiplication in
characteristic-function space.
By induction, for independent X_1, \dots, X_n we get
\varphi_{X_1 + \cdots + X_n}(t) = \prod_{k=1}^{n} \varphi_{X_k}(t), and if they
are identically distributed with common characteristic function
\varphi, this collapses to a single power:
\varphi_{S_n}(t) = \bigl[\varphi(t)\bigr]^{n}, \qquad S_n = X_1 + \cdots + X_n.
Take X \sim \mathrm{Bernoulli}(p), so
X = 1 with probability p and
X = 0 with probability 1 - p. Straight from the
definition,
\varphi_X(t) = (1-p)\,e^{it\cdot 0} + p\,e^{it\cdot 1} = (1-p) + p\,e^{it}.
Check the sanity conditions: \varphi_X(0) = (1-p) + p = 1. ✓ Now sum
n independent copies to get a
\mathrm{Binomial}(n, p) count of successes. The product rule hands you its
characteristic function with no convolution at all:
\varphi_{S_n}(t) = \bigl[(1-p) + p\,e^{it}\bigr]^{n}.
Try doing that by summing over the binomial coefficients directly and you will appreciate the change of
coordinates. The whole distribution of a sum, packaged in one power of one small function.
Uniqueness, inversion, and moments
A change of coordinates is only useful if you can change back. Two facts guarantee it. First, the
uniqueness / inversion theorem: the characteristic function determines the law
completely — if \varphi_X = \varphi_Y as functions, then
X and Y have the same distribution. There is even an
explicit inversion formula recovering the distribution function from
\varphi (a Fourier inversion in disguise). So passing to
\varphi loses nothing: it is a faithful re-encoding of the whole distribution.
Second, moments live in the derivatives at zero. Differentiating under the integral sign
(legal whenever the relevant moment is finite) pulls down a factor of iX each
time:
\varphi_X^{(k)}(0) = i^{k}\,\mathbb{E}[X^{k}].
In particular \varphi_X'(0) = i\,\mathbb{E}[X] and
\varphi_X''(0) = i^{2}\,\mathbb{E}[X^{2}] = -\mathbb{E}[X^{2}]. Feeding these
into a Taylor expansion about t = 0 gives the local shape we will need. If
X has mean \mu and variance
\sigma^{2} (so \mathbb{E}[X^{2}] = \sigma^{2} + \mu^{2}),
then as t \to 0,
\varphi_X(t) = 1 + it\mu - \tfrac{1}{2}\bigl(\sigma^{2} + \mu^{2}\bigr)t^{2} + o(t^{2}).
Read the coefficients: the constant term is 1 (normalisation), the linear term
carries the mean, the quadratic term carries the second moment. The entire behaviour near the origin is
governed by just the first two moments — which is precisely why, in the limit theorem to come,
only the mean and variance survive.
Lévy's continuity theorem — the engine
Everything so far lets us compute with characteristic functions. The last ingredient lets us take
limits, and it is the crank that turns algebra into a theorem about distributions.
Let X_n be random variables with characteristic functions
\varphi_{X_n}.
-
If X_n \Rightarrow X (convergence in distribution), then
\varphi_{X_n}(t) \to \varphi_X(t) for every t.
-
Conversely, if \varphi_{X_n}(t) \to \psi(t) pointwise for some function
\psi that is continuous at t = 0,
then \psi is the characteristic function of some random variable
X, and X_n \Rightarrow X.
This is extraordinary leverage. Convergence in distribution is a statement about
cumulative distribution functions matching at every continuity point — awkward to check directly.
Lévy says you may instead check pointwise convergence of a single function,
\varphi_{X_n} \to \psi, plus a mild continuity condition at the origin (which
rules out probability mass escaping to infinity). Prove that one function converges, and the whole
distribution converges. That is the engine; here is what it drives.
The Central Limit Theorem, proved in four lines
Let X_1, X_2, \dots be independent and identically distributed, with finite
mean \mu = \mathbb{E}[X_i] and finite variance
0 < \sigma^{2} < \infty. Form the sum
S_n = X_1 + \cdots + X_n and standardise it — subtract its
mean n\mu and divide by its standard deviation
\sigma\sqrt{n}:
Z_n = \frac{S_n - n\mu}{\sigma\sqrt{n}}.
Under those hypotheses, Z_n converges in distribution to a standard normal:
Z_n \;\Longrightarrow\; N(0,1), \qquad \text{i.e.}\qquad \mathbb{P}(Z_n \le z) \to \int_{-\infty}^{z} \tfrac{1}{\sqrt{2\pi}}\, e^{-u^{2}/2}\, du.
Now the proof. Recentre by writing Y_i = X_i - \mu, so
\mathbb{E}[Y_i] = 0 and \mathbb{E}[Y_i^{2}] = \sigma^{2},
with common characteristic function \varphi_Y. Since
Z_n = \sum_i Y_i / (\sigma\sqrt{n}) is a scaled sum of independent copies, the
affine rule and the product rule combine into a single power:
\varphi_{Z_n}(t) = \left[\varphi_Y\!\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^{n}.
Expand \varphi_Y near 0 using the moment expansion
above. Because Y has mean 0 and variance
\sigma^{2}, the linear term vanishes and
\varphi_Y(s) = 1 - \tfrac{1}{2}\sigma^{2}s^{2} + o(s^{2}). Substitute
s = t/(\sigma\sqrt{n}), so s^{2} = t^{2}/(\sigma^{2} n)
and the \sigma^{2} cancels beautifully:
\varphi_{Z_n}(t) = \left[\,1 - \frac{t^{2}}{2n} + o\!\left(\tfrac{1}{n}\right)\right]^{n} \;\xrightarrow[n\to\infty]{}\; e^{-t^{2}/2}.
That last step is the classic limit (1 + c/n)^{n} \to e^{c} with
c = -t^{2}/2. And e^{-t^{2}/2} is exactly the
characteristic function of the standard normal N(0,1) (a Gaussian is its own
Fourier transform). The limit function is continuous at 0, so
Lévy's continuity theorem converts this pointwise convergence into convergence in
distribution:
\varphi_{Z_n}(t) \to e^{-t^{2}/2} \quad\Longrightarrow\quad Z_n \Rightarrow N(0,1). \qquad \blacksquare
Notice what the argument used and what it ignored. It used only the mean and variance — the first two
Taylor coefficients — and it never touched the shape of the individual
X_i. Bernoulli, uniform, exponential, dice, anything with a finite variance:
the third-and-higher moments are swept into the o(1/n) and forgotten. That
forgetting is the universality of the bell curve. The chart below is this proof, drawn: it plots
\varphi_{Z_n}(t) for a sum of uniforms and watches it flatten onto
e^{-t^{2}/2} as you raise n.
At n = 1 the solid curve is a single uniform's characteristic function — a
\operatorname{sinc} that dips below the axis and oscillates, visibly not a bell.
Push n up and each raise-to-the-power irons the oscillations flat, pinning the
curve onto the dashed Gaussian. By the time the two are indistinguishable, so are the distributions.
Three traps, each of which sinks a lot of first attempts at the CLT:
-
The data does not become normal — the standardised SUM does. The CLT says nothing
about your raw sample X_1, \dots, X_n turning bell-shaped; a histogram of
1000 die rolls stays flat forever. The object whose law approaches
N(0,1) is the single number
Z_n = (S_n - n\mu)/(\sigma\sqrt{n}) — the sum, centred and rescaled. Drop the
\sqrt{n} scaling and the whole statement collapses.
-
Finite variance is not optional. The proof spent its one crucial move on the term
-\tfrac{1}{2}\sigma^{2}s^{2} — if \sigma^{2} = \infty
that term is meaningless and the theorem is simply false. The
\mathrm{Cauchy} distribution has characteristic function
\varphi(t) = e^{-|t|}, so a standardised-looking average of
n i.i.d. Cauchy variables has characteristic function
[e^{-|t|/n}]^{n} = e^{-|t|} — it never budges from Cauchy, never approaches a
bell. Heavy tails converge instead to other stable laws, not to the normal.
-
Convergence is in distribution only. Z_n \Rightarrow N(0,1)
means the cumulative distribution functions match in the limit — it does not say
Z_n converges as a sequence of numbers (it doesn't), nor that any density even
exists for finite n (for coin flips Z_n is discrete
at every stage). It is a statement about the shape of the law, nothing more and nothing less.
The oldest thread runs back to Abraham de Moivre, who around 1733 found that binomial
probabilities for a fair coin could be approximated by what we now call a Gaussian curve — the special
case X_i \sim \mathrm{Bernoulli}(\tfrac12), later polished by
Laplace into the de Moivre–Laplace theorem. For a century the general result was
believed but not rigorously nailed down; the missing pieces were sharp hypotheses.
Lyapunov (1901) and Lindeberg (1922) supplied conditions under which
even non-identically-distributed summands still add up to a bell, and Lévy's continuity theorem
gave the clean Fourier-analytic proof machine we used above.
The grand name came late. In 1920 George Pólya wrote of the
"zentraler Grenzwertsatz" — the central limit theorem — where "central" modifies
theorem: this is the theorem at the centre of probability theory, not a theorem
about a centre. The name stuck, and the pun-that-isn't has confused students ever since.