Characteristic Functions and the Central Limit Theorem

Add up enough independent random nudges — measurement errors in a lab, the heights of a crowd, the daily ups and downs of a stock, the votes in a poll — and no matter what the individual nudges look like, their (properly scaled) total draws the same bell-shaped curve. This uncanny universality is the Central Limit Theorem, the reason the normal distribution turns up everywhere in science. It is easy to state and, with the right tool, astonishingly clean to prove. That tool is the characteristic function.

For a real random variable X with law \mu_X, the characteristic function is the expectation of a rotating unit phasor:

\varphi_X(t) \;=\; \mathbb{E}\!\left[e^{itX}\right] \;=\; \int_{\mathbb{R}} e^{itx}\, d\mu_X(x), \qquad t \in \mathbb{R}.

Read the middle expression as an expectation, the right one as a Lebesgue integral against the distribution of X. In fact \varphi_X is nothing but the Fourier transform of the measure \mu_X (with the probabilist's sign and normalisation convention). Every fact you know about Fourier transforms — that they turn convolution into multiplication, that they are invertible, that they encode smoothness and decay — is about to earn its keep in probability.

Unlike the moment generating function \mathbb{E}[e^{tX}], which can blow up (a heavy tail makes e^{tX} non-integrable), the characteristic function is defined for every distribution and every t. The reason is a one-line bound: the integrand has constant modulus,

\bigl|e^{itX}\bigr| = 1 \quad\Longrightarrow\quad \bigl|\varphi_X(t)\bigr| \le \int 1 \, d\mu_X = 1 < \infty.

A bounded function is always integrable against a probability measure. So \varphi_X exists unconditionally — no tail assumptions, no radius of convergence. That robustness is exactly why it, and not the MGF, is the right engine for limit theorems.

The properties that make it useful

A handful of properties follow straight from the definition and the linearity of the integral. Each is small; together they are everything we need.

Normalisation: \varphi_X(0) = \mathbb{E}[e^{0}] = \mathbb{E}[1] = 1.
Bounded: |\varphi_X(t)| \le 1 for all t, with equality at t = 0.
Hermitian: \varphi_X(-t) = \overline{\varphi_X(t)}; if X is symmetric about 0 then \varphi_X is real-valued.
Uniformly continuous: \varphi_X is uniformly continuous on \mathbb{R} (dominated convergence, with the constant dominator 1) — no matter how rough the underlying distribution.
Affine maps: for constants a, b, \varphi_{aX + b}(t) = \mathbb{E}\!\left[e^{it(aX + b)}\right] = e^{itb}\,\varphi_X(at).

And now the property that does all the heavy lifting — the reason we changed coordinates into Fourier space at all:

If X and Y are independent, then the phasors e^{itX} and e^{itY} are independent too, so the expectation of their product factors:

\varphi_{X+Y}(t) = \mathbb{E}\!\left[e^{it(X+Y)}\right] = \mathbb{E}\!\left[e^{itX}\right]\,\mathbb{E}\!\left[e^{itY}\right] = \varphi_X(t)\,\varphi_Y(t).

The convolution of two laws (the messy integral that describes the distribution of a sum) becomes an ordinary product of two functions. This is the Fourier miracle, transplanted into probability: adding independent random variables is multiplication in characteristic-function space.

By induction, for independent X_1, \dots, X_n we get \varphi_{X_1 + \cdots + X_n}(t) = \prod_{k=1}^{n} \varphi_{X_k}(t), and if they are identically distributed with common characteristic function \varphi, this collapses to a single power:

\varphi_{S_n}(t) = \bigl[\varphi(t)\bigr]^{n}, \qquad S_n = X_1 + \cdots + X_n.

Take X \sim \mathrm{Bernoulli}(p), so X = 1 with probability p and X = 0 with probability 1 - p. Straight from the definition,

\varphi_X(t) = (1-p)\,e^{it\cdot 0} + p\,e^{it\cdot 1} = (1-p) + p\,e^{it}.

Check the sanity conditions: \varphi_X(0) = (1-p) + p = 1. ✓ Now sum n independent copies to get a \mathrm{Binomial}(n, p) count of successes. The product rule hands you its characteristic function with no convolution at all:

\varphi_{S_n}(t) = \bigl[(1-p) + p\,e^{it}\bigr]^{n}.

Try doing that by summing over the binomial coefficients directly and you will appreciate the change of coordinates. The whole distribution of a sum, packaged in one power of one small function.

Uniqueness, inversion, and moments

A change of coordinates is only useful if you can change back. Two facts guarantee it. First, the uniqueness / inversion theorem: the characteristic function determines the law completely — if \varphi_X = \varphi_Y as functions, then X and Y have the same distribution. There is even an explicit inversion formula recovering the distribution function from \varphi (a Fourier inversion in disguise). So passing to \varphi loses nothing: it is a faithful re-encoding of the whole distribution.

Second, moments live in the derivatives at zero. Differentiating under the integral sign (legal whenever the relevant moment is finite) pulls down a factor of iX each time:

\varphi_X^{(k)}(0) = i^{k}\,\mathbb{E}[X^{k}].

In particular \varphi_X'(0) = i\,\mathbb{E}[X] and \varphi_X''(0) = i^{2}\,\mathbb{E}[X^{2}] = -\mathbb{E}[X^{2}]. Feeding these into a Taylor expansion about t = 0 gives the local shape we will need. If X has mean \mu and variance \sigma^{2} (so \mathbb{E}[X^{2}] = \sigma^{2} + \mu^{2}), then as t \to 0,

\varphi_X(t) = 1 + it\mu - \tfrac{1}{2}\bigl(\sigma^{2} + \mu^{2}\bigr)t^{2} + o(t^{2}).

Read the coefficients: the constant term is 1 (normalisation), the linear term carries the mean, the quadratic term carries the second moment. The entire behaviour near the origin is governed by just the first two moments — which is precisely why, in the limit theorem to come, only the mean and variance survive.

Lévy's continuity theorem — the engine

Everything so far lets us compute with characteristic functions. The last ingredient lets us take limits, and it is the crank that turns algebra into a theorem about distributions.

Let X_n be random variables with characteristic functions \varphi_{X_n}.

If X_n \Rightarrow X (convergence in distribution), then \varphi_{X_n}(t) \to \varphi_X(t) for every t.
Conversely, if \varphi_{X_n}(t) \to \psi(t) pointwise for some function \psi that is continuous at t = 0, then \psi is the characteristic function of some random variable X, and X_n \Rightarrow X.

This is extraordinary leverage. Convergence in distribution is a statement about cumulative distribution functions matching at every continuity point — awkward to check directly. Lévy says you may instead check pointwise convergence of a single function, \varphi_{X_n} \to \psi, plus a mild continuity condition at the origin (which rules out probability mass escaping to infinity). Prove that one function converges, and the whole distribution converges. That is the engine; here is what it drives.

The Central Limit Theorem, proved in four lines

Let X_1, X_2, \dots be independent and identically distributed, with finite mean \mu = \mathbb{E}[X_i] and finite variance 0 < \sigma^{2} < \infty. Form the sum S_n = X_1 + \cdots + X_n and standardise it — subtract its mean n\mu and divide by its standard deviation \sigma\sqrt{n}:

Z_n = \frac{S_n - n\mu}{\sigma\sqrt{n}}.

Under those hypotheses, Z_n converges in distribution to a standard normal:

Z_n \;\Longrightarrow\; N(0,1), \qquad \text{i.e.}\qquad \mathbb{P}(Z_n \le z) \to \int_{-\infty}^{z} \tfrac{1}{\sqrt{2\pi}}\, e^{-u^{2}/2}\, du.

Now the proof. Recentre by writing Y_i = X_i - \mu, so \mathbb{E}[Y_i] = 0 and \mathbb{E}[Y_i^{2}] = \sigma^{2}, with common characteristic function \varphi_Y. Since Z_n = \sum_i Y_i / (\sigma\sqrt{n}) is a scaled sum of independent copies, the affine rule and the product rule combine into a single power:

\varphi_{Z_n}(t) = \left[\varphi_Y\!\left(\frac{t}{\sigma\sqrt{n}}\right)\right]^{n}.

Expand \varphi_Y near 0 using the moment expansion above. Because Y has mean 0 and variance \sigma^{2}, the linear term vanishes and \varphi_Y(s) = 1 - \tfrac{1}{2}\sigma^{2}s^{2} + o(s^{2}). Substitute s = t/(\sigma\sqrt{n}), so s^{2} = t^{2}/(\sigma^{2} n) and the \sigma^{2} cancels beautifully:

\varphi_{Z_n}(t) = \left[\,1 - \frac{t^{2}}{2n} + o\!\left(\tfrac{1}{n}\right)\right]^{n} \;\xrightarrow[n\to\infty]{}\; e^{-t^{2}/2}.

That last step is the classic limit (1 + c/n)^{n} \to e^{c} with c = -t^{2}/2. And e^{-t^{2}/2} is exactly the characteristic function of the standard normal N(0,1) (a Gaussian is its own Fourier transform). The limit function is continuous at 0, so Lévy's continuity theorem converts this pointwise convergence into convergence in distribution:

\varphi_{Z_n}(t) \to e^{-t^{2}/2} \quad\Longrightarrow\quad Z_n \Rightarrow N(0,1). \qquad \blacksquare

Notice what the argument used and what it ignored. It used only the mean and variance — the first two Taylor coefficients — and it never touched the shape of the individual X_i. Bernoulli, uniform, exponential, dice, anything with a finite variance: the third-and-higher moments are swept into the o(1/n) and forgotten. That forgetting is the universality of the bell curve. The chart below is this proof, drawn: it plots \varphi_{Z_n}(t) for a sum of uniforms and watches it flatten onto e^{-t^{2}/2} as you raise n.

At n = 1 the solid curve is a single uniform's characteristic function — a \operatorname{sinc} that dips below the axis and oscillates, visibly not a bell. Push n up and each raise-to-the-power irons the oscillations flat, pinning the curve onto the dashed Gaussian. By the time the two are indistinguishable, so are the distributions.

Three traps, each of which sinks a lot of first attempts at the CLT:

The data does not become normal — the standardised SUM does. The CLT says nothing about your raw sample X_1, \dots, X_n turning bell-shaped; a histogram of 1000 die rolls stays flat forever. The object whose law approaches N(0,1) is the single number Z_n = (S_n - n\mu)/(\sigma\sqrt{n}) — the sum, centred and rescaled. Drop the \sqrt{n} scaling and the whole statement collapses.
Finite variance is not optional. The proof spent its one crucial move on the term -\tfrac{1}{2}\sigma^{2}s^{2} — if \sigma^{2} = \infty that term is meaningless and the theorem is simply false. The \mathrm{Cauchy} distribution has characteristic function \varphi(t) = e^{-|t|}, so a standardised-looking average of n i.i.d. Cauchy variables has characteristic function [e^{-|t|/n}]^{n} = e^{-|t|} — it never budges from Cauchy, never approaches a bell. Heavy tails converge instead to other stable laws, not to the normal.
Convergence is in distribution only. Z_n \Rightarrow N(0,1) means the cumulative distribution functions match in the limit — it does not say Z_n converges as a sequence of numbers (it doesn't), nor that any density even exists for finite n (for coin flips Z_n is discrete at every stage). It is a statement about the shape of the law, nothing more and nothing less.

The oldest thread runs back to Abraham de Moivre, who around 1733 found that binomial probabilities for a fair coin could be approximated by what we now call a Gaussian curve — the special case X_i \sim \mathrm{Bernoulli}(\tfrac12), later polished by Laplace into the de Moivre–Laplace theorem. For a century the general result was believed but not rigorously nailed down; the missing pieces were sharp hypotheses. Lyapunov (1901) and Lindeberg (1922) supplied conditions under which even non-identically-distributed summands still add up to a bell, and Lévy's continuity theorem gave the clean Fourier-analytic proof machine we used above.

The grand name came late. In 1920 George Pólya wrote of the "zentraler Grenzwertsatz" — the central limit theorem — where "central" modifies theorem: this is the theorem at the centre of probability theory, not a theorem about a centre. The name stuck, and the pun-that-isn't has confused students ever since.