The Laws of Large Numbers

A casino does not know whether you will win tonight. It does not care. On any single spin of the roulette wheel the house might lose badly, and often does. What the casino knows — what it stakes its entire existence on — is that across the millions of spins it will host this year, the average outcome will settle, with crushing reliability, onto the number that the odds dictate. Its edge is a couple of cents on the dollar. Over a handful of bets that edge is invisible noise; over a million bets it is a fortune as certain as sunrise.

That certainty is not folklore. It is a theorem — in fact a pair of them — and they are the mathematical bedrock beneath every insurer setting a premium, every pollster quoting a margin, and every physicist averaging a noisy measurement. They are the Laws of Large Numbers, and this page is about what they really say (which is subtler and more interesting than the folk version the casino would give you).

Here is the setup for the whole page. Let X_1, X_2, X_3, \dots be a sequence of independent, identically distributed (i.i.d.) random variables — think of the same random experiment repeated forever under identical, non-interfering conditions. Independence and the product structure that makes an infinite i.i.d. sequence a legitimate object live in independence and product measures. Write their common mean as

\mu = \mathbb{E}[X_1] = \int_\Omega X_1 \, d\mathbb{P},

and form the sample mean (running average) of the first n of them:

\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i.

The Laws of Large Numbers answer one question: as n \to \infty, does \bar{X}_n converge to \mu? The answer is yes — but "converge" is a loaded word for random objects, and the whole richness of the topic is that there are two honest answers, in two different modes of convergence.

Two laws, because there are two modes of convergence

A sequence of random variables can approach a limit in several inequivalent senses — the subject of modes of convergence. The two that matter here are convergence in probability and almost-sure convergence, and each gives its own law.

If X_1, X_2, \dots are i.i.d. with \mathbb{E}|X_1| < \infty and mean \mu, then \bar{X}_n \to \mu in probability:

\forall\, \varepsilon > 0: \quad \mathbb{P}\bigl(\,|\bar{X}_n - \mu| \ge \varepsilon\,\bigr) \xrightarrow[n\to\infty]{} 0.

Under the same hypothesis \mathbb{E}|X_1| < \infty (Kolmogorov), \bar{X}_n \to \mu almost surely:

\mathbb{P}\Bigl(\,\lim_{n\to\infty} \bar{X}_n = \mu\,\Bigr) = 1.

Read the two boxes side by side, because the difference between them is the entire drama of this page.

Almost-sure convergence is strictly stronger: it implies convergence in probability, so the SLLN implies the WLLN. The names are earned. And notice the SLLN delivers more (a statement about entire sample paths) under the same integrability hypothesis — which is exactly why it is the deep theorem and the WLLN is the easy one.

The weak law, proved in four lines (finite variance)

When the variables have a finite variance \sigma^2 = \operatorname{Var}(X_1) < \infty, the WLLN falls out of one inequality. First, the variance of the sample mean. Because the X_i are independent, variances add, and pulling the 1/n out squares it:

\operatorname{Var}(\bar{X}_n) = \operatorname{Var}\!\Bigl(\tfrac1n \textstyle\sum_{i=1}^n X_i\Bigr) = \frac{1}{n^2}\sum_{i=1}^n \operatorname{Var}(X_i) = \frac{1}{n^2}\cdot n\sigma^2 = \frac{\sigma^2}{n}.

There is the engine of the whole subject: averaging n independent copies divides the variance by n. The spread of \bar{X}_n shrinks to zero. Now feed that into Chebyshev's inequality, which bounds the probability of a deviation by the variance:

\mathbb{P}\bigl(|\bar{X}_n - \mu| \ge \varepsilon\bigr) \;\le\; \frac{\operatorname{Var}(\bar{X}_n)}{\varepsilon^2} \;=\; \frac{\sigma^2}{n\,\varepsilon^2} \;\xrightarrow[n\to\infty]{}\; 0.

Done. For every fixed \varepsilon > 0 the right-hand side goes to zero like 1/n, which is precisely \bar{X}_n \to \mu in probability. The bound even tells you how fast: to be sure of landing within \varepsilon with confidence 1-\delta, it suffices to take n \ge \sigma^2/(\delta\varepsilon^2).

The finite-variance assumption is a convenience, not a necessity. Khinchin's WLLN keeps the conclusion \bar{X}_n \to \mu in probability assuming only a finite mean, \mathbb{E}|X_1| < \infty — no second moment required. The proof trades Chebyshev for a truncation-and-characteristic-function argument, but the payoff is the same clean statement.

Roll a fair six-sided die n times and average the results. Each roll has

\mu = \frac{1+2+3+4+5+6}{6} = 3.5, \qquad \sigma^2 = \frac{1}{6}\sum_{k=1}^{6}(k-3.5)^2 = \frac{35}{12} \approx 2.917.

So \operatorname{Var}(\bar{X}_n) = \tfrac{35}{12n}. After n = 1000 rolls the standard deviation of the average is \sqrt{35/12000} \approx 0.054 — the running average is almost surely hugging 3.5. Chebyshev even guarantees \mathbb{P}(|\bar{X}_{1000} - 3.5| \ge 0.2) \le \tfrac{35/12}{1000\cdot 0.04} \approx 0.073, and the truth is far tighter still. Watch this happen live in the diagram below.

The strong law — the deep one

The strong law is a genuinely harder theorem, and it is worth pausing on why. The weak law controls one n at a time. The strong law must control all sufficiently large n simultaneously: it asserts that, for almost every run, there is some threshold N(\omega) beyond which \bar{X}_n(\omega) never again strays more than \varepsilon from \mu. Ruling out the possibility of infinitely many future excursions — however rare each one individually is — is the real work.

Let X_1, X_2, \dots be i.i.d. Then \bar{X}_n converges almost surely to a finite constant if and only if \mathbb{E}|X_1| < \infty, in which case the limit is \mu = \mathbb{E}[X_1].

This is a razor-sharp characterisation: a finite first moment is exactly the right condition — necessary as well as sufficient. Nothing about variance appears; the theorem reaches heavy-tailed distributions that Chebyshev cannot touch, as long as the mean itself exists.

The idea, in one breath: the natural quantity to control is the tail event \{|\bar{X}_n - \mu| \ge \varepsilon\}. If one can show these events happen only finitely often — that \sum_n \mathbb{P}(|\bar{X}_n - \mu| \ge \varepsilon) < \infty — then the first Borel–Cantelli lemma says that, with probability one, only finitely many of them occur, so the average is eventually trapped inside the window. The catch is that the WLLN bound \sigma^2/(n\varepsilon^2) sums to +\infty (the harmonic series), so it is not enough by itself. Kolmogorov's genius was to sharpen the probability estimates (via his maximal inequality) until the sum converges.

If you are willing to assume a finite fourth moment, \mathbb{E}[X_1^4] < \infty, the strong law becomes a homework exercise. Centre the variables so \mu = 0. Expanding \mathbb{E}[(\sum_{i=1}^n X_i)^4] and using independence, the only surviving terms are the n pure fourth powers and the 3n(n-1) cross terms \mathbb{E}[X_i^2 X_j^2], so it is O(n^2). Hence

\mathbb{E}[\bar{X}_n^4] = \frac{1}{n^4}\,\mathbb{E}\Bigl[\bigl(\textstyle\sum_i X_i\bigr)^4\Bigr] = O\!\left(\frac{1}{n^2}\right), \qquad \sum_{n=1}^{\infty} \mathbb{E}[\bar{X}_n^4] < \infty.

A finite sum of non-negative terms means the summand \bar{X}_n^4 \to 0 almost surely (Markov + Borel–Cantelli, or just monotone convergence on the sum), and therefore \bar{X}_n \to 0 a.s. The whole apparatus of the strong law, in miniature — the price is the extra moment, which Kolmogorov's full theorem does away with.

Watch it converge

Below is a single sample run of a random experiment. The jagged curve is the running average \bar{X}_n as a function of the number of trials n; the dashed horizontal line is the true mean \mu; and the pale band is the \pm\,\sigma/\sqrt{n} envelope — the one-standard-deviation spread of the average, which pinches inward like 1/\sqrt{n}.

Reseed the run and the early wiggles are different every time, but the destination never is — the tail of the curve always creeps onto \mu. Switch to the coin and the average is the relative frequency of heads, converging on 0.5 (the frequentist foundation, just below). Then switch to Cauchy and watch the promise break: the curve lurches and never settles, because a Cauchy variable has no mean at all — the hypothesis of every law on this page fails, so the conclusion is allowed to fail too.

Notice the two shrinking things do different jobs. The envelope narrowing tells you the average is concentrating (the LLN). But that same envelope is \pm\sigma/\sqrt{n} wide, not \pm\sigma/n — the residual wobble decays only like 1/\sqrt{n}, and its shape, once magnified, is a bell curve. Quantifying that leftover fluctuation is the job of the Central Limit Theorem, the natural sequel: the LLN says the average lands on \mu; the CLT tells you the precise Gaussian shape of how it lands.

Why it matters: frequencies, estimators, and Monte Carlo

The frequentist foundation of probability. Take X_i = \mathbf{1}_{A}, the indicator of some event A on trial i — a Bernoulli variable equal to 1 when A happens and 0 otherwise. Then \mu = \mathbb{E}[\mathbf{1}_A] = \mathbb{P}(A), and the sample mean \bar{X}_n is literally the relative frequency of A in n trials. The strong law then says

\frac{\#\{i \le n : A \text{ occurred}\}}{n} \xrightarrow{\text{a.s.}} \mathbb{P}(A).

This is the theorem (Borel's, historically) that makes the intuitive slogan "probability is long-run frequency" into mathematics. It is why a fair coin's heads-fraction tends to 1/2 and an insurer's realised claim rate tends to the modelled probability.

Consistency of estimators. A statistician who estimates an unknown mean by the sample average is relying on exactly the LLN: as data accumulates, the estimator \bar{X}_n converges to the truth \mu. In the language of statistics, the sample mean is a consistent estimator. More sample means less error — the whole enterprise of learning from data leans on this.

Monte Carlo integration. To compute a hard integral I = \int_{[0,1]^d} g(x)\,dx = \mathbb{E}[g(U)] with U uniform, draw i.i.d. samples U_1, \dots, U_n and average: \bar{g}_n = \tfrac1n\sum_i g(U_i). The LLN guarantees \bar{g}_n \to I, and the error shrinks like 1/\sqrt{n} regardless of the dimension d — the reason Monte Carlo beats grid methods in high dimensions, from physics to finance.

Both laws demand a finite mean. Take i.i.d. standard Cauchy variables, with density \tfrac{1}{\pi(1+x^2)}. The tails are so heavy that \mathbb{E}|X_1| = \infty — there is no mean. A remarkable fact: for Cauchy variables, \bar{X}_n has the same Cauchy distribution as a single X_1, for every n. Averaging buys you nothing; \bar{X}_n does not converge to any constant, in any mode. No mean, no law.

The single most abused theorem in all of probability. Three traps:

Write a real number x \in [0,1] in binary, x = 0.b_1 b_2 b_3 \dots. Call x normal (in base 2) if the digit 1 occurs with limiting frequency exactly \tfrac12: \frac1n\sum_{k=1}^n b_k \to \tfrac12. Borel's normal number theorem (1909) says that almost every number — all but a set of Lebesgue measure zero — is normal.

This is secretly a strong law of large numbers! Choose x uniformly at random from [0,1]; then its binary digits b_1, b_2, \dots are exactly i.i.d. fair coin flips (mean \tfrac12). The SLLN says the digit average converges to \tfrac12 with probability one — and "probability one" for the uniform distribution is "Lebesgue measure one." So the abstract almost-sure statement of the SLLN and the concrete measure-theoretic statement that almost every real is normal are the very same fact wearing two costumes. (And yet exhibiting one specific normal number is famously hard — we know almost all of them exist without easily naming many. Measure theory sees the forest; the individual trees stay elusive.)