The Laws of Large Numbers
A casino does not know whether you will win tonight. It does not care. On any single spin of
the roulette wheel the house might lose badly, and often does. What the casino knows — what it stakes
its entire existence on — is that across the millions of spins it will host this year, the
average outcome will settle, with crushing reliability, onto the number that the odds
dictate. Its edge is a couple of cents on the dollar. Over a handful of bets that edge is invisible
noise; over a million bets it is a fortune as certain as sunrise.
That certainty is not folklore. It is a theorem — in fact a pair of them — and they
are the mathematical bedrock beneath every insurer setting a premium, every pollster quoting a margin,
and every physicist averaging a noisy measurement. They are the Laws of Large Numbers,
and this page is about what they really say (which is subtler and more interesting than the folk version
the casino would give you).
Here is the setup for the whole page. Let X_1, X_2, X_3, \dots be a sequence
of independent, identically distributed (i.i.d.) random variables — think of the same
random experiment repeated forever under identical, non-interfering conditions. Independence and the
product structure that makes an infinite i.i.d. sequence a legitimate object live in
independence and product measures.
Write their common mean as
\mu = \mathbb{E}[X_1] = \int_\Omega X_1 \, d\mathbb{P},
and form the sample mean (running average) of the first n
of them:
\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i.
The Laws of Large Numbers answer one question: as n \to \infty, does
\bar{X}_n converge to \mu? The answer is
yes — but "converge" is a loaded word for random objects, and the whole richness of the topic is that
there are two honest answers, in two different modes of convergence.
Two laws, because there are two modes of convergence
A sequence of random variables can approach a limit in several inequivalent senses — the subject of
modes of convergence.
The two that matter here are convergence in probability and almost-sure
convergence, and each gives its own law.
If X_1, X_2, \dots are i.i.d. with \mathbb{E}|X_1| < \infty and mean \mu, then \bar{X}_n \to \mu in probability:
\forall\, \varepsilon > 0: \quad \mathbb{P}\bigl(\,|\bar{X}_n - \mu| \ge \varepsilon\,\bigr) \xrightarrow[n\to\infty]{} 0.
Under the same hypothesis \mathbb{E}|X_1| < \infty (Kolmogorov), \bar{X}_n \to \mu almost surely:
\mathbb{P}\Bigl(\,\lim_{n\to\infty} \bar{X}_n = \mu\,\Bigr) = 1.
Read the two boxes side by side, because the difference between them is the entire drama of this page.
-
Weak law — "at each fixed large n, it is very likely
that \bar{X}_n is close to \mu." It quantifies
one snapshot at a time. It says nothing about whether the sequence settles down;
for all the weak law knows, \bar{X}_n could be close for a while, drift
away, come back, drift away again — as long as the probability of being far shrinks to zero for each
individual n.
-
Strong law — "with probability one, the whole trajectory
n \mapsto \bar{X}_n converges to \mu." Pick an
outcome \omega (a single infinite run of the experiment); for almost every
such \omega, the ordinary numerical sequence
\bar{X}_n(\omega) has limit \mu. It eventually
enters every \varepsilon-window and stays there forever.
Almost-sure convergence is strictly stronger: it implies convergence in probability,
so the SLLN implies the WLLN. The names are earned. And notice the SLLN delivers more
(a statement about entire sample paths) under the same integrability hypothesis — which is
exactly why it is the deep theorem and the WLLN is the easy one.
The weak law, proved in four lines (finite variance)
When the variables have a finite variance \sigma^2 = \operatorname{Var}(X_1) < \infty,
the WLLN falls out of one inequality. First, the variance of the sample mean. Because the
X_i are independent, variances add, and pulling the
1/n out squares it:
\operatorname{Var}(\bar{X}_n) = \operatorname{Var}\!\Bigl(\tfrac1n \textstyle\sum_{i=1}^n X_i\Bigr) = \frac{1}{n^2}\sum_{i=1}^n \operatorname{Var}(X_i) = \frac{1}{n^2}\cdot n\sigma^2 = \frac{\sigma^2}{n}.
There is the engine of the whole subject: averaging n independent
copies divides the variance by n. The spread of
\bar{X}_n shrinks to zero. Now feed that into Chebyshev's
inequality, which bounds the probability of a deviation by the variance:
\mathbb{P}\bigl(|\bar{X}_n - \mu| \ge \varepsilon\bigr) \;\le\; \frac{\operatorname{Var}(\bar{X}_n)}{\varepsilon^2} \;=\; \frac{\sigma^2}{n\,\varepsilon^2} \;\xrightarrow[n\to\infty]{}\; 0.
Done. For every fixed \varepsilon > 0 the right-hand side goes to zero like
1/n, which is precisely \bar{X}_n \to \mu in
probability. The bound even tells you how fast: to be sure of landing within
\varepsilon with confidence 1-\delta, it suffices
to take n \ge \sigma^2/(\delta\varepsilon^2).
The finite-variance assumption is a convenience, not a necessity. Khinchin's WLLN
keeps the conclusion \bar{X}_n \to \mu in probability assuming only a finite
mean, \mathbb{E}|X_1| < \infty — no second moment required. The proof trades
Chebyshev for a truncation-and-characteristic-function argument, but the payoff is the same clean
statement.
Roll a fair six-sided die n times and average the results. Each roll has
\mu = \frac{1+2+3+4+5+6}{6} = 3.5, \qquad \sigma^2 = \frac{1}{6}\sum_{k=1}^{6}(k-3.5)^2 = \frac{35}{12} \approx 2.917.
So \operatorname{Var}(\bar{X}_n) = \tfrac{35}{12n}. After
n = 1000 rolls the standard deviation of the average is
\sqrt{35/12000} \approx 0.054 — the running average is almost surely
hugging 3.5. Chebyshev even guarantees
\mathbb{P}(|\bar{X}_{1000} - 3.5| \ge 0.2) \le \tfrac{35/12}{1000\cdot 0.04} \approx 0.073,
and the truth is far tighter still. Watch this happen live in the diagram below.
The strong law — the deep one
The strong law is a genuinely harder theorem, and it is worth pausing on why. The weak law
controls one n at a time. The strong law must control all
sufficiently large n simultaneously: it asserts that, for almost
every run, there is some threshold N(\omega) beyond which
\bar{X}_n(\omega) never again strays more than
\varepsilon from \mu. Ruling out the possibility of
infinitely many future excursions — however rare each one individually is — is the real work.
Let X_1, X_2, \dots be i.i.d. Then \bar{X}_n
converges almost surely to a finite constant if and only if
\mathbb{E}|X_1| < \infty, in which case the limit is
\mu = \mathbb{E}[X_1].
This is a razor-sharp characterisation: a finite first moment is exactly the right condition —
necessary as well as sufficient. Nothing about variance appears; the theorem reaches heavy-tailed
distributions that Chebyshev cannot touch, as long as the mean itself exists.
The idea, in one breath: the natural quantity to control is the tail event
\{|\bar{X}_n - \mu| \ge \varepsilon\}. If one can show these events happen
only finitely often — that
\sum_n \mathbb{P}(|\bar{X}_n - \mu| \ge \varepsilon) < \infty — then the
first Borel–Cantelli lemma says that, with probability one, only finitely many of them
occur, so the average is eventually trapped inside the window. The catch is that the WLLN bound
\sigma^2/(n\varepsilon^2) sums to +\infty (the
harmonic series), so it is not enough by itself. Kolmogorov's genius was to sharpen the
probability estimates (via his maximal inequality) until the sum converges.
If you are willing to assume a finite fourth moment,
\mathbb{E}[X_1^4] < \infty, the strong law becomes a homework exercise.
Centre the variables so \mu = 0. Expanding
\mathbb{E}[(\sum_{i=1}^n X_i)^4] and using independence, the only surviving
terms are the n pure fourth powers and the
3n(n-1) cross terms \mathbb{E}[X_i^2 X_j^2], so
it is O(n^2). Hence
\mathbb{E}[\bar{X}_n^4] = \frac{1}{n^4}\,\mathbb{E}\Bigl[\bigl(\textstyle\sum_i X_i\bigr)^4\Bigr] = O\!\left(\frac{1}{n^2}\right), \qquad \sum_{n=1}^{\infty} \mathbb{E}[\bar{X}_n^4] < \infty.
A finite sum of non-negative terms means the summand \bar{X}_n^4 \to 0
almost surely (Markov + Borel–Cantelli, or just monotone convergence on the sum), and therefore
\bar{X}_n \to 0 a.s. The whole apparatus of the strong law, in miniature —
the price is the extra moment, which Kolmogorov's full theorem does away with.
Watch it converge
Below is a single sample run of a random experiment. The jagged curve is the running average
\bar{X}_n as a function of the number of trials n;
the dashed horizontal line is the true mean \mu; and the pale band is the
\pm\,\sigma/\sqrt{n} envelope — the one-standard-deviation spread of the
average, which pinches inward like 1/\sqrt{n}.
Reseed the run and the early wiggles are different every time, but the destination never is — the tail
of the curve always creeps onto \mu. Switch to the coin and
the average is the relative frequency of heads, converging on
0.5 (the frequentist foundation, just below). Then switch to
Cauchy and watch the promise break: the curve lurches and never settles, because a
Cauchy variable has no mean at all — the hypothesis of every law on this page fails, so the
conclusion is allowed to fail too.
Notice the two shrinking things do different jobs. The envelope narrowing tells you the
average is concentrating (the LLN). But that same envelope is
\pm\sigma/\sqrt{n} wide, not \pm\sigma/n — the
residual wobble decays only like 1/\sqrt{n}, and its shape, once
magnified, is a bell curve. Quantifying that leftover fluctuation is the job of the
Central Limit Theorem,
the natural sequel: the LLN says the average lands on \mu; the CLT tells you
the precise Gaussian shape of how it lands.
Why it matters: frequencies, estimators, and Monte Carlo
The frequentist foundation of probability. Take
X_i = \mathbf{1}_{A}, the indicator of some event
A on trial i — a Bernoulli variable equal to
1 when A happens and 0
otherwise. Then \mu = \mathbb{E}[\mathbf{1}_A] = \mathbb{P}(A), and the
sample mean \bar{X}_n is literally the relative frequency
of A in n trials. The strong law then says
\frac{\#\{i \le n : A \text{ occurred}\}}{n} \xrightarrow{\text{a.s.}} \mathbb{P}(A).
This is the theorem (Borel's, historically) that makes the intuitive slogan "probability is
long-run frequency" into mathematics. It is why a fair coin's heads-fraction tends to
1/2 and an insurer's realised claim rate tends to the modelled probability.
Consistency of estimators. A statistician who estimates an unknown mean by the sample
average is relying on exactly the LLN: as data accumulates, the estimator
\bar{X}_n converges to the truth \mu. In the
language of statistics, the sample mean is a consistent estimator. More sample means
less error — the whole enterprise of learning from data leans on this.
Monte Carlo integration. To compute a hard integral
I = \int_{[0,1]^d} g(x)\,dx = \mathbb{E}[g(U)] with
U uniform, draw i.i.d. samples U_1, \dots, U_n and
average: \bar{g}_n = \tfrac1n\sum_i g(U_i). The LLN guarantees
\bar{g}_n \to I, and the error shrinks like
1/\sqrt{n} regardless of the dimension d
— the reason Monte Carlo beats grid methods in high dimensions, from physics to finance.
Both laws demand a finite mean. Take i.i.d. standard Cauchy variables, with density
\tfrac{1}{\pi(1+x^2)}. The tails are so heavy that
\mathbb{E}|X_1| = \infty — there is no mean. A remarkable fact:
for Cauchy variables, \bar{X}_n has the same Cauchy
distribution as a single X_1, for every n.
Averaging buys you nothing; \bar{X}_n does not converge to
any constant, in any mode. No mean, no law.
The single most abused theorem in all of probability. Three traps:
-
The gambler's fallacy. After ten reds at roulette, the LLN does not make
black "due." Future spins are independent of the past; the wheel has no memory and no obligation to
balance the books. The law is a statement about the limit of the average, achieved by
swamping early imbalance with an ocean of fresh trials — never by correcting it.
Nothing reaches back to cancel a past streak.
-
The sum's deviations GROW; only the average shrinks. This is the subtle heart of it.
Let S_n = \sum_{i=1}^n X_i and suppose \mu = 0.
The typical size of the deviation of the sum is
\sqrt{\operatorname{Var}(S_n)} = \sigma\sqrt{n} — it grows without bound as
n \to \infty. The average is small only because
\bar{X}_n = S_n/n divides that growing
\sigma\sqrt{n} by the even-faster-growing n,
leaving \sigma/\sqrt{n} \to 0. So the "law of averages" absolutely does
not promise that heads and tails end up equal in count — that gap typically
widens like \sqrt{n}. It only promises their proportions
even out.
-
"The law of averages" is not a theorem. There is no principle guaranteeing a short
run must balance, or that a "cold" slot machine is primed to pay out. The LLN speaks only about the
limit, and says nothing about any finite stretch. Small samples are volatile — that is what
Chebyshev's \sigma^2/(n\varepsilon^2) is quietly telling you.
Write a real number x \in [0,1] in binary,
x = 0.b_1 b_2 b_3 \dots. Call x
normal (in base 2) if the digit 1 occurs with limiting
frequency exactly \tfrac12:
\frac1n\sum_{k=1}^n b_k \to \tfrac12. Borel's normal number
theorem (1909) says that almost every number — all but a set of Lebesgue measure
zero — is normal.
This is secretly a strong law of large numbers! Choose x uniformly at random
from [0,1]; then its binary digits b_1, b_2, \dots
are exactly i.i.d. fair coin flips (mean \tfrac12). The SLLN says the digit
average converges to \tfrac12 with probability one — and "probability one"
for the uniform distribution is "Lebesgue measure one." So the abstract almost-sure statement of
the SLLN and the concrete measure-theoretic statement that almost every real is normal are the very same
fact wearing two costumes. (And yet exhibiting one specific normal number is famously hard — we know
almost all of them exist without easily naming many. Measure theory sees the forest; the individual
trees stay elusive.)