Modes of Convergence
A single number can approach a limit in only one way: x_n \to x means the
gap |x_n - x| eventually stays tiny. But a
random variable is a whole function on a probability space, and there turn out to be
several genuinely different senses in which a sequence
X_1, X_2, X_3, \dots can "settle down" to a limit X.
This page is about those senses — the four modes of convergence — and, crucially,
which one implies which.
The stakes are not academic. When a statistician builds an estimator
\hat\theta_n from n data points — a sample mean, a
maximum-likelihood fit, a regression coefficient — the whole enterprise rests on a convergence claim:
as I gather more data, my estimate homes in on the truth. The word for that is
consistency, and it is precisely
\hat\theta_n \xrightarrow{\ \mathbb{P}\ } \theta: convergence
in probability. The two great limit theorems of probability each speak in one of these modes —
the Law of Large Numbers promises the sample mean converges to the population mean
(in probability, or even almost surely), and the Central Limit Theorem promises the
rescaled error converges to a Gaussian (in distribution). To read either theorem correctly you must
know exactly what its arrow means.
Throughout, X_n and X live on a common probability
space (\Omega, \mathcal{F}, \mathbb{P}), and
\mathbb{E}[\,\cdot\,] is
expectation, the integral against the probability measure.
The four modes
Here they are, from the strongest and most literal to the weakest and most abstract. Read each as a
precise mathematical statement, then read the plain-English gloss beside it.
-
Almost surely (X_n \xrightarrow{\text{a.s.}} X):
\ \mathbb{P}\bigl(\{\omega : X_n(\omega) \to X(\omega)\}\bigr) = 1.
For all but a null set of outcomes, the numerical sequence genuinely converges, pointwise.
-
In probability (X_n \xrightarrow{\ \mathbb{P}\ } X):
for every \varepsilon > 0,
\ \mathbb{P}\bigl(|X_n - X| > \varepsilon\bigr) \to 0.
The chance of a gap bigger than \varepsilon shrinks to zero — but
which outcomes misbehave may change with n.
-
In L^p (mean-p;
X_n \xrightarrow{L^p} X), for
p \ge 1:
\ \mathbb{E}\bigl[\,|X_n - X|^p\,\bigr] \to 0.
The case p = 2 is mean-square convergence.
The average size of the error, measured in the p-th power, vanishes.
-
In distribution (weakly;
X_n \xrightarrow{\ d\ } X):
\ F_{X_n}(x) \to F_X(x) at every continuity point
x of F_X; equivalently
\mathbb{E}[g(X_n)] \to \mathbb{E}[g(X)] for every bounded continuous
g. Only the laws converge — the shapes of the
histograms line up — and the variables need not even live on the same space.
Notice how the object being controlled drifts from the concrete to the abstract as you go down the
list. Almost-sure convergence is a statement about the actual sample paths
\omega \mapsto X_n(\omega). Convergence in probability forgets the paths and
watches only the measure of the bad set. L^p convergence weighs the
error by its magnitude and averages. And convergence in distribution has thrown away the random
variables altogether, keeping only their distribution functions. Weaker modes ask for less — which is
exactly why more sequences satisfy them.
The implication hierarchy
The whole subject organises itself around one picture. Three implications hold always, and
every arrow that is not drawn genuinely fails — there is a standard counterexample for
each missing arrow.
-
a.s. \Rightarrow in probability. If the paths converge
off a null set, the bad-set probabilities must decay.
-
L^p \Rightarrow in probability, by
Markov / Chebyshev:
\mathbb{P}\bigl(|X_n - X| > \varepsilon\bigr) = \mathbb{P}\bigl(|X_n - X|^p > \varepsilon^p\bigr) \le \frac{\mathbb{E}\bigl[|X_n - X|^p\bigr]}{\varepsilon^p} \to 0.
-
in probability \Rightarrow in distribution. If the
variables get close in probability, their laws must line up.
So almost sure and L^p are the two
"strong" modes; neither implies the other (an a.s. limit can have infinite error mass, and an
L^p limit can fail to converge at any single point). Both funnel down into
convergence in probability, the central hub, which in turn feeds the weakest mode,
convergence in distribution. Reading the diagram top-to-bottom is reading from "the
variables themselves are close" to "only their statistics are close."
Why the converses fail — three famous counterexamples
The arrows point one way for a reason. Each of the following sequences satisfies a weaker mode while
violating a stronger one; memorise them and the whole hierarchy becomes unforgettable. Take
\Omega = [0,1] with Lebesgue measure as the probability.
1. In probability but NOT almost surely — the "typewriter" (sliding bump).
March a window of shrinking width across [0,1], wrapping around like a
typewriter carriage: intervals of length
1, \tfrac12, \tfrac12, \tfrac13, \tfrac13, \tfrac13, \dots that sweep the
unit interval again and again. Let X_n = \mathbf{1}_{I_n} be the indicator of
the n-th window. Since \mathbb{P}(X_n = 1) = |I_n| \to 0,
we have X_n \xrightarrow{\ \mathbb{P}\ } 0. But every point
\omega is hit by infinitely many windows and missed by infinitely many, so
X_n(\omega) flickers between 0 and
1 forever — it converges for no \omega.
Convergence in probability, but almost-sure convergence fails at every single point.
2. a.s. and in probability but NOT in L^1 — the tall skinny spike.
Let X_n = n \cdot \mathbf{1}_{(0,\,1/n)}. For any fixed
\omega > 0, once n > 1/\omega the point is
outside the spike and X_n(\omega) = 0, so
X_n \to 0 pointwise (hence a.s. and in probability). Yet the mass under each
spike is constant:
\mathbb{E}[X_n] = n \cdot \tfrac{1}{n} = 1 \not\to 0 = \mathbb{E}[0].
The spike grows tall exactly as fast as it grows thin, so its area never leaves — convergence in
L^1 fails. The energy escapes "to infinity" even as the function collapses to
zero everywhere.
3. In distribution but NOT in probability — a genuinely random limit.
Let X \sim \mathcal{N}(0,1) be standard normal and set
X_n = (-1)^n X. Because the normal law is symmetric,
-X has the same distribution as X, so
every X_n is standard normal and trivially
X_n \xrightarrow{\ d\ } X. But
|X_n - X| equals 0 for even
n and 2|X| for odd n, so
\mathbb{P}(|X_n - X| > 1) does not go to zero. The laws
agree perfectly while the variables stay far apart — the sharpest reminder that convergence in
distribution is a statement about distributions, not about the variables being close.
The partial converses — where broken arrows can be repaired
The one-way street has two celebrated exits. Neither restores the full arrow, but each buys something
back under a mild extra hypothesis — and both are workhorses in proofs.
-
In probability \Rightarrow a.s. along a subsequence.
If X_n \xrightarrow{\ \mathbb{P}\ } X, then there is a subsequence
X_{n_k} \xrightarrow{\text{a.s.}} X. You cannot make the whole sequence
converge pointwise, but you can always thin it out until it does. (This is how many a.s.
statements are bootstrapped from probability statements.)
-
In distribution to a CONSTANT \Rightarrow in probability.
If X_n \xrightarrow{\ d\ } c for a constant c,
then in fact X_n \xrightarrow{\ \mathbb{P}\ } c. When the limiting law is
a point mass, "the histograms agree" and "the values are close" coincide — there is no room left for
the variable to wander. This is exactly why weak-convergence proofs of consistency (limit =
the true parameter, a constant) actually deliver convergence in probability.
The second point is the quiet hero of statistics. The Central Limit Theorem hands you convergence
in distribution; whenever the target is a fixed number rather than a spread-out law, that
upgrades for free to convergence in probability, i.e. consistency.
Watch it concentrate
Nothing makes the difference between the modes as vivid as watching a distribution collapse. Below is
the density of the sample mean
\bar X_n = \tfrac1n\sum_{i=1}^n Z_i of
n independent standard-normal draws
Z_i \sim \mathcal{N}(0,1). Its law is
\mathcal{N}\!\left(0, \tfrac1n\right) — a bell centred at
0 with standard deviation 1/\sqrt{n}.
Drag n up. The bell squeezes inward and shoots up, dumping ever more of its
probability into a tiny neighbourhood of 0. Fix any window
(-\varepsilon, \varepsilon): the area outside it —
\mathbb{P}(|\bar X_n| > \varepsilon) — visibly drains to zero. That is
convergence in probability to 0 (the Weak Law of Large
Numbers in miniature). Because the limit is a constant, it is simultaneously
convergence in distribution to the point mass at 0, and by
the partial converse the two coincide here. It is also
L^2 convergence, since
\mathbb{E}[\bar X_n^2] = \operatorname{Var}(\bar X_n) = 1/n \to 0.
The spike counterexample above is what this picture would look like if the tail mass
refused to leave — a case where the eye sees collapse but L^1 does not.
Crossing back to expectations: the convergence theorems
A recurring need is to turn a convergence of variables into a convergence of their
expectations — to pass a limit through the integral,
\mathbb{E}[X_n] \to \mathbb{E}[X]. The spike example warns that this can
fail even under a.s. convergence: the mass can run away. The remedy is the
Monotone and Dominated Convergence Theorems,
which supply exactly the missing control.
Dominated Convergence is the bridge: if
X_n \to X almost surely (or, after passing to a subsequence, in probability)
and there is a single integrable envelope |X_n| \le Y with
\mathbb{E}[Y] < \infty, then
\mathbb{E}[X_n] \to \mathbb{E}[X] \qquad\text{and indeed}\qquad X_n \xrightarrow{L^1} X.
The dominating Y is precisely what pins the escaping mass in place — it is
the hypothesis the spike n\,\mathbf{1}_{(0,1/n)} cannot satisfy (any envelope
would need Y \ge n near 0, so
Y \notin L^1). More generally, the passage from convergence in probability up
to L^p convergence requires a uniform integrability
condition; domination is the most common way to guarantee it.
No — and this is the single most common misreading of "converges in distribution." Convergence in
distribution is a statement about the laws F_{X_n} \to F_X,
not about the numbers X_n(\omega) and X(\omega)
being close. Our third counterexample, X_n = (-1)^n X with
X standard normal, makes it stark: every X_n has
exactly the same bell-shaped law as X, so
X_n \xrightarrow{\ d\ } X is immediate — yet for odd
n the variable is the mirror image -X, sitting as
far from X as it possibly can. The variables never get close; only their
distributions do.
Two safety rails. First, "X_n \xrightarrow{\ d\ } X" is really shorthand for
"X_n \xrightarrow{\ d\ } \operatorname{Law}(X)" — you may freely replace the
limit with any other variable sharing its law. Second, the one case where distributional
closeness does force the variables together is when the limit is a constant: then, and
only then, convergence in distribution upgrades to convergence in probability.
The Laws of Large Numbers come in two grades, and the grades are exactly two of our modes. The
Weak Law asserts \bar X_n \xrightarrow{\ \mathbb{P}\ } \mu:
for each large n, the mean is probably near \mu.
The Strong Law asserts \bar X_n \xrightarrow{\text{a.s.}} \mu:
with probability one, the entire trajectory of running averages settles onto
\mu and stays there.
Because a.s. convergence implies convergence in probability, the Strong Law contains the Weak Law — but
not vice versa, and the gap is real, not pedantry. The typewriter sequence shows a process that
converges in probability while its paths never settle; the Weak Law alone cannot rule that out. The
Strong Law promises the gambler that this particular run of averages will converge, not merely
that convergence is probable at each frozen instant. That is the difference between "the bad set is
small at every time" and "the bad set of paths is null."
Four traps in the modes of convergence:
-
Convergence in distribution says nothing about the variables being close. It
constrains only the laws F_{X_n} \to F_X. Do not conclude
|X_n - X| is small — as
X_n = (-1)^n X shows, they can be maximally far apart.
-
Almost sure \ne in probability. "In probability" lets the
bad outcomes reshuffle with n (the typewriter); "almost surely" demands the
same paths converge. In probability is strictly weaker, recoverable only along a subsequence.
-
Pointwise convergence does not give you L^p for free.
The spike n\,\mathbf{1}_{(0,1/n)} \to 0 everywhere, yet
\mathbb{E}[X_n] = 1. To cross from in-probability / a.s. back to
L^p you need a uniform integrability bridge — most often a
dominating integrable envelope (Dominated Convergence).
-
Chebyshev bounds a probability, not a value.
\mathbb{P}(|X - \mu| \ge k\sigma) \le 1/k^2 is an inequality; it never
says the deviation equals anything, only that large deviations are rare. Quoting it as an
equality is a classic slip.