Modes of Convergence

A single number can approach a limit in only one way: x_n \to x means the gap |x_n - x| eventually stays tiny. But a random variable is a whole function on a probability space, and there turn out to be several genuinely different senses in which a sequence X_1, X_2, X_3, \dots can "settle down" to a limit X. This page is about those senses — the four modes of convergence — and, crucially, which one implies which.

The stakes are not academic. When a statistician builds an estimator \hat\theta_n from n data points — a sample mean, a maximum-likelihood fit, a regression coefficient — the whole enterprise rests on a convergence claim: as I gather more data, my estimate homes in on the truth. The word for that is consistency, and it is precisely \hat\theta_n \xrightarrow{\ \mathbb{P}\ } \theta: convergence in probability. The two great limit theorems of probability each speak in one of these modes — the Law of Large Numbers promises the sample mean converges to the population mean (in probability, or even almost surely), and the Central Limit Theorem promises the rescaled error converges to a Gaussian (in distribution). To read either theorem correctly you must know exactly what its arrow means.

Throughout, X_n and X live on a common probability space (\Omega, \mathcal{F}, \mathbb{P}), and \mathbb{E}[\,\cdot\,] is expectation, the integral against the probability measure.

The four modes

Here they are, from the strongest and most literal to the weakest and most abstract. Read each as a precise mathematical statement, then read the plain-English gloss beside it.

Almost surely (X_n \xrightarrow{\text{a.s.}} X): \ \mathbb{P}\bigl(\{\omega : X_n(\omega) \to X(\omega)\}\bigr) = 1. For all but a null set of outcomes, the numerical sequence genuinely converges, pointwise.
In probability (X_n \xrightarrow{\ \mathbb{P}\ } X): for every \varepsilon > 0, \ \mathbb{P}\bigl(|X_n - X| > \varepsilon\bigr) \to 0. The chance of a gap bigger than \varepsilon shrinks to zero — but which outcomes misbehave may change with n.
In L^p (mean-p; X_n \xrightarrow{L^p} X), for p \ge 1: \ \mathbb{E}\bigl[\,|X_n - X|^p\,\bigr] \to 0. The case p = 2 is mean-square convergence. The average size of the error, measured in the p-th power, vanishes.
In distribution (weakly; X_n \xrightarrow{\ d\ } X): \ F_{X_n}(x) \to F_X(x) at every continuity point x of F_X; equivalently \mathbb{E}[g(X_n)] \to \mathbb{E}[g(X)] for every bounded continuous g. Only the laws converge — the shapes of the histograms line up — and the variables need not even live on the same space.

Notice how the object being controlled drifts from the concrete to the abstract as you go down the list. Almost-sure convergence is a statement about the actual sample paths \omega \mapsto X_n(\omega). Convergence in probability forgets the paths and watches only the measure of the bad set. L^p convergence weighs the error by its magnitude and averages. And convergence in distribution has thrown away the random variables altogether, keeping only their distribution functions. Weaker modes ask for less — which is exactly why more sequences satisfy them.

The implication hierarchy

The whole subject organises itself around one picture. Three implications hold always, and every arrow that is not drawn genuinely fails — there is a standard counterexample for each missing arrow.

a.s. \Rightarrow in probability. If the paths converge off a null set, the bad-set probabilities must decay.
L^p \Rightarrow in probability, by Markov / Chebyshev: \mathbb{P}\bigl(|X_n - X| > \varepsilon\bigr) = \mathbb{P}\bigl(|X_n - X|^p > \varepsilon^p\bigr) \le \frac{\mathbb{E}\bigl[|X_n - X|^p\bigr]}{\varepsilon^p} \to 0.
in probability \Rightarrow in distribution. If the variables get close in probability, their laws must line up.

So almost sure and L^p are the two "strong" modes; neither implies the other (an a.s. limit can have infinite error mass, and an L^p limit can fail to converge at any single point). Both funnel down into convergence in probability, the central hub, which in turn feeds the weakest mode, convergence in distribution. Reading the diagram top-to-bottom is reading from "the variables themselves are close" to "only their statistics are close."

Why the converses fail — three famous counterexamples

The arrows point one way for a reason. Each of the following sequences satisfies a weaker mode while violating a stronger one; memorise them and the whole hierarchy becomes unforgettable. Take \Omega = [0,1] with Lebesgue measure as the probability.

1. In probability but NOT almost surely — the "typewriter" (sliding bump). March a window of shrinking width across [0,1], wrapping around like a typewriter carriage: intervals of length 1, \tfrac12, \tfrac12, \tfrac13, \tfrac13, \tfrac13, \dots that sweep the unit interval again and again. Let X_n = \mathbf{1}_{I_n} be the indicator of the n-th window. Since \mathbb{P}(X_n = 1) = |I_n| \to 0, we have X_n \xrightarrow{\ \mathbb{P}\ } 0. But every point \omega is hit by infinitely many windows and missed by infinitely many, so X_n(\omega) flickers between 0 and 1 forever — it converges for no \omega. Convergence in probability, but almost-sure convergence fails at every single point.

2. a.s. and in probability but NOT in L^1 — the tall skinny spike. Let X_n = n \cdot \mathbf{1}_{(0,\,1/n)}. For any fixed \omega > 0, once n > 1/\omega the point is outside the spike and X_n(\omega) = 0, so X_n \to 0 pointwise (hence a.s. and in probability). Yet the mass under each spike is constant: \mathbb{E}[X_n] = n \cdot \tfrac{1}{n} = 1 \not\to 0 = \mathbb{E}[0]. The spike grows tall exactly as fast as it grows thin, so its area never leaves — convergence in L^1 fails. The energy escapes "to infinity" even as the function collapses to zero everywhere.

3. In distribution but NOT in probability — a genuinely random limit. Let X \sim \mathcal{N}(0,1) be standard normal and set X_n = (-1)^n X. Because the normal law is symmetric, -X has the same distribution as X, so every X_n is standard normal and trivially X_n \xrightarrow{\ d\ } X. But |X_n - X| equals 0 for even n and 2|X| for odd n, so \mathbb{P}(|X_n - X| > 1) does not go to zero. The laws agree perfectly while the variables stay far apart — the sharpest reminder that convergence in distribution is a statement about distributions, not about the variables being close.

The partial converses — where broken arrows can be repaired

The one-way street has two celebrated exits. Neither restores the full arrow, but each buys something back under a mild extra hypothesis — and both are workhorses in proofs.

In probability \Rightarrow a.s. along a subsequence. If X_n \xrightarrow{\ \mathbb{P}\ } X, then there is a subsequence X_{n_k} \xrightarrow{\text{a.s.}} X. You cannot make the whole sequence converge pointwise, but you can always thin it out until it does. (This is how many a.s. statements are bootstrapped from probability statements.)
In distribution to a CONSTANT \Rightarrow in probability. If X_n \xrightarrow{\ d\ } c for a constant c, then in fact X_n \xrightarrow{\ \mathbb{P}\ } c. When the limiting law is a point mass, "the histograms agree" and "the values are close" coincide — there is no room left for the variable to wander. This is exactly why weak-convergence proofs of consistency (limit = the true parameter, a constant) actually deliver convergence in probability.

The second point is the quiet hero of statistics. The Central Limit Theorem hands you convergence in distribution; whenever the target is a fixed number rather than a spread-out law, that upgrades for free to convergence in probability, i.e. consistency.

Watch it concentrate

Nothing makes the difference between the modes as vivid as watching a distribution collapse. Below is the density of the sample mean \bar X_n = \tfrac1n\sum_{i=1}^n Z_i of n independent standard-normal draws Z_i \sim \mathcal{N}(0,1). Its law is \mathcal{N}\!\left(0, \tfrac1n\right) — a bell centred at 0 with standard deviation 1/\sqrt{n}.

Drag n up. The bell squeezes inward and shoots up, dumping ever more of its probability into a tiny neighbourhood of 0. Fix any window (-\varepsilon, \varepsilon): the area outside it — \mathbb{P}(|\bar X_n| > \varepsilon) — visibly drains to zero. That is convergence in probability to 0 (the Weak Law of Large Numbers in miniature). Because the limit is a constant, it is simultaneously convergence in distribution to the point mass at 0, and by the partial converse the two coincide here. It is also L^2 convergence, since \mathbb{E}[\bar X_n^2] = \operatorname{Var}(\bar X_n) = 1/n \to 0. The spike counterexample above is what this picture would look like if the tail mass refused to leave — a case where the eye sees collapse but L^1 does not.

Crossing back to expectations: the convergence theorems

A recurring need is to turn a convergence of variables into a convergence of their expectations — to pass a limit through the integral, \mathbb{E}[X_n] \to \mathbb{E}[X]. The spike example warns that this can fail even under a.s. convergence: the mass can run away. The remedy is the Monotone and Dominated Convergence Theorems, which supply exactly the missing control.

Dominated Convergence is the bridge: if X_n \to X almost surely (or, after passing to a subsequence, in probability) and there is a single integrable envelope |X_n| \le Y with \mathbb{E}[Y] < \infty, then \mathbb{E}[X_n] \to \mathbb{E}[X] \qquad\text{and indeed}\qquad X_n \xrightarrow{L^1} X. The dominating Y is precisely what pins the escaping mass in place — it is the hypothesis the spike n\,\mathbf{1}_{(0,1/n)} cannot satisfy (any envelope would need Y \ge n near 0, so Y \notin L^1). More generally, the passage from convergence in probability up to L^p convergence requires a uniform integrability condition; domination is the most common way to guarantee it.

No — and this is the single most common misreading of "converges in distribution." Convergence in distribution is a statement about the laws F_{X_n} \to F_X, not about the numbers X_n(\omega) and X(\omega) being close. Our third counterexample, X_n = (-1)^n X with X standard normal, makes it stark: every X_n has exactly the same bell-shaped law as X, so X_n \xrightarrow{\ d\ } X is immediate — yet for odd n the variable is the mirror image -X, sitting as far from X as it possibly can. The variables never get close; only their distributions do.

Two safety rails. First, "X_n \xrightarrow{\ d\ } X" is really shorthand for "X_n \xrightarrow{\ d\ } \operatorname{Law}(X)" — you may freely replace the limit with any other variable sharing its law. Second, the one case where distributional closeness does force the variables together is when the limit is a constant: then, and only then, convergence in distribution upgrades to convergence in probability.

The Laws of Large Numbers come in two grades, and the grades are exactly two of our modes. The Weak Law asserts \bar X_n \xrightarrow{\ \mathbb{P}\ } \mu: for each large n, the mean is probably near \mu. The Strong Law asserts \bar X_n \xrightarrow{\text{a.s.}} \mu: with probability one, the entire trajectory of running averages settles onto \mu and stays there.

Because a.s. convergence implies convergence in probability, the Strong Law contains the Weak Law — but not vice versa, and the gap is real, not pedantry. The typewriter sequence shows a process that converges in probability while its paths never settle; the Weak Law alone cannot rule that out. The Strong Law promises the gambler that this particular run of averages will converge, not merely that convergence is probable at each frozen instant. That is the difference between "the bad set is small at every time" and "the bad set of paths is null."

Four traps in the modes of convergence:

Convergence in distribution says nothing about the variables being close. It constrains only the laws F_{X_n} \to F_X. Do not conclude |X_n - X| is small — as X_n = (-1)^n X shows, they can be maximally far apart.
Almost sure \ne in probability. "In probability" lets the bad outcomes reshuffle with n (the typewriter); "almost surely" demands the same paths converge. In probability is strictly weaker, recoverable only along a subsequence.
Pointwise convergence does not give you L^p for free. The spike n\,\mathbf{1}_{(0,1/n)} \to 0 everywhere, yet \mathbb{E}[X_n] = 1. To cross from in-probability / a.s. back to L^p you need a uniform integrability bridge — most often a dominating integrable envelope (Dominated Convergence).
Chebyshev bounds a probability, not a value. \mathbb{P}(|X - \mu| \ge k\sigma) \le 1/k^2 is an inequality; it never says the deviation equals anything, only that large deviations are rare. Quoting it as an equality is a classic slip.