Probability Spaces

Ask an insurer the chance a house burns down this year, a physicist the chance a given atom decays in the next second, a forecaster the chance of rain tomorrow. Each answers with a single number in [0,1]. For those numbers to be more than gut feeling — for them to add up, never contradict one another, and survive being combined across unions, complements and limits — they must live inside a rigid mathematical frame. That frame is a probability space.

The whole of the last module was quietly building it. A (X, \mathcal{F}, \mu) is a set, a σ-algebra of measurable pieces, and a measure weighing them. A probability space is nothing more than a measure space whose total weight is exactly one:

(\Omega, \mathcal{F}, P), \qquad P(\Omega) = 1.

That single normalisation — total measure 1 instead of \infty — turns "how big is this set?" into "how likely is this event?". Every theorem of measure theory instantly becomes a theorem of probability. This page is the translation dictionary: it takes the abstract machinery and reads it in the language of chance, and it explains why the school-level rules you met in are exactly the ones this structure forces on us — no more, no fewer.

The three ingredients

A probability space bundles three objects, and it pays to meet them one at a time.

The sample space \Omega — the set of every possible outcome of the experiment, one and only one of which actually happens. For a single die roll, \Omega = \{1,2,3,4,5,6\}. For "how many emails arrive today", \Omega = \{0,1,2,\dots\}. For "where on the dartboard does the tip land", \Omega is a disc — uncountably infinite. An element \omega \in \Omega is a single, fully specified outcome.
The σ-algebra of events \mathcal{F} — a σ-algebra of subsets of \Omega. An event is a set of outcomes — "the die shows an even number" is the event \{2,4,6\} — and \mathcal{F} is the collection of events we are allowed to assign a probability to. It is closed under complement (A^{c}: "A does not happen"), countable union (\bigcup_n A_n: "at least one A_n happens") and hence countable intersection ("all of them happen"), and contains \Omega itself.
The probability measure P — a measure P : \mathcal{F} \to [0,1] with P(\Omega) = 1. It reports, for each event, its likelihood.

When \Omega is finite or countable we usually take \mathcal{F} = 2^{\Omega}, the power set — every subset is an event, and the σ-algebra fades into the background. So why carry \mathcal{F} around at all?

Because for uncountable \Omega the power set is too greedy. Try to spread probability uniformly over [0,1] (pick a "random real") and demand that every subset get a consistent, translation-invariant probability, and you can construct a Vitali set — a subset with no coherent measure at all, a paradoxical thing whose "probability" would have to be simultaneously 0 and positive. The escape, discovered by the measure-theorists, is to not insist: we only ever probabilise the sets in \mathcal{F} (for [0,1], the Lebesgue-measurable ones), and quietly leave the monsters out. The σ-algebra is precisely the guest list of sets well-behaved enough to be assigned a chance.

Kolmogorov's axioms

Strip probability down to its skeleton and only three demands survive. They are due to Andrey Kolmogorov (1933), and they say: P is a measure of total mass one.

A probability measure P on (\Omega,\mathcal{F}) satisfies:

Non-negativity: P(A) \ge 0 for every event A \in \mathcal{F}.
Normalisation: P(\Omega) = 1 — something certainly happens.
Countable additivity (σ-additivity): for pairwise disjoint events A_1, A_2, \dots (no two can occur together), \displaystyle P\!\Bigl(\bigcup_{n=1}^{\infty} A_n\Bigr) = \sum_{n=1}^{\infty} P(A_n).

That is the entire foundation. Notice how little it is — the same two measure axioms (non-negativity + σ-additivity) you already met, plus the single word P(\Omega)=1. Everything else that feels like a "rule of probability" is a theorem derived from these three lines, not an extra assumption. The next card cashes that out.

Every school rule, derived

Watch the familiar formulae fall out one after another, each a two-line consequence of the axioms.

The impossible event has probability zero. Take A_1 = \Omega and A_2 = A_3 = \dots = \varnothing — pairwise disjoint. Countable additivity gives 1 = P(\Omega) = P(\Omega) + \sum_{n\ge 2} P(\varnothing), and a sum of non-negative terms can only vanish if each is 0. So P(\varnothing) = 0. (This also upgrades σ-additivity to plain finite additivity — pad a finite disjoint union with copies of \varnothing.)
Complement rule. A and A^{c} are disjoint and union to \Omega, so P(A) + P(A^{c}) = 1, i.e. P(A^{c}) = 1 - P(A). In particular P(A) \le 1 always.
Monotonicity. If A \subseteq B, split B = A \cup (B \setminus A) into disjoint parts: P(B) = P(A) + P(B\setminus A) \ge P(A). A bigger event is at least as likely.
Inclusion–exclusion. For any two events, P(A \cup B) = P(A) + P(B) - P(A \cap B), because adding P(A) and P(B) double-counts the overlap A \cap B exactly once.
Subadditivity (the union bound). Dropping the (non-negative) intersection terms, P\bigl(\bigcup_n A_n\bigr) \le \sum_n P(A_n) for any events, disjoint or not. Cheap, crude, and one of the most-used inequalities in all of probability.
Continuity. If events increase A_1 \subseteq A_2 \subseteq \cdots with union A, then P(A_n) \uparrow P(A) (continuity from below); dually, if they decrease B_1 \supseteq B_2 \supseteq \cdots with intersection B, then P(B_n) \downarrow P(B) (from above). Probability commutes with monotone limits of events — this is exactly σ-additivity worn as a limit, and it is the property finite additivity alone cannot give you.

Two concrete spaces

The fair die (finite). Let \Omega = \{1,\dots,6\}, \mathcal{F} = 2^{\Omega} (all 64 subsets), and P(A) = |A|/6. Then P(\Omega) = 6/6 = 1, and every axiom holds because counting is additive over disjoint sets. The event "even", \{2,4,6\}, has probability 3/6 = \tfrac12; its complement "odd" has 1 - \tfrac12 = \tfrac12. This is the equally-likely model, now seen as the smallest possible probability space.

Infinite coin tossing (the canonical infinite space). Toss a fair coin forever. An outcome is an infinite binary string, so \Omega = \{0,1\}^{\mathbb{N}} — uncountable. Here \mathcal{F} = 2^{\Omega} is hopeless; instead \mathcal{F} is the σ-algebra generated by the cylinder sets (events that pin down finitely many tosses), and P is the unique measure with

P(\text{tosses } 1..k \text{ equal } b_1\dots b_k) = 2^{-k}.

This space is where finite additivity visibly fails to be enough. Consider the event H = \{"a head eventually appears"\}. Its complement is "all tails forever", the single string 000\cdots. Let T_n be "the first n tosses are all tails", so P(T_n) = 2^{-n} and T_1 \supseteq T_2 \supseteq \cdots with intersection exactly the all-tails string. Continuity from above forces P(\text{all tails}) = \lim_n 2^{-n} = 0, hence P(H) = 1. No finite collection of axioms could reach that limit; you genuinely need the countable one. This is the point of measure-theoretic probability — it is built to survive infinity.

What does the number P(A) actually mean?

The axioms tell you how probabilities combine, but not what they are. The operational reading — the one that ties the theory back to the real world — is long-run relative frequency: repeat the experiment independently again and again, and the fraction of trials on which event A occurs should settle down to P(A). Below, a fair die is rolled and we track the running frequency of the event A = \{6\}, whose probability is \tfrac16 \approx 0.167.

For small n the curve lurches — a couple of early sixes and the frequency spikes to 0.5; a dry patch and it slumps toward 0. But as the trials pile up the wobble shrinks and the frequency is pinned ever closer to the dashed line. That this must happen — that the frequency converges to P(A) with certainty — is the Law of Large Numbers, and it is a theorem proved inside a probability space, not an assumption bolted on. The axioms are exactly what make the promise keepable.

Null sets and "almost surely"

A null event is one with P(N) = 0. Its opposite, an event with P(A) = 1, is said to happen almost surely (abbreviated a.s.), the probabilist's cousin of measure theory's "almost everywhere". Because the theory is blind to null sets, almost-sure statements are the natural currency: the Law of Large Numbers above holds "almost surely", meaning the set of unlucky infinite sequences on which the frequency fails to converge is a null event — real, but of probability zero.

Pick a uniform random point in [0,1]. The event "the point is exactly \tfrac13" is a single outcome, a set of Lebesgue measure 0, so P(\{\tfrac13\}) = 0 — yet some point does get chosen, so a probability-zero outcome occurs every single time. Here is where careful language earns its keep.

Two of the most common — and most costly — confusions in all of probability:

"Probability 0" is NOT the same as "impossible". On an uncountable \Omega, every individual outcome typically has probability 0, yet one of them happens. "All tails forever" is a genuinely possible infinite sequence with P = 0. The impossible event is \varnothing (which has P = 0), but the converse fails: P(A) = 0 does not force A = \varnothing. Symmetrically, P(A) = 1 means "almost surely", not "certainly" — the complement can be a non-empty null set.
Not every subset of \Omega is an event. On an uncountable space you cannot ask "what is P of this arbitrary set?" — only sets in \mathcal{F} have probabilities. Write P(A) only after checking A \in \mathcal{F}. Skipping that check is how the non-measurable paradoxes sneak in.

For nearly three centuries after Pascal and Fermat traded letters over dice in the 1650s, probability was a brilliant but foundationally shaky art — powerful formulae with no agreed definition of the object they described. Paradoxes multiplied; even great mathematicians disagreed about what "probability" meant. The problem was on Hilbert's 1900 list of the century's challenges: put probability on a rigorous axiomatic footing.

The answer arrived in 1933 in a slim German monograph, Grundbegriffe der Wahrscheinlichkeitsrechnung ("Foundations of the Theory of Probability"), by the 30-year-old Andrey Kolmogorov. His move was audacious in its simplicity: probability is measure theory, with the extra rule P(\Omega) = 1. Overnight, decades of hard-won results in measure and integration became free tools for probability, and every classical formula found a home. It is one of the cleanest examples in mathematics of a whole subject snapping into focus the moment the right definition is written down.