Probability Spaces
Ask an insurer the chance a house burns down this year, a physicist the chance a given atom decays in
the next second, a forecaster the chance of rain tomorrow. Each answers with a single number in
[0,1]. For those numbers to be more than gut feeling — for them to
add up, never contradict one another, and survive being combined across unions, complements
and limits — they must live inside a rigid mathematical frame. That frame is a
probability space.
The whole of the last module was quietly building it. A
(X, \mathcal{F}, \mu) is a set, a σ-algebra of measurable pieces, and a
measure weighing them. A probability space is nothing more than a measure space whose total
weight is exactly one:
(\Omega, \mathcal{F}, P), \qquad P(\Omega) = 1.
That single normalisation — total measure 1 instead of
\infty — turns "how big is this set?" into "how likely is this event?". Every
theorem of measure theory instantly becomes a theorem of probability. This page is the translation
dictionary: it takes the abstract machinery and reads it in the language of chance, and it explains why
the school-level rules you met in
are exactly the ones this structure forces on us — no more, no fewer.
The three ingredients
A probability space bundles three objects, and it pays to meet them one at a time.
-
The sample space \Omega — the set of every possible
outcome of the experiment, one and only one of which actually happens. For a single die
roll, \Omega = \{1,2,3,4,5,6\}. For "how many emails arrive today",
\Omega = \{0,1,2,\dots\}. For "where on the dartboard does the tip land",
\Omega is a disc — uncountably infinite. An element
\omega \in \Omega is a single, fully specified outcome.
-
The σ-algebra of events \mathcal{F} — a
σ-algebra of subsets of
\Omega. An event is a set of outcomes — "the die
shows an even number" is the event \{2,4,6\} — and
\mathcal{F} is the collection of events we are allowed to assign a
probability to. It is closed under complement (A^{c}: "A
does not happen"), countable union (\bigcup_n A_n: "at least one
A_n happens") and hence countable intersection ("all of them happen"), and
contains \Omega itself.
-
The probability measure P — a
measure
P : \mathcal{F} \to [0,1] with P(\Omega) = 1. It
reports, for each event, its likelihood.
When \Omega is finite or countable we usually take
\mathcal{F} = 2^{\Omega}, the power set — every subset is an
event, and the σ-algebra fades into the background. So why carry
\mathcal{F} around at all?
Because for uncountable \Omega the power set is too greedy. Try to
spread probability uniformly over [0,1] (pick a "random real") and demand
that every subset get a consistent, translation-invariant probability, and you can construct
a Vitali set — a subset with no coherent measure at all, a paradoxical thing whose
"probability" would have to be simultaneously 0 and positive. The escape,
discovered by the measure-theorists, is to not insist: we only ever probabilise the sets in
\mathcal{F} (for [0,1], the
Lebesgue-measurable
ones), and quietly leave the monsters out. The σ-algebra is precisely the guest list of sets
well-behaved enough to be assigned a chance.
Kolmogorov's axioms
Strip probability down to its skeleton and only three demands survive. They are due to
Andrey Kolmogorov (1933), and they say: P is a measure of
total mass one.
A probability measure P on (\Omega,\mathcal{F}) satisfies:
-
Non-negativity: P(A) \ge 0 for every event
A \in \mathcal{F}.
-
Normalisation: P(\Omega) = 1 — something
certainly happens.
-
Countable additivity (σ-additivity): for pairwise disjoint events
A_1, A_2, \dots (no two can occur together),
\displaystyle P\!\Bigl(\bigcup_{n=1}^{\infty} A_n\Bigr) = \sum_{n=1}^{\infty} P(A_n).
That is the entire foundation. Notice how little it is — the same two measure axioms
(non-negativity + σ-additivity) you already met, plus the single word
P(\Omega)=1. Everything else that feels like a "rule of probability" is a
theorem derived from these three lines, not an extra assumption. The next card cashes that out.
Every school rule, derived
Watch the familiar formulae fall out one after another, each a two-line consequence of the axioms.
-
The impossible event has probability zero. Take
A_1 = \Omega and A_2 = A_3 = \dots = \varnothing
— pairwise disjoint. Countable additivity gives
1 = P(\Omega) = P(\Omega) + \sum_{n\ge 2} P(\varnothing), and a sum of
non-negative terms can only vanish if each is 0. So
P(\varnothing) = 0. (This also upgrades σ-additivity to plain
finite additivity — pad a finite disjoint union with copies of
\varnothing.)
-
Complement rule. A and A^{c}
are disjoint and union to \Omega, so
P(A) + P(A^{c}) = 1, i.e. P(A^{c}) = 1 - P(A). In
particular P(A) \le 1 always.
-
Monotonicity. If A \subseteq B, split
B = A \cup (B \setminus A) into disjoint parts:
P(B) = P(A) + P(B\setminus A) \ge P(A). A bigger event is at least as
likely.
-
Inclusion–exclusion. For any two events,
P(A \cup B) = P(A) + P(B) - P(A \cap B),
because adding P(A) and P(B) double-counts the
overlap A \cap B exactly once.
-
Subadditivity (the union bound). Dropping the (non-negative) intersection terms,
P\bigl(\bigcup_n A_n\bigr) \le \sum_n P(A_n) for any events,
disjoint or not. Cheap, crude, and one of the most-used inequalities in all of probability.
-
Continuity. If events increase A_1 \subseteq A_2 \subseteq \cdots
with union A, then P(A_n) \uparrow P(A)
(continuity from below); dually, if they decrease
B_1 \supseteq B_2 \supseteq \cdots with intersection
B, then P(B_n) \downarrow P(B) (from
above). Probability commutes with monotone limits of events — this is exactly σ-additivity worn
as a limit, and it is the property finite additivity alone cannot give you.
Two concrete spaces
The fair die (finite). Let \Omega = \{1,\dots,6\},
\mathcal{F} = 2^{\Omega} (all 64 subsets), and
P(A) = |A|/6. Then P(\Omega) = 6/6 = 1, and every
axiom holds because counting is additive over disjoint sets. The event "even",
\{2,4,6\}, has probability 3/6 = \tfrac12; its
complement "odd" has 1 - \tfrac12 = \tfrac12. This is the
equally-likely
model, now seen as the smallest possible probability space.
Infinite coin tossing (the canonical infinite space). Toss a fair coin forever. An
outcome is an infinite binary string, so
\Omega = \{0,1\}^{\mathbb{N}} — uncountable. Here
\mathcal{F} = 2^{\Omega} is hopeless; instead
\mathcal{F} is the σ-algebra generated by the cylinder sets
(events that pin down finitely many tosses), and P is the unique measure
with
P(\text{tosses } 1..k \text{ equal } b_1\dots b_k) = 2^{-k}.
This space is where finite additivity visibly fails to be enough. Consider the event
H = \{"a head eventually appears"\}. Its
complement is "all tails forever", the single string
000\cdots. Let T_n be "the first
n tosses are all tails", so P(T_n) = 2^{-n} and
T_1 \supseteq T_2 \supseteq \cdots with intersection exactly the all-tails
string. Continuity from above forces
P(\text{all tails}) = \lim_n 2^{-n} = 0, hence
P(H) = 1. No finite collection of axioms could reach that limit; you
genuinely need the countable one. This is the point of measure-theoretic probability — it is
built to survive infinity.
What does the number P(A) actually mean?
The axioms tell you how probabilities combine, but not what they are. The
operational reading — the one that ties the theory back to the real world — is long-run
relative frequency: repeat the experiment independently again and again, and the fraction of
trials on which event A occurs should settle down to
P(A). Below, a fair die is rolled and we track the running frequency of the
event A = \{6\}, whose probability is \tfrac16 \approx 0.167.
For small n the curve lurches — a couple of early sixes and the frequency
spikes to 0.5; a dry patch and it slumps toward 0.
But as the trials pile up the wobble shrinks and the frequency is pinned ever closer to the dashed
line. That this must happen — that the frequency converges to P(A)
with certainty — is the Law of Large Numbers, and it is a theorem proved inside
a probability space, not an assumption bolted on. The axioms are exactly what make the promise
keepable.
Null sets and "almost surely"
A null event is one with P(N) = 0. Its opposite, an event
with P(A) = 1, is said to happen almost surely
(abbreviated a.s.), the probabilist's cousin of measure theory's "almost everywhere". Because
the theory is blind to null sets, almost-sure statements are the natural currency: the Law of Large
Numbers above holds "almost surely", meaning the set of unlucky infinite sequences on which the
frequency fails to converge is a null event — real, but of probability zero.
Pick a uniform random point in [0,1]. The event "the point is exactly
\tfrac13" is a single outcome, a set of Lebesgue measure
0, so P(\{\tfrac13\}) = 0 — yet some point
does get chosen, so a probability-zero outcome occurs every single time. Here is where careful
language earns its keep.
Two of the most common — and most costly — confusions in all of probability:
-
"Probability 0" is NOT the same as "impossible". On an
uncountable \Omega, every individual outcome typically has probability
0, yet one of them happens. "All tails forever" is a genuinely possible
infinite sequence with P = 0. The impossible event is
\varnothing (which has P = 0), but the
converse fails: P(A) = 0 does not force
A = \varnothing. Symmetrically, P(A) = 1 means
"almost surely", not "certainly" — the complement can be a non-empty null set.
-
Not every subset of \Omega is an event. On an
uncountable space you cannot ask "what is P of this arbitrary
set?" — only sets in \mathcal{F} have probabilities. Write
P(A) only after checking A \in \mathcal{F}.
Skipping that check is how the non-measurable paradoxes sneak in.
For nearly three centuries after Pascal and Fermat traded letters over dice in the 1650s, probability
was a brilliant but foundationally shaky art — powerful formulae with no agreed definition of the
object they described. Paradoxes multiplied; even great mathematicians disagreed about what
"probability" meant. The problem was on Hilbert's 1900 list of the century's challenges: put
probability on a rigorous axiomatic footing.
The answer arrived in 1933 in a slim German monograph, Grundbegriffe der
Wahrscheinlichkeitsrechnung ("Foundations of the Theory of Probability"), by the
30-year-old Andrey Kolmogorov. His move was audacious in its simplicity:
probability is measure theory, with the extra rule P(\Omega) = 1.
Overnight, decades of hard-won results in measure and integration became free tools for probability,
and every classical formula found a home. It is one of the cleanest examples in mathematics of a
whole subject snapping into focus the moment the right definition is written down.