Random Variables as Measurable Functions
You roll a die, toss a handful of coins, wait for a bus, measure the height of the next person through
the door. In every case a chance experiment happens somewhere in the background — in a space
of raw outcomes — and yet the thing you actually care about is a number: the total
showing, the count of heads, the minutes waited, the centimetres. A random variable is
the bridge. It is the rule that reads a raw outcome and reports the number you were watching for.
Measure theory lets us say precisely what that "rule" is, and — crucially — which rules are
allowed. The one-sentence answer, the whole of this page in a nutshell:
Let (\Omega, \mathcal{F}, P) be a probability space. A random
variable is a measurable
function
X : (\Omega, \mathcal{F}) \longrightarrow (\mathbb{R}, \mathcal{B}),
meaning that the preimage of every Borel set is an event:
X^{-1}(B) = \{\omega \in \Omega : X(\omega) \in B\} \in \mathcal{F} \qquad \text{for every } B \in \mathcal{B}.
That is the entire definition. No new machinery — a random variable is just a measurable function whose
domain happens to carry a probability measure. Everything else on this page (its law, its distribution
function, discrete versus continuous, the \sigma-algebra it generates) is a
consequence of that single line.
Why measurability is exactly the right condition
Here is the point of the definition — not a technicality, but the very thing that makes probability
work. You want to ask questions like
P(X \le x), \qquad P(a < X \le b), \qquad P(X \in B).
But P only knows how to assign probabilities to events —
the members of \mathcal{F}. The expression P(X \le x)
is really shorthand for the probability of the set of outcomes on which X
comes out at most x:
P(X \le x) \;:=\; P\bigl(\{\omega : X(\omega) \le x\}\bigr) \;=\; P\bigl(X^{-1}((-\infty, x])\bigr).
For that right-hand side to mean anything, the set X^{-1}((-\infty,x])
has to be something P can weigh — it must live in
\mathcal{F}. Measurability is precisely the promise that it always does.
Measurability is not red tape; it is the exact condition that makes the probabilities you want
to compute well-defined. An arbitrary function X:\Omega\to\mathbb{R}
might send a perfectly innocent interval back to a non-event, a set with no assigned probability, and
then "P(X \le x)" would be gibberish.
You never have to verify X^{-1}(B) \in \mathcal{F} for all
uncountably many Borel sets. It is enough to check it on the half-infinite rays:
\{X \le x\} \in \mathcal{F} \ \text{ for every } x \in \mathbb{R} \quad\Longleftrightarrow\quad X \text{ is a random variable.}
- The rays (-\infty, x] generate the Borel
\sigma-algebra \mathcal{B}.
- Preimage X^{-1} commutes with complements, countable unions and
countable intersections, so it carries that generating set all the way up to every Borel set.
This is why the cumulative distribution function below is built from
\{X \le x\}: those rays are the seeds from which the entire distribution
grows.
Worked example: X = number of heads in three tosses
Nothing makes the abstraction concrete like building the whole object by hand. Toss a fair coin three
times. The sample space is the eight equally likely sequences:
\Omega = \{\,\text{TTT},\ \text{TTH},\ \text{THT},\ \text{HTT},\ \text{THH},\ \text{HTH},\ \text{HHT},\ \text{HHH}\,\}.
Take \mathcal{F} = 2^{\Omega}, the full power set (every subset is an event),
and P the uniform measure, P(\{\omega\}) = \tfrac18.
Define the random variable
X(\omega) = \#\{\text{heads in } \omega\}, \qquad\text{so}\qquad X:\Omega \to \{0,1,2,3\}.
Is it measurable? On a power-set \sigma-algebra every subset is an
event, so X^{-1}(B)\in\mathcal{F} automatically — a function on a discrete
probability space is always a random variable. Now read off the preimages of the single values:
X^{-1}(0) = \{\text{TTT}\},\quad X^{-1}(1) = \{\text{TTH,THT,HTT}\},\quad X^{-1}(2) = \{\text{THH,HTH,HHT}\},\quad X^{-1}(3) = \{\text{HHH}\}.
Weigh each preimage with P and you have pushed the probability from
\Omega out onto the number line. That collection of weights is the
law of X:
P(X=0)=\tfrac18,\qquad P(X=1)=\tfrac38,\qquad P(X=2)=\tfrac38,\qquad P(X=3)=\tfrac18.
Drag the threshold in the figure below. It shows the raw outcomes up top, the map
X carrying each to its head-count on the real line, and the sliding preimage
X^{-1}\bigl((-\infty,x]\bigr) = \{X \le x\} lighting up as an event
inside \Omega. Its probability is the cumulative distribution function
F_X(x), read out live.
Watch the readout jump. Between the atoms the count of caught outcomes is constant, so
F_X is flat; each time x crosses an integer
k the preimage swallows a fresh clump of outcomes and
F_X leaps by exactly P(X=k). That staircase —
flat, jump, flat, jump — is the distribution function of a discrete random variable.
The law: a probability measure that lives on \mathbb{R}
The example did something profound in passing. We started with a probability measure
P on the abstract space \Omega and ended with a
collection of weights on \mathbb{R}. That new object is a bona fide
probability measure in its own right — the pushforward of
P along X, written
\mu_X = P \circ X^{-1} and called the law or
distribution
of X:
\mu_X(B) \;=\; P\bigl(X^{-1}(B)\bigr) \;=\; P(X \in B), \qquad B \in \mathcal{B}.
It is a genuine probability measure on (\mathbb{R}, \mathcal{B}): it is
non-negative, it gives the whole line total mass
\mu_X(\mathbb{R}) = P(X^{-1}(\mathbb{R})) = P(\Omega) = 1, and it is countably
additive because X^{-1} turns disjoint Borel sets into disjoint events and
P is countably additive. Measurability of X is
exactly what makes \mu_X well-defined on all of
\mathcal{B} — the preimage never leaves \mathcal{F}.
This is the great simplification of the subject. Once you have \mu_X, you can
forget \Omega entirely. Every probabilistic question about
X — its mean, its variance, the chance it lands in any set — is answered by
\mu_X alone, on the familiar real line, with no memory of the underlying coin
tosses or bus stops that produced it.
The law forgets the domain, so wildly different experiments can produce identical distributions.
Let \Omega_1 be two coin tosses with X_1 = \#\text{heads},
and let \Omega_2 = \{1,\dots,4\} be one throw of a four-sided die with
X_2(\omega) = \#\{\text{heads in the binary word } \omega-1\}. Different
sample spaces, different functions, different physical apparatus — yet run the arithmetic and both
give the same law
\mu(0)=\tfrac14,\quad \mu(1)=\tfrac12,\quad \mu(2)=\tfrac14.
To a statistician holding only the distribution, X_1 and
X_2 are indistinguishable. We say they are equal in
distribution, written X_1 \stackrel{d}{=} X_2 — even though as
functions they share no domain and are not "equal" in any pointwise sense. The random variable and its
law are two different things, and this is the cleanest proof of it.
The cumulative distribution function F_X
The law \mu_X assigns a number to every Borel set — a lot of information. It
can be repackaged into a single function of one real variable, the cumulative distribution
function (CDF), built from exactly the rays we said were enough:
F_X(x) \;=\; P(X \le x) \;=\; \mu_X\bigl((-\infty, x]\bigr).
Remarkably, F_X carries the whole distribution — because the rays
generate \mathcal{B}, knowing F_X pins down
\mu_X on every Borel set. And F_X is not any old
function; the measure-theoretic definition forces three properties, and these three properties in turn
characterise CDFs completely.
Every F_X satisfies, and every function satisfying these three is the CDF of some random variable:
- Non-decreasing: x \le y \Rightarrow F_X(x) \le F_X(y)
— because \{X\le x\}\subseteq\{X\le y\} and
P is monotone.
- Right-continuous:
\lim_{h\downarrow 0} F_X(x+h) = F_X(x) — from continuity of
P along the shrinking sets
\{X\le x+h\}\downarrow\{X\le x\}.
- Limits:
\displaystyle\lim_{x\to-\infty}F_X(x)=0 and
\displaystyle\lim_{x\to+\infty}F_X(x)=1.
For the three-coin example the CDF is the staircase you watched being built:
F_X(x) = \begin{cases} 0, & x < 0,\\[2pt] \tfrac18, & 0 \le x < 1,\\[2pt] \tfrac48, & 1 \le x < 2,\\[2pt] \tfrac78, & 2 \le x < 3,\\[2pt] 1, & x \ge 3. \end{cases}
Note the closed left endpoints: at x=1 the value is already
\tfrac48, not \tfrac18. That is right-continuity in
the flesh — F_X includes the jump at the point where it happens. The height
of each jump is the point mass P(X=k) = F_X(k) - F_X(k^-).
Discrete, continuous, and the density in between
The shape of F_X sorts random variables into families, according to
how the law \mu_X is spread out over the line.
-
Discrete. \mu_X is concentrated on a countable set of
atoms \{x_i\} with masses
p_i = P(X=x_i) summing to 1. The CDF is a pure
staircase (our coin example, dice, counts).
-
Continuous. F_X has no jumps; the law spreads smoothly,
and (in the absolutely continuous case) there is a density
f_X \ge 0 with
F_X(x) = \int_{-\infty}^{x} f_X(t)\,dt and
P(X\in B)=\int_B f_X\,d\lambda.
Where does a density even come from, in the measure-theoretic picture? It is a Radon–Nikodym
derivative. A density exists exactly when \mu_X is absolutely
continuous with respect to a reference measure — Lebesgue measure
\lambda in the continuous case, counting measure in the discrete case — and
then
f_X = \frac{d\mu_X}{d\lambda},\qquad\text{so}\qquad \mu_X(B) = \int_B \frac{d\mu_X}{d\lambda}\,d\lambda.
(The Radon–Nikodym
theorem guarantees this derivative exists whenever
\mu_X \ll \lambda; that is the whole story of "why densities integrate to
give probabilities." Some laws are neither discrete nor continuous — mixtures, or the exotic Cantor
distribution — but every law splits into these pieces.)
What X knows: the \sigma-algebra it generates
There is one more object riding along with every random variable, and it becomes indispensable the
moment you meet conditional expectation. Among all \sigma-algebras on
\Omega that make X measurable, there is a smallest
one, gathered from the preimages of the Borel sets:
\sigma(X) \;=\; \bigl\{\, X^{-1}(B) : B \in \mathcal{B} \,\bigr\} \;\subseteq\; \mathcal{F}.
It is a \sigma-algebra (preimage commutes with the set operations),
and by construction it is contained in \mathcal{F} precisely because
X is a random variable. Intuitively \sigma(X) is
the information carried by X: the events you can decide
knowing only the value of X, and nothing more about which
\omega occurred.
For the three-coin count, \sigma(X) is generated by the four preimages
\{X=0\},\{X=1\},\{X=2\},\{X=3\}. Knowing X tells
you how many heads fell — so you can decide the event "at least two heads" — but it can never
distinguish TTH from THT from HTT, since X reports the same
1 for all three. That blind spot, made precise, is what
\sigma(X) encodes, and it is the launch pad for conditioning.
Three traps, all rooted in the word "variable":
-
A random variable is neither random nor a variable. It is an ordinary,
deterministic function X:\Omega\to\mathbb{R}. Feed it a fixed
outcome \omega and it returns a fixed number
X(\omega) — every time, no dice. All the "randomness" lives in
P, the measure on the domain; X itself is as
rigid as x\mapsto x^2. The name is a 300-year-old historical accident,
not a description.
-
Do not confuse X with its distribution. The function
X lives on \Omega; the law
\mu_X lives on \mathbb{R}. Different
X's can share one law
(X_1\stackrel{d}{=}X_2), and knowing the law tells you nothing about
which outcomes produced which values. "Equal in distribution" is far weaker than "equal".
-
Measurable does NOT mean continuous. Measurability is a vastly weaker condition.
The indicator \mathbf{1}_{\mathbb{Q}} is discontinuous at every point yet
is a perfectly good random variable. Wanting P(X\le x) to make sense
forces measurability, not continuity — that is the whole reason measure theory, and not ordinary
calculus, is the right home for probability.