Random Variables as Measurable Functions

You roll a die, toss a handful of coins, wait for a bus, measure the height of the next person through the door. In every case a chance experiment happens somewhere in the background — in a space of raw outcomes — and yet the thing you actually care about is a number: the total showing, the count of heads, the minutes waited, the centimetres. A random variable is the bridge. It is the rule that reads a raw outcome and reports the number you were watching for.

Measure theory lets us say precisely what that "rule" is, and — crucially — which rules are allowed. The one-sentence answer, the whole of this page in a nutshell:

Let (\Omega, \mathcal{F}, P) be a probability space. A random variable is a measurable function

X : (\Omega, \mathcal{F}) \longrightarrow (\mathbb{R}, \mathcal{B}),

meaning that the preimage of every Borel set is an event:

X^{-1}(B) = \{\omega \in \Omega : X(\omega) \in B\} \in \mathcal{F} \qquad \text{for every } B \in \mathcal{B}.

That is the entire definition. No new machinery — a random variable is just a measurable function whose domain happens to carry a probability measure. Everything else on this page (its law, its distribution function, discrete versus continuous, the \sigma-algebra it generates) is a consequence of that single line.

Why measurability is exactly the right condition

Here is the point of the definition — not a technicality, but the very thing that makes probability work. You want to ask questions like

P(X \le x), \qquad P(a < X \le b), \qquad P(X \in B).

But P only knows how to assign probabilities to events — the members of \mathcal{F}. The expression P(X \le x) is really shorthand for the probability of the set of outcomes on which X comes out at most x:

P(X \le x) \;:=\; P\bigl(\{\omega : X(\omega) \le x\}\bigr) \;=\; P\bigl(X^{-1}((-\infty, x])\bigr).

For that right-hand side to mean anything, the set X^{-1}((-\infty,x]) has to be something P can weigh — it must live in \mathcal{F}. Measurability is precisely the promise that it always does. Measurability is not red tape; it is the exact condition that makes the probabilities you want to compute well-defined. An arbitrary function X:\Omega\to\mathbb{R} might send a perfectly innocent interval back to a non-event, a set with no assigned probability, and then "P(X \le x)" would be gibberish.

You never have to verify X^{-1}(B) \in \mathcal{F} for all uncountably many Borel sets. It is enough to check it on the half-infinite rays:

\{X \le x\} \in \mathcal{F} \ \text{ for every } x \in \mathbb{R} \quad\Longleftrightarrow\quad X \text{ is a random variable.}

This is why the cumulative distribution function below is built from \{X \le x\}: those rays are the seeds from which the entire distribution grows.

Worked example: X = number of heads in three tosses

Nothing makes the abstraction concrete like building the whole object by hand. Toss a fair coin three times. The sample space is the eight equally likely sequences:

\Omega = \{\,\text{TTT},\ \text{TTH},\ \text{THT},\ \text{HTT},\ \text{THH},\ \text{HTH},\ \text{HHT},\ \text{HHH}\,\}.

Take \mathcal{F} = 2^{\Omega}, the full power set (every subset is an event), and P the uniform measure, P(\{\omega\}) = \tfrac18. Define the random variable

X(\omega) = \#\{\text{heads in } \omega\}, \qquad\text{so}\qquad X:\Omega \to \{0,1,2,3\}.

Is it measurable? On a power-set \sigma-algebra every subset is an event, so X^{-1}(B)\in\mathcal{F} automatically — a function on a discrete probability space is always a random variable. Now read off the preimages of the single values:

X^{-1}(0) = \{\text{TTT}\},\quad X^{-1}(1) = \{\text{TTH,THT,HTT}\},\quad X^{-1}(2) = \{\text{THH,HTH,HHT}\},\quad X^{-1}(3) = \{\text{HHH}\}.

Weigh each preimage with P and you have pushed the probability from \Omega out onto the number line. That collection of weights is the law of X:

P(X=0)=\tfrac18,\qquad P(X=1)=\tfrac38,\qquad P(X=2)=\tfrac38,\qquad P(X=3)=\tfrac18.

Drag the threshold in the figure below. It shows the raw outcomes up top, the map X carrying each to its head-count on the real line, and the sliding preimage X^{-1}\bigl((-\infty,x]\bigr) = \{X \le x\} lighting up as an event inside \Omega. Its probability is the cumulative distribution function F_X(x), read out live.

Watch the readout jump. Between the atoms the count of caught outcomes is constant, so F_X is flat; each time x crosses an integer k the preimage swallows a fresh clump of outcomes and F_X leaps by exactly P(X=k). That staircase — flat, jump, flat, jump — is the distribution function of a discrete random variable.

The law: a probability measure that lives on \mathbb{R}

The example did something profound in passing. We started with a probability measure P on the abstract space \Omega and ended with a collection of weights on \mathbb{R}. That new object is a bona fide probability measure in its own right — the pushforward of P along X, written \mu_X = P \circ X^{-1} and called the law or distribution of X:

\mu_X(B) \;=\; P\bigl(X^{-1}(B)\bigr) \;=\; P(X \in B), \qquad B \in \mathcal{B}.

It is a genuine probability measure on (\mathbb{R}, \mathcal{B}): it is non-negative, it gives the whole line total mass \mu_X(\mathbb{R}) = P(X^{-1}(\mathbb{R})) = P(\Omega) = 1, and it is countably additive because X^{-1} turns disjoint Borel sets into disjoint events and P is countably additive. Measurability of X is exactly what makes \mu_X well-defined on all of \mathcal{B} — the preimage never leaves \mathcal{F}.

This is the great simplification of the subject. Once you have \mu_X, you can forget \Omega entirely. Every probabilistic question about X — its mean, its variance, the chance it lands in any set — is answered by \mu_X alone, on the familiar real line, with no memory of the underlying coin tosses or bus stops that produced it.

The law forgets the domain, so wildly different experiments can produce identical distributions. Let \Omega_1 be two coin tosses with X_1 = \#\text{heads}, and let \Omega_2 = \{1,\dots,4\} be one throw of a four-sided die with X_2(\omega) = \#\{\text{heads in the binary word } \omega-1\}. Different sample spaces, different functions, different physical apparatus — yet run the arithmetic and both give the same law

\mu(0)=\tfrac14,\quad \mu(1)=\tfrac12,\quad \mu(2)=\tfrac14.

To a statistician holding only the distribution, X_1 and X_2 are indistinguishable. We say they are equal in distribution, written X_1 \stackrel{d}{=} X_2 — even though as functions they share no domain and are not "equal" in any pointwise sense. The random variable and its law are two different things, and this is the cleanest proof of it.

The cumulative distribution function F_X

The law \mu_X assigns a number to every Borel set — a lot of information. It can be repackaged into a single function of one real variable, the cumulative distribution function (CDF), built from exactly the rays we said were enough:

F_X(x) \;=\; P(X \le x) \;=\; \mu_X\bigl((-\infty, x]\bigr).

Remarkably, F_X carries the whole distribution — because the rays generate \mathcal{B}, knowing F_X pins down \mu_X on every Borel set. And F_X is not any old function; the measure-theoretic definition forces three properties, and these three properties in turn characterise CDFs completely.

Every F_X satisfies, and every function satisfying these three is the CDF of some random variable:

For the three-coin example the CDF is the staircase you watched being built:

F_X(x) = \begin{cases} 0, & x < 0,\\[2pt] \tfrac18, & 0 \le x < 1,\\[2pt] \tfrac48, & 1 \le x < 2,\\[2pt] \tfrac78, & 2 \le x < 3,\\[2pt] 1, & x \ge 3. \end{cases}

Note the closed left endpoints: at x=1 the value is already \tfrac48, not \tfrac18. That is right-continuity in the flesh — F_X includes the jump at the point where it happens. The height of each jump is the point mass P(X=k) = F_X(k) - F_X(k^-).

Discrete, continuous, and the density in between

The shape of F_X sorts random variables into families, according to how the law \mu_X is spread out over the line.

Where does a density even come from, in the measure-theoretic picture? It is a Radon–Nikodym derivative. A density exists exactly when \mu_X is absolutely continuous with respect to a reference measure — Lebesgue measure \lambda in the continuous case, counting measure in the discrete case — and then

f_X = \frac{d\mu_X}{d\lambda},\qquad\text{so}\qquad \mu_X(B) = \int_B \frac{d\mu_X}{d\lambda}\,d\lambda.

(The Radon–Nikodym theorem guarantees this derivative exists whenever \mu_X \ll \lambda; that is the whole story of "why densities integrate to give probabilities." Some laws are neither discrete nor continuous — mixtures, or the exotic Cantor distribution — but every law splits into these pieces.)

What X knows: the \sigma-algebra it generates

There is one more object riding along with every random variable, and it becomes indispensable the moment you meet conditional expectation. Among all \sigma-algebras on \Omega that make X measurable, there is a smallest one, gathered from the preimages of the Borel sets:

\sigma(X) \;=\; \bigl\{\, X^{-1}(B) : B \in \mathcal{B} \,\bigr\} \;\subseteq\; \mathcal{F}.

It is a \sigma-algebra (preimage commutes with the set operations), and by construction it is contained in \mathcal{F} precisely because X is a random variable. Intuitively \sigma(X) is the information carried by X: the events you can decide knowing only the value of X, and nothing more about which \omega occurred.

For the three-coin count, \sigma(X) is generated by the four preimages \{X=0\},\{X=1\},\{X=2\},\{X=3\}. Knowing X tells you how many heads fell — so you can decide the event "at least two heads" — but it can never distinguish TTH from THT from HTT, since X reports the same 1 for all three. That blind spot, made precise, is what \sigma(X) encodes, and it is the launch pad for conditioning.

Three traps, all rooted in the word "variable":