The Radon–Nikodym Theorem
Every measure we have met so far was built by hand. Lebesgue measure was assembled from
outer measure and Carathéodory. A Dirac mass \delta_0 was declared to
put weight 1 on any set containing the origin and 0
elsewhere. A probability distribution was written down as a rule that assigns numbers to events. Each
time, a nagging question follows: could I have written this measure as an integral instead?
Is there a single non-negative function f — a density — so that
weighing a set with \nu is the same as
integrating
f over it?
\nu(A) \;=\; \int_A f \, d\mu \qquad\text{for every measurable } A.
The Radon–Nikodym theorem answers this completely. It says: yes — exactly when
\nu is absolutely continuous with respect to
\mu, meaning \nu vanishes wherever
\mu does. When that single condition holds, the density
f exists, is essentially unique, and earns the name
Radon–Nikodym derivative, written f = \dfrac{d\nu}{d\mu}.
This one function is the bridge between measures and functions: it is what turns a
probability law into a probability density function, it is the change-of-measure
factor behind likelihood ratios in statistics, and it is the object that makes
conditional expectation definable at all. One theorem, and three subjects fall into
place.
Two ways measures can relate: continuity and singularity
Before the theorem we need the vocabulary for how two measures on the same
σ-algebra can sit relative to
each other. There are two opposite extremes.
Absolute continuity. We say \nu is absolutely
continuous with respect to \mu, and write
\nu \ll \mu, when
\mu(A) = 0 \;\Longrightarrow\; \nu(A) = 0.
In words: \nu cannot see anything that \mu calls
negligible. Wherever \mu places no mass, \nu must
place none either. The name is apt — this is precisely the condition under which
\nu can be reconstructed by integrating a density against
\mu, because an integral \int_A f\,d\mu is
automatically zero on any \mu-null set.
Mutual singularity. At the other pole, \nu and
\mu are mutually singular, written
\nu \perp \mu, when they live on disjoint sets: there is a
measurable E carrying all of one and none of the other,
\mu(E) = 0 \qquad\text{and}\qquad \nu(E^{c}) = 0.
The two measures are concentrated on complementary pieces of the space; each is invisible to the
other. The cleanest example is on the line: Lebesgue measure and the Dirac mass
\delta_0 are mutually singular. Take E = \{0\}.
Then Lebesgue measure gives \mu(\{0\}) = 0, while
\delta_0 puts all of its mass there, so
\delta_0(\mathbb{R}\setminus\{0\}) = 0. A single point holds the whole of
\delta_0 and none of Lebesgue measure — the definition of
\perp. And note \delta_0 \not\ll \mu: the set
\{0\} is \mu-null yet
\delta_0(\{0\}) = 1 \ne 0. A Dirac mass has no density with respect
to length.
The theorem: a measure that is absolutely continuous is an integral
With the vocabulary in place, the statement is short and its conclusion is exactly the density we
wanted at the top of the page.
Let \mu, \nu be \sigma-finite measures on a
measurable space (X, \mathcal{M}). Then:
-
Absolute continuity, defined.
\nu \ll \mu means \mu(A) = 0 \Rightarrow \nu(A) = 0.
-
Existence of a density. If \nu \ll \mu, there is a
non-negative measurable f — the Radon–Nikodym derivative
f = \dfrac{d\nu}{d\mu} — with
\displaystyle \nu(A) = \int_A f \, d\mu for every
A \in \mathcal{M}.
-
Uniqueness. The density f is unique
\mu-almost everywhere: any two such densities agree
except on a \mu-null set.
-
Integration against \nu. For every non-negative (or
\nu-integrable) g,
\displaystyle \int g \, d\nu = \int g\, \frac{d\nu}{d\mu} \, d\mu.
-
Lebesgue decomposition. Even without \nu \ll \mu, any
\sigma-finite \nu splits uniquely as
\nu = \nu_{\mathrm{ac}} + \nu_{\mathrm{sing}} with
\nu_{\mathrm{ac}} \ll \mu and
\nu_{\mathrm{sing}} \perp \mu.
The middle bullets are the heart. Absolute continuity is not just necessary but sufficient:
the moment \nu respects \mu's null sets, a density
materialises. The last bullet, the Lebesgue decomposition, tells you what happens in
general — every measure peels apart into a piece with a density (the absolutely continuous part) and a
piece concentrated where \mu is blind (the singular part). Radon–Nikodym
captures exactly the absolutely continuous part; the singular part is what it must leave behind.
Why \dfrac{d\nu}{d\mu} behaves like a derivative
The notation is a promise, and the derivative keeps it. The Radon–Nikodym derivative obeys the same
algebra you expect from ordinary differentiation, with measures playing the role of the quantities
being differentiated.
-
Linearity. If \nu_1, \nu_2 \ll \mu, then
\nu_1 + \nu_2 \ll \mu and
\dfrac{d(\nu_1 + \nu_2)}{d\mu} = \dfrac{d\nu_1}{d\mu} + \dfrac{d\nu_2}{d\mu}
(\mu-a.e.).
-
Chain rule. If \nu \ll \lambda \ll \mu, the derivatives
multiply, exactly as a chain rule should:
\dfrac{d\nu}{d\mu} = \dfrac{d\nu}{d\lambda}\,\dfrac{d\lambda}{d\mu}.
-
Reciprocal. If \nu \ll \mu and
\mu \ll \nu (they share the same null sets — call them
equivalent), then the derivative is invertible:
\dfrac{d\nu}{d\mu} = \left(\dfrac{d\mu}{d\nu}\right)^{-1}
(\mu-a.e.).
Where the proof comes from. The slick modern argument is Hilbert-space flavoured, and it is the reason
the theory of L^p
spaces is a prerequisite here. Von Neumann's trick: assuming both measures finite, work
inside L^2(\mu + \nu) and consider the bounded linear functional
g \mapsto \int g \, d\nu. By the Riesz representation theorem
it is given by inner product against some h \in L^2(\mu+\nu); a short
argument shows 0 \le h < 1 a.e. where it matters, and unwinding the identity
\int g\,d\nu = \int g h \, d(\mu+\nu) produces the density
f = \dfrac{h}{1-h}. The classical alternative builds
f directly through the Hahn decomposition of the signed
measures \nu - t\mu, which simultaneously delivers the Lebesgue
decomposition. The \sigma-finite case follows by exhausting the space with
finite-measure pieces.
See it: the measure is the area under the density
The whole idea is captured in one picture. Fix a density f \ge 0 on an
interval. The measure \nu it induces assigns to a set
A the area under f over
A. Take A = [0, t] and slide
t: the shaded region grows, and its area is
\nu\bigl([0,t]\bigr) = \int_0^t f \, d\mu. The taller
f is, the faster the measure accumulates — f
records how concentrated \nu is, point by point, relative to the
background ruler \mu.
This is exactly the relationship between a probability density and its cumulative distribution: the
running area under the pdf f is the CDF
F(t) = \nu\bigl((-\infty, t]\bigr) = \int_{-\infty}^{t} f \, d\mu, and
differentiating gives F' = f = \dfrac{d\nu}{d\mu} back. The Radon–Nikodym
derivative is the pdf; the measure is the area it sweeps out.
Where the density shows up
Probability: the pdf is a Radon–Nikodym derivative
Let X be a real random variable with law
\nu(A) = \mathbb{P}(X \in A). The variable is called continuous
precisely when \nu \ll \lambda (Lebesgue measure) — no single point carries
positive probability, so \nu vanishes on Lebesgue-null sets. Radon–Nikodym
then hands you a function
f = \frac{d\nu}{d\lambda}, \qquad \mathbb{P}(X \in A) = \int_A f \, d\lambda,
and this f is exactly what every probability course calls the
probability density function. A discrete random variable is the opposite
case: its law is a sum of atoms \sum_n p_n \delta_{x_n}, mutually singular
with Lebesgue measure, so it has no density — only a probability mass function. A
mixed variable (a bit of both) is precisely a Lebesgue decomposition
\nu = \nu_{\mathrm{ac}} + \nu_{\mathrm{sing}}: a continuous part with a pdf
plus a discrete part of atoms.
A weighted counting measure has weights as its density
Densities are not only for the continuous world. Let \mu be
counting measure on a countable set (so \mu(A) is the
number of points in A), and define a weighted measure
\nu(A) = \sum_{n \in A} w_n, \qquad w_n \ge 0.
Then \nu \ll \mu (if \mu(A)=0 then
A = \varnothing, so \nu(A)=0), and the
Radon–Nikodym derivative is simply the weight sequence itself:
\dfrac{d\nu}{d\mu}(n) = w_n. Integrating a density against counting measure
is summing; the density is the term you sum. Discrete probability is this special case with
w_n = p_n.
Change of measure and conditional expectation
Two more places the derivative is indispensable. In statistics, comparing two models with laws
\mathbb{P}, \mathbb{Q} (with \mathbb{Q} \ll \mathbb{P})
goes through the likelihood ratio \dfrac{d\mathbb{Q}}{d\mathbb{P}}
— the Radon–Nikodym derivative is the density ratio, the engine of hypothesis testing and of
the Girsanov change of measure in stochastic calculus. And conditional expectation
\mathbb{E}[X \mid \mathcal{G}] is defined as the Radon–Nikodym
derivative of the measure A \mapsto \int_A X \, d\mathbb{P} (restricted to
the sub-σ-algebra \mathcal{G}) with respect to
\mathbb{P}|_{\mathcal{G}}. Without Radon–Nikodym, one of the central objects
of modern probability would have no definition at all.
It is a genuine derivative — the exact measure-theoretic generalisation of the one from calculus. On
the line, take \mu to be Lebesgue measure. Then the fundamental theorem of
calculus for Lebesgue integration says that for an absolutely continuous
\nu with density f, the ratio of masses over a
shrinking interval converges to the density:
\frac{d\nu}{d\mu}(x) \;=\; \lim_{r \to 0} \frac{\nu\bigl(B(x, r)\bigr)}{\mu\bigl(B(x, r)\bigr)} \qquad\text{for }\mu\text{-almost every } x.
Read the right-hand side aloud: "how much \nu-mass sits near
x, per unit of \mu-mass". That is
concentration — a density in the physical sense (mass per length), and the Lebesgue
differentiation theorem certifies it matches the abstract \dfrac{d\nu}{d\mu}
almost everywhere. So the same symbol that is a pdf in probability, a likelihood ratio in statistics,
and a density in physics is, at bottom, the derivative of one measure with respect to another. The
notation is not a metaphor; it is a theorem.
Five snares around absolute continuity and the derivative:
-
Absolute continuity of a measure is not the same as absolute continuity of a
function. The two are related — a function
F is absolutely continuous on [a,b] exactly
when the measure dF it induces is absolutely continuous with respect to
Lebesgue measure, and then F' = \frac{d(dF)}{d\lambda} — but the phrase
means different things in the two settings, and conflating them causes confusion. Here we always
mean \mu(A)=0 \Rightarrow \nu(A)=0.
-
\sigma-finiteness is required. Drop it and the density
can fail to exist. On [0,1], let \mu be
counting measure (not \sigma-finite — the interval is
uncountable) and \nu = \lambda Lebesgue measure. Then
\lambda \ll \mu holds vacuously (only \varnothing
is \mu-null), yet no density
f can satisfy \lambda(A) = \int_A f\, d\mu: the
right side would be a sum over points, forcing f \equiv 0 and hence
\lambda \equiv 0, false. Absolute continuity alone is not enough.
-
The derivative is only defined up to \mu-a.e. equality.
You may freely change f on any \mu-null set;
"\dfrac{d\nu}{d\mu}" names an equivalence class of functions, not a single
one. Asking for its value "at a point" is meaningless.
-
\ll is not symmetric.
\nu \ll \mu does not give \mu \ll \nu. Example:
on [0,1] take \nu(A) = \int_A x\, d\lambda (so
\nu \ll \lambda, with density x), but
\nu(\{0\}) = 0 while… well, both agree on the point — sharper: any
density that vanishes on a positive-measure set breaks the reverse direction. Only when the null
sets coincide are the measures equivalent and the derivative invertible.
-
A singular measure has no density. The Dirac mass
\delta_0 (or any atom) is not absolutely continuous with respect
to Lebesgue measure — \{0\} is Lebesgue-null but carries mass
1 — so there is no f with
\delta_0(A) = \int_A f\, d\lambda. Radon–Nikodym does not apply; the
Lebesgue decomposition is what handles such a measure, quarantining it as the singular part.