The Radon–Nikodym Theorem

Every measure we have met so far was built by hand. Lebesgue measure was assembled from outer measure and Carathéodory. A Dirac mass \delta_0 was declared to put weight 1 on any set containing the origin and 0 elsewhere. A probability distribution was written down as a rule that assigns numbers to events. Each time, a nagging question follows: could I have written this measure as an integral instead? Is there a single non-negative function f — a density — so that weighing a set with \nu is the same as integrating f over it?

\nu(A) \;=\; \int_A f \, d\mu \qquad\text{for every measurable } A.

The Radon–Nikodym theorem answers this completely. It says: yes — exactly when \nu is absolutely continuous with respect to \mu, meaning \nu vanishes wherever \mu does. When that single condition holds, the density f exists, is essentially unique, and earns the name Radon–Nikodym derivative, written f = \dfrac{d\nu}{d\mu}. This one function is the bridge between measures and functions: it is what turns a probability law into a probability density function, it is the change-of-measure factor behind likelihood ratios in statistics, and it is the object that makes conditional expectation definable at all. One theorem, and three subjects fall into place.

Two ways measures can relate: continuity and singularity

Before the theorem we need the vocabulary for how two measures on the same σ-algebra can sit relative to each other. There are two opposite extremes.

Absolute continuity. We say \nu is absolutely continuous with respect to \mu, and write \nu \ll \mu, when

\mu(A) = 0 \;\Longrightarrow\; \nu(A) = 0.

In words: \nu cannot see anything that \mu calls negligible. Wherever \mu places no mass, \nu must place none either. The name is apt — this is precisely the condition under which \nu can be reconstructed by integrating a density against \mu, because an integral \int_A f\,d\mu is automatically zero on any \mu-null set.

Mutual singularity. At the other pole, \nu and \mu are mutually singular, written \nu \perp \mu, when they live on disjoint sets: there is a measurable E carrying all of one and none of the other,

\mu(E) = 0 \qquad\text{and}\qquad \nu(E^{c}) = 0.

The two measures are concentrated on complementary pieces of the space; each is invisible to the other. The cleanest example is on the line: Lebesgue measure and the Dirac mass \delta_0 are mutually singular. Take E = \{0\}. Then Lebesgue measure gives \mu(\{0\}) = 0, while \delta_0 puts all of its mass there, so \delta_0(\mathbb{R}\setminus\{0\}) = 0. A single point holds the whole of \delta_0 and none of Lebesgue measure — the definition of \perp. And note \delta_0 \not\ll \mu: the set \{0\} is \mu-null yet \delta_0(\{0\}) = 1 \ne 0. A Dirac mass has no density with respect to length.

The theorem: a measure that is absolutely continuous is an integral

With the vocabulary in place, the statement is short and its conclusion is exactly the density we wanted at the top of the page.

Let \mu, \nu be \sigma-finite measures on a measurable space (X, \mathcal{M}). Then:

The middle bullets are the heart. Absolute continuity is not just necessary but sufficient: the moment \nu respects \mu's null sets, a density materialises. The last bullet, the Lebesgue decomposition, tells you what happens in general — every measure peels apart into a piece with a density (the absolutely continuous part) and a piece concentrated where \mu is blind (the singular part). Radon–Nikodym captures exactly the absolutely continuous part; the singular part is what it must leave behind.

Why \dfrac{d\nu}{d\mu} behaves like a derivative

The notation is a promise, and the derivative keeps it. The Radon–Nikodym derivative obeys the same algebra you expect from ordinary differentiation, with measures playing the role of the quantities being differentiated.

Where the proof comes from. The slick modern argument is Hilbert-space flavoured, and it is the reason the theory of L^p spaces is a prerequisite here. Von Neumann's trick: assuming both measures finite, work inside L^2(\mu + \nu) and consider the bounded linear functional g \mapsto \int g \, d\nu. By the Riesz representation theorem it is given by inner product against some h \in L^2(\mu+\nu); a short argument shows 0 \le h < 1 a.e. where it matters, and unwinding the identity \int g\,d\nu = \int g h \, d(\mu+\nu) produces the density f = \dfrac{h}{1-h}. The classical alternative builds f directly through the Hahn decomposition of the signed measures \nu - t\mu, which simultaneously delivers the Lebesgue decomposition. The \sigma-finite case follows by exhausting the space with finite-measure pieces.

See it: the measure is the area under the density

The whole idea is captured in one picture. Fix a density f \ge 0 on an interval. The measure \nu it induces assigns to a set A the area under f over A. Take A = [0, t] and slide t: the shaded region grows, and its area is \nu\bigl([0,t]\bigr) = \int_0^t f \, d\mu. The taller f is, the faster the measure accumulates — f records how concentrated \nu is, point by point, relative to the background ruler \mu.

This is exactly the relationship between a probability density and its cumulative distribution: the running area under the pdf f is the CDF F(t) = \nu\bigl((-\infty, t]\bigr) = \int_{-\infty}^{t} f \, d\mu, and differentiating gives F' = f = \dfrac{d\nu}{d\mu} back. The Radon–Nikodym derivative is the pdf; the measure is the area it sweeps out.

Where the density shows up

Probability: the pdf is a Radon–Nikodym derivative

Let X be a real random variable with law \nu(A) = \mathbb{P}(X \in A). The variable is called continuous precisely when \nu \ll \lambda (Lebesgue measure) — no single point carries positive probability, so \nu vanishes on Lebesgue-null sets. Radon–Nikodym then hands you a function

f = \frac{d\nu}{d\lambda}, \qquad \mathbb{P}(X \in A) = \int_A f \, d\lambda,

and this f is exactly what every probability course calls the probability density function. A discrete random variable is the opposite case: its law is a sum of atoms \sum_n p_n \delta_{x_n}, mutually singular with Lebesgue measure, so it has no density — only a probability mass function. A mixed variable (a bit of both) is precisely a Lebesgue decomposition \nu = \nu_{\mathrm{ac}} + \nu_{\mathrm{sing}}: a continuous part with a pdf plus a discrete part of atoms.

A weighted counting measure has weights as its density

Densities are not only for the continuous world. Let \mu be counting measure on a countable set (so \mu(A) is the number of points in A), and define a weighted measure

\nu(A) = \sum_{n \in A} w_n, \qquad w_n \ge 0.

Then \nu \ll \mu (if \mu(A)=0 then A = \varnothing, so \nu(A)=0), and the Radon–Nikodym derivative is simply the weight sequence itself: \dfrac{d\nu}{d\mu}(n) = w_n. Integrating a density against counting measure is summing; the density is the term you sum. Discrete probability is this special case with w_n = p_n.

Change of measure and conditional expectation

Two more places the derivative is indispensable. In statistics, comparing two models with laws \mathbb{P}, \mathbb{Q} (with \mathbb{Q} \ll \mathbb{P}) goes through the likelihood ratio \dfrac{d\mathbb{Q}}{d\mathbb{P}} — the Radon–Nikodym derivative is the density ratio, the engine of hypothesis testing and of the Girsanov change of measure in stochastic calculus. And conditional expectation \mathbb{E}[X \mid \mathcal{G}] is defined as the Radon–Nikodym derivative of the measure A \mapsto \int_A X \, d\mathbb{P} (restricted to the sub-σ-algebra \mathcal{G}) with respect to \mathbb{P}|_{\mathcal{G}}. Without Radon–Nikodym, one of the central objects of modern probability would have no definition at all.

It is a genuine derivative — the exact measure-theoretic generalisation of the one from calculus. On the line, take \mu to be Lebesgue measure. Then the fundamental theorem of calculus for Lebesgue integration says that for an absolutely continuous \nu with density f, the ratio of masses over a shrinking interval converges to the density:

\frac{d\nu}{d\mu}(x) \;=\; \lim_{r \to 0} \frac{\nu\bigl(B(x, r)\bigr)}{\mu\bigl(B(x, r)\bigr)} \qquad\text{for }\mu\text{-almost every } x.

Read the right-hand side aloud: "how much \nu-mass sits near x, per unit of \mu-mass". That is concentration — a density in the physical sense (mass per length), and the Lebesgue differentiation theorem certifies it matches the abstract \dfrac{d\nu}{d\mu} almost everywhere. So the same symbol that is a pdf in probability, a likelihood ratio in statistics, and a density in physics is, at bottom, the derivative of one measure with respect to another. The notation is not a metaphor; it is a theorem.

Five snares around absolute continuity and the derivative: