The Lebesgue Integral
The Lebesgue integral adds up a function by slicing its range —
the output values — rather than its domain. A
Riemann integral chops the input
axis into thin vertical strips; the Lebesgue integral instead chops the value axis into
bands and asks, for each band, how much of the space lands that high. Measured
against a
measure
\mu, we write it
\int_{\Omega} f \, d\mu.
Slicing by value is exactly what lets the integral handle badly-behaved
measurable functions
and — crucially — pass cleanly through limits, the convergence theorems the rest of the
theory leans on. It is built in four deliberate stages, each extending the
last: first the indicator of a single set, then a finite combination of indicators (a
simple function), then any non-negative measurable function, and finally a general
signed one. We take them in order and skip no step.
Stage 0 — the indicator of a set
The atom of the whole construction is the indicator function of a
measurable set A \in \mathcal{F}: the function that is
1 on A and 0
off it,
\mathbf{1}_A(\omega) \;=\; \begin{cases} 1 & \omega \in A,\\[2pt] 0 & \omega \notin A. \end{cases}
Its integral is defined to be the measure of the set it marks out — there is nothing
to compute, this is the seed everything else grows from:
\int_{\Omega} \mathbf{1}_A \, d\mu \;=\; \mu(A).
Read it as: "weight 1 over the region A,
weight 0 everywhere else, and total it up" — and the total weight
is just the size \mu(A) of A. With
\mu = \mathbb{P} a probability measure this already says
\int \mathbf{1}_A \, d\mathbb{P} = \mathbb{P}(A), the fact that
makes probability a special case of integration.
Stage 1 — simple functions
A simple function takes finitely many values
a_1, \dots, a_n, each on a measurable set
A_i, where the A_i partition
\Omega (they are disjoint and cover everything). It is just a
finite stack of indicators, one per value:
S \;=\; \sum_{i=1}^{n} a_i \, \mathbf{1}_{A_i}.
Its integral is forced on us by Stage 0 and the demand that integration be additive: integrate
the stack term by term, and each term a_i \mathbf{1}_{A_i}
contributes a_i \mu(A_i). So the integral is the value-weighted
total of the measures of those sets,
\int_{\Omega} S \, d\mu \;=\; \sum_{i=1}^{n} a_i \, \mu(A_i).
In particular, taking n = 1 with
S = \mathbf{1}_A recovers Stage 0,
\int_{\Omega} \mathbf{1}_A \, d\mu = \mu(A).
Linearity on simple functions, line by line
The one fact the rest of the theory rests on is that
S \mapsto \int S \, d\mu is linear. It is worth
seeing exactly why, because the only subtlety is that two simple functions are written over
different partitions, and we must put them on a common one before we can add.
Take two simple functions over their own partitions of \Omega,
S \;=\; \sum_{i=1}^{m} a_i \, \mathbf{1}_{A_i}, \qquad T \;=\; \sum_{j=1}^{n} b_j \, \mathbf{1}_{B_j}.
Form the common refinement: the sets
C_{ij} = A_i \cap B_j. Because
\{A_i\} and \{B_j\} each partition
\Omega, the C_{ij} are disjoint and also
partition \Omega; and crucially, for a fixed
i the set A_i is sliced exactly into its
pieces C_{ij}, so by finite additivity of the measure
\mu,
\mu(A_i) \;=\; \sum_{j=1}^{n} \mu(C_{ij}), \qquad \mu(B_j) \;=\; \sum_{i=1}^{m} \mu(C_{ij}).
On each tiny cell C_{ij} both functions are constant —
S = a_i and T = b_j there — so for any
scalars \alpha, \beta the combination
\alpha S + \beta T is itself simple, equal to
\alpha a_i + \beta b_j on C_{ij}:
\alpha S + \beta T \;=\; \sum_{i=1}^{m}\sum_{j=1}^{n} (\alpha a_i + \beta b_j)\, \mathbf{1}_{C_{ij}}.
Now apply the Stage 1 definition to this simple function and unpack it step by step:
\int (\alpha S + \beta T)\, d\mu \;=\; \sum_{i=1}^{m}\sum_{j=1}^{n} (\alpha a_i + \beta b_j)\, \mu(C_{ij})
by the definition of the integral of a simple function. Split the sum across the plus sign:
=\; \alpha \sum_{i=1}^{m}\sum_{j=1}^{n} a_i\, \mu(C_{ij}) \;+\; \beta \sum_{i=1}^{m}\sum_{j=1}^{n} b_j\, \mu(C_{ij})
by regrouping the finite double sum. In the first double sum a_i
does not depend on j, so sum over j first;
in the second b_j does not depend on i, so
sum over i first:
=\; \alpha \sum_{i=1}^{m} a_i \Big(\sum_{j=1}^{n}\mu(C_{ij})\Big) \;+\; \beta \sum_{j=1}^{n} b_j \Big(\sum_{i=1}^{m}\mu(C_{ij})\Big)
and the two bracketed sums are exactly the additivity identities above, so they collapse:
=\; \alpha \sum_{i=1}^{m} a_i\, \mu(A_i) \;+\; \beta \sum_{j=1}^{n} b_j\, \mu(B_j) \;=\; \alpha \int S \, d\mu \;+\; \beta \int T \, d\mu.
That is linearity, with no gaps: the common refinement is the whole trick, and additivity of
\mu does the rest. Monotonicity is even quicker — if
S \le T, then on every cell
a_i \le b_j, and since each \mu(C_{ij}) \ge 0
the term-by-term inequality survives summation, giving
\int S \, d\mu \le \int T \, d\mu.
Stage 2 — non-negative functions
For a general measurable f \ge 0 we cannot write down a finite
formula, so instead we approximate it from below by simple functions and take the
best such approximation — the supremum of their integrals:
\int_{\Omega} f \, d\mu \;=\; \sup\Big\{\, \textstyle\int_{\Omega} S \, d\mu \;:\; S \text{ simple},\; 0 \le S \le f \,\Big\}.
Two things make this definition sound. First, the set of candidates is non-empty — the simple
function S \equiv 0 always qualifies — so the supremum is over a
non-empty set of real numbers. Second, by monotonicity on simple functions
(just proved), every candidate already satisfies
\int S \, d\mu \le \int S' \, d\mu whenever
S \le S', so raising the staircase can only raise the estimate — the
supremum genuinely captures the "best from below".
The trick is to partition the range, not the domain: slice the
y-axis into finer and finer levels and, on each level, ask "how
much of the space reaches this high?". Concretely, with n levels of
height h = \tfrac{1}{n} one can take the simple function that floors
f down to the nearest level below it; as n
grows the staircase climbs up to f and the simple-function integrals
increase to \int f \, d\mu. The interactive figure below shows
exactly this staircase tightening against the curve.
Let 0 \le f_1 \le f_2 \le \cdots be a non-decreasing sequence of
non-negative measurable functions with pointwise limit
f = \lim_{n\to\infty} f_n. Then the limit and the integral may be
exchanged:
\int_{\Omega} \Big(\lim_{n\to\infty} f_n\Big)\, d\mu \;=\; \lim_{n\to\infty} \int_{\Omega} f_n \, d\mu.
This is precisely what licenses the Stage 2 definition: the flooring staircases
S_n \uparrow f are such a non-decreasing sequence, so their integrals
rise to \int f \, d\mu with no loss in the limit.
Stage 3 — general functions
A general f takes both signs, but Stage 2 only knows how to
integrate non-negative functions. So split f into its
positive and negative parts,
f^{+} \;=\; \max(f, 0), \qquad f^{-} \;=\; \max(-f, 0),
each of which is non-negative and measurable. They reconstruct
f and its absolute value cleanly:
f \;=\; f^{+} - f^{-}, \qquad |f| \;=\; f^{+} + f^{-},
and at every point \omega exactly one of
f^{+}(\omega), f^{-}(\omega) is non-zero — the positive and negative
parts are disjointly supported. Each part is non-negative, so each has a Stage 2
integral; define
\int_{\Omega} f \, d\mu \;=\; \int_{\Omega} f^{+} \, d\mu \;-\; \int_{\Omega} f^{-} \, d\mu,
provided the two pieces are not both +\infty (otherwise the
difference is the meaningless \infty - \infty). We call
f integrable precisely when both pieces are finite,
which by |f| = f^{+} + f^{-} and Stage 1 linearity is the single
condition
\int_{\Omega} |f| \, d\mu \;=\; \int_{\Omega} f^{+} \, d\mu + \int_{\Omega} f^{-} \, d\mu \;<\; \infty.
Two properties now extend from the simple-function case to all integrable
f, g — by approximating each part from below and passing to the limit
with Monotone Convergence — and they are used on essentially every page that follows:
- Linearity: \int (a f + b g) \, d\mu = a\!\int f \, d\mu + b\!\int g \, d\mu.
- Monotonicity: if f \le g then \int f \, d\mu \le \int g \, d\mu.
The picture that separates the two integrals is which axis you cut. The
Riemann integral chops the
domain into thin vertical strips and sums
f(x_i)\,\Delta x_i — it asks "where is
x?". The Lebesgue integral chops the range into
horizontal bands and, for each band height y, weighs the set
\{\,\omega : f(\omega) \ge y\,\} by \mu —
it asks "how much of the space reaches this high?". Slicing the range is what frees the
integral from needing a tidy domain: the indicator of the rationals,
\mathbf{1}_{\mathbb{Q}}, has no Riemann integral (its strips never
settle), yet its Lebesgue integral is plainly \mu(\mathbb{Q}) = 0.
Slicing the range is also why limits pass through so cleanly — the single most
important advantage for probability, where we constantly take limits of approximating random
variables and of conditional estimates. Monotone Convergence handles increasing sequences;
its dominated cousin handles the general case:
Dominated Convergence. If f_n \to f pointwise and
all the f_n are bounded by one fixed integrable
g (so |f_n| \le g with
\int g \, d\mu < \infty), then
\lim_{n\to\infty} \int_{\Omega} f_n \, d\mu \;=\; \int_{\Omega} \lim_{n\to\infty} f_n \, d\mu \;=\; \int_{\Omega} f \, d\mu.
The dominating g is the safety net that stops mass from leaking off
to infinity. This is the theorem that lets us swap a limit and an expectation — exactly the
move behind continuity of expectation, the tower property of conditioning, and the
interchange of differentiation and integration that pricing formulas rely on.