The Lebesgue Integral

The Lebesgue integral adds up a function by slicing its range — the output values — rather than its domain. A Riemann integral chops the input axis into thin vertical strips; the Lebesgue integral instead chops the value axis into bands and asks, for each band, how much of the space lands that high. Measured against a measure \mu, we write it

\int_{\Omega} f \, d\mu.

Slicing by value is exactly what lets the integral handle badly-behaved measurable functions and — crucially — pass cleanly through limits, the convergence theorems the rest of the theory leans on. It is built in four deliberate stages, each extending the last: first the indicator of a single set, then a finite combination of indicators (a simple function), then any non-negative measurable function, and finally a general signed one. We take them in order and skip no step.

Stage 0 — the indicator of a set

The atom of the whole construction is the indicator function of a measurable set A \in \mathcal{F}: the function that is 1 on A and 0 off it,

\mathbf{1}_A(\omega) \;=\; \begin{cases} 1 & \omega \in A,\\[2pt] 0 & \omega \notin A. \end{cases}

Its integral is defined to be the measure of the set it marks out — there is nothing to compute, this is the seed everything else grows from:

\int_{\Omega} \mathbf{1}_A \, d\mu \;=\; \mu(A).

Read it as: "weight 1 over the region A, weight 0 everywhere else, and total it up" — and the total weight is just the size \mu(A) of A. With \mu = \mathbb{P} a probability measure this already says \int \mathbf{1}_A \, d\mathbb{P} = \mathbb{P}(A), the fact that makes probability a special case of integration.

Stage 1 — simple functions

A simple function takes finitely many values a_1, \dots, a_n, each on a measurable set A_i, where the A_i partition \Omega (they are disjoint and cover everything). It is just a finite stack of indicators, one per value:

S \;=\; \sum_{i=1}^{n} a_i \, \mathbf{1}_{A_i}.

Its integral is forced on us by Stage 0 and the demand that integration be additive: integrate the stack term by term, and each term a_i \mathbf{1}_{A_i} contributes a_i \mu(A_i). So the integral is the value-weighted total of the measures of those sets,

\int_{\Omega} S \, d\mu \;=\; \sum_{i=1}^{n} a_i \, \mu(A_i).

In particular, taking n = 1 with S = \mathbf{1}_A recovers Stage 0, \int_{\Omega} \mathbf{1}_A \, d\mu = \mu(A).

Linearity on simple functions, line by line

The one fact the rest of the theory rests on is that S \mapsto \int S \, d\mu is linear. It is worth seeing exactly why, because the only subtlety is that two simple functions are written over different partitions, and we must put them on a common one before we can add.

Take two simple functions over their own partitions of \Omega,

S \;=\; \sum_{i=1}^{m} a_i \, \mathbf{1}_{A_i}, \qquad T \;=\; \sum_{j=1}^{n} b_j \, \mathbf{1}_{B_j}.

Form the common refinement: the sets C_{ij} = A_i \cap B_j. Because \{A_i\} and \{B_j\} each partition \Omega, the C_{ij} are disjoint and also partition \Omega; and crucially, for a fixed i the set A_i is sliced exactly into its pieces C_{ij}, so by finite additivity of the measure \mu,

\mu(A_i) \;=\; \sum_{j=1}^{n} \mu(C_{ij}), \qquad \mu(B_j) \;=\; \sum_{i=1}^{m} \mu(C_{ij}).

On each tiny cell C_{ij} both functions are constant — S = a_i and T = b_j there — so for any scalars \alpha, \beta the combination \alpha S + \beta T is itself simple, equal to \alpha a_i + \beta b_j on C_{ij}:

\alpha S + \beta T \;=\; \sum_{i=1}^{m}\sum_{j=1}^{n} (\alpha a_i + \beta b_j)\, \mathbf{1}_{C_{ij}}.

Now apply the Stage 1 definition to this simple function and unpack it step by step:

\int (\alpha S + \beta T)\, d\mu \;=\; \sum_{i=1}^{m}\sum_{j=1}^{n} (\alpha a_i + \beta b_j)\, \mu(C_{ij})

by the definition of the integral of a simple function. Split the sum across the plus sign:

=\; \alpha \sum_{i=1}^{m}\sum_{j=1}^{n} a_i\, \mu(C_{ij}) \;+\; \beta \sum_{i=1}^{m}\sum_{j=1}^{n} b_j\, \mu(C_{ij})

by regrouping the finite double sum. In the first double sum a_i does not depend on j, so sum over j first; in the second b_j does not depend on i, so sum over i first:

=\; \alpha \sum_{i=1}^{m} a_i \Big(\sum_{j=1}^{n}\mu(C_{ij})\Big) \;+\; \beta \sum_{j=1}^{n} b_j \Big(\sum_{i=1}^{m}\mu(C_{ij})\Big)

and the two bracketed sums are exactly the additivity identities above, so they collapse:

=\; \alpha \sum_{i=1}^{m} a_i\, \mu(A_i) \;+\; \beta \sum_{j=1}^{n} b_j\, \mu(B_j) \;=\; \alpha \int S \, d\mu \;+\; \beta \int T \, d\mu.

That is linearity, with no gaps: the common refinement is the whole trick, and additivity of \mu does the rest. Monotonicity is even quicker — if S \le T, then on every cell a_i \le b_j, and since each \mu(C_{ij}) \ge 0 the term-by-term inequality survives summation, giving \int S \, d\mu \le \int T \, d\mu.

Stage 2 — non-negative functions

For a general measurable f \ge 0 we cannot write down a finite formula, so instead we approximate it from below by simple functions and take the best such approximation — the supremum of their integrals:

\int_{\Omega} f \, d\mu \;=\; \sup\Big\{\, \textstyle\int_{\Omega} S \, d\mu \;:\; S \text{ simple},\; 0 \le S \le f \,\Big\}.

Two things make this definition sound. First, the set of candidates is non-empty — the simple function S \equiv 0 always qualifies — so the supremum is over a non-empty set of real numbers. Second, by monotonicity on simple functions (just proved), every candidate already satisfies \int S \, d\mu \le \int S' \, d\mu whenever S \le S', so raising the staircase can only raise the estimate — the supremum genuinely captures the "best from below".

The trick is to partition the range, not the domain: slice the y-axis into finer and finer levels and, on each level, ask "how much of the space reaches this high?". Concretely, with n levels of height h = \tfrac{1}{n} one can take the simple function that floors f down to the nearest level below it; as n grows the staircase climbs up to f and the simple-function integrals increase to \int f \, d\mu. The interactive figure below shows exactly this staircase tightening against the curve.

Let 0 \le f_1 \le f_2 \le \cdots be a non-decreasing sequence of non-negative measurable functions with pointwise limit f = \lim_{n\to\infty} f_n. Then the limit and the integral may be exchanged: \int_{\Omega} \Big(\lim_{n\to\infty} f_n\Big)\, d\mu \;=\; \lim_{n\to\infty} \int_{\Omega} f_n \, d\mu. This is precisely what licenses the Stage 2 definition: the flooring staircases S_n \uparrow f are such a non-decreasing sequence, so their integrals rise to \int f \, d\mu with no loss in the limit.

Stage 3 — general functions

A general f takes both signs, but Stage 2 only knows how to integrate non-negative functions. So split f into its positive and negative parts,

f^{+} \;=\; \max(f, 0), \qquad f^{-} \;=\; \max(-f, 0),

each of which is non-negative and measurable. They reconstruct f and its absolute value cleanly:

f \;=\; f^{+} - f^{-}, \qquad |f| \;=\; f^{+} + f^{-},

and at every point \omega exactly one of f^{+}(\omega), f^{-}(\omega) is non-zero — the positive and negative parts are disjointly supported. Each part is non-negative, so each has a Stage 2 integral; define

\int_{\Omega} f \, d\mu \;=\; \int_{\Omega} f^{+} \, d\mu \;-\; \int_{\Omega} f^{-} \, d\mu,

provided the two pieces are not both +\infty (otherwise the difference is the meaningless \infty - \infty). We call f integrable precisely when both pieces are finite, which by |f| = f^{+} + f^{-} and Stage 1 linearity is the single condition

\int_{\Omega} |f| \, d\mu \;=\; \int_{\Omega} f^{+} \, d\mu + \int_{\Omega} f^{-} \, d\mu \;<\; \infty.

Two properties now extend from the simple-function case to all integrable f, g — by approximating each part from below and passing to the limit with Monotone Convergence — and they are used on essentially every page that follows:

Linearity: \int (a f + b g) \, d\mu = a\!\int f \, d\mu + b\!\int g \, d\mu.
Monotonicity: if f \le g then \int f \, d\mu \le \int g \, d\mu.

The picture that separates the two integrals is which axis you cut. The Riemann integral chops the domain into thin vertical strips and sums f(x_i)\,\Delta x_i — it asks "where is x?". The Lebesgue integral chops the range into horizontal bands and, for each band height y, weighs the set \{\,\omega : f(\omega) \ge y\,\} by \mu — it asks "how much of the space reaches this high?". Slicing the range is what frees the integral from needing a tidy domain: the indicator of the rationals, \mathbf{1}_{\mathbb{Q}}, has no Riemann integral (its strips never settle), yet its Lebesgue integral is plainly \mu(\mathbb{Q}) = 0.

Slicing the range is also why limits pass through so cleanly — the single most important advantage for probability, where we constantly take limits of approximating random variables and of conditional estimates. Monotone Convergence handles increasing sequences; its dominated cousin handles the general case:

Dominated Convergence. If f_n \to f pointwise and all the f_n are bounded by one fixed integrable g (so |f_n| \le g with \int g \, d\mu < \infty), then

\lim_{n\to\infty} \int_{\Omega} f_n \, d\mu \;=\; \int_{\Omega} \lim_{n\to\infty} f_n \, d\mu \;=\; \int_{\Omega} f \, d\mu.

The dominating g is the safety net that stops mass from leaking off to infinity. This is the theorem that lets us swap a limit and an expectation — exactly the move behind continuity of expectation, the tower property of conditioning, and the interchange of differentiation and integration that pricing formulas rely on.