Expectation as an Integral

You already know two formulas for an average. For a die you write E[X] = \sum_i x_i\, p(x_i); for a continuous quantity you write E[X] = \int x\, f(x)\, dx. They look like different animals — one a sum, one an integral — and a first course keeps them in separate boxes marked "discrete" and "continuous". But a probability is just a measure (one that happens to assign the whole space total weight 1), and a random variable is just a measurable function. So there is really only one notion of average, and it is a single object we have already built:

E[X] \;=\; \int_{\Omega} X \, dP.

The Lebesgue integral of the function X : \Omega \to \mathbb{R} against the probability measure P. That is the whole idea of this page. The sum and the density integral are not two definitions of expectation — they are two computations of the same integral, read off in two special cases. Once you see expectation as \int X\,dP, everything a probabilist wants from it — linearity, monotonicity, the variance shortcut, Jensen's inequality — is just the toolbox of Lebesgue integration, inherited for free.

Why it matters. An insurer setting a premium, a casino sizing a bet, a physicist predicting the mean energy of a gas, a machine-learning model minimising an expected loss — all of them are computing \int X\,dP. The measure-theoretic view is the one that survives contact with mixtures of discrete and continuous parts, with random variables that have no density, and with infinite-dimensional spaces of paths. It is the language every serious use of probability is written in.

The definition, in three stages

Fix a probability space (\Omega, \mathcal{F}, P) with P(\Omega) = 1. A random variable X is exactly a measurable function \Omega \to \mathbb{R}. Its expectation is built by the same three-stage Lebesgue construction as any other integral — nothing new is needed, we just rename \int f\,d\mu to E[X].

We say X is integrable, and the expectation is a genuine finite number, exactly when

The absoluteness is worth flagging: expectation is defined through E[|X|], so like every Lebesgue integral it refuses conditionally convergent cancellation. A random variable whose positive and negative parts both have infinite expectation has no expectation at all — a fact the naive \sum x_i p_i view can hide, and one we will meet head-on with the Cauchy distribution below.

From \int X\,dP back to \sum x\,p(x) and \int x\,f(x)\,dx

The abstract integral lives on \Omega, a space we usually never look at directly — who cares about the underlying sample space of "a die roll"? What we see are the values X takes. The bridge is the pushforward (or distribution) measure \mu_X on \mathbb{R}, defined by \mu_X(B) = P(X \in B). Change of variables moves the integral off \Omega and onto the real line:

E[g(X)] \;=\; \int_{\Omega} g(X)\, dP \;=\; \int_{\mathbb{R}} g(x)\, d\mu_X(x).

This is the law of the unconscious statistician (LOTUS): to average g(X) you do not need the distribution of g(X) — you integrate g against the distribution of X. And now the two familiar formulas are just what \mu_X happens to be:

Both are the single object \int g\,d\mu_X. The elementary expected value you already knew was this integral in disguise all along.

Let X be the score of a fair six-sided die, so \mu_X = \tfrac16\sum_{i=1}^{6}\delta_i. With g(x) = x,

E[X] = \sum_{i=1}^{6} i\cdot \tfrac16 = \tfrac{1+2+3+4+5+6}{6} = \tfrac{21}{6} = 3.5.

Note the expected value 3.5 is not a face the die can ever show — a first hint that "expected" does not mean "typical" or "attainable".

Let X have density f(x) = 2x on [0,1] (and 0 outside). Check it is a density: \int_0^1 2x\,dx = 1. Then

E[X] = \int_0^1 x\cdot 2x\,dx = \int_0^1 2x^2\,dx = \tfrac{2}{3}.

The mean sits at \tfrac23, pulled to the right because the density puts more weight near 1. That is exactly the "balance point" picture of the mean — the density is a strip of mass and E[X] is its centre of gravity.

The properties you actually use

Because expectation is the Lebesgue integral, its rules are the integral's rules — transcribed into probability notation. For integrable X, Y and constants a, b:

Linearity is the quiet headline. Read it again and notice what is missing: E[X + Y] = E[X] + E[Y] holds whether or not X and Y are independent. There is no product, no covariance term, no assumption at all beyond integrability. This is because linearity is a theorem about the integral of a sum of functions, and functions add pointwise on \Omega with no reference to how they depend on each other. It is the single most useful fact in all of elementary probability — it is why you can compute the expected number of fixed points of a random permutation, or the expected length of the longest run, by chopping the count into indicators X = \sum_k \mathbf{1}_{A_k} and summing E[X] = \sum_k P(A_k), never once worrying whether the events overlap.

Monotonicity says a bigger random variable has a bigger average, and the absolute bound says averaging can only shrink size — both immediate from -|X| \le X \le |X| and the integral's own monotonicity.

Variance, and a word from Jensen

Once the mean \mu = E[X] exists (and X \in L^2), the spread of X is another expectation — the average squared distance from the mean:

\operatorname{Var}(X) = E\bigl[(X - \mu)^2\bigr].

Expand the square and use linearity — which is where the algebra pays off — to get the computational shortcut every statistician reaches for:

\operatorname{Var}(X) = E[X^2] - 2\mu E[X] + \mu^2 = E[X^2] - \mu^2 = E[X^2] - \bigl(E[X]\bigr)^2.

"The mean of the square minus the square of the mean." The standard deviation is \sigma = \sqrt{\operatorname{Var}(X)}, back in the units of X. Because \operatorname{Var}(X) = E[(X-\mu)^2] \ge 0 (monotonicity of a non-negative integrand), the shortcut also proves E[X^2] \ge (E[X])^2 — a baby case of a much bigger law.

Taking \phi(t) = t^2 recovers (E[X])^2 \le E[X^2]; taking \phi(t) = 1/t on the positive reals gives 1/E[X] \le E[1/X]. The slogan is: a convex function felt through a random input is, on average, at least as big as it would be at the average input. Convexity plus averaging always pushes upward.

Watch it happen: the running average finds E[X]

Expectation is the number the world's averages settle on. Draw independent copies X_1, X_2, \dots and form the running (sample) mean

\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i.

The Law of Large Numbers says \bar{X}_n \to E[X] as n \to \infty. Below, each X_i is a \{0,1\}-valued Bernoulli draw with mean p, so E[X] = p exactly. The jagged curve is \bar{X}_n plotted against n; the flat line is p itself. For small n the average lurches around; as n grows the wobble is squeezed out and the curve locks onto the integral. Slide p and the target line — and the curve chasing it — both move.

This convergence is the operational meaning of "expected value": not a value you expect on any single trial, but the long-run average the trials conspire to produce. The integral \int X\,dP is precisely that limit.

In 1713 Nicolaus Bernoulli posed a puzzle that his cousin Daniel wrestled with in St. Petersburg. Flip a fair coin until the first head. If the first head is on toss k, you win 2^{k} ducats. What is a fair price to enter?

The probability of first head on toss k is 2^{-k}, and the payout is 2^{k}, so each term of the expectation contributes 2^{-k}\cdot 2^{k} = 1:

E[X] = \sum_{k=1}^{\infty} 2^{-k}\cdot 2^{k} = \sum_{k=1}^{\infty} 1 = \infty.

The expected winnings are infinite — the non-negative integral genuinely diverges to +\infty. So a coldly rational player should pay any finite entry fee, a million ducats, anything. Yet almost nobody will stake more than a handful of coins. The paradox launched two deep ideas: that utility is concave (a Jensen story — the marginal value of a ducat falls, so E[u(X)] is finite even when E[X] is not), and that a heavy enough tail can make an "average" a badly behaved summary. It is the friendliest possible warning that an expectation can be infinite.

Three ways expectation can bite: