Expectation as an Integral
You already know two formulas for an average. For a die you write
E[X] = \sum_i x_i\, p(x_i); for a continuous quantity you write
E[X] = \int x\, f(x)\, dx. They look like different animals — one a sum, one
an integral — and a first course keeps them in separate boxes marked "discrete" and "continuous". But a
probability is just a
measure
(one that happens to assign the whole space total weight 1), and a random
variable is just a
measurable function.
So there is really only one notion of average, and it is a single object we have already
built:
E[X] \;=\; \int_{\Omega} X \, dP.
The
Lebesgue integral
of the function X : \Omega \to \mathbb{R} against the probability measure
P. That is the whole idea of this page. The sum and the density integral are
not two definitions of expectation — they are two computations of the same integral, read off in
two special cases. Once you see expectation as \int X\,dP, everything a
probabilist wants from it — linearity, monotonicity, the variance shortcut, Jensen's inequality — is just
the toolbox of Lebesgue integration, inherited for free.
Why it matters. An insurer setting a premium, a casino sizing a bet, a physicist
predicting the mean energy of a gas, a machine-learning model minimising an expected loss — all of them
are computing \int X\,dP. The measure-theoretic view is the one that survives
contact with mixtures of discrete and continuous parts, with random variables that have no density, and
with infinite-dimensional spaces of paths. It is the language every serious use of probability is written
in.
The definition, in three stages
Fix a probability space (\Omega, \mathcal{F}, P) with
P(\Omega) = 1. A random variable X
is exactly a measurable function \Omega \to \mathbb{R}. Its expectation is
built by the same three-stage Lebesgue construction as any other integral — nothing new is
needed, we just rename \int f\,d\mu to E[X].
-
Simple random variables. If
X = \sum_{k=1}^{n} c_k\, \mathbf{1}_{A_k} takes finitely many values on
disjoint events A_k \in \mathcal{F}, then
E[X] = \sum_k c_k\, P(A_k) — value times probability, summed. This is the
familiar \sum x_i p_i, already.
-
Non-negative random variables. If X \ge 0, define
E[X] = \sup\bigl\{ E[\varphi] : \varphi \text{ simple},\ 0 \le \varphi \le X \bigr\} \in [0, \infty],
the supremum of the expectations of all simple random variables sitting underneath
X. Always defined, possibly +\infty.
-
General random variables. Split into positive and negative parts
X = X^{+} - X^{-} with
X^{+} = \max(X, 0), X^{-} = \max(-X, 0), and set
E[X] = E[X^{+}] - E[X^{-}],
provided at least one part is finite (so no \infty - \infty).
We say X is integrable, and the expectation is a genuine finite number, exactly when
- E[|X|] = E[X^{+}] + E[X^{-}] < \infty — both parts finite;
- equivalently, X \in L^1(\Omega, \mathcal{F}, P), the space of absolutely-integrable random variables.
The absoluteness is worth flagging: expectation is defined through
E[|X|], so like every Lebesgue integral it refuses conditionally convergent
cancellation. A random variable whose positive and negative parts both have infinite expectation has
no expectation at all — a fact the naive \sum x_i p_i view can
hide, and one we will meet head-on with the Cauchy distribution below.
From \int X\,dP back to \sum x\,p(x) and \int x\,f(x)\,dx
The abstract integral lives on \Omega, a space we usually never look at
directly — who cares about the underlying sample space of "a die roll"? What we see are the
values X takes. The bridge is the pushforward (or
distribution) measure \mu_X on \mathbb{R},
defined by \mu_X(B) = P(X \in B). Change of variables moves the integral off
\Omega and onto the real line:
E[g(X)] \;=\; \int_{\Omega} g(X)\, dP \;=\; \int_{\mathbb{R}} g(x)\, d\mu_X(x).
This is the law of the unconscious statistician (LOTUS): to average
g(X) you do not need the distribution of
g(X) — you integrate g against the distribution of
X. And now the two familiar formulas are just what
\mu_X happens to be:
-
Discrete X. If \mu_X = \sum_i p_i\,\delta_{x_i}
is a sum of point masses, the integral collapses to a sum:
E[g(X)] = \sum_i g(x_i)\, p(x_i).
-
Continuous X. If \mu_X has a
density f with respect to Lebesgue measure (its
Radon–Nikodym derivative),
then E[g(X)] = \int_{\mathbb{R}} g(x)\, f(x)\, dx.
Both are the single object \int g\,d\mu_X. The
elementary expected value
you already knew was this integral in disguise all along.
Let X be the score of a fair six-sided die, so
\mu_X = \tfrac16\sum_{i=1}^{6}\delta_i. With
g(x) = x,
E[X] = \sum_{i=1}^{6} i\cdot \tfrac16 = \tfrac{1+2+3+4+5+6}{6} = \tfrac{21}{6} = 3.5.
Note the expected value 3.5 is not a face the die can ever show — a first
hint that "expected" does not mean "typical" or "attainable".
Let X have density f(x) = 2x on
[0,1] (and 0 outside). Check it is a density:
\int_0^1 2x\,dx = 1. Then
E[X] = \int_0^1 x\cdot 2x\,dx = \int_0^1 2x^2\,dx = \tfrac{2}{3}.
The mean sits at \tfrac23, pulled to the right because the density puts more
weight near 1. That is exactly the "balance point" picture of the mean — the
density is a strip of mass and E[X] is its centre of gravity.
The properties you actually use
Because expectation is the Lebesgue integral, its rules are the integral's rules — transcribed
into probability notation. For integrable X, Y and constants
a, b:
- Linearity: E[aX + bY] = a\,E[X] + b\,E[Y].
- Monotonicity: X \le Y \text{ a.s.} \implies E[X] \le E[Y]; in particular X \ge 0 \implies E[X] \ge 0.
- Absolute bound (triangle inequality): \bigl|E[X]\bigr| \le E[|X|].
Linearity is the quiet headline. Read it again and notice what is missing:
E[X + Y] = E[X] + E[Y] holds whether or not
X and Y are independent. There is no product, no
covariance term, no assumption at all beyond integrability. This is because linearity is a theorem about
the integral of a sum of functions, and functions add pointwise on
\Omega with no reference to how they depend on each other. It is the single
most useful fact in all of elementary probability — it is why you can compute the expected number of
fixed points of a random permutation, or the expected length of the longest run, by chopping the count
into indicators X = \sum_k \mathbf{1}_{A_k} and summing
E[X] = \sum_k P(A_k), never once worrying whether the events overlap.
Monotonicity says a bigger random variable has a bigger average, and the absolute bound says averaging
can only shrink size — both immediate from -|X| \le X \le |X| and the
integral's own monotonicity.
Variance, and a word from Jensen
Once the mean \mu = E[X] exists (and X \in L^2),
the spread of X is another expectation — the average squared distance from the
mean:
\operatorname{Var}(X) = E\bigl[(X - \mu)^2\bigr].
Expand the square and use linearity — which is where the algebra pays off — to get the
computational shortcut every statistician reaches for:
\operatorname{Var}(X) = E[X^2] - 2\mu E[X] + \mu^2 = E[X^2] - \mu^2 = E[X^2] - \bigl(E[X]\bigr)^2.
"The mean of the square minus the square of the mean." The standard deviation is
\sigma = \sqrt{\operatorname{Var}(X)}, back in the units of
X. Because \operatorname{Var}(X) = E[(X-\mu)^2] \ge 0
(monotonicity of a non-negative integrand), the shortcut also proves
E[X^2] \ge (E[X])^2 — a baby case of a much bigger law.
- If \phi is convex and X integrable, then \phi\bigl(E[X]\bigr) \le E\bigl[\phi(X)\bigr].
- For a concave \phi the inequality flips: E[\phi(X)] \le \phi(E[X]).
Taking \phi(t) = t^2 recovers (E[X])^2 \le E[X^2];
taking \phi(t) = 1/t on the positive reals gives
1/E[X] \le E[1/X]. The slogan is: a convex function felt through a random
input is, on average, at least as big as it would be at the average input. Convexity plus averaging
always pushes upward.
Watch it happen: the running average finds E[X]
Expectation is the number the world's averages settle on. Draw independent copies
X_1, X_2, \dots and form the running (sample) mean
\bar{X}_n = \frac{1}{n}\sum_{i=1}^{n} X_i.
The Law of Large Numbers says \bar{X}_n \to E[X] as
n \to \infty. Below, each X_i is a
\{0,1\}-valued Bernoulli draw with mean p, so
E[X] = p exactly. The jagged curve is \bar{X}_n
plotted against n; the flat line is p itself. For
small n the average lurches around; as n grows the
wobble is squeezed out and the curve locks onto the integral. Slide p and the
target line — and the curve chasing it — both move.
This convergence is the operational meaning of "expected value": not a value you expect on any single
trial, but the long-run average the trials conspire to produce. The integral
\int X\,dP is precisely that limit.
In 1713 Nicolaus Bernoulli posed a puzzle that his cousin Daniel wrestled with in St. Petersburg. Flip a
fair coin until the first head. If the first head is on toss k, you win
2^{k} ducats. What is a fair price to enter?
The probability of first head on toss k is
2^{-k}, and the payout is 2^{k}, so each term of the
expectation contributes 2^{-k}\cdot 2^{k} = 1:
E[X] = \sum_{k=1}^{\infty} 2^{-k}\cdot 2^{k} = \sum_{k=1}^{\infty} 1 = \infty.
The expected winnings are infinite — the non-negative integral genuinely diverges to
+\infty. So a coldly rational player should pay any finite entry fee, a million
ducats, anything. Yet almost nobody will stake more than a handful of coins. The paradox launched two
deep ideas: that utility is concave (a Jensen story — the marginal value of a ducat falls, so
E[u(X)] is finite even when E[X] is not), and that a
heavy enough tail can make an "average" a badly behaved summary. It is the friendliest possible warning
that an expectation can be infinite.
Three ways expectation can bite:
-
The expectation need not exist. For the standard Cauchy distribution,
f(x) = \dfrac{1}{\pi(1+x^2)}, the tails decay only like
1/x^2, so
E[|X|] = \int_{-\infty}^{\infty} \dfrac{|x|}{\pi(1+x^2)}\,dx = \infty. Both
E[X^{+}] and E[X^{-}] are infinite, and
E[X] is undefined — not zero, not anything. Writing
"E[X] = 0 by symmetry" is the classic error; the symmetric cancellation
\infty - \infty is meaningless. Absolute integrability is not optional.
-
The expected value need not be attainable. A fair die averages
3.5, a face it can never show; a Bernoulli averages
p, a value it never takes. "Expected" is a long-run average, not a
prediction of any single outcome.
-
E[1/X] \ne 1/E[X] in general. Expectation does not commute
with nonlinear functions. If X is 1 or
2 with equal chance, E[X] = 1.5 so
1/E[X] = 2/3 \approx 0.667, but
E[1/X] = \tfrac12(1) + \tfrac12(\tfrac12) = 0.75. (Jensen even tells you
which way the gap leans: E[1/X] \ge 1/E[X], since
1/t is convex.)