Decision Theory and Risk
So far we've judged estimators one property at a time — unbiased here, low-variance there. But choosing
an estimator is really making a decision under uncertainty, and there is a single
framework that judges any such decision: statistical decision theory. It
asks two questions. First, what does it cost to be wrong? — a loss function.
Second, what is the expected cost of a given rule? — its risk.
With loss and risk in hand, everything becomes comparable: point estimators, classifiers, forecasts,
even whole experiments. And two grand strategies emerge for picking a rule when no single one wins
everywhere — the Bayes rule (best on average against a prior) and the
minimax rule (best against the worst case). This is the page where estimation grows
up into optimisation.
Loss and risk
A decision rule (here, an estimator) \delta(X) maps data to
an action. A loss function L(\theta, a) says how much it
hurts to take action a when the truth is
\theta. Three classics:
- Squared-error loss L(\theta,a)=(\theta-a)^2 — punishes big misses hard;
- Absolute-error loss L(\theta,a)=|\theta-a| — proportional, robust to outliers;
- 0–1 loss L(\theta,a)=\mathbf{1}\{a\neq\theta\} — for classification: right or wrong, nothing in between.
The risk is the loss averaged over the randomness of the data — the
expected loss:
R(\theta,\delta) = \mathbb{E}_\theta\big[L(\theta,\delta(X))\big].
Crucially the risk is a function of \theta: a rule can be
great for some true values and poor for others.
Squared-error risk is exactly MSE
Under squared-error loss the risk of an estimator is
R(\theta,\delta)=\mathbb{E}_\theta[(\theta-\delta(X))^2] — which is nothing
but the mean
squared error we already know, so
R(\theta,\delta) = \operatorname{Var}_\theta(\delta) + \operatorname{Bias}_\theta(\delta)^2.
Decision theory doesn't replace what we learned; it subsumes it. The bias–variance
decomposition is just the squared-error risk of an estimator, now sitting inside a much larger theory
that also handles absolute loss, 0–1 loss, and any custom cost you care to write down.
No rule wins everywhere — so how do we choose?
Here is the central difficulty. Compare two rules by their risk functions and you'll usually
find they cross: rule A beats rule B for some \theta, B
beats A for others. There is generally no uniformly best rule. (When one rule's risk
is never worse and sometimes better than another's, the loser is called
inadmissible — genuinely dominated. But most sensible rules are admissible and simply
trade regions.) To pick, we must collapse the whole risk function to a single number. Two principled
ways to do it:
-
Bayes risk — average the risk over a prior
\pi(\theta) on the parameter:
r(\pi,\delta)=\int R(\theta,\delta)\,\pi(\theta)\,d\theta. The rule that
minimises it is the Bayes rule.
-
Minimax risk — score each rule by its worst-case risk,
\sup_\theta R(\theta,\delta), and pick the rule that minimises that
worst case. Cautious, adversarial, prior-free.
Bayes rules read straight off the posterior
The beautiful payoff: minimising Bayes risk is equivalent to minimising posterior expected loss
for each dataset, and for the three classic losses the Bayes rule is a familiar summary of the
posterior distribution.
- Squared-error loss → the posterior mean \mathbb{E}[\theta\mid X].
- Absolute-error loss → the posterior median.
- 0–1 loss → the posterior mode, i.e. MAP estimation.
So MAP estimation is not an ad-hoc recipe — it is the Bayes-optimal decision under 0–1 loss, just as
the posterior mean is optimal under squared-error loss. Your choice of loss silently picks
which posterior summary you should report.
Watch two risk functions cross
Estimating a mean with (rescaled units) \sigma^2/n = 1. The flat line is
the unbiased rule \bar X, with constant risk 1. The parabola is a
shrinkage rule c\bar X that pulls the estimate toward 0;
its risk is c^2 + (1-c)^2\theta^2. Slide the shrinkage factor
c: for \theta near 0 the shrinkage rule wins
(lower risk), but for large |\theta| it loses. They cross
— the visual proof that no rule dominates, and the reason we need Bayes or minimax to choose.
Risk is an expectation over the data you might have seen. It is a property of the rule and the
true \theta, computed before you look at your one dataset —
not the loss you incur on the sample in your hand. On any particular dataset a high-risk rule might
happen to nail the answer, and a low-risk rule might badly miss; risk only promises good behaviour
on average across repetitions. And because true risk depends on the unknown
\theta, you can never compute your rule's actual risk from data alone — you
estimate it, or you commit to a prior (Bayes) or to the worst case (minimax). Confusing "low risk"
with "I got it right this time" is a category error.
Minimax feels pessimistic — you optimise against an adversary who gets to choose the true
\theta to hurt you most. But it earns its keep when being badly wrong is
catastrophic and you have no trustworthy prior: think safety engineering, robust control, or
an opponent in a genuine game who really is trying to beat you. There is also a deep bridge between the
two philosophies. A minimax rule is often a Bayes rule for the least-favourable prior
— the prior an adversary would choose — so minimax is, in a sense, Bayesian analysis against your
worst enemy's beliefs. Bayes optimises for the world you expect; minimax insures against the world you
fear. Mature analysis keeps both lenses handy.