Decision Theory and Risk

So far we've judged estimators one property at a time — unbiased here, low-variance there. But choosing an estimator is really making a decision under uncertainty, and there is a single framework that judges any such decision: statistical decision theory. It asks two questions. First, what does it cost to be wrong? — a loss function. Second, what is the expected cost of a given rule? — its risk.

With loss and risk in hand, everything becomes comparable: point estimators, classifiers, forecasts, even whole experiments. And two grand strategies emerge for picking a rule when no single one wins everywhere — the Bayes rule (best on average against a prior) and the minimax rule (best against the worst case). This is the page where estimation grows up into optimisation.

Loss and risk

A decision rule (here, an estimator) \delta(X) maps data to an action. A loss function L(\theta, a) says how much it hurts to take action a when the truth is \theta. Three classics:

Squared-error loss L(\theta,a)=(\theta-a)^2 — punishes big misses hard;
Absolute-error loss L(\theta,a)=|\theta-a| — proportional, robust to outliers;
0–1 loss L(\theta,a)=\mathbf{1}\{a\neq\theta\} — for classification: right or wrong, nothing in between.

The risk is the loss averaged over the randomness of the data — the expected loss:

R(\theta,\delta) = \mathbb{E}_\theta\big[L(\theta,\delta(X))\big].

Crucially the risk is a function of \theta: a rule can be great for some true values and poor for others.

Squared-error risk is exactly MSE

Under squared-error loss the risk of an estimator is R(\theta,\delta)=\mathbb{E}_\theta[(\theta-\delta(X))^2] — which is nothing but the mean squared error we already know, so

R(\theta,\delta) = \operatorname{Var}_\theta(\delta) + \operatorname{Bias}_\theta(\delta)^2.

Decision theory doesn't replace what we learned; it subsumes it. The bias–variance decomposition is just the squared-error risk of an estimator, now sitting inside a much larger theory that also handles absolute loss, 0–1 loss, and any custom cost you care to write down.

No rule wins everywhere — so how do we choose?

Here is the central difficulty. Compare two rules by their risk functions and you'll usually find they cross: rule A beats rule B for some \theta, B beats A for others. There is generally no uniformly best rule. (When one rule's risk is never worse and sometimes better than another's, the loser is called inadmissible — genuinely dominated. But most sensible rules are admissible and simply trade regions.) To pick, we must collapse the whole risk function to a single number. Two principled ways to do it:

Bayes risk — average the risk over a prior \pi(\theta) on the parameter: r(\pi,\delta)=\int R(\theta,\delta)\,\pi(\theta)\,d\theta. The rule that minimises it is the Bayes rule.
Minimax risk — score each rule by its worst-case risk, \sup_\theta R(\theta,\delta), and pick the rule that minimises that worst case. Cautious, adversarial, prior-free.

Bayes rules read straight off the posterior

The beautiful payoff: minimising Bayes risk is equivalent to minimising posterior expected loss for each dataset, and for the three classic losses the Bayes rule is a familiar summary of the posterior distribution.

Squared-error loss → the posterior mean \mathbb{E}[\theta\mid X].
Absolute-error loss → the posterior median.
0–1 loss → the posterior mode, i.e. MAP estimation.

So MAP estimation is not an ad-hoc recipe — it is the Bayes-optimal decision under 0–1 loss, just as the posterior mean is optimal under squared-error loss. Your choice of loss silently picks which posterior summary you should report.

Watch two risk functions cross

Estimating a mean with (rescaled units) \sigma^2/n = 1. The flat line is the unbiased rule \bar X, with constant risk 1. The parabola is a shrinkage rule c\bar X that pulls the estimate toward 0; its risk is c^2 + (1-c)^2\theta^2. Slide the shrinkage factor c: for \theta near 0 the shrinkage rule wins (lower risk), but for large |\theta| it loses. They cross — the visual proof that no rule dominates, and the reason we need Bayes or minimax to choose.

Risk is an expectation over the data you might have seen. It is a property of the rule and the true \theta, computed before you look at your one dataset — not the loss you incur on the sample in your hand. On any particular dataset a high-risk rule might happen to nail the answer, and a low-risk rule might badly miss; risk only promises good behaviour on average across repetitions. And because true risk depends on the unknown \theta, you can never compute your rule's actual risk from data alone — you estimate it, or you commit to a prior (Bayes) or to the worst case (minimax). Confusing "low risk" with "I got it right this time" is a category error.

Minimax feels pessimistic — you optimise against an adversary who gets to choose the true \theta to hurt you most. But it earns its keep when being badly wrong is catastrophic and you have no trustworthy prior: think safety engineering, robust control, or an opponent in a genuine game who really is trying to beat you. There is also a deep bridge between the two philosophies. A minimax rule is often a Bayes rule for the least-favourable prior — the prior an adversary would choose — so minimax is, in a sense, Bayesian analysis against your worst enemy's beliefs. Bayes optimises for the world you expect; minimax insures against the world you fear. Mature analysis keeps both lenses handy.