MAP Estimation
Maximum likelihood
listens only to the data. Maximum a posteriori (MAP) estimation also brings in a
prior: it picks the parameter
that maximises the posterior,
\hat\theta_{\text{MAP}} = \arg\max_\theta\, p(\theta \mid D) = \arg\max_\theta\, \underbrace{p(D\mid\theta)}_{\text{likelihood}}\,\underbrace{p(\theta)}_{\text{prior}}.
Taking logs turns the product into a sum: MAP maximises
\log p(D\mid\theta) + \log p(\theta). The first term is the data fit
(the MLE objective); the second is a penalty that pulls the answer toward what
the prior considers plausible. MAP = MLE + a prior penalty.
Gaussian prior, Gaussian likelihood
With a Gaussian prior N(\mu_0, \tau^2) and Gaussian data of spread
\sigma_d, the posterior is again Gaussian, and its peak — the MAP
estimate — is a precision-weighted average of prior and data:
\hat\theta_{\text{MAP}} = \frac{\mu_0/\tau^2 + \bar x/\sigma_d^2}{1/\tau^2 + 1/\sigma_d^2}.
Confident prior (small \tau) pulls toward \mu_0;
precise data (small \sigma_d) pulls toward \bar x.
Two limits matter: a flat (uninformative) prior recovers the MLE, while in the
Gaussian case the log-prior penalty is exactly \|\theta-\mu_0\|^2/\tau^2
— a squared penalty. That is the seed of
ridge / Tikhonov regularization:
a Gaussian prior is an L2 penalty.
Prior meets data
The three bumps are the prior (what we believed), the likelihood
(what the data says), and their product the posterior — each drawn at unit peak
so the shapes are easy to compare. The posterior sits between prior and data and is
narrower than either (combining information always sharpens). Tighten the prior or the
data and watch the posterior slide toward whichever you trust more.
- MAP maximises \log p(D\mid\theta) + \log p(\theta) — data fit plus a prior penalty.
- Gaussian prior + Gaussian likelihood ⇒ Gaussian posterior; the MAP is a precision-weighted average.
- A flat prior recovers the MLE; a Gaussian prior is an L2 (ridge / Tikhonov) penalty.