MAP Estimation

Maximum likelihood listens only to the data. Maximum a posteriori (MAP) estimation also brings in a prior: it picks the parameter that maximises the posterior,

\hat\theta_{\text{MAP}} = \arg\max_\theta\, p(\theta \mid D) = \arg\max_\theta\, \underbrace{p(D\mid\theta)}_{\text{likelihood}}\,\underbrace{p(\theta)}_{\text{prior}}.

Taking logs turns the product into a sum: MAP maximises \log p(D\mid\theta) + \log p(\theta). The first term is the data fit (the MLE objective); the second is a penalty that pulls the answer toward what the prior considers plausible. MAP = MLE + a prior penalty.

Gaussian prior, Gaussian likelihood

With a Gaussian prior N(\mu_0, \tau^2) and Gaussian data of spread \sigma_d, the posterior is again Gaussian, and its peak — the MAP estimate — is a precision-weighted average of prior and data:

\hat\theta_{\text{MAP}} = \frac{\mu_0/\tau^2 + \bar x/\sigma_d^2}{1/\tau^2 + 1/\sigma_d^2}.

Confident prior (small \tau) pulls toward \mu_0; precise data (small \sigma_d) pulls toward \bar x. Two limits matter: a flat (uninformative) prior recovers the MLE, while in the Gaussian case the log-prior penalty is exactly \|\theta-\mu_0\|^2/\tau^2 — a squared penalty. That is the seed of ridge / Tikhonov regularization: a Gaussian prior is an L2 penalty.

Prior meets data

The three bumps are the prior (what we believed), the likelihood (what the data says), and their product the posterior — each drawn at unit peak so the shapes are easy to compare. The posterior sits between prior and data and is narrower than either (combining information always sharpens). Tighten the prior or the data and watch the posterior slide toward whichever you trust more.