Priors Are Regularization
Here is the punchline that unites the two halves of this course. Take the Bayesian
MAP objective
with a zero-mean Gaussian prior m \sim N(0, C_M) and uniform Gaussian
noise C_D = \sigma^2 I. The negative log-posterior is, up to constants,
-\log p(m\mid d) \;=\; \frac{1}{\sigma^2}\|d - Gm\|^2 \;+\; m^{\mathsf T} C_M^{-1} m.
That is exactly a
general Tikhonov
functional. The data-fit term is the likelihood; the penalty term is the prior. Maximising the
posterior is regularized least squares.
The dictionary
Term by term, the Bayesian and deterministic pictures are the same object in two languages:
- A prior N(0, \tau^2 I) ⇒ the penalty \|m\|^2/\tau^2 — standard Tikhonov with \alpha^2 = \sigma^2/\tau^2.
- A smoothness prior (covariance favouring smooth fields) ⇒ a derivative penalty \|Lm\|^2 — general Tikhonov, with L^{\mathsf T}L = C_M^{-1}.
- A confident prior (small \tau) ⇒ a large \alpha — heavy regularization.
So the regularization parameter was never arbitrary: \alpha is the
ratio of how noisy you think the data is to how large you expect the model to be. And the penalty
was never a mere mathematical trick — it is a precise statement of prior belief. The
negative log-prior is the penalty, and the parabola
m^2/(2\tau^2) is the L2 penalty drawn out.
The penalty drawn from the prior
The bell curve is a zero-mean Gaussian prior on a model component; the upward parabola is its
negative logarithm — the penalty m^2/(2\tau^2) that regularization
adds. Narrow the prior (smaller \tau, more confident) and the parabola
steepens — a stronger pull toward zero, i.e. a larger \alpha.
- Negative log-posterior = data misfit + (negative log-prior) = a Tikhonov functional.
- Gaussian prior N(0,\tau^2 I) ⇒ L2 penalty with \alpha^2 = \sigma^2/\tau^2; smoothness prior ⇒ derivative penalty.
- \alpha is the noise-to-prior ratio: regularization strength is a statement of belief.