Maximum Likelihood = Least Squares

With the data now random, we can ask which model makes the observations most probable — the maximum-likelihood estimate. For Gaussian noise e \sim N(0, C_D), the likelihood of the data given a model is

p(d \mid m) \propto \exp\!\Big(-\tfrac12 (d - Gm)^{\mathsf T} C_D^{-1} (d - Gm)\Big).

Maximising this is minimising the exponent. So maximum likelihood is weighted least squares:

\hat m_{\text{MLE}} = \arg\min_m\, (d - Gm)^{\mathsf T} C_D^{-1} (d - Gm).

The covariance does the weighting

The inverse covariance C_D^{-1} is the natural weighting matrix. A measurement with small variance (trusted) gets a large weight; a noisy one gets little say. When the noise is uniform and uncorrelated, C_D = \sigma^2 I and this collapses to ordinary least squares — the special case we started from. The general estimate is

\hat m_{\text{MLE}} = (G^{\mathsf T} C_D^{-1} G)^{-1} G^{\mathsf T} C_D^{-1} d.

This is the bridge between the deterministic and statistical stories: the least-squares method we used for stability was, all along, the maximum-likelihood estimate under Gaussian noise. It still offers no help with ill-posedness, though — for that we must add a prior, which is the next step.

Gaussian noise ⇒ likelihood \propto \exp(-\tfrac12 (d-Gm)^{\mathsf T}C_D^{-1}(d-Gm)).
MLE = weighted least squares with weight C_D^{-1}; uniform noise C_D = \sigma^2 I gives ordinary least squares.
Estimate: \hat m = (G^{\mathsf T}C_D^{-1}G)^{-1}G^{\mathsf T}C_D^{-1}d.