Likelihood and MLE
Flip the
likelihood
around. Fix the data and let the parameter vary: the
likelihood function L(\theta) = P(\text{data} \mid \theta)
scores how well each candidate \theta explains what we saw. The
maximum-likelihood estimate (MLE) is the parameter that scores highest:
\hat\theta_{\text{MLE}} = \arg\max_{\theta} \, L(\theta) = \arg\max_{\theta}\,\prod_i P(x_i \mid \theta).
Because products of many small probabilities are awkward, we almost always maximise the
log-likelihood \ell(\theta) = \sum_i \log P(x_i\mid\theta)
instead — the logarithm turns the product into a sum and does not move the maximum.
Gaussian noise makes MLE into least squares
Suppose each measurement is the truth plus independent Gaussian noise,
x_i = \theta + \varepsilon_i with
\varepsilon_i \sim N(0, \sigma^2). The log-likelihood is
\ell(\theta) = -\frac{1}{2\sigma^2}\sum_i (x_i - \theta)^2 + \text{const}.
Maximising \ell is the same as minimising the sum of squared
residuals \sum_i(x_i - \theta)^2. That is the deep reason
least squares is everywhere: least squares is maximum likelihood under Gaussian
noise. For estimating a single mean, the MLE is just the sample average
\hat\theta = \bar x.
The likelihood peaks at the best fit
Three measurements with sample average \bar x = 3. The curve is the
likelihood of the mean \theta; it is a Gaussian centred on
\bar x — the MLE. Shrink the noise \sigma and
the peak sharpens: less noise means the data pins the estimate down more tightly.
- The likelihood L(\theta) = \prod_i P(x_i\mid\theta) scores parameters by how well they explain the data; the MLE maximises it.
- Maximise the log-likelihood (a sum) for convenience — same maximiser.
- Under independent Gaussian noise, MLE = least squares; for a single mean, the MLE is the sample average.