Likelihood and MLE

Flip the likelihood around. Fix the data and let the parameter vary: the likelihood function L(\theta) = P(\text{data} \mid \theta) scores how well each candidate \theta explains what we saw. The maximum-likelihood estimate (MLE) is the parameter that scores highest:

\hat\theta_{\text{MLE}} = \arg\max_{\theta} \, L(\theta) = \arg\max_{\theta}\,\prod_i P(x_i \mid \theta).

Because products of many small probabilities are awkward, we almost always maximise the log-likelihood \ell(\theta) = \sum_i \log P(x_i\mid\theta) instead — the logarithm turns the product into a sum and does not move the maximum.

Gaussian noise makes MLE into least squares

Suppose each measurement is the truth plus independent Gaussian noise, x_i = \theta + \varepsilon_i with \varepsilon_i \sim N(0, \sigma^2). The log-likelihood is

\ell(\theta) = -\frac{1}{2\sigma^2}\sum_i (x_i - \theta)^2 + \text{const}.

Maximising \ell is the same as minimising the sum of squared residuals \sum_i(x_i - \theta)^2. That is the deep reason least squares is everywhere: least squares is maximum likelihood under Gaussian noise. For estimating a single mean, the MLE is just the sample average \hat\theta = \bar x.

The likelihood peaks at the best fit

Three measurements with sample average \bar x = 3. The curve is the likelihood of the mean \theta; it is a Gaussian centred on \bar x — the MLE. Shrink the noise \sigma and the peak sharpens: less noise means the data pins the estimate down more tightly.

The likelihood L(\theta) = \prod_i P(x_i\mid\theta) scores parameters by how well they explain the data; the MLE maximises it.
Maximise the log-likelihood (a sum) for convenience — same maximiser.
Under independent Gaussian noise, MLE = least squares; for a single mean, the MLE is the sample average.