The Least-Squares Solution
When a problem is over-determined, d = Gm has no exact solution — noisy
measurements simply do not all agree. The honest goal is then the model that comes
closest: minimise the residual r = d - Gm in
the least-squares sense,
\hat m = \arg\min_m \|d - Gm\|^2.
Geometrically, Gm ranges over the column space of
G; the closest point to d is its
orthogonal projection
onto that subspace. The residual at the optimum is perpendicular to every column of
G — and that perpendicularity is the solution.
The normal equations
"Residual ⟂ columns of G" is written
G^{\mathsf T}(d - Gm) = 0, which rearranges to the
normal equations:
G^{\mathsf T}G\,\hat m = G^{\mathsf T}d \quad\Longrightarrow\quad \hat m = (G^{\mathsf T}G)^{-1}G^{\mathsf T}d.
This is the same closed form as the
normal equation in regression
— fitting a line is a small over-determined inverse problem. It works cleanly when
G has full column rank; when it does not (or is badly conditioned), we
will need the generalized inverse and regularization.
Minimise the squared residuals
Five data points and a line you control (slope and intercept). The vertical segments are the
residuals; the readout is their sum of squares \|r\|^2. Hunt for the
smallest value — the faint line is the true least-squares optimum
(\|r\|^2 = 3.6), and you will find you cannot beat it.
- Minimise \|d - Gm\|^2: the best fit is the projection of d onto the column space of G.
- The residual is orthogonal to the columns: G^{\mathsf T}(d - Gm) = 0.
- Normal equations: \hat m = (G^{\mathsf T}G)^{-1}G^{\mathsf T}d (full column rank).