The Least-Squares Solution

When a problem is over-determined, d = Gm has no exact solution — noisy measurements simply do not all agree. The honest goal is then the model that comes closest: minimise the residual r = d - Gm in the least-squares sense,

\hat m = \arg\min_m \|d - Gm\|^2.

Geometrically, Gm ranges over the column space of G; the closest point to d is its orthogonal projection onto that subspace. The residual at the optimum is perpendicular to every column of G — and that perpendicularity is the solution.

The normal equations

"Residual ⟂ columns of G" is written G^{\mathsf T}(d - Gm) = 0, which rearranges to the normal equations:

G^{\mathsf T}G\,\hat m = G^{\mathsf T}d \quad\Longrightarrow\quad \hat m = (G^{\mathsf T}G)^{-1}G^{\mathsf T}d.

This is the same closed form as the normal equation in regression — fitting a line is a small over-determined inverse problem. It works cleanly when G has full column rank; when it does not (or is badly conditioned), we will need the generalized inverse and regularization.

Minimise the squared residuals

Five data points and a line you control (slope and intercept). The vertical segments are the residuals; the readout is their sum of squares \|r\|^2. Hunt for the smallest value — the faint line is the true least-squares optimum (\|r\|^2 = 3.6), and you will find you cannot beat it.

Minimise \|d - Gm\|^2: the best fit is the projection of d onto the column space of G.
The residual is orthogonal to the columns: G^{\mathsf T}(d - Gm) = 0.
Normal equations: \hat m = (G^{\mathsf T}G)^{-1}G^{\mathsf T}d (full column rank).