The Camera and View Matrix

Here is the secret every renderer keeps: the camera never moves. The GPU bolts the viewpoint to the origin, staring fixedly down one axis, and to look around it shoves the entire universe past that frozen eye. Walking forward isn't you moving — it's the world sliding backwards through you. The matrix that performs this grand relocation is the view matrix V, and it takes a vertex from world space into camera (eye) space:

\vec{v}_{\text{cam}} = V\,\vec{v}_{\text{world}}.

The camera has a world transform of its own — call it M_{\text{cam}}, which says where the camera is and which way it faces. The view matrix is just that transform run backwards: to express the world in the camera's frame, undo the camera's placement.

The view matrix is the camera's inverse

A camera is a rigid body — it rotates and translates, never bends or scales. From inverse transforms we already know how to undo a rigid move cheaply. Let us re-derive the view matrix from scratch.

Step 1 — write the camera's world transform. The camera sits at world position (the eye) \vec{e} with orientation R. A point \vec{p}_{\text{cam}} given in the camera's own frame lands in the world at

\vec{p}_{\text{world}} = R\,\vec{p}_{\text{cam}} + \vec{e}.

Step 2 — we want the opposite direction. Rendering needs every world vertex re-expressed in the camera's frame, so solve for \vec{p}_{\text{cam}}. Subtract the eye:

\vec{p}_{\text{world}} - \vec{e} = R\,\vec{p}_{\text{cam}}.

Step 3 — undo the rotation with a transpose. A camera's orientation R is orthogonal, so R^{-1} = R^{\top} — no general inverse needed:

\vec{p}_{\text{cam}} = R^{\top}\big(\vec{p}_{\text{world}} - \vec{e}\big) = R^{\top}\vec{p}_{\text{world}} - R^{\top}\vec{e}.

Step 4 — read off the view matrix. Comparing with the rigid form R'\vec{p} + \vec{t}', the view matrix rotates by R^{\top} and translates by -R^{\top}\vec{e} — it is exactly the inverse of the camera's world transform:

V = M_{\text{cam}}^{-1} = \big(\,R^{\top},\; -R^{\top}\vec{e}\,\big).

That single fact is why nudging the camera right shunts the whole world left.

Look-at: building the camera's basis

We never type R in by hand. Instead we hand the engine three intentions — where the eye is (\vec{e}), what to look at (the target \vec{c}), and which way is up (a rough up-hint \vec{u}_{\text{hint}}, usually world-up) — and the look-at construction manufactures an orthonormal basis from them.

Step 5 — the forward axis. The camera looks from the eye toward the target, so the viewing direction is the normalised eye-to-target vector. Following the distance and direction recipe:

\vec{f} = \frac{\vec{c} - \vec{e}}{\lVert \vec{c} - \vec{e} \rVert}.

Step 6 — the right axis. "Right" must be perpendicular to both forward and up. The cross product delivers exactly a vector orthogonal to two others:

\vec{r} = \frac{\vec{f} \times \vec{u}_{\text{hint}}}{\lVert \vec{f} \times \vec{u}_{\text{hint}} \rVert}.

Step 7 — the true up axis. The up-hint was only a rough guide — it need not be exactly perpendicular to \vec{f}. Cross right with forward to recover a clean up that is perpendicular to both. Since \vec{r} and \vec{f} are already unit and orthogonal, the result is already unit:

\vec{u} = \vec{r} \times \vec{f}.

Now \{\vec{r}, \vec{u}, \vec{f}\} is an orthonormal basis: the camera's right, up, and forward, all mutually perpendicular and of length one.

Step 8 — pack the basis into rows and translate by the eye. The orientation R has the basis vectors as its columns; its inverse R^{\top} therefore has them as its rows. So the rotation part of V is "the basis, as rows", and the translation is -R^{\top}\vec{e}, which dots each basis vector against the eye:

V = \begin{bmatrix} r_x & r_y & r_z & -\vec{r}\cdot\vec{e} \\ u_x & u_y & u_z & -\vec{u}\cdot\vec{e} \\ -f_x & -f_y & -f_z & \;\;\vec{f}\cdot\vec{e} \\ 0 & 0 & 0 & 1 \end{bmatrix}.

(The forward row carries a minus sign by the usual convention that the camera looks down its negative z-axis, so things in front get negative depth.) Read the matrix as a recipe: each row of the basis projects the world onto one camera axis — the right row measures how far right a point lies, the up row how far up, the forward row how far in front — and the last column slides the eye to the origin first.

The view matrix V moves the world into the camera's frame.

It feels backwards because it is backwards. There is no privileged "camera object" flying around the scene at render time; there is only the origin, and a fixed direction of gaze. Every sense of motion you feel in a game is the world being transported into that fixed frame by V. Strafe right, and -R^{\top}\vec{e} shifts so the world streams left. Spin to face a new target, and R^{\top} counter-rotates the world the other way.

This is also why V is the camera's inverse rather than its transform. The camera transform M_{\text{cam}} would take something sitting at the origin and place it where the camera is; rendering wants the reverse trip, taking the world and re-coordinatising it relative to the camera. Undo the placement, and the scene falls into the viewpoint's own grid.

Watch the world fall into camera space

Top-down view of a flat scene of props (the coloured dots). Drag the eye sliders to move the camera and the target slider to swing where it looks. The faint grid is world space; the bold axes are the camera's right (\vec{r}) and forward (\vec{f}) basis, built by look-at. The hollow ghosts show each prop's original world position; the solid dots show the same props re-expressed in camera coordinates by V — the eye snapped to the origin, the world rotated to line up with the gaze.