Here is the secret every renderer keeps: the camera never moves. The GPU
bolts the viewpoint to the origin, staring fixedly down one axis, and to look around it shoves
the entire universe past that frozen eye. Walking forward isn't you moving — it's the world
sliding backwards through you. The matrix that performs this grand relocation is the
view matrix V, and it takes a vertex from
world space into camera (eye) space:
\vec{v}_{\text{cam}} = V\,\vec{v}_{\text{world}}.
The camera has a world transform of its own — call it M_{\text{cam}},
which says where the camera is and which way it faces. The view matrix is just that
transform run backwards: to express the world in the camera's frame, undo the camera's placement.
The view matrix is the camera's inverse
A camera is a rigid body — it rotates and translates, never bends or scales. From
inverse transforms
we already know how to undo a rigid move cheaply. Let us re-derive the view matrix from scratch.
Step 1 — write the camera's world transform. The camera sits at world
position (the eye) \vec{e} with orientation
R. A point \vec{p}_{\text{cam}} given in
the camera's own frame lands in the world at
\vec{p}_{\text{world}} = R\,\vec{p}_{\text{cam}} + \vec{e}.
Step 2 — we want the opposite direction. Rendering needs every world vertex
re-expressed in the camera's frame, so solve for \vec{p}_{\text{cam}}.
Subtract the eye:
\vec{p}_{\text{world}} - \vec{e} = R\,\vec{p}_{\text{cam}}.
Step 3 — undo the rotation with a transpose. A camera's orientation
R is orthogonal, so R^{-1} = R^{\top} — no
general inverse needed:
\vec{p}_{\text{cam}} = R^{\top}\big(\vec{p}_{\text{world}} - \vec{e}\big) = R^{\top}\vec{p}_{\text{world}} - R^{\top}\vec{e}.
Step 4 — read off the view matrix. Comparing with the rigid form
R'\vec{p} + \vec{t}', the view matrix rotates by
R^{\top} and translates by -R^{\top}\vec{e}
— it is exactly the inverse of the camera's world transform:
V = M_{\text{cam}}^{-1} = \big(\,R^{\top},\; -R^{\top}\vec{e}\,\big).
That single fact is why nudging the camera right shunts the whole world left.
Look-at: building the camera's basis
We never type R in by hand. Instead we hand the engine three
intentions — where the eye is (\vec{e}), what to look at
(the target \vec{c}), and which way is up (a rough up-hint
\vec{u}_{\text{hint}}, usually world-up) — and the
look-at construction manufactures an orthonormal basis from them.
Step 5 — the forward axis. The camera looks from the eye toward the target,
so the viewing direction is the normalised eye-to-target vector. Following the
distance and direction
recipe:
\vec{f} = \frac{\vec{c} - \vec{e}}{\lVert \vec{c} - \vec{e} \rVert}.
Step 6 — the right axis. "Right" must be perpendicular to both forward and up.
The
cross product
delivers exactly a vector orthogonal to two others:
\vec{r} = \frac{\vec{f} \times \vec{u}_{\text{hint}}}{\lVert \vec{f} \times \vec{u}_{\text{hint}} \rVert}.
Step 7 — the true up axis. The up-hint was only a rough guide — it need not be
exactly perpendicular to \vec{f}. Cross right with forward to recover a
clean up that is perpendicular to both. Since \vec{r} and
\vec{f} are already unit and orthogonal, the result is already unit:
\vec{u} = \vec{r} \times \vec{f}.
Now \{\vec{r}, \vec{u}, \vec{f}\} is an orthonormal basis: the
camera's right, up, and forward, all mutually perpendicular and of length one.
Step 8 — pack the basis into rows and translate by the eye. The orientation
R has the basis vectors as its columns; its inverse
R^{\top} therefore has them as its rows. So the rotation part
of V is "the basis, as rows", and the translation is
-R^{\top}\vec{e}, which dots each basis vector against the eye:
V = \begin{bmatrix} r_x & r_y & r_z & -\vec{r}\cdot\vec{e} \\ u_x & u_y & u_z & -\vec{u}\cdot\vec{e} \\ -f_x & -f_y & -f_z & \;\;\vec{f}\cdot\vec{e} \\ 0 & 0 & 0 & 1 \end{bmatrix}.
(The forward row carries a minus sign by the usual convention that the camera looks down its
negative z-axis, so things in front get negative depth.) Read
the matrix as a recipe: each row of the basis projects the world onto one camera axis
— the right row measures how far right a point lies, the up row how far up, the forward row how far
in front — and the last column slides the eye to the origin first.
The view matrix V moves the world into the camera's frame.
-
It maps world space to camera space:
\vec{v}_{\text{cam}} = V\,\vec{v}_{\text{world}}.
-
It is the inverse of the camera's world transform:
V = M_{\text{cam}}^{-1} = (\,R^{\top},\, -R^{\top}\vec{e}\,).
-
Look-at builds the camera basis from eye, target and an up-hint:
\vec{f} = \widehat{\vec{c} - \vec{e}},
\vec{r} = \widehat{\vec{f} \times \vec{u}_{\text{hint}}},
\vec{u} = \vec{r} \times \vec{f}.
-
Its rows are the basis vectors \vec{r}, \vec{u}, -\vec{f}, and it
translates by -R^{\top}\vec{e} — rotate the world by the basis,
then shift the eye to the origin.
It feels backwards because it is backwards. There is no privileged "camera object"
flying around the scene at render time; there is only the origin, and a fixed direction of gaze.
Every sense of motion you feel in a game is the world being transported into that fixed frame by
V. Strafe right, and -R^{\top}\vec{e} shifts
so the world streams left. Spin to face a new target, and R^{\top}
counter-rotates the world the other way.
This is also why V is the camera's inverse rather than its
transform. The camera transform M_{\text{cam}} would take something
sitting at the origin and place it where the camera is; rendering wants the reverse trip, taking
the world and re-coordinatising it relative to the camera. Undo the placement, and the scene falls
into the viewpoint's own grid.