3-D Transforms & Homogeneous Coordinates
A 3-D game or CAD model is a cloud of points in space — the corners of every wall, character and
tree. To animate the world we scale, rotate and slide those points, and we would love to do it the
way we did in 2-D: one matrix per move, multiplied together into a single matrix, then fired at
every vertex. There is one stubborn problem, and its solution is one of the most elegant tricks in
all of computer graphics.
The problem is translation. Scaling and rotating a 3-D point are honest linear
operations — a 3\times3 matrix handles them perfectly. But sliding a
point by (t_x, t_y, t_z) is not linear, because it moves the
origin: M\cdot\mathbf{0}=\mathbf{0} for every matrix
M, so no 3\times3 matrix can ever budge the
point (0,0,0). Translation refuses to be a matrix multiply. We hit
exactly this wall in 2-D too — and we beat it the same way.
The trick: append a 1
Write each 3-D point with a fourth coordinate equal to 1:
(x, y, z) \;\longrightarrow\; \begin{bmatrix}x\\y\\z\\1\end{bmatrix}.
This extra w=1 coordinate is called a
homogeneous
coordinate. Now use a 4\times4 matrix. Watch what the extra
row and column let us do — the translation amounts sit in the last column, and the
constant 1 multiplies them straight into the answer:
T = \begin{bmatrix} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y \\ 0 & 0 & 1 & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix}.
The forbidden operation — translation — has become an ordinary matrix multiply. And because scale
and rotation also fit inside a 4\times4 (their
3\times3 block sits in the top-left corner, with the last row and column
left as the identity), every 3-D transform now speaks the same language and can be chained
by multiplication.
S = \begin{bmatrix} s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & s_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}.
Worked example: multiply it out
Take the point (2, 5, 3), written homogeneously as
(2,5,3,1), and translate it by
(t_x,t_y,t_z) = (10, -2, 4). The matrix multiply is just four dot
products, one per row:
\begin{bmatrix} 1 & 0 & 0 & 10 \\ 0 & 1 & 0 & -2 \\ 0 & 0 & 1 & 4 \\ 0 & 0 & 0 & 1 \end{bmatrix}\begin{bmatrix}2\\5\\3\\1\end{bmatrix} = \begin{bmatrix} 1\cdot 2 + 10\cdot 1\\ 1\cdot 5 + (-2)\cdot 1\\ 1\cdot 3 + 4\cdot 1\\ 1 \end{bmatrix} = \begin{bmatrix}12\\3\\7\\1\end{bmatrix}.
The point lands at (12, 3, 7) — exactly
(2+10,\;5-2,\;3+4), the slide we asked for. Notice the bottom coordinate
stays 1, ready for the next matrix in the pipeline. That is the whole
magic: the constant 1 reaches up into the last column and drags the
translation into the sum.
One matrix for the whole move
Because scale, rotation and translation are now all 4\times4 matrices,
you can multiply them into a single combined matrix once — often written
M = T\,R\,S — and then transform an entire mesh of thousands of vertices
by that one matrix. Read it right to left: each vertex is scaled first, then
rotated, then translated into the world. This combined matrix is
the model matrix, and it is computed for every object, every frame, in every 3-D engine
running right now.
Rotation in 3-D is richer than in 2-D: there is one 4\times4 rotation
matrix for spinning about the x-axis, one for
y, and one for z, and you combine them to spin
about any axis you like. But each is still just a 3\times3 rotation
tucked into the top-left corner of a 4\times4, with that faithful
1 in the bottom-right.
Move the cube
The faint wireframe is the original unit cube; the solid one is the cube after we scale it by
s and translate it by (t_x,t_y,t_z) — the
combined matrix T\,S applied to each of its eight corners. Drag the view
to orbit the scene; slide the controls to push the cube around space and swell or shrink it.
A pure translate, scale or rotate keeps the last coordinate at w=1, so
it is tempting to think w is always 1 and can
be ignored. That is a trap. The whole point of carrying w is
that a later matrix — the perspective projection matrix — deliberately sets
w to something other than 1 (typically the
point's depth). You then recover the real 2-D coordinates by dividing
x, y, z by
w — the "perspective divide."
So the rule is: (x, y, z, w) means the actual point
\left(\tfrac{x}{w}, \tfrac{y}{w}, \tfrac{z}{w}\right). As long as
w=1 nothing changes, but assuming it can never be anything else will
break the very projection step you are building toward. Keep the whole four-vector; never throw
w away.
Because uniformity is speed. Once translation, rotation, scale and perspective all live in
the same 4\times4 shape, a GPU needs only one piece of hardware: a unit
that multiplies a 4\times4 matrix by a 4-vector, over and over, in
massive parallel. Millions of vertices per frame stream through identical
4\times4 multiplies. If translation had stayed a stubborn "add" that
didn't fit the matrix mould, every vertex would need a special case — and the beautiful pipeline of
multiply-multiply-multiply would fall apart. The humble appended 1 is
what lets a whole industry build silicon around a single operation.