3-D Transforms & Homogeneous Coordinates

A 3-D game or CAD model is a cloud of points in space — the corners of every wall, character and tree. To animate the world we scale, rotate and slide those points, and we would love to do it the way we did in 2-D: one matrix per move, multiplied together into a single matrix, then fired at every vertex. There is one stubborn problem, and its solution is one of the most elegant tricks in all of computer graphics.

The problem is translation. Scaling and rotating a 3-D point are honest linear operations — a 3\times3 matrix handles them perfectly. But sliding a point by (t_x, t_y, t_z) is not linear, because it moves the origin: M\cdot\mathbf{0}=\mathbf{0} for every matrix M, so no 3\times3 matrix can ever budge the point (0,0,0). Translation refuses to be a matrix multiply. We hit exactly this wall in 2-D too — and we beat it the same way.

The trick: append a 1

Write each 3-D point with a fourth coordinate equal to 1:

(x, y, z) \;\longrightarrow\; \begin{bmatrix}x\\y\\z\\1\end{bmatrix}.

This extra w=1 coordinate is called a homogeneous coordinate. Now use a 4\times4 matrix. Watch what the extra row and column let us do — the translation amounts sit in the last column, and the constant 1 multiplies them straight into the answer:

T = \begin{bmatrix} 1 & 0 & 0 & t_x \\ 0 & 1 & 0 & t_y \\ 0 & 0 & 1 & t_z \\ 0 & 0 & 0 & 1 \end{bmatrix}.

The forbidden operation — translation — has become an ordinary matrix multiply. And because scale and rotation also fit inside a 4\times4 (their 3\times3 block sits in the top-left corner, with the last row and column left as the identity), every 3-D transform now speaks the same language and can be chained by multiplication.

S = \begin{bmatrix} s_x & 0 & 0 & 0 \\ 0 & s_y & 0 & 0 \\ 0 & 0 & s_z & 0 \\ 0 & 0 & 0 & 1 \end{bmatrix}.

Worked example: multiply it out

Take the point (2, 5, 3), written homogeneously as (2,5,3,1), and translate it by (t_x,t_y,t_z) = (10, -2, 4). The matrix multiply is just four dot products, one per row:

\begin{bmatrix} 1 & 0 & 0 & 10 \\ 0 & 1 & 0 & -2 \\ 0 & 0 & 1 & 4 \\ 0 & 0 & 0 & 1 \end{bmatrix}\begin{bmatrix}2\\5\\3\\1\end{bmatrix} = \begin{bmatrix} 1\cdot 2 + 10\cdot 1\\ 1\cdot 5 + (-2)\cdot 1\\ 1\cdot 3 + 4\cdot 1\\ 1 \end{bmatrix} = \begin{bmatrix}12\\3\\7\\1\end{bmatrix}.

The point lands at (12, 3, 7) — exactly (2+10,\;5-2,\;3+4), the slide we asked for. Notice the bottom coordinate stays 1, ready for the next matrix in the pipeline. That is the whole magic: the constant 1 reaches up into the last column and drags the translation into the sum.

One matrix for the whole move

Because scale, rotation and translation are now all 4\times4 matrices, you can multiply them into a single combined matrix once — often written M = T\,R\,S — and then transform an entire mesh of thousands of vertices by that one matrix. Read it right to left: each vertex is scaled first, then rotated, then translated into the world. This combined matrix is the model matrix, and it is computed for every object, every frame, in every 3-D engine running right now.

Rotation in 3-D is richer than in 2-D: there is one 4\times4 rotation matrix for spinning about the x-axis, one for y, and one for z, and you combine them to spin about any axis you like. But each is still just a 3\times3 rotation tucked into the top-left corner of a 4\times4, with that faithful 1 in the bottom-right.

Move the cube

The faint wireframe is the original unit cube; the solid one is the cube after we scale it by s and translate it by (t_x,t_y,t_z) — the combined matrix T\,S applied to each of its eight corners. Drag the view to orbit the scene; slide the controls to push the cube around space and swell or shrink it.

A pure translate, scale or rotate keeps the last coordinate at w=1, so it is tempting to think w is always 1 and can be ignored. That is a trap. The whole point of carrying w is that a later matrix — the perspective projection matrix — deliberately sets w to something other than 1 (typically the point's depth). You then recover the real 2-D coordinates by dividing x, y, z by w — the "perspective divide."

So the rule is: (x, y, z, w) means the actual point \left(\tfrac{x}{w}, \tfrac{y}{w}, \tfrac{z}{w}\right). As long as w=1 nothing changes, but assuming it can never be anything else will break the very projection step you are building toward. Keep the whole four-vector; never throw w away.

Because uniformity is speed. Once translation, rotation, scale and perspective all live in the same 4\times4 shape, a GPU needs only one piece of hardware: a unit that multiplies a 4\times4 matrix by a 4-vector, over and over, in massive parallel. Millions of vertices per frame stream through identical 4\times4 multiplies. If translation had stayed a stubborn "add" that didn't fit the matrix mould, every vertex would need a special case — and the beautiful pipeline of multiply-multiply-multiply would fall apart. The humble appended 1 is what lets a whole industry build silicon around a single operation.