From the matrix to the pixel, line by line
Step 1 — leave the matrix in clip space. After the projection matrix
P, a camera-space vertex
(x, y, z, 1) becomes a 4-vector we call clip space:
\begin{pmatrix} x_c \\ y_c \\ z_c \\ w_c \end{pmatrix} = P \begin{pmatrix} x \\ y \\ z \\ 1 \end{pmatrix}.
Step 2 — notice what w_c is. The clever bottom row of a
perspective P copies minus the view-space depth into the fourth
coordinate. With a camera looking down its own -z axis, points in front have
z < 0, so
w_c = -z \;>\; 0.
The further away the vertex, the larger w_c. Hold that thought — it is the
whole trick.
Step 3 — clip the triangle against the cube, now. Before any division, the GPU
throws away (or trims) geometry that falls outside the canonical view volume. In clip space the test is
a cheap pair of inequalities on each coordinate,
-w_c \le x_c \le w_c, \qquad -w_c \le y_c \le w_c, \qquad -w_c \le z_c \le w_c.
No square roots, no division — just comparisons against \pm w_c. That is
exactly why this stage is named clip space, and why clipping happens here.
Step 4 — perform the perspective divide. Divide every component by
w_c. This lands the vertex in normalised device coordinates
(NDC):
\begin{pmatrix} x_n \\ y_n \\ z_n \end{pmatrix} = \begin{pmatrix} x_c / w_c \\ y_c / w_c \\ z_c / w_c \end{pmatrix}.
Step 5 — read off the consequence. Because everything in view satisfied the Step 3
inequalities, dividing by w_c squeezes the whole visible world into the tidy
cube
-1 \le x_n \le 1, \qquad -1 \le y_n \le 1, \qquad -1 \le z_n \le 1.
Step 6 — see why distant things shrink. Two vertices at the same screen offset
x_c but different depths get divided by different
w_c = -z. The far one (big w_c) is divided by more,
so its x_n is pulled closer to the centre:
x_n = \frac{x_c}{w_c} = \frac{x_c}{-z}.
Double the distance, halve the on-screen size. That single division by w_c \approx z
— not the matrix — is what makes parallel rails appear to meet at the horizon. Perspective is a fraction.
After the projection matrix, every vertex passes through two fixed-function steps:
-
Clip space is pre-divide and four-dimensional. The output of P
is (x_c, y_c, z_c, w_c) with w_c = -z, the
view-space depth.
-
Clipping happens here, against the cube. Triangles are clipped to
-w_c \le x_c, y_c, z_c \le w_c — cheap comparisons, before any division.
-
The perspective divide lands you in NDC. Dividing by w_c
maps the visible volume into the cube [-1, 1]^3.
-
The {/}w is the perspective. Since
x_n = x_c / w_c with w_c \approx z, far things
(large w_c) shrink toward the centre.
It is tempting to imagine dividing first and clipping the neat cube afterward. That order is a disaster.
A vertex behind the camera has z > 0, so
w_c = -z < 0; one exactly on the camera plane gives
w_c = 0. Dividing by a negative w_c flips a point
to the opposite side of the screen, and dividing by 0 is undefined — a
triangle straddling the camera would tear into nonsense.
Clipping in clip space sidesteps both hazards. The inequality -w_c \le z_c \le w_c
implicitly requires w_c \ge 0, so anything with
w_c \le 0 is trimmed away before the divide ever sees it. The near
plane of the view frustum is precisely the guard rail that keeps w_c safely
positive. That is the deep reason the pipeline carries a fourth coordinate all the way to the very last
moment instead of dividing the instant the matrix is done.