Raycasting & Picking

A player clicks the screen. They mean “that goblin” — but all the engine receives is a 2-D pixel, (x_{\text{px}}, y_{\text{px}}), with no depth at all. Picking is the art of turning that flat click into a world-space selection. The trick is to run the rendering MVP pipeline backwards: un-project the pixel into a 3-D ray, then ask the world what that ray hits.

Forward, the pipeline flattens 3-D world points down to 2-D pixels and throws depth away. We can't un-flatten a single point — a pixel corresponds to a whole line of world points stacked behind it, every one of which renders to the same spot. That line, from the camera out through the pixel, is the answer: the pick ray.

From pixel to ray, line by line

Step 1 — lift the pixel to normalised device coordinates. The MVP pipeline's output lives in the [-1,1]^3 NDC cube before the viewport transform spreads it across the screen. Invert that viewport map to put the pixel back into NDC:

x_{\text{ndc}} = \frac{2x_{\text{px}}}{W} - 1, \qquad y_{\text{ndc}} = 1 - \frac{2y_{\text{px}}}{H}.

(The y flips because pixel rows count down from the top while NDC y points up.)

Step 2 — choose two depths to pin the line. A pixel is one (x_{\text{ndc}}, y_{\text{ndc}}) but a ray needs two points. Take the same pixel at the near plane and the far plane — NDC depths z=-1 and z=+1 (the OpenGL convention):

\vec p_{\text{near}} = (x_{\text{ndc}}, y_{\text{ndc}}, -1), \qquad \vec p_{\text{far}} = (x_{\text{ndc}}, y_{\text{ndc}}, +1).

These two NDC points sit on the same screen pixel but at opposite ends of the view frustum — so the world line through them is exactly the line the camera sees through that pixel.

Step 3 — un-project both with the inverse pipeline. Apply the inverse of the projection×view matrix (then divide out the homogeneous w, undoing the perspective divide) to carry each point back into world space:

\vec P_{\text{near}} = \frac{(VP)^{-1}\,\tilde p_{\text{near}}}{w_{\text{near}}}, \qquad \vec P_{\text{far}} = \frac{(VP)^{-1}\,\tilde p_{\text{far}}}{w_{\text{far}}}.

Equivalently, apply P^{-1} then V^{-1} — un-projection, then the inverse view (camera-to-world) transform. Either way you now hold two honest world-space points on the click's sight-line.

Step 4 — assemble the pick ray. The ray starts at the camera (the near point) and aims toward the far point:

\vec r(t) = \vec P_{\text{near}} + t\,\hat d, \qquad \hat d = \frac{\vec P_{\text{far}} - \vec P_{\text{near}}}{\lVert \vec P_{\text{far}} - \vec P_{\text{near}}\rVert}.

This is the camera-through-pixel ray: the unique world line that the chosen pixel looks down.

Step 5 — intersect the scene and take the nearest hit. Now hand the ray to the intersection tests you already have — ray–sphere for bounding spheres, ray–triangle for meshes — collecting every intersection parameter t_i > 0. The object the player meant is the one in front:

t^\star = \min\{\,t_i : t_i > 0\,\}, \qquad \text{picked} = \text{object}(t^\star).

Smallest positive t wins because nearer objects occlude farther ones — exactly what the eye expects. That single rule is the whole of mouse selection and hitscan shooting.

To turn a clicked pixel into a world-space selection:

Once you can cast a ray from a pixel, a whole family of features falls out of the one mechanism. A hitscan weapon — a rifle, a laser — fires not a projectile but a ray straight down the crosshair (the centre pixel), takes the nearest hit, and applies damage there the same frame: no travel time, no drop. Mouse selection in an RTS or an inventory casts from wherever the cursor sits and picks the unit under it.

Level editors lean on the inverse direction hardest of all: “drop this prop where I clicked on the floor” un-projects the cursor to a ray and intersects it with the ground plane, converting a 2-D click into a 3-D placement. Drag-to-move, the terrain brush, the gizmo you grab to rotate an object — all of it is cursor-to-world un-projection. The pick ray is the bridge every time the player or the designer reaches into the screen.