Convolutional Networks

A neural network that treats an image as one long list of pixels is in trouble. A modest 200 \times 200 colour image is 120{,}000 numbers; a fully-connected first layer of a thousand neurons would need 120 million weights — and it would have thrown away the single most important fact about an image: that nearby pixels belong together, and a feature like an edge looks the same wherever it appears.

A convolutional neural network (CNN) is built around that fact. Instead of connecting every pixel to every neuron, it slides a tiny filter across the image, looking for one small pattern everywhere at once.

Convolution: one small filter, slid everywhere

A filter (or kernel) is a little grid of weights — often just 3 \times 3. You slide it over every position of the image, and at each spot compute a dot product of the filter with the patch beneath it. The grid of results is a feature map that lights up wherever the filter's pattern occurs. A filter like \begin{bmatrix}-1&0&1\\-1&0&1\\-1&0&1\end{bmatrix} responds strongly at a vertical edge — dark on the left, bright on the right — and stays quiet on flat regions. Run it on a little image with an edge down the middle:

type Grid = number[][]; // A 5×5 "image": dark (0) on the left, bright (1) on the right — a vertical edge. const image: Grid = [ [0, 0, 1, 1, 1], [0, 0, 1, 1, 1], [0, 0, 1, 1, 1], [0, 0, 1, 1, 1], [0, 0, 1, 1, 1], ]; // A vertical-edge filter: (right column) minus (left column). const filter: Grid = [ [-1, 0, 1], [-1, 0, 1], [-1, 0, 1], ]; function convolve(img: Grid, k: Grid): Grid { const out: Grid = []; for (let i = 0; i + 3 <= img.length; i++) { const row: number[] = []; for (let j = 0; j + 3 <= img[0].length; j++) { let sum = 0; for (let di = 0; di < 3; di++) for (let dj = 0; dj < 3; dj++) sum += img[i + di][j + dj] * k[di][dj]; row.push(sum); } out.push(row); } return out; } // The feature map lights up (big numbers) exactly where the edge is, and is 0 on flat areas. for (const row of convolve(image, filter)) console.log(row.join(" "));

The output is large at the edge and zero over the uniform bright region — the filter has found the edge, and it would find one anywhere in the image using the exact same nine weights.

Why this is such a good idea

In 2012 a deep CNN called AlexNet won the ImageNet image-recognition contest by a staggering margin, roughly halving the previous error rate. It was the spark that lit the modern deep-learning era: within a few years CNNs were beating humans at narrow recognition tasks, and the same convolutional idea now underpins medical imaging, self-driving perception, and the vision front-ends of many larger models.