Convolutional Networks

A neural network that treats an image as one long list of pixels is in trouble. A modest 200 \times 200 colour image is 120{,}000 numbers; a fully-connected first layer of a thousand neurons would need 120 million weights — and it would have thrown away the single most important fact about an image: that nearby pixels belong together, and a feature like an edge looks the same wherever it appears.

A convolutional neural network (CNN) is built around that fact. Instead of connecting every pixel to every neuron, it slides a tiny filter across the image, looking for one small pattern everywhere at once.

Convolution: one small filter, slid everywhere

A filter (or kernel) is a little grid of weights — often just 3 \times 3. You slide it over every position of the image, and at each spot compute a dot product of the filter with the patch beneath it. The grid of results is a feature map that lights up wherever the filter's pattern occurs. A filter like \begin{bmatrix}-1&0&1\\-1&0&1\\-1&0&1\end{bmatrix} responds strongly at a vertical edge — dark on the left, bright on the right — and stays quiet on flat regions. Run it on a little image with an edge down the middle:

type Grid = number[][]; // A 5×5 "image": dark (0) on the left, bright (1) on the right — a vertical edge. const image: Grid = [ [0, 0, 1, 1, 1], [0, 0, 1, 1, 1], [0, 0, 1, 1, 1], [0, 0, 1, 1, 1], [0, 0, 1, 1, 1], ]; // A vertical-edge filter: (right column) minus (left column). const filter: Grid = [ [-1, 0, 1], [-1, 0, 1], [-1, 0, 1], ]; function convolve(img: Grid, k: Grid): Grid { const out: Grid = []; for (let i = 0; i + 3 <= img.length; i++) { const row: number[] = []; for (let j = 0; j + 3 <= img[0].length; j++) { let sum = 0; for (let di = 0; di < 3; di++) for (let dj = 0; dj < 3; dj++) sum += img[i + di][j + dj] * k[di][dj]; row.push(sum); } out.push(row); } return out; } // The feature map lights up (big numbers) exactly where the edge is, and is 0 on flat areas. for (const row of convolve(image, filter)) console.log(row.join(" "));

The output is large at the edge and zero over the uniform bright region — the filter has found the edge, and it would find one anywhere in the image using the exact same nine weights.

Why this is such a good idea

Weight sharing. The same little filter is reused at every position, so a convolutional layer has a handful of weights instead of millions — and it automatically detects its feature anywhere (translation invariance).
Local connections. Each output looks at a small neighbourhood, matching how visual features are local.
Pooling. Between convolutions, a pooling step (e.g. taking the maximum of each little block) shrinks the map, keeping the strongest responses and adding a bit more position-invariance.
Hierarchy. Stack these layers and the features compound: early layers find edges, the next combine edges into corners and textures, later ones into shapes, and the deepest into whole objects — a face, a cat, a car.

Slides small filters over the image to build feature maps.
Shared weights mean far fewer parameters and detection of a feature anywhere.
Stacked conv + pooling layers learn a hierarchy: edges → shapes → objects.

In 2012 a deep CNN called AlexNet won the ImageNet image-recognition contest by a staggering margin, roughly halving the previous error rate. It was the spark that lit the modern deep-learning era: within a few years CNNs were beating humans at narrow recognition tasks, and the same convolutional idea now underpins medical imaging, self-driving perception, and the vision front-ends of many larger models.

A single filter detects the same feature everywhere — that's the strength (efficiency, translation invariance), but it's why a CNN uses many filters per layer, one for each pattern it needs to spot.
Pooling throws away precise location. That's usually good (you care that there's an edge, not exactly which pixel), but it means a CNN is not naturally sensitive to exact positions.