The Curse of Dimensionality

Every extra feature you hand a model is another number in the feature vector — another axis of the space your examples live in. So more features ought to be pure good news: more to look at, more signal, better predictions. And for a while it is. Then, past a certain point, adding features quietly makes everything worse — the model needs astronomically more data, neighbours stop being neighbourly, and accuracy sags.

This isn't a bug in any particular algorithm. It's geometry. High-dimensional space is a genuinely strange place, and our flat, three-dimensional intuitions lie to us there. The name for the whole bundle of weirdness is the curse of dimensionality, and this page is about three concrete, countable ways it bites: space empties out, volume flees to the walls, and distances stop meaning anything.

Effect 1: data goes sparse — the 10^d explosion

Chop each feature's range into 10 bins and lay down a grid. In one dimension that's just 10 little cells along a line. In two dimensions it's a 10 \times 10 grid — 100 cells. In three, 10 \times 10 \times 10 = 1000. In general, a grid of b bins per axis across d features has

b^d \quad\text{cells.}

With b = 10 the count is 10^d, and that exponent is merciless. Ten features already carve space into 10^{10} = 10{,}000{,}000{,}000 — ten billion — cells. Twenty features give 10^{20}, more cells than there are grains of sand on Earth.

Here's why that hurts. Learning a pattern means having a few examples in each region of space to learn from. Suppose you'd like, on average, just 100 examples per cell:

d = 1: 10 cells → about 1{,}000 examples. Easy.
d = 3: 1{,}000 cells → about 100{,}000 examples. Doable.
d = 10: 10^{10} cells → about 10^{12} examples. A trillion rows. Hopeless.

Real datasets don't grow with an exponent — but the space they live in does. So as you add features, your fixed pile of data spreads thinner and thinner across an exploding volume, until almost every cell is empty and the model is guessing between islands of data with vast unexplored gulfs in between. The data hasn't shrunk; the space has ballooned.

The phrase was coined by the American mathematician Richard Bellman in the late 1950s — not in machine learning at all, but in his work on dynamic programming, a method for solving multi-stage optimisation and control problems. Bellman noticed that as he added state variables to a problem, the number of grid points he had to evaluate blew up exponentially: each new dimension multiplied the work. He dubbed this the "curse of dimensionality," and the name stuck so hard it escaped his field entirely and became one of the most quoted phrases in all of data science. It's a rare thing — a piece of mathematics with a genuinely spooky brand name.

Effect 2: all the volume runs to the shell

Picture a cube with sides of length 1. Paint a thin skin of thickness \varepsilon on the inside of every face. What fraction of the cube's volume is inside that skin — near the surface — rather than in the cosy interior?

The interior (everything more than \varepsilon from every wall) is itself a smaller cube, with side 1 - 2\varepsilon, so its volume is (1 - 2\varepsilon)^d. Everything else is the shell:

\text{shell fraction} = 1 - (1 - 2\varepsilon)^d.

Take a genuinely thin skin, \varepsilon = 0.05 — just 5% of the way in from each face — and watch what dimension does to it:

d = 1: 1 - 0.9^{1} = 0.10 — 10% near the edges. Sensible.
d = 10: 1 - 0.9^{10} \approx 0.65 — already almost two-thirds.
d = 100: 1 - 0.9^{100} \approx 0.99997 — essentially all of it.

Since 1 - 2\varepsilon < 1, the interior term (1 - 2\varepsilon)^d \to 0 as d grows, so the shell fraction \to 1. In a high-dimensional cube there is essentially no "middle" left: almost every point is jammed up against the boundary. An orange in 1000 dimensions is all peel and no flesh. That wrecks the mental picture of a tidy blob of typical data with a well-defined centre — high-dimensional data lives on a thin, spiky shell.

Drag the shell thickness \varepsilon. Even for a razor-thin skin the curve rushes up to 1 within a few dozen dimensions — and the thicker you make the skin, the faster it gets there.

Effect 3: distances concentrate — "nearest" loses its meaning

This is the one that hurts machine learning most directly. Scatter a bunch of points at random in a d-dimensional space and, for some query point, measure the distance to its nearest neighbour and to its farthest one. In low dimensions those are wildly different — the nearest point is snug, the farthest is way over there. As d climbs, though, a startling thing happens: those two distances converge. The ratio

\frac{\text{dist}_{\max} - \text{dist}_{\min}}{\text{dist}_{\min}} \longrightarrow 0 \quad\text{as } d \to \infty.

In plain terms: in high dimensions everything is roughly the same distance from everything else. The nearest neighbour is barely nearer than the farthest one. A hundred randomly placed points can all sit at almost identical distances from you — the very notion of "close" goes soft and stops carrying information.

Any algorithm whose whole idea is "look at what's nearby" now stands on quicksand: k-nearest-neighbours can't trust its neighbours, clustering can't tell one clump from another, and similarity search returns near-ties. If closeness is meaningless, algorithms built on closeness are, too.

As the number of dimensions d grows, three things happen together:

Sparsity. A grid of b bins per axis has b^d cells, so the data needed to keep any fixed density grows exponentially in d.
Volume on the shell. The fraction of a unit cube within \varepsilon of its surface is 1 - (1 - 2\varepsilon)^d \to 1 — almost all points hug the boundary.
Distance concentration. Nearest- and farthest-neighbour distances converge, so "closeness" stops distinguishing points — breaking k-NN, clustering, and similarity search.

The seductive mistake is "when in doubt, add another feature — it can't hurt." It absolutely can. An irrelevant feature adds a whole new dimension of pure noise: it stretches the space, pushes points apart along a meaningless axis, and dilutes the real signal the useful features carry. Now a model has to work out that this new axis is junk — and to do that reliably it needs exponentially more data, exactly the sparsity blow-up from Effect 1.

So a small set of features that genuinely relate to the label almost always beats a huge pile that merely happen to be available. The right response to the curse isn't "collect more columns" — it's the opposite: feature selection (keep only the features that carry signal) and dimensionality reduction (squash many correlated features down into a few informative ones, e.g. with principal component analysis). Both are ways of paying down dimensions to lift the curse. Fewer, better axes; more room to breathe.

So is high-dimensional data hopeless?

No — and it's worth saying why, because models with millions of features clearly do work. The escape hatch is that real data almost never fills its space. The pixels of photos of faces, say, technically live in a space of thousands of dimensions, but actual faces occupy a thin, curved sheet inside it — a low-dimensional manifold. The curse describes what happens when data spreads uniformly through all of a high-dimensional space; the reason machine learning survives is that it usually doesn't. The practical craft is to find and exploit that hidden low-dimensional structure — which is exactly what feature selection and dimensionality reduction are for.