The Curse of Dimensionality
Every extra feature you hand a model is another number in the
feature
vector — another axis of the space your examples live in. So more features ought to be
pure good news: more to look at, more signal, better predictions. And for a while it is. Then, past
a certain point, adding features quietly makes everything worse — the model needs
astronomically more data, neighbours stop being neighbourly, and accuracy sags.
This isn't a bug in any particular algorithm. It's geometry. High-dimensional space
is a genuinely strange place, and our flat, three-dimensional intuitions lie to us there. The name
for the whole bundle of weirdness is the curse of dimensionality, and this page is
about three concrete, countable ways it bites: space empties out, volume flees to the walls, and
distances stop meaning anything.
Effect 1: data goes sparse — the 10^d explosion
Chop each feature's range into 10 bins and lay down a grid. In one dimension that's just 10 little
cells along a line. In two dimensions it's a 10 \times 10 grid — 100
cells. In three, 10 \times 10 \times 10 = 1000. In general, a grid of
b bins per axis across d features has
b^d \quad\text{cells.}
With b = 10 the count is 10^d, and that
exponent is merciless. Ten features already carve space into
10^{10} = 10{,}000{,}000{,}000 — ten billion — cells. Twenty
features give 10^{20}, more cells than there are grains of sand on Earth.
Here's why that hurts. Learning a pattern means having a few examples in each region of space to
learn from. Suppose you'd like, on average, just 100 examples per cell:
- d = 1: 10 cells → about
1{,}000 examples. Easy.
- d = 3: 1{,}000 cells → about
100{,}000 examples. Doable.
- d = 10: 10^{10} cells → about
10^{12} examples. A trillion rows. Hopeless.
Real datasets don't grow with an exponent — but the space they live in does. So as you add features,
your fixed pile of data spreads thinner and thinner across an exploding volume, until almost every
cell is empty and the model is guessing between islands of data with vast unexplored gulfs in
between. The data hasn't shrunk; the space has ballooned.
The phrase was coined by the American mathematician
Richard Bellman in the late 1950s — not in machine
learning at all, but in his work on dynamic programming, a method for solving
multi-stage optimisation and control problems. Bellman noticed that as he added state variables to a
problem, the number of grid points he had to evaluate blew up exponentially: each new dimension
multiplied the work. He dubbed this the "curse of dimensionality," and the name stuck so hard it
escaped his field entirely and became one of the most quoted phrases in all of data science. It's a
rare thing — a piece of mathematics with a genuinely spooky brand name.
Effect 2: all the volume runs to the shell
Picture a cube with sides of length 1. Paint a thin skin of thickness
\varepsilon on the inside of every face. What fraction of the cube's
volume is inside that skin — near the surface — rather than in the cosy interior?
The interior (everything more than \varepsilon from every wall)
is itself a smaller cube, with side 1 - 2\varepsilon, so its volume is
(1 - 2\varepsilon)^d. Everything else is the shell:
\text{shell fraction} = 1 - (1 - 2\varepsilon)^d.
Take a genuinely thin skin, \varepsilon = 0.05 — just 5% of the way in
from each face — and watch what dimension does to it:
- d = 1: 1 - 0.9^{1} = 0.10 — 10% near the
edges. Sensible.
- d = 10: 1 - 0.9^{10} \approx 0.65 — already
almost two-thirds.
- d = 100: 1 - 0.9^{100} \approx 0.99997 —
essentially all of it.
Since 1 - 2\varepsilon < 1, the interior term
(1 - 2\varepsilon)^d \to 0 as d grows, so the
shell fraction \to 1. In a high-dimensional cube there is essentially no
"middle" left: almost every point is jammed up against the boundary. An orange in
1000 dimensions is all peel and no flesh. That wrecks the mental picture of a tidy blob of typical
data with a well-defined centre — high-dimensional data lives on a thin, spiky shell.
Drag the shell thickness \varepsilon. Even for a razor-thin skin the
curve rushes up to 1 within a few dozen dimensions — and the thicker you make the skin, the faster
it gets there.
Effect 3: distances concentrate — "nearest" loses its meaning
This is the one that hurts machine learning most directly. Scatter a bunch of points at random in a
d-dimensional space and, for some query point, measure the distance to its
nearest neighbour and to its farthest one. In low dimensions
those are wildly different — the nearest point is snug, the farthest is way over there. As
d climbs, though, a startling thing happens: those two distances
converge. The ratio
\frac{\text{dist}_{\max} - \text{dist}_{\min}}{\text{dist}_{\min}} \longrightarrow 0
\quad\text{as } d \to \infty.
In plain terms: in high dimensions everything is roughly the same distance from everything
else. The nearest neighbour is barely nearer than the farthest one. A hundred randomly
placed points can all sit at almost identical distances from you — the very notion of "close" goes
soft and stops carrying information.
Any algorithm whose whole idea is "look at what's nearby" now stands on quicksand:
k-nearest-neighbours
can't trust its neighbours,
clustering
can't tell one clump from another, and similarity search
returns near-ties. If closeness is meaningless, algorithms built on closeness are, too.
As the number of dimensions d grows, three things happen together:
- Sparsity. A grid of b bins per axis has
b^d cells, so the data needed to keep any fixed density grows
exponentially in d.
- Volume on the shell. The fraction of a unit cube within
\varepsilon of its surface is
1 - (1 - 2\varepsilon)^d \to 1 — almost all points hug the boundary.
- Distance concentration. Nearest- and farthest-neighbour distances converge, so
"closeness" stops distinguishing points — breaking k-NN, clustering, and similarity search.
The seductive mistake is "when in doubt, add another feature — it can't hurt." It absolutely can. An
irrelevant feature adds a whole new dimension of pure noise: it stretches the space, pushes
points apart along a meaningless axis, and dilutes the real signal the useful features carry. Now a
model has to work out that this new axis is junk — and to do that reliably it needs
exponentially more data, exactly the sparsity blow-up from Effect 1.
So a small set of features that genuinely relate to the label almost always beats a huge pile that
merely happen to be available. The right response to the curse isn't "collect more columns" — it's
the opposite: feature selection (keep only the features that carry signal) and
dimensionality
reduction (squash many correlated features down into a few informative ones, e.g. with
principal
component analysis). Both are ways of paying down dimensions to
lift the curse. Fewer, better axes; more room to breathe.
So is high-dimensional data hopeless?
No — and it's worth saying why, because models with millions of features clearly do work. The escape
hatch is that real data almost never fills its space. The pixels of photos of faces, say, technically
live in a space of thousands of dimensions, but actual faces occupy a thin, curved sheet inside it —
a low-dimensional manifold. The curse describes what happens when data spreads uniformly
through all of a high-dimensional space; the reason machine learning survives is that it usually
doesn't. The practical craft is to find and exploit that hidden low-dimensional structure — which is
exactly what feature selection and dimensionality reduction are for.