Floating-Point and Error

Type 0.1 + 0.2 into almost any calculator built from a real computer and you get back not 0.3 but 0.30000000000000004. The machine is not broken and it is not being lazy — it simply cannot store the number 0.1. A computer holds real numbers in a fixed number of binary digits, and 0.1 in binary is the endlessly repeating 0.0001100110011\ldots_2 — just as \tfrac13 = 0.333\ldots never terminates in decimal. So the computer keeps the closest value it can store, and a tiny error rides along.

This is the founding fact of numerical analysis: every real number a computer touches is, in general, slightly wrong. Usually that is harmless — the error is smaller than a speck of dust on a bridge. But wire enough arithmetic together carelessly and the specks pile into an avalanche: a weather model that diverges, a rocket that misses, a bank ledger that won't balance. This page teaches you to measure the error a computer makes, to name the one number that governs its size — machine epsilon — and to spot the single most dangerous operation in all of numerical computing: catastrophic cancellation.

How a computer stores a real number

A floating-point number is just scientific notation in binary. Where you would write 6.022 \times 10^{23}, the machine writes

x = \pm\, (1.b_1 b_2 \ldots b_{52})_2 \times 2^{e}.

The part in brackets — the significand (or mantissa) — carries the digits; the exponent e slides the binary point left or right. The universal standard, IEEE 754 double precision, spends 64 bits: one for the sign, 11 for the exponent, and 52 for the fraction. Those 52 bits (plus the implicit leading 1) buy about 15–16 significant decimal digits. Everything past digit 16 or so is simply not there.

Because only finitely many bit patterns exist, the storable numbers form a discrete grid, not the continuous number line. Ask to store a value between two grid points and the machine rounds to the nearest one — exactly the idea from rounding, but the "places" are binary. Cruically the grid is not evenly spaced: near 1 the points are about 2^{-52} apart, but near 1{,}000{,}000 they are a million times coarser. The grid stretches with the magnitude of the number, which is exactly why the error is best measured in relative terms.

Two ways to measure being wrong

Suppose the true value is x and the computer holds an approximation \hat{x}. There are two honest ways to say how far off it is.

Absolute error — the raw gap, in the same units as x: E_{\text{abs}} = |\,\hat{x} - x\,|.
Relative error — the gap as a fraction of the true size (for x \ne 0): E_{\text{rel}} = \frac{|\,\hat{x} - x\,|}{|x|}.

Why keep both? An absolute error of 1\,\text{mm} is a triumph when measuring a football pitch and a catastrophe when machining a wristwatch — the relative error tells them apart. Relative error is the natural currency of floating point precisely because the grid stretches with magnitude: everywhere on the number line the relative rounding error is roughly the same tiny constant. That constant has a name.

Machine epsilon: the size of a rounding error

Machine epsilon, written \varepsilon_{\text{mach}}, is the gap between 1 and the next floating-point number above it. It is the coarseness of the grid at 1, and it caps the relative error of a single rounding: round any real number to the nearest double and you are off by at most \tfrac12\varepsilon_{\text{mach}} in relative terms. For IEEE double precision,

\varepsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}.

You can find it without looking it up. Start at 1 and keep halving a step; the moment 1 + \text{step} is indistinguishable from 1, the step has fallen through the grid. Run it:

// Discover machine epsilon by halving until 1 + eps rounds back to 1. let eps: number = 1; let steps: number = 0; while (1 + eps / 2 > 1) { eps = eps / 2; // still resolvable — keep shrinking steps++; } console.log(`machine epsilon = ${eps}`); console.log(` 2^-52 = ${Math.pow(2, -52)}`); console.log(`halvings from 1 = ${steps}`); // expect 52 // Proof that it is the *threshold*: half of it vanishes when added to 1. console.log(`1 + eps = ${1 + eps} (distinct from 1)`); console.log(`1 + eps/2 = ${1 + eps / 2} (rounds back to 1)`);

Fifty-two halvings, landing on 2^{-52} — the number of fraction bits, laid bare. Anything smaller than \varepsilon_{\text{mach}} added to 1 is swallowed whole. That single number governs the accuracy of every calculation your machine will ever do.

Yes — and this is the whole point of "floating" point. The gap to the next representable number scales with the magnitude. Near 1 it is 2^{-52}; near 2^{53} \approx 9\times10^{15} the gap grows to a full 1, so integers beyond 2^{53} can no longer all be stored exactly — 2^{53} + 1 rounds back to 2^{53}! The relative spacing stays fixed at about \varepsilon_{\text{mach}} everywhere, which is exactly why relative error is the right yardstick and why doubling a number does not cost you any significant digits.

The chart: how the grid stretches

Here is the gap to the next double — the local grid spacing \text{ulp}(x), an "unit in the last place" — plotted against the magnitude of x, both on log axes. Every time x crosses a power of two the available precision jumps down another notch: a perfect staircase of doublings, the geometry of floating point laid out in full.

Catastrophic cancellation: where digits go to die

Rounding errors are usually harmless because they are relatively tiny. The danger comes when an operation takes two numbers that are nearly equal and subtracts them. The leading digits — the accurate ones — cancel, promoting the small trailing rounding errors to centre stage. This is catastrophic cancellation, and it is the villain of numerical analysis.

Picture two numbers known to 16 digits that agree in their first 15:

\underbrace{1.2345678901234\,\mathbf{5}}_{\hat a} - \underbrace{1.2345678901234\,\mathbf{1}}_{\hat b} = 0.0000000000000\,\mathbf{4}.

The answer keeps a single trustworthy digit — the fifteen that cancelled took all their precision with them. The classic trap is the quadratic formula. To solve x^2 + bx + c = 0 when b > 0 is huge and c is modest, the root x = \dfrac{-b + \sqrt{b^2 - 4c}}{2} subtracts \sqrt{b^2-4c} (just under b) from b — a near-perfect cancellation. Watch the naive formula bleed digits, then a clever rearrangement rescue them all:

// Solve x^2 + b x + c = 0 for the root near zero, with b huge and c small. const b: number = 1e8; const c: number = 1; const disc: number = Math.sqrt(b * b - 4 * c); // NAIVE: -b + sqrt(...) subtracts two nearly-equal giants → cancellation. const naive: number = (-b + disc) / 2; // STABLE: multiply top and bottom by (-b - sqrt), turning the subtraction // into an addition of same-sign numbers. Algebraically identical; numerically safe. const stable: number = (2 * c) / (-b - disc); console.log(`naive root = ${naive}`); console.log(`stable root = ${stable}`); // The true root satisfies x ≈ -c/b for this regime: console.log(`exact-ish = ${-c / b}`); console.log(`residual naive = ${(naive * naive + b * naive + c).toExponential(3)}`); console.log(`residual stable = ${(stable * stable + b * stable + c).toExponential(3)}`);

Both formulas are exactly the same in pure algebra, yet the naive one returns a root whose residual is enormous while the rearranged one nails it. The lesson is profound and practical: the way you write a formula changes how accurately a computer can evaluate it. A large part of numerical analysis is finding the arrangement that dodges the subtraction of near-equals.

Never test two floating-point results with ==. Because 0.1 + 0.2 stores as 0.30000000000000004, the check 0.1 + 0.2 === 0.3 returns false — a bug that has cost real money and real spacecraft. Two other landmines share the same root cause:

Addition is not associative. (a + b) + c can differ from a + (b + c), because each + rounds. Summing a big number then a tiny one can lose the tiny one entirely — so to add a long list accurately, sort small-to-large first.
Compare with a tolerance, not for equality. Ask whether |\hat a - \hat b| < \tau for a small \tau chosen to suit the problem's scale — usually a small multiple of \varepsilon_{\text{mach}} times the size of the numbers involved.

console.log(`0.1 + 0.2 = ${0.1 + 0.2}`); console.log(`equal to 0.3 with === ? ${0.1 + 0.2 === 0.3}`); const tol = 1e-9; console.log(`equal within tolerance? ${Math.abs(0.1 + 0.2 - 0.3) < tol}`);

Errors that compound: conditioning, briefly

Even with perfect arrangements, some problems amplify whatever tiny input error they are handed. A problem is ill-conditioned when a small relative change in the input forces a large relative change in the output — the calculation is a lever that magnifies uncertainty. The most famous example is \tan x near x = \tfrac{\pi}{2}, where a whisker of input change sends the output racing to infinity. No cleverness in coding can save an ill-conditioned problem; the sensitivity is baked into the mathematics, not the machine. We measure it precisely with the condition number, which returns in full when we study iterative methods and conditioning for linear systems. For now, hold the intuition: total error = (how ill-conditioned the problem is) × (how stable your algorithm is). Numerical analysis is the discipline of keeping both factors small.