Floating-Point and Error
Type 0.1 + 0.2 into almost any calculator built from a real computer and you get back
not 0.3 but 0.30000000000000004. The machine is
not broken and it is not being lazy — it simply cannot store the number
0.1. A computer holds real numbers in a fixed number of binary digits, and
0.1 in binary is the endlessly repeating
0.0001100110011\ldots_2 — just as \tfrac13 = 0.333\ldots
never terminates in decimal. So the computer keeps the closest value it can store, and a tiny
error rides along.
This is the founding fact of numerical analysis: every real number a computer touches
is, in general, slightly wrong. Usually that is harmless — the error is smaller than a speck of dust on
a bridge. But wire enough arithmetic together carelessly and the specks pile into an avalanche: a
weather model that diverges, a rocket that misses, a bank ledger that won't balance. This page teaches
you to measure the error a computer makes, to name the one number that governs its size —
machine epsilon — and to spot the single most dangerous operation in all of numerical
computing: catastrophic cancellation.
How a computer stores a real number
A floating-point number is just scientific notation in binary. Where you would write
6.022 \times 10^{23}, the machine writes
x = \pm\, (1.b_1 b_2 \ldots b_{52})_2 \times 2^{e}.
The part in brackets — the significand (or mantissa) — carries the digits; the exponent
e slides the binary point left or right. The universal standard,
IEEE 754 double precision, spends 64 bits: one for the sign, 11 for the exponent,
and 52 for the fraction. Those 52 bits (plus the implicit leading
1) buy about 15–16
significant decimal digits. Everything past digit 16 or so is simply not there.
Because only finitely many bit patterns exist, the storable numbers form a discrete grid,
not the continuous number line. Ask to store a value between two grid points and the machine
rounds to the nearest one — exactly the idea from
rounding, but the
"places" are binary. Cruically the grid is not evenly spaced: near
1 the points are about 2^{-52} apart, but near
1{,}000{,}000 they are a million times coarser. The grid stretches with the
magnitude of the number, which is exactly why the error is best measured in relative terms.
Two ways to measure being wrong
Suppose the true value is x and the computer holds an approximation
\hat{x}. There are two honest ways to say how far off it is.
-
Absolute error — the raw gap, in the same units as
x:
E_{\text{abs}} = |\,\hat{x} - x\,|.
-
Relative error — the gap as a fraction of the true size (for
x \ne 0):
E_{\text{rel}} = \frac{|\,\hat{x} - x\,|}{|x|}.
Why keep both? An absolute error of 1\,\text{mm} is a triumph when measuring
a football pitch and a catastrophe when machining a wristwatch — the relative error tells them
apart. Relative error is the natural currency of floating point precisely because the grid stretches
with magnitude: everywhere on the number line the relative rounding error is roughly the same tiny
constant. That constant has a name.
Machine epsilon: the size of a rounding error
Machine epsilon, written \varepsilon_{\text{mach}}, is the
gap between 1 and the next floating-point number above it. It is the
coarseness of the grid at 1, and it caps the relative error of a single
rounding: round any real number to the nearest double and you are off by at most
\tfrac12\varepsilon_{\text{mach}} in relative terms. For IEEE double precision,
\varepsilon_{\text{mach}} = 2^{-52} \approx 2.22 \times 10^{-16}.
You can find it without looking it up. Start at 1 and keep halving a
step; the moment 1 + \text{step} is indistinguishable from
1, the step has fallen through the grid. Run it:
// Discover machine epsilon by halving until 1 + eps rounds back to 1.
let eps: number = 1;
let steps: number = 0;
while (1 + eps / 2 > 1) {
eps = eps / 2; // still resolvable — keep shrinking
steps++;
}
console.log(`machine epsilon = ${eps}`);
console.log(` 2^-52 = ${Math.pow(2, -52)}`);
console.log(`halvings from 1 = ${steps}`); // expect 52
// Proof that it is the *threshold*: half of it vanishes when added to 1.
console.log(`1 + eps = ${1 + eps} (distinct from 1)`);
console.log(`1 + eps/2 = ${1 + eps / 2} (rounds back to 1)`);
Fifty-two halvings, landing on 2^{-52} — the number of fraction bits, laid
bare. Anything smaller than \varepsilon_{\text{mach}} added to
1 is swallowed whole. That single number governs the accuracy of every
calculation your machine will ever do.
Yes — and this is the whole point of "floating" point. The gap to the next representable number scales
with the magnitude. Near 1 it is
2^{-52}; near 2^{53} \approx 9\times10^{15} the
gap grows to a full 1, so integers beyond
2^{53} can no longer all be stored exactly —
2^{53} + 1 rounds back to 2^{53}! The
relative spacing stays fixed at about
\varepsilon_{\text{mach}} everywhere, which is exactly why relative error is
the right yardstick and why doubling a number does not cost you any significant digits.
The chart: how the grid stretches
Here is the gap to the next double — the local grid spacing
\text{ulp}(x), an "unit in the last place" — plotted against the
magnitude of x, both on log axes. Every time
x crosses a power of two the available precision jumps down another notch: a
perfect staircase of doublings, the geometry of floating point laid out in full.
Catastrophic cancellation: where digits go to die
Rounding errors are usually harmless because they are relatively tiny. The danger comes when an
operation takes two numbers that are nearly equal and subtracts them.
The leading digits — the accurate ones — cancel, promoting the small trailing rounding errors to centre
stage. This is catastrophic cancellation, and it is the villain of numerical analysis.
Picture two numbers known to 16 digits that agree in their first 15:
\underbrace{1.2345678901234\,\mathbf{5}}_{\hat a} - \underbrace{1.2345678901234\,\mathbf{1}}_{\hat b} = 0.0000000000000\,\mathbf{4}.
The answer keeps a single trustworthy digit — the fifteen that cancelled took all their precision with
them. The classic trap is the quadratic formula. To solve
x^2 + bx + c = 0 when b > 0 is huge and
c is modest, the root
x = \dfrac{-b + \sqrt{b^2 - 4c}}{2} subtracts
\sqrt{b^2-4c} (just under b) from
b — a near-perfect cancellation. Watch the naive formula bleed digits, then a
clever rearrangement rescue them all:
// Solve x^2 + b x + c = 0 for the root near zero, with b huge and c small.
const b: number = 1e8;
const c: number = 1;
const disc: number = Math.sqrt(b * b - 4 * c);
// NAIVE: -b + sqrt(...) subtracts two nearly-equal giants → cancellation.
const naive: number = (-b + disc) / 2;
// STABLE: multiply top and bottom by (-b - sqrt), turning the subtraction
// into an addition of same-sign numbers. Algebraically identical; numerically safe.
const stable: number = (2 * c) / (-b - disc);
console.log(`naive root = ${naive}`);
console.log(`stable root = ${stable}`);
// The true root satisfies x ≈ -c/b for this regime:
console.log(`exact-ish = ${-c / b}`);
console.log(`residual naive = ${(naive * naive + b * naive + c).toExponential(3)}`);
console.log(`residual stable = ${(stable * stable + b * stable + c).toExponential(3)}`);
Both formulas are exactly the same in pure algebra, yet the naive one returns a root whose
residual is enormous while the rearranged one nails it. The lesson is profound and practical:
the way you write a formula changes how accurately a computer can evaluate it. A large
part of numerical analysis is finding the arrangement that dodges the subtraction of near-equals.
Never test two floating-point results with ==. Because
0.1 + 0.2 stores as 0.30000000000000004, the
check 0.1 + 0.2 === 0.3 returns false — a bug that has cost real money
and real spacecraft. Two other landmines share the same root cause:
-
Addition is not associative.
(a + b) + c can differ from
a + (b + c), because each
+ rounds. Summing a big number
then a tiny one can lose the tiny one entirely — so to add a long list accurately, sort
small-to-large first.
-
Compare with a tolerance, not for equality. Ask whether
|\hat a - \hat b| < \tau for a small
\tau chosen to suit the problem's scale — usually a small multiple of
\varepsilon_{\text{mach}} times the size of the numbers involved.
console.log(`0.1 + 0.2 = ${0.1 + 0.2}`);
console.log(`equal to 0.3 with === ? ${0.1 + 0.2 === 0.3}`);
const tol = 1e-9;
console.log(`equal within tolerance? ${Math.abs(0.1 + 0.2 - 0.3) < tol}`);
Errors that compound: conditioning, briefly
Even with perfect arrangements, some problems amplify whatever tiny input error they are handed.
A problem is ill-conditioned when a small relative change in the input forces a large
relative change in the output — the calculation is a lever that magnifies uncertainty. The most famous
example is \tan x near x = \tfrac{\pi}{2}, where a
whisker of input change sends the output racing to infinity. No cleverness in coding can save an
ill-conditioned problem; the sensitivity is baked into the mathematics, not the machine. We measure it
precisely with the condition number, which returns in full when we study
iterative
methods and conditioning for linear systems. For now, hold the intuition: total
error = (how ill-conditioned the problem is) × (how stable your algorithm is). Numerical
analysis is the discipline of keeping both factors small.