Floating-Point Representation
With whole numbers, binary is comfortable: a pattern of bits like
01101 means the fixed value 13. But real
computing needs far more than whole numbers. A physics engine wants
9.81, a bank wants 3.75, an astronomer wants
1.989 \times 10^{30} (the mass of the Sun in kilograms) and a chemist
wants 0.000000000053 (roughly the radius of a hydrogen atom in metres).
We need to store numbers that have a fractional part and that can be
absolutely tiny or astronomically huge — all inside a fixed, small box of bits.
Floating-point representation is how computers pull this off. The trick is the same
one you already use in scientific notation, moved into binary. This page teaches that one
idea: split the bits into a mantissa (the significant digits) and an
exponent (where to put the binary point), so a single fixed-size word can float its
point across a colossal range of values.
The idea you already know: scientific notation
Scientists never write 1{,}989{,}000{,}000{,}000{,}000{,}000{,}000{,}000{,}000{,}000.
They write 1.989 \times 10^{30}. That expression has exactly two moving
parts:
- the significant digits — 1.989 — which fix the
precision, the actual digits of the number;
- the power of ten — 30 — which just says how far, and
which way, to slide the decimal point.
The same number can be written many ways —
19.89 \times 10^{29} or 0.1989 \times 10^{31}
— but the digits and the power together always rebuild the value. Floating-point does precisely
this, with one change: the digits are binary and the power is a power of two, not
ten. A number is stored as
\text{value} = \text{mantissa} \times 2^{\text{exponent}}.
The mantissa carries the digits; the exponent slides the binary point left or right.
Because the point can float to wherever the exponent puts it, we call it floating-point.
Binary fractions first
Before the point can float, we need to read a binary number that has a point. It works
exactly like the columns to the left of the point, continued rightward — but now each column is
worth half the one before it:
So 101.101_2 is
4 + 1 + \tfrac{1}{2} + \tfrac{1}{8} = 5.625. The columns after the point
are the halves, quarters, eighths, and so on. Every value a float can hold
is built from these negative powers of two — and, as we'll see in "Watch out!", that is exactly why
some friendly decimals cannot be stored exactly.
Splitting the word: sign, exponent, mantissa
A floating-point number is stored by chopping a fixed word of bits into three fields. Here is a small
teaching example — a 1-bit sign, a
4-bit exponent, and a 7-bit
mantissa — laid out across the word:
- Sign — one bit: 0 for positive,
1 for negative.
- Exponent — a whole number (often stored in
two's complement,
so it can be negative) that says how far to slide the binary point.
- Mantissa — the significant bits, the actual digits of the number.
Real formats use the same three fields, just wider: the IEEE 754 single-precision float that
most languages call a float is 32 bits —
1 sign, 8 exponent,
23 mantissa — and a double is
64 bits with a 52-bit mantissa.
Normalisation: one tidy form for every number
We saw that 1.989 \times 10^{30} and
0.1989 \times 10^{31} are the same value. That freedom is a problem for a
computer: which one do we store? We fix it by agreeing on a single normalised form,
so every number has exactly one representation and the mantissa's bits are never wasted on leading
zeros.
There are two common conventions, and A-level exam boards use both, so know them:
- The mantissa is a fraction beginning 0.1\ldots — the first bit after
the point is always a 1 (for a positive number). Value lies in
[0.5,\, 1).
- Or the mantissa begins 1.\ldots — a single
1 before the point. This is what IEEE 754 uses, and the leading
1 is so predictable it isn't even stored (the "hidden bit").
Either way the rule is the same: shift the binary point until the leading digit is a
1, and adjust the exponent to compensate. Slide the point left by
one place and the exponent goes up by one; slide it right and the exponent goes down. The value never
changes — only its bookkeeping.
A non-zero binary value is normalised by writing it as
m \times 2^{e} where the mantissa m has its
leading significant bit in the agreed position:
- in the 0.1\ldots convention, 0.5 \le |m| < 1;
- in the 1.\ldots (IEEE) convention, 1 \le |m| < 2.
This makes the representation unique and spends every mantissa bit on real
precision instead of leading zeros.
Worked example — store 5.5
Small and clean, step by step, using the 1.\ldots convention.
- Write it in binary: 5.5 = 101.1_2
(that's 4 + 1 + \tfrac{1}{2}).
- Normalise: slide the point two places left so a single
1 leads. 101.1_2 = 1.011_2 \times 2^{2}.
- Read off the fields: sign = 0 (positive);
exponent = 2; mantissa digits are
1.011.
And to go backwards you simply undo it: take the mantissa
1.011_2, shift the point right by the exponent
(2 places) to get 101.1_2, and read that as
5.5. Try the reverse for a negative number:
-6 = -110_2 = -1.10_2 \times 2^{2}, so sign
= 1, exponent = 2, mantissa
1.10.
The great trade-off: precision vs range
Here is the heart of floating-point, and the favourite exam question. The total number of bits in a
word is fixed. Every bit you give to the exponent is a bit you take away from the
mantissa, and vice versa. So the two fields pull against each other:
- More mantissa bits → more significant digits → more
precision (values are stored more accurately, with smaller rounding error).
- More exponent bits → the point can slide further → more
range (you can reach far bigger and far smaller magnitudes).
Slide the slider below on a fixed 12-bit word (one bit is always the sign)
and watch the two "budgets" trade off — the chart shows how much precision and how much range each
split buys.
There is no free lunch: a designer picks the split to suit the job. Graphics and science lean toward
precision or range as needed, which is exactly why the double exists — it spends its
extra 32 bits on both, giving far more of each than a
float.
See the rounding error for yourself
Because a float has only finitely many mantissa bits, most real numbers get rounded to the nearest
value it can represent. The famous case: 0.1 and
0.2 each round slightly, and the tiny errors add up. Run this — the answer
is not what school arithmetic promises:
// Every JavaScript/TypeScript number is a 64-bit float.
const a: number = 0.1;
const b: number = 0.2;
const sum: number = a + b;
console.log("0.1 + 0.2 =", sum); // NOT exactly 0.3
console.log("Is it 0.3?", sum === 0.3); // false!
console.log("The gap:", sum - 0.3); // a tiny leftover error
// 0.1 in binary is a recurring fraction, so it can't be stored exactly:
console.log("0.1 to 20 dp:", (0.1).toFixed(20));
The lesson every programmer learns: never test floating-point numbers for exact
equality. Instead, check that they are close enough:
const sum: number = 0.1 + 0.2;
// Compare within a tiny tolerance instead of using ===
const closeEnough: boolean = Math.abs(sum - 0.3) < 1e-9;
console.log("0.1 + 0.2 within 1e-9 of 0.3?", closeEnough); // true
A floating-point number is almost always an approximation, not the exact value.
Just as \tfrac{1}{3} = 0.333\ldots never terminates in decimal, the
friendly decimal 0.1 is a recurring fraction in binary
(0.0001100110011\ldots_2) and simply cannot be written exactly in any
finite mantissa. So the computer stores the nearest representable value, and that rounding error is
real: it accumulates across a long calculation, which is why you must
never use == to compare floats — compare within a small tolerance
instead.
The same finiteness drives the precision-vs-range trade-off: with a fixed word
size you cannot have both maximum precision and maximum range at once. Spend bits on the
mantissa and you buy accuracy but shrink your reach; spend them on the exponent and you buy reach
but coarsen every value. Choosing that split wisely — or reaching for a double — is
part of the job.
The older, simpler scheme was fixed-point: the binary point sits at one agreed
column and never moves, so every value has the same number of fractional bits. That's easy but
wasteful — you can't represent both 0.0001 and
10000.0 well with the same fixed split. Floating-point lets the point
float: the exponent field records where the point currently sits, so it can move to wherever
the number needs it. Same total bits, dramatically more useful range.