Floating-Point Representation

With whole numbers, binary is comfortable: a pattern of bits like 01101 means the fixed value 13. But real computing needs far more than whole numbers. A physics engine wants 9.81, a bank wants 3.75, an astronomer wants 1.989 \times 10^{30} (the mass of the Sun in kilograms) and a chemist wants 0.000000000053 (roughly the radius of a hydrogen atom in metres). We need to store numbers that have a fractional part and that can be absolutely tiny or astronomically huge — all inside a fixed, small box of bits.

Floating-point representation is how computers pull this off. The trick is the same one you already use in scientific notation, moved into binary. This page teaches that one idea: split the bits into a mantissa (the significant digits) and an exponent (where to put the binary point), so a single fixed-size word can float its point across a colossal range of values.

The idea you already know: scientific notation

Scientists never write 1{,}989{,}000{,}000{,}000{,}000{,}000{,}000{,}000{,}000{,}000. They write 1.989 \times 10^{30}. That expression has exactly two moving parts:

The same number can be written many ways — 19.89 \times 10^{29} or 0.1989 \times 10^{31} — but the digits and the power together always rebuild the value. Floating-point does precisely this, with one change: the digits are binary and the power is a power of two, not ten. A number is stored as

\text{value} = \text{mantissa} \times 2^{\text{exponent}}.

The mantissa carries the digits; the exponent slides the binary point left or right. Because the point can float to wherever the exponent puts it, we call it floating-point.

Binary fractions first

Before the point can float, we need to read a binary number that has a point. It works exactly like the columns to the left of the point, continued rightward — but now each column is worth half the one before it:

So 101.101_2 is 4 + 1 + \tfrac{1}{2} + \tfrac{1}{8} = 5.625. The columns after the point are the halves, quarters, eighths, and so on. Every value a float can hold is built from these negative powers of two — and, as we'll see in "Watch out!", that is exactly why some friendly decimals cannot be stored exactly.

Splitting the word: sign, exponent, mantissa

A floating-point number is stored by chopping a fixed word of bits into three fields. Here is a small teaching example — a 1-bit sign, a 4-bit exponent, and a 7-bit mantissa — laid out across the word:

Real formats use the same three fields, just wider: the IEEE 754 single-precision float that most languages call a float is 32 bits — 1 sign, 8 exponent, 23 mantissa — and a double is 64 bits with a 52-bit mantissa.

Normalisation: one tidy form for every number

We saw that 1.989 \times 10^{30} and 0.1989 \times 10^{31} are the same value. That freedom is a problem for a computer: which one do we store? We fix it by agreeing on a single normalised form, so every number has exactly one representation and the mantissa's bits are never wasted on leading zeros.

There are two common conventions, and A-level exam boards use both, so know them:

Either way the rule is the same: shift the binary point until the leading digit is a 1, and adjust the exponent to compensate. Slide the point left by one place and the exponent goes up by one; slide it right and the exponent goes down. The value never changes — only its bookkeeping.

A non-zero binary value is normalised by writing it as m \times 2^{e} where the mantissa m has its leading significant bit in the agreed position: This makes the representation unique and spends every mantissa bit on real precision instead of leading zeros.

Worked example — store 5.5

Small and clean, step by step, using the 1.\ldots convention.

  1. Write it in binary: 5.5 = 101.1_2 (that's 4 + 1 + \tfrac{1}{2}).
  2. Normalise: slide the point two places left so a single 1 leads. 101.1_2 = 1.011_2 \times 2^{2}.
  3. Read off the fields: sign = 0 (positive); exponent = 2; mantissa digits are 1.011.

And to go backwards you simply undo it: take the mantissa 1.011_2, shift the point right by the exponent (2 places) to get 101.1_2, and read that as 5.5. Try the reverse for a negative number: -6 = -110_2 = -1.10_2 \times 2^{2}, so sign = 1, exponent = 2, mantissa 1.10.

The great trade-off: precision vs range

Here is the heart of floating-point, and the favourite exam question. The total number of bits in a word is fixed. Every bit you give to the exponent is a bit you take away from the mantissa, and vice versa. So the two fields pull against each other:

Slide the slider below on a fixed 12-bit word (one bit is always the sign) and watch the two "budgets" trade off — the chart shows how much precision and how much range each split buys.

There is no free lunch: a designer picks the split to suit the job. Graphics and science lean toward precision or range as needed, which is exactly why the double exists — it spends its extra 32 bits on both, giving far more of each than a float.

See the rounding error for yourself

Because a float has only finitely many mantissa bits, most real numbers get rounded to the nearest value it can represent. The famous case: 0.1 and 0.2 each round slightly, and the tiny errors add up. Run this — the answer is not what school arithmetic promises:

// Every JavaScript/TypeScript number is a 64-bit float. const a: number = 0.1; const b: number = 0.2; const sum: number = a + b; console.log("0.1 + 0.2 =", sum); // NOT exactly 0.3 console.log("Is it 0.3?", sum === 0.3); // false! console.log("The gap:", sum - 0.3); // a tiny leftover error // 0.1 in binary is a recurring fraction, so it can't be stored exactly: console.log("0.1 to 20 dp:", (0.1).toFixed(20));

The lesson every programmer learns: never test floating-point numbers for exact equality. Instead, check that they are close enough:

const sum: number = 0.1 + 0.2; // Compare within a tiny tolerance instead of using === const closeEnough: boolean = Math.abs(sum - 0.3) < 1e-9; console.log("0.1 + 0.2 within 1e-9 of 0.3?", closeEnough); // true

A floating-point number is almost always an approximation, not the exact value. Just as \tfrac{1}{3} = 0.333\ldots never terminates in decimal, the friendly decimal 0.1 is a recurring fraction in binary (0.0001100110011\ldots_2) and simply cannot be written exactly in any finite mantissa. So the computer stores the nearest representable value, and that rounding error is real: it accumulates across a long calculation, which is why you must never use == to compare floats — compare within a small tolerance instead.

The same finiteness drives the precision-vs-range trade-off: with a fixed word size you cannot have both maximum precision and maximum range at once. Spend bits on the mantissa and you buy accuracy but shrink your reach; spend them on the exponent and you buy reach but coarsen every value. Choosing that split wisely — or reaching for a double — is part of the job.

The older, simpler scheme was fixed-point: the binary point sits at one agreed column and never moves, so every value has the same number of fractional bits. That's easy but wasteful — you can't represent both 0.0001 and 10000.0 well with the same fixed split. Floating-point lets the point float: the exponent field records where the point currently sits, so it can move to wherever the number needs it. Same total bits, dramatically more useful range.