Fault Tolerance

It is the busiest shopping hour of the year, and a hundred thousand people are trying to check out at once. Somewhere in a data centre a hard drive dies. A few racks over, a power supply hiccups. A network switch drops offline for eight seconds. On any single machine, each of these is a small catastrophe — and yet the checkout page never so much as flickers. Orders keep flowing, money keeps moving, and nobody outside the building ever knows anything broke.

That is fault tolerance: the art of building a reliable whole out of unreliable parts. Every physical component — disk, wire, fan, chip — will eventually fail; that is not a possibility to be engineered away but a certainty to be planned around. A fault-tolerant system is one that keeps giving correct answers and staying available despite the failure of some of its pieces. This page is about the one big idea that makes that possible — redundancy — and the machinery that turns it into working systems: replication and voting, checkpoints and recovery, and the surprisingly deep problem of even noticing that something has died. It builds directly on the system and failure models and on how nodes reach agreement under failure.

Fault, error, failure — three words, not one

Careless talk lumps everything under "it broke." Engineers who build reliable systems keep three stages carefully apart, because you can intervene at each one:

Fault — the underlying defect or flaw: a cosmic ray flips a bit, a solder joint cracks, a line of code has an off-by-one. A fault is a cause, and it may sit dormant for years.
Error — the fault becomes active and puts the system into a wrong internal state: the flipped bit is now read as data, the cracked joint drops a signal. An error is a wrong state that has not yet been noticed from outside.
Failure — the error escapes and the system's visible behaviour deviates from what it promised: the wrong answer reaches the user, the service goes down.

The chain runs fault → error → failure, and the whole discipline of fault tolerance is about breaking the chain before the last arrow — catching and masking an error so it never becomes a failure the outside world can see. A fault that never activates, or an error that is caught and corrected, causes no failure at all. That is the win.

Reliability — the probability the system runs correctly across a whole interval (often summarised as mean time to failure, MTTF).
Availability — the fraction of time the system is up and usable right now, \frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}}, where MTTR is the mean time to repair. This is where the famous "number of nines" (99.999% ≈ five minutes of downtime a year) comes from.
Fault tolerance — the property of continuing to meet those goals even while some components are faulty.

Redundancy: the master tool

There is really only one fundamental weapon against component failure, and everything else is a variation on it: redundancy — deliberately keeping more than you strictly need, so that a survivor can carry on when a copy dies. It comes in three flavours, distinguished by what you duplicate:

Space redundancy — replicate the components. Run three servers where one would do; keep the same file on five disks. If a copy fails, another copy already holds the answer.
Time redundancy — replicate the execution. Retry the request, or re-run the computation. This is cheap (no extra hardware) and beats transient faults — a glitch that is gone by the second attempt — but does nothing for a permanently dead component.
Information redundancy — replicate the bits. Add extra, carefully chosen bits so the data can detect and even repair its own corruption: parity, checksums, error-correcting codes (ECC memory), RAID. A few redundant bits let you notice — and often fix — a flipped one.

For replicated components there are two organising styles, and the difference is what the backups do while all is well:

Active replication — every replica processes every request in lock-step, all the time. Failover is instant (the survivors were already computing the answer) but you pay full cost on every node continuously.
Passive replication (primary–backup) — one primary does the work and ships its state to idle backups. Cheaper, but when the primary dies there is a recovery gap while a backup is promoted and catches up.

N-modular redundancy and majority voting

Replication alone raises a sharp question: if you have three copies and they disagree, which one do you believe? The classic answer is N-modular redundancy (NMR): run N identical units on the same input and put a voter in front of the output that returns the majority answer. The famous case is N = 3 — Triple Modular Redundancy (TMR):

Because two of the three modules still agree on the right answer, the voter's majority rule masks the single fault entirely — the wrong value never leaves the box. In general an NMR system with N = 2f + 1 modules survives up to f simultaneous faulty modules: TMR (N=3) tolerates f = 1, five-way redundancy tolerates f = 2, and so on. This is exactly how flight computers, spacecraft, and safety-critical controllers keep flying with a chip on fire.

Sharp question — and it exposes the catch in every redundancy scheme: you have not removed the single point of failure, you have moved it to the voter. If the one voter dies, the whole gloriously-redundant triplet is useless. Real safety-critical systems answer this by replicating the voter too (and sometimes voting on the voters), pushing the single point of failure down to a tiny, exhaustively-verified, ultra-simple circuit whose failure probability is orders of magnitude below everything it guards. You can never reach zero — you can only keep chasing the weakest remaining link until it is negligibly small.

Checkpointing and recovery

Voting masks a fault at the instant it happens, but many systems face a different problem: a node crashes and restarts, losing everything in volatile memory. Re-running a computation that has been chugging for six hours from scratch is agony. The cure is checkpointing: periodically save a snapshot of the system's state to stable storage — durable storage (disk, replicated storage) that survives a crash. On restart, the node does not begin from zero; it rolls back to its most recent checkpoint and resumes from there.

A checkpoint alone loses everything since the last snapshot, so it is paired with a log. In a write-ahead log (WAL), every change is appended to a durable log before it is applied; in a message log, incoming messages are recorded. After a crash the recovery procedure is: load the last checkpoint, then replay the log forward to reconstruct everything that happened afterwards. Checkpoint plus log = you never lose committed work, and you never have to redo the hours before the checkpoint.

The design knob is how often to checkpoint, and it is a genuine trade-off:

Frequent checkpoints → little log to replay after a crash, so fast recovery — but each snapshot costs time and I/O, so high steady-state overhead when nothing is even failing.
Rare checkpoints → cheap during normal running, but a crash means replaying a long log, so slow recovery.

The sweet spot depends on how often you expect to crash and how much a slow recovery costs you — you pay a small tax constantly to buy a shorter outage occasionally. Databases live and die by exactly this WAL-plus-checkpoint machinery.

Detecting failure: heartbeats and the fundamental limit

None of this helps if the system cannot tell that a component has died. The workhorse mechanism is the heartbeat: each node periodically sends "I'm alive" to a monitor (or to its peers). If no heartbeat arrives within a timeout, the monitor declares the node dead and triggers failover — promoting a backup, rerouting traffic, kicking off recovery.

And here we hit one of the deepest truths in distributed systems, straight from the failure models: in an asynchronous network you cannot distinguish a slow node from a dead one. A missing heartbeat might mean the node crashed — or it might mean the node is perfectly healthy but its message is stuck behind a garbage-collection pause, a congested switch, or a briefly-partitioned network. The timeout is a guess, and you are forced to trade two kinds of mistake:

A short timeout detects real crashes quickly but falsely accuses healthy-but-slow nodes — triggering needless failovers, and in the worst case two nodes both acting as primary ("split brain").
A long timeout avoids false alarms but leaves the system limping on a dead node for longer before it reacts.

It is tempting to treat "the heartbeat timed out" as a fact: the node is gone, promote the backup, move on. It is not a fact — it is an inference under uncertainty, and it is sometimes wrong. The classic disaster is split brain: a network partition, not a crash, separates a still-running primary from its monitor; the monitor times out, declares it dead, and promotes a backup to primary. Now two primaries are both accepting writes on opposite sides of the partition, silently corrupting each other's data. This is exactly why serious systems don't rely on a lone timeout to make the call — they route the "who is alive" decision through consensus and fencing (majority quorums, leases) so that at most one node can win, even when the network is lying to everyone. A timeout tells you a node is unreachable, never that it is dead.

The reliability maths: why replicas win so fast

Here is the quantitative heart of why redundancy is so powerful. Suppose one node fails independently with probability p. If you keep k replicas and the system fails only when all of them are down at once, then — because independent probabilities multiply —

P(\text{all } k \text{ fail}) = p^{\,k}, \qquad \text{availability} = 1 - p^{\,k}.

The exponent is the magic. Because p < 1, raising it to a power drives it toward zero fast. Work a concrete example with p = 0.1 (each node is down 10% of the time — a fairly flaky node):

k = 1: fail prob 0.1 → availability 90%.
k = 2: fail prob 0.1^2 = 0.01 → availability 99%.
k = 3: fail prob 0.1^3 = 0.001 → availability 99.9%.
k = 4: fail prob 0.0001 → availability 99.99%.

Each replica you add tacks another nine onto the availability. Three flaky 90%-nodes combine into a 99.9% service — better than any one of them could ever be. Drag the slider below to feel how the curve climbs, and how a smaller per-node failure probability lifts the whole thing:

The whole p^k argument rests on one quietly enormous assumption: the replicas fail independently. In the real world they often don't. Put all three replicas in the same rack and one power supply takes out all three at once. Run the same buggy code on every replica and one poisoned input crashes all of them simultaneously. Depend on the same certificate and they all go dark the minute it expires. These are correlated failures (a shared root cause), and they shred the independence assumption: if the replicas really fail together with probability c, your true failure probability is roughly c, not the beautiful p^k — no matter how many copies you add. This is why real fault tolerance means diversity, not just quantity: spread replicas across racks, power domains, availability zones, and sometimes even different implementations, so there is no single fault that can fell them all together.

Compute it yourself

The program below does two things at once. First it prints the availability table for k = 1 \dots 5 given a per-node failure probability, so you can watch the nines pile up. Then it implements a tiny majority voter over three replica outputs — the beating heart of TMR — and shows it masking a single wrong answer. Press Run ▶, then try changing p or corrupting a second replica and see what the voter does:

// ---- 1) Availability = 1 - p^k for k = 1..5 ------------------------- const p = 0.1; // per-node failure probability (each node down 10% of the time) console.log("Per-node failure probability p = " + p); console.log("k P(all k fail) = p^k availability = 1 - p^k"); for (let k = 1; k <= 5; k++) { const allFail = Math.pow(p, k); const avail = 1 - allFail; console.log( String(k).padEnd(4) + allFail.toExponential(3).padEnd(24) + (avail * 100).toFixed(4) + "%", ); } // ---- 2) Majority voter over 3 replica outputs (TMR) ---------------- // Returns the value that appears most often; masks a single wrong one. function majority(outputs: number[]): number { const counts = new Map<number, number>(); for (const v of outputs) counts.set(v, (counts.get(v) ?? 0) + 1); let best = outputs[0]; let bestCount = 0; for (const [value, count] of counts) { if (count > bestCount) { best = value; bestCount = count; } } return best; } console.log(""); console.log("TMR voter (three replicas, correct answer is 4):"); console.log(" all agree [4, 4, 4] -> " + majority([4, 4, 4])); console.log(" one faulty [4, 7, 4] -> " + majority([4, 7, 4]) + " (7 outvoted, masked)"); console.log(" two faulty [7, 7, 4] -> " + majority([7, 7, 4]) + " (majority wrong -> TMR fails!)");

The last line is the honest limit: TMR masks one fault, but if two of the three go wrong the majority itself is wrong, and the voter faithfully outputs garbage. Tolerating more faults costs more modules — there is no free lunch, only a purchase you can choose to make.