Fault Tolerance
It is the busiest shopping hour of the year, and a hundred thousand people are trying to check out at
once. Somewhere in a data centre a hard drive dies. A few racks over, a power supply hiccups. A
network switch drops offline for eight seconds. On any single machine, each of these is a small
catastrophe — and yet the checkout page never so much as flickers. Orders keep flowing, money keeps
moving, and nobody outside the building ever knows anything broke.
That is fault tolerance: the art of building a reliable whole out of
unreliable parts. Every physical component — disk, wire, fan, chip — will eventually fail; that
is not a possibility to be engineered away but a certainty to be planned around. A fault-tolerant
system is one that keeps giving correct answers and staying available despite the failure of
some of its pieces. This page is about the one big idea that makes that possible — redundancy
— and the machinery that turns it into working systems: replication and voting, checkpoints and
recovery, and the surprisingly deep problem of even noticing that something has died. It
builds directly on the
system and failure models
and on how nodes reach
agreement under failure.
Fault, error, failure — three words, not one
Careless talk lumps everything under "it broke." Engineers who build reliable systems keep three
stages carefully apart, because you can intervene at each one:
-
Fault — the underlying defect or flaw: a cosmic ray flips a bit, a solder
joint cracks, a line of code has an off-by-one. A fault is a cause, and it may sit dormant for
years.
-
Error — the fault becomes active and puts the system into a wrong internal
state: the flipped bit is now read as data, the cracked joint drops a signal. An error is a wrong
state that has not yet been noticed from outside.
-
Failure — the error escapes and the system's visible behaviour deviates
from what it promised: the wrong answer reaches the user, the service goes down.
The chain runs fault → error → failure, and the whole discipline of fault tolerance
is about breaking the chain before the last arrow — catching and masking an error so it never
becomes a failure the outside world can see. A fault that never activates, or an error that is caught
and corrected, causes no failure at all. That is the win.
-
Reliability — the probability the system runs correctly across a whole interval
(often summarised as mean time to failure, MTTF).
-
Availability — the fraction of time the system is up and usable right now,
\frac{\text{MTTF}}{\text{MTTF} + \text{MTTR}}, where MTTR is the mean
time to repair. This is where the famous "number of nines" (99.999% ≈ five minutes of
downtime a year) comes from.
-
Fault tolerance — the property of continuing to meet those goals even while some
components are faulty.
Redundancy: the master tool
There is really only one fundamental weapon against component failure, and everything else is a
variation on it: redundancy — deliberately keeping more than you strictly
need, so that a survivor can carry on when a copy dies. It comes in three flavours, distinguished
by what you duplicate:
-
Space redundancy — replicate the components. Run three servers where one
would do; keep the same file on five disks. If a copy fails, another copy already holds the answer.
-
Time redundancy — replicate the execution. Retry the request, or
re-run the computation. This is cheap (no extra hardware) and beats transient faults — a
glitch that is gone by the second attempt — but does nothing for a permanently dead component.
-
Information redundancy — replicate the bits. Add extra, carefully chosen
bits so the data can detect and even repair its own corruption: parity, checksums, error-correcting
codes (ECC memory), RAID. A few redundant bits let you notice — and often fix — a flipped one.
For replicated components there are two organising styles, and the difference is what the
backups do while all is well:
-
Active replication — every replica processes every request in lock-step,
all the time. Failover is instant (the survivors were already computing the answer) but you pay full
cost on every node continuously.
-
Passive replication (primary–backup) — one primary does the work and
ships its state to idle backups. Cheaper, but when the primary dies there is a recovery gap
while a backup is promoted and catches up.
N-modular redundancy and majority voting
Replication alone raises a sharp question: if you have three copies and they disagree, which
one do you believe? The classic answer is N-modular redundancy (NMR): run
N identical units on the same input and put a voter in
front of the output that returns the majority answer. The famous case is
N = 3 — Triple Modular Redundancy (TMR):
Because two of the three modules still agree on the right answer, the voter's majority rule
masks the single fault entirely — the wrong value never leaves the box. In general an
NMR system with N = 2f + 1 modules survives up to f
simultaneous faulty modules: TMR (N=3) tolerates f = 1,
five-way redundancy tolerates f = 2, and so on. This is exactly how flight
computers, spacecraft, and safety-critical controllers keep flying with a chip on fire.
Sharp question — and it exposes the catch in every redundancy scheme: you have not removed
the single point of failure, you have moved it to the voter. If the one voter dies, the
whole gloriously-redundant triplet is useless. Real safety-critical systems answer this by
replicating the voter too (and sometimes voting on the voters), pushing the single point of failure
down to a tiny, exhaustively-verified, ultra-simple circuit whose failure probability is orders of
magnitude below everything it guards. You can never reach zero — you can only keep chasing
the weakest remaining link until it is negligibly small.
Checkpointing and recovery
Voting masks a fault at the instant it happens, but many systems face a different problem: a node
crashes and restarts, losing everything in volatile memory. Re-running a computation that has
been chugging for six hours from scratch is agony. The cure is checkpointing:
periodically save a snapshot of the system's state to stable storage — durable
storage (disk, replicated storage) that survives a crash. On restart, the node does not begin from
zero; it rolls back to its most recent checkpoint and resumes from there.
A checkpoint alone loses everything since the last snapshot, so it is paired with a log.
In a write-ahead log (WAL), every change is appended to a durable log before
it is applied; in a message log, incoming messages are recorded. After a crash the
recovery procedure is: load the last checkpoint, then replay the log forward to
reconstruct everything that happened afterwards. Checkpoint plus log = you never lose committed work,
and you never have to redo the hours before the checkpoint.
The design knob is how often to checkpoint, and it is a genuine trade-off:
-
Frequent checkpoints → little log to replay after a crash, so fast recovery
— but each snapshot costs time and I/O, so high steady-state overhead when nothing is even
failing.
-
Rare checkpoints → cheap during normal running, but a crash means replaying a long
log, so slow recovery.
The sweet spot depends on how often you expect to crash and how much a slow recovery costs you — you
pay a small tax constantly to buy a shorter outage occasionally. Databases live and die by exactly
this WAL-plus-checkpoint machinery.
Detecting failure: heartbeats and the fundamental limit
None of this helps if the system cannot tell that a component has died. The workhorse
mechanism is the heartbeat: each node periodically sends "I'm alive" to a monitor (or
to its peers). If no heartbeat arrives within a timeout, the monitor declares the
node dead and triggers failover — promoting a backup, rerouting traffic, kicking off
recovery.
And here we hit one of the deepest truths in distributed systems, straight from the
failure models:
in an asynchronous network you cannot distinguish a slow node from a dead one.
A missing heartbeat might mean the node crashed — or it might mean the node is perfectly healthy but
its message is stuck behind a garbage-collection pause, a congested switch, or a briefly-partitioned
network. The timeout is a guess, and you are forced to trade two kinds of mistake:
-
A short timeout detects real crashes quickly but falsely accuses healthy-but-slow
nodes — triggering needless failovers, and in the worst case two nodes both acting as primary
("split brain").
-
A long timeout avoids false alarms but leaves the system limping on a dead node for
longer before it reacts.
It is tempting to treat "the heartbeat timed out" as a fact: the node is gone, promote the backup,
move on. It is not a fact — it is an inference under uncertainty, and it is sometimes
wrong. The classic disaster is split brain: a network partition, not a crash,
separates a still-running primary from its monitor; the monitor times out, declares it dead, and
promotes a backup to primary. Now two primaries are both accepting writes on opposite sides
of the partition, silently corrupting each other's data. This is exactly why serious systems don't
rely on a lone timeout to make the call — they route the "who is alive" decision through
consensus
and fencing (majority quorums, leases) so that at most one node can win, even when the network is
lying to everyone. A timeout tells you a node is unreachable, never that it is
dead.
The reliability maths: why replicas win so fast
Here is the quantitative heart of why redundancy is so powerful. Suppose one node fails
independently with probability p. If you keep
k replicas and the system fails only when all of them are
down at once, then — because independent probabilities multiply —
P(\text{all } k \text{ fail}) = p^{\,k}, \qquad \text{availability} = 1 - p^{\,k}.
The exponent is the magic. Because p < 1, raising it to a power drives it
toward zero fast. Work a concrete example with p = 0.1 (each node
is down 10% of the time — a fairly flaky node):
- k = 1: fail prob 0.1 → availability 90%.
- k = 2: fail prob 0.1^2 = 0.01 → availability 99%.
- k = 3: fail prob 0.1^3 = 0.001 → availability 99.9%.
- k = 4: fail prob 0.0001 → availability 99.99%.
Each replica you add tacks another nine onto the availability. Three flaky 90%-nodes combine
into a 99.9% service — better than any one of them could ever be. Drag the slider below to feel how the
curve climbs, and how a smaller per-node failure probability lifts the whole thing:
The whole p^k argument rests on one quietly enormous assumption: the
replicas fail independently. In the real world they often don't. Put all three
replicas in the same rack and one power supply takes out all three at once. Run the
same buggy code on every replica and one poisoned input crashes all of them
simultaneously. Depend on the same certificate and they all go dark the minute it expires.
These are correlated failures (a shared root cause), and they shred the
independence assumption: if the replicas really fail together with probability
c, your true failure probability is roughly
c, not the beautiful p^k — no matter how many
copies you add. This is why real fault tolerance means diversity, not just quantity: spread replicas
across racks, power domains, availability zones, and sometimes even different implementations,
so there is no single fault that can fell them all together.
Compute it yourself
The program below does two things at once. First it prints the availability table for
k = 1 \dots 5 given a per-node failure probability, so you can watch the
nines pile up. Then it implements a tiny majority voter over three replica outputs —
the beating heart of TMR — and shows it masking a single wrong answer. Press Run ▶,
then try changing p or corrupting a second replica and see what the voter does:
// ---- 1) Availability = 1 - p^k for k = 1..5 -------------------------
const p = 0.1; // per-node failure probability (each node down 10% of the time)
console.log("Per-node failure probability p = " + p);
console.log("k P(all k fail) = p^k availability = 1 - p^k");
for (let k = 1; k <= 5; k++) {
const allFail = Math.pow(p, k);
const avail = 1 - allFail;
console.log(
String(k).padEnd(4) +
allFail.toExponential(3).padEnd(24) +
(avail * 100).toFixed(4) + "%",
);
}
// ---- 2) Majority voter over 3 replica outputs (TMR) ----------------
// Returns the value that appears most often; masks a single wrong one.
function majority(outputs: number[]): number {
const counts = new Map<number, number>();
for (const v of outputs) counts.set(v, (counts.get(v) ?? 0) + 1);
let best = outputs[0];
let bestCount = 0;
for (const [value, count] of counts) {
if (count > bestCount) { best = value; bestCount = count; }
}
return best;
}
console.log("");
console.log("TMR voter (three replicas, correct answer is 4):");
console.log(" all agree [4, 4, 4] -> " + majority([4, 4, 4]));
console.log(" one faulty [4, 7, 4] -> " + majority([4, 7, 4]) + " (7 outvoted, masked)");
console.log(" two faulty [7, 7, 4] -> " + majority([7, 7, 4]) + " (majority wrong -> TMR fails!)");
The last line is the honest limit: TMR masks one fault, but if two of the three go
wrong the majority itself is wrong, and the voter faithfully outputs garbage. Tolerating more faults
costs more modules — there is no free lunch, only a purchase you can choose to make.