Models and Failure

Two allied generals camp on opposite hills, an enemy city in the valley between them. If they attack together at dawn they win; if only one attacks, that army is destroyed. Their only way to coordinate is to send a runner down through the valley — but the valley is enemy territory, and any runner might be captured. General A sends a messenger: "Attack at dawn." But did the runner get through? A cannot commit until she knows B received it. So B, on getting the message, sends back "Acknowledged — attacking at dawn." But now B cannot be sure A received the acknowledgement. So A must acknowledge the acknowledgement… and so on, forever. No finite number of messengers over an unreliable channel can ever make both generals certain they agree.

This is the Two Generals' Problem, and it is not a riddle with a clever answer — it is a genuine impossibility. It is also the whole subject of this page in miniature. The moment your computation stops living inside one machine and spreads across many machines joined by an unreliable network, guarantees you never even noticed you were relying on — a shared clock, memory that doesn't lie, the comforting fact that an action "either happened or it didn't" — quietly stop holding. To reason about such systems at all, we first have to write down honestly what we are allowed to assume. That is what a system model and a failure model are.

Why one machine spoils us

On a single computer, three luxuries are so reliable you forget they are assumptions:

A shared clock. Every part of the program reads the same system time, so "event X happened before event Y" is a plain, answerable question.
Trustworthy memory. When you write a value and read it back, you get what you wrote. Memory does not silently invent numbers.
All-or-nothing steps. An operation either completes or it doesn't, and when it returns you know which. There is no "maybe it ran."

Now cut the machine in two and connect the halves with a network — the same TCP/IP stack that carries the rest of the internet. Every one of those luxuries evaporates. There is no single clock: each machine has its own, and they drift apart. A message you send may be delayed, reordered, duplicated, or dropped entirely. And the killer: when you send a request and hear nothing back, you cannot tell whether the other machine crashed, or is merely slow, or replied and the reply is still in flight, or the reply was lost. "It either happened or it didn't" becomes "it might have happened, and I may never find out."

A distributed system, then, is not just "a program with more computers." It is a program that has given up its most basic certainties. Engineering it well starts with stating precisely which certainties we are giving up — the model.

The system model: how timing behaves

A system model pins down what we may assume about timing — how long messages take and how far clocks drift. There are two clean extremes and one realistic middle.

In a synchronous model there are known, finite bounds on everything that can make you wait: a message always arrives within some maximum time \Delta, each processing step finishes within a known limit, and clocks drift apart no faster than a known rate. This model is a joy to reason in — because if you send a request and no reply comes within \Delta plus a round trip, you may soundly conclude the other side is down. Timeouts become reliable failure detectors. The catch: real networks rarely honour these bounds. A garbage-collection pause, a congested switch, or a swapped-out process can blow past any \Delta you picked.

In an asynchronous model there are no bounds at all. A message will eventually arrive (nothing is lost forever), but it may take arbitrarily long; steps may take arbitrarily long; clocks are useless for ordering. This is the pessimist's model, and it is a much more honest fit for the open internet. Its sting is severe: with no time bound, a slow machine and a dead machine look exactly the same. No timeout, however generous, can ever be certain — a reply might arrive one microsecond after you gave up.

Synchronous — known finite bounds on message delay, step time, and clock drift. Timeouts detect failure soundly. Easy to reason about, unrealistic to guarantee.
Asynchronous — no timing bounds whatsoever; messages arrive eventually but after unbounded delay. Realistic and unforgiving: "slow" is indistinguishable from "dead."

Because the world does not read your specification. You can hope a message arrives within 50 ms, and 99.99% of the time it does — but the whole point of a failure model is to survive the 0.01%. A single stop-the-world pause, a leader that got paged out by the OS scheduler, or a transatlantic link that briefly saturates, and your "bound" is violated. The famous result here is FLP impossibility (Fischer, Lynch, Paterson, 1985): in a fully asynchronous system where even one process may crash, there is no algorithm that guarantees all correct processes reach agreement in bounded time — because you can never rule out that the silent process is just slow. Reality lives between the extremes, which is why we need a third model.

Partial synchrony: the useful middle

Neither extreme fits practice. Pure synchrony is a fantasy; pure asynchrony is so bleak that FLP says you can't even agree on a single bit. Real systems live in partial synchrony: the network behaves asynchronously — unbounded delays, no guarantees — for a while, but eventually it settles into a synchronous period where bounds hold long enough to get work done. You don't know when the good times will come, only that they will.

This is exactly enough. Practical consensus algorithms (Paxos, Raft) are designed to stay safe always and make progress during the good periods: they never produce a wrong answer even in the chaotic stretches, and they finish their agreement once the network calms down. Partial synchrony is the honest model most production systems are actually built against.

The failure model: how components break

Timing is only half the picture. The failure model says in what ways a component is allowed to misbehave. This is a spectrum, from the polite failures to the diabolical, and the further along it you go, the harder — and more expensive — your algorithm must work to tolerate it.

Crash-stop (fail-stop). The kindest failure. A node runs correctly, then at some point simply halts — and stays halted forever. It never sends a wrong message; it just goes silent. Like a process that is killed and never restarted.
Crash-recovery. A node may crash and then come back later, having lost whatever it held only in volatile memory (but keeping what it wrote to durable storage). Now you must worry about a "dead" node re-joining and acting on stale state — trickier than crash-stop.
Omission. The node itself is fine, but it drops messages — some sends or receives silently vanish (a full buffer, a flaky link). It behaves correctly on the messages that do get through, but you can't rely on any single one arriving.
Byzantine (arbitrary). The worst case, named for the Byzantine Generals' Problem. A faulty node may do anything at all: send contradictory information to different peers, lie about its state, forge messages, or actively collude to break the system. This covers not just bugs and cosmic-ray bit-flips but malicious, adversarial behaviour.

The jump from crash faults to Byzantine faults is enormous, not incremental. To tolerate f crash failures a system typically needs 2f + 1 nodes (a simple majority survives); to tolerate f Byzantine failures it needs 3f + 1 — because you must out-vote not only the silent nodes but the actively lying ones, who may tell different stories to different listeners. This is why Byzantine fault tolerance is reserved for settings where the stakes justify it: aircraft flight-control systems, and, famously, public blockchains where anonymous participants have a financial incentive to cheat.

The heart of it: you cannot tell "slow" from "dead"

Fold the two models together and you arrive at the single most humbling fact in distributed computing. In an asynchronous (or partially synchronous) network, when you send a message and hear nothing, the silence is fundamentally ambiguous. It could mean:

the receiver crashed before reading your message;
the receiver got it, did the work, but crashed before replying;
the receiver got it, did the work, replied — and the reply is still in transit;
the receiver got it, replied, and the reply was dropped by the network;
the receiver is simply slow and will answer in another second.

From the sender's side these are indistinguishable. You see exactly one thing — nothing — and it has (at least) five possible causes, some meaning "the work happened" and some meaning "it didn't." A timeout does not resolve the ambiguity; it only forces you to guess after waiting a while. Set the timeout short and you'll declare healthy-but-slow nodes dead (a false positive); set it long and you'll take forever to notice real crashes. There is no setting that is both fast and certain, because certainty was never on the table.

This is why "just retry the request" is such a loaded suggestion. If the first attempt actually succeeded but its reply was lost, your retry runs the operation a second time — charge the card twice, ship the order twice. Distinguishing "do it again" from "do it once, but tell me again whether it worked" is the entire reason idempotency and exactly-once semantics are hard-won engineering achievements rather than defaults.

See the ambiguity for yourself

The sandbox below runs on one thread, so there is no real network — which is exactly why it is instructive. We script what really happened on the receiver's side (did it crash? was the network slow? was the ack dropped?), then show what the sender observes: only whether an ack arrived before the timeout. Watch how several genuinely different truths all collapse into the same observation. Press Run ▶:

// An asynchronous model: messages arrive "eventually" (or not at all), // with NO bound on delay. The sender waits for an ack up to `timeout`, // then must guess. We know the hidden truth; the sender does not. interface Scenario { label: string; didReceiverRunIt: boolean; // did the work actually happen? ackDelay: number | null; // ms until ack arrives, or null = never } const TIMEOUT = 100; // the sender's patience, in ms function senderObserves(s: Scenario): string { const acked = s.ackDelay !== null && s.ackDelay <= TIMEOUT; // The sender ONLY sees this boolean. Everything else is hidden. return acked ? "ACK received -> guess: 'it worked'" : "timed out -> guess: 'it failed?'"; } const scenarios: Scenario[] = [ { label: "receiver crashed before reading", didReceiverRunIt: false, ackDelay: null }, { label: "ran it, then crashed before ack", didReceiverRunIt: true, ackDelay: null }, { label: "ran it, ack dropped by network", didReceiverRunIt: true, ackDelay: null }, { label: "ran it, ack merely slow (250ms)", didReceiverRunIt: true, ackDelay: 250 }, { label: "ran it, ack arrived in time", didReceiverRunIt: true, ackDelay: 40 }, ]; console.log("timeout = " + TIMEOUT + "ms\n"); console.log("HIDDEN TRUTH".padEnd(34) + "| DID IT RUN? | WHAT THE SENDER SEES"); console.log("-".repeat(78)); for (const s of scenarios) { console.log( s.label.padEnd(34) + "| " + (s.didReceiverRunIt ? "yes " : "no ") + "| " + senderObserves(s) ); } console.log("\nLook at the four 'timed out' rows: the work happened in THREE of"); console.log("them and not the fourth -- yet the sender's observation is identical."); console.log("A timeout cannot tell 'slow' or 'ack lost' from 'dead'.");

The last two rows are the punchline. A reply that takes 250 ms and a reply that takes 40 ms are the same event — an ack that arrives — but the sender, having chosen a 100 ms timeout, sees one as success and the other as (apparent) failure. Nothing about the receiver changed; only the timing did. Correctness cannot rest on a guess about timing, which is the whole reason distributed algorithms are so much subtler than their single-machine cousins.

Putting the two models to work

Whenever you read or design a distributed protocol, the first two questions are always: which system model? and which failure model? Together they define the rules of the game the algorithm must win. A protocol proven correct under a crash-stop, synchronous model may fall apart the instant a node lies (Byzantine) or a link stalls past its assumed bound (asynchronous). Naming the model is not academic box-ticking — it is the contract that tells you exactly when your guarantees hold and, just as importantly, when they don't.

Notice the direction of difficulty. Moving from synchronous toward asynchronous timing, or from crash-stop toward Byzantine faults, each makes the world strictly harder and forces stronger (more redundant, more communication-heavy) algorithms. The engineering art is choosing the weakest model that still describes your real environment — assume too little and you'll pay for tolerance you never needed; assume too much and reality will find the gap.

The single most common beginner mistake is to treat silence as proof of death: "I didn't get an ack, so the other node must have crashed and never did the work — I'll safely retry." Every clause of that sentence can be false. In an asynchronous model, no reply means only "no reply (yet)." The receiver may have executed the request perfectly and then crashed, or executed it and had its ack dropped, or be about to answer a moment after you gave up. Assuming "no reply ⇒ crashed ⇒ nothing happened" is how you double-charge a customer or delete data twice. A missing message tells you about the channel, not reliably about the node. And a second trap in the same family: treating a crash fault and a Byzantine fault as "just a failed node." A crashed node is honestly silent; a Byzantine node may cheerfully send you a plausible, wrong answer — and different wrong answers to your neighbours. They sit at opposite ends of the difficulty spectrum, and an algorithm built for the first is helpless against the second.