Consensus: Paxos and Raft

Picture five computers in five different data centres, wired together over a flaky network. Their one job sounds almost insultingly simple: agree on a single value — say, "the next command to run is withdraw \$50." Not a hard sum, not a clever algorithm. Just agree. And yet this is one of the deepest problems in all of computing, because the machines are unreliable (any of them can crash at the worst possible moment) and the network is unreliable (messages arrive late, out of order, duplicated, or never at all). No node can even tell the difference between "that server crashed" and "that server is fine but its reply is stuck in traffic."

This is the consensus problem, and it is the beating heart of every serious distributed system — the thing that lets a cluster of ordinary, failure-prone machines behave, from the outside, like one flawless machine that never forgets and never contradicts itself. Databases that stay consistent across continents, systems that elect a coordinator, locks that work across a whole cluster — all of them run a consensus protocol underneath. This page is about what consensus demands, why a simple majority is the trick that makes it possible, and how the two famous protocols — the venerable, brain-bending Paxos and its deliberately friendlier cousin Raft — actually reach agreement.

What "agreement" has to mean

Before we can solve consensus we have to pin down exactly what a correct solution owes us. A group of nodes, each starting with some proposed value, must all decide on one value, and the protocol must guarantee three properties:

Agreement (safety) — no two nodes ever decide on different values. This is the promise you can never break: two replicas that disagree about what happened are worse than useless.
Validity (integrity) — the value decided upon must be one that some node actually proposed. The system can't invent a command out of thin air.
Termination (liveness) — every non-crashed node eventually decides. It isn't enough to never be wrong; a protocol that stays silent forever is no use either.

The tension between the first and the last of these is the whole story. Agreement says "never, ever be inconsistent," and termination says "always make progress." In a world where machines crash and messages vanish, keeping both at once — and keeping them while nodes are dropping — is exactly what makes consensus a genuinely hard problem, not a homework exercise.

The trick: majority quorums

Here is the single most important idea on this page. Whenever the cluster needs to commit to something, it does not wait for everyone to agree — that would grind to a halt the instant one machine crashed. Instead a decision is made by a majority: strictly more than half of the nodes. Such a group is called a quorum.

Why a majority, of all things? Because of one small, beautiful fact of arithmetic:

Any two majorities of the same set must overlap — they share at least one node.

The proof is a one-liner. Take N nodes. A majority is more than N/2 of them. If you had two majorities that shared no node, their sizes would add up to more than N/2 + N/2 = N distinct nodes — but there are only N nodes in total. Contradiction. So every pair of majorities has a node in common.

That overlapping node is the hero of the whole story. Because any decision requires a majority, and any two majorities share a member, that shared member remembers the earlier decision and refuses to help ratify a conflicting one. No two majorities can ever independently commit to contradictory values — which is precisely the Agreement property, handed to us for free by counting. Every consensus protocol, Paxos and Raft included, is at its core an elaborate way of making sure each committed step passes through a majority.

A happy consequence: with N nodes, a majority still forms as long as fewer than half have failed, so the cluster keeps working after f = \left\lfloor \tfrac{N-1}{2} \right\rfloor failures. Five nodes tolerate two crashes; seven tolerate three. Notice the numbers: going from 3 nodes to 4 does not improve fault tolerance (both survive one failure) — it's why real clusters use an odd count. This is the same kind of shared-state coordination that parallel programming models wrestle with inside a single machine, now stretched across a whole network.

Astonishingly, yes — and it's a theorem. The FLP impossibility result (Fischer, Lynch and Paterson, 1985) proves that in a fully asynchronous system — one with no bound at all on how long a message can take — no deterministic protocol can guarantee it will always reach consensus if even a single node may crash. The reason is that shared-timeline argument again: with no clock you can trust, a node genuinely cannot tell a crashed peer from a merely slow one, and an adversarial scheduler can keep any protocol dithering forever at exactly the wrong moment.

So how do real systems exist at all? They quietly weaken the assumption. Paxos and Raft never sacrifice safety (they will never decide two different values, ever), but they only guarantee progress when the network behaves reasonably for a while — using timeouts and a leader to break symmetry and keep moving. FLP doesn't say consensus is hopeless; it says you can't get it for free, and the price is a dash of timing you have to assume goes right often enough.

Paxos: correct, and famously baffling

The first protocol to solve this properly — and prove it — was Paxos, described by Leslie Lamport in 1998. Paxos is correct, elegant, and notoriously hard to understand; Lamport's own follow-up paper was cheekily titled "Paxos Made Simple," and it still leaves most readers dizzy. The mechanism, stripped to its bones, hands out three roles (one node often plays all three): proposers suggest values, acceptors vote on them, and learners find out the result.

A value gets chosen in two phases, and every proposal carries a unique, ever-increasing proposal number that acts like a ticket for who goes next:

Phase 1 — Prepare / Promise. A proposer picks a proposal number n and asks a majority of acceptors: "will you promise not to consider anything numbered below n?" Each acceptor, if n is the highest it has seen, promises — and, crucially, reports back any value it has already accepted, so a new proposer can never trample a decision already in flight.
Phase 2 — Accept / Accepted. If a majority promised, the proposer sends them an accept request for a value (its own, unless an acceptor's report forced it to reuse an older, possibly-chosen value). Once a majority accepts, the value is chosen — and the quorum overlap from before guarantees no competing proposal can choose anything different.

It works, and the majority-overlap argument is why. But the two-phase dance, the rule about adopting the highest-numbered previously-accepted value, and the ways proposals can duel and stall are subtle enough that engineers kept getting real implementations wrong. That pain is exactly what motivated the next protocol.

Raft: consensus you can actually explain

In 2014, Diego Ongaro and John Ousterhout published Raft with an unusual explicit goal baked into its design: understandability. Their thesis was that Paxos was so hard to grasp that it was hard to build correctly, so they engineered a protocol that solves the same problem but that a student can actually hold in their head. The core simplification: instead of many proposers duelling, Raft elects one leader, and all agreement flows through that single leader for as long as it lives.

Raft slices time into numbered terms, and a node is always in one of three states:

Follower — the passive default; it just answers the leader and votes when asked.
Candidate — a follower that suspected the leader died and is now running for office.
Leader — the single elected coordinator that handles all client requests and drives replication.

Leader election

Every follower runs an election timeout — a countdown that resets each time it hears from the leader (a periodic heartbeat). If a follower's timer ever fires without a peep from the leader, it assumes the leader is gone and springs into action: it becomes a candidate, bumps the term number, votes for itself, and sends RequestVote messages to everyone else. Each node grants at most one vote per term, so if the candidate collects votes from a majority, it wins and becomes leader — then immediately starts sending heartbeats to shut down everyone else's timers. Step through the whole sequence below:

Two details make elections actually terminate. First, each election timeout is chosen randomly from a range, so it's unlikely that two followers time out at the exact same instant and split the vote; whoever fires first usually wins uncontested. Second, if a vote does split (nobody gets a majority), that term simply ends with no leader, everyone waits out another random timeout, and a fresh election runs — and thanks to the randomness, it almost always resolves quickly.

Log replication

Once elected, the leader is the sole point of contact for client commands. Each command is appended to the leader's log and shipped to the followers via AppendEntries messages. Here is where the majority returns, one last time: the leader considers an entry committed — safe, permanent, applied to the state machine — only once a majority of nodes have stored it. Overlapping quorums guarantee that any future leader must already have every committed entry, so committed history can never be lost or rewritten. A crashed leader just triggers a new election; the term number rises, and the cluster carries on.

Two tempting misconceptions, both wrong, both about the same core idea.

"A single leader is a single point of failure." It feels like one, but it isn't. The leader holds no unique, unrecoverable state — every committed entry lives on a majority of followers. If the leader crashes, the followers' timers fire, an election runs, and a new leader is in charge within a fraction of a second, carrying the full committed log with it. The leadership role is a single point of coordination, but it is replaceable, which is the whole point.

"Consensus needs all the nodes to agree." No — and requiring unanimity would be a disaster. If every node had to vote yes, a single crashed or unreachable machine would freeze the entire cluster, destroying availability. Consensus deliberately needs only a majority, and that is not a compromise — it is the essential feature. The majority is what lets the system keep deciding while nodes are down, and quorum overlap is what keeps it safe anyway. "Everyone agrees" is not stronger than "a majority agrees"; it is strictly more fragile.

Count the votes yourself

A leader election comes down to one bit of arithmetic: did the candidate collect strictly more than half the votes? The little simulation below tallies votes across a cluster and declares the candidate elected only when its count clears the majority line \lfloor N/2 \rfloor + 1. Watch the split case, where nobody reaches a majority and the term produces no leader at all. Press Run ▶:

// A tiny model of a Raft election tally. // A candidate is ELECTED iff it wins strictly MORE THAN half the nodes. function majorityNeeded(n: number): number { return Math.floor(n / 2) + 1; // strictly more than half } function report(nodes: number, votesForCandidate: number): void { const need = majorityNeeded(nodes); const elected = votesForCandidate > nodes / 2; // equivalently: votes >= need const verdict = elected ? "ELECTED ✓" : "no majority — split vote, retry"; console.log( "N=" + nodes + " votes=" + votesForCandidate + " (need " + need + ")".padEnd(6) + " => " + verdict, ); } console.log("Cluster of 5 (tolerates " + Math.floor((5 - 1) / 2) + " failures):"); report(5, 5); // unanimous -> elected report(5, 3); // exactly a majority -> elected report(5, 2); // minority -> split, no leader console.log(""); console.log("Cluster of 3 (tolerates " + Math.floor((3 - 1) / 2) + " failures):"); report(3, 2); // majority -> elected report(3, 1); // split down the middle-> no leader console.log(""); console.log("An election only succeeds when one candidate clears the majority line.");

Notice how a tie or a minority yields no leader — the term ends leaderless and Raft simply runs another election after a fresh random timeout. Safety is never in danger: at worst the cluster pauses. It never decides two different things.