Fault tolerance and the threshold theorem

A quantum error-correcting code protects stored information by spreading one logical qubit across many physical ones and repeatedly measuring the errors out. But step back and a worrying question appears: the machinery that does the correcting — the gates, the measurements, the wiring — is built from the same shaky physical parts we are trying to protect against. The gates are noisy. The syndrome-extraction circuit is noisy. Even the step that applies a correction can misfire. If curing an error takes a hundred noisy operations, do we fix more than we break, or do we just churn out new errors faster than we mop them up?

This is the deepest question in quantum computing, and it has a stunningly precise answer. The threshold theorem says: get the error rate of your individual parts below one fixed constant — a threshold p_{\text{th}} — and you can build a machine of arbitrary reliability out of them. Below the line, scaling up wins; above it, scaling up makes things worse. This page is about that line and the discipline — fault tolerance — that lets you stand on the right side of it.

The catch: errors spread

Correction can keep up only if faults stay small. The threat is that a single fault, run through an ordinary circuit, can grow into many errors — and no code can correct more errors than it was designed for. The chief culprit is the two-qubit gate. Take the CNOT: it is the workhorse of every quantum circuit, and it is also an error-spreading machine.

The rule is short. A CNOT copies a bit-flip (X) error forward, from control to target, and copies a phase-flip (Z) error backward, from target to control:

\text{CNOT}\,\big(X_c\big)\,\text{CNOT} = X_c X_t, \qquad \text{CNOT}\,\big(Z_t\big)\,\text{CNOT} = Z_c Z_t.

Read either equation the honest way: one error went in, two came out. A stray flip on the control before the gate is, after the gate, a flip on both qubits. Chain a few gates and a single slip can smear across a whole register — far more damage than any modest code can undo.

Worked example 1: one CNOT fault becomes two errors

Suppose qubit c suffers a bit-flip X_c just before a CNOT with target t. Push the error through the gate using the rule above. The gate's job is to apply X_t to the target whenever the control is |1\rangle; a flipped control therefore drags the target along, and the fault emerges as

X_c \;\xrightarrow{\ \text{CNOT}\ }\; X_c X_t.

Now imagine a naive encoding that just repeats a qubit three times and runs a bare (transversal by qubit but carelessly ordered) circuit where one physical qubit fans out into several targets through successive CNOTs. That single starting flip becomes a flip on qubit 1, then also 2, then also 3 — three errors from one fault. A distance-3 code corrects only one error, so it is now overwhelmed: the "correction" confidently applies the wrong fix and creates a logical error. Bare repetition fails not because the code is weak but because the circuit let one fault multiply past what the code can handle.

The fix is a design rule, not a better code. A circuit is fault-tolerant when it is arranged so that any single component fault produces at most one error per code block — errors are contained, never allowed to cascade within a block. Concretely you never let one physical qubit talk to two qubits of the same block through a shared ancilla; you use fresh, verified ancillas and transversal gates (qubit i of one block interacts only with qubit i of another). Then a lone fault stays lonely, and the code can mop it up on the next round of syndrome extraction.

The threshold theorem

Fault tolerance keeps a single fault from exploding. The threshold theorem then turns that local guarantee into a global promise. Encode your logical qubits in a good code, run only fault-tolerant gadgets, and correct after every step. Do the accounting for one round of a code that corrects one error: a logical error needs two physical faults to slip through in the same round, so the logical failure probability scales like

p_L \;\approx\; c\,p^{2} \;=\; p_{\text{th}}\left(\frac{p}{p_{\text{th}}}\right)^{\!2}, \qquad p_{\text{th}} = \frac{1}{c},

where p is the physical error rate per operation and c counts the pairs of fault locations that could conspire. Stare at that formula and the magic falls out. If p < p_{\text{th}} then p/p_{\text{th}} < 1, so p_L < p: the encoded qubit is more reliable than its parts. Repeat the encoding — a code inside a code, called concatenation (or, equivalently, growing the code distance) — and the reliability compounds, driving the logical error rate as low as you like.

The number itself depends on the code and noise model, but for the practical champion — the surface code — the threshold is a comfortable ~1%. That single fact is why building a large quantum computer is an engineering problem rather than an impossibility: the target is "make your gates about 99% reliable," and everything else is scale.

Worked example 2: concatenation squares the error, level by level

Concatenation is recursion. One level of encoding maps a physical error rate p to p_1 = p_{\text{th}}(p/p_{\text{th}})^{2}. But the level-1 qubits are themselves just qubits, so encoding them applies the same map again:

p_2 = p_{\text{th}}\!\left(\frac{p_1}{p_{\text{th}}}\right)^{\!2} = p_{\text{th}}\!\left(\frac{p}{p_{\text{th}}}\right)^{\!4}, \qquad\text{and after } L \text{ levels}\qquad p_L = p_{\text{th}}\!\left(\frac{p}{p_{\text{th}}}\right)^{\!2^{L}}.

Each level squares the ratio p/p_{\text{th}}. When that ratio is below 1, squaring it repeatedly sends it to zero super-exponentially fast. Put in numbers: take a threshold p_{\text{th}} = 1\% and physical error p = 0.1\%, so p/p_{\text{th}} = \tfrac{1}{10}:

p_1 = 10^{-2}\cdot 10^{-2} = 10^{-4}, \quad p_2 = 10^{-2}\cdot 10^{-4} = 10^{-6}, \quad p_3 = 10^{-2}\cdot 10^{-8} = 10^{-10}, \quad \dots

Three levels of nesting and the logical error is one in ten billion — from parts that fail one time in a thousand. The exponent doubles each level (2, 4, 8, 16, \dots), so the error does not merely shrink, it plummets. And because you only need L \sim \log\log(1/\varepsilon) levels to hit a target \varepsilon, the qubit and gate overhead stays polylogarithmic — the crown jewel of the theorem.

Now feed the same formula a physical error above threshold, p = 2\% so p/p_{\text{th}} = 2: the ratio squares upward4, 16, 256, \dots — and every extra level makes the logical qubit worse. Same recursion, opposite fate; the only thing that changed is which side of p_{\text{th}} you started on.

Seeing the threshold

Plot the logical error rate against the physical error rate for codes of growing strength. Every curve passes through the same crossing point — the threshold p_{\text{th}} (here 1%). To the left of it (p < p_{\text{th}}) a stronger code sits lower: encoding helps, and the more distance the better. To the right the order flips — a stronger code sits higher: encoding hurts. The faint diagonal is break-even, where a corrected qubit is exactly as good as a bare one; useful codes are the region below it.

The crossing point is the whole story of the theorem in one picture: it is the pivot on which "adding more error correction" flips from helping to hurting.

The other bill: magic-state distillation

There is a second reason fault tolerance is expensive, hiding in which gates are cheap. In most codes the Clifford gates (like H, S, and CNOT) are transversal — you apply them qubit-by-qubit across a code block and they are automatically fault-tolerant, essentially free. The trouble is that the Clifford gates are not universal: by the Gottesman–Knill theorem a circuit of only Clifford gates can be simulated efficiently on a laptop, so it cannot be doing anything genuinely quantum-hard.

Universality needs one non-Clifford gate, conventionally the T gate (a 45^\circ phase). And T is exactly the gate that is not transversal in the surface code — you cannot just apply it and stay protected. The workaround is magic-state distillation: prepare many noisy copies of a special "magic" state, then run a fault-tolerant circuit that consumes many low-quality copies to purify a few high-quality ones. A clean magic state can then be "injected" to enact a T gate. This works, but it is the dominant cost of a fault-tolerant machine — distillation factories can occupy the majority of the physical qubits and runtime. In the fault-tolerant world, Clifford gates are nearly free and T gates are the currency you spend.

Before the threshold theorem (Shor, Aharonov–Ben-Or, Kitaev, and others in the mid-1990s), it was a genuinely open question whether a large quantum computer could ever work — perhaps noise always won, and the whole enterprise was a mirage. The theorem changed the character of the problem completely. It says there is a finite bar: get your physical operations below roughly a percent error and there is no fundamental obstacle left, only engineering. Every qubit count you read in the news, every experiment chasing "below threshold" operation, is aimed at that one line. It is the reason a fault-tolerant quantum computer is considered a when, not an if — the physics gives permission, provided you can just get below ~1%.

Two traps. First, the threshold is a two-way door. Below p_{\text{th}}, piling on error correction (more distance, more concatenation) drives the logical error down. But above p_{\text{th}} the exact same effort makes things worse — each extra layer squares a number bigger than one, amplifying errors instead of suppressing them. Error correction is not a knob that "always helps a bit"; on the wrong side of the line it actively hurts.

Second, do not imagine a logical qubit is cheap. Fault tolerance carries a huge overhead: a single high-quality logical qubit can demand thousands of physical qubits, and the T gates that make the computer universal are paid for with sprawling magic-state distillation factories that dominate the hardware budget. "We have error correction" does not mean the qubits are free — it means each precious logical qubit is an expensive, hard-won composite of many noisy real ones.