Fault tolerance and the threshold theorem
A quantum
error-correcting code protects stored information by spreading one logical qubit
across many physical ones and repeatedly measuring the errors out. But step back and a
worrying question appears: the machinery that does the correcting — the gates, the
measurements, the wiring — is built from the same shaky physical parts we are trying to
protect against. The gates are noisy. The syndrome-extraction circuit is noisy. Even the step that
applies a correction can misfire. If curing an error takes a hundred noisy operations, do we fix more
than we break, or do we just churn out new errors faster than we mop them up?
This is the deepest question in quantum computing, and it has a stunningly precise answer. The
threshold theorem says: get the error rate of your individual parts below one fixed
constant — a threshold p_{\text{th}} — and you can build a machine of
arbitrary reliability out of them. Below the line, scaling up wins; above it, scaling up makes
things worse. This page is about that line and the discipline — fault tolerance — that
lets you stand on the right side of it.
The catch: errors spread
Correction can keep up only if faults stay small. The threat is that a single fault, run through
an ordinary circuit, can grow into many errors — and no code can correct more errors than
it was designed for. The chief culprit is the two-qubit gate. Take the
CNOT: it is the
workhorse of every quantum circuit, and it is also an error-spreading machine.
The rule is short. A CNOT copies a bit-flip (X) error forward, from
control to target, and copies a phase-flip (Z) error backward, from
target to control:
\text{CNOT}\,\big(X_c\big)\,\text{CNOT} = X_c X_t, \qquad \text{CNOT}\,\big(Z_t\big)\,\text{CNOT} = Z_c Z_t.
Read either equation the honest way: one error went in, two came out. A stray flip on
the control before the gate is, after the gate, a flip on both qubits. Chain a few gates and a
single slip can smear across a whole register — far more damage than any modest code can undo.
Worked example 1: one CNOT fault becomes two errors
Suppose qubit c suffers a bit-flip X_c just before a
CNOT with target t. Push the error through the gate using the rule above. The
gate's job is to apply X_t to the target whenever the control is
|1\rangle; a flipped control therefore drags the target along, and the fault
emerges as
X_c \;\xrightarrow{\ \text{CNOT}\ }\; X_c X_t.
Now imagine a naive encoding that just repeats a qubit three times and runs a bare (transversal
by qubit but carelessly ordered) circuit where one physical qubit fans out into several targets through
successive CNOTs. That single starting flip becomes a flip on qubit 1, then also 2, then also 3 — three
errors from one fault. A distance-3 code corrects only one error, so it is now
overwhelmed: the "correction" confidently applies the wrong fix and creates a logical error.
Bare repetition fails not because the code is weak but because the circuit let one fault
multiply past what the code can handle.
The fix is a design rule, not a better code. A circuit is fault-tolerant when it is
arranged so that any single component fault produces at most one error per code block — errors
are contained, never allowed to cascade within a block. Concretely you never let one physical qubit talk
to two qubits of the same block through a shared ancilla; you use fresh, verified ancillas and
transversal gates (qubit i of one block interacts only with qubit
i of another). Then a lone fault stays lonely, and the code can mop it up on
the next round of syndrome extraction.
The threshold theorem
Fault tolerance keeps a single fault from exploding. The threshold theorem then turns
that local guarantee into a global promise. Encode your logical qubits in a good code, run only
fault-tolerant gadgets, and correct after every step. Do the accounting for one round of a code that
corrects one error: a logical error needs two physical faults to slip through in the same round,
so the logical failure probability scales like
p_L \;\approx\; c\,p^{2} \;=\; p_{\text{th}}\left(\frac{p}{p_{\text{th}}}\right)^{\!2}, \qquad p_{\text{th}} = \frac{1}{c},
where p is the physical error rate per operation and
c counts the pairs of fault locations that could conspire. Stare at that
formula and the magic falls out. If p < p_{\text{th}} then
p/p_{\text{th}} < 1, so p_L < p: the encoded
qubit is more reliable than its parts. Repeat the encoding — a code inside a code, called
concatenation (or, equivalently, growing the code distance) — and the reliability
compounds, driving the logical error rate as low as you like.
- There exists a constant threshold p_{\text{th}} > 0
such that, if the physical error rate per operation satisfies
p < p_{\text{th}}, then any quantum computation can be performed with
logical error rate as low as desired.
- The cost is only polylogarithmic: to reach a target logical error
\varepsilon over a computation of size N, the
overhead in extra qubits and gates grows like
\operatorname{poly}\!\big(\log(N/\varepsilon)\big) per logical operation.
- Below threshold, scaling wins (more encoding ⇒ less error);
above threshold, scaling loses (more encoding ⇒ more error).
The number itself depends on the code and noise model, but for the practical champion — the surface code —
the threshold is a comfortable ~1%. That single fact is why building a large quantum
computer is an engineering problem rather than an impossibility: the target is "make your gates about 99%
reliable," and everything else is scale.
Worked example 2: concatenation squares the error, level by level
Concatenation is recursion. One level of encoding maps a physical error rate
p to p_1 = p_{\text{th}}(p/p_{\text{th}})^{2}.
But the level-1 qubits are themselves just qubits, so encoding them applies the same map again:
p_2 = p_{\text{th}}\!\left(\frac{p_1}{p_{\text{th}}}\right)^{\!2} = p_{\text{th}}\!\left(\frac{p}{p_{\text{th}}}\right)^{\!4}, \qquad\text{and after } L \text{ levels}\qquad p_L = p_{\text{th}}\!\left(\frac{p}{p_{\text{th}}}\right)^{\!2^{L}}.
Each level squares the ratio p/p_{\text{th}}. When that ratio
is below 1, squaring it repeatedly sends it to zero super-exponentially fast. Put in
numbers: take a threshold p_{\text{th}} = 1\% and physical error
p = 0.1\%, so p/p_{\text{th}} = \tfrac{1}{10}:
p_1 = 10^{-2}\cdot 10^{-2} = 10^{-4}, \quad p_2 = 10^{-2}\cdot 10^{-4} = 10^{-6}, \quad p_3 = 10^{-2}\cdot 10^{-8} = 10^{-10}, \quad \dots
Three levels of nesting and the logical error is one in ten billion — from parts that fail one time in a
thousand. The exponent doubles each level (2, 4, 8, 16, \dots), so the error
does not merely shrink, it plummets. And because you only need
L \sim \log\log(1/\varepsilon) levels to hit a target
\varepsilon, the qubit and gate overhead stays polylogarithmic — the crown
jewel of the theorem.
Now feed the same formula a physical error above threshold,
p = 2\% so p/p_{\text{th}} = 2: the ratio squares
upward — 4, 16, 256, \dots — and every extra level makes the logical
qubit worse. Same recursion, opposite fate; the only thing that changed is which side of
p_{\text{th}} you started on.
Seeing the threshold
Plot the logical error rate against the physical error rate for codes of growing strength. Every curve
passes through the same crossing point — the threshold p_{\text{th}}
(here 1%). To the left of it (p < p_{\text{th}}) a stronger
code sits lower: encoding helps, and the more distance the better. To the right
the order flips — a stronger code sits higher: encoding hurts. The faint diagonal is break-even,
where a corrected qubit is exactly as good as a bare one; useful codes are the region below it.
The crossing point is the whole story of the theorem in one picture: it is the pivot on which "adding more
error correction" flips from helping to hurting.
The other bill: magic-state distillation
There is a second reason fault tolerance is expensive, hiding in which gates are cheap. In most
codes the Clifford
gates (like H, S, and CNOT) are
transversal — you apply them qubit-by-qubit across a code block and they are automatically
fault-tolerant, essentially free. The trouble is that the Clifford gates are not universal:
by the Gottesman–Knill theorem a circuit of only Clifford gates can be simulated efficiently on a laptop,
so it cannot be doing anything genuinely quantum-hard.
Universality needs one non-Clifford gate, conventionally the T gate (a
45^\circ phase). And T is exactly the gate that is
not transversal in the surface code — you cannot just apply it and stay protected. The workaround
is magic-state distillation: prepare many noisy copies of a special "magic" state, then
run a fault-tolerant circuit that consumes many low-quality copies to purify a few high-quality ones. A
clean magic state can then be "injected" to enact a T gate. This works, but it
is the dominant cost of a fault-tolerant machine — distillation factories can occupy the
majority of the physical qubits and runtime. In the fault-tolerant world, Clifford gates are nearly free
and T gates are the currency you spend.
Before the threshold theorem (Shor, Aharonov–Ben-Or, Kitaev, and others in the mid-1990s), it was a
genuinely open question whether a large quantum computer could ever work — perhaps noise always won, and
the whole enterprise was a mirage. The theorem changed the character of the problem completely. It says
there is a finite bar: get your physical operations below roughly a percent error and
there is no fundamental obstacle left, only engineering. Every qubit count you read in the news, every
experiment chasing "below threshold" operation, is aimed at that one line. It is the reason a
fault-tolerant quantum computer is considered a when, not an if — the physics gives
permission, provided you can just get below ~1%.
Two traps. First, the threshold is a two-way door. Below p_{\text{th}},
piling on error correction (more distance, more concatenation) drives the logical error down. But
above p_{\text{th}} the exact same effort makes things
worse — each extra layer squares a number bigger than one, amplifying errors instead of
suppressing them. Error correction is not a knob that "always helps a bit"; on the wrong side of the line
it actively hurts.
Second, do not imagine a logical qubit is cheap. Fault tolerance carries a huge overhead:
a single high-quality logical qubit can demand thousands of physical qubits, and the
T gates that make the computer universal are paid for with sprawling
magic-state distillation factories that dominate the hardware budget. "We have error correction" does not
mean the qubits are free — it means each precious logical qubit is an expensive, hard-won composite of
many noisy real ones.