The Pumping Lemma
We already know that a finite-state
machine has no real memory — only its current state. So it feels obvious that a
language like "the same number of 0s then 1s"
cannot be regular:
the machine would have to count, and it has nowhere to keep the count. But "it feels obvious" is not a
proof. How do you prove, rigorously, that no finite-state machine — however
cleverly built, however many states — can recognise a language?
The pumping lemma is the tool for exactly this. It is the single most useful theorem
for showing a language is not regular, and it works by turning the machine's finiteness
against it with a beautifully simple counting argument: the pigeonhole principle.
The idea: a long walk must repeat a state
Suppose a finite-state machine has p states, and we feed it a string that is
at least p symbols long. As it reads the string, it passes through a sequence
of states — one before each symbol, so at least p + 1 states in all. But there
are only p different states available. By the pigeonhole principle,
some state must be visited twice.
The chunk of input read between those two visits forms a loop in the machine: it starts
and ends in the very same state. And here is the crucial consequence — if a loop takes you from a state
back to itself, you can go around it as many times as you like (or skip it entirely) and
the machine cannot tell the difference. Every one of those altered strings ends in the same final state,
so the machine accepts them all, or rejects them all.
Break the string into three pieces x\,y\,z: the walk up to the first
visit (x), the loop (y), and the
rest (z). Because y is a loop, every string
x\,y^i\,z — y repeated i
times, for any i \ge 0 — is treated identically. "Pumping"
y is repeating it. Step through the diagram to watch the loop grow.
The lemma, stated precisely
The counting argument packages up into a clean guarantee about every regular language:
-
If L is regular, there is a pumping length
p \ge 1 such that every string
s \in L with |s| \ge p can be split as
s = x\,y\,z, with
- |y| \ge 1 (the loop is non-empty),
-
|xy| \le p (the loop happens within the first
p symbols), and
-
x\,y^{i}\,z \in L for every
i \ge 0.
Read it as a promise the language must keep. If a language is regular, it must let you pump
every long string. So to prove a language is not regular, you show it
breaks that promise — you find one long string that cannot possibly be pumped.
Worked example: \{\,0^{n}1^{n}\,\} is not regular
Let L = \{\,0^{n}1^{n} : n \ge 0\,\} — equal numbers of
0s then 1s. This is exactly the balanced-brackets
language a finite-state machine choked on. We prove it is not regular by
contradiction, using
the lemma as our lever.
-
Assume L is regular. Then it has some pumping
length p.
-
Pick a stubborn string. Choose s = 0^{p}1^{p}. It is in
L, and |s| = 2p \ge p, so the lemma applies — it
must split as s = xyz with the three conditions.
-
Corner the loop. Since |xy| \le p and the first
p symbols of s are all
0s, the loop y lies entirely inside the block of
0s. And |y| \ge 1, so
y is one or more 0s.
-
Pump and contradict. Pump up to i = 2:
xy^{2}z has extra 0s but the
same number of 1s. So it has more 0s
than 1s and is not in L — yet
the lemma guaranteed it would be. Contradiction.
The assumption was the only thing that could give. Therefore L is
not regular. No finite-state machine, of any size, recognises equal-count strings —
exactly the intuition, now nailed down.
You have a free choice of i — pick whichever value breaks the language most
easily. For 0^{p}1^{p}, i = 0 works just as well:
deleting y removes some 0s and again unbalances the
counts. Pumping down to zero copies is often the slickest contradiction — one well-chosen
i is all you need.
It is a game against an adversary
The trickiest thing about the lemma is the tangle of "for all" and "there exists". The cleanest way to
keep them straight is to picture a two-player game — you are trying to prove the
language is not regular, and a stubborn Adversary is trying to stop you:
- The Adversary picks the pumping length p. You do not get to see a number — your argument must work for any p.
- You pick the string s \in L (with |s| \ge p). Choose the most awkward string you can — this is your one big move.
- The Adversary splits it into x y z (respecting |y| \ge 1 and |xy| \le p). You must beat every legal split.
- You pick the exponent i and show x y^{i} z \notin L. Land this and you win — the language is not regular.
Notice how the quantifiers alternate: for all p,
there exists a string, for all splits, there exists an
i. Winning the game is the proof.
The pumping lemma is a one-way tool. It says "regular \Rightarrow
pumpable", so failing to pump proves not regular. But it does not say
the reverse: some non-regular languages happen to be pumpable anyway. So you can never
use the pumping lemma to prove a language is regular — passing the test proves nothing.
To show a language is regular, build a finite-state machine or a regular expression for it.
A second classic slip: letting yourself choose the split xyz. You
don't — the Adversary does, so your argument has to defeat every valid split at once.
That is exactly why the |xy| \le p condition is such a gift: it pins the loop
into the opening block of the string, killing off most of the Adversary's options for you.