Regular Expressions and Regular Languages

Imagine you are writing the sign-up page for a website and you need to answer one deceptively simple question: is this a valid email address? Or you are searching a huge log file for every line that looks like a date. Or a phone number. Or a UK postcode. In every case you are not looking for one fixed string — you are looking for anything that fits a certain shape. A regular expression (almost always shortened to regex) is the tool for describing exactly that shape: it is a tiny, compact pattern that stands for a whole set of strings at once.

A plain string like "cat" describes just one word. A regex like c.t describes many — cat, cot, cut, c9t and so on — all the three-letter strings that start with c and end with t. That leap, from naming one string to describing a possibly infinite family of them in a few characters, is the whole idea. This page builds that idea up operator by operator, shows you the exact strings each pattern matches and rejects, and then reveals a beautiful fact: regexes are secretly the very same thing as the finite state machines you have already met.

The simplest pattern: concatenation

The most basic thing a regex can do is spell out characters one after another. Writing abc as a pattern means "an a, then a b, then a c" — the characters are glued in order. This is called concatenation, and it is the same "join end to end" idea you meet with string concatenation, only now we are joining patterns rather than fixed text.

On its own, abc matches exactly one string: "abc". Not very exciting — it is no better than the plain word. Concatenation only becomes powerful once we can drop choices and repetition into the sequence, so that each position stands for more than one possibility. That is what the next three operators give us.

Alternation: `|` means "or"

The alternation operator, written | and read aloud as "or", lets a pattern match either of two things. The pattern cat|dog matches the string "cat" or the string "dog" — and nothing else. Round brackets group a choice so it applies to just part of a pattern: gr(a|e)y matches both the British spelling "gray" and "grey", because the (a|e) stands for "an a or an e" sitting between gr and y.

0|1 — matches a single "0" or a single "1" (one binary digit).
(Mr|Ms|Dr) — matches any one of the three titles.
(cat|dog|fish)s? — combined with the ? you will meet shortly, matches those animals singular or plural.

Alternation is what turns a regex from "one exact string" into "any of these" — the first taste of a pattern standing for a set.

The Kleene star: `*` means "zero or more"

The star * — named the Kleene star after the logician Stephen Kleene, who invented regular expressions in the 1950s — is the operator that makes patterns truly powerful, because it introduces repetition. Placed after something, it means "zero or more copies of that thing". Read that carefully: zero copies is allowed, so a starred item can vanish entirely.

Take the pattern ab*. The * attaches only to the b right before it, so this reads "an a, followed by zero or more bs". It matches:

"a" — yes! (the b appears zero times);
"ab", "abb", "abbbbbb" — yes (one or more bs);
"b" — no (there is no leading a);
"ba", "aab" — no (the shape is wrong).

Because * can repeat without limit, a single short pattern like ab* already describes an infinite set of strings — \{a,\ ab,\ abb,\ abbb,\ \dots\}. That is the magic: finite pattern, infinite language.

The single most common mistake with regexes is misreading *. It means zero or more — not "at least one", and not "exactly one". People instinctively expect a starred thing to be present, but:

ab* happily matches "a" with no b at all. If you truly want "an a then at least one b", that is ab+ (the + operator, below).
a* even matches the empty string "" — zero as is a perfectly valid "zero or more".

A useful mantra: * can always match nothing. Whenever a pattern seems to match strings you didn't expect, check whether a * is quietly matching zero copies.

Two handy shortcuts: `+` and `?`

Two more operators round out the everyday toolkit. Both are really just conveniences — anything they do could be written with the three operators above — but they are so common they earn their own symbols.

+ — "one or more". The plus is the star's demanding cousin: it insists on at least one copy. So ab+ matches "ab", "abb", "abbb" … but not "a". Note that x+ is just shorthand for xx^* — one copy, then zero or more.
? — "zero or one" (optional). The question mark makes the thing before it optional: it may appear once or not at all. So colou?r matches both "color" and "colour", and -?[0-9]+ matches a whole number with an optional minus sign in front.

A neat way to keep the three repetition operators straight is by how many copies each allows:

\underbrace{x?}_{0\text{ or }1}\qquad \underbrace{x^{*}}_{0\text{ or more}}\qquad \underbrace{x^{+}}_{1\text{ or more}}

Reading whole patterns

With concatenation, |, *, + and ? in hand you can already read serious patterns. The trick is to break a pattern into its pieces and describe each in words. Take this classic:

\texttt{(0|1)}^{*}\texttt{01}

Read left to right: (0|1)* is "zero or more binary digits" (any run of 0s and 1s, including none at all), and then 01 pins a literal 0 followed by a 1 on the end. Put together, it matches exactly the binary strings that end in 01:

String	`(0\|1)*01`?	Why
`"01"`	match	the `(0\|1)*` matches nothing, then `01`
`"11101"`	match	`111` is soaked up by the star, then `01`
`"0"`	reject	too short — there is no `01` to finish on
`"010"`	reject	it ends in `0`, not `01`
`"abc"`	reject	contains characters that aren't `0` or `1`

Notice how the pattern quietly describes an infinite set — every string ending in 01, of any length — in seven characters. That compactness is why regexes are everywhere in real software.

See it run

In TypeScript (and JavaScript) a regex is written between slashes, like /ab*/, and its .test(s) method returns true if the string s matches. The ^ and $ anchors mean "start of string" and "end of string", pinning the pattern so it must describe the whole string, not just a piece of it. Press Run and watch each pattern sort strings into match and reject:

// ab* : an "a", then ZERO or more "b"s const abStar = /^ab*$/; console.log(abStar.test("a")); // true — zero b's is allowed! console.log(abStar.test("ab")); // true console.log(abStar.test("abbbb")); // true console.log(abStar.test("b")); // false — must start with a console.log(abStar.test("aab")); // false — wrong shape // (0|1)*01 : any binary string that ENDS in "01" const endsIn01 = /^(0|1)*01$/; console.log(endsIn01.test("01")); // true console.log(endsIn01.test("11101")); // true console.log(endsIn01.test("010")); // false — ends in 0, not 01 console.log(endsIn01.test("0")); // false — too short

The +/* difference from the "Watch out!" box is easy to see for yourself. The two patterns below look almost identical, but one accepts a lone "a" and the other does not:

const zeroOrMore = /^ab*$/; // a, then ZERO or more b's const oneOrMore = /^ab+$/; // a, then ONE or more b's console.log(zeroOrMore.test("a")); // true — * allows no b at all console.log(oneOrMore.test("a")); // false — + demands at least one b console.log(oneOrMore.test("ab")); // true // A real use: a (very simplified) email check. const looksLikeEmail = /^[a-z]+@[a-z]+\.[a-z]+$/; console.log(looksLikeEmail.test("ada@example.com")); // true console.log(looksLikeEmail.test("not-an-email")); // false

Regular languages — and why regexes = finite state machines

Here is the payoff. In computer science a language just means a set of strings. A regular language is defined as any language that some regular expression can describe — the exact set of strings that regex matches. So "strings ending in 01" is a regular language, because (0|1)*01 picks it out; "an a then some bs" is a regular language, described by ab*.

Now the beautiful part. You have already met finite state machines (FSMs) — little diagrams of states and labelled arrows that read a string and either accept or reject it. It turns out that regexes and FSMs have exactly the same power:

Every language a regular expression can describe can also be recognised by some finite state machine.
Every language a finite state machine can recognise can also be described by some regular expression.

The two are just different notations for the very same class of languages — the regular languages. A regex is a compact written formula; an FSM is the same idea drawn as a diagram.

You can see the correspondence directly. The machine below is a finite state machine that accepts precisely the strings matched by ab* — a single a to move into the accepting state, then a self-loop that lets you go round on b as many times as you like (including not at all). Step through it:

The self-loop labelled b is the diagram's way of drawing the *: "stay here, reading another b, for as long as you want — or move on straight away." Every regex operator has a matching FSM gadget like this, which is why the two systems describe exactly the same languages.

Where regexes earn their keep

Regular expressions are not just theory — they are one of the most widely used tools in all of programming. Two jobs dominate:

Validation. Checking that user input has the right shape before you trust it: is this a well-formed email address, phone number, UK postcode, date, or password? A single regex does the work of pages of hand-written character checks.
Search and replace. Finding every occurrence of a pattern in a body of text — every hashtag in a tweet, every URL in a document, every line of a log file that mentions an error — and optionally rewriting it. The "find" box in most editors and the command-line tool grep are built on regexes.

Regular expressions are powerful, but they have a hard limit that comes straight from the FSM connection: a finite state machine has only a finite number of states, so it cannot count without bound. This means a genuine regular expression cannot match strings that require balancing or counting arbitrarily many things.

The classic example is matching balanced brackets — strings like ((())) where the number of opening brackets must equal the number of closing ones, for any depth. To do that you would have to remember how many ( you have seen so far, and that count can grow without limit — more than a fixed set of states can hold. The same goes for a^{n}b^{n} (some as followed by exactly the same number of bs). These languages are not regular; recognising them needs a more powerful machine (one with a memory stack — a pushdown automaton). So if a problem needs you to count and compare, a plain regex is the wrong tool.