Imagine you are writing the sign-up page for a website and you need to answer one deceptively simple question: is this a valid email address? Or you are searching a huge log file for every line that looks like a date. Or a phone number. Or a UK postcode. In every case you are not looking for one fixed string — you are looking for anything that fits a certain shape. A regular expression (almost always shortened to regex) is the tool for describing exactly that shape: it is a tiny, compact pattern that stands for a whole set of strings at once.
A plain string like "cat" describes just one word. A regex like c.t
describes many — cat, cot, cut,
c9t and so on — all the three-letter strings that start with c and end
with t. That leap, from naming one string to describing a possibly
infinite family of them in a few characters, is the whole idea. This page builds
that idea up operator by operator, shows you the exact strings each pattern matches and rejects,
and then reveals a beautiful fact: regexes are secretly the very same thing as the
The most basic thing a regex can do is spell out characters one after another. Writing
abc as a pattern means "an a, then a b,
then a c" — the characters are glued in order. This is called
concatenation, and it is the same "join end to end" idea you meet with
On its own, abc matches exactly one string: "abc". Not very exciting —
it is no better than the plain word. Concatenation only becomes powerful once we can drop
choices and repetition into the sequence, so that each position
stands for more than one possibility. That is what the next three operators give us.
| means "or"
The alternation operator, written | and read aloud as
"or", lets a pattern match either of two things. The pattern
cat|dog matches the string "cat" or the string "dog" —
and nothing else. Round brackets group a choice so it applies to just part of a pattern:
gr(a|e)y matches both the British spelling "gray" and
"grey", because the (a|e) stands for "an a or an
e" sitting between gr and y.
0|1 — matches a single "0" or a single "1" (one binary digit).(Mr|Ms|Dr) — matches any one of the three titles.(cat|dog|fish)s? — combined with the ? you will meet shortly, matches those animals singular or plural.Alternation is what turns a regex from "one exact string" into "any of these" — the first taste of a pattern standing for a set.
* means "zero or more"
The star * — named the Kleene star after the logician Stephen
Kleene, who invented regular expressions in the 1950s — is the operator that makes patterns truly
powerful, because it introduces repetition. Placed after something, it means
"zero or more copies of that thing". Read that carefully: zero copies is
allowed, so a starred item can vanish entirely.
Take the pattern ab*. The * attaches only to the b right
before it, so this reads "an a, followed by zero or more bs". It
matches:
"a" — yes! (the b appears zero times);"ab", "abb", "abbbbbb" — yes (one or more bs);"b" — no (there is no leading a);"ba", "aab" — no (the shape is wrong).
Because * can repeat without limit, a single short pattern like ab*
already describes an infinite set of strings —
The single most common mistake with regexes is misreading *. It means
zero or more — not "at least one", and not "exactly one".
People instinctively expect a starred thing to be present, but:
ab* happily matches "a" with no b at
all. If you truly want "an a then at least one b", that is
ab+ (the + operator, below).
a* even matches the empty string "" — zero
as is a perfectly valid "zero or more".
A useful mantra: * can always match nothing. Whenever a pattern
seems to match strings you didn't expect, check whether a * is quietly matching
zero copies.
+ and ?Two more operators round out the everyday toolkit. Both are really just conveniences — anything they do could be written with the three operators above — but they are so common they earn their own symbols.
+ — "one or more". The plus is the star's demanding cousin: it
insists on at least one copy. So ab+ matches "ab",
"abb", "abbb" … but not "a". Note that
? — "zero or one" (optional). The question mark makes the thing
before it optional: it may appear once or not at all. So colou?r matches
both "color" and "colour", and -?[0-9]+ matches a whole
number with an optional minus sign in front.
A neat way to keep the three repetition operators straight is by how many copies each allows:
With concatenation, |, *, + and ? in hand you
can already read serious patterns. The trick is to break a pattern into its pieces and describe
each in words. Take this classic:
Read left to right: (0|1)* is "zero or more binary digits" (any run of
0s and 1s, including none at all), and then 01 pins a
literal 0 followed by a 1 on the end. Put together, it matches
exactly the binary strings that end in 01:
| String | (0|1)*01? | Why |
|---|---|---|
"01" | match | the (0|1)* matches nothing, then 01 |
"11101" | match | 111 is soaked up by the star, then 01 |
"0" | reject | too short — there is no 01 to finish on |
"010" | reject | it ends in 0, not 01 |
"abc" | reject | contains characters that aren't 0 or 1 |
Notice how the pattern quietly describes an infinite set — every string ending in 01,
of any length — in seven characters. That compactness is why regexes are everywhere in real
software.
In TypeScript (and JavaScript) a regex is written between slashes, like /ab*/, and
its .test(s) method returns true if the string s matches.
The ^ and $ anchors mean "start of string" and "end of string", pinning
the pattern so it must describe the whole string, not just a piece of it. Press
Run and watch each pattern sort strings into match and reject:
The +/* difference from the "Watch out!" box is easy to see for
yourself. The two patterns below look almost identical, but one accepts a lone "a"
and the other does not:
Here is the payoff. In computer science a language just means a
set of strings. A regular language is defined as any language that
some regular expression can describe — the exact set of strings that regex matches. So
"strings ending in 01" is a regular language, because (0|1)*01 picks it
out; "an a then some bs" is a regular language, described by
ab*.
Now the beautiful part. You have already met
The two are just different notations for the very same class of languages — the regular languages. A regex is a compact written formula; an FSM is the same idea drawn as a diagram.
You can see the correspondence directly. The machine below is a finite state machine that accepts
precisely the strings matched by ab* — a single a to move into
the accepting state, then a self-loop that lets you go round on b as many times as
you like (including not at all). Step through it:
The self-loop labelled b is the diagram's way of drawing the *: "stay
here, reading another b, for as long as you want — or move on straight away." Every
regex operator has a matching FSM gadget like this, which is why the two systems describe exactly
the same languages.
Regular expressions are not just theory — they are one of the most widely used tools in all of programming. Two jobs dominate:
grep are built on regexes.
Regular expressions are powerful, but they have a hard limit that comes straight from the FSM connection: a finite state machine has only a finite number of states, so it cannot count without bound. This means a genuine regular expression cannot match strings that require balancing or counting arbitrarily many things.
The classic example is matching balanced brackets — strings like
((())) where the number of opening brackets must equal the number of closing ones,
for any depth. To do that you would have to remember how many ( you have
seen so far, and that count can grow without limit — more than a fixed set of states can hold.
The same goes for as followed by
exactly the same number of bs). These languages are not
regular; recognising them needs a more powerful machine (one with a memory stack — a