Before a compiler can understand your program, it has to read it — and to a computer, your
source file is nothing but a flat river of characters: l, e, t,
space, x, 1, 2, space, =, and so on. The very
first pass of the
Think of it exactly like reading a sentence. You don't perceive
"T·h·e·c·a·t" as fourteen letters — your eye groups
them into the words The and cat, and quietly throws away the spaces. The lexer does
the same job for code. Given the line
the lexer hands the next stage a tidy list of tokens — no whitespace, no ambiguity about where one symbol ends and the next begins:
KEYWORD let IDENT x12 EQUALS =
NUMBER 3.5 PLUS + IDENT count
SEMI ;
That flat list is all the parser (the next pass) ever sees. By classifying characters into tokens up front, the lexer lets everything downstream reason about words instead of letters — a huge simplification.
A token is a small record with (usually) three parts:
NUMBER,
IDENT, PLUS, KEYWORD, STRING, and so on. This
is what the parser matches on."x12", or "3.5", or "+"."3.5" → the number 3.5); a
position (line and column) lets the compiler point at the exact spot in later error
messages.
The distinction between kind and lexeme matters. The parser cares
that x12 and count are both identifiers — same kind, different
lexeme. It does not care which identifier while it works out the grammatical structure; the
name is just cargo the token carries along.
Tokens fall into a handful of token classes:
x12, count,
totalPrice.let, if,
while, return.42, 3.5, 0xFF,
6.02e23."hello", delimited by quotes.+ - * / = == < >= &&.( ) { } , ;.And two things get skipped entirely — the lexer consumes them but emits no token: whitespace (spaces, tabs, newlines) and comments. They exist to help the human reader; the parser never needs them, so the scanner silently swallows them between tokens.
Let us hand-scan let total = price * 42; the way the lexer does — left to right, one
token at a time, skipping the spaces. Watch how each token is a kind paired with the exact
lexeme it consumed:
| Lexeme | Kind | Why |
|---|---|---|
let | KEYWORD | letters, and a reserved word |
total | IDENT | letters, not reserved |
= | EQUALS | single operator char |
price | IDENT | letters, not reserved |
* | STAR | operator |
42 | NUMBER | a run of digits |
; | SEMI | punctuation |
Seven tokens, and the four spaces have vanished. Notice how the lexer groups: t,
o, t, a, l aren't five tokens — the scanner keeps
eating letters as long as they keep coming, then emits one IDENT. That "keep eating"
instinct has a name, and we will meet it shortly.
How does the scanner know that 3.5 is a number but x12 is a name? Each token
class is described by a pattern, and the natural language for those patterns is
This is not just documentation — it is the algorithm. A classic result of automata theory is that every regular expression can be converted into a finite automaton: first an NFA (nondeterministic finite automaton, via Thompson's construction), then a deterministic DFA (via the subset construction). A DFA is the simplest machine there is — a bag of states and a transition table saying "in state S, on character c, go to state T." Feed it characters; if it ends in an accepting state, you have matched a token.
That pipeline — regex → NFA → DFA — is exactly what lexer generators like
lex and flex do: you write the regular expressions, and the tool bakes out a
blazing-fast DFA-driven scanner. The whole reason a lexer is a separate, fast pass is that its job is
regular — it needs no memory beyond "which state am I in?"
A finite automaton has, by definition, a finite number of states — it cannot count
arbitrarily high. To check that ((())) has matching brackets you must remember
how many opens are still unclosed, and that count is unbounded, so no fixed set of states
can track it. This is a consequence of the pumping lemma: the language of balanced
brackets is not regular. That is precisely why lexing and parsing are split: the lexer
handles the regular part (recognising each ( and ) as a token), and the
parser handles the nesting with a stack (a context-free machine,
a push-down automaton). Counting is the parser's job, not the lexer's.
When several patterns could match at the current position, which does the lexer pick? The universal rule is maximal munch (also called longest match): the scanner consumes the longest run of characters that still forms a valid token.
This is why, in a language with both > and >=, the input
>= is read as one GTE token, not a >
followed by an =. The scanner reads the >, sees it could keep
going, tries the next character =, finds >= is a longer valid token, and
takes it. Likewise the identifier scan eats t-o-t-a-l and only stops at the space,
because total is a longer identifier than tota. Maximal munch is what makes
multi-character operators and multi-digit numbers work at all.
{ kind, lexeme } token, then repeat until the input is exhausted.
Here is a complete tokenizer for a tiny expression language, written by hand in TypeScript — no
generator, just the scanner loop above. It skips whitespace, reads multi-digit and decimal numbers,
reads identifiers, distinguishes keywords from identifiers with a reserved-word set, and recognises
the operators and punctuation + - * / ( ) =. Press Run to see the token
stream it produces:
Every branch of that loop is a token class; the two while loops are maximal
munch in action — they keep consuming as long as the character still fits the pattern. And notice the
keyword decision: we read the entire word first, then ask the reserved set whether
it is a keyword. That ordering is the whole trick to keyword recognition.
A tempting design is to give each keyword its own regular expression and scan for them directly. Real
lexers almost never do this. Instead they use the pattern you saw above: scan an identifier,
then check a lookup table. let, if, and count all
match the same identifier pattern KEYWORD (or a specific IF/LET kind); if no, it is an ordinary
IDENT.
This is simpler and faster (one identifier rule plus an O(1) set lookup beats dozens of
competing keyword patterns), and it explains why you cannot name a variable if: the
keyword is reserved, so the table intercepts it before it can ever become an identifier.
> and
>= both in the language, x>=1 lexes as x,
>=, 1 — the >= is one token, never
> then =. This can even bite you: in old C, a+++b is
read as a ++ + b because the scanner grabs the longest operator first.
123abc is not one token. The number rule munches 123
and stops at a (a digit run cannot contain letters); the identifier rule then reads
abc. You get two adjacent tokens NUMBER 123, IDENT abc —
it is the parser, later, that may reject that combination as a syntax error. The lexer
itself is happy.
iffy or lettuce.
Read the entire identifier, then consult the reserved set once.
letx (one identifier) differs from let x
(keyword + identifier). The lexer emits no whitespace token, but the boundary it marks is real.