Lexical Analysis

Before a compiler can understand your program, it has to read it — and to a computer, your source file is nothing but a flat river of characters: l, e, t, space, x, 1, 2, space, =, and so on. The very first pass of the compiler pipeline — the lexical analyser, or lexer (also called the scanner or tokenizer) — takes that raw stream and chops it into the meaningful "words" of the language, called tokens.

Think of it exactly like reading a sentence. You don't perceive "T·h·e·c·a·t" as fourteen letters — your eye groups them into the words The and cat, and quietly throws away the spaces. The lexer does the same job for code. Given the line

\texttt{let x12 = 3.5 + count;}

the lexer hands the next stage a tidy list of tokens — no whitespace, no ambiguity about where one symbol ends and the next begins:

KEYWORD let   IDENT x12   EQUALS =   NUMBER 3.5   PLUS +   IDENT count   SEMI ;

That flat list is all the parser (the next pass) ever sees. By classifying characters into tokens up front, the lexer lets everything downstream reason about words instead of letters — a huge simplification.

What exactly is a token?

A token is a small record with (usually) three parts:

The distinction between kind and lexeme matters. The parser cares that x12 and count are both identifiers — same kind, different lexeme. It does not care which identifier while it works out the grammatical structure; the name is just cargo the token carries along.

Tokens fall into a handful of token classes:

And two things get skipped entirely — the lexer consumes them but emits no token: whitespace (spaces, tabs, newlines) and comments. They exist to help the human reader; the parser never needs them, so the scanner silently swallows them between tokens.

Worked example: one line, tokenised

Let us hand-scan let total = price * 42; the way the lexer does — left to right, one token at a time, skipping the spaces. Watch how each token is a kind paired with the exact lexeme it consumed:

Lexeme Kind Why
letKEYWORDletters, and a reserved word
totalIDENTletters, not reserved
=EQUALSsingle operator char
priceIDENTletters, not reserved
*STARoperator
42NUMBERa run of digits
;SEMIpunctuation

Seven tokens, and the four spaces have vanished. Notice how the lexer groups: t, o, t, a, l aren't five tokens — the scanner keeps eating letters as long as they keep coming, then emits one IDENT. That "keep eating" instinct has a name, and we will meet it shortly.

How the lexer knows the shapes: regular expressions

How does the scanner know that 3.5 is a number but x12 is a name? Each token class is described by a pattern, and the natural language for those patterns is regular expressions. Every token class is a regular language:

This is not just documentation — it is the algorithm. A classic result of automata theory is that every regular expression can be converted into a finite automaton: first an NFA (nondeterministic finite automaton, via Thompson's construction), then a deterministic DFA (via the subset construction). A DFA is the simplest machine there is — a bag of states and a transition table saying "in state S, on character c, go to state T." Feed it characters; if it ends in an accepting state, you have matched a token.

That pipeline — regex → NFA → DFA — is exactly what lexer generators like lex and flex do: you write the regular expressions, and the tool bakes out a blazing-fast DFA-driven scanner. The whole reason a lexer is a separate, fast pass is that its job is regular — it needs no memory beyond "which state am I in?"

A finite automaton has, by definition, a finite number of states — it cannot count arbitrarily high. To check that ((())) has matching brackets you must remember how many opens are still unclosed, and that count is unbounded, so no fixed set of states can track it. This is a consequence of the pumping lemma: the language of balanced brackets is not regular. That is precisely why lexing and parsing are split: the lexer handles the regular part (recognising each ( and ) as a token), and the parser handles the nesting with a stack (a context-free machine, a push-down automaton). Counting is the parser's job, not the lexer's.

Maximal munch: the longest match wins

When several patterns could match at the current position, which does the lexer pick? The universal rule is maximal munch (also called longest match): the scanner consumes the longest run of characters that still forms a valid token.

This is why, in a language with both > and >=, the input >= is read as one GTE token, not a > followed by an =. The scanner reads the >, sees it could keep going, tries the next character =, finds >= is a longer valid token, and takes it. Likewise the identifier scan eats t-o-t-a-l and only stops at the space, because total is a longer identifier than tota. Maximal munch is what makes multi-character operators and multi-digit numbers work at all.

A real hand-written scanner

Here is a complete tokenizer for a tiny expression language, written by hand in TypeScript — no generator, just the scanner loop above. It skips whitespace, reads multi-digit and decimal numbers, reads identifiers, distinguishes keywords from identifiers with a reserved-word set, and recognises the operators and punctuation + - * / ( ) =. Press Run to see the token stream it produces:

// A token is a kind plus the exact text (lexeme) that formed it. interface Token { kind: string; lexeme: string; } // Reserved words: these LOOK like identifiers but the language owns them. const KEYWORDS = new Set(["let", "if", "else", "while"]); // Single-character symbols mapped to their token kind. const SYMBOLS: Record<string, string> = { "+": "PLUS", "-": "MINUS", "*": "STAR", "/": "SLASH", "(": "LPAREN", ")": "RPAREN", "=": "EQUALS", }; const isDigit = (c: string) => c >= "0" && c <= "9"; const isAlpha = (c: string) => (c >= "a" && c <= "z") || (c >= "A" && c <= "Z") || c === "_"; function tokenize(src: string): Token[] { const tokens: Token[] = []; let i = 0; while (i < src.length) { const c = src[i]; // 1) Skip whitespace — emit nothing. if (c === " " || c === "\t" || c === "\n") { i++; continue; } // 2) Numbers: one or more digits, optionally a dot then more digits (maximal munch). if (isDigit(c)) { let start = i; while (i < src.length && isDigit(src[i])) i++; if (src[i] === "." && isDigit(src[i + 1])) { i++; // consume the "." while (i < src.length && isDigit(src[i])) i++; } tokens.push({ kind: "NUMBER", lexeme: src.slice(start, i) }); continue; } // 3) Identifiers / keywords: a letter, then letters or digits (longest match). if (isAlpha(c)) { let start = i; while (i < src.length && (isAlpha(src[i]) || isDigit(src[i]))) i++; const word = src.slice(start, i); // A keyword is just a reserved identifier — decide AFTER reading the whole word. const kind = KEYWORDS.has(word) ? "KEYWORD" : "IDENT"; tokens.push({ kind, lexeme: word }); continue; } // 4) Operators and punctuation: single characters from the table. if (c in SYMBOLS) { tokens.push({ kind: SYMBOLS[c], lexeme: c }); i++; continue; } // 5) Anything else is a lexical error. throw new Error("Unexpected character: " + c); } return tokens; } // Tokenise a sample line and print the token stream. const source = "let x12 = 3.5 + count * (10 - y)"; for (const t of tokenize(source)) { console.log(t.kind.padEnd(8), t.lexeme); }

Every branch of that loop is a token class; the two while loops are maximal munch in action — they keep consuming as long as the character still fits the pattern. And notice the keyword decision: we read the entire word first, then ask the reserved set whether it is a keyword. That ordering is the whole trick to keyword recognition.

Keywords are just reserved identifiers

A tempting design is to give each keyword its own regular expression and scan for them directly. Real lexers almost never do this. Instead they use the pattern you saw above: scan an identifier, then check a lookup table. let, if, and count all match the same identifier pattern [A\text{-}Za\text{-}z][A\text{-}Za\text{-}z0\text{-}9]^{*} — so you read the whole word with one rule, then ask a hash set "is this reserved?" If yes, its kind is KEYWORD (or a specific IF/LET kind); if no, it is an ordinary IDENT.

This is simpler and faster (one identifier rule plus an O(1) set lookup beats dozens of competing keyword patterns), and it explains why you cannot name a variable if: the keyword is reserved, so the table intercepts it before it can ever become an identifier.