Lexical Analysis

Before a compiler can understand your program, it has to read it — and to a computer, your source file is nothing but a flat river of characters: l, e, t, space, x, 1, 2, space, =, and so on. The very first pass of the compiler pipeline — the lexical analyser, or lexer (also called the scanner or tokenizer) — takes that raw stream and chops it into the meaningful "words" of the language, called tokens.

Think of it exactly like reading a sentence. You don't perceive "T·h·e·c·a·t" as fourteen letters — your eye groups them into the words The and cat, and quietly throws away the spaces. The lexer does the same job for code. Given the line

\texttt{let x12 = 3.5 + count;}

the lexer hands the next stage a tidy list of tokens — no whitespace, no ambiguity about where one symbol ends and the next begins:

KEYWORD let IDENT x12 EQUALS = NUMBER 3.5 PLUS + IDENT count SEMI ;

That flat list is all the parser (the next pass) ever sees. By classifying characters into tokens up front, the lexer lets everything downstream reason about words instead of letters — a huge simplification.

What exactly is a token?

A token is a small record with (usually) three parts:

Kind (or type/class) — the category: NUMBER, IDENT, PLUS, KEYWORD, STRING, and so on. This is what the parser matches on.
Lexeme — the exact run of characters from the source that formed the token: the text "x12", or "3.5", or "+".
Value and position (optional) — a value is the lexeme decoded into something useful (the string "3.5" → the number 3.5); a position (line and column) lets the compiler point at the exact spot in later error messages.

The distinction between kind and lexeme matters. The parser cares that x12 and count are both identifiers — same kind, different lexeme. It does not care which identifier while it works out the grammatical structure; the name is just cargo the token carries along.

Tokens fall into a handful of token classes:

Identifiers — names you invent: x12, count, totalPrice.
Keywords — reserved words the language owns: let, if, while, return.
Number literals — 42, 3.5, 0xFF, 6.02e23.
String literals — "hello", delimited by quotes.
Operators — + - * / = == < >= &&.
Punctuation / delimiters — ( ) { } , ;.

And two things get skipped entirely — the lexer consumes them but emits no token: whitespace (spaces, tabs, newlines) and comments. They exist to help the human reader; the parser never needs them, so the scanner silently swallows them between tokens.

Worked example: one line, tokenised

Let us hand-scan let total = price * 42; the way the lexer does — left to right, one token at a time, skipping the spaces. Watch how each token is a kind paired with the exact lexeme it consumed:

Lexeme	Kind	Why
`let`	KEYWORD	letters, and a reserved word
`total`	IDENT	letters, not reserved
`=`	EQUALS	single operator char
`price`	IDENT	letters, not reserved
`*`	STAR	operator
`42`	NUMBER	a run of digits
`;`	SEMI	punctuation

Seven tokens, and the four spaces have vanished. Notice how the lexer groups: t, o, t, a, l aren't five tokens — the scanner keeps eating letters as long as they keep coming, then emits one IDENT. That "keep eating" instinct has a name, and we will meet it shortly.

How the lexer knows the shapes: regular expressions

How does the scanner know that 3.5 is a number but x12 is a name? Each token class is described by a pattern, and the natural language for those patterns is regular expressions. Every token class is a regular language:

Identifier: a letter, then any number of letters or digits — [A\text{-}Za\text{-}z][A\text{-}Za\text{-}z0\text{-}9]^{*}
Integer: one or more digits — [0\text{-}9]^{+}
Decimal: digits, a dot, digits — [0\text{-}9]^{+}\,.\,[0\text{-}9]^{+}
Whitespace (to skip): [\ \backslash t\backslash n]^{+}

This is not just documentation — it is the algorithm. A classic result of automata theory is that every regular expression can be converted into a finite automaton: first an NFA (nondeterministic finite automaton, via Thompson's construction), then a deterministic DFA (via the subset construction). A DFA is the simplest machine there is — a bag of states and a transition table saying "in state S, on character c, go to state T." Feed it characters; if it ends in an accepting state, you have matched a token.

That pipeline — regex → NFA → DFA — is exactly what lexer generators like lex and flex do: you write the regular expressions, and the tool bakes out a blazing-fast DFA-driven scanner. The whole reason a lexer is a separate, fast pass is that its job is regular — it needs no memory beyond "which state am I in?"

A finite automaton has, by definition, a finite number of states — it cannot count arbitrarily high. To check that ((())) has matching brackets you must remember how many opens are still unclosed, and that count is unbounded, so no fixed set of states can track it. This is a consequence of the pumping lemma: the language of balanced brackets is not regular. That is precisely why lexing and parsing are split: the lexer handles the regular part (recognising each ( and ) as a token), and the parser handles the nesting with a stack (a context-free machine, a push-down automaton). Counting is the parser's job, not the lexer's.

Maximal munch: the longest match wins

When several patterns could match at the current position, which does the lexer pick? The universal rule is maximal munch (also called longest match): the scanner consumes the longest run of characters that still forms a valid token.

This is why, in a language with both > and >=, the input >= is read as one GTE token, not a > followed by an =. The scanner reads the >, sees it could keep going, tries the next character =, finds >= is a longer valid token, and takes it. Likewise the identifier scan eats t-o-t-a-l and only stops at the space, because total is a longer identifier than tota. Maximal munch is what makes multi-character operators and multi-digit numbers work at all.

Skip any leading whitespace and comments.
Look at the current character to decide the token class (digit → number, letter → identifier/keyword, symbol → operator/punctuation).
Consume the longest run of characters that forms a valid token of that class (maximal munch).
Emit one { kind, lexeme } token, then repeat until the input is exhausted.

A real hand-written scanner

Here is a complete tokenizer for a tiny expression language, written by hand in TypeScript — no generator, just the scanner loop above. It skips whitespace, reads multi-digit and decimal numbers, reads identifiers, distinguishes keywords from identifiers with a reserved-word set, and recognises the operators and punctuation + - * / ( ) =. Press Run to see the token stream it produces:

// A token is a kind plus the exact text (lexeme) that formed it. interface Token { kind: string; lexeme: string; } // Reserved words: these LOOK like identifiers but the language owns them. const KEYWORDS = new Set(["let", "if", "else", "while"]); // Single-character symbols mapped to their token kind. const SYMBOLS: Record<string, string> = { "+": "PLUS", "-": "MINUS", "*": "STAR", "/": "SLASH", "(": "LPAREN", ")": "RPAREN", "=": "EQUALS", }; const isDigit = (c: string) => c >= "0" && c <= "9"; const isAlpha = (c: string) => (c >= "a" && c <= "z") || (c >= "A" && c <= "Z") || c === "_"; function tokenize(src: string): Token[] { const tokens: Token[] = []; let i = 0; while (i < src.length) { const c = src[i]; // 1) Skip whitespace — emit nothing. if (c === " " || c === "\t" || c === "\n") { i++; continue; } // 2) Numbers: one or more digits, optionally a dot then more digits (maximal munch). if (isDigit(c)) { let start = i; while (i < src.length && isDigit(src[i])) i++; if (src[i] === "." && isDigit(src[i + 1])) { i++; // consume the "." while (i < src.length && isDigit(src[i])) i++; } tokens.push({ kind: "NUMBER", lexeme: src.slice(start, i) }); continue; } // 3) Identifiers / keywords: a letter, then letters or digits (longest match). if (isAlpha(c)) { let start = i; while (i < src.length && (isAlpha(src[i]) || isDigit(src[i]))) i++; const word = src.slice(start, i); // A keyword is just a reserved identifier — decide AFTER reading the whole word. const kind = KEYWORDS.has(word) ? "KEYWORD" : "IDENT"; tokens.push({ kind, lexeme: word }); continue; } // 4) Operators and punctuation: single characters from the table. if (c in SYMBOLS) { tokens.push({ kind: SYMBOLS[c], lexeme: c }); i++; continue; } // 5) Anything else is a lexical error. throw new Error("Unexpected character: " + c); } return tokens; } // Tokenise a sample line and print the token stream. const source = "let x12 = 3.5 + count * (10 - y)"; for (const t of tokenize(source)) { console.log(t.kind.padEnd(8), t.lexeme); }

Every branch of that loop is a token class; the two while loops are maximal munch in action — they keep consuming as long as the character still fits the pattern. And notice the keyword decision: we read the entire word first, then ask the reserved set whether it is a keyword. That ordering is the whole trick to keyword recognition.

Keywords are just reserved identifiers

A tempting design is to give each keyword its own regular expression and scan for them directly. Real lexers almost never do this. Instead they use the pattern you saw above: scan an identifier, then check a lookup table. let, if, and count all match the same identifier pattern [A\text{-}Za\text{-}z][A\text{-}Za\text{-}z0\text{-}9]^{*} — so you read the whole word with one rule, then ask a hash set "is this reserved?" If yes, its kind is KEYWORD (or a specific IF/LET kind); if no, it is an ordinary IDENT.

This is simpler and faster (one identifier rule plus an O(1) set lookup beats dozens of competing keyword patterns), and it explains why you cannot name a variable if: the keyword is reserved, so the table intercepts it before it can ever become an identifier.

Maximal munch is greedy and unconditional. With > and >= both in the language, x>=1 lexes as x, >=, 1 — the >= is one token, never > then =. This can even bite you: in old C, a+++b is read as a ++ + b because the scanner grabs the longest operator first.
123abc is not one token. The number rule munches 123 and stops at a (a digit run cannot contain letters); the identifier rule then reads abc. You get two adjacent tokens NUMBER 123, IDENT abc — it is the parser, later, that may reject that combination as a syntax error. The lexer itself is happy.
Keywords are decided after the whole word, not during. If you check "is this a keyword?" character-by-character you will wrongly flag iffy or lettuce. Read the entire identifier, then consult the reserved set once.
Whitespace is skipped, not deleted from meaning. Spaces separate tokens — they are how the lexer knows letx (one identifier) differs from let x (keyword + identifier). The lexer emits no whitespace token, but the boundary it marks is real.

The lexer is the compiler's first pass: it turns a stream of characters into a stream of tokens, discarding whitespace and comments.
A token is a kind (its class) plus a lexeme (the matched text), and optionally a decoded value and source position.
Token classes are described by regular expressions and recognised by finite automata (regex → NFA → DFA).
Maximal munch resolves ambiguity: always take the longest valid token.
Keywords are reserved identifiers: scan an identifier, then look it up in a reserved-word set.