Context-Free Grammars and Parsing

Every programming language has a rulebook for what counts as a well-formed program. x = a + b * 2 is fine; x = + a b) 2 = is gibberish. What is that rulebook, written down precisely enough that a machine can enforce it? It is a context-free grammar (CFG) — a compact set of rules that generates exactly the strings the language allows. And the program that reads your source and checks it against those rules — deciding "yes, this fits the grammar, here is its structure" or "no, syntax error on line 12" — is the parser.

The parser is the compiler's second pass. The tokens from the lexer stream in — IDENT, PLUS, NUMBER, … — and the parser arranges that flat stream into a tree that captures how the pieces nest. This page is about the two halves of that story: the grammar that defines a language's syntax, and parsing, the act of recovering a program's structure against that grammar.

The four ingredients of a grammar

A context-free grammar is built from just four things:

This notation — nonterminals in brackets, ::= for "is defined as", | for alternatives — is BNF (Backus–Naur Form), invented to describe ALGOL 60 and still the lingua franca of language specs. EBNF (Extended BNF) adds convenient shorthand: ? for optional, * for "zero or more", + for "one or more". Here is a complete grammar for simple arithmetic:

<expr> ::= <expr> "+" <term> | <expr> "-" <term> | <term> <term> ::= <term> "*" <factor> | <term> "/" <factor> | <factor> <factor> ::= NUMBER | "(" <expr> ")"

Three nonterminals (<expr>, <term>, <factor>), a handful of terminals (+ - * / ( ) and NUMBER), six productions, and the start symbol <expr>. That tiny rulebook generates every well-formed arithmetic expression and rejects every malformed one. Notice the rules are recursive<expr> mentions itself, and <factor> loops back to <expr> through the brackets — which is how a finite set of rules describes an infinite set of programs.

Deriving a string, step by step

To show a string belongs to the language, you derive it: start at the start symbol and repeatedly replace one nonterminal using a production, until only terminals remain. When you always expand the leftmost nonterminal, it is a leftmost derivation. Here is 1 + 2 * 3 derived from our grammar (the rule used is on the right):

<expr> => <expr> "+" <term> (expr ::= expr + term) => <term> "+" <term> (expr ::= term) => <factor> "+" <term> (term ::= factor) => 1 "+" <term> (factor ::= NUMBER, i.e. 1) => 1 "+" <term> "*" <factor> (term ::= term * factor) => 1 "+" <factor> "*" <factor> (term ::= factor) => 1 "+" 2 "*" <factor> (factor ::= NUMBER, i.e. 2) => 1 "+" 2 "*" 3 (factor ::= NUMBER, i.e. 3)

Eight steps, and we have manufactured the exact token sequence 1 + 2 * 3 from the rules alone. Each => is one rewrite. The order we expanded in does not change the final string, but it does record a path through the grammar — and that path has a shape.

The parse tree: structure made visible

A derivation, drawn as a tree, is a parse tree (or derivation tree). The root is the start symbol; every internal node is a nonterminal, and its children are exactly the symbols on the right-hand side of the production used to expand it; the leaves, read left to right, spell out the input. The parse tree below is the one derivation above, for 1 + 2 * 3. Play through it:

The shape is the point. The + is near the root; the * lives deeper, tucked entirely inside the right-hand <term>. Because a subtree is a self-contained group, the tree says 1 + (2 * 3), not (1 + 2) * 3 — the multiplication is grouped, and evaluated, first. The grammar's structure has quietly encoded precedence, and we will see exactly how in a moment.

Why "context-free" — and why you need a stack

The free in context-free means each production replaces a single nonterminal no matter what surrounds it — the left-hand side is always exactly one nonterminal, with no conditions on its neighbours. That freedom is what makes grammars easy to reason about, yet still powerful enough for real syntax.

And they are strictly more powerful than the regular expressions the lexer uses. A finite automaton has a fixed, finite number of states, so it cannot count without bound — it cannot verify that ( ( ( … ) ) ) has as many closing brackets as opening ones, because the nesting depth is unlimited. A CFG can, because the machine that recognises context-free languages is a pushdown automaton: a finite-state machine with one extra gadget bolted on — a stack. Every ( pushes; every ) pops; balanced exactly when the stack empties. That stack is unbounded memory, and it is precisely what lets a parser track arbitrarily deep nesting. This is the deep reason lexing and parsing are separate passes: recognising tokens is a regular job (no memory needed), but recognising nested structure is a context-free job (a stack needed).

Try to write one for "any number of matched brackets": (), (()), ((())), … You would need the pattern to remember how many opens are still unclosed, and that count grows without limit. The pumping lemma proves no finite automaton — and hence no regular expression — can do it: the language of balanced brackets is not regular. A context-free grammar handles it in two lines, <s> ::= "(" <s> ")" <s> | ε, because its recognising machine, the pushdown automaton, gets a stack to count with. Nesting is exactly the frontier between the two.

Ambiguity, precedence, and associativity

Here is where grammar design earns its keep. Suppose we had written the simpler, sloppier grammar instead:

<expr> ::= <expr> "+" <expr> | <expr> "*" <expr> | NUMBER

It generates the same strings — but now 1 + 2 * 3 has two different parse trees: one with + at the root (giving 1 + (2 * 3) = 7) and one with * at the root (giving (1 + 2) * 3 = 9). A grammar that allows two distinct parse trees for one string is ambiguous, and ambiguity is poison: the parser has no principled way to choose, so the meaning of the program is undefined.

The cure is to stratify the grammar into levels — exactly what our original expr → term → factor ladder does. Each tier captures one precedence level: + and - live at the loose <expr> tier, * and / at the tighter <term> tier, and atoms/brackets at <factor>. Because a <term> can never expand back up to an <expr> except through brackets, multiplication is forced to nest below addition — precedence is baked into the shape, and only one parse tree survives. The same trick fixes associativity: writing the rule as <expr> ::= <expr> "-" <term> (recursion on the left) makes 1 - 2 - 3 parse as (1 - 2) - 3, the correct left-associative answer, rather than 1 - (2 - 3).

The most famous ambiguity in real languages is the dangling else: in if a then if b then s1 else s2, does the else attach to the first if or the second? The grammar, if written naively, permits both parse trees. Languages resolve it by fiat — C, Java and friends decree "else binds to the nearest unmatched if" — and encode that choice by rewriting the grammar into matched and unmatched statement tiers, the very same stratifying trick.

So what is parsing?

Put it all together. Parsing is the process of taking a flat stream of tokens and either recovering a parse tree for them against the grammar — proving the input is a valid program and revealing its structure — or reporting a syntax error when no derivation exists. It is, quite literally, running a derivation backwards: the derivation goes from the start symbol down to the tokens; the parser goes from the tokens back up to the start symbol.

There are two grand strategies, and both are just disciplined ways of driving a pushdown automaton. Top-down parsers (recursive descent, LL) start at the start symbol and try to expand rules to match the incoming tokens — one recursive function per nonterminal, the call stack playing the role of the PDA's stack. Bottom-up parsers (LR, the kind tools like yacc and bison generate) start from the tokens and shift them onto a stack, reducing groups back into nonterminals as complete right-hand sides appear. Either way, success means a parse tree; failure means a precise, locatable syntax error.