Context-Free Grammars and Parsing

Every programming language has a rulebook for what counts as a well-formed program. x = a + b * 2 is fine; x = + a b) 2 = is gibberish. What is that rulebook, written down precisely enough that a machine can enforce it? It is a context-free grammar (CFG) — a compact set of rules that generates exactly the strings the language allows. And the program that reads your source and checks it against those rules — deciding "yes, this fits the grammar, here is its structure" or "no, syntax error on line 12" — is the parser.

The parser is the compiler's second pass. The tokens from the lexer stream in — IDENT, PLUS, NUMBER, … — and the parser arranges that flat stream into a tree that captures how the pieces nest. This page is about the two halves of that story: the grammar that defines a language's syntax, and parsing, the act of recovering a program's structure against that grammar.

The four ingredients of a grammar

A context-free grammar is built from just four things:

Terminals — the atoms, the actual symbols that appear in the final string. For a programming language these are the tokens: +, *, (, ), a NUMBER, an identifier. They are called terminal because a derivation stops when it reaches them — there is nothing left to expand.
Nonterminals — the named categories of phrase, the placeholders that stand for "an expression", "a statement", "a term". They are written in angle brackets in BNF: <expr>, <term>, <factor>. Every nonterminal eventually expands into terminals.
Productions (rules) — each says "this nonterminal may be replaced by that sequence". <expr> ::= <expr> "+" <term> reads: an expression can be an expression, a plus sign, then a term. The | bar lists alternatives.
The start symbol — one distinguished nonterminal (here <expr>) where every derivation begins. The language of the grammar is the set of all terminal strings you can reach from the start symbol.

This notation — nonterminals in brackets, ::= for "is defined as", | for alternatives — is BNF (Backus–Naur Form), invented to describe ALGOL 60 and still the lingua franca of language specs. EBNF (Extended BNF) adds convenient shorthand: ? for optional, * for "zero or more", + for "one or more". Here is a complete grammar for simple arithmetic:

Three nonterminals (<expr>, <term>, <factor>), a handful of terminals (+ - * / ( ) and NUMBER), six productions, and the start symbol <expr>. That tiny rulebook generates every well-formed arithmetic expression and rejects every malformed one. Notice the rules are recursive — <expr> mentions itself, and <factor> loops back to <expr> through the brackets — which is how a finite set of rules describes an infinite set of programs.

Deriving a string, step by step

To show a string belongs to the language, you derive it: start at the start symbol and repeatedly replace one nonterminal using a production, until only terminals remain. When you always expand the leftmost nonterminal, it is a leftmost derivation. Here is 1 + 2 * 3 derived from our grammar (the rule used is on the right):

<expr> => <expr> "+" <term> (expr ::= expr + term) => <term> "+" <term> (expr ::= term) => <factor> "+" <term> (term ::= factor) => 1 "+" <term> (factor ::= NUMBER, i.e. 1) => 1 "+" <term> "*" <factor> (term ::= term * factor) => 1 "+" <factor> "*" <factor> (term ::= factor) => 1 "+" 2 "*" <factor> (factor ::= NUMBER, i.e. 2) => 1 "+" 2 "*" 3 (factor ::= NUMBER, i.e. 3)

Eight steps, and we have manufactured the exact token sequence 1 + 2 * 3 from the rules alone. Each => is one rewrite. The order we expanded in does not change the final string, but it does record a path through the grammar — and that path has a shape.

The parse tree: structure made visible

A derivation, drawn as a tree, is a parse tree (or derivation tree). The root is the start symbol; every internal node is a nonterminal, and its children are exactly the symbols on the right-hand side of the production used to expand it; the leaves, read left to right, spell out the input. The parse tree below is the one derivation above, for 1 + 2 * 3. Play through it:

The shape is the point. The + is near the root; the * lives deeper, tucked entirely inside the right-hand <term>. Because a subtree is a self-contained group, the tree says 1 + (2 * 3), not (1 + 2) * 3 — the multiplication is grouped, and evaluated, first. The grammar's structure has quietly encoded precedence, and we will see exactly how in a moment.

Why "context-free" — and why you need a stack

The free in context-free means each production replaces a single nonterminal no matter what surrounds it — the left-hand side is always exactly one nonterminal, with no conditions on its neighbours. That freedom is what makes grammars easy to reason about, yet still powerful enough for real syntax.

And they are strictly more powerful than the regular expressions the lexer uses. A finite automaton has a fixed, finite number of states, so it cannot count without bound — it cannot verify that ( ( ( … ) ) ) has as many closing brackets as opening ones, because the nesting depth is unlimited. A CFG can, because the machine that recognises context-free languages is a pushdown automaton: a finite-state machine with one extra gadget bolted on — a stack. Every ( pushes; every ) pops; balanced exactly when the stack empties. That stack is unbounded memory, and it is precisely what lets a parser track arbitrarily deep nesting. This is the deep reason lexing and parsing are separate passes: recognising tokens is a regular job (no memory needed), but recognising nested structure is a context-free job (a stack needed).

Try to write one for "any number of matched brackets": (), (()), ((())), … You would need the pattern to remember how many opens are still unclosed, and that count grows without limit. The pumping lemma proves no finite automaton — and hence no regular expression — can do it: the language of balanced brackets is not regular. A context-free grammar handles it in two lines, <s> ::= "(" <s> ")" <s> | ε, because its recognising machine, the pushdown automaton, gets a stack to count with. Nesting is exactly the frontier between the two.

Ambiguity, precedence, and associativity

Here is where grammar design earns its keep. Suppose we had written the simpler, sloppier grammar instead:

<expr> ::= <expr> "+" <expr> | <expr> "*" <expr> | NUMBER

It generates the same strings — but now 1 + 2 * 3 has two different parse trees: one with + at the root (giving 1 + (2 * 3) = 7) and one with * at the root (giving (1 + 2) * 3 = 9). A grammar that allows two distinct parse trees for one string is ambiguous, and ambiguity is poison: the parser has no principled way to choose, so the meaning of the program is undefined.

The cure is to stratify the grammar into levels — exactly what our original expr → term → factor ladder does. Each tier captures one precedence level: + and - live at the loose <expr> tier, * and / at the tighter <term> tier, and atoms/brackets at <factor>. Because a <term> can never expand back up to an <expr> except through brackets, multiplication is forced to nest below addition — precedence is baked into the shape, and only one parse tree survives. The same trick fixes associativity: writing the rule as <expr> ::= <expr> "-" <term> (recursion on the left) makes 1 - 2 - 3 parse as (1 - 2) - 3, the correct left-associative answer, rather than 1 - (2 - 3).

A grammar is ambiguous if some string has two or more distinct parse trees.
Precedence is encoded by stratifying the grammar into tiers (expr → term → factor); the tighter-binding operator sits in the lower, more deeply nested tier.
Associativity is encoded by the side of the recursion: left-recursive rules give left associativity, right-recursive rules give right associativity.

The most famous ambiguity in real languages is the dangling else: in if a then if b then s1 else s2, does the else attach to the first if or the second? The grammar, if written naively, permits both parse trees. Languages resolve it by fiat — C, Java and friends decree "else binds to the nearest unmatched if" — and encode that choice by rewriting the grammar into matched and unmatched statement tiers, the very same stratifying trick.

Ambiguity is a property of the GRAMMAR, not the language. The same set of strings can be described by an ambiguous grammar and by an unambiguous one. You do not fix ambiguity by changing the language — you fix it by rewriting the grammar (stratifying it, adding tiers) so each string has a unique parse tree.
A parse tree is not an AST. The parse tree (this page) contains every nonterminal and every bracket the grammar visited — including single-child chains like expr → term → factor → 1. The abstract syntax tree is the parse tree with all that scaffolding stripped away, keeping only the meaningful operators and operands. Parsing usually builds the AST directly; the parse tree is the concept, the AST is the artefact.
Precedence is not a separate table bolted onto the grammar. In a well-designed CFG, precedence and associativity are the grammar's structure. If two operators end up in the same tier, or you recurse on the wrong side, the tree comes out wrong — the fix is in the productions, not in a post-processing step.
Different derivations are fine; different parse trees are not. A single string has many derivations (leftmost, rightmost, and everything between) that all yield the same parse tree — that is harmless. Ambiguity means two genuinely different trees, which is the problem.

So what is parsing?

Put it all together. Parsing is the process of taking a flat stream of tokens and either recovering a parse tree for them against the grammar — proving the input is a valid program and revealing its structure — or reporting a syntax error when no derivation exists. It is, quite literally, running a derivation backwards: the derivation goes from the start symbol down to the tokens; the parser goes from the tokens back up to the start symbol.

There are two grand strategies, and both are just disciplined ways of driving a pushdown automaton. Top-down parsers (recursive descent, LL) start at the start symbol and try to expand rules to match the incoming tokens — one recursive function per nonterminal, the call stack playing the role of the PDA's stack. Bottom-up parsers (LR, the kind tools like yacc and bison generate) start from the tokens and shift them onto a stack, reducing groups back into nonterminals as complete right-hand sides appear. Either way, success means a parse tree; failure means a precise, locatable syntax error.

A context-free grammar is a start symbol, a set of nonterminals and terminals, and productions; it generates exactly the strings of a language.
A derivation rewrites the start symbol to a string of terminals; drawn as a tree it is the parse tree, whose shape encodes the program's structure.
CFGs are recognised by pushdown automata — a finite machine plus a stack — which is why they can handle unbounded nesting that regular expressions cannot.
Ambiguity (two parse trees for one string) is fixed by rewriting the grammar; precedence and associativity are baked in by stratifying it.
Parsing recovers a parse tree for a token stream, or reports a syntax error.