Every programming language has a rulebook for what counts as a well-formed program.
x = a + b * 2 is fine; x = + a b) 2 = is gibberish. What is that
rulebook, written down precisely enough that a machine can enforce it? It is a
context-free grammar (CFG) — a compact set of rules that generates exactly the strings
the language allows. And the program that reads your source and checks it against those rules — deciding
"yes, this fits the grammar, here is its structure" or "no, syntax error on line 12" — is the
parser.
The parser is the compiler's second pass. The
IDENT, PLUS, NUMBER, … — and the parser arranges that flat
stream into a tree that captures how the pieces nest. This page is about the two halves
of that story: the grammar that defines a language's syntax, and
parsing, the act of recovering a program's structure against that grammar.
A context-free grammar is built from just four things:
+, *, (,
), a NUMBER, an identifier. They are called terminal because a
derivation stops when it reaches them — there is nothing left to expand.<expr>, <term>, <factor>. Every nonterminal
eventually expands into terminals.<expr> ::= <expr> "+" <term> reads: an expression can be
an expression, a plus sign, then a term. The | bar lists alternatives.<expr>)
where every derivation begins. The language of the grammar is the set of all
terminal strings you can reach from the start symbol.
This notation — nonterminals in brackets, ::= for "is defined as", | for
alternatives — is BNF (Backus–Naur Form), invented to describe ALGOL 60 and still the
lingua franca of language specs. EBNF (Extended BNF) adds convenient shorthand:
? for optional, * for "zero or more", + for "one or more". Here
is a complete grammar for simple arithmetic:
Three nonterminals (<expr>, <term>, <factor>),
a handful of terminals (+ - * / ( ) and NUMBER), six productions, and the
start symbol <expr>. That tiny rulebook generates every well-formed
arithmetic expression and rejects every malformed one. Notice the rules are recursive
— <expr> mentions itself, and <factor> loops back to
<expr> through the brackets — which is how a finite set of rules describes an
infinite set of programs.
To show a string belongs to the language, you derive it: start at the start symbol and
repeatedly replace one nonterminal using a production, until only terminals remain. When you always
expand the leftmost nonterminal, it is a leftmost derivation. Here is
1 + 2 * 3 derived from our grammar (the rule used is on the right):
Eight steps, and we have manufactured the exact token sequence 1 + 2 * 3 from the rules
alone. Each => is one rewrite. The order we expanded in does not change the
final string, but it does record a path through the grammar — and that path has a shape.
A derivation, drawn as a tree, is a parse tree (or derivation tree). The root
is the start symbol; every internal node is a nonterminal, and its children are exactly the symbols on
the right-hand side of the production used to expand it; the leaves, read left to right, spell out the
input. The parse tree below is the one derivation above, for 1 + 2 * 3. Play through it:
The shape is the point. The + is near the root; the * lives
deeper, tucked entirely inside the right-hand <term>. Because a subtree is a
self-contained group, the tree says 1 + (2 * 3), not (1 + 2) * 3 — the
multiplication is grouped, and evaluated, first. The grammar's structure has quietly encoded
precedence, and we will see exactly how in a moment.
The free in context-free means each production replaces a single nonterminal no matter what surrounds it — the left-hand side is always exactly one nonterminal, with no conditions on its neighbours. That freedom is what makes grammars easy to reason about, yet still powerful enough for real syntax.
And they are strictly more powerful than the
( ( ( … ) ) ) has as many
closing brackets as opening ones, because the nesting depth is unlimited. A CFG can, because the machine
that recognises context-free languages is a
(
pushes; every ) pops; balanced exactly when the stack empties. That stack is unbounded
memory, and it is precisely what lets a parser track arbitrarily deep nesting. This is the deep reason
lexing and parsing are separate passes: recognising tokens is a regular job (no memory needed),
but recognising nested structure is a context-free job (a stack needed).
Try to write one for "any number of matched brackets": (), (()),
((())), … You would need the pattern to remember how many opens are still unclosed, and
that count grows without limit. The <s> ::= "(" <s> ")" <s> | ε, because its recognising machine, the
pushdown automaton, gets a stack to count with. Nesting is exactly the frontier between the two.
Here is where grammar design earns its keep. Suppose we had written the simpler, sloppier grammar instead:
It generates the same strings — but now 1 + 2 * 3 has two different parse
trees: one with + at the root (giving 1 + (2 * 3) = 7) and one with
* at the root (giving (1 + 2) * 3 = 9). A grammar that allows two distinct
parse trees for one string is ambiguous, and ambiguity is poison: the parser has no
principled way to choose, so the meaning of the program is undefined.
The cure is to stratify the grammar into levels — exactly what our original
expr → term → factor ladder does. Each tier captures one precedence level: +
and - live at the loose <expr> tier, * and / at
the tighter <term> tier, and atoms/brackets at <factor>. Because a
<term> can never expand back up to an <expr> except through
brackets, multiplication is forced to nest below addition — precedence is baked into the shape,
and only one parse tree survives. The same trick fixes associativity: writing the rule
as <expr> ::= <expr> "-" <term> (recursion on the left) makes
1 - 2 - 3 parse as (1 - 2) - 3, the correct left-associative answer, rather
than 1 - (2 - 3).
expr → term → factor); the tighter-binding operator sits in the lower, more deeply
nested tier.
The most famous ambiguity in real languages is the dangling else: in
if a then if b then s1 else s2, does the else attach to the first
if or the second? The grammar, if written naively, permits both parse trees. Languages
resolve it by fiat — C, Java and friends decree "else binds to the nearest unmatched
if" — and encode that choice by rewriting the grammar into matched and unmatched statement
tiers, the very same stratifying trick.
expr → term → factor → 1. The abstract syntax tree is the parse tree
with all that scaffolding stripped away, keeping only the meaningful operators and operands. Parsing
usually builds the AST directly; the parse tree is the concept, the AST is the artefact.
Put it all together. Parsing is the process of taking a flat stream of
There are two grand strategies, and both are just disciplined ways of driving a
yacc
and bison generate) start from the tokens and shift them onto a stack, reducing groups back
into nonterminals as complete right-hand sides appear. Either way, success means a parse tree; failure
means a precise, locatable syntax error.