Your 2 * (3 + 4) —
and worked out its structure. But now the parser goes quiet, and the rest of the compiler
takes over: it checks types, allocates registers, optimises, and finally emits machine code. What do
all those later stages actually operate on?
Not the source text — that is long gone. They operate on one shared data structure: the
Abstract Syntax Tree, or AST. It is the program's backbone. Every
phase in the back half of a compiler is, at heart, just a
A grammar describes a language with rules like Expr → Term ( + Term )*,
Term → Factor ( * Factor )*, Factor → num | ( Expr ). When the parser
recognises a string, it implicitly builds a parse tree (also called a
concrete syntax tree, or CST): a tree that records every detail of how the
grammar matched. Every nonterminal that fired becomes a node. Every literal parenthesis and comma is
a leaf. Every single-child "chain" rule — Term → Factor → num — leaves a trail of
redundant nodes behind.
Here is the parse tree for 2 * (3 + 4). Notice how bushy it is: nine internal
nonterminal nodes and two literal parentheses, most of them carrying no meaning at all.
Now the same expression as an abstract syntax tree. The AST keeps only what is semantically meaningful and throws the rest away:
* is the root; its children are 2 and the + subtree.Expr → Term → Factor
ladder — a number is just a number leaf, sitting directly under the operator that uses it.+ is nested inside the
*, the addition necessarily happens first. Precedence is structure, not punctuation.The result: five nodes instead of thirteen. Same meaning, far less to walk.
+, a function call,
an if); leaves are the atoms — literals and identifiers.
No — and that surprises people. In the source, (3 + 4) and a bare 3 + 4
sitting in the same position produce the identical AST: a single + node over
3 and 4. The brackets did their one job during parsing (they told the
parser how to shape the tree) and then they were done. There is no "parenthesis node." If you ever
need to print the program back out, the pretty-printer re-inserts parentheses only where
the tree shape would otherwise be ambiguous. The tree is the source of truth; the punctuation was
just scaffolding.
Because an AST node is always exactly one of a fixed set of shapes — a literal, or a binary
operation, or an if, or a function call — it is naturally a tagged union
(also called a discriminated union or sum type). Each variant carries a kind tag plus
exactly the fields that variant needs. A number node needs a value; a binary-op node needs an operator
and two child expressions:
Notice what is not here: no comma, no parenthesis, no semicolon. A call node
stores an array of argument subtrees — the commas that separated them in the source are
implicit in "this is a list." The tags matter because every traversal is written as a
switch on kind: one branch per node type, and the compiler's typechecker can
even confirm you handled them all.
Here is why the AST earns the title "backbone." Once it exists, the whole back half of the compiler is
a sequence of
The standard way to write such a walk is the visitor pattern: one function (or one
object with a method per node kind) that pattern-matches on the node's tag, recurses into its
children, and combines the results. It is a post-order traversal whenever a node needs its
children's results before it can produce its own. Below, two different phases walk the
same AST for 2 * (3 + 4): one pretty-prints it, one evaluates it. Press
Run:
Two phases, one tree, and neither of them ever touches a source character. That separation — text is the parser's problem, the tree is everyone else's — is exactly what makes a compiler's back end tractable.
";" node in your tree, you have built (part of) a parse tree, not an AST.
2 + 3 * 4 and
2 + (3 * 4) are the same AST. The tree already put * below
+; re-reading the flat source to re-decide precedence would undo the parser's work.