Abstract Syntax Trees

Your parser has just done something remarkable: it read a flat river of characters — 2 * (3 + 4) — and worked out its structure. But now the parser goes quiet, and the rest of the compiler takes over: it checks types, allocates registers, optimises, and finally emits machine code. What do all those later stages actually operate on?

Not the source text — that is long gone. They operate on one shared data structure: the Abstract Syntax Tree, or AST. It is the program's backbone. Every phase in the back half of a compiler is, at heart, just a tree traversal over the AST. Get the AST right and everything downstream has a clean object to walk. This page is about what that tree is — and, crucially, how it differs from the bushier parse tree the parser conceptually builds first.

Two trees for one expression

A grammar describes a language with rules like Expr → Term ( + Term )*, Term → Factor ( * Factor )*, Factor → num | ( Expr ). When the parser recognises a string, it implicitly builds a parse tree (also called a concrete syntax tree, or CST): a tree that records every detail of how the grammar matched. Every nonterminal that fired becomes a node. Every literal parenthesis and comma is a leaf. Every single-child "chain" rule — Term → Factor → num — leaves a trail of redundant nodes behind.

Here is the parse tree for 2 * (3 + 4). Notice how bushy it is: nine internal nonterminal nodes and two literal parentheses, most of them carrying no meaning at all.

The AST: the same meaning, none of the noise

Now the same expression as an abstract syntax tree. The AST keeps only what is semantically meaningful and throws the rest away:

The result: five nodes instead of thirteen. Same meaning, far less to walk.

No — and that surprises people. In the source, (3 + 4) and a bare 3 + 4 sitting in the same position produce the identical AST: a single + node over 3 and 4. The brackets did their one job during parsing (they told the parser how to shape the tree) and then they were done. There is no "parenthesis node." If you ever need to print the program back out, the pretty-printer re-inserts parentheses only where the tree shape would otherwise be ambiguous. The tree is the source of truth; the punctuation was just scaffolding.

Node types as a tagged union

Because an AST node is always exactly one of a fixed set of shapes — a literal, or a binary operation, or an if, or a function call — it is naturally a tagged union (also called a discriminated union or sum type). Each variant carries a kind tag plus exactly the fields that variant needs. A number node needs a value; a binary-op node needs an operator and two child expressions:

// Every AST node is exactly ONE of these shapes — a tagged union. // The "kind" field is the tag that says which shape this node is. type Node = | { kind: "num"; value: number } // a leaf: a numeric literal | { kind: "var"; name: string } // a leaf: an identifier | { kind: "bin"; op: "+" | "-" | "*" | "/"; left: Node; right: Node } // a binary operation | { kind: "call"; callee: string; args: Node[] }; // f(a, b, ...)

Notice what is not here: no comma, no parenthesis, no semicolon. A call node stores an array of argument subtrees — the commas that separated them in the source are implicit in "this is a list." The tags matter because every traversal is written as a switch on kind: one branch per node type, and the compiler's typechecker can even confirm you handled them all.

Every later phase is a tree walk

Here is why the AST earns the title "backbone." Once it exists, the whole back half of the compiler is a sequence of tree traversals over that one structure:

The standard way to write such a walk is the visitor pattern: one function (or one object with a method per node kind) that pattern-matches on the node's tag, recurses into its children, and combines the results. It is a post-order traversal whenever a node needs its children's results before it can produce its own. Below, two different phases walk the same AST for 2 * (3 + 4): one pretty-prints it, one evaluates it. Press Run:

// One AST, many walks. This is the core idea of the compiler back end. type Node = | { kind: "num"; value: number } | { kind: "bin"; op: "+" | "-" | "*" | "/"; left: Node; right: Node }; // The AST for 2 * (3 + 4). Note: NO parenthesis node — the shape encodes the grouping. const ast: Node = { kind: "bin", op: "*", left: { kind: "num", value: 2 }, right: { kind: "bin", op: "+", left: { kind: "num", value: 3 }, right: { kind: "num", value: 4 } }, }; // WALK 1 — a pretty-printer: an indented, pre-order traversal (visit node, then children). function print(n: Node, depth = 0): void { const pad = " ".repeat(depth); if (n.kind === "num") { console.log(pad + "num " + n.value); } else { console.log(pad + "op " + n.op); print(n.left, depth + 1); // recurse into the children print(n.right, depth + 1); } } // WALK 2 — an evaluator: a post-order traversal (children FIRST, then combine). function evaluate(n: Node): number { if (n.kind === "num") return n.value; // a leaf hands back its value const l = evaluate(n.left); // walk the children first... const r = evaluate(n.right); if (n.op === "+") return l + r; // ...then combine (the node's meaning) if (n.op === "-") return l - r; if (n.op === "*") return l * r; return l / r; } console.log("--- pretty-printed tree ---"); print(ast); console.log("--- evaluated value ---"); console.log(evaluate(ast)); // 14

Two phases, one tree, and neither of them ever touches a source character. That separation — text is the parser's problem, the tree is everyone else's — is exactly what makes a compiler's back end tractable.