Abstract Syntax Trees

Your parser has just done something remarkable: it read a flat river of characters — 2 * (3 + 4) — and worked out its structure. But now the parser goes quiet, and the rest of the compiler takes over: it checks types, allocates registers, optimises, and finally emits machine code. What do all those later stages actually operate on?

Not the source text — that is long gone. They operate on one shared data structure: the Abstract Syntax Tree, or AST. It is the program's backbone. Every phase in the back half of a compiler is, at heart, just a tree traversal over the AST. Get the AST right and everything downstream has a clean object to walk. This page is about what that tree is — and, crucially, how it differs from the bushier parse tree the parser conceptually builds first.

Two trees for one expression

A grammar describes a language with rules like Expr → Term ( + Term )*, Term → Factor ( * Factor )*, Factor → num | ( Expr ). When the parser recognises a string, it implicitly builds a parse tree (also called a concrete syntax tree, or CST): a tree that records every detail of how the grammar matched. Every nonterminal that fired becomes a node. Every literal parenthesis and comma is a leaf. Every single-child "chain" rule — Term → Factor → num — leaves a trail of redundant nodes behind.

Here is the parse tree for 2 * (3 + 4). Notice how bushy it is: nine internal nonterminal nodes and two literal parentheses, most of them carrying no meaning at all.

The AST: the same meaning, none of the noise

Now the same expression as an abstract syntax tree. The AST keeps only what is semantically meaningful and throws the rest away:

Operators become internal nodes, and their operands become the children. The * is the root; its children are 2 and the + subtree.
The chain nonterminals vanish. There is no Expr → Term → Factor ladder — a number is just a number leaf, sitting directly under the operator that uses it.
The parentheses vanish. They were only ever grouping instructions. Once the tree exists, the shape itself encodes the grouping: because + is nested inside the *, the addition necessarily happens first. Precedence is structure, not punctuation.

The result: five nodes instead of thirteen. Same meaning, far less to walk.

An AST is a tree representation of a program that records its semantic structure while discarding syntactic noise (chain nonterminals, parentheses, most punctuation).
Internal nodes are operations or constructs (a +, a function call, an if); leaves are the atoms — literals and identifiers.
Grouping and precedence are encoded by the tree's shape, so the same subtree means the same thing whether or not the source had parentheses around it.
It is the shared data structure that every later phase — semantic analysis, IR generation, code generation — reads by traversing it.

No — and that surprises people. In the source, (3 + 4) and a bare 3 + 4 sitting in the same position produce the identical AST: a single + node over 3 and 4. The brackets did their one job during parsing (they told the parser how to shape the tree) and then they were done. There is no "parenthesis node." If you ever need to print the program back out, the pretty-printer re-inserts parentheses only where the tree shape would otherwise be ambiguous. The tree is the source of truth; the punctuation was just scaffolding.

Node types as a tagged union

Because an AST node is always exactly one of a fixed set of shapes — a literal, or a binary operation, or an if, or a function call — it is naturally a tagged union (also called a discriminated union or sum type). Each variant carries a kind tag plus exactly the fields that variant needs. A number node needs a value; a binary-op node needs an operator and two child expressions:

// Every AST node is exactly ONE of these shapes — a tagged union. // The "kind" field is the tag that says which shape this node is. type Node = | { kind: "num"; value: number } // a leaf: a numeric literal | { kind: "var"; name: string } // a leaf: an identifier | { kind: "bin"; op: "+" | "-" | "*" | "/"; left: Node; right: Node } // a binary operation | { kind: "call"; callee: string; args: Node[] }; // f(a, b, ...)

Notice what is not here: no comma, no parenthesis, no semicolon. A call node stores an array of argument subtrees — the commas that separated them in the source are implicit in "this is a list." The tags matter because every traversal is written as a switch on kind: one branch per node type, and the compiler's typechecker can even confirm you handled them all.

Every later phase is a tree walk

Here is why the AST earns the title "backbone." Once it exists, the whole back half of the compiler is a sequence of tree traversals over that one structure:

Semantic analysis / type checking — walk the tree, look up each identifier, and check the operator's operands have compatible types.
IR generation — walk the tree and emit intermediate code for each node.
Code generation — walk the (lowered) tree and emit machine instructions.
Pretty-printing, linting, optimisation — all more walks.

The standard way to write such a walk is the visitor pattern: one function (or one object with a method per node kind) that pattern-matches on the node's tag, recurses into its children, and combines the results. It is a post-order traversal whenever a node needs its children's results before it can produce its own. Below, two different phases walk the same AST for 2 * (3 + 4): one pretty-prints it, one evaluates it. Press Run:

// One AST, many walks. This is the core idea of the compiler back end. type Node = | { kind: "num"; value: number } | { kind: "bin"; op: "+" | "-" | "*" | "/"; left: Node; right: Node }; // The AST for 2 * (3 + 4). Note: NO parenthesis node — the shape encodes the grouping. const ast: Node = { kind: "bin", op: "*", left: { kind: "num", value: 2 }, right: { kind: "bin", op: "+", left: { kind: "num", value: 3 }, right: { kind: "num", value: 4 } }, }; // WALK 1 — a pretty-printer: an indented, pre-order traversal (visit node, then children). function print(n: Node, depth = 0): void { const pad = " ".repeat(depth); if (n.kind === "num") { console.log(pad + "num " + n.value); } else { console.log(pad + "op " + n.op); print(n.left, depth + 1); // recurse into the children print(n.right, depth + 1); } } // WALK 2 — an evaluator: a post-order traversal (children FIRST, then combine). function evaluate(n: Node): number { if (n.kind === "num") return n.value; // a leaf hands back its value const l = evaluate(n.left); // walk the children first... const r = evaluate(n.right); if (n.op === "+") return l + r; // ...then combine (the node's meaning) if (n.op === "-") return l - r; if (n.op === "*") return l * r; return l / r; } console.log("--- pretty-printed tree ---"); print(ast); console.log("--- evaluated value ---"); console.log(evaluate(ast)); // 14

Two phases, one tree, and neither of them ever touches a source character. That separation — text is the parser's problem, the tree is everyone else's — is exactly what makes a compiler's back end tractable.

A parse tree and an AST are NOT the same thing. The parse tree (concrete syntax tree) is faithful to the grammar and full of chain nodes and punctuation; the AST is the distilled, meaning-only version. Compilers usually build the AST directly and never materialise the full parse tree at all — but conflating the two is a classic exam slip.
The AST does not contain the syntax. No parentheses, no commas, no semicolons, no keyword tokens sitting as leaves. Those were consumed during parsing. If you find yourself putting a ";" node in your tree, you have built (part of) a parse tree, not an AST.
Precedence lives in the shape, not in the source order. 2 + 3 * 4 and 2 + (3 * 4) are the same AST. The tree already put * below +; re-reading the flat source to re-decide precedence would undo the parser's work.