Semantic Analysis and Symbol Tables

A parser is a strict grammar teacher: it checks that your sentence is well-formed. Feed it x + 1 and it happily builds a tree — a plus node with a variable on the left and a literal on the right. Grammatically perfect. But there is a question the parser never asks: was x ever declared? Does x + 1 actually mean anything?

That is the job of the next stage, semantic analysis. Parsing checks form; semantic analysis checks meaning. It walks the abstract syntax tree the parser produced and asks the questions grammar can't:

None of these are grammar mistakes — every one is a perfectly legal shape. Catching them needs two tools that this page is about: a symbol table to remember what every name means, and a type checker that walks the tree working out the type of each node.

The symbol table: names to attributes

The moment a program says let count = 0, the compiler has learned a fact it must remember: the name count now exists, it is a variable (not a function or a type), it holds a number, and it lives at some storage location. All of that bundled knowledge is a name's attributes, and the structure that stores them is the symbol table: a dictionary mapping each name to its attributes.

Every time the analyser meets a use of a name — count + 1 — it does a lookup: fetch count's attributes so it can check the use makes sense. A single program does millions of these lookups, so the operation must be fast. That is exactly why a symbol table is almost always backed by a hash table: it turns the name into an index and reaches the attributes in O(1) average time, no matter how many thousands of names the program declares. A balanced tree would give O(log n); a hash table gives us constant time, and lookups are the inner loop of the whole pass.

The kind attribute — and it matters more than it looks. In many languages the same identifier can be a type and a value in different contexts, and the analyser leans on kind to tell them apart. When the parser sees foo(x) it can't know whether foo is a function being called or (in some languages) a type doing a cast; only the symbol table, filled in by an earlier declaration, resolves the ambiguity. This is one reason semantic analysis is a separate pass after parsing: it needs the whole picture of what every name is.

Scope: a stack of symbol tables

A real program doesn't have one flat namespace. The same name i can be a loop counter in one function and something entirely different in another. What keeps them apart is scope — the region of program text where a declaration is visible. And the elegant way to track scope is a stack of symbol tables:

The stack mirrors the nesting of blocks precisely. Here is a small program and the stack of tables as the analyser reaches the innermost line:

let g = 10; // scope 0 (global) function area(r) { // r -> scope 1 let two = 2; // two -> scope 1 { // enter a block let pi = 3; // pi -> scope 2 return two * pi * r; // <-- resolving names HERE } // leave block: scope 2 popped }
Stack (top = innermost)Names in this table
scope 2 (the inner block)pi
scope 1 (function area)r, two
scope 0 (global)g, area

Now the line two * pi * r needs to resolve three names. Name resolution searches the stack from the top down — innermost scope first, then outward until a match is found:

If a name is not found in any table on the stack, it was never declared — that is the "undeclared variable" error. Because the search always begins at the top, an inner declaration can shadow an outer one of the same name: the inner is found first, so it wins. This inside-out search is called lexical (or static) scoping — "lexical" because you can work out which declaration a name refers to purely from where it sits in the text, before the program ever runs.

Under lexical (static) scope, a name refers to whatever declaration encloses it in the source text — you resolve it by reading the program. Under dynamic scope, a name refers to the most recent binding on the call stack at run time: which declaration wins depends on who called whom, and can differ from one run to the next.

Dynamic scope makes code fiendishly hard to reason about — you can't tell what a variable means just by looking at the function that uses it. That is why almost every modern language (type systems rely on it too) chose lexical scope: it lets both the compiler and the human resolve every name locally, from the text alone. A handful of languages (older Lisps, Bash, Emacs Lisp) kept dynamic scope for a few special cases.

Type checking: walking the AST bottom-up

With names resolved, the analyser can check types. The idea is beautifully recursive and it walks the abstract syntax tree bottom-up: to know the type of a node, first work out the types of its children, then check they fit together, and report this node's type upward.

The function below is a real, complete mini type checker. It models a tiny AST, keeps a scope as a stack of Maps, and check(node, scope) returns the node's type or throws the moment it meets an undeclared name or a type it can't reconcile. Then it checks two sample programs — one good, one broken. Press Run:

// ---- The tiny AST: five node shapes ---- type Node = | { kind: "num"; value: number } // 42 -> number | { kind: "str"; value: string } // "hi" -> string | { kind: "var"; name: string } // x (look it up) | { kind: "bin"; op: "+" | "*"; left: Node; right: Node } | { kind: "let"; name: string; value: Node; body: Node }; type Type = "number" | "string"; // A scope is a STACK of symbol tables. Each table maps a name -> its type. type Scope = Map<string, Type>[]; // Name resolution: search the stack from the TOP (innermost) outward. function resolve(scope: Scope, name: string): Type { for (let i = scope.length - 1; i >= 0; i--) { const t = scope[i].get(name); if (t !== undefined) return t; // found in the nearest enclosing scope } throw new Error(`Semantic error: undeclared variable '${name}'`); } // check: return this node's TYPE, or throw on a semantic error. Walks bottom-up. function check(node: Node, scope: Scope): Type { switch (node.kind) { case "num": return "number"; // leaves know their own type case "str": return "string"; case "var": return resolve(scope, node.name); case "bin": { // children FIRST, then combine const lt = check(node.left, scope); const rt = check(node.right, scope); if (lt !== "number" || rt !== "number") { throw new Error( `Type error: cannot apply '${node.op}' to ${lt} and ${rt}`); } return "number"; // number op number -> number } case "let": { // introduce a name in a NEW inner scope const vt = check(node.value, scope); // type of the bound value scope.push(new Map([[node.name, vt]])); // enter block: push a table const bt = check(node.body, scope); // check the body with x visible scope.pop(); // leave block: pop the table return bt; } } } function typeOf(program: Node): string { try { return check(program, [new Map()]); } // start with one (global) table catch (err) { return (err as Error).message; } } // GOOD: let x = 6 in x * x -> a number const good: Node = { kind: "let", name: "x", value: { kind: "num", value: 6 }, body: { kind: "bin", op: "*", left: { kind: "var", name: "x" }, right: { kind: "var", name: "x" } }, }; console.log("let x = 6 in x * x :", typeOf(good)); // BAD 1: "a" * 3 -> type error (string times number) const bad1: Node = { kind: "bin", op: "*", left: { kind: "str", value: "a" }, right: { kind: "num", value: 3 }, }; console.log('"a" * 3 :', typeOf(bad1)); // BAD 2: y + 1 with y never declared -> undeclared variable const bad2: Node = { kind: "bin", op: "+", left: { kind: "var", name: "y" }, right: { kind: "num", value: 1 }, }; console.log("y + 1 (y undeclared):", typeOf(bad2));

Look at what came back: the good program is reported as number, while the two broken ones produce diagnostics, not crashes. That is the whole spirit of semantic analysis — it doesn't run your program, it reasons about it, and when the meaning doesn't add up it hands you a clear error with a location. The output of a clean pass is an annotated AST: the same tree, now with a type stamped on every node and every name linked to its symbol-table entry, ready for the code generator that comes next.