Code Generation

Every compiler is a translator with one final, unglamorous job: at the very end of the pipeline it must produce the actual instructions a CPU fetches, decodes, and executes. All the earlier phases — lexing, parsing, type-checking, and optimising the intermediate representation — build up an abstract, machine-independent description of what the program means. Code generation is where that abstraction finally hits silicon: it lowers the IR into real assembly language, the instructions your processor was built to run.

This is the back end of the compiler. Where the front end asked "is this program valid, and what does it say?", the back end asks a blunter question: "given a chip with a handful of registers and a big slab of memory, exactly which LOADs, ADDs and STOREs compute this?" Two problems dominate the answer — instruction selection (which instructions?) and register allocation (which values live in which registers?) — and this page is about both.

The machine model: registers, memory, instructions

To generate code you first need a picture of the target machine. A modern CPU offers two very different places to keep numbers:

Registers — a tiny, fixed set of ultra-fast slots right inside the processor, named R0, R1, R2, … A real chip might have 16 or 32 of them. The arithmetic units can read and write registers in a single clock cycle.
Memory — a vast array of slots (gigabytes) reached by address. Roomy, but far slower: a trip to memory can cost tens or hundreds of cycles.

Most instruction sets are load-store: arithmetic happens only between registers, so to add two numbers sitting in memory you must first LOAD each into a register, ADD the registers, and STORE the result back. That single fact — compute in registers, shuttle to and from memory with loads and stores — shapes everything the code generator does. Registers are the fast lane; you want your hot values living there.

You could — and the code would be correct — but it would crawl. Reading a value from a register is roughly a single cycle; reading it from main memory can be 100× slower. A tight loop that touches memory on every operation spends nearly all its time waiting. Keeping the values a computation is actively using in registers is one of the biggest wins a back end delivers, which is exactly why register allocation (below) gets so much attention.

Instruction selection: one IR op → real instructions

The IR that reaches the back end is usually three-address code (TAC): a flat list of simple statements, each doing one operation on at most two operands and naming the result. A temporary like t1 is just a name for an intermediate value — and crucially there can be unboundedly many of them, because the front end invents a fresh temporary for every subexpression.

Consider the expression (a + b) * c. After the earlier phases it might arrive as this two-line TAC:

t1 = a + b t2 = t1 * c

Instruction selection walks each TAC statement and picks a sequence of real machine instructions that computes it. On a load-store machine, t1 = a + b can't be a single instruction — the operands a and b live in memory — so it lowers to a load-load-add, and the result is stored back:

; t1 = a + b LOAD R0, a ; R0 <- value of a from memory LOAD R1, b ; R1 <- value of b from memory ADD R0, R0, R1 ; R0 <- R0 + R1 STORE t1, R0 ; write the result back to t1's slot ; t2 = t1 * c LOAD R0, t1 ; reload t1 LOAD R1, c MUL R0, R0, R1 ; R0 <- R0 * R1 STORE t2, R0

Every abstract + became a LOAD/LOAD/ADD/STORE quartet, and * became the same pattern around a MUL. That mechanical mapping — each IR op to a template of instructions — is the heart of instruction selection.

There is no single "the assembly". Instruction selection has many valid answers with different costs. The naive template above stores t1 to memory and immediately reloads it — a smarter selector notices t1 is already sitting in R0 and skips the round trip entirely. Correct code is easy; good code is the whole game.
Temporaries are not registers. t1, t2, … are abstract names the front end created freely; R0, R1, … are the few physical slots the chip actually has. Turning the first into the second is a separate problem — register allocation — not something instruction selection solves on its own.

A tiny code generator you can run

The naive strategy above is simple enough to write in a dozen lines. Below is a real code generator: it takes a list of three-address instructions — each an object { dst, op, a, b } — and emits assembly strings. For every instruction it loads both operands into R0 and R1, emits the matching arithmetic op, and stores the result. Press Run:

// A three-address instruction: dst = a op b interface Tac { dst: string; op: "+" | "-" | "*" | "/"; a: string; b: string; } // Map each IR operator to its machine mnemonic. const MNEMONIC: Record<string, string> = { "+": "ADD", "-": "SUB", "*": "MUL", "/": "DIV", }; // Lower ONE TAC statement to a burst of load-store machine instructions. function lower(instr: Tac): string[] { const mnemonic = MNEMONIC[instr.op]; return [ `LOAD R0, ${instr.a}`, // R0 <- first operand `LOAD R1, ${instr.b}`, // R1 <- second operand `${mnemonic.padEnd(5)} R0, R0, R1`, // R0 <- R0 op R1 `STORE ${instr.dst}, R0`, // write the result back ]; } // The code generator: lower every statement, concatenating the assembly. function generate(program: Tac[]): string[] { const asm: string[] = []; for (const instr of program) { asm.push(`; ${instr.dst} = ${instr.a} ${instr.op} ${instr.b}`); for (const line of lower(instr)) asm.push(line); } return asm; } // Sample TAC for (a + b) * c : const program: Tac[] = [ { dst: "t1", op: "+", a: "a", b: "b" }, { dst: "t2", op: "*", a: "t1", b: "c" }, ]; for (const line of generate(program)) console.log(line);

That is a working (if unoptimised) back end for straight-line arithmetic. Notice it happily reuses R0 and R1 for every statement — it never runs out of registers, because it immediately spills every result to memory with a STORE. Real allocators try to keep values in registers across statements, and that is where the interesting problem begins.

Register allocation: many temporaries, few registers

Here is the tension at the core of the back end. The IR names an unbounded number of temporaries — t1, t2, … t500 — but the CPU has only k physical registers, maybe 16. Register allocation is the job of assigning each temporary to one of those k registers so that the program still computes the right answer.

The saving grace is that not every temporary is needed at once. A value is live from the moment it is computed until its last use; two temporaries whose lifetimes don't overlap can safely share a register. So the real question is: how many values are live at the same time? If at most k values are ever simultaneously live, everything fits in registers and no memory traffic is needed. If more than k are live at some point, something has to give.

The IR uses arbitrarily many temporaries; the machine has a fixed k physical registers.
Allocation maps each temporary to a register such that temporaries live at the same time never share one.
When more than k values are live at once, one is spilled to memory (the stack) and reloaded when next needed.
The goal is a valid assignment with as few spills as possible, because every spill is slow memory traffic.

This maps onto a classic piece of theory. Build an interference graph: one node per temporary, with an edge between any two temporaries that are live simultaneously (they "interfere", so they may not share a register). Assigning k registers so that no two adjacent nodes get the same register is exactly graph colouring with k colours. If the graph is k-colourable, every temporary fits; if not, you must spill.

Because graph colouring is. Deciding whether an arbitrary graph can be coloured with k colours is a classic NP-complete problem, and register allocation inherits that difficulty through the interference graph. So compilers don't search for a provably optimal answer — they use fast heuristics. Chaitin's algorithm repeatedly removes low-degree nodes (which are always colourable) and pushes them on a stack, colouring them on the way back; when it gets stuck it picks a victim to spill. Linear-scan allocation is even simpler and faster — it sweeps the live ranges in order and is popular in just-in-time compilers where compile time itself matters. Neither is optimal, but both are fast and good enough that the spills they introduce rarely hurt.

Spilling — and the stack

What happens when a function genuinely needs more simultaneously-live values than there are registers? It does not fail to compile. The allocator picks a value to spill: it stores that value to a slot in memory (on the stack) and reloads it just before its next use, freeing the register in the meantime. The program is still correct — it is simply a little slower, paying for extra LOAD/STORE traffic. Spilling is the pressure-release valve that lets a machine with 16 registers run functions that mention thousands of temporaries.

The stack also underpins the calling convention: an agreed protocol for how functions pass arguments, return values, and preserve registers across calls — typically by pushing arguments and saved registers onto the stack so a callee can reuse those registers and restore them before returning. That is a whole topic of its own; for now, just know the stack is where spilled values and call bookkeeping both live.

No — and this trips people up. Having more live temporaries than registers is the normal case for any non-trivial function, not an error condition. The allocator simply spills the excess to the stack and carries on. The only thing you lose is a bit of speed; correctness is never in question. So "16 registers" is never a hard ceiling on how complex your code can be — it's just a budget the allocator manages, borrowing from memory whenever it overspends.