Every compiler is a translator with one final, unglamorous job: at the very end of the pipeline it
must produce the actual instructions a CPU fetches, decodes, and executes. All the
earlier phases — lexing, parsing, type-checking, and optimising the
This is the back end of the compiler. Where the front end asked "is this program valid, and
what does it say?", the back end asks a blunter question: "given a chip with a handful of registers
and a big slab of memory, exactly which LOADs, ADDs and STOREs
compute this?" Two problems dominate the answer — instruction selection (which
instructions?) and register allocation (which values live in which registers?) — and
this page is about both.
To generate code you first need a picture of the target machine. A modern CPU offers two very different places to keep numbers:
R0, R1, R2, … A real chip might have 16 or 32 of them.
The arithmetic units can read and write registers in a single clock cycle.
Most instruction sets are load-store: arithmetic happens only between
registers, so to add two numbers sitting in memory you must first LOAD each into a
register, ADD the registers, and STORE the result back. That single fact —
compute in registers, shuttle to and from memory with loads and stores — shapes everything
the code generator does. Registers are the fast lane; you want your hot values living there.
You could — and the code would be correct — but it would crawl. Reading a value from a register is roughly a single cycle; reading it from main memory can be 100× slower. A tight loop that touches memory on every operation spends nearly all its time waiting. Keeping the values a computation is actively using in registers is one of the biggest wins a back end delivers, which is exactly why register allocation (below) gets so much attention.
The IR that reaches the back end is usually three-address code (TAC): a flat list of
simple statements, each doing one operation on at most two operands and naming the result. A temporary
like t1 is just a name for an intermediate value — and crucially there can be
unboundedly many of them, because the front end invents a fresh temporary for every
subexpression.
Consider the expression (a + b) * c. After the earlier phases it might arrive as this
two-line TAC:
Instruction selection walks each TAC statement and picks a sequence of real machine
instructions that computes it. On a load-store machine, t1 = a + b can't be a single
instruction — the operands a and b live in memory — so it lowers to a
load-load-add, and the result is stored back:
Every abstract + became a LOAD/LOAD/ADD/STORE quartet, and *
became the same pattern around a MUL. That mechanical mapping — each IR op to a
template of instructions — is the heart of instruction selection.
t1
to memory and immediately reloads it — a smarter selector notices t1 is already
sitting in R0 and skips the round trip entirely. Correct code is easy; good
code is the whole game.
t1, t2, … are abstract
names the front end created freely; R0, R1, … are the few physical slots
the chip actually has. Turning the first into the second is a separate problem — register
allocation — not something instruction selection solves on its own.
The naive strategy above is simple enough to write in a dozen lines. Below is a real code generator:
it takes a list of three-address instructions — each an object { dst, op, a, b } — and
emits assembly strings. For every instruction it loads both operands into R0 and
R1, emits the matching arithmetic op, and stores the result. Press Run:
That is a working (if unoptimised) back end for straight-line arithmetic. Notice it happily reuses
R0 and R1 for every statement — it never runs out of registers, because it
immediately spills every result to memory with a STORE. Real allocators try to keep
values in registers across statements, and that is where the interesting problem begins.
Here is the tension at the core of the back end. The IR names an unbounded number of
temporaries — t1, t2, … t500 — but the CPU has only
k physical registers, maybe 16. Register allocation is the job of
assigning each temporary to one of those k registers so that the program still computes the right
answer.
The saving grace is that not every temporary is needed at once. A value is live from the moment it is computed until its last use; two temporaries whose lifetimes don't overlap can safely share a register. So the real question is: how many values are live at the same time? If at most k values are ever simultaneously live, everything fits in registers and no memory traffic is needed. If more than k are live at some point, something has to give.
This maps onto a classic piece of theory. Build an interference graph: one node per temporary, with an edge between any two temporaries that are live simultaneously (they "interfere", so they may not share a register). Assigning k registers so that no two adjacent nodes get the same register is exactly graph colouring with k colours. If the graph is k-colourable, every temporary fits; if not, you must spill.
Because graph colouring is. Deciding whether an arbitrary graph can be coloured with k colours is a classic NP-complete problem, and register allocation inherits that difficulty through the interference graph. So compilers don't search for a provably optimal answer — they use fast heuristics. Chaitin's algorithm repeatedly removes low-degree nodes (which are always colourable) and pushes them on a stack, colouring them on the way back; when it gets stuck it picks a victim to spill. Linear-scan allocation is even simpler and faster — it sweeps the live ranges in order and is popular in just-in-time compilers where compile time itself matters. Neither is optimal, but both are fast and good enough that the spills they introduce rarely hurt.
What happens when a function genuinely needs more simultaneously-live values than there are registers?
It does not fail to compile. The allocator picks a value to spill:
it stores that value to a slot in memory (on the stack) and reloads it just before
its next use, freeing the register in the meantime. The program is still correct — it is simply a
little slower, paying for extra LOAD/STORE traffic. Spilling is the
pressure-release valve that lets a machine with 16 registers run functions that mention thousands of
temporaries.
The stack also underpins the calling convention: an agreed protocol for how functions pass arguments, return values, and preserve registers across calls — typically by pushing arguments and saved registers onto the stack so a callee can reuse those registers and restore them before returning. That is a whole topic of its own; for now, just know the stack is where spilled values and call bookkeeping both live.
No — and this trips people up. Having more live temporaries than registers is the normal case for any non-trivial function, not an error condition. The allocator simply spills the excess to the stack and carries on. The only thing you lose is a bit of speed; correctness is never in question. So "16 registers" is never a hard ceiling on how complex your code can be — it's just a budget the allocator manages, borrowing from memory whenever it overspends.