Compiler Phases: Front, Middle, and Back End

You write x = a + b * 2; and, moments later, a processor is toggling billions of transistors to make it happen. Between those two worlds sits an enormous gap: your source is a string of characters pitched at humans, while the machine wants a stream of numbered opcodes pitched at silicon. A compiler crosses that gap — and, crucially, it does not do it in one heroic leap. It works like a factory assembly line, handing the program down a sequence of phases, each transforming one well-defined representation into the next.

This page is about that pipeline as a whole — the shape of the whole journey, not the internals of any one station. Traditionally the phases are grouped into three parts: the front end (which understands the source language), the middle end (which improves a neutral intermediate form), and the back end (which emits code for a particular machine). Getting this map in your head first makes every later topic — lexing, parsing, optimisation, register allocation — click into its proper slot.

One long conveyor belt of representations

The single most useful idea here is that each phase consumes one representation of the program and produces another. The program is never destroyed; it is repeatedly re-encoded into a form that makes the next job easy. Follow the goods down the belt:

Read the middle column of that list top to bottom and you have the whole story: characters → tokens → parse tree / AST → annotated AST → IR → optimised IR → target code. Six transformations, each simple because the previous one already did its part.

The three ends

Group those six phases and the classic three-part split appears:

Because it turns a multiplication into an addition. Suppose you want to support m source languages on n target machines. Build a separate, monolithic compiler for each pairing and you need m × n of them — 5 languages × 4 chips is 20 whole compilers to write and maintain. Route everything through one shared IR instead, and you need only m front ends (one per language, each lowering to the IR) plus n back ends (one per machine, each starting from the IR): just m + n = 9 components. Add a brand-new language and you write one front end and instantly target every existing machine; add a new chip and every existing language can target it for free.

This is exactly why real toolchains are built this way. GCC has front ends for C, C++, Fortran, Go and more, all meeting at its GIMPLE/RTL internal forms before fanning out to dozens of back ends. LLVM makes the IR the star of the show: Clang (C/C++), Rust, Swift and Julia all emit LLVM IR, and a single set of back ends compiles that IR to x86, ARM, RISC-V, WebAssembly and beyond. The IR is the pinch point that makes the whole ecosystem retargetable.

Why bother with the middle at all?

You could imagine a compiler that skips the IR — parse straight to machine code. Some tiny compilers do. But the IR earns its keep three times over. First, it is where optimisation lives: rewriting three-address code is far easier than rewriting either a syntax tree or raw assembly. Second, it is the decoupling layer that gives you the m + n win above. Third, it is portable reasoning — analyses like liveness and constant propagation are written once against the IR and apply to every language and machine.

A worked pass down the belt

Watch a single statement descend through the phases. Each arrow is one transformation; notice how the representation gets steadily lower-level while the meaning is preserved end to end.

Source (characters): x = a + b * 2; Lexer → tokens: ID(x) ASSIGN ID(a) PLUS ID(b) STAR NUM(2) SEMI Parser → AST: (=) / \ x (+) / \ a (*) / \ b 2 Semantic → annotated AST: every name resolved via the symbol table; types checked (a,b,x : int); 2 : int literal IR (three-address code): t1 = b * 2 t2 = a + t1 x = t2 Optimised IR: t1 = b * 2 ; (nothing constant-foldable here, x = a + t1 ; but t2 was a needless copy → removed) Back end → target (x86-64): mov eax, [b] imul eax, 2 ; or: lea eax, [eax*2] add eax, [a] mov [x], eax

Same statement, seven costumes. The front end changed text into structured meaning; the middle end tidied the IR; the back end chose real instructions and registers. No single phase is doing anything heroic — that is the point.