Pipelining

You already know how a CPU runs a program: it takes each instruction and pushes it through the fetch–decode–execute cyclefetch it from memory, decode what it means, then execute it. The obvious way to run a whole program is to do this one instruction at a time: finish all three stages of instruction 1 before you even start fetching instruction 2.

That works — but look at what the CPU's hardware is doing while it does it. The three stages are handled by different parts of the processor. So while the execute hardware is busy running instruction 1, the fetch hardware and the decode hardware are just sitting there idle. Two-thirds of the machine is doing nothing at any given moment. That is a waste — and getting rid of that waste is the whole idea of pipelining.

Pipelining means overlapping the stages of consecutive instructions: while instruction 1 is being executed, instruction 2 is being decoded, and instruction 3 is being fetched — all at the same moment. Nobody stands idle. The instructions flow through the CPU in a steady stream, so more of them finish per second.

The laundry that explains everything

Imagine you have three loads of washing to do, and each load goes through three machines in order: the washer, then the dryer, then the folding table. Say each stage takes one hour.

The slow, one-at-a-time way: wash load 1, dry load 1, fold load 1 — then start load 2. Each load takes 3 hours and you never touch a load until the one before it is completely finished. Three loads take 3 \times 3 = 9 hours. But look — while load 1 is in the dryer, the washer is empty. Why not start washing load 2 right then?

That's exactly what a smart person does. The moment load 1 moves from the washer to the dryer, load 2 goes into the washer. An hour later, load 1 is being folded, load 2 is drying, and load 3 is washing — all three machines running at once. Now three loads take only 5 hours, not 9. You didn't make any single machine faster; you just stopped letting them stand idle.

A pipelined CPU is doing exactly this. Fetch = washer, decode = dryer, execute = folding. Each stage has its own hardware, so each one can be working on a different instruction at the same time.

See the overlap

Below, time runs left to right in equal clock cycles, and each row is one instruction. Watch how, once the pipeline fills up, every column has all three stages busy at once — a fetch, a decode and an execute happening together, each on a different instruction. Step through it and compare the pipelined schedule with the slow one-at-a-time schedule underneath.

Notice two things. First, in the pipelined version each instruction still takes the same three cycles to pass all the way through — pipelining did not make one instruction quicker. Second, after the pipeline fills, a brand-new instruction completes every single cycle instead of every three cycles. That is where the speed-up comes from: they finish three times more often.

Counting the speed-up

Let's put numbers on it. Suppose each stage takes one cycle, so the pipeline has k = 3 stages.

And the more instructions you run, the better it gets. The three "fill the pipeline" cycles are a one-off cost paid at the very start; spread over a program of millions of instructions, it becomes a rounding error, and a k-stage pipeline approaches being k times faster.

For a pipeline of k stages, each one cycle long, running n instructions with no stalls:

When the flow breaks: hazards

The lovely steady stream only works if the CPU always knows which instruction comes next so it can fetch it early. Sometimes it can't, and the pipeline hits a hazard — a situation that stops the overlap working cleanly.

The most important one at A-level is the branch (a jump, such as the if-decision or a loop). At a branch the program can carry on in one of two directions, and the CPU won't know which until that branch instruction has actually executed. But by then it has already fetched and started decoding the instructions sitting right after the branch — on the assumption the program just runs straight on. If the branch jumps somewhere else instead, those half-processed instructions are wrong. The CPU must flush them (throw them away) and stall (wait, doing nothing useful) while it re-fills the pipeline from the correct place. Those wasted cycles are called a pipeline bubble.

Real CPUs fight back with clever tricks such as branch prediction — guessing which way a branch will go and speculatively fetching that way — but a wrong guess still costs a flush. Hazards are the reason a 5-stage pipeline never quite gives you a full 5× speed-up in practice.

Pipelining improves throughputhow many instructions finish per secondnot latency, the time for one single instruction to get through. A pipelined instruction still takes just as long (or slightly longer) to travel end-to-end as it would on its own; you've simply arranged for lots of them to be in flight at once. It's like a motorway: adding lanes doesn't make your car faster, but far more cars get past per minute. Never say "pipelining makes each instruction quicker" — say "pipelining makes instructions finish more often."

And remember the catch: hazards, above all a branch (jump), break the flow. When the CPU doesn't yet know which instruction is next, it may fetch the wrong ones and then have to flush the pipeline and stall — losing some of the benefit. The theoretical k× speed-up is a best case, not a promise.

A worked example

A CPU has a 3-stage pipeline (fetch, decode, execute), each stage taking one clock cycle. It runs a straight run of 6 instructions with no branches.

Now suppose instruction 3 is a branch that jumps elsewhere, and the two instructions the CPU had already fetched behind it must be flushed, costing 2 extra "bubble" cycles. The pipelined run now takes 8 + 2 = 10 cycles instead of 8 — still far better than 18, but you can see the hazard eating into the gain.

If a k-stage pipeline can approach a k\times speed-up, why not chop the work into 100 tiny stages? Two reasons. First, every extra stage makes a branch misprediction more expensive — a deeper pipeline has more in-flight instructions to flush, so a wrong guess wastes more cycles. Second, splitting the work adds overhead (each stage needs its own registers to hold the partial result), and eventually the stages get so small that this overhead dominates. Real processors settle on a handful of stages — classically 5, sometimes 14–20 — as the sweet spot between overlap and hazard cost.