CPU Fundamentals
Pipelines, superscalar execution, branch prediction, and the microarchitectural tricks that make modern CPUs fast.
The Fetch-Decode-Execute Cycle
At its core every processor repeats the same loop: fetch an instruction from memory, decode it into micro-operations, and execute those ops on functional units. What separates a textbook CPU from a modern out-of-order beast is how aggressively we overlap and speculate across these stages.
Pipelining
A classic five-stage pipeline (IF → ID → EX → MEM → WB) lets us start a new instruction every cycle even though each instruction takes five cycles end-to-end. Throughput ≈ 1 IPC in the ideal case.
Pipeline hazards — data, control, and structural — are the enemies of throughput. Forwarding paths, branch predictors, and register renaming exist to mitigate them.
Superscalar & Out-of-Order Execution
Modern cores issue 4–8 µ-ops per cycle. The reorder buffer (ROB) tracks in-flight instructions so results commit in program order even though execution is out-of-order. Key structures:
| Structure | Purpose |
|---|---|
| ROB | In-order retirement |
| Reservation stations | Hold operands until ready |
| Physical register file | Eliminates WAR/WAW hazards |
Branch Prediction
A mispredicted branch flushes the pipeline — tens of cycles wasted. Modern predictors (TAGE, perceptron-based) achieve >96 % accuracy on general workloads.
Key takeaway: Understand the pipeline and you understand why branchless code, data-oriented design, and prefetching matter.