CPU Fundamentals

Pipelines, superscalar execution, branch prediction, and the microarchitectural tricks that make modern CPUs fast.

cpupipelinesuperscalarbranch-prediction

The Fetch-Decode-Execute Cycle

At its core every processor repeats the same loop: fetch an instruction from memory, decode it into micro-operations, and execute those ops on functional units. What separates a textbook CPU from a modern out-of-order beast is how aggressively we overlap and speculate across these stages.

Pipelining

A classic five-stage pipeline (IF → ID → EX → MEM → WB) lets us start a new instruction every cycle even though each instruction takes five cycles end-to-end. Throughput ≈ 1 IPC in the ideal case.

5-stage pipeline showing instruction overlap across cycles

Pipeline hazards — data, control, and structural — are the enemies of throughput. Forwarding paths, branch predictors, and register renaming exist to mitigate them.

Superscalar & Out-of-Order Execution

Modern cores issue 4–8 µ-ops per cycle. The reorder buffer (ROB) tracks in-flight instructions so results commit in program order even though execution is out-of-order. Key structures:

StructurePurpose
ROBIn-order retirement
Reservation stationsHold operands until ready
Physical register fileEliminates WAR/WAW hazards

Branch Prediction

A mispredicted branch flushes the pipeline — tens of cycles wasted. Modern predictors (TAGE, perceptron-based) achieve >96 % accuracy on general workloads.

Key takeaway: Understand the pipeline and you understand why branchless code, data-oriented design, and prefetching matter.