Branch Prediction

The Problem

Branch instruction are instructions that can cause the program flow to change. Oftentimes you will find “conditional branches” in the code; branches that require a previous instruction to determine the outcome.

ADD R1, R2, R3 ; R1 = R2 + R3
BRz label1 ; If R1 is zero, jump to label1
ADD R4, R5, R6 ; R4 = R5 + R6
label1:
AND R0, R0 #0 ; R0 = 0

In this example, the BRz instruction is a branch instruction. The condition is if R1 is zero, the program will jump to the label1 and execute the code there. If R1 is not zero, the program will continue executing the next instruction.

Let’s see how this looks in a basic pipeline with no data forwarding and no branch prediction. Condition codes are written at the end of the Writeback stage.

                                              (CC written)
                                              ↓
                        C1   C2   C3   C4   C5   C6   C7   C8   C9   C10  C11
ADD R1, R2, R3           F    D    E    M    W
BRz label1                    F    D    ·    ·    E    M    W
AND R0, R0, #0                                         F    D    E     M    W

The pipeline stalls for 3 cycles after BRz’s Decode. Two of those (C4–C5) are spent waiting for ADD to write the condition codes at the end of its Writeback stage. The third (C6) is spent executing BRz to resolve the branch target. Only in (C7) can the correct next instruction be fetched, since the program counter is set by Execute stage of BRz.

With Data Forwarding

If the pipeline supports data forwarding, the condition codes computed during ADD’s Execute stage can be forwarded directly to BRz — eliminating the 2-cycle CC dependency stall.

                                         (CC forwarded)
                                         ↓
                        C1   C2   C3   C4   C5   C6   C7   C8   C9
ADD R1, R2, R3           F    D    E    M    W
BRz label1                    F    D    E    M    W
AND R0, R0, #0                               F    D    E    M    W

Now only 1 stall cycle remains (C4) — the pipeline still has to wait for BRz to finish executing and resolve the branch target before it knows which instruction to fetch next.

With branch prediction, the pipeline speculatively fetches the predicted target during BRz’s Fetch stage — eliminating the branch resolution stall entirely.

Correct prediction (predicted taken, branch was taken):

                        C1   C2   C3   C4   C5   C6   C7
ADD R1, R2, R3           F    D    E    M    W
BRz label1                    F    D    E    M    W
AND R0, R0, #0                     F    D    E    M    W

Zero stall cycles. The predictor identifies BRz during Fetch and supplies the predicted target address, so AND at label1 is fetched the very next cycle.

Misprediction (predicted not-taken, branch was actually taken):

                        C1   C2   C3   C4   C5   C6   C7   C8   C9
ADD R1, R2, R3           F    D    E    M    W
BRz label1                    F    D    E    M    W
ADD R4, R5, R6                     F    D    X
AND R0, R0, #0                               F    D    E    M    W

The predictor guessed not-taken, so the pipeline fetched ADD R4, R5, R6 (the sequential next instruction). When BRz resolves during Execute (C4), the misprediction is detected — the speculative work is thrown away (X) and the correct target is fetched in C5, resulting in a 2-cycle misprediction penalty.

However it is important to see that even with a misprediction, we are still no worse off than without branch prediction. We still only have a 1 cycle stall after BRz’s Decode.

Branch Prediction Algorithms

There are two main types of branch prediction algorithms:

Static branch prediction: the branch prediction is predetermined by the compiler.
Dynamic branch prediction: the branch prediction is determined at runtime.

Static Branch Prediction

Static branch prediction is the simplest type of branch prediction. It is predetermined by the compiler. The compiler will predict the branch outcome based on the code analysis.

Some examples of static branch prediction algorithms are:

Always not-taken: by always attempting the not-taken path, we can achieve some speedup by avoiding the branch misprediction penalty. This is cheap because it doesn’t require any additional hardware.
BTFNT: Branch Taken Forwards Not Taken. This is a type of static branch prediction algorithm. This algorithm will always take the branch to the smaller address. The intuition is that when performing a loop, the branch will be taken to the smaller address.

Dynamic Branch Prediction

Dynamic branch prediction is the type of branch prediction that is determined at runtime. The runtime will predict the branch outcome based on the code analysis.

LT Last Taken. This is a type of dynamic branch prediction algorithm. This algorithm will predict the branch to the last taken address. The intuition is that the last taken address is likely to be taken again.
2-bit saturating counter: a 2-bit counter that is used to store the branch prediction history. The counter will be incremented on taken and decremented on not taken. The counter will be saturated at 2 and 0. The prediction is made using the most significant bit of the counter.

2-bit saturating counter