Branch predictor


In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch will go before this is known definitively. The purpose of the branch predictor is to improve the flow in the instruction pipeline. Branch predictors play a critical role in achieving high performance in many modern pipelined microprocessor architectures.
Two-way branching is usually implemented with a conditional jump instruction. A conditional jump can either be "taken" and jump to a different place in program memory, or it can be "not taken" and continue execution immediately after the conditional jump. It is not known for certain whether a conditional jump will be taken or not taken until the condition has been calculated and the conditional jump has passed the execution stage in the instruction pipeline.
Without branch prediction, the processor would have to wait until the conditional jump instruction has passed the execute stage before the next instruction can enter the fetch stage in the pipeline. The branch predictor attempts to avoid this waste of time by trying to guess whether the conditional jump is most likely to be taken or not taken. The branch that is guessed to be the most likely is then fetched and speculatively executed. If it is later detected that the guess was wrong, then the speculatively executed or partially executed instructions are discarded and the pipeline starts over with the correct branch, incurring a delay.
The time that is wasted in case of a branch misprediction is equal to the number of stages in the pipeline from the fetch stage to the execute stage. Modern microprocessors tend to have quite long pipelines so that the misprediction delay is between 10 and 20 clock cycles. As a result, making a pipeline longer increases the need for a more advanced branch predictor.
The first time a conditional jump instruction is encountered, there is not much information to base a prediction on. However, the branch predictor keeps records of whether or not branches are taken, so when it encounters a conditional jump that has been seen several times before, it can base the prediction on the recorded history. The branch predictor may, for example, recognize that the conditional jump is taken more often than not, or that it is taken every second time.
Branch prediction is not the same as branch target prediction. Branch prediction attempts to guess whether a conditional jump will be taken or not. Branch target prediction attempts to guess the target of a taken conditional or unconditional jump before it is computed by decoding and executing the instruction itself. Branch prediction and branch target prediction are often combined into the same circuitry.

Implementation

Static branch prediction

Static prediction is the simplest branch prediction technique because it does not rely on information about the dynamic history of code executing. Instead, it predicts the outcome of a branch based solely on the branch instruction. Early implementations of SPARC and MIPS used single-direction static branch prediction: they always predict that a conditional jump will not be taken, so they always fetch the next sequential instruction. Only when the branch or jump is evaluated and found to be taken, does the instruction pointer get set to a non-sequential address.
Both CPUs fetch instructions in one cycle and evaluate branches in the decode stage. As a result, the branch target recurrence is two cycles long, and the machine always fetches the instruction immediately after any taken branch. Both architectures define branch delay slots in order to utilize these fetched instructions.
A more-advanced form of static prediction assumes that backward branches will be taken and forward branches will not. This rule increases prediction accuracy in loops, which can be constructed with a backward branch at the end that is mostly taken, or a forward branch at the beginning that is mostly not taken. With processors that use this prediction method, ordering the instructions can maximize branch-prediction accuracy. The RISC-V ISA recommends that software written for RISC-V harts, or generated to run on RISC-V harts, be optimized with the assumption that backward branches are taken and forward branches are not.
Some processors accept branch-prediction hints that override the static prediction. The Intel Pentium 4 and Pentium 4E accept branch-prediction hints as prefixes. In the presence of dynamic prediction, static prediction offers almost no benefit, and overriding the static prediction helps even less. Intel's subsequent Pentium M and Core2 processors ignore branch-prediction hint prefixes. No x86 manufacturer has re-introduced prediction hints.
Dynamic branch predictors naturally fall back on static branch prediction when they have no cached information. Both the Motorola MPC7450 and the Intel Pentium 4 fall back on static prediction.
In static prediction, all decisions are made at compile time, before the execution of the program.

Dynamic branch prediction

Dynamic branch prediction uses information about taken or not-taken branches gathered at run-time to predict the outcome of a branch.

Random branch prediction

Random branch prediction means taking a random guess at whether a branch will be taken, each time it is executed. The cost for a pseudorandom number generator is low compared to other techniques. Random prediction guarantees a 50% correct prediction rate that cannot be increased or reduced by any means. Static branch prediction predicts each branch with an accuracy ranging from 0% to 100%, with the overall success rate somewhere in-between. Unoptimized code is very likely to run with better than 50% prediction success; even 90% is likely. If the compiler does some instruction reordering, it makes the correct branch prediction rate higher than for unoptimized code. Static branch prediction long ago surpassed random branch prediction. Dynamic branch prediction has surpassed static branch prediction and is commonplace, despite the added complexity.

Next-line prediction

Some superscalar processors fetch each line of instructions with a pointer to the next line. This next-line predictor handles branch target prediction as well as branch direction prediction.
When a next-line predictor points to aligned groups of 2, 4, or 8 instructions, the branch target will usually not be the first instruction fetched, and so the initial instructions fetched are wasted. Assuming for simplicity, a uniform distribution of branch targets, 0.5, 1.5, and 3.5 instructions fetched are discarded, respectively.
Since the branch itself will generally not be the last instruction in an aligned group, instructions after the taken branch will be discarded. Once again, assuming a uniform distribution of branch instruction placements, 0.5, 1.5, and 3.5 instructions fetched are discarded.
The discarded instructions at the branch and destination lines add up to nearly a complete fetch cycle, even for a single-cycle next-line predictor.

One-level branch prediction

Saturating counter

A 1-bit saturating counter records the last outcome of the branch. This is the most simple version of dynamic branch predictor possible, although it is not very accurate.
A 2-bit saturating counter is a state machine with four states:
  • Strongly not taken
  • Weakly not taken
  • Weakly taken
  • Strongly taken
When a branch is evaluated, the corresponding state machine is updated. Branches evaluated as not taken change the state toward strongly not taken, and branches evaluated as taken change the state toward strongly taken. The advantage of the two-bit counter scheme over a one-bit scheme is that a conditional jump has to deviate twice from what it has done most in the past before the prediction changes. For example, a loop-closing conditional jump is mispredicted once rather than twice.
The original, non-MMX Intel Pentium processor uses a saturating counter, though with an imperfect implementation.
On the SPEC'89 benchmarks, very large bimodal predictors saturate at 93.5% correct, once every branch maps to a unique counter.
The predictor table is indexed with the instruction address bits, so that the processor can fetch a prediction for every instruction before the instruction is decoded.

Two-level predictor

The Two-Level Branch Predictor, also referred to as Correlation-Based Branch Predictor, uses a two-dimensional table of counters, also called "Pattern History Table". The table entries are two-bit counters.

Two-level adaptive predictor

If an if statement is executed three times, the decision made on the third execution might depend upon whether the previous two were taken or not. In such scenarios, a two-level adaptive predictor works more efficiently than a saturation counter. Conditional jumps that are taken every second time or have some other regularly recurring pattern are not predicted well by the saturating counter. A two-level adaptive predictor remembers the history of the last n occurrences of the branch and uses one saturating counter for each of the possible 2n history patterns. This method is illustrated in figure 3.
Consider the example of n = 2. This means that the last two occurrences of the branch are stored in a two-bit shift register. This branch history register can have four different binary values, 00, 01, 10, and 11, where zero means "not taken" and one means "taken". A pattern history table contains four entries per branch, one for each of the 22 = 4 possible branch histories, and each entry in the table contains a two-bit saturating counter of the same type as in figure 2 for each branch. The branch history register is used for choosing which of the four saturating counters to use. If the history is 00, then the first counter is used; if the history is 11, then the last of the four counters is used.
Assume, for example, that a conditional jump is taken every third time. The branch sequence is 001001001... In this case, entry number 00 in the pattern history table will go to state "strongly taken", indicating that after two zeroes comes a one. Entry number 01 will go to state "strongly not taken", indicating that after 01 comes a zero. The same is the case with entry number 10, while entry number 11 is never used because there are never two consecutive ones.
The general rule for a two-level adaptive predictor with an n-bit history is that it can predict any repetitive sequence with any period if all n-bit sub-sequences are different.
The advantage of the two-level adaptive predictor is that it can quickly learn to predict an arbitrary repetitive pattern. This method was invented by T.-Y. Yeh and Yale Patt at the University of Michigan. Since the initial publication in 1991, this method has become very popular. Variants of this prediction method are used in most modern microprocessors.