Branch Predictors in Modern Microprocessors

Viewer
Transcript

A Review of Branch Prediction Schemes and a Study of Branch Predictors in Modern Microprocessors Ketan N Kulkarni

Venkata Rajesh Mekala

Texas A&M University [email protected]

Texas A&M University [email protected]

ABSTRACT

2. BRANCH PREDICTION TECHNIQUES

Accurate branch prediction has become increasingly important as the trend continues to move towards superscalar and deeply pipelined processors. This has necessitated implementation of advanced branch handling techniques which have high prediction rates and low misprediction penalties. In this paper, apart from discussing some of the important branch prediction schemes like static, dynamic and hybrid prediction, we examine branch predictors in modern processors like Alpha 21264, Pentium 4, Itanium 2, ARM-11, Opteron and Power5. We present a systematic summary of our study, by emphasizing design tradeoffs and contrasting effectiveness of the different schemes.

Section 2.1 covers static branch predictors, that is, predictors that do not make use of any runtime information about branch behavior. Section 2.2 explains about a wide variety of dynamic branch prediction algorithms, that is, predictors that can monitor the branch behavior while the program is running and make future predictions based on these observations. Section 2.3 describes hybrid branch predictors that combine the features of multiple simpler predictors to form a better overall predictor. The organization of the material is similar as discussed in [2].

2.1 Static Branch Prediction Techniques The advantage of static branch prediction techniques is that they are very simple to implement and they require very little hardware resources. The main types are enumerated below.

General Terms Branch prediction, static prediction, dynamic prediction, hybrid prediction, branch target buffer, global history, local history, branch history register, pattern history table, saturating counters.

2.1.1 Single Direction Prediction The simplest branch prediction strategy is to predict that the direction of all branches will always go in the same direction (taken or untaken). For e.g. the Intel i486 used the always not taken approach [3]. Branches are more often taken than not, so an always taken policy can be employed, but this suffers from the drawback that branch target address is generally unavailable when the prediction is made. This causes extra latency or stalls. A branch delay slot [4] can be used to do useful work in the branch stall.

1. INTRODUCTION In computer architecture, a branch predictor is the unit in a processor that determines whether a conditional branch in the instruction flow of a program is likely to be taken or not. Branch target prediction speculates the target of the branch or unconditional jump before it is computed by parsing the instruction itself. Branch prediction thus increases the number of instructions available for the scheduler to issue and helps in exploiting the available instruction level parallelism (ILP). In a highly parallel computer system, since conditional branch instructions can break the normal flow of instruction fetching and execution (by causing pipeline disruptions), there has been a mounting pressure on microprocessor architects to improve the predictability of the conditional branches. Pipeline disruption reduces the effective instruction throughput by introducing extra delays in the pipeline. According to Amdahl‘s law [1], the clock cycles per instruction (CPI) of a pipelined processor is adversely affected by the stall cycles as CPIpipelined = CPIideal + Pipeline stall cycles per instruction. The lesser the CPI, more is the throughput in unit time. Since branches constitute a significant fraction of the instruction mix, the efficiency of handling branches is important. Recent work in branch prediction has led to the development of both hardware and software schemes that achieve good prediction accuracy. Branch prediction addresses two basic problems: 1) predicting the direction of conditional branches, and 2) calculating the branch target address. The rest of the paper is organized as follows. Section 2 covers a number of important static, dynamic and hybrid techniques. Section 3 presents a case study of the branch prediction schemes in commercial microprocessors from various semiconductor companies. A comparison of the branch prediction schemes and their design tradeoffs is presented in section 4. Sections 5 and 6 conclude the paper.

2.1.2 Backwards Taken Forward Not Taken A backward taken forward not taken (BTFNT) strategy can be used to predict branches. In this scheme the branch is assumed taken (not taken) when the branch target is less (greater) than the current value of the PC. This is useful for loops, which typically run for many times before exiting. The Intel Pentium 4 processor uses this approach as a backup strategy [5].

2.1.3 Ball/ Larus Heuristics Some ISAs allow for a compiler interface through which branch hints can be made. If such is the case, the compiler can make use of these hints by inserting the most likely outcome of the branches based on high level information about the structure of the program. This is called program based prediction. A set of such heuristics is described in [6].

2.1.4 Profiling Profile based static prediction involves executing an instrumented version of a program on sample input data, collecting statistics, and then feeding back the collected information to the compiler. The computer makes use of this profile information to make static branch predictions that are inserted into final program binary as branch hints. In [7], it was found that for the branch prediction bias as determined by some representative data sets was successful at providing a good accuracy of predictions for future runs on a different data set. 1

The advantage of profile based prediction techniques and other static prediction algorithms is that they are very simple to implement in hardware. One disadvantage of profile based prediction is that once the predictions are derived from the sample data set, they are forever fixed for the entire lifetime of the processor. If a different data set is used in the actual execution which alters this branch behavior, frequent mispredictions will be made.

2.2.1.2 Two-Level Prediction Table The two-level-predictor employs two separate levels of branch history information to make the branch prediction. The global-history-two-level predictor uses a history of the most recent branch outcomes. These outcomes are stored in a branch history register (BHR). The BHR is a shift register where the outcome of each branch is shifted into one end, and the oldest outcome is shifted out of the other end and discarded. The branch history is the first level of global-history-two-level predictor. The second level of the global-history-two-level predictor is a table of saturating counters. This table is called pattern history table (PHT). Figure 2 describes the hardware organization. This predictor uses the outcomes of the four most recent branch instructions and 2 bits from the branch address to form an index into a 64- entry instruction PHT. With bits of branch history and bits of branch address, the PHT has entries. When using only bits of the branch address (where PC), the branch address must be hashed down to bits, similar to the Smith predictor. Thus there is a tradeoff between the number of branch address bits used and the length is the BHR, since the sum of their lengths must be equal to the number of index bits.

2.2 Dynamic Branch Prediction Techniques Dynamic branch prediction algorithms take advantage of the run time information available in the processor, and can react to the changing branch patterns. The benefit of dynamic branch prediction is that performance enhancements can be realized without profiling and existing binaries can be run (since recompilation is not necessary).

2.2.1 Basic Dynamic Algorithms Most of the advanced dynamic prediction algorithms used today are derived from one or more of these basic algorithms.

2.2.1.1 Smith’s Algorithm The main idea behind the majority of the branch predictors is that each time the processor discovers the true outcome of a branch; it makes a note of this; so that the next time when it encounters the same branch it makes a better judgment. The Smith‘s algorithm [8] is one of the earliest proposed dynamic branch prediction algorithms. The predictor consists of a table that keeps a record for each branch whether each or not the previous branches were taken or not. It consists of a table of counters (figure 1), where each counter tracks the past branch directions.

Figure 2. Two-level predictor with global history[2] Another variation of the two-level predictor is the local history two-level predictor (figure 3). Whereas global history tracks the outcomes of the last several branches encountered, local history tracks the outcomes of the last several encounters of only the current branch.

Figure 1. Architecture of Smith’s Algorithm [2] Since there are only entries the branch address (PC) is hashed down to bits. Each counter in the table has a width of bits. The MSB of the counter is used for the branch direction prediction. If the MSB is 1, the branch is predicted to be taken; if MSB is 0, branch is predicted to be not taken. After the branch is resolved and its true direction is known, the counter is updated depending on the branch outcome. If the branch was taken, the counter is incremented only if the current value is less than If the branch was not taken the counter is decremented if its value is greater than 0. Thus by using the histories of several recent branches the saturating counter will not be thrown off by a single anomalous decision, due to additional history bits which act as inertia. For tracking branch directions 2 bit counters provide better prediction rates than 1 bit counters. Adding a third bit however improves the performance only by a small increment and in most designs this incremental improvement is not worth the 50% increase in the area needed to accommodate the third bit.

Figure 3. Two-level predictor with local history [2] To implement a local history branch predictor, the single global BHR is replaced by one BHR per branch. The collection of BHRs form a branch history table (BHT). The branch address is used to select one of the entries in the BHT, which provides the 2

local history. The contents of the selected BHR are then combined with the PC to index into the PHT. The most significant bit of the counter provides the branch prediction and the update of the counter is similar to the Smith Predictor.

2.2.2.1 The Bi-Mode Predictor The Bi-Mode predictor [14] uses two PHTs, indexed in a gshare fashion to reduce the effects of aliasing. A separate choice predictor is indexed with the lower order bits of the branch address. The choice predictor is a table of 2-bit counters where the MSB indicates which of the 2 PHTs is to be used. The rationale behind the Bi-Mode predictor is that most of the branches are biased towards one direction or other. The choice predictor effectively remembers the bias of each branch.

The tradeoffs in sizing a local history two level predictor are more complex than the case of the global history predictor. In addition to balancing the number of history and address bits for the PHT index, there is also a tradeoff between the number of bits dedicated to the PHT. In the BHT there is also a balance between the number of entries and the width of each entry (i.e. the history length). A local history two level predictor with entry BHT and an -bit history and that uses bits of the branch address for the PHT index requires a total size of bits. By tracking the behavior of each branch individually, a predictor can detect patterns that are local to a particular branch, like an alternating pattern of a loop.

2.2.2.2 The gskewed Predictor The gskewed algorithm [12] divides the PHT into three (or more) banks. Each bank is indexed by a different hash of address-history pairs. If two addresses conflict in one PHT, they are guaranteed not to conflict with each other in the other two PHTs. A majority function is used to make the final prediction.

2.2.2.3 The Agree Predictor

A third variant utilizes a BHT that uses an arbitrary hashing function to divide the branches into different sets. Each group shares a single BHR. Instead of using the least significant bits of the branch address to select a BHR from the BHT, other example set-partitioning functions use only the higher order bits of the PC, or divide the BHT into sets, based on opcode. This type of history is called per-set branch history, and the table is called a per-set branch history table (SBHT).

The agree predictor reduces destructive aliasing interference by reinterpreting the PHT counters as a direction agreement bit [15]. The agree predictor assumes that most branches are biased towards either one direction and stores the most likely predicted direction in a separate biasing bit. This biasing bit may be stored in a branch target buffer line of the corresponding branch. Instead of predicting the branch direction, the agree predictor predicts whether or not the branch direction will agree with the biasing bit.

2.2.1.3 Index-Sharing Predictors For a fixed PHT size employing a larger number of history bits reveals more opportunities to correlate with more distant branches, but this comes at the cost of using fewer branch address bits. Thus if the history length is very long the frequently occurring history patterns will map into the PHT in a very sparse distribution. The gshare algorithm [9] attempts to make a better use if the index bits by hashing the BHR and the PC together to select an entry from the PHT. The combination of the BHR and the PC tends to contain more information due to the non uniform distribution of PC values and branch histories. This is called index sharing.

2.2.2.4 The YAGS Predictor The yet another global scheme (YAGS) approach is similar to the Bi-Mode predictor, except that the two PHTs record only the instances that do not agree with the direction bias. More details can be found in [16].

2.2.2.5 Branch Filtering This approach attempts to remove the highly biased branches from the PHT, thus reducing the total number of branches stored in the PHT which helps alleviate the capacity and conflict aliasing. This approach is especially useful in predicting branches corresponding to error-checking code and exceptions.

2.2.2 Interference Reducing Predictors Aliasing can happen when two or more branch addresses map to the same entry in the PHT. The PHT can thus be viewed as a cache like structure and like the three C‘s model of cache misses [10, 11], an analogous model for PHT aliasing can be formulated [12]. Accordingly, a particular address-history pair can ―miss‖ in the PHT for the following reasons:

2.2.2.6 Selective Inversion

1.

2.2.2.7 Alloyed History Predictors

2.

3.

Selective branch inversion tries to reduce the interference in the PHT by interference correction [17, 18]. Here the idea is to estimate the confidence of each branch prediction; if the confidence is lower than some threshold, the branch direction is reversed.

Compulsory aliasing occurs the first time the address history pair is ever used to index the PHT. Fortunately it accounts for less than 1% on the IBS benchmarks [13]. Capacity aliasing occurs because the size of the current working set of address history pairs is greater than the capacity of the PHT. This aliasing can be mitigated by increasing the PHT size. Conflict aliasing occurs when two different address history pairs map to the same PHT entry. Increasing the PHT size is not very effective to reduce this; instead increasing the associativity or selecting a better replacement policy might be more useful.

The global predictors are able to make predictions based on correlations with the global branch history, while the local predictors use correlations with local, or per-address, branch history. An alloyed branch predictor uses both global and local branch history [19]. A per-address BHT and a global BHR are both maintained. The bits from the branch address, global branch history and local branch history are used to together index into the PHT. This allows both global and local correlations to be distinguished by the same structure.

2.2.2.8 Path History Predictors Path based branch correlation tries to predict the outcome of a conditional branch depending on the past trace of the PC. Thus instead of storing the last branch outcomes, bits from each of the last branch addresses are stored [20, 21]. The concatenation of these bits encodes the branch path of the

The following section briefly describes algorithms that try to reduce this interference.

3

last branches, also called as path history, thus allowing the predictor to differentiate between 2 very different branch behaviors. If number of bits per branch address is , number of branches in the history path is and number of bits in the current branch address is then a somewhat expensive memory requirement of entry PHT is required. To alleviate this problem, a hashing mechanism to compress the bits is proposed in [22].

later branch or 2) the two branches operate on similar data. The branch which affects the outcome of another branch is then called an affector branch. The main idea behind the data flow branch predictor is to explicitly track which previous branches are affector branches for the current branch [27]. The affector register file (ARF) stores one bitmask per architected register, where the entries of the bitmask correspond to past branches. On a branch instruction, the ARF is updated.

2.2.2.9 Variable Path Length Predictors

2.3 Hybrid Branch Predictors

Since some branch behaviors depend on very recently encountered branch addresses, a constant history size would be ineffective. The predictor used in [22] uses variable history length so that the memory requirement can be brought down.

Different branches in a program may be strongly correlated with different types of history. Because of this, some branches may be accurately predicted with global history-based predictors, while others are more strongly correlated with local history. Programs typically contain a mix of such branch types. This section describes hybrid algorithms that employ two or more single scheme branch prediction algorithms and combine these multiple predictions together to make one final prediction.

2.2.2.10 Dynamic History Length Filtering Predictors The optimal history length might change between applications. Dynamic history length fitting (DHLF) addresses this issue. Instead of using fixed lengths for the history field, it dynamically finds the optimum history length required for each branch so as to minimize the total mispredictions [23].

2.3.1 The Tournament Predictor The tournament predictor [9] consists of two component predictors and and a meta predictor . The component predictors can be any of the single-scheme predictors described in section 2.2.1-2.2.2. The meta-predictor makes a prediction of which prediction will be correct. After the true outcome is known, and are updated according to their update rules. is updated depending upon whom amongst and won the ‗tournament‘. A hybrid predictor similar to this was implemented in Compaq Alpha 21264 microprocessor [28]. It has been shown that global branch outcome history hashed with the PC provides better overall prediction [29]. By recursively arranging multiple tournament meta-predictors into a tree, any number of predictors may be combined as in [30].

2.2.2.11 Loop Counting Predictors Many applications typically have loops that run for a fixed number of iterations (e.g. matrix computation etc.). A loop predictor consists of a counter which records the count of the number of times a loop is run, the first time. The next time the loop is encountered, the loop exit prediction is made looking at the past history. The Pentium-M processor uses a loop predictor in conjunction with a branch-history based predictor [24]. While loop predictors are effective in hybrid predictors, they generally offer poor stand alone performance, since they cannot predict non loop branches effectively.

2.3.2 Prediction Fusion

2.2.2.12 The Perceptron Predictor

Instead of singling out one prediction as the final prediction amongst a multi predictor system, the prediction fusion method attempts to combine the prediction from all participating predictors, while coming up with a final prediction [31].

By maintaining larger branch history registers, the additional history stored provides more opportunities for correlating the branch predictions. There are two major drawbacks to this approach. The first is that the size of the PHT is exponential in the width of the BHR. The second is that many of the history bits may not actually be relevant, and thus act as training noise. Thus two level predictors with a large BHR will take a long time to train. The Perceptron predictor [25] offers a solution to this problem. Each branch address is mapped to a single entry in a Perceptron table. Each entry in the table consists of a single Perceptron, which is the simplest form of neural network [26]. One limitation of the Perceptron predictor is that only linearly separable1 fields can be learned. In [25] it is shown that for half of the SPEC2000 benchmarks about 50% of the branches are linearly separable. The Perceptron predictor has four parameters: the number of Perceptrons, the number of bits of history to use, the width of the weights, and the learning threshold. The number of history bits that can be used is still much larger than gshare predictors.

2.4 Target prediction For conditional branches, predicting whether the branch is taken or not taken is only half of the problem. After the direction of a branch is known, the actual branch address of the next instruction along the predicted path must also be determined. If a branch is predicted not-taken, then the target is simply the next instruction. If the branch is predicted taken, the target would depend on type of branch. The two common types of branch targets are 1) PC relative or 2) indirect. An indirect target is not known at compile time and directly read from a register at runtime. The target of a branch is usually predicted by a branch target buffer (BTB). The BTB is a cache like structure that stores the last seen target address for a branch instruction. When predicting a branch outcome by a traditional branch predictor, the processor, in parallel, indexes into the BTB with the PC to get the target address. Figure 4 shows the organization of a BTB. If the branch predictor predicts not-taken the target is simply the next sequential instruction. If the branch predictor predicts taken and there is a hit in the BTB, then the BTB entry (a PC address) is used as the next instruction address. On the other hand, if the branch predictor predicts taken and there is a miss in the BTB, then the processor may either stall fetching the next instruction, or assume branch not-taken and fetch the next

2.2.2.13 The Data Flow Predictor Two branches may be correlated due to two reasons- 1) A branch may guard variables that affect the test condition of a 1

Linearly separable Boolean functions are those where all instances of outputs that are 1 can be separated in hyperspace from all instances whose outputs are 0 by a hyper plane. 4

sequential instruction, depending on the strategy employed. Different strategies may be used for maintaining the information in the BTB. The targets of all branches or only taken branches (better) can be stored for lookup.

scheme [28]. The scheme dynamically chooses between two types of branch predictors- one using local history, and one using global history- to predict the direction of a given branch. This gives a 90% - 100% success rate on most simulated applications/benchmarks. The processor adapts to dynamically choose the best method for each branch.

Figure 5. Branch prediction scheme in Alpha 21264 [28] Figure 5 shows the structure of the tournament branch predictor. It shows the local history prediction path—through a two-level structure—on the left. The first level holds 10 bits of branch pattern history for up to 1,024 branches. This 10-bit pattern picks from one of 1,024 prediction counters. The global predictor is a 4,096-entry table of 2-bit saturating counters indexed by the path, or global, history of the last 12 branches. The choice prediction is also a 4,096-entry table of 2-bit prediction counters indexed by the path history. The processor inserts the true branch direction in the local-history table once branches.

Figure 4. Branch Target Buffer [2] Typically, the target of a jump into a function is easy to predict, but the return is not, since a subroutine may be called from multiple locations in the code. The return address stack (RAS) is a special branch target predictor that only provides predictions for subroutine returns [32]. When a jump into a function happens, the return address is pushed onto the RAS. During the initial jump, the RAS does not provide a prediction and the target address must be predicted from the regular BTB. At some later point in the program, when the program returns from the subroutine, the top entry of the RAS is popped and provides the correct target prediction. The stack can store multiple addresses and so returns from nested functions are properly predicted. To avoid a stack overflow from deeply nested functions, the RAS is implemented as a circular buffer so that an overflow overwrites the oldest written return address with the most recent return address.

3.2 Intel - Pentium 4 (2000) The Pentium 4 (P4) processor improves upon the performance of prior implementations of the NetBurst microarchitecture [34] through larger caches, larger internal buffers, improved algorithms, and new features [35]. This processor also implements Hyper-Threading Technology [34], which is the ability to simultaneously run multiple threads, allowing one physical processor to appear as two independent logical processors. P4 processor provides a substantial performance gain for many key application areas where the end user can truly appreciate the difference. P4 uses two branch prediction schemes. The front-end dynamic branch predictor has 4K branch target entries. It captures most of the branch history information for the program. If a branch is not found in the branch target buffer, P4 uses a simpler static BTFNT branch prediction scheme. The Pentium 4 also makes use of software branch hints for accurate branch prediction. These hints are inserted by the compiler by obtaining information about the branch behavior in higher level languages.

3. BRANCH PREDICTORS IN MODERN PROCESSORS It is well known that the microprocessor complexity has been increasing consistently, adhering to the Moore‘s Law [33]. Many of the modern microprocessors today are multithreaded, super-pipelined, and superscalar. In this section we discuss the branch prediction schemes of some of those processors, in chronological order.

3.3 Intel – Itanium 2 (2002)

3.1 Alpha – 21264 (2000)

The Itanium 2 processor extends the processing power of the Itanium processor family with capable and balanced microarchitecture executing up to 6 instructions at a time. It is a superscalar, deeply pipelined, out-of-order processor providing both performance and binary compatibility for Itanium based applications and operating systems [36].

Alpha 21264 [28] is a high clock speed, superscalar, out-oforder processor that provides exceptional core computational performance. The processor also features a high-bandwidth memory system that provides robust performance for a wide range of applications. Branch prediction is important to the 21264‘s efficiency because its instruction engine is faster and the misprediction penalty cost is higher than its predecessors. The 21264 can accept 80 inflight instructions, thus offering many parallelism opportunities. It implements a sophisticated tournament branch prediction

The Itanium 2 processor‘s branch prediction performance relies on a two-level prediction algorithm and two levels of branch history storage. The first level of branch prediction storage is tightly coupled to the L1 Instruction cache (L1I). This coupling allows a branch‘s taken/not taken history and a predicted target 5

to be delivered with every L1I demand access in one cycle. The branch prediction logic uses the history to access a pattern history table and determine a branch‘s final taken/not taken prediction, or trigger, according to the Yeh-Patt algorithm [37]. The L2 branch cache saves the histories and triggers of branches evicted from the L1 I-cache so that they are available when the branch is revisited, providing the second storage level. A bundle of three branch instructions shares the same hardware (prediction history and target). The advantage is that less hardware is required to predict the outcome of a branch but the caveat is that the target may not be sufficient to represent the entire span required by the branch, and there might be times when the front end is re-steered to an incorrect address. The branch prediction logic tracks this situation and provides a corrected PC relative target one cycle later.

2-bit saturating up/down counters. The branch target array holds branch target addresses and has backup from a 2-cycle branch target address calculator to accelerate address calculations.

3.4 ARM 11 (2002) The ARM11 microarchitecture [38] is the first implementation of the ARMv6 ISA, and forms the basis of a new family of ARM11 cores. The low power ARM11 microarchitecture pipeline is scalar2. ARM11 micro architecture uses two techniques to predict branches [38]. The dynamic branch predictor uses a history table to see whether the branch has been seen before, and whether it was most-frequently taken, or most frequently not taken. A 64entry, 4-state Branch Target Address Cache (BTAC) is maintained. This table is sufficient to hold the majority of most recent branches. If the branch prediction has been encountered before, a prediction is made based on the previous outcome. If the dynamic branch predictor cannot find a record of the branch instruction, a static branch prediction procedure takes over, which follows the BTFNT policy. The return stack manages branch prediction for returns from up to three procedure calls. As well as predicting branches, the ARM pipeline also folds the branches. This means that if the result of the branch prediction is the branch will be not-taken, the original branch instruction is removed or folded from the pipeline (as there is no point in executing it). This saves a clock cycle for a correctly predicted branch. The net benefit of the dynamic and static branch prediction employed in the ARM11 micro architecture is around 85% of branches are correctly predicted, saving typically five processor clock cycles for every correctly-predicted branch.

Figure 6. Branch prediction scheme in AMD Opteron [39] The return-address stack (RAS) optimizes call/ return pairs by storing the return address of each call and supplying it for use by the corresponding return instruction. When a line is evicted from the instruction cache, it saves the branch prediction information and the end bits in the L2 cache error correction code field (since parity protection is sufficient for read-only instruction data). Thus, the fetch unit does not have to recreate branch history information when it brings a line back into the instruction cache from the L2 cache. The mispredicted branch penalty is 11 cycles. This unique and simple trick has improved Opteron‘s branch prediction accuracy on various benchmarks by 5-10% over the Athlon.

3.6 IBM - Power 5 (2005)

3.5 AMD – Opteron (2003) AMD Opteron [39] is AMD‘s first 64-bit computing processor which provides a backward compatibility of the x86-64 architecture with a DDR memory controller and hypertransport links to deliver server class performance. The Opteron core is an out-of-order, superscalar processor supported by large on-chip L1 and L2 caches.

The enhancements in Power5 [40] (Performance Optimization with enhanced RISC) include dynamic resource balancing to efficiently allocate system resources to each thread, softwarecontrolled thread prioritization, and dynamic power management to reduce power consumption without affecting performance. Power 5 is a superscalar, out-of-order execution processor.

There is a significant improvement in AMD‘s branch prediction scheme in Opteron when compared with AMD‘s Athlon processor. Opteron uses a combination of branch prediction schemes. The branch selector array selects between static prediction and the history table. As shown in figure 6, the branch prediction scheme has a global history table which uses a

The Power5 scans the fetched instructions for branches. Each branch is entered in the branch information queue (BIQ) at instruction fetch time. The BIQ saves the necessary information to recover from a mispredicted branch. Entries are deallocated in program order when branches are executed. If the fetched instructions contain multiple branches, the branch prediction stage can predict all the branches at the same time. The direction of the branch is predicted using three branch history tables. Two of the BHTs use bimodal and path-correlated branch prediction mechanisms to predict branch directions. The third BHT predicts which of these prediction mechanisms is more likely to

2

The analysis made in developing the ARM11 microarchitecture determined that the potential for increased performance from multiple instruction issue did not justify the penalty in increased power and area. 6

predict the correct direction. In addition to predicting direction, the Power5 also predicts the target of a taken branch in the current cycle‘s eight-instruction group (it is an 8-wide superscalar processor). Target addresses for absolute and relative branches are computed directly as part of the branch scan function. If there is a taken branch, the program counter is loaded with the target address of the branch. Otherwise, the program counter is loaded with the address of the next sequential instruction from which fetching is being done. In simultaneous multi threading (SMT) mode, two separate program counters are used, one for each thread. Instruction fetches alternate between the two threads. Similarly, branch prediction alternates between threads. In single thread mode, only one program counter is used, and instructions can be fetched for that thread every cycle.

performs better than other schemes. Some more interesting comparisons are found in [9, 41, 43].

4. COMPARISON OF DIFFERENT STRATEGIES

In McFarling‘s paper [9], SPEC89 benchmarks are used to compare various dynamic branch prediction schemes. We can infer the following from this paper:

Table 1. Performance Summary of Schemes [44] Schemes

The effectiveness of different schemes can be studied on the basis of their comparisons with each other, and their performance on benchmarks. Accurate comparisons, however, are often difficult to make due to the following two reasons: 1) each branch prediction scheme may be designed with a specific type of architecture in mind, and may be useful for a particular application. Also 2) each branch prediction scheme was introduced at a different time- thus the implementation costs of that particular era must be taken into account while comparing the cost effectiveness of a scheme. In spite of these difficulties numerous attempts [8, 12, 14, 22, 31, 37, 41] have been made at comparing the different schemes. In this section we present the advantages and caveats of the different schemes. Section 4.1 lists the results obtained on comparing the schemes with one another. Section 4.2 examines the branch predictors in the processors discussed before.

fp

all

bimodal

89.8%

94.4%

92.6%

gshare

90.3%

94.7%

93.0%

correlation

90.8%

94.7%

93.2%

local

91.3%

95.6%

93.9%

gselect

91.8%

95.3%

93.9%

selective

92.7%

95.5%

94.4%

1. 2.

3.

4.1 Comparison of branch prediction schemes

4.

Although static branch prediction techniques can achieve conditional branch prediction rates of 70-80% [42], if the profiling information is not representative of the actual run time behavior, prediction accuracy can suffer greatly. Dynamic branch predictors typically achieve higher branch prediction rates in the range of 80-95% [9, 41, 43] but require higher chip area to implement. The area budget of an embedded processor might not allow for this, but for a large wide-issue superscalar processor, (where branch prediction is critical) sophisticated dynamic prediction schemes must be used. Thus, it is generally accepted that dynamic schemes outperform the static schemes. Hence, instead of elaborating on static vs. dynamic schemes, we proceed to compare the different dynamic schemes.

Benchmarks int

5.

The global prediction scheme is significantly less effective than the local prediction scheme for a fixed size predictor. For small predictors, the bimodal scheme is relatively good. This is because the branch address bits used in the bimodal scheme efficiently distinguish different branches. As the number of counters doubles, roughly half as many branches will share the same counter. As more counters are added, eventually each frequent branch will map to a unique counter. Thus, the information content in each additional address bit declines to zero for increasingly large counter tables. For small sizes, global prediction with index selection (gselect) best parallels the performance of bimodal prediction. Once there are enough address bits to identify most branches, more global history bits used result in significant better prediction results than the bimodal scheme. The gselect method also significantly outperforms simple global prediction for most predictor sizes because the branch address bits more efficiently identify the branch. The bimodal prediction scheme and global history prediction scheme with index sharing (gshare) branch predictors combined together over-perform all the other branch predictors in accuracy.

4.1.2 Comparison of different two level dynamic prediction schemes Detailed performance evaluation methodology and comparisons of the nine 2-level branch prediction schemes (GAg, GAp, GAs, PAg, PAp, PAs, SAg, SAp, SAs) is given in [41]. Trace driven simulations were done with SPEC89 benchmarks and the main observations are summarized in table 2. The results in table 2 can be better analyzed by understanding the following points:

4.1.1 Comparison of local and global prediction strategies.

1.

It has been observed that higher the taken frequency the higher is the prediction accuracy. Also, the branches in floating point programs are easier to predict than the integer programs as supported by the data in table 1 where the floating point predictions are consistently better than the integer predictions. The table 1 shows various branch prediction accuracy numbers for different branch prediction schemes averaged over several benchmarks from SPEC89 suite. It shows that with the same size of other branch prediction tables, the selective predictor

2.

7

Global history schemes make effective predictions for ifthen-else branches due to their correlation with previous branches. When the global history is used, the pattern histories of different branches interfere with each other if they map to the same pattern history table. Therefore, global history schemes require longer branch history and/ or many pattern history tables to reduce the interference for effective overall performance.

3.

Floating point programs contain many frequently executed loop-control branches which exhibit periodic branch behavior. This periodic behavior is better retained with a per-address branch history table. When the per-address branch history is used, the pattern histories of different branches tend to interfere less with each other; therefore, fewer pattern history tables are needed. Per-set history schemes require higher implementation costs than global history schemes due to the separate pattern history tables of each set.

4.

5.

4.2 Comparison of branch predictors in modern processors In this section we examine the branch prediction strategies used in each of the processors listed in Section 3. Again, absolute comparison is difficult because 1) each processor has a different ISA, CPU frequency and architecture. 2) Each processor is manufactured by a different company at a different time. 3) Even for a single processor, several models are manufactured over a number of years and incremental improvement in the form of process technology or design architecture is made to the original design. Finally, 4) each processor has a different target application.

Table 2. Performance of different 2-level schemes [41]. Evaluation Parameters

Branch prediction schemes Global history

Per-set history

Per-address history

A table comparing all the processors is included in the appendix. Amongst the data available to us we have made the following observations:

int

best

≈ global

worst

1.

fp

worst

≈ peraddress

best

Implementation cost

more

highest

less

Size of branch history register

requires large BHR or more PHTs

optimal

requires many entries in BHT

Cost effectiveness

GAs most effective amongst high cost schemes

PAs most effective amongst low cost schemes

predicts ifthen-else effectively

-

Performance

Remarks

2.

3.

-

4.

5. predicts loops effectively

A comparison of the hardware cost amongst the different schemes appears in table 3 [41], which should be kept in mind by the designers.

6.

Table 3. Hardware costs for different 2 level schemes [41]. Scheme Name

BHR size

No. of PHTs

Simplified hardware cost

7.

GAg( ) GAs(

8.

GAp( ) PAg( ) PAs(

All the modern processors discussed support out-of-order execution and use hybrid branch prediction schemes to achieve good accuracy. The branch prediction accuracy of the Alpha 21264 seems to be better than the other processors. However it should be noted that it is a RISC processor with a much simple instruction set than Intel‘s Pentium and hence absolute comparison should not be made. The Power5 is an 8 wide superscalar processor. If all the fetched instructions are branches, the complex branch prediction scheme can predict all of them simultaneously. The Intel Pentium has a very low hardware cost for its branch prediction hardware and thus has a low area overhead in the implementation. The Intel Itanium has an average branch penalty of just 6 cycles, which is very good for a server class processor. However, since clock frequencies vary across different processors (and also within different chips of the same model) it is difficult to comment on how much the penalty really is. At any given stage, the Alpha 21264 can have up to 80 instructions in various stages of pipeline. This surpasses every other contemporary microprocessor. For such a complex design, an implementation cost of 30KB (plus some control logic overhead) is noteworthy. The AMD Opteron implements a return address stack (RAS) which makes deeply nested loops successfully break out to the correct target address without causing a stack overflow. Since the IBM Power 5 is the latest amongst the microprocessors presented here, we expect its branch prediction accuracy to be high.

5. ACKNOWLEDGMENTS

)

SAg( )

We would like to thank Dr. Paul Gratz for suggesting this topic for the literature survey, which was both informative and interesting.

SAs(

6. CONCLUSION

Pap( )

SAp(

In this paper we have presented a survey of important branch prediction techniques. We have also listed the salient features of branch predictors in modern microprocessors and compared their functionalities. The design tradeoffs and comparisons have been discussed.

)

8

19. Skadron, K., M. Martonosi, and D.W. Clark. A taxonomy of branch mispredictions, and alloyed prediction as a robust solution to wrong-history mispredictions. in Parallel Architectures and Compilation Techniques, 2000. Proceedings. International Conference on. 2000. 20. Ravi, N., Dynamic path-based branch correlation, in Proceedings of the 28th annual international symposium on Microarchitecture. 1995, IEEE Computer Society Press: Ann Arbor, Michigan, United States. 21. Reches, S. and S. Weiss, Implementation and analysis of path history in dynamic branch prediction schemes, in Proceedings of the 11th international conference on Supercomputing. 1997, ACM: Vienna, Austria. 22. Jared, S., E. Marius, and N.P. Yale, Variable length path branch prediction. SIGPLAN Not., 1998. 33(11): p. 170179. 23. Juan, T., S. Sanjeevan, and J.J. Navarro. Dynamic historylength fitting: a third level of adaptivity for branch prediction. in Computer Architecture, 1998. Proceedings. The 25th Annual International Symposium on. 1998. 24. Simcha Gochman, M.P.G., Intel Corporation, et al., The Intel® Pentium® M Processor: Microarchitecture and Performance. May 21, 2003. 07(02). 25. Daniel, A.J., nez, and L. Calvin, Neural methods for dynamic branch prediction. ACM Trans. Comput. Syst., 2002. 20(4): p. 369-397. 26. Rosenblatt, F., Principles of neurodynamics; perceptrons and the theory of brain mechanisms. 1962, Washington: Spartan Books. 27. Renju, T., et al., Improving branch prediction by dynamic dataflow-based identification of correlated branches from a large global history, in Proceedings of the 30th annual international symposium on Computer architecture. 2003, ACM: San Diego, California. 28. Kessler, R.E., The Alpha 21264 Microprocessor. IEEE Micro, 1999. 19(2): p. 24-36. 29. Po-Ying, C., H. Eric, and N.P. Yale, Alternative implementations of hybrid branch predictors, in Proceedings of the 28th annual international symposium on Microarchitecture. 1995, IEEE Computer Society Press: Ann Arbor, Michigan, United States. 30. Marius, E., Improving branch prediction by understanding branch behavior. 2000, University of Michigan. p. 150. 31. Gabriel, H.L. and S.H. Dana, Predicting Conditional Branches With Fusion-Based Hybrid Predictors, in Proceedings of the 2002 International Conference on Parallel Architectures and Compilation Techniques. 2002, IEEE Computer Society. 32. David, R.K. and G.E. Philip, Branch history table prediction of moving target branches due to subroutine returns. SIGARCH Comput. Archit. News, 1991. 19(3): p. 34-42. 33. Moore, G.E., Cramming more components onto integrated circuits. Electronics, April 19, 1965. 38(8). 34. Koufaty, D. and D.T. Marr, Hyperthreading technology in the netburst microarchitecture. Micro, IEEE, 2003. 23(2): p. 56-65. 35. Corp, D.S.a.D.P.G.a.I., The microarchitecture of the pentium 4 processor. Intel Technology Journal, 2001. 1: p. 2001.

7. REFERENCES 1.

2. 3. 4.

5. 6. 7.

8.

9. 10.

11.

12.

13.

14.

15.

16.

17.

18.

Gene, M.A., Validity of the single processor approach to achieving large scale computing capabilities, in Readings in computer architecture. 2000, Morgan Kaufmann Publishers Inc. p. 79-81. John Shen, M.L., Modern Processor Design: Fundamentals of Superscalar Processors. Corporation, I., Embedded Intel486 Processor Hardware Reference Manual. 1997. David, A.P. and L.H. John, Computer architecture: a quantitative approach. 1990: Morgan Kaufmann Publishers Inc. 594. Corporation, I., Intel® Architecture Optimization Reference Manual. 2003. Thomas, B. and R.L. James, Branch prediction for free. SIGPLAN Not., 1993. 28(6): p. 300-313. Joseph, A.F. and M.F. Stefan, Predicting conditional branch directions from previous runs of a program. SIGPLAN Not., 1992. 27(9): p. 85-95. James, E.S., A study of branch prediction strategies, in 25 years of the international symposia on Computer architecture (selected papers). 1998, ACM: Barcelona, Spain. McFarling, S., Combining Branch Predictors. Western Research Laboratory, 1993. Mark Donald, H., Aspects of cache memory and instruction buffer performance. 1987, University of California, Berkeley. p. 297. Rabin, A.S. and G.A. Santosh, Efficient simulation of caches under optimal replacement with applications to miss characterization, in Proceedings of the 1993 ACM SIGMETRICS conference on Measurement and modeling of computer systems. 1993, ACM: Santa Clara, California, United States. Michaud, P., A. Seznec, and R. Uhlig. Trading Conflict And Capacity Aliasing In Conditional Branch Predictors. in Computer Architecture, 1997. Conference Proceedings. The 24th Annual International Symposium on. 1997. Richard, U., et al., Instruction fetching: coping with code bloat. SIGARCH Comput. Archit. News, 1995. 23(2): p. 345-356. Chih-Chieh, L., I.C.K. Chen, and N.M. Trevor, The bimode branch predictor, in Proceedings of the 30th annual ACM/IEEE international symposium on Microarchitecture. 1997, IEEE Computer Society: Research Triangle Park, North Carolina, United States. Eric, S., et al., The agree predictor: a mechanism for reducing negative branch history interference. SIGARCH Comput. Archit. News, 1997. 25(2): p. 284-291. Eden, A.N. and T. Mudge. The YAGS branch prediction scheme. in Microarchitecture, 1998. MICRO-31. Proceedings. 31st Annual ACM/IEEE International Symposium on. 1998. Srilatha, M., K. Artur, and G. Dirk, Branch Prediction Using Selective Branch Inversion, in Proceedings of the 1999 International Conference on Parallel Architectures and Compilation Techniques. 1999, IEEE Computer Society. Juan, L.A., et al., Confidence Estimation for Branch Prediction Reversal, in Proceedings of the 8th International Conference on High Performance Computing. 2001, Springer-Verlag. 9

36. Cameron, M. and B. Rohit, Montecito: A Dual-Core, DualThread Itanium Processor. IEEE Micro, 2005. 25(2): p. 1020. 37. Tse-Yu, Y. and N.P. Yale, Alternative implementations of two-level adaptive branch prediction, in 25 years of the international symposia on Computer architecture (selected papers). 1998, ACM: Barcelona, Spain. 38. Cormie, D., The ARM11 Microarchitecture. 2002. 39. Chetana, N.K., et al., The AMD Opteron Processor for Multiprocessor Servers. IEEE Micro, 2003. 23(2): p. 66-76. 40. Kalla, R., S. Balaram, and J.M. Tendler, IBM Power5 chip: a dual-core multithreaded processor. Micro, IEEE, 2004. 24(2): p. 40-47.

41. Tse-Yu, Y. and N.P. Yale, A comparison of dynamic branch predictors that use two levels of branch history. SIGARCH Comput. Archit. News, 1993. 21(2): p. 257-266. 42. Brad, C., et al., Evidence-based static branch prediction using machine learning. ACM Trans. Program. Lang. Syst., 1997. 19(1): p. 188-222. 43. Lee, J.K.F. and A.J. Smith, Branch Prediction Strategies and Branch Target Buffer Design. Computer, 1984. 17(1): p. 6-22. 44. Cheng, C.-C., The Schemes and Performances of Dynamic Branch predictors. 45. J. M. Tendler, J.S.D., J. S. Fields, Jr., H. Le, and B. Sinharoy, POWER4 system microarchitecture.

10

APPENDIX Feature

Processors Alpha 21264

Pentium 4

Itanium 2

ARM 11

Opteron

Power 5

Company

DEC

Intel

Intel

ARM

AMD

IBM

Year

2000

2000

2002

2002

2003

2005

ISA

Alpha

IA-32, x86-32

IA-64, x86-64

ARM v6

x86-64

PowerPC 2.00

Clock frequency

1.25 GHz

1.3- 3.8 GHz

733 MHz- 1.67 GHz

1 GHz

1.4 GHz3.2 GHz

1.65 GHz

Applications

Servers, mathintensive & scientific tasks.

Desktops, laptops

Enterprise servers & HPC systems.

Embedded systems, phones etc.

Servers, desktops.

Storage servers, high end printers.

Transistors

15.2 million

55 million

2 billion

NA

233.2 million

276 million

Technology

0.35 µm

0.18µm to 65nm

65 nm

0.13µm

0.13µm to 45nm

30 nm

Address Bus

64-bit virtual

64-bit virtual

64-bit virtual

64-bit virtual

64-bit virtual

64-bit virtual

Microarchitecture

Multithreading, on-chip caches

Hyper threading, execution trace, cache rapid execution engine

On-chip L1, L2 caches

Low power core

On-chip DDR memory controller, 3 Hyper Transport Links, on-chip L1, L2 caches

Dual-Core microprocessor, simultaneous multi-threading, on-chip L1, L2 caches

Instruction Issue

Superscalar Out-of-Order (IPC = 4)

Superscalar Out-of-order

Superscalar Out-of-Order (IPC = 6)

Scalar Out-of-Order

Superscalar Out-of-Order

Superscalar Out-of-Order

Branch Prediction Scheme Details

Tournament: Local History Table + Global History Table.

Hybrid: Dynamic via Branch Target Buffer + Static Branch Prediction.

2 levels of branch history storage:

Hybrid: History Table + Static Branch Prediction (BTFNT).

Hybrid: Branch Target Address Calculator + Return Address Stack + Static Branch Prediction History Table (2bit counter)

2 bimodal BHTs + 1 Path Correlated BHT; Capable of predicting 8 branches at a time

Branch Prediction Accuracy

90 - 100%

94%

NA

85%

80 – 85%

NA

Branch Mispredictio n Penalty

Minimum: Average: 11

Average: 6

NA

11

Minimum: 12

Prediction Hardware Complexity (approx.)

~ 30KB

NA

~128KB

~ 160KB

NA

7

Minimum: Average: 20 ~ 8KB

19

1st stage coupled with L1 instruction cache 2nd stage coupled with L2 cache.

cell

Comparison of branch predictors in modern processors: the table shows the key features of the different microprocessors and the branch prediction techniques used therein. More information on the various processors can be found in [28, 35, 36, 38-40, 45]. Please refer to section 4.2 for a detailed discussion.

11