Throttling-Based Resource Management in High ... - IEEE Xplore

Viewer
Transcript

1142

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 55,

NO. 9,

SEPTEMBER 2006

Throttling-Based Resource Management in High Performance Multithreaded Architectures Seong-Won Lee and Jean-Luc Gaudiot, Fellow, IEEE Abstract—Up to now, the power problems which could be caused by the huge amount of hardware resources present in modern systems have not been a primary concern. More recently, however, power consumption has begun limiting the number of resources which can be safely integrated into a single package, lest the heat dissipation exceed physical limits (before actual package meltdown). At the same time, new architectural techniques such as Simultaneous MultiThreading (SMT), whose goal it is to efficiently use the resources of a superscalar machine without introducing excessive additional control overhead, have appeared on the scene. In this paper, we present a new resource management scheme which enables an efficient low power mode in SMT architectures. The proposed scheme is based on a modified pipeline throttling technique which introduces a throttling point at the last stage of the processor pipeline in order to reduce power consumption. We demonstrate that resource utilization plays an important role in efficient power management and that our strategy can significantly improve performance in the power-saving mode. Since the proposed resource management scheme tests the processor condition cycle by cycle, we evaluate its performance by setting a target IPC as one sort of immediate power measure. Our analysis shows that an SMT processor with our dynamic resource management scheme can yield significantly higher overall performance. Index Terms—Resource management, power management, multithreading, throttling, resource utilization.

Ç 1

INTRODUCTION

E

MERGING

new fabrication technologies make billion transistor chips feasible in the near future. Research has typically focused on obtaining the maximum performance from this tremendous amount of available hardware resources [1], [2]. Simultaneous MultiThreading (SMT) has consequently emerged as one of the most efficient architectures in terms of resource utilization [3]. However, at the same time, having a billion transistors in a processor chip will cause a critical side effect: extremely high power consumption. For instance, the power consumption of a 42 million transistor Intel Pentium-4 processor [4] can be as high as 90W at 2.8GHz [5]. Thus, projecting into the future, a billion transistor microprocessor running at 10GHz with a technology which will consume one-tenth of the power of an equivalent current technology will need about 800W. This means that such high power consumption could very well turn out to be the limiting factor in the design of future high performance computers for a number of related reasons [6]. First, supplying power is becoming more challenging because it is difficult for the power lines in a silicon die to support the currents which are needed. Second, if the power consumption exceeds the cooling capacity of cooling devices such as heat sinks and fans [5], the excessive power consumption can physically damage the circuitry or the package of the chip.

. S.-W. Lee is with the Department of Computer Engineering, Kwangwoon University, Seoul, Korea. E-mail: [email protected]. . J.-L. Gaudiot is with the Department of Electrical Engineering and Computer Science, The Henry Samueli School of Engineering, University of California, Irvine, CA 92697-2625. E-mail: [email protected]. Manuscript received 8 Mar. 2005; revised 1 Mar. 2006; accepted 9 Mar. 2006; published online 20 July 2006. For information on obtaining reprints of this article, please send e-mail to: [email protected], and reference IEEECS Log Number TC-0071-0305. 0018-9340/06/$20.00 ß 2006 IEEE

Seng et al. [7] showed how an SMT architecture consumes less energy per instruction (as measured by the ratio of the cumulative power consumption over the number of committed instructions) than a conventional superscalar architecture would. However, the low amount of energy per useful instruction does not necessarily mean that SMT architectures are low power architectures. On the contrary, an SMT architecture often consumes more overall power because it yields a higher Instruction Per Cycle (IPC) ratio than an equivalent conventional superscalar architecture. For example, according to Seng et al. [7], a four-thread SMT processor can deliver an IPC 1.9 times higher while still reducing the Energy Per Instruction (EPI) ratio by as much as 78 percent (as compared to a superscalar processor). Yet, it actually consumes 1.5 times more Energy Per Cycle (EPI IPC) on the average. A power management scheme normally consists of a low power mode and a control mechanism such as an activation/deactivation policy. Conventional power management schemes which use a suspend mode or a sleep mode [4], [8] are designed to protect against the occasional worst case. This means that they usually incur large response times to switch from the normal operation mode to the low power mode. This also means that extremely low performance in the low power mode is deemed acceptable. On the other hand, since we want to minimize the performance degradation due to the frequent activation of the low power mode in high performance SMT processors, the computing performance of processors must be improved, even in the low power mode. One possible low power scheme for SMT architectures is fetch throttling as described by Sanchez et al. [9]. This scheme is used to reduce power consumption by periodically holding instruction fetches for several clock cycles. This evidently immediately reduces the number of instructions in the processor. Fetch throttling also reduces the Published by the IEEE Computer Society

LEE AND GAUDIOT: THROTTLING-BASED RESOURCE MANAGEMENT IN HIGH PERFORMANCE MULTITHREADED ARCHITECTURES

power wasted due to instructions that are speculatively executed but otherwise found to be in the wrong path [10]. Incidentally, it should be noted that throttling techniques can be easily implemented in general and in particular in SMT architectures. However, even though fetch throttling works not only with superscalar architectures but also with SMT architectures [7], the effectiveness of fetch throttling is reduced in SMT architectures because the number of instructions speculatively executed is lower with SMT architectures than with superscalar architectures. In this paper, we develop a new resource management scheme based on throttling techniques. It is meant to further improve the performance of SMT architectures when the power management of the processor is activated. Our new scheme maintains high resource utilization in the low power mode while the conventional throttling scheme reduces the number of instructions and speculative execution in the processor for an equivalent level of computing performance. We investigate the diminishing contribution of speculative execution on the performance of SMT architectures and analyze the effects of resource utilization and speculative execution in throttling-based power management schemes. Based on this analysis, we also develop a novel mechanism which controls the degree of throttling and activation of our low power scheme in SMT architectures. Since the proposed mechanism tests the processor condition cycle by cycle, we evaluate the performance of the proposed mechanism by setting a target IPC as one sort of immediate power measure. We demonstrate that our proposed resource management scheme for a throttlingbased low power mode can significantly improve performance over conventional power management schemes. Section 2 describes previous research on Simultaneous MultiThreading and low power designs. We then introduce our new throttling scheme in Section 3. Our methodology and simulation environment are presented in Section 4. Section 5 includes our simulation results and the details of our analysis of throttling-based power management schemes. Finally, we summarize our observations in Section 6.

2

BACKGROUND

AND

RELATED WORK

Let us now describe previous research on Simultaneous MultiThreading and its power characteristics, power management schemes, as well as the relationship between fetch throttling and speculative execution.

2.1 Simultaneous Multithreaded Architectures Simultaneous MultiThreading [11] is a hardware approach to support multithreaded operations in superscalar architectures. Indeed, recall that the aim of Simultaneous MultiThreading is to use resources more efficiently by exploiting the parallelism inherent at the application level. The resources are more efficiently used because they are shared by multiple threads or allocated to a single thread as the need arises. Thus, whenever a functional unit becomes available, it can be scheduled to execute instructions from any ready thread. Simultaneous MultiThreading can be easily built “on top of” any modern superscalar architecture [12]. Instructions are issued from several independent threads and have no data or control dependences so that an SMT processor can safely simultaneously issue those instructions. This means

1143

that the instructions that have no intrathread dependence (Instruction Level Parallelism) and the collection of available instructions over all the threads (Thread Level Parallelism) can be combined and issued to the functional units. This will obviously result in a better utilization of hardware resources since the two forms of parallelism (ILP and TLP) are exploited instead of one only (ILP), as in a conventional superscalar processor. The efficient resource utilization of Simultaneous MultiThreading [11] should lead to better performance per unit of power consumption. Indeed, using area-based power estimation, Seng et al. [7] have shown that an SMT processor spends less energy per useful instruction than a conventional superscalar processor. They evoked the possibility of using SMT architectures as high performance microprocessors because of their efficient management of power consumption.

2.2 Design for Low Power Consumption The power consumption of a processor can be decomposed into dynamic power consumption and static power consumption. The dynamic power consumption is the power caused by various switching activities in the processor. The static power, which is caused by gate oxide leakage currents and subthreshold leakage currents [13], is independent of the switching activity in the processor. 2.2.1 Static Power Consumption With the advent of submicron fabrication technology, the supply voltage can be reduced. However, with the lower supply voltage, the switching activity of transistors also slows down. In order to boost the speed of transistors, the threshold voltage (the minimum voltage needed to turn on the transistor) should thus be lowered. On the other hand, lowering the threshold voltage exponentially increases the subthreshold leakage current [14]. Furthermore, thinner gate oxide in submicron technology also contributes to the increase in static power consumption [13]. Leakage is one of the major static current issues in future processors. However, techniques to reduce leakage current are related to fabrication technology (or device technology) such as Silicon on Insulator (SOI) [15]. There have been many studies regarding reducing static power consumption, ranging from fabrication technology to circuit technology [16]. Consequently, we consider techniques to reduce static power consumption to be beyond the scope of this paper. Instead, we concentrate on dynamic currents since architectural techniques are independent of device technology (for the simple reason that architectural techniques can be implemented on top of any low leakage device technology). For a given processor circuitry, the transition to finer design rules reduces the power consumption overall. However, it should be noted that deep submicron technology obviously does not solve all problems [17]: For one thing, it encourages the addition of more functionality, which correspondingly increases the total power consumption (in other words, smaller design rules decrease the power so long as the number of transistors remains the same). 2.2.2 Dynamic Power Consumption In the recent past, when power consumption has been taken into consideration in circuit and architectural techniques, not much regard has been given to the consequent

1144

IEEE TRANSACTIONS ON COMPUTERS,

performance degradation [18], [19], [20]. However, once the design of a processor has been optimized for speed, any attempt at reducing power consumption will somehow degrade its performance. Some techniques, such as clock gating [21], can reduce power consumption and yet result in only a comparatively small performance degradation. However, the power saving effect of clock gating is relatively small (8 percent of the total power saving, as shown in [19]), which means that clock gating should be applied in conjunction with other low power techniques. Burd et al. [22] proposed reducing power consumption while maintaining high IPC by dynamically reducing the supply voltage and clock frequency with an ARM processor. However, unlike low power processors that have a relatively small die size and are used in small systems, a billion transistor microprocessor will have many systemrelated modules to which continuous voltage/frequency scaling is not applicable. Instead, several steps of voltage and frequency combinations are often used as in Intel’s SpeedStep11 and AMD’s PowerNow!2.2 Introducing a throttling point in a processor pipeline by controlling the amount of resources can also reduce power consumption. One of such low power schemes is fetch throttling (also called instruction cache throttling), which was first introduced in the PowerPC processor [9]. Then, Manne et al. [10] indirectly uncovered the effect of speculative execution on fetch throttling in their fetch mechanism called pipeline gating. They investigated the power wasted due to speculative execution and introduced a fetch mechanism which blocks the fetch stage in order to suppress speculative execution as the probability of misprediction increases. According to their research, speculative execution is one of the main reasons for which power is wasted because instructions executed in the mispredicted branches are eventually discarded, but after consuming some power. Indeed, while throttling the fetch stage, instructions fetched before throttling began are executed and soon generate results. The result of branch instructions can be fed to the fetch stage before instructions in the speculative branch are fetched. Thus, the number of instructions in the speculative branches and their subsets (such as mispredicted branches) is reduced. Beside showing how efficient SMT is in terms of power consumption, Seng et al. [7] also evaluated several schemes to improve the EPI of SMT architectures. These include turning off speculative execution and reducing fetch bandwidth. However, those variations of fetch throttling only focus on the power wasted due to speculation and show only marginal improvement in the multithreaded environment of SMT [7]. Some methods to reduce EPI actually result in improved utilization. These techniques usually target register file and reorder buffers, which are the most power consuming memory units of the processor. The accesses to those memory units are buffered by means of small peripheral register files [23] or postponed until the accesses are inevitable by expanding the register map table to the reorder buffer [24]. These techniques help reduce the number of ports in the register file and reduce power consumption because they wait until enough register file 1. SpeedStep1 is a registered trademark of Intel Corporation. 2. PowerNow!2 is a registered trademark of Advanced Micro Devices, Inc.

VOL. 55,

NO. 9,

SEPTEMBER 2006

accesses can be batched. In turn, this improves resource utilization.

3

A NEW THROTTLING SCHEME

As we have alluded to earlier, throttling is a well-known technique to reduce the activity in the pipeline and correspondingly reduce power consumption. Fetch throttling has already been studied and is actually used in many power-aware designs [7], [10] as well as commercial microprocessors [9]. Clock gating reduces the dynamic power consumption of a module by cutting the clock signal off whenever the module is not in operation. If the module using a clock gating technique consists of multiple replicated subblocks of processing elements and no subblock for common processing, the power consumption will be exactly proportional to the computing performance (as defined by a measure such as IPC). However, there are many modules in a typical processor to which applying clock gating may be difficult. For instance, the instruction issue window must always be completely scanned, regardless of how many instructions will ultimately be issued. Clock gating is also not applicable to memory modules such as register files. Although some register file architectures such as multiplebanked register file architectures [25] could be used for clock gating and have a lower total power consumption, the high parallelism available in SMT processors might easily activate all banks in parallel. Since many modules in modern microprocessors are made of memory units, only a limited portion of the total power consumption can thus be saved by clock gating [19]. If a module simultaneously processes a higher average number of instructions (i.e., high utilization), performance can increase while, concurrently, only slightly increasing the total power consumption (i.e., more efficient power consumption). A conventional throttling scheme such as fetch throttling can lower power consumption by reducing instructions executed in mispredicted branches. Since the fetch throttling has a limited effect on power utilization, we propose a different throttling point. Let us now consider the three possible pipeline throttling schemes: Fetch Throttling (FT) [8], Issue Throttling (IT), and Commit Throttling (CT). Fetch throttling is the conventional throttling power management scheme discussed above. We propose here Commit Throttling, which is a power management scheme at the level of the commit stage stage. We also examine Issue throttling to investigate the effect of applying control between the fetch stage and the commit stage. The throttling points of these three throttling schemes on a simple processor architecture are shown in Fig. 1. Freezing a whole stage means that instructions are not only fed into the stage, but also that instructions in the stage do not proceed to the next stage. As seen in the figure, however, a throttling scheme does not actually freeze a whole stage, but prevents the instructions it contains from proceeding to the next stage. FT, IT, and CT correspond, respectively, to blocking fetching, issuing, and committing. These three schemes represent different combinations of speculation and utilization in SMT architectures. In throttling-based power management, the throttling stage is sometimes blocked and sometimes operated. We define a throttling cycle as consisting of several inoperative clock

LEE AND GAUDIOT: THROTTLING-BASED RESOURCE MANAGEMENT IN HIGH PERFORMANCE MULTITHREADED ARCHITECTURES

1145

Fig. 1. Throttling points in the processor architecture.

cycles and a single normal operating clock cycle. The number of inoperative clock cycles is defined as the throttling value. Once the throttling strategy is activated, a throttling cycle is repeated until the throttling value is changed. Fig. 2 illustrates how throttling works in processor pipelines. In the figure, the numbers circled represent individual instructions, while the lines show dependencies between instructions. Two different threads are used in the example. The numbers in the white circles are from one thread, while the numbers in the gray circles are from the other thread. Boxes represent pipeline stage and the pipeline stages that consume power are highlighted in gray. The throttling values in all the examples of Fig. 2 are set to 4. In Fig. 2a, fetch throttling is illustrated. The processor fetches four instructions at cycle N while the fetch operation is blocked at cycles N þ 1, N þ 2, and N þ 3. That is, the processor fetches instructions every four cycles. It should be noted that the sequence is repeated after cycle N. Issue throttling is shown in Fig. 2b. In the case of issue throttling, the pipeline state repeats every 12 cycles and the figure only shows that issue throttling can significantly lower IPC when compared to other throttling schemes. The low utilization of issue throttling comes from another factor: speculation. This point will be discussed at length in another section. Finally, Fig. 2c illustrates the pipeline activities during Commit Throttling. The processor commits four instructions every four cycles from cycle N on. The two throttling schemes in the figure have the same throughput of one instruction per cycle. As stated earlier, dynamic power consumption is caused by the switching activities of logic gates in the processor only. In other words, if the input does not change, the state of the logic remains still. There is thus no switching activity and no power consumption. With Fetch Throttling, more

Fig. 2. Throttling in the processor architecture. (a) Fetch throttling. (b) Issue throttling. (c) Commit throttling.

pipeline stages are concurrently in operation because some instructions rely on the results of other instructions. In Commit Throttling, there are many instructions in the processor pipeline. They are, however, mostly dormant and consume no power because there is nowhere for them to proceed. Gray boxes in the figure represent active pipeline stages that consume power. As shown in the figure, Commit Throttling has fewer gray boxes than Fetch Throttling. This means that, under Fetch Throttling, more pipeline stages consume power, even though, under Commit Throttling, the processor has many more instructions in the pipeline. Implementation of all three throttling schemes is relatively simple. Since SMT is essentially a superscalar

1146

IEEE TRANSACTIONS ON COMPUTERS,

architecture, instructions (or operations) in a pipeline stage can only advance to the next pipeline stage if there are available slots in the next stage. Throttling can be easily implemented by giving a simple “False” signal to inform the previous stage that there is no available slot in the next stage. We can also control the throttling rate by turning on and off the simple “False” signal logic. Thus, the implementation of throttling schemes in SMT does not significantly impact the hardware complexity. In SMT, the consequences of Commit Throttling are more complicated due to the existence of the reorder buffer: Clogging occurs when there are no available slots in the reorder buffer. If the reorder buffer is full, it stalls the processor pipeline from the issue stage down. Since the reorder buffer absorbs the throttling effect of the commit stage, the pipeline is clogged when too many instructions are executed in the processor—that is, when power is close to its peak. Once Commit Throttling reaches the point that restricts the processor’s throughput, the reorder buffer could easily overflow with instructions to be committed, thereby clogging the processor pipeline.

4

EVALUATION METHODOLOGY

In this research, a number of experiments with various resource management schemes have been carried out by behavioral level simulation. Our simulation environment has been developed to simulate multiple threads on SMT architectures with the capability to estimate the power based on the frequency of module activity. Benchmark programs and simulation configurations are also described in this section.

4.1 The Simulation Environment We simulate the proposed algorithm with SimpleSMT, the performance simulator of our ALPSS framework [26]. SimpleSMT is based on SimpleScalar [27], which is a widely used simulator toolkit mainly designed for academic and educational purposes. Much research including the development of architecture-level power simulators [28], [29] has been carried out with SimpleScalar-based simulation environments. Most units in a superscalar processor have to be slightly modified and extended to accommodate the multiple threads of an SMT processor. Units which are not shared by the threads such as program counters are simply replicated. Modifications from the base SimpleScalar architecture include the addition of a separate integer window, a floating-point window, and a reorder buffer. Other detailed features are based on the SMT architecture published by Tullsen et al. [12] and described in [20], [26]. The other portion of ALPSS is a power estimator which is based on Wattch [28]. Since the power estimation models in ALPSS are analytic models for the corresponding modules, they cannot appropriately measure the effect of resource utilization, which requires the description of lower levels of modules. Instead, we introduce a new measure, Instructions Per Active Cycle (IPAC), which can represent the degree of resource utilization. IPAC indicates the average number of instructions processed concurrently in a module or a pipeline stage. Each IPAC represents the average number of instructions processed in a module, but only for the cycles during which the module operates, i.e., the cycles

VOL. 55,

NO. 9,

SEPTEMBER 2006

during which the module consumes power. For instance, if the issue stage simultaneously issues four instructions at one cycle and does not issue anything during the next four cycles, the IPAC for the total five cycles is four. Let us consider for a moment the following two equations: X P ¼ ai Ci V 2 f; ð1Þ i

where i represents modules in a processor, ai the lumped activity factor for module i, Ci the lumped capacitance of module i, V the voltage, and f the frequency, and X ði i Ci þ i ð1 i ÞCi Þ; P ¼ V 2f ð2Þ i

where i is the activity factor for the part that is proportional to the operation count in module i, i is the activity factor of the nonproportional part in module i, and i is the ratio of the part proportional to the operation count over the total module capacitance. The power consumption of a processor can be roughly estimated by (1). The number of operations in module i determines ai . The power estimation methods of both Wattch and ALPSS are also based on (1). We can rewrite a more realistic power model as (2) by dividing the module into a part whose power consumption is proportional to the number of operations processed in the module and a part whose power consumption remains constant regardless of the number of operations. In (2), the activity factor i is proportional to the operation count, while the other activity factor i remains constant. If there is no part commonly used for multiple operations (i.e., i ¼ 1), the power consumption of the module can be proportional to the number of operations in the module. If a part provides a common intermediate value for multiple operations (i.e., i < 1), processing additional operations in parallel (i.e., higher IPAC or higher utilization) has less effect on the power consumption. Since i and i can be only accurately determined after the physical implementation of the module has been completed and since their values can greatly differ from one to the next, we prefer to focus in this paper on implementation-independent factors such as utilization (i.e., IPAC in this paper).

4.2 Benchmark Selection An important part of formulating a reasonable and reliable simulation is the careful selection of benchmark suites. We feel that SPEC 2000 CPU is the most appropriate because it consists of applications that can significantly exercise hardware resources [30]. Thus, all benchmarks we use are selected from SPEC CPU2000 and we simulate them with the reference input set out of three SPEC input sets which are test, reference, and train. We use the reference input set because it is designed to evaluate the overall performance. The benchmarks used in our simulations are Alpha EV6 binaries which were compiled on a Digital Unix V4.0F. As is customary, we ignore the first 300 million instructions per thread in order to avoid overlapping initialization routines and to wait until caches are saturated and have reached steady-state. We then gather results from the next 300 million instructions per thread. For instance, in

LEE AND GAUDIOT: THROTTLING-BASED RESOURCE MANAGEMENT IN HIGH PERFORMANCE MULTITHREADED ARCHITECTURES

TABLE 1 Benchmark Programs

a simulation of four threads, this would mean the simulated execution of 1.2 billion instructions. Since each thread in an SMT processor is independent of the others, a combination of several single thread versions of the programs is used for our simulations. We chose five integer and five floating-point benchmark programs from the SPEC 2000 CPU benchmark suite based on their simulation times and contents to represent various application environments. Those 10 benchmark programs that have short initialization times [26] are selected in order to keep simulation times within reasonable bounds. In order to account for different program behaviors between different combinations of programs, eight subsets of four threads out of the 10 programs are classified by performance and the number of floating-point instructions. Those classifications are chosen to classify the results by performance and types of instructions. In order to account for different program behaviors between different combinations of programs, four subsets consist of groups of programs with the highest IPC, the lowest IPC, medium IPC, and mix of the highest IPC and the lowest IPC. Four other subsets are based on the number of floating-point instructions with the same manner. Those combinations are shown in Table 1. Twothread subsets consist of the first two programs, and fourthread subsets include all four programs.

4.3 Simulator Configurations The overall configuration of the SMT processor used in all our simulation runs is shown in Table 2. This configuration is designed to have an amount of resources roughly equivalent to that in a commercially designed SMT processor, the Alpha EV8 [32] (even though it has been cancelled, we selected it for its SMT-like features). The architectural parameters in the configuration could be slightly different because of architectural variations between SimpleScalar and Alpha. All functional units are fully pipelined. The fetch policy is ICOUNT [12], which fetches eight instructions from up to two threads. The branch prediction scheme used in our simulator is the combined branch prediction of 4K bimodal and (1, 1,024, 10, 0) gshare with 4K choice predictors [33], [34].

1147

TABLE 2 Default Simulation Configuration

5

EVALUATION OF PERFORMANCE CONSUMPTION

AND

POWER

Our simulation consists of the following steps: First, single thread and four-thread benchmarks are simulated and compared to investigate the effect of speculative execution in multithreaded environments. Then, pipeline throttling on different pipeline stages is simulated in order to analyze and evaluate the proposed low power mode. Finally, we simulate a new control mechanism for the proposed resource management scheme.

5.1 Speculation in Multithreaded Environments Table 3 shows the dynamic characteristics of 10 single thread benchmarks and eight four-thread benchmarks simulated on the same architecture configuration, as in Table 2. The table includes Instruction Per Cycle (IPC), Branches Per Instruction (BPI), and MisPrediction Rate (MPR). MPR is introduced so as to show the effect of speculative execution. It is the ratio of the total number of instructions in mispredicted branches over the total number of instructions committed. In other words, MPR corresponds to the number of instructions which are executed and discarded due to misprediction in the course of program execution. In single thread execution, the average number of instructions in the mispredicted path is 27.58 percent of the total number of instructions committed. In the case of a four thread SMT architecture, the highest is 15.27 percent and the average is 6.78 percent. This shows that Simultaneous MultiThreading does not inherently rely on speculative execution for its performance as much as superscalar does. Therefore, power saving or power management techniques which target speculative execution

1148

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 55,

NO. 9,

SEPTEMBER 2006

TABLE 3 Characteristics of Benchmark Programs

may not be as effective with SMT architectures as they would be with superscalar architectures. In our experiments on pipeline throttling, we investigate this problem both quantitatively and qualitatively.

5.2 Speculation and Utilization in Low Power Modes In this subsection, the dynamic characteristics of the throttling schemes are evaluated on an SMT architecture. In order to improve the efficiency of power consumed, the Energy Per Instruction ratio (EPI) should be minimized. Two factors related to EPI are speculation and utilization. With speculative execution, instructions beyond branch instructions are executed without a guarantee of being used. If the prediction fails, instructions which were executed speculatively are discarded and the energy to execute those instructions is wasted. Thus, EPI can be improved if fewer instructions are executed speculatively. The other factor, utilization, is related to the average number of concurrently processed instructions per cycle at each module. 5.2.1 Performance of the Three Throttling Schemes Fig. 3a shows the IPCs of all benchmarks. In all benchmarks, Commit Throttling has the best IPC, followed by Issue Throttling. Since throttling the fetch stage aggravates the fetch bottleneck [12], the IPCs of fetch throttling drop very rapidly. While the benefit of multiple threads in SMT architectures diminishes in the Fetch Throttling and Commit Throttling schemes along with the throttling level, the performance of Issue Throttling is still affected by the number of threads at a high throttling level. When a high throttling level introduces a bottleneck in the processor pipeline, increased parallelism owing to multiple threads does not improve performance. Thus, low MPR due to multiple threads can contribute to the performance of Issue Throttling. 5.2.2 Effect of Speculation in the Throttling Schemes Fig. 3b shows the variance of the MPRs of the three throttling schemes. It demonstrates that Fetch throttling significantly reduces speculative execution because a predicted branch is resolved soon enough for the processor pipeline not to be polluted by any instruction from a mispredicted branch. However, the improvement diminishes as the throttling

Fig. 3. Performance of each throttling scheme. (a) IPCs. (b) MPRs.

level rises. At the same time, Thread-Level Parallelism in a multithreaded environment attenuates the improvement even further. On the other hand, issue throttling actually makes MPR worse. That is, the higher the throttling level, the more instructions are executed and discarded due to branch misprediction. In fact, it will be observed that Issue Throttling increases MPR. Issue throttling delays the resolution of branch prediction because instructions are held in the issue stage without being executed. Therefore, the MPR of Issue Throttling is the highest among the three throttling schemes, as seen in Fig. 3b. Even though multiple threads running in an SMT architecture improve MPR with issue throttling, the power wasted by speculative execution is much higher than that with fetch throttling or commit throttling. In fact, the figure demonstrates that 10 30 percent of the total power consumption with issue throttling is spent on executing instructions in mispredicted branches. The MPR of Commit Throttling is lower than that of Issue Throttling, but higher than that of Fetch Throttling. Since Commit Throttling lets the instructions execute, the

LEE AND GAUDIOT: THROTTLING-BASED RESOURCE MANAGEMENT IN HIGH PERFORMANCE MULTITHREADED ARCHITECTURES

1149

predicted branches are resolved and execution of wrong paths is repealed while the commit operation is blocked. However, the fetch operations also supply new instructions in the wrong paths, which increases MPR.

populated pipeline. In other words, this approach reduces the latency of instruction by giving all the resources to few instructions which renders many units idle. Fetch Throttling has poor utilization as far as accessing the register file and the reorder buffer is concerned. With fetch throttling, the utilization decreases as the throttling level increases. On the other hand, Issue and Commit Throttling do not resolve branch prediction as early as Fetch Throttling does. Instead, throttling the later stages of the pipeline clogs the pipeline. This clogging makes instructions pile up in the pipeline stages before the throttled stage because new instructions are still being fetched. Since only data transitions consume power in digital logic, the clogged instructions in the processor pipeline do not consume power [6]. When the throttling is released, each stage concurrently processes all instructions piled up. As shown in Fig. 4a, compared to Fetch Throttling, Issue Throttling significantly increases the number of operations in a single access to register file and reorder buffer (i.e., utilization) in the issue stage and slightly increases the number at commit stage. However, the highest MPRs of Issue Throttling cancel the benefit of better utilization of the issue stage. By moving the throttling point to the end of the pipeline (the commit stage), Commit Throttling increases the utilization of all the pipeline stages and also allows branch instructions to be resolved. According to Fig. 4b, the utilization at the commit stage yielded by Commit Throttling is higher than that given by Fetch Throttling or Issue Throttling. Another advantage of Commit Throttling is that it has less effect on fetch bandwidth than other throttling schemes for SMT architectures which have a significant fetch bottleneck [12]. Since instructions are filled up from the end of the pipeline while throttling the commit stage, the fetch stage is the least affected stage in Commit Throttling. Those advantages of Commit Throttling contribute to its sustaining higher performance than other schemes, as seen in Fig. 3a. Fig. 5 shows the energy per cycle and the energy per instruction of each throttling scheme. Four thread benchmarks are used in the simulation. As described above, the power estimation model in Wattch and ALPSS generates only rough estimates from a clumped model. They are thus provided for reference only. However, one can still observe that commit throttling consumes the smallest energy per instruction. Technology parameters used in the simulation are: 600 MHz operating frequency and 2.2V operating voltage.

5.2.3 Resource Utilization of Each Throttling Scheme Instruction Per Active Cycle (IPAC) here represents how efficiently the register file and the reorder buffer are accessed. The average number of instructions issued concurrently is shown in Fig. 4a. The graph represents how many operands are read from the register file and/or the reorder buffer and are fed into the processor pipeline. Fig. 4b shows the average number of instructions committed at the same time. The commit operation accompanies the transfer of results from the reorder buffer into the register file. These two graphs illustrate how efficiently read/write operations are performed in terms of IPAC. Instructions are introduced into the processor pipeline once in a while with Fetch Throttling. Therefore, once instructions are fetched, they are executed faster in the less

5.2.4 Improving Resource Utilization While Fetch Throttling can reduce power consumption, owing to the advantage of the lowest MPR, Commit Throttling can have low power consumption by efficiently utilizing accesses to the register file and the reorder buffer. Combining those two throttling schemes, we can have a new efficient power management scheme which has low MPR and good utilization of the register file and the reorder buffer at the same time. In this combined throttling scheme, the same level of throttling is applied to the fetch stage and the commit stage simultaneously. With various throttling configurations, the combined throttling has MPR, IPC, and IPAC in the issue stage very close to those of Fetch Throttling. The differences between Combined Throttling and Fetch Throttling are

Fig. 4. IPACs of throttling schemes. (a) At issue stage. (b) At commit stage.

1150

IEEE TRANSACTIONS ON COMPUTERS,

VOL. 55,

NO. 9,

SEPTEMBER 2006

Fig. 5. Energy per cycle and energy per instruction of throttling schemes.

0.04 percent of MPR, 1.28 percent of IPC, and 1.08 percent of IPAC in the issue stage on average. While throttling the commit stage has virtually no effect on MPR and IPC, it improves the utilization of the register file and of the reorder buffer access, as shown in Fig. 6.

5.3

Throttling Schemes with a Dynamic Power Management In addition to the results of the proposed resource management scheme with a fixed throttling value, the proposed throttling-based schemes are also tested in variable throttling level environments. Since the conventional power management schemes do not switch the mode of operation of the processor between normal and low power sufficiently fast, they have long response delays [35], which could allow for the temperature to rise unimpeded for a long while after an overheat condition has been detected. Similarly, it would require a long time of operation under low power mode to return to normal temperature. Therefore, the behavior of the temperature with conventional power management schemes is a typical “sawtooth” waveform of large amplitude [9]. Because of the long response delays, conventional power management schemes usually have a fixed configuration for the low power mode. The purpose of our dynamic power management scheme with pipeline throttling schemes is to keep the period of the low power and low performance mode as short as possible so that we can maintain a high target average performance. One key factor of this dynamic power management in our approach is its fast response time: The low power mode can operate for as low as one or two clock cycles only. Other considerations are ease of implementation and small size because the power management unit itself should have a minimal impact on the total power consumption of the whole processor. The flowchart in Fig. 7a shows how our control mechanism works with a pipeline throttling scheme. IP Cm represents measured IPC and IP Ct corresponds to the target IPC. We used IPC as a pseudopower metric in this experiment because an on-die thermometer would be too slow to cope with the proposed power management mechanism. A more accurate measure for the instant power consumption per cycle would be obtained with an operation counter for every module in the microprocessor, as in

Fig. 6. IPACs of throttling schemes at the commit stage.

(1) or (2). However, we can approximate the power consumption with the following equation: P ¼ at Ct V 2 f ninst Ct V 2 f;

ð3Þ

where at represents the lumped activity factor of the processor, Ct the total capacitance, and ninst the instruction count. Hence, we use IPC since it can fairly reflect the total power consumption of the processor. In the power management scheme, the instructions executed are counted because the power consumption is proportional to the number of instructions executed. For performance critical missions, the number of instructions committed can also be used as a power measure. By maintaining the IPC of a processor below a certain level, our dynamic power management can provide just enough performance for target applications and can also prevent the processor from temporarily overheating. When the mechanism is initialized, the throttling value is set to 0 and the state of the mechanism is set to not in throttling cycle. Since the IPC of the processor is not high at first, the power management mechanism stays in loop 1, which represents the normal operation mode. Once the IPC measured is over the target IPC, the throttling value is increased and the throttling cycle starts in loop 2. After loop 2 is executed as many times as the throttling value is set, the state of the mechanism is set to end of throttling cycle. Then, the processor can be set to its normal execution mode for one cycle (one throttling cycle ends here). When a single throttling cycle ends, the IPC is checked against the target IPC. If it is still high, a new throttling cycle begins with an increased throttling value so that the IPC can decrease even more. Otherwise, the processor returns to its normal state. By gradually increasing the throttling value from a unique cycle, our power management scheme can efficiently deal with the slightly higher power consumption than the threshold power consumption. In Fig. 7b, we present the effect of the proposed throttlingbased power management scheme. Fetch Throttling, Commit Throttling, and Fetch & Commit Throttling are used for the low power mode. The target IPC is set to 2. In order to compare power saving effects, the number of committed instructions instead of that of executed instructions are used

LEE AND GAUDIOT: THROTTLING-BASED RESOURCE MANAGEMENT IN HIGH PERFORMANCE MULTITHREADED ARCHITECTURES

Fig. 7. Proposed control mechanism. (a) Flowchart. (b) Result with variable throttling level.

in calculating IP Cm . Note that all four-thread benchmarks are used in the simulations. It is hard to tell which power management scheme between Fetch Throttling and Commit Throttling is best because the power consumption of a processor depends in large part on circuit and architectural structures. Similarly to the results with fixed throttling value, Fetch Throttling has a lower MPR, while Commit Throttling has better utilization as far as register file access is concerned. Fig. 7b shows that Fetch Throttling and Fetch & Commit Throttling minimize the number of instructions wasted due to misprediction down to almost half (55.7 percent and 57.2 percent, respectively). In case of Fetch & Commit Throttling, IPAC in the commit stage is almost twice (200.8 percent) that of Fetch Throttling. This means that, with Fetch & Commit Throttling, we can significantly improve conventional Fetch Throttling.

6

implement, easy to control, and have extremely low response times. The simulation results show that conventional Fetch Throttling is not as effective in SMT architectures as it is in superscalar architectures. This means that Simultaneous MultiThreading is not as sensitive to the number of speculatively executed instructions because it fetches instructions from other threads rather than executing instructions in predicted branches. We have also shown that our proposed low power scheme, Commit Throttling, has better resource utilization than Fetch Throttling has. We have also demonstrated that Commit Throttling, in conjunction with Fetch Throttling, improves wasted power consumption due to speculation and resource utilization at the same time. The fast response time of the throttling-based power management scheme for SMT processors helps keep the power consumption of the microprocessor within a small range because the low power mode starts to operate within a few clock cycles when it is needed. Our experiments show that our resource management scheme in an SMT processor works with the dynamic throttling level. Thus, the power management scheme we have described in this paper helps keep the performance high with a conventional cooling solution. It can be advantageously applied to future SMT processors which will obviously have more transistors without imposing an excessive burden on the cooling system. Since our experiments do not cover all possible instruction mixes and since speculation and utilization are correlated, there remains a significant opportunity to introduce other power management schemes which would return fair performance with certain instruction mixes. In addition, we may need an adaptive mechanism by using a detector thread [36] which applies a different combination of power management schemes for a different situation in order to improve the performance of the proposed scheme even more.

ACKNOWLEDGMENTS This work was partly supported by the Ministry of Information and Communication, Korea, under the ITRC program supervised by the IITA and partly supported by the US National Science Foundation under Grant No. CCF0541403. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the US National Science Foundation.

REFERENCES [1] [2] [3]

CONCLUSIONS

In this paper, we have examined several throttling-based low power modes for SMT architectures. They are easy to

1151

[4] [5]

S. Eggers, J. Emer, H. Levy, J. Lo, R. Stamm, and D. Tullsen, “Simultaneous Multithreading: A Platform for Next-Generation Processors,” IEEE Micro, pp. 12-18, Sept./Oct. 1997. L. Hammond, B. Nayfeh, and K. Olukotun, “A Single-Chip Multiprocessor,” Computer, special issue on billion-transistor processors, vol. 30, no. 9, pp. 79-85, Sept. 1997. J. Lo, S. Eggers, J. Emer, H. Levy, R. Stamm, and D. Tullsen, “Converting Thread-Level Parallelism to Instruction-Level Parallelism via Simultaneous Multithreading,” ACM Trans. Computer Systems, pp. 322-354, Aug. 1997. Intel Pentium 4 Processor Datasheet, Intel Corp., 2002. Intel Pentium 4 Processor Thermal Design Guidelines, Intel Corp., 2000.

1152

[6] [7] [8] [9]

[10] [11] [12]

[13]

[14] [15] [16] [17] [18] [19]

[20]

[21] [22] [23] [24]

[25] [26] [27] [28] [29]

IEEE TRANSACTIONS ON COMPUTERS,

T. Mudge, “Power: A First Class Design Constraint for Future Architectures,” Proc. Seventh Int’l Conf. High Performance Computing, pp. 215-224, Dec. 2000. J. Seng, D. Tullsen, and G. Cai, “Power-Sensitive Multithreaded Architecture,” Proc. 2000 Int’l Conf. Computer Design, pp. 119-206, Sept. 2000. PowerPC: MPC750 RISC Microprocessor Technical Summary, Motorola Inc., 1997. H. Sanchez, B. Kuttanna, T. Olson, M. Alexander, G. Gerosa, R. Philip, and J. Alvarez, “Thermal Management System for High Performance PowerPC Microprocessors,” Proc. 42nd IEEE Int’l Computer Conf., pp. 325-330, Feb. 1997. S. Manne, A. Klauser, and D. Grunwald, “Pipeline Gating: Speculation Control for Energy Reduction,” Proc. 25th Ann. Int’l Symp. Computer Architecture, pp. 132-141, June 1998. D. Tullsen, S. Eggers, and H. Levy, “Simultaneous Multithreading: Maximizing On-Chip Parallelism,” Proc. 22nd Ann. Int’l Symp. Computer Architecture, pp. 392-403, June 1995. D. Tullsen, S. Eggers, J. Emer, H. Levy, J. Lo, and R. Stamm, “Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor,” Proc. 23rd Ann. Int’l Symp. Computer Architecture, pp. 191-202, May 1996. N. Kim, T. Austin, D. Blaauw, T. Mudge, K. Flautner, J. Hu, M. Irwin, M. Kandemir, and V. Narayanan, “Leakage Current: Moore’s Law Meets Static Power,” Computer, vol. 36, no. 12, pp. 65-77, Dec. 2003. R. Gonzalez, B. Gordon, and M. Horowitz, “Supply and Threshold Voltage Scaling for Low Power CMOS,” IEEE J. Solid-State Circuits, vol. 32, no. 8, pp. 1210-1216, Aug. 1997. C. Chuang, P. Lu, and C. Anderson, “SOI for Digital CMOS VLSI: Design Considerations and Advances,” Proc. IEEE, vol. 86, no. 4, pp. 689-720, Apr. 1998. J. Kao and A. Chandrakasan, “Dual-Threshold Voltage Techniques for Low-Power Digital Circuits,” IEEE J. Solid-State Circuits, no. 7, pp. 1009-1018, July 2000. M. Flynn, P. Hung, and K. Rudd, “Deep-Submicron Microprocessor Design Issues,” IEEE Micro, vol. 19, no. 4, pp. 11-22, July/ Aug. 1999. R. Gonzalez and M. Horowitz, “Energy Dissipation in General Purpose Microprocessors,” IEEE J. Solid-State Circuits, vol. 21, no. 9, pp. 1277-1284, Sept. 1996. J. Brennan, A. Dean, S. Kenyon, and S. Ventrone, “Low Power Methodology and Design Techniques for Processor Design,” Proc. 1998 Int’l Symp. Low-Power Electronics and Design, pp. 268-273, Aug. 1998. S. Lee and J.-L. Gaudiot, “Clustered Microarchitecture Simultaneous Multithreading,” Proc. Parallel Processing, Ninth Int’l EuroPar Conf. (Euro-Par 2003), H. Kosch, L. Bo¨szo¨rme´nyi, and H. Hellwagner, eds., pp. 576-585, Aug. 2003. M. Gowan, L. Biro, and D. Jackson, “Power Considerations in the Design of the Alpha 21264 Microprocessor,” Proc. 35th Design Automation Conf., pp. 726-731, June 1998. T. Burd, T. Pering, A. Stratakos, and R. Brodersen, “A Dynamic Voltage Scaled Microprocessor System,” IEEE J. Solid-State Circuits, pp. 1571-1580, Nov. 2000. N. Kim and T. Mudge, “Reducing Register Ports Using Delayed Write-Back Queues and Operand Pre-Fetch,” Proc. 17th Ann. Int’l Conf. Supercomputing, pp. 172-182, June 2003. G. Savransky, R. Ronen, and A. Gonzalez, “Lazy Retirement: A Power Aware Register Management Mechanism,” Proc. Workshop Complexity-Effective Design, 29th Int’l Symp. Computer Architecture, May 2002. J.-L. Cruz, A. Gonzalez, M. Valero, and N. Topham, “MultipleBanked Register File Architectures,” Proc. 27th Ann. Int’l Symp. Computer Architecture, pp. 316-325, June 2000. S. Lee and J.-L. Gaudiot, “ALPSS: Architectural Level Power Simulator for Simultaneous Multithreading, Version 1.0,” Technical Report CENG-02-04, Univ. of Southern California, Apr. 2002. T. Austin, “The SimpleScalar Architectural Research Tool Set, Version 2. 0,” Technical Report 1342, Univ. of Wisconsin-Madison, June 1997. D. Brooks, V. Tiwari, and M. Martonosi, “Wattch: A Framework for Architectural-Level Power Analysis and Optimizations,” Proc. 27th Ann. Int’l Symp. Computer Architecture, pp. 83-94, June 2000. G. Cai and C.-H. Lim, “Architectural Level Power/Performance Optimization and Dynamic Power Estimation,” Cool Chips Tutorial colocated with MICRO32, Nov. 1999.

VOL. 55,

NO. 9,

SEPTEMBER 2006

[30] J. Henning, “SPEC CPU2000: Measuring CPU Performance in the New Millennium,” Computer, vol. 33, no. 7, pp. 28-35, July 2000. [31] Alpha 21264 Microprocessor Hardware Reference Manual, Compaq Computer Corp., 1999. [32] R. Preston, R. Badeau, D. Bailey, S. Bell, L. Biro, W. Bowhill, D. Dever, S. Felix, R. Gammack, V. Germini, M. Gowan, P. Gronowski, D. Jackson, S. Mehta, S. Morton, J. Pickholtz, M. Reilly, and M. Smith, “Design of an 8-Wide Superscalar RISC Microprocessor with Simultaneous Multithreading,” Digest of Technical Papers, 2002 IEEE Int’l Solid-State Circuits Conf., pp. 334335, Feb. 2002. [33] S. McFarling, “Combining Branch Predictors,” Technical Report WRL-TN-36, Western Research Laboratory, Digital June 1993. [34] R. Kessler, “The Alpha 21264 Microprocessor,” IEEE Micro, vol. 19, no. 2, pp. 24-36, Mar./Apr. 1999. [35] D. Brooks and M. Martonosi, “Dynamic Thermal Management for High-Performance Microprocessors,” Proc. Seventh Int’l Symp. High Performance Computer Architecture, pp. 171-182, Jan. 2001. [36] C. Shin, S. Lee, and J.-L. Gaudiot, “Dynamic Scheduling Issues in SMT Architectures,” Proc. 17th Int’l Parallel and Distributed Processing Symp. (IPDPS ’03), Apr. 2003. Seong-Won Lee received the BS and MS degrees in control and instrumentation engineering from Seoul National University, Korea, in 1988 and 1990, respectively. He received the PhD degree in electrical engineering from the University of Southern California, in 2003. From 1990 to 2004, he worked on VLSI/SOC design at the Samsung Electronics Co., Ltd., Korea. Since March 2005, he has been a professor in the Department of Computer Engineering at Kwangwoon University, Seoul, Korea. His research interests include VLSI/SOC architectures, multithreaded architectures, media processor architectures, power aware computing, and multimedia signal processing. Jean-Luc Gaudiot received the Diploˆme d’ln ge´nieur from the Ecole Supe´rieure d’lnge´nieurs en Electrotechnique et Electronique, Paris, France, in 1976 and the MS and PhD degrees in computer science from the University of California, Los Angeles, in 1977 and 1982, respectively. He is currently a professor and chair of the Department of Electrical Engineering and Computer Science at the University of California, Irvine (UCI). Prior to joining UCI in January 2002, he had been a professor of electrical engineering at the University of Southern California since 1982, where he served as director of the Computer Engineering Division for three years. He has also done microprocessor systems design at Teledyne Controls, Santa Monica, California (1979-1980) and research in innovative architectures at the TRW Technology Research Center, El Segundo, California (19801982). His research interests include multithreaded architectures, faulttolerant multiprocessors, and implementation of reconfigurable architectures. He has published more than 170 journal and conference papers. He served as the editor-in-chief of the IEEE Transactions on Computers (1999-2002) and has been the editor-in-chief of IEEE Computer Architecture Letters since January 2006. He is a member of the ACM, of ACM SIGARCH, and a fellow of the IEEE.

. For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.

DISTRIBUTED RESOURCE ALLOCATION IN ... - IEEE Xplore