Software-Directed Power-Aware Interconnection Networks Vassos Soteriou, Noel Eisley and Li-Shiuan Peh Department of Electrical Engineering Princeton University, Princeton, NJ 08544. {soteriou,eisley,

peh}@princeton.edu

ABSTRACT Interconnection networks have been deployed as the communication fabric in a wide range of parallel computer systems. With recent technological trends allowing growing quantities of chip resources and faster clock rates, there have been prevailing concerns of increasing power consumption being a major limiting factor in the design of parallel computer systems, from multiprocessor SoCs to multi-chip embedded systems and parallel servers. To tackle this, power-aware networks must become inherent components of single-chip and multi-chip systems. On the hardware design side, while there has been some recent interconnection network power reduction research, especially targeted towards communication links, the techniques presented are ad hoc and are not tailored to the application running on the network. We show that with these ad hoc techniques, power savings and corresponding impact on network latency vary significantly from one application to the next – in many cases network performance can suffer severely. On the software side, extensive research on compile-time optimization has produced parallelizing compilers that can efficiently map an application onto hardware for high performance. However, research into power-aware parallelizing compilers is in its infancy; none addressed communication power. In this paper, we take the first steps towards tailoring applications’ communication needs at run-time for low power. We propose software techniques that extend the flow of a parallelizing compiler in order to direct run-time network power optimization. We target network links, the dominant power consumer in these systems, allowing DVS instructions extracted during static compilation to orchestrate link voltage and frequency transitions for power savings during application runtime. Concurrently, a hardware online mechanism measures network congestion levels and adapts these off-line DVS settings to optimize network performance. Our simulations show that link power consumption can be greatly reduced by up to 76.3%, with a minor increase in network latency in the range of 0.23% to 6.78% across a number of benchmark suites running on three existing parallel architectures, from very fine-grained single-chip to coarse-grained multi-chip architectures. Categories and Subject Descriptors: B.9.1 [Power Manage-

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. CASES’05, September 24–27, 2005, San Francisco, California, USA. Copyright 2005 ACM 1-59593-149-X/05/0009 ...$5.00.

ment]: Low-Power Design; B.8.2 [Performance and Reliability]: Performance Analysis and Design Aids; C.2.0 [ComputerCommunication Networks]: General General Terms: Design, Management, Performance Keywords: Software-directed power reduction, dynamic voltage scaling, interconnection networks, networks on-a-chip (NoC), communication links, simulation

1. INTRODUCTION Interconnection networks are becoming the de facto communication fabric in both single-chip multiprocessors [27, 21, 3] and multi-chip systems [17, 7] facilitating program parallelism as a means to reduce execution time and achieve very high, scalable performance. While rapidly improving VLSI technology is allowing the use of additional chip resources along with higher clock rates, performance gains do not arrive at no cost. As in the case of uniprocessor systems, interconnected systems, both on and offchip, suffer from the effects of ever increasing power consumption, with the interconnection network taking up a sizable portion of the chip power budget. For instance the on-chip network in the MIT RAW CMP burns 36% of the power budget [12] while 20% (23W ) of the total allocated power in the Alpha 21364 microprocessor is consumed by the router and links [17]). Indeed the International Technology Roadmap for Semiconductors [23] highlights system power consumption as the limiting factor in developing systems below the 50nm technology point. Interconnection networks therefore urgently have to be designed to be power-aware. A widely recognized power reduction technique is Dynamic Voltage Scaling (DVS). In uniprocessors, researchers have proposed several methods to explore compile-time DVS scheduling [22, 34]. These techniques identify periods of program execution slack at various points in a program and appropriately insert DVS instructions in the original code to slow down the processor in order to save power. In interconnected systems, DVS has also been used in order to reduce the power consumption of on-chip buses [33] and chip-tochip interconnection networks [24, 26] using hardware prediction mechanisms to tune voltage and frequency levels on each link according to the projected traffic levels. Though these approaches are simple and provide good power savings, they demonstrate a number of severe limitations due to their ad hoc settings, lacking fine responsiveness to the specific application’s fluctuating link bandwidth needs. As shown in Section 2 network performance can be highly variable and unpredictable from one application to the next and in some cases severely degraded. In the software arena, extensive research on compile-time optimization has produced heavily optimized compilers [31, 15] that expose program parallelism to efficiently map an application onto

the parallel architecture, showing good potential for performance speedup. However, compiler optimizations that address power issues in parallel architectures remain very limited, with recent work in [9] targeting processor power optimization for array-based applications. In short, communication power has not been addressed at the software level. In this paper, we take the first steps towards tailoring applications’ communication needs at run-time for low power and propose a software-based methodology that extends the parallel compiler flow in order to construct high performance power-aware interconnection networks by targeting communication links, the dominant power consumer in interconnection networks. Our methodology takes in the statically compiled message flow of an application and analyzes the traffic levels for all links in the network over periods of time. By factoring in architectural characteristics, it matches DVS link transitions to the expected levels of traffic, generating DVS software directives that are injected into the network along with the network-mapped application. These DVS instructions are then executed at runtime, dynamically adapting link power consumption to actual utilization. Concurrently, a hardware online mechanism measures network congestion levels and fine-tunes the execution of these DVS instructions to handle runtime variabilities that are not precisely captured at compile time. Our results show our software-directed approach turning in significantly superior powerperformance as compared to prior hardware approaches, reducing link power by up to 76.3%, with a minor increase in network latency in the range of 0.23% to 6.78% across a number of benchmark suites running on three existing network architectures, from very fine-grained single-chip to coarse-grained multi-chip architectures. Next, Section 2 discusses prior related research, demonstrating the ad hoc nature of existing hardware-driven approaches, motivating the use of software directives to optimize network power, while Section 3 describes the DVS link model assumed. Section 4 follows with details of our proposed techniques in extracting DVS software directives for network power reduction and Section 5 describes our online DVS hardware mechanism. Section 6 details our simulation setup and results for a range of benchmark suites running on three different existing network architectures and finally Section 7 concludes the paper.

2.

BACKGROUND AND MOTIVATION

2.1 Related Work As power-constrained systems become increasingly interconnected, there has been increasing recognition of the need to target the power consumption of interconnection networks. Several recent studies modeled and characterized the power profile of network routers and links in a variety of systems ranging from clusters to servers and chip multiprocessors (CMPs) [19, 29, 12, 33], emphasizing and demonstrating the high power consumed by communication links. Limited research has explored the use of power-aware techniques to reduce link power consumption in interconnection networks. All are hardware-based approaches which can be classified based on the type of power-aware mechanism: dynamic voltage scaling (DVS) and on/off links. The first power-aware interconnection networks proposed by Shang et. al. [24] explored the use of DVS links, where hardware counters measure past and current network utilization statistics over fixed sampling windows, that are later compared to fixed thresholds to direct (V, f ) link pair transitions for power reduction. Chen et. al. [2] used similar hardware DVS policies, focusing on proposing circuits for realizing DVS in opto-

electronic links. Stine and Carter [26] later demonstrated that in some cases, as long as the network can provide enough bandwidth to meet the application requirements, a static setting of link frequency along with adaptive routing can outperform DVS links for the multi-chip synthetic self-similar traffic traces they studied. In their exploration of the use of on/off links as a means to reduce overall network power consumption, Soteriou and Peh [25] proposed a number of techniques where links switch on/off according to hardware counter metrics measured from the network during runtime. Additionally to avoid deadlocks, they devised fully adaptive routing protocols, mapped onto the network topology, which were modeled as a connectivity graphs. Lastly, networks with hybrid DVS-DLS (Dynamic Link Shutdown) links that further shut down DVS links when traffic drops to very low levels were proposed and investigated by Kim et. al. [10]. Additionally, research in embedded Networks-on-Chip (NoC) and Systems-on-a-Chip (SoC) areas have proposed several software-based techniques that use application profiling to reduce power. Across all studies, links are set at one frequency/voltage across the entire application, i.e. link frequencies do not vary at runtime. Luo et. al. [16] addressed joint optimization for variable-voltage processors and communication links in heterogenous embedded systems, at the same time meeting real-time constraints. Similarly Hu et. al. [6] have proposed a static energyaware scheduling algorithm to reduce energy in a heterogenous NoC, by parallel scheduling both communication transactions and computation tasks under real-time constraints, where voltages and frequencies are set based on application profiling. Further, Jalabert et. al. [8] have presented the ×pipesCompiler, a tool for instantiating an application-specific NoC for heterogenous multi-processor SoCs, based on application profiling, exhibiting good power savings. The software directives proposed in this paper can extend the above synthesis tools towards power-aware DVS networks.

2.2 Motivation Though recent dynamically-tuned hardware-based power-aware approaches that rely on runtime statistics have exhibited good interconnection network power savings [24, 25, 26, 10], they also demonstrate a number of serious limitations. These techniques are ad hoc and are not tailored to the specific application’s spatial and temporal variability running on the network. For good powerperformance, power-aware policies need to be aware of an application’s demands and tuned to its fluctuating network bandwidth requirements. The above techniques depend on statistics obtained during application runtime that are then compared against thresholds to direct power-aware decisions. However these statistics are short-lived and are measured over a limited number of system cycles or sampling windows, reflecting only short term temporal traffic variability. These statistics are also obtained locally at each router and therefore they do not reflect the spatial variability across the entire network topology. And lastly, the thresholds are fixed and empirically set, and are not based upon traffic behavior indicators. The pioneering interconnection networks with DVS links work of [24] presents good power savings with synthetic self-similar traffic while sustaining high performance. However, due to its ad hoc nature, performance can suffer severely when faced with a traffic pattern that differs from that which it is tuned for. To demonstrate this, we applied traffic traces from the TRIPS CMP [21] to the exact original proposed implementation of [24]. The various thresholds and sampling windows used here were those used in the original work as well. We used the DVS link model described in Section 3. While high link power savings at an average of 74.4% are

Link power savings 76.00%

110.00%

Latency penalty 100.00%

75.00%

74.00% 73.00%

75.00%-76.00% 74.00%-75.00% 73.00%-74.00% 72.00%-73.00% 71.00%-72.00%

90.00% 80.00% 70.00%

72.00%

60.00%

adpcm art mpeg2encode

71.00% Se Se t3 Se t3 ( Se t3 (1k 10k) t3 Se (10 ) Se t2 ( 0) t2 Se (10 10) Se t2 ( 1 k k) t2 ) Se (10 Se t1 (10 0) t1 Se (10 ) t1 Se (1k k) t1 (10 ) (10 0) )

100.00%-110.00% 90.00%-100.00% 80.00%-90.00% 70.00%-80.00% 60.00%-70.00% 50.00%-60.00%

50.00% adpcm art mpeg2encode

Se Set 3 t3 Se Set (1k (10k Se 3 t3 Se ) t2 ) (10 (100 Se t2 Se ) t2 ) (1k (10k Se t2 ( Se ) 10 t1 ) (10 Se t1 0) (10 Se ) t1 (1k t1 k ( ) 10 ) (10 0) )

Threshold sets and sampling window sizes

Threshold sets and sampling window sizes

Figure 1: Link power savings of ad hoc hardware-directed DVS for Figure 2: Network latency penalty of ad hoc hardware-directed DVS for three trace benchmarks running on a TRIPS CMP.

demonstrated (see Figure 1), Figure 2 shows the overall severe impact on network latency penalties. We see that with short sampling window sizes of 10 cycles, the latency penalty can increase up to more than double (100.7%) the original network delay without DVS. This is because the short sampling cycles are not able to distinguish short term from long term traffic fluctuations. Latency increases are 47.4% at the minimum and 62.1% on average for all configurations and benchmarks. With longer sampling windows latency penalties tend to decrease slightly, as links do not toggle (V, f ) pairs as often, along with smaller power savings. For instance, threshold Set 1 with a sampling window size of 10k cycles (see Figure 1) presents the smallest link power savings out of all experiments carried out at 71.2% along with the shortest latency increase of 47.4%. Our proposed software-directed techniques present a number of important advantages as compared to the above purely hardwarebased approaches: Global view of traffic: First our approach has an advance global (collective) view of the network via the estimation of all link utilization levels that is carried out for each link individually, covering the entire application running on the network. It is therefore able to “see” the entire network’s spatial and temporal variability that is unique for each application, directing DVS link transitions at each link independently from other links for optimal power-performance during runtime. Threshold customization: Our approach automatically picks customized thresholds, unique to each application, based on profiling of the parallelized application itself. Architectural-specific customization: Our methodology accounts for network configuration variables, such as network size and buffer capacity, and individual link architectural characteristics, such as maximum assigned link frequency and bandwidth, in deriving DVS software directives; it can therefore be applied to various network types such as heterogenous systems (SoC and NoC) with links having different assigned bandwidths. Indeed our results of Section 6 show high resilience to fluctuating link bandwidth requirements due to high variabilities in the parallelized application’s spatial and temporal distributions, demonstrating superior power-performance results when applied to three existing parallel architectures, spanning fine-grained on-chip to coarsegrained multi-chip implementations.

three trace benchmarks running on a TRIPS CMP.

dissipating 21mW at 1Gb/s and up to 197mW at 3.5Gb/s, providing up to 90% power reduction. Though this link was designed for off-line frequency settings, and not for both dynamic voltage and frequency settings, the link architecture can be extended to accommodate DVS [24, 2]. In this paper we construct a realistic multi-level DVS model, where the serial link can take only a range of 10 discrete frequency levels and corresponding voltage levels. The maximum frequency of the serial link is 1GHz at 2.5V that can be scaled down to 0.6GHz at 1.57V . Though previous research [24] has suggested a range of frequencies of 1GHz to 125M Hz for up to 10X power improvement, the latter lower frequency level increases the traversal time of a flit crossing a link by a factor of 8X. As a link has to go through all transitioning steps one by one requiring a considerable number of cycles, transitioning to the maximum frequency in case of an abrupt increase in link traffic can have a serious negative impact on performance. In our model, though the minimum frequency is restricted to 0.6GHz, considerable link power savings, by up to 76.33%, can be achieved. Since frequencies and voltages are compacted in the upper (V, f ) range, more granular frequency levels can be considered; this allows the discrete link frequencies to be fine-tuned to the expected traffic levels. Table 1 shows the (V, f ) voltage-frequency pairs of our DVS link. Dynamic link power is estimated using: 2 · flink Plink = Cload · Vdd

where Plink is the power consumed by the link, Cload is the load capacitance, Vdd is the supply voltage and flink is the link frequency. The voltage and link frequency transitions occur separately. When the link down ramps (V, f ), the frequency is reduced first and then the voltage; when the link up ramps (V, f ), the voltage increases first and then the frequency. During frequency transitions, no network traffic (packets) can cross the link. It takes 20 link clock cycles to transition between any two sequential discrete frequency steps and 100 cycles to transition between any two sequential discrete voltage steps [2]; in other words, a total of 180 and 900 link clock cycles to traverse the entire range of discrete frequency and voltage levels respectively. The energy consumed Table 1: Multi-level discrete DVS link model (V, f ) pairs.

3.

DVS LINK MODEL

Chip-to-chip parallel [30] and serial links [11], where the link automatically and continuously adjusts its frequency at a minimum voltage, have already been demonstrated. The variable-frequency serial link has a supply voltage which varies from 0.55V to 2.5V ,

(1)

(V, f )0 (V, f )2 (V, f )4 (V, f )6 (V, f )8

←(2.50V , 1.00GHz) ←(2.27V , 0.90GHz) ←(2.02V , 0.80GHz) ←(1.84V , 0.72GHz) ←(1.66V , 0.64GHz)

(V, f )1 (V, f )3 (V, f )5 (V, f )7 (V, f )9

←(2.38V , 0.95GHz) ←(2.15V , 0.85GHz) ←(1.93V , 0.76GHz) ←(1.75V , 0.68GHz) ←(1.57V , 0.60GHz)

4. SOFTWARE DIRECTIVES GENERATION

loop: read X[i] read Y[i] a=X[i]+Y[i] read Z[i] b=a*Z[i] write b B[i] i++ jump loop

Sequential code

Parallelizing compiler Node 6

Node 7

Node 8

loop: read Y[i] loop: read Z[i] x = rcv(node6) a = rcv(node7) a = x*Y[i] b = a*Z[i] send(a) node8 write b B[i] i++ i++ jump loop jump loop

loop: read X[i] send(X[i]) node7 i++ jump loop

Message flow space-time scheduling Network architecture

node 8 - code communication stream

Parameters:

node 7 - code communication stream node 6 - code communication stream Cycle source destin. flit operand/ stamp router router count opcode

<7>



<105>

<3> <...>



<6>



node 8 - link utilizations node 7 - link utilizations node 6 - link utilizations Tw, sampling window=20k cycles link begin end Link util. time time (U link [Twi ])

Phase 1: LUNA

<0> <2Tw>





Multi-level DVS discrete link model

+

(V , f ) 0 ← ( 2.50V ,1GHz )

Phase 2: software directives generation







node 8 - software directives node 7 - software directives node 6 - software directives Cycle Intermediate Final(V,f) target stamp (V,f) target

link

(V , f ) 4 <0> (V , f ) 5

Phase 3: thresholds Thresholds: Nodes 6,7,8 T BU higher = BU + α σ





BU

Network runtime 6

7

8



0

(V , f ) 3 (V , f ) 2



# occurences





<0.5>

…<0.9>

+ Buffer utilization

Router node

1

Thresholds cache

Directives cache

Buffer utilization counters

V,f controller DVS link

X-Bar

4

5

. . .

0

1

2

Switch allocation

nodes

. . .

3

other

Figure 3: Overview of proposed software directives. during transitioning is [1]: Elink−trans = (1 − η) · Cf ilter · |Va2 − Vb2 |

(2)

where η is the efficiency typically taken to be 90% and Cf ilter is the filter capacitance, assumed to be 5pF [11]. In our experiments we considered both the dynamic and overhead transitioning link energies in estimating power savings.

A parallelizing compiler such as [15, 31] takes as input sequential code, performs temporal and spatial partitioning of this code dividing it into a number of code segments and distributes these segments onto computational nodes. For correct code execution, the compiler orchestrates inter-node communication via synchronization of send() and receive() messaging passing operations. Each node communicates with others through a communication fabric. As the number of nodes scale, networks become the fabric of choice, with each node interfacing with an associated router. Figure 3 shows an example of a code snippet being partitioned into three segments, with each mapped onto a computational node. Our power-aware methodology extends this flow, statically generating DVS software directives right after code partitioning and scheduling. DVS directives generation is achieved in three phases, depicted in Figure 3. In the first phase, our technique uses LUNA [5] (Link Utilization for Network power Analysis), a framework that was originally proposed to analyze network power consumption, as a base. LUNA factors in network architectural parameters such as network size and the compiler-generated communication code streams to periodically estimate average link utilization levels across all network links, paced by a sampling window of Tw cycles. For this to work, message flows need to contain network injection time stamps. In RAW [27], the hardware is fully exposed to the compiler by exporting a cost model for communication and computation; the RAWCC compiler[15] explicitly manages all communication through the interconnect statically from compile time, providing cycle by cycle message flow scheduling and timing information that can be used by LUNA. Sequencing is exact, but due to dynamic events there are some disturbances in run-time flows, with a 5% probability of occurrence [15]. An advantage of our methodology, as we show in Section 6.6, is that message flow timing information does not have to be exact and can tolerate fairly large disturbances. In the case that static compilation cannot provide flow time stamping, applications can go through profile runs, previously used in uniprocessor [22, 34] and embedded systems [6] where timing estimates can be obtained. In phase 2, LUNA’s link utilization estimates are normalized to the link bandwidths, and by considering the multi-level discrete DVS model of Section 3, we generate DVS instructions for each link individually by using our proposed DVS software directives algorithm of Section 4.2. Custom directives relevant to each router node are then written into a dedicated cache at every node at the time when the node-mapped parallelized code segment is scheduled to run onto the processor attached to each node. Periodically, at each Tw , the directives are read and used to set the frequency and voltage levels of each outgoing link individually. There are two DVS instructions per Tw , one trying to reach an intermediate (V, f ) level within Tw and the second for reaching the final target (V, f ) level at the end of Tw . Hence, the hardware directives cache can be as small as two (V, f ) directives (8 bits since each directive can be represented by 4 bits, given our 10-level DVS link power model), refreshed with the latest directives for each sampling period. Similar caches in network routers have been used in on-chip architectures such as RAW to direct packet switching in a static network [27] – these instructions are similarly created during static compilation. Note that since the software directives are created serially, the application can start running on the network once enough directives are generated, given that subsequent directives will be generated and pumped into the network ahead of their execution time stamp.

Msg A

1

2

Utilization

(a) Network message flows 0.8

1.3

0.8

Msg B 300

1000

Link 1->2 600

1000

0.5 Link 2->5

0.8

600

0.8

1000

Msg A 1->2

0.4 600

0.9

1000

0.5 Msg B 1->2 300 0.9

Link 1->2

0.90

(V , f ) 2

0.85

(V , f ) 3

0.80

(V , f ) 4

0.75

(V , f ) 5

Tw

2Tw

300

1000

1.0

t

0

0

....

Tw

(V , f ) 4

2Tw

....

intermediate (V,f) target

final (V,f) target

(V , f ) 0

(V , f )1 (V , f ) 2

(V , f ) 3 (V , f ) 4

(V , f ) 5

Tw

2Tw

3Tw

t

Legend:

final (V,f) target

.... (V , f ) 3

....

Steady state Frequency transitioning Voltage transitioning

Figure 5: Example of software directives generation.

Msg B Link 1->2

3Tw

(c) DVS software directives Tw-based list sampling period intermediate (V,f) target

0.5 Msg B 2->5

(b) Discrete (V,f) pair transitioning

beginning link utilization (V , f ) 0 (V , f )1 final link utilization

0 1000

(c) Step 2: Mapping injection functions onto link utilization functions

1.3

1.2 0.9 300

1000

Time (cycles)

Utilization

Utilization

Utilization

600 1.2

0.5

(b) Step 1: Injection rate functions for the two messages

Link 0->1

0.3

0.9

1.0 0.95

Frequency and Voltage Index

0

1000 Time (cycles)

(a) Estimated average link utilization

0.8 Msg A 0->1

0.4

Link Utilization

Msg B

600

Utilization Utilization

5

Utilization

4

0.8 Msg A

0.4

Utilization

8

Utilization

3

Injection Rate

7

Injection Rate

6

Msg A

In the final 3rd phase, we make use of queueing theory principles to translate already estimated link utilizations into router output buffer utilizations. We then construct a histogram by collecting all the individual buffer utilizations of the entire network to extract statistical parameters of network output buffer utilization. We use these statistics to set router thresholds which are stored in a threshold cache at each router. These thresholds are used to direct our online hardware DVS mechanism, which detects localized short-term link congestion levels, backing off DVS transitioning and delaying the execution of software directives in order to optimize network performance. Section 5 provides a detailed description of the proposed hardware mechanisms.

packet contents are not required and ignored. Part (c) of Figure 4 shows how message flows traverse links 0 → 1, 1 → 2 and 2 → 5 for the same time duration of 1,000 clock cycles. Step 3. Next, injection rate functions are superimposed and summed, reflecting the sharing of links amongst multiple message flows. Part (d) of Figure 4 shows that this summation actually detects traffic contention or overflow in link 1 → 2 between cycles 0 to 300 and 600 to 1,000 since the utilization rate of 1 is exceeded (100% link bandwidth capacity). Step 4. To account for link contention, LUNA propagates this overflow area as depicted in part (e) of Figure 4. Intuitively, this overflow area corresponds to the number of bits that need to be transported later as they exceed the link capacity. Step 5. Finally, the link utilization functions are split back into constituent message flows, reflecting how individual messages are affected by the contention. Fair arbitration is assumed in splitting the link utilization among the message flows as shown in part (f) of Figure 4.

4.1 Phase 1: Link Utilization Estimation

4.2 Phase 2: Software Directives Extraction

To capture spatial and temporal message flow variability in order to explore power savings, we use LUNA to estimate link utilizations across the network [5]. LUNA is a high-level network power analysis tool whose accuracy was shown to be within 5.9% of cycle-level simulators [29] with a run time that is up to 360X faster, making it suitable for compiler-directed network power analysis. LUNA abstracts network power through link utilizations, capturing the effect of contention amongst message flows in its estimation of utilization across time for each link in the network. Based on these estimates we then create DVS software directives in Section 4.2 that leverage unused link capacity as power saving opportunities. There are five key steps in LUNA, explained in Figure 4. Note that we show traffic only across router nodes 0, 1, 2 and 5 for clarity (Figure 4(a)). Step 1. In the first step message flows are captured as injection rate functions, with the injection rate of the message expressed as a percentage of the injection port bandwidth over time. Figure 4(b) shows the injection rate functions of message flows A and B over the first 1,000 clock cycles. Step 2. During this step, routing maps the injection rate functions of Step 1 onto links of a network topology, translating them into normalized link utilization functions, between 0 (no traffic) to 1.0 (link saturated). We assume deterministic XY routing1 ; LUNA considers the packet size in terms of flits2 and the sourcedestination router coordinates of each packet (see phase 1 of Figure 3). Packet types, whether data, control, acknowledgement and

Estimated link utilizations from phase 1 are used as inputs in phase 2 to generate DVS software directives for each network link individually. Though phase 2 is carried out independently for each link, the link utilization estimates were derived in phase 1 based on global information of all message flows across an entire application, unlike previous ad hoc hardware methods [24, 25] which only had local information available. Figure 5 sketches an overview of phase 2. Intuitively, the process of generating software directives works as follows: Given the average link utilizations generated by LUNA in phase 1 over window intervals Tw , it first maps these utilization levels to the closest discrete link frequency/voltage levels that can support the required bandwidth (Figure 5(a)). Then, at each sampling window, starting from the voltage/frequency setting at the beginning of the window, it tries to lower voltage/frequency as much as possible, as long as it can return to the voltage/frequency setting required at the start of the next sampling window, taking into account the voltage/frequency transition delays of our DVS link model (Figure 5(b)). We term this lowest voltage/frequency level the intermediate (V, f ) target, and the voltage/frequency setting just before the start of the next window as the final (V, f ) target. Finally, these two directives are entered for each Tw to create a Tw -based list of DVS software directives which are executed at runtime (Figure 5(c)). Our methodology works by considering remaining resources that reside on two axes: time and remaining link capacity. Horizontally, with frequency down ramping, a flit’s link traversal period is proportionally stretched. Vertically, with our Tw -based approach prohibiting traffic spilling over the next Tw , now the link has to transport the same number of flits within this Tw time span at the lowered link frequency; this translates to an increase in the link’s bandwidth utilization and a proportional decrease in the remaining link’s

300

1000

(d) Step 3: Summing the utilization function of each link

300

600

1000 1120

(e) Step 4: Propagating overflow for link 1->2

600

1100 1120

(f) Step 5: Constituent message splitting

Figure 4: A walkthrough example showing the five steps of link utilization calculations using LUNA.

1 Packets fully traverse the X dimension first before traversing the Y dimension towards their destination node. 2 Flit stands for flow control unit, a fixed-size segment of a packet.

1: 2: 3:

for each link in the network do for each Tw do {prevIdx, curIdx, nextIdx} ← previous, current, and next segment frequency indices // calculate the number of flits that must be sent 4: numF litsT oSend ← current link utilization × Tw // find lowest acceptable target frequency index for the current time window, // such that this index is greater than or equal to curIdx given above 5: curCapacity = −1 // ensure while loop is executed at least once 6: while steadyT ime < 0 or curCapacity < numF litsT oSend do 7: downT ime ← (df + dv) × (curIdx − prevIdx) 8: upT ime ← (df + dv) × (curIdx − nextIdx) 9: steadyT ime ← windowT ime − (downT ime + upT ime) // Find the number of flits the link can transmit during this time // window given the frequency and voltage transitions PcurIdx fcurIdx fi 10: curCapacity ← + i=prevIdx dv f0 +steadyT ime f0 PcurIdx fi dv i=nextIdx f 0

11: 12: 13: 14: 15: 16:

if f litCapacity < numF litsT oSend or steadyT ime < 0 then curIdx ← curIdx − 1 end if end while end for end for

Figure 6: Software directives generating algorithm. flit transport capacity as flits are squashed in this Tw segment. Our methodology performs a recursion over consecutive (V, f ) steps, checking to determine that the intermediate (V, f ) level which is just enough to satisfy the above two conditions. This (V, f ) level target then generates a DVS directive. Note that in the uniprocessor [22, 34] and embedded systems area [16, 6, 8] DVS policies only deal with the time axis, with a frequency decrease causing application execution time stretching, covering subsequent idle processor periods (slack). The only constraint is to prohibit program execution spilling over the next code block’s running time segment, meeting program execution deadlines to subsequently maintain performance. Mathematically, each link utilization profile is modeled as a discrete-time function, U [Tw n], where n = 0, 1, 2, ..., N and Tw is the sampling period, or window size. As Figure 5(a) shows U [t] is a step function and is continuous such that U [t] = U [Tw n], where Tw n ≤ t < Tw (n + 1); its amplitude is the average measured link bandwidth requirement, normalized between 0 and 1 to indicate zero and maximum capacity utilizations respectively. This is matched to the closest discrete frequency fk , where k is the frequency index from a range of 0 to 9. f0 denotes full link frequency (1GHz) and f9 the smallest available frequency (0.6GHz). To create (V, f ) pair directives, we apply the algorithm of Figure 6 to each network link individually to extract the intermediate and final target DVS instructions for every Tw , shown in Figure 5(b-c). Though individual directives are created for each Tw , the calculations carried out in the algorithm need to consider the beginning (end of Tw (i − 1)) and the end (beginning of Tw (i + 1)) link utilization levels of Tw (i); between t = Tw and t = 2Tw in our example. Directives are therefore created serially for each Tw time segment, with interdependence upon the previous and next Tw segments. Algorithm details. The algorithm begins by translating the amplitude of each link utilization function Uj [Tw i] for each time segment i into the near-upper index k such that ffk0 ≥ U [Tw i]. This k is termed curIdx. Similarly, prevIdx (beginning of Tw ) and nextIdx (end of Tw ) for U [Tw (i − 1)] and U [Tw (i + 1)] respectively are determined, to meet the utilization targets at the two ends of Tw . The number of flits to be traversed over this particular link j is the product of the current utilization level (flits/cycle) and window size (cycle count), numF litsT oSend = Tw × Uj [Tw i].

Figure 5 depicts numF litsT oSend as the shaded area between t = Tw and t = 2Tw . All these flits must be able to traverse the link within this same time duration, that is Tw , with our calculated lower-frequency (V, f ) targets. To create DVS directives, the algorithm keeps a count of the number of discrete level higher-to-lower frequency transitions or step-downs relative to the beginning link utilization and in a similar manner for the lower-to-higher transitions or step-ups to meet the end utilization target within this Tw . Given prevIdx, curIdx, and nextIdx, the algorithm calculates the number of steps down at the beginning of a Tw segment as max(curIdx−prevIdx, 0), and the number of steps up at the end of a segment as max(curIdx − nextIdx, 0). As an example, consider the time t = Tw to t = 2Tw of Figure 5(b), where the frequency and voltage are reduced by three steps from (V, f )1 to the intermediate target of (V, f )4 , and then they are increased by one step to a final target of (V, f )3 , prior to 2Tw . To determine the final number of (V, f ) step ups/downs within the current Tw the algorithm begins at fcurIdx discrete frequency level which can accommodate U [Tw ], that is fcurIdx ≥ U [Tw ], f0 and keeps recursing until both: (1) the horizontal component is satisfied, that is there is just enough time for the calculated number of (V, f ) hoppings, and (2) the vertical component is satisfied, that is the utilized link bandwidth can fit into the link’s maximum capacity of 1.0 and numF litsT oSend can be sent within Tw . The horizontal component is measured via steadyT ime and the vertical via f litCapacity. steadyT ime is directly affected by dv and df , the voltage and frequency transition delays (during df no flits can traverse a link) in terms of nominal router cycles (relative to f0 = 1GHz). In turn, these determine the time to step down as downT ime = numStepsDown × (dv + df ), and the time to step up as upT ime = numStepsU p × (dv + df ). steadyT ime = Tw − (downT ime + upT ime), and is defined as the time spent at the intermediate target frequency and voltage for the current Tw . The used link capacity f litCapacity is determined by the cumulative effects of up and down (V, f ) link transitions. With each (V, f ) step, the link’s throughput is altered (increased if frequency is increased), and this effect is taken into account to determine whether within Tw f litCapacity ≥ numF litsT oSend is satisfied. In each algorithm recursion f litCapacity is calculated under the given (V, f ) transitions by multiplying each frequency that the link operates at during the current segment by the number of cycles the link spends at each frequency, and summing these numbers. Figure 5(b) depicts this currently tested frequency-time product as the horizontally striped shaded area, entitled “steady state”. The intermediate target index keeps decrementing by 1, equivalently reducing the number of step ups/downs, for the current segment, until neither steadyT ime < 0 nor f litCapacity < numF litsT oSend are violated. As an example, Figure 5(b) shows that in order to reach the intermediate (V, f ) target, there must be three step down hops from (V, f )1 to (V, f )4 and one step up to meet the final target of (V, f )3 – no more steadyT ime is available for further DVS hopping. Once these two targets are calculated, software directives with the intermediate and final (V, f )k targets are created as Figure 5(c) shows; the algorithm recurses over the next Tw segments to similarly create further two (V, f )k instructions for the entire application duration and for all network links.

Upstream router

where L is the set of links in the network, T is the set of all sampling periods under consideration, |L| is the number of links in the network, |T | is the number of sampling periods and |B| is the buffer size in terms of flits. Next, Section 5 describes how BU is used to set thresholds to direct our online DVS mechanism that tunes short-term localized (V, f ) pair transitions to optimize network performance.

fl

. .

. . .

fr

Downstream router

λ

Legend: Crossbar

Input port

Output port

Input buffer

Output buffer

Switch arbiter

Figure 7: Wormhole router micro-architecture, depicting the physical parameters of an M/D/1 queueing system.

4.3 Phase 3: Output Buffer Utilization Estimation In this phase, we make use of queueing theory principles to translate estimated link utilizations of phase 1 and target link operating frequencies of phase 2 into output buffer utilization histograms, the goal being to derive statistics to set application-specific thresholds which guide our online DVS mechanism of Section 5. We estimate the output buffer utilization for each link and for each sampling period, Tw , where an output buffer is approximated and modeled as an M/D/1 queue. Under standard queueing notation, this refers to a queue which has a Poisson flit arrival rate of average value λ, a deterministic flit service rate of µ, and that there is one server (link). The service rate is considered to be deterministic since phases 1 and 2 provide information concerning the average link utilizations and operating frequencies over the entire application span. Figure 7 depicts the micro-architecture of a wormhole router with an output link connecting sender (upstream) and receiver (downstream) routers. λ is the rate of traffic on the downstream side of the crossbar. The router operating frequency fr is constant (1GHz) and that of the link, fl , is variable with 10 discrete (V, f ) levels as described in Section 3, where fr ≥ fl . The following equation is used to determine the average number of occupied buffers (equivalently customers in a queue) for our M/D/1 queueing system [13]: N =

2ρ − ρ2 2(1 − ρ)

(3)

where ρ = µλ . This system assumes a constant service rate, however with a DVS link the service rate varies since the software directives can set any of (V, fl )0−9 level pairs with each Tw . To f account for this, we parameterize µi,j = fi,j , where fi,j is the 0 intermediate target frequency value for link j and time Tw i, and f0 is again the maximum frequency, using information from phases 1 λi,j and 2. Similarly ρi,j = µi,j , further parameterizing equation 3 as: N i,j =

2ρi,j − ρ2i,j 2(1 − ρi,j )

(4)

In this model we assume that the output and input buffers have enough capacity such that no overflows occur, where LUNA’s estimation of link utilization (per Tsw ) is used as the value of λi,j ; for wormhole credit-based routing (see Section 6.1), if a flit has been given permission to traverse the crossbar, then it must be the case that there is room for it at the next input or output buffer. To obtain the average network buffer utilization BU , all N i,j are normalized by dividing them by the buffer size, summed up over all sampling periods and all network links and then averaged over the product of the total number of network links and LUNA sampling periods: 1 X 1 X N i,j BU = |L| i∈L |T | j∈T |B|

5. ONLINE DVS HARDWARE MECHANISM In this section, we describe our online DVS mechanism and its interaction with DVS software directives. The online DVS mechanism is used to react to runtime variabilities in the traffic profile, which can arise as a result of averaging effects or inaccuracies of LUNA, as well as inaccuracies of compiler-time scheduling/profiling.

5.1 Output Buffer Thresholds To detect runtime traffic variability, we need to compare LUNA’s statically estimated link utilization levels, to the network utilization at runtime. If the latter is relatively greater than the former, we need to perform appropriate actions to reduce network contention. To direct this online mechanism we make use of statistics collected via hardware counters. An obvious choice of runtime statistics is to track link utilizations directly in hardware. However, with practical flow control methods, link utilization only tracks resource utilization well at low to mid network traffic levels; when the network is congested, or when the link’s (V, f ) level is currently set below the required bandwidth, traffic tends to get buffered in input and output buffers and link utilization leans to zero, making it an unsuitable metric [24, 25]. Per-port output buffer utilization, the number of buffers which are occupied per unit time coupled to an output link, is a better proxy: Pt=1 t=−M F [n − t] BU pout[n] = , 0 ≤ BU pout ≤ 1 (6) B ×M where n is the sampling time at which we measure the output buffer utilization of output port pout, t is a dummy timing index that spans over the past M router cycles and F [n − t] is the number of output buffers occupied at time n − t. B is the output buffer size in terms of flit occupancy and M is the sampling moving window size in terms of router clock cycles. Essentially BU pout[n] is the average buffer occupancy over the past M router cycles, measured at sampling time n for each output port at a router. BU pout [n] is calculated at every router clock cycle. In our experiments we set M = 300 cycles for two reasons: to detect recent network contention levels and to keep the hardware compact. It is critical to keep in mind the hardware overhead involved in gathering statistics. For all statistics we propose, only simple hardware counters are needed3 . We use LUNA’s average network buffer utilization estimation BU from Section 4.3 to set thresholds; BU pout is then compared against these thresholds to direct our online hardware DVS mechanism, that detects localized link congestion levels, backing off DVS (V, f ) transitioning and delaying software directives execution in order to optimize network performance. Unlike previous ad hoc methodologies [24, 25] thresholds are not set based upon 3

(5)

The power consumed by counters is ignored, as similar hardware has been shown to consume little CMOS area with negligible power [24, 25]

(V , f ) 0

f1→2 V1→ 2

f 2→3 V2→3

(V , f )1 (V , f ) 2

f 3→4 V3→4

V4→3 f 4→3

final (V,f) target

(V , f ) 3 (V , f ) 4 (V , f ) 5

intermediate (V,f) target

t =0 (V , f ) 0 (V , f )1

t retry t retry

t retry

t retry V3→2 f 3→ 2

Begin ramping (V,f) up to meet target

t ramp −up

t retry f 2→3 V2→3

(V , f ) 2

f 3→ 4 V3→4

(V , f ) 3

Tw

t

final (V,f) target

(V , f ) 4 (V , f ) 5

intermediate (V,f) target

Tw Legend:

Consult

BU pout

t remain Consult BU pout , calculate and consult right margin time

Consult remaining right margin time

2Tw t

Figure 8: Online DVS (V, f ) transitioning example. some empirical value, but are customized according to the application characteristics. Using LUNA’s statistics from Section 4.3, in N

i,j particular the average number of occupied output buffers |B| for every sampling period Tw i for all links, we are able to draw a histogram of the network’s output buffer utilization profile; with the X-axis showing the normalized output buffer occupancy and the the Y-axis showing the number of occurrences, where the latter equals to the product of the total number of sampling periods |T | and the entire set of links |L|. The derived output buffer utilization profile, though application-dependent, approximates to a Gaussian-like distribution resembling a bell shape with most applications; using this histogram we can estimate the standard deviation σ BU , for which we base our thresholds to capture the outlier cases of higher buffer utilizations at which network contention is likely to occur. With the remaining applications under which the buffer utilization distribution shows skewness to the left (most occurrences of buffer utilization values are relatively low indicating little link utilization), σ BU can also be used to capture the righthand side outlier cases of higher buffer utilizations at which network congestion is most likely to occur:

T hBUhigher = BU + α × σ BU

(7)

T hBUhigh = BU + β × σ BU

(8)

T hBUlow = γ × BU

(9)

1 < β < α, γ < 1

(10)

where: The various thresholds act as follows: When BU pout > T hBUhigh and BU pout < T hBUhigher , the algorithm postpones software directives for a retry period tretry . We set tretry = 240 cycles in our experiments, 2 × (dv + df ). When BU pout > T hBUhigher , the algorithm ignores software directives and the link transitions to a higher voltage/frequency. If any of the above cases occur and tretry has elapsed, the link would transition to a lower (V, f ) pair trying to reach the intermediate (V, f ) target only if BU pout < T hBUlow . Essentially DVS software directives acts as recommendations for setting (V, f ) target pairs for every Tw , while the online mechanism acts as the final decider of setting these levels according to the various conditions just described. The directives present a lower bound for (V, f ) targets for power optimization, while the online mechanism allows (V, f ) pairs to float above this lower bound, conservatively delaying/ignoring power savings opportunities in favor of performance.

Figure 8 exhibits the behavior of a software-directed network. It shows the (V, f ) transitions for two consecutive Tw windows. In the upper part the link starts transitioning from (V, f )1 to reach the intermediate target (V, f )4 . At (V, f )3 , BU pout > T hBUhigh postponing a further down ramp for a time duration of tretry . When this time has elapsed it tests for BU pout < T hBUlow which is not satisfied postponing (V, f )3→4 for another tretry . When this time expires again, BU pout is consulted and since it is now satisfied (V, f )3→4 is performed, taking also into account that enough steadyT ime is present with respect to time t = Tw to perform this transition. At tramp−up the link starts up-ramping (V, f )4→3 to meet the final (V, f )3 target at the end of t = Tw . In the lower part of Figure 8 the link starts from (V, f )3 to reach the final (equals to the intermediate) (V, f )5 target. From the start BU pout > T hBUhigh therefore postponing the (V, f )3→4 software directive for tretry . When tretry has elapsed BU pout is consulted, BU pout > T hBUhigher and the link back-offs up-ramping (V, f )3→2 . A tretry is elapsed where BU pout > T hBUlow , setting another tretry at which BU pout < T hBUlow , and the link down-ramps (V, f )2→3 , then (V, f )3→4 . At that point there is no steadyT ime left (tremain < dv + df ) with respect to t = 2Tw and the link does not reach the target (V, f )5 level. The mechanism will then try to reach the (V, f ) intermediate and final target levels within the next Tw (recursive behavior). This example shows the responsiveness of the online mechanism to increases in network contention over short intervals of time, tuning the link in order to maintain performance, while at the same time trying to reach the (V, f ) target levels to lower power consumption.

6. EXPERIMENTAL SETUP AND RESULTS 6.1 Simulator Setup To evaluate power-latency tradeoffs of our approach, we simulated parallelized code running on three existing network architectures with software-directed DVS links; RAW [27] and TRIPS [21] on-chip CMPs and an Alpha 21364-based multi-chip server [17]. Details of these architectures are provided in subsequent subsections. Our simulator models an event-driven wormhole switching network with credit-based flow control at the flit level [4], extending upon PoPNet [20], a publicly available simulator. The simulator supports k-ary 2-mesh topologies with 1GHz multi-stage pipelined router cores, each with 2 virtual channels. Packets are composed of 32-bit flits with each flit transported in 1 link cycle over links of 32Gb/s bandwidth. Each router consists of 8 unidirectional channels (four incoming and four outgoing). Table 2 provides a summary of our simulated network architectures. Table 2: Configurations of simulated architectures. Architecture RAW TRIPS Alpha 21364

Network size 4×4 5×5 8×8

Pipeline length 5/14 5 13

Input/Output buffer size (flits) 5/5 10/10 128/128

Packet size (flit count) 1 3 16 and 80

In all our experiments we set T hBUhigher = BU + 54 σBU , T hBUhigh = BU + 34 σ BU and T hBUlow = 43 BU . tretry is set to 240 cycles, which is 2 × (dv + df ) and Tw = 20k cycles. Each simulation is run for the entire trace length measuring up to 10s of millions of cycles. The metrics considered are latency and power consumption. Latency spans the injection of the head flit of a packet until its tail flit is ejected from the destination router. Power savings is the ratio of the aggregate power consumption across all

90%

60%

80%

Link power savings

Link power savings

70%

50% 40% 30% 20%

70% 60% 50% 40% 30% 20% 10%

10%

0% m geo

ean

cf

pr

m

.v 175

pe mg rid g2 en co de pa rs er sw im to m ca tv tu rb 3d ge o m ea n

s am stre

m

e g2 mp

gz i hy p dr o2 d

ip .gz 164

dc t eq ua ke

fir

ar t

fft

bz co ip2 m pr es s

m

m p

c adp

am

e der co d n co _en a_e 10b _11 8b_ 802

ad pc m

0%

Figure 9: Link power savings for a suite of benchmarks running on a RAW CMP.

Figure 11: Link power savings for a suite of benchmarks running on a TRIPS CMP.

6%

6%

Network latency penalty e g2 mp

s am stre

r .vp 175

geo

an me

Figure 10: Network latency penalty for a suite of benchmarks running on a RAW CMP.

links in the network with DVS, divided by the power consumption of all links operating at full frequency (no DVS).

6.2 RAW Architecture with RAW VersaBench Applications The Raw CMP [27] comprises 16 identical tiles, each with its own pipelined RISC processor, memory, computational resources and programmable routers, with each tile interconnected to its closest neighbor in a 4 × 4 mesh array. It uses an ISA, where all the raw hardware resources, including interconnect wire delays are fully exposed to the software interface, allowing the compiler to optimize program execution by mapping and scheduling parallelized code onto each tile. An interesting feature of this architecture is the static network that allows the implementation of compiler-directed routing among tiles. This network provides ordered, flow-controlled and reliable transfer of single word operands and data streams between functional units. The static router at each tile has its own instruction memory and is thus programmable by the compiler. This memory holds a corresponding switching instruction for each operand to be sent on the network, with instructions programmed statically in advance during compile-time and then cached in the memory. Thus the static routers collectively configure the entire network on a cycle-by-cycle basis. To evaluate the effectiveness of our software-directed DVS methodology we ran binaries compiled by the RAW compiler on the RAW cycle-accurate simulator which accurately matches hardware timing, and extracted communication traces from the static network of a RAW CMP. These traces contain all the information required by our methodology: the router switching time stamp of an operand and the operand source and destination tiles. Figure 9 shows the link power savings and Figure 10 shows the corresponding latency of 9 benchmarks running on a RAW CMP. This suite includes a mix of streaming (streams), bit4 In RAW the static router is a 5-stage pipeline, which decodes switching instructions to setup a path in advance (see Section 6.2). Once the path is established, every flit encounters a unit delay in the sender and receiver router ALUs in ALU to ALU communication, plus the link delay.

1% 0%

cf

ip .gz 164

m

fir

pe mg rid g2 en co de pa rs er sw im to m ca tv tu rb 3d ge o m ea n

fft

m

cm adp

gz i hy p dr o2 d

e der co d n co _en a_e 10b _ 11 8b_ 802

dc eq t ua ke

0%

2%

bz co ip2 m pr es s

1%

3%

ar t

2%

4%

m p

3%

5%

am

4%

ad pc m

Latency penalty

5%

Figure 12: Network latency penalty for for a suite of benchmarks running on a TRIPS CMP. level (802 11a encoder), SPECINT2000 [28] (164.gzip) and MediaBench [14] (adpcm) benchmarks. We observe high link power savings across all benchmarks, 49.4% on average, with just 2.8% latency penalty on average. Lesser power savings and corresponding impact on latency were observed for fir, as higher network resource utilization was observed, showing our methodology efficiently adapting to increased network usage demands. The relative latency increase – relative power savings product, Lrel × Prel , an indication of the effectiveness of power savings strategies is just 0.52 (without DVS Lrel × Prel = 1), indicating superior power savings with little performance impact.

6.3 TRIPS Architecture with SPEC and MediaBench Benchmarks To further evaluate the effectiveness of our proposed poweraware methodology, we obtained network traces from the TRIPS CMP [21]. The TRIPS CMP consists of 4 large, coarse-grained element cores each of which is an instantiation of the Grid Processor Architecture (GPA) containing an ALU execution array and local L1 memory tiles interconnected via a 5 × 5 mesh network. TRIPS network packets carry data (operands for instructions or addresses to memory) and status information associated with them. Network traces were obtained from simulations of a suite of sixteen SPEC [28] and MediaBench [14] benchmarks. The traces are in general very bursty exhibiting high temporal variance along with spatial injection variability among routers. Large bursts of packets injected at times (see topmost right histogram in Figure 15), and zero packets at others, which present interesting opportunities for power optimization. Figure 11 shows the link power savings and Figure 12 shows the corresponding impact on latency of the 16 benchmarks running on TRIPS. We again observe high link power savings across all benchmarks, 70.2% on average, with just 1.16% latency penalty on average. The worst case latency penalty, 5.22% was observed from the mpeg2encode benchmark. The Lrel ×Prel product is just 0.301, indicating excellent power savings with little performance impact. Discussion. The TRIPS traces exhibit higher power savings and

80.00%

Link power savings

6.00%

Latency penalty

5.00% 75.00%

4.00%

70.00%

75.00%-80.00% 70.00%-75.00% 65.00%-70.00%

3.00% 2.00% 1.00%

65.00%

mpeg2encode art adpcm

Se t1 Se (5k t1 Se ) (20 t1 Se k) (50 t2 Se k ( t2 5k ) Se ) ( t 2 Se 2( 0k t 5 ) Se 0k 3( t3 ) 5k Se ) (20 t3 k) (50 Threshold sets and sampling k)

5.00%-6.00% 4.00%-5.00% 3.00%-4.00% 2.00%-3.00% 1.00%-2.00% 0.00%-1.00%

0.00%

mpeg2encode art adpcm

window sizes

Se

Se t3

t3

(5 0 k)

Se t (2 0 k)

Se 3(

Se

t2

5k )

Se

Se t2

Se t1

(5 k t2 ) (20 k) (50 k)

Se

t1

(5 0 k)

t1 (5 k (2 0 ) k)

Threshold sets and sampling window sizes

Figure 13: Link power savings of software-directed DVS for three trace Figure 14: Network latency penalty of software-directed DVS for three trace benchmarks from the TRIPS CMP. benchmarks from the TRIPS CMP. smaller impact on performance as compared to the RAW traces due to a couple of reasons. First the RAW static network exhibited higher utilization ratio as compared to the TRIPS network, therefore providing fewer opportunities for (V, f ) pair down-ramping for further power reduction. And second, the TRIPS traffic exhibited greater spatial and temporal variance than the RAW traffic, with less frequent bursts (greater gaps) of injected traffic, therefore providing greater opportunities for power optimization along with smaller latency increases.

6.4 Alpha 21364 Architecture with SPLASH2 Benchmarks Finally, we apply the proposed software directives to a chipto-chip network. We ran three benchmarks from the SPLASH-2 suite [32] on the RSIM [18] shared-memory cache-coherent multiprocessor infrastructure, modeling Alpha 21364 processor nodes and cache coherence models, and collected their traffic traces. Note that unlike the previous two studies, the multiprocessor architecture is not modeled using its original simulator, so clearly, the traces will not precisely match that of Alpha 21364. We tried our best to mimic published Alpha 21364 parameters with RSIM parameters. Table 3: Average link power savings and network latency penalty for SPLASH-2 benchmarks.

Benchmark Link power savings fft 26.79% lu 25.82% radix 13.44%

Network latency penalty 6.78% 6.49% 3.86%

Table 3 shows the link power savings and the corresponding latency penalties for the 3 SPLASH-2 benchmarks. Network link power savings are 20.75% with 5.54% latency increase on average. The Lrel × Prel is 0.836 indicating good power-performance responsiveness of our methodology. Discussion. Here we observe smaller power savings than with those observed in the two on-chip architectures. This is due to a number of reasons. The SPLASH-2 benchmarks are designed to evaluate off-chip shared-address-memory architectures where the packet size is considerably larger than on-chip architectures. Network traffic consists of either 16-flit packets, containing control information such as requests for a cache line and coherence protocol actions, or 80-flit packets containing replies to requests that can contain contents of cache lines. Traffic patterns were also observed to be considerably less bursty and more uniform than onchip traffic, translating to decreased spatial and temporal variability. In summary the SPLASH-2 network traffic imposes increased communication demand onto the network links with sparser oppor-

tunities for lowering power. Though smaller power savings were observed here, we see our power-aware policies adapting to a wide spectrum of applications and network utilization demands, lowering power consumption while maintaining high interconnection network performance.

6.5 Discussion: Threshold Perturbation Here, we demonstrate the relative invariance of our techniques to variations in thresholds and sampling windows when applied to a range of benchmarks, concurrently assessing our approach’s resilience to four combinations of thresholds and three LUNA sampling windows (Tw ). Referring to Equations 7-10, we use three (α, β, γ) 3-tuples, (1, 12 , 34 ), ( 45 , 34 , 34 ) and ( 45 , 34 , 21 ) correspondingly for sets 1 to 3. For each set, Tw is placed at 5k, 20k and 50k system cycles. Set 1 presents more aggressive behavior with greater power savings expected along with higher impact on latency. Set 2 presents more responsive behavior with smaller impact on power and latencies expected. Set 3 uses the same α and β as set 1, however with a smaller γ. (V, f ) backing-off is prolonged since the traffic has to settle at a level before the link can retry to transition to a lower (V, f ) level. It is clear from the Figure 13 that consistent (to each application) high power savings, 73.8% on average, can be achieved for the three TRIPS applications. Foremost, Figure 14 shows consisted (to each application) low latency impact, just 2.74% on average. It is evident that the latency penalty and power savings are almost invariant to Tw - the sampling period does not affect powerperformance. Relatively high consistency in power-performance results is also observed with the three combinations of thresholds, with very small variance observed. This is because the thresholds are customized to each application, and are based upon the expected average of output buffer utilization BU and its standard deviation σ BU , capturing only the outlier cases of higher buffer utilizations at which network contention is most likely to occur. These results stand in stark contrast to the behavior of hardware ad hoc approaches as shown in Section 2.2 (See Figures 1 and 2).

6.6 Discussion: Traffic Perturbation Though parallelizing compilers such as RAWCC [15] statically schedule instructions across the network in both space and time, the estimated message flow timing information is not always exact; due to the presence of dynamic events such as data dependencies, dynamic memory references and I/O operations, some of the message flows may not be routed at pre-set times. We evaluate the resilience of our proposed technique to inaccuracies in message flow injection/arrival timings by artificially perturbing the entire application flow. We consider the compiler-derived

sigma = 0

sigma = 10

50 40 30 20 10 0

5 Time in cycles

6 5 4 3 2 1 0

10

0

4

x 10

4 2

0.5

10 4

x 10

4 3 2 1

10 4

x 10

2

1

5 Time in cycles

4 2 0

2.5

0

0.5

4

x 10

1 1.5 2 Time in cycles

2.5 4

x 10

sigma = 1000

10 8 6 4 2 0

0

6

7

3

0

1 1.5 2 Time in cycles

8

sigma = 100 Injected burst size (# packets)

Injected burst size (# packets)

Injected burst size (# packets)

5 Time in cycles

4

5 Time in cycles

6

sigma = 1000

5

0

8

0

sigma = 100

0

10

Injected burst size (# packets)

0

Injected burst size (# packets)

Injected burst size (# packets)

Injected burst size (# packets)

60

sigma = 10

10

7

Injected burst size (# packets)

sigma = 0 70

10

0

4

x 10

Figure 15: Profile of perturbed mpeg2encode from the TRIPS CMP.

1 1.5 2 Time in cycles

5 4 3 2 1 0

2.5

0

4

x 10

0.5

1 1.5 2 Time in cycles

2.5 4

x 10

Figure 16: Profile of perturbed gzip from the RAW CMP.

7. CONCLUSIONS AND FUTURE WORK This paper proposes software-driven power-aware techniques to address the critical issue of interconnection network power consumption. Our techniques form an extension to the parallelizing compiler flow, statically generating DVS instructions that later direct DVS link (V, f ) settings during application runtime. With our approach presenting a number of advantages over related previous ad hoc work, such as the advance consideration of the entire application network flow and the factorization of architectural parameters, DVS directives are in a position to tailor power-performance of a running application, fine-tuning (V, f ) transitions to match message flow network utilization requirements. Our results show power reduction levels of up to 76.3%, with a minute increase in latency, 6.78% at most, for a spectrum of benchmark suites running on 3 existing network architectures. Future work directions include the investigation of adaptive routing in conjunction with software directives as a means to further improve power-performance by potentially re-directing messages along less congested routes to reduce delays. Further interesting avenues include the investigation of the impact of software directives upon future multi-programmed NoCs, where multiple applications are to run concurrently, sharing network links, buffers, switching and arbitration resources. With parallel compilers already presenting sophisticated performance enhancements but lagging in poweraware optimizations, we see our technique taking the critical first steps towards enabling power-aware parallelizing compilers.

16% 14%

Latency penalty

injection time of each flow as the mean value of a normal distribution and we change this time within ±3σd (standard deviation flow displacement). We test three cases where σ can take a value of 10, 100 and 1k cycles. Note that σd = 0 indicates no timing perturbation. We set Tw = 5k cycles, to allow minimal steadyT ime for our DVS online mechanism to adjust to any expected traffic bursts. Note that these settings present a highly challenging message flow timing inexactness - in reality message shuffling of such a magnitude is expected to occur infrequently [15]. We apply this scenario onto three TRIPS benchmarks and RAW’s 164.gzip, that were randomly chosen. We also show results where the DVS online mechanism is absent, that is the DVS directives are faithfully executed, ignoring runtime hardware statistics. Figure 17 shows the TRIPS CMP latency impact with message flow perturbation; Interestingly the latency penalties are greater for smaller σd s. Figure 15 explains this phenomenon: the original flow possesses high temporal variability with up to 80-packet injection bursts. With σd = 10 message displacements cause lesser temporal variance, leaving fewer gaps for power optimization, while the bursts remain relatively large. Performance drops as DVS software directives begin to apply out-of-date (V, f ) targets, with the online mechanism backing them off with increased contention; still results here show much improvements as compared with previous ad hoc hardware approaches shown in Section 2. With larger σd s the injected flows even-out, the burst peaks drop and latency penalties show improvements. Also note that with the online DVS mechanism present, latency results are on average 32.4% better than without this mechanism being applied, again verifying the mechanism’s positive impact on sustaining network performance. Interestingly, Table 4 shows improvement in RAW’s results as message displacement increases. As Figure 16 shows, with greater σd s the message flow smoothes out, burst peaks drop, filling inactive gaps, leading to better network performance. In conclusion the impact on latency will depend on the profile of the message flows.

0.5

6

12%

adpcm (adapt.) adpcm (static) art (adapt.) art (static) mpeg2encode (adapt.) mpeg2encode (static)

10% 8% 6% 4% 2% 0%

Table 4: Network latency penalty of RAW’s permutated 164.gzip. Displacement σd = 0 σd = 10 σd = 100 σd = 1k Latency penalty 1.50% 0.74% 0.76% 0.61%

σ=0

σ = 10

σ = 100

σ = 1,000

Flit displacement

Figure 17: Network latency penalty for TRIPS CMP perturbed traffic.

Acknowledgments The authors wish to thank the anonymous reviewers for their valuable suggestions and comments. The authors also wish to thank the TRIPS Team at the University of Texas at Austin for supplying the traffic traces of various benchmarks, Michael Taylor of the MIT RAW Team for supplying the RAW simulator and the VersaBench applications, Li Shang of Queen’s University, Canada, for support of his PoPNet network simulator and Amit Kumar of Princeton University for his help on compiling the RAW applications and for running SPLASH-2 benchmarks on RSIM. This work was supported in part by the National Science Foundation (NSF) under contracts CCR-0237540 (CAREER), CCR-0324891 (ITR) and CNS-0305617 as well as the MARCO Gigascale Systems Research Center.

8.

REFERENCES

[1] T. D. Burd and R. W. Brodersen. Design issues for dynamic voltage scaling. In Proc. of the 5th International Symposium on Low Power Electronics and Design (ISLPED’00), pp. 9–14, 2000. [2] X. Chen and L.-S. Peh. Exploring the design space of power-aware opto-electronic network systems. In Proc. of the 11th International Symposium on High-Performance Computer Architecture (HPCA-11), pp. 120–131, Feb. 2005. [3] W. J. Dally and B. Towles. Route packets, not wires: On-chip interconnection networks. In Proc. of the of 41st Design Automation Conference (DAC-41), pp. 684–689, June 2001. [4] J. Duato. A theory of fault-tolerant routing in wormhole networks. IEEE Transactions on Parallel and Distributed Systems (TPDS), Vol. 8, No. 8, Aug. 1997. [5] N. Eisley and L.-S. Peh. High-level analysis for on-chip networks. In Proc. of the 7th International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES’04), pp. 104–115, Sept. 2004. (LUNA. [online] http://www.princeton.edu/∼eisley/LUNA.html) [6] J. Hu and R. Marculescu. Energy-aware communication and task scheduling for network-on-chip architectures under real-time Constraints. In Proc. of Design, Automation and Test in Europe Conference and Exhibition (DATE’04), pp. 10234–10239, Feb. 2004. [7] InfiniBand Trade Alliance. The InfiniBand architecture. [online] http:// www.infinibandta.org. [8] A. Jalabert et. al. ×pipesCompiler: A tool for instantiating application specific networks on chip. In Proc. of Design, Automation and Test in Europe Conference and Exhibition (DATE’04), pp. 20884–20889, Feb. 2004. [9] I. Kadayif et. al. Exploiting processor workload heterogeneity for reducing energy consumption in chip multiprocessors. In Proc. of the Design, Automation and Test in Europe Conference and Exhibition (DATE’04), pp. 21158–21163 Feb. 2004. [10] E. J. Kim et. al. Energy optimization techniques in cluster interconnects. In Proc. of the International Symposium on Low Power Electronics and Design (ISLPED’03), pp. 459–464, Aug. 2003. [11] J. Kim and M. Horowitz. Adaptive supply serial links with sub-1V operation and per-pin clock recovery. In Proc. International Solid-State Circuits Conference (ISSCC), pp. 1403-1413, Feb. 2002. [12] J. S. Kim et. al. Energy characterization of a tiled architecture processor with on-chip networks. In Proc. of the 8th International Symposium on Low Power Electronics and Design (ISLPED’03), pp. 424-427, Aug. 2003. [13] L. Kleinrock. Queueing Systems, Vol. 1. John Wiley and Sons. New York, NY, 1975. [14] C. Lee et. al. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In Proc. of the 30th International Symposium on Microarchitecture (MICRO-30), pp. 330-335, Nov. 1997.

[15] W. Lee et. al. Space-time scheduling of instruction-level parallelism on a Raw machine. In Proc. of the 8th International Conference on Architectural Support for Programming Language and Operating Systems (ASPLOS-8), pp. 46–57, Oct. 1998. [16] J. Luo et. al. Simultaneous dynamic voltage scaling of processors and communication links in real-time distributed embedded systems. In Proc. of Design, Automation and Test in Europe Conference and Exhibition (DATE’03), pp. 11150–11151, 2003. [17] S. S. Mukherjee et. al. The Alpha 21364 network architecture. IEEE Micro, 22(1), 2002. [18] V. S. Pai et. al. RSIM: An execution-driven simulator for ILP-based shared-memory multiprocessors and uniprocessors. In IEEE Technical Committee on Computer Architecture Newsletter (TCCA), 35(11), pp. 37–48, Oct. 1997. [19] C. Patel et. al. Power-constrained design of multiprocessor interconnection networks. In Proc. of the 15th International Conference on Computer Design (ICCD’97), pp. 408–416, Oct. 1997. [20] PoPNet. [online] http://www.princeton.edu/edu/∼lshang/popnet.html [21] K. Sankaralingam et. al. Exploiting ILP, TLP, and DLP with the polymorphous TRIPS architecture. In Proc. of the 30th International Symposium on Computer Architecture (ISCA-30), pp. 422–422, June 2003. [22] H. Saputra et. al. Energy-conscious compilation based on voltage scaling. In Proc. of the Joint Conference on Languages, Compilers and Tools for Embedded Systems: Software and Compilers for Embedded Systems, pp. 2–11, June 2002. [23] Semiconductor Industry Association. International Technology Roadmap for Semiconductors, 2001. [online] http://public.itrs.net/Files/2001ITRS/Home.htm [24] L. Shang et. al. Dynamic voltage scaling with links for power optimization of interconnection networks. In Proc. of the 9th International Symposium on High-Performance Computer Architecture (HPCA-9), pp. 79-90, Feb. 2003. [25] V. Soteriou and L.-S. Peh. Design-space exploration of power-aware on/off interconnection networks. In Proc. of the 22nd International Conference on Computer Design (ICCD’04), pp. 510–517, Oct. 2004. [26] J. M. Stine and N. P. Carter. Comparing adaptive routing and dynamic voltage scaling for link power reduction. Computer Architecture Letters, Vol. 3, June 2004. [27] M. B. Taylor et. al. Evaluation of the Raw microprocessor: An exposed-wire-delay architecture for ILP and streams. In Proc. of the 31st International Symposium on Computer Architecture (ISCA-31), pp. 2–13, June 2004. [28] The Standard Performance Evaluation Corporation. [online]http://www.spec.org/ [29] H. Wang et al. Orion: A power-performance simulator for interconnection networks. In Proc. of the 35th International Symposium on Microarchitecture (MICRO-35), pp. 294–305, Nov. 2002. [30] G. Wei et. al. A variable-frequency parallel I/O interface with adaptive power-supply regulation. Journal of Solid-State Circuits, 35(11):16001610, Nov. 2000. [31] R. Wilson et. al. SUIF: An infrastructure for research on parallelizing and optimizing compilers. ACM SIGPLAN Notices, 29(12), Dec. 1996. [32] S. C. Woo et. al. The SPLASH-2 programs: characterization and methodological considerations. In Proc. of the 22nd International Symposium on Computer Architecture (ISCA-22), pp. 24-36, June 1995. [33] F. Worm et. al. An adaptive low-power transmission scheme for on-chip networks. In Proc. of the International Symposium on System Synthesis (ISSS), pp. 92–100, 2002. [34] F. Xie et. al. Compile-time dynamic voltage scaling settings: opportunities and limits. In Proc. of Programming Language Design and Implementation (PLDI), pp. 49–62, June 2003.

Software-Directed Power-Aware Interconnection Networks - CiteSeerX

takes in the statically compiled message flow of an application and analyzes the traffic levels ... Concurrently, a hardware online mecha- ..... send(X[i]) node7 i++.

833KB Sizes 3 Downloads 265 Views

Recommend Documents

Software-Directed Power-Aware Interconnection Networks - CiteSeerX
utilization statistics over fixed sampling windows, that are later compared to ..... R ate. (b) Step 1: Injection rate functions for the two messages. 1000. 1000. 300. 600 ...... Architectural Support for Programming Language and Operating. Systems .

Vulnerability of On-chip Interconnection Networks to Soft Errors
investigates the effect of technology scaling on Soft. Error Rate (SER) for a switch-based on-chip interconnection router. The analysis quantifies the. SER trends ...

Fault-Tolerant Routing in Interconnection Networks
Furthermore, product information from company websites ... these solutions have resembled that of traditional software development processes. In ... as the requirement for good network performance, the requirement for fault tolerance,.

Fault-Tolerant Routing in Interconnection Networks
As an illustration, 11 of the top 15 spots on the current top 500 ..... For instance, there are no routing tables in the BlueGene/L supercomputer. [1], while routing ...

Theory of Communication Networks - CiteSeerX
Jun 16, 2008 - protocol to exchange packets of data with the application in another host ...... v0.4. http://www9.limewire.com/developer/gnutella protocol 0.4.pdf.

Energy proportional datacenter networks - CiteSeerX
Jun 23, 2010 - Finally, based on our analysis, we propose opportunities for ..... package, operating at higher data rates, further increasing chip power.

Energy proportional datacenter networks - CiteSeerX
Jun 23, 2010 - of future network switches: 1) We show that there is a significant ... or power distribution unit can adversely affect service availability.

Theory of Communication Networks - CiteSeerX
Jun 16, 2008 - and forwards that packet on one of its outgoing communication links. From the ... Services offered by link layer include link access, reliable.

Discrete temporal models of social networks - CiteSeerX
We believe our temporal ERG models represent a useful new framework for .... C(t, θ) = Eθ [Ψ(Nt,Nt−1)Ψ(Nt,Nt−1)′|Nt−1] . where expectations are .... type of nondegeneracy result by bounding the expected number of nonzero en- tries in At.

Detecting Malicious Flux Service Networks through ... - CiteSeerX
services. Once information about potential malicious flux domains has been collected for a certain epoch E. (e.g., one day), we perform a more fine-grain ...

Discrete temporal models of social networks - CiteSeerX
Abstract: We propose a family of statistical models for social network ..... S. Hanneke et al./Discrete temporal models of social networks. 591. 5. 10. 15. 20. 25. 30.

Energy-Efficient Protocol for Cooperative Networks - CiteSeerX
Apr 15, 2011 - model a cooperative transmission link in wireless networks as a transmitter cluster ... savings can be achieved for a grid topology, while for random node placement our ...... Comput., Pacific Grove, CA, Oct. 2006, pp. 814–818.

Intelligent Jamming in Wireless Networks with ... - CiteSeerX
create a denial of service attack. ... Index Terms—Denial of Service, MAC protocol attacks, .... presented in [1] that showed that wireless networks using TCP.

CURRENT--Interconnection Renewable Energy Net Metering ...
Town of Estes Park under common law or the Colorado Governmental Immunity Act, Sec. ... Renewable Energy Net Metering AGREEMENT rev 11112014.pdf.

Modeling and Design of Mobile Surveillance Networks ... - CiteSeerX
Index Terms— Mobile Surveillance Networks, Mutational ... Mobile Surveillance Network. ... mechanisms are best suited for such mobile infrastructure-less.

Recurrent Neural Networks for Noise Reduction in Robust ... - CiteSeerX
duce a model which uses a deep recurrent auto encoder neural network to denoise ... Training noise reduction models using stereo (clean and noisy) data has ...

IEEE 1547 Interconnection Standards
Jun 9, 2004 - National Renewable Energy Laboratory. Golden, Colorado. Page 2. 2. Outline. ➢ Background: Distributed Energy Resources (DER or DR).

Achieving distributed user access control in sensor networks - CiteSeerX
Achieving distributed user access control in sensor networks. Haodong Wang a,*. , Qun Li b a Department of Computer and Information Science, Cleveland State University, Cleveland, OH 44115, United States b Department of Computer Science, College of W

Juggler: Virtual Networks for Fun and Profit - CiteSeerX
May 14, 2009 - The second scenario explores the degree to which various ... The final scenario demonstrates Juggler's ability to ...... trigger the transition.

Improving the speed of neural networks on CPUs - CiteSeerX
This paper is a tutorial ... putations by factors from 5× to 50× [1-3]. .... As an illustration of how these simple techniques fare in comparison to off-the-shelf fast ...

Deep Neural Networks for Acoustic Modeling in Speech ... - CiteSeerX
Apr 27, 2012 - data that lie on or near a non-linear manifold in the data space. ...... “Reducing the dimensionality of data with neural networks,” Science, vol.

Deep Neural Networks for Acoustic Modeling in Speech ... - CiteSeerX
Apr 27, 2012 - origin is not the best way to find a good set of weights and unless the initial ..... State-of-the-art ASR systems do not use filter-bank coefficients as the input ...... of the 24th international conference on Machine learning, 2007,

QoS Differentiation in OBT Ring Networks with ... - CiteSeerX
such as real-time video, VoIP, online gaming, and wireless access. ... provision a link by a network operator has reduced provisioning times from weeks to a ... in OBT, we provide QoS differentiation by taking advantage of the fact that OBT is.