Design Space Exploration for Multicore Architectures: A ...

Viewer
Transcript

Design Space Exploration for Multicore Architectures: A Power/Performance/Thermal View 1

Matteo Monchiero1 Dipartimento di Elettronica e Informazione Politecnico di Milano Via Ponzio, 34/5 20133 Milano, Italy

[email protected]

Ramon Canal2

2

Dept of Computer Architecture Universitat Politecnica de ` Catalunya Cr. Jordi Girona, 1-3 08034 Barcelona, Spain

[email protected]

ABSTRACT Multicore architectures are ruling the recent microprocessor design trend. This is due to different reasons: better performance, threadlevel parallelism bounds in modern applications, ILP diminishing returns, better thermal/power scaling (many small cores dissipate less than a large and complex one); and, ease and reuse of design. This paper presents a thorough evaluation of multicore architectures. The architecture we target is composed of a configurable number of cores, a memory hierarchy consisting of private L1 and L2, and a shared bus interconnect. We consider parallel shared memory applications. We explore the design space related to the number of cores, L2 cache size and processor complexity, showing the behavior of the different configurations/applications with respect to performance, energy consumption and temperature. Design tradeoffs are analyzed, stressing the interdependency of the metrics and design factors. In particular, we evaluate several chip floorplans. Their power/thermal characteristics are analyzed and they show the importance of considering thermal effects at the architectural level to achieve the best design choice.

1.

INTRODUCTION

Main semiconductor firms have been recently proposing microprocessor solutions composed of a few cores integrated on a single chip [1–3]. This approach, named Chip Multiprocessor (CMP), permits to efficiently deal with power/thermal issues dominating deep sub-micron technologies and makes it easy to exploit threadlevel parallelism of modern applications. Power has been recognized as first class design constraint [4], and many literature works target analysis and optimization of power consumption. Issues related to chip thermal behavior have been faced only recently, but emerged as one of the most important factors to determine a wide range of architectural decisions [5]. This paper targets a CMP system composed of multiple cores, each one with private L1 and L2 hierarchy. This approach has been

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICS06, June 28-30, Cairns, Queensland, Australia Copyright C 2006 ACM 1-59593-282-8/06/0006 ...$5.00.

3

Antonio Gonzalez ´ 2,3

Intel Barcelona Research Center Intel Labs-Universitat Politecnica de Catalunya ` Cr. Jordi Girona, 27-29 08034 Barcelona, Spain

[email protected]

adopted in some industrial products, e.g. the Intel Montecito [1] and Pentium D [6], and the AMD AthlonX2 [7]. It maximizes design reuse, since this architectural style does not require re-design of the secondary cache, as it would be for a L2 shared architecture. Unlike many recent works about design space explorations of CMPs, we consider parallel shared memory applications, which can be considered the natural workload for a small scale multiprocessor. We target several scientific programs and multimedia ones. Scientific applications represent the traditional benchmark for multiprocessor, while multimedia programs represent a promising way of improving parallelism in everyday computing. Our experimental framework consists of a detailed microarchitectural simulator [8], integrated with Wattch [9] and CACTI [10] power models. Thermal effects have been modeled at the functional unit granularity by using HotSpot [5]. This environment makes a fast and accurate exploration of the target design space possible. The main contribution of this work is the analysis of several design energy/performance trade-offs when varying core complexity, L2 cache size and number of cores, for parallel applications. In particular, we discuss the interdependence of energy/thermal efficiency, performance and architectural level chip floorplan. Our findings can be summarized as follows: • We show that L2 cache is an important factor in determining chip thermal behavior. • We show that large CMPs of fairly narrow cores are the best solution for energy-delay. • We show that floorplan geometric characteristics are important for chip temperature. This paper is organized as follows. Section 2 presents the related work. Metrics are defined in Section 3. The target design space, as well as the description of the considered architecture is presented in Section 4. The experimental framework used in this paper is introduced in Section 5. Section 6 discusses performance/energy/thermal results for the proposed configurations. Section 7 presents an analysis of the spatial distribution of the temperature for selected chip floorplans. Finally, conclusions are drawn in Section 8.

2. RELATED WORK Many works have appeared exploring the design space of CMPs from the point of view of different metrics and application domains. This paper extends previous works in several issues. In detail:

Huh et al. [11] evaluate the impact of several design factors on the performance. The authors discuss the interactions of core complexity, cache hierarchy, and available off-chip bandwidth. The paper focuses on a workload composed of single-threaded applications. It is shown that out-of-order cores are more effective than in-order ones. Furthermore, bandwidth limitations can force large L2 caches to be used, therefore reducing the number of cores. A similar study was conducted by Ekman et al. [12] for scientific parallel programs. Kaxiras et al. [13] deal with multimedia workload, especially targeting DSP for mobile phones. The paper discuss the energy efficiency of SMT and CMP organizations with given performance constraints. They show that both these approaches can be more efficient than a single-threaded processor, and claims that SMT has some advantages in terms of power. Grochowski et al. [14] discuss how to achieve the best performance for scalar and parallel codes in a power constrained environment. The idea proposed in this paper is to dynamically vary the amount of energy expended to process instructions according to the amount of parallelism available in the software. They evaluate several architectures showing that a combination of voltage frequency/scaling and asymmetric cores represents the best approach. In [15], Li and Martinez study the power/performance implications of parallel computing on CMPs. They use a mixed analyticalexperimental model to explore parallel efficiency, the number of processor used, and voltage/frequency scaling. They show that parallel computation can bring significant power savings when the number of processors used and the voltage/frequency scaling levels are properly chosen. These works [14–16] are orthogonal to ours. Our experimental framework is similar to the one of [15, 16]. We also consider parallel applications and the benchmarks as in [16]. The power model we use accounts for thermal effects on leakage energy, but, unlike the authors of [15, 16], we also provide a temperature-dependent model for L2 caches. This makes it possible to capture some particular behaviors of temperature and energy consumption since L2 caches may occupy more than half of the area of the chip in some configurations. In [17], a comparison of SMT and CMP architectures is carried out. The authors consider performance, power and temperature metrics. They show that SMT and CMP exhibit similar thermal behavior, but the sources of heating are different: localized hotspots for SMTs, global heating for CMPs. They found that CMPs are more efficient for computation intensive applications, while SMTs perform better for memory bound programs due to larger L2 cache available. The paper also discusses the suitability of several dynamic thermal management techniques for CMPs and SMTs. Li et al. [18] conduct a thorough exploration of the multi-dimensional design space for CMPs. They show that thermal constraints dominate other physical constraints such as pin-bandwidth and power delivery. According to the authors, thermal constraints tend to favor shallow pipelines, narrower cores and tend to reduce the optimal number of cores and L2 cache size. The focus of [17, 18] is on single-threaded applications. In this paper, we target a different benchmark set, composed of explicitly parallel applications. In [19], Kumar et al. provide an analysis of several on-chip interconnections. The authors show the importance of considering the interconnect while designing a CMP and argue that careful codesign of the interconnection and the other architectural entities is needed. To the best of the authors knowledge, the problem of architecturallevel floorplan has not been faced for what concerns multi-core ar-

chitectures. Several chip floorplan have been proposed as instrumental to thermal/power models, but the interactions of floorplanning issues and CMP power/performance characteristics have not been faced up to now. In [20], Sankaranarayanan et al. discuss some issues related to chip floorplanning for single-core processors. In particular, the authors show that chip temperature variation is sensitive to three main parameters: the lateral spreading of heat in the silicon, the power density, and the temperature dependent leakage power. We also use these parameters to describe the thermal behavior for CMPs. The idea of distributing the microarchitecture for reducing temperature is exploited in [21]. In this paper the the organization of a distributed front-end for clustered microarchitectures is proposed and evaluated Ku et al. [22] analyze the inter-dependence of temperature and leakage energy in cache memories. They focus on single-core workload. The authors analyze several low-power techniques and propose a temperature-aware cache management approach. In this paper, we account for temperature effects on leakage in the memory hierarchy, stressing its importance to make best design decisions. This paper contribution differentiates from the previous ones, since it combines power/performance exploration for parallel applications and interactions with the chip floorplan. Our model takes into account the thermal effects and the temperature dependence of leakage energy.

3.

METRICS

This section describes the metrics that we use in this paper to characterize performance and power efficiency of the CMP configurations we explore. We measure the performance of the system when running a given parallel applications by using the execution time (or delay). This is computed as the time needed to complete the execution of the program. Delay is often preferred to IPC for multiprocessor studies. It accounts for different instruction count which may arise when different system configurations are used. This mostly happens when the number of cores is varied and the dynamic instructions count varies to deal with a different number of parallel threads. Power efficiency is evaluated through the system energy, i.e. the energy needed to run a given application. We also account for thermal effects, and we therefore report the average temperature across the chip. Furthermore, in Section 7, we evaluate several alternative floorplans by means of the spatial distribution of the temperature (thermal map of the chip). Energy Delay Product (EDP) and Energy Delay2 Product (ED2P) are typically used to evaluate energy-delay trade-offs. EDP is usually preferred when targeting low-power devices, while ED2P when focusing on high-performance systems, since it gives higher priority to performance over power. In this paper, we use ED2P. Anyway, we give a discussion of EDn P optimality for the considered design space. We use concepts taken form [23]. In particular, we adopt the notion of hardware intensity to proper characterize the complexity of different configurations.

4. DESIGN SPACE We consider a shared memory multiprocessor, composed of several independent out-of-order cores (each one with private L2 cache), communicating on a shared bus. Detailed microarchitecture is discussed in Section 5. In this section, we focus on the target design space. This is composed of the following architecture parameters: • Number of cores (2, 4, 8). This is the number of cores present in the system.

#cores 2 2 2 2 4 4 4 4 8 8 8 8

Table 1: Chip Area [mm2 ] Issue L2 Size [KB] 2 4 6 256 45 49 62 512 63 68 80 1024 108 113 124 2048 187 192 202 256 81 90 112 512 117 126 148 199 208 228 1024 2048 350 359 379 256 162 180 224 512 235 253 297 1024 398 416 456 700 719 758 2048

8 74 93 138 216 138 174 255 406 275 348 511 812

P0

P1

D$ I$

D$ I$

L2_0

L2_1

(a) L2 256, 512 KB

Base Unit

SharedBus

• Issue width (2, 4 ,6, 8). We modulate the number of instructions which can be issued in parallel to integer/floating point reservation stations. According to this parameter, many microarchitectural blocks are scaled, see Table 2 for details. The issue width is therefore an index of the core complexity. • L2 cache size (256, 512, 1024, 2048 KB). This is the size of the private L2 cache of each processor. For each configuration in terms of number of cores and L2 cache size we consider a different chip floorplan. As in some related works on thermal analysis [5], the floorplan of the core is modeled on the AlphaEv6 (21264) [24].The area of several units (register files, issue and Ld/St queues, rename units, and FUs) has been scaled according to size and complexity [25]. Each floorplan has been re-designed to minimize dead spaces and not to increase the delay of the critical loops. Figure 1 illustrates the methodology used to build the floorplans we use. In particular, to obtain a reasonable aspect ratio we defined two different layouts: one for small caches, 256KB and 512KB (Figure 1.a), and another one for large caches, 1024KB and 2048KB (Figure 1.b). The floorplans for 4 and 8 cores are built by using the 2-core floorplan as a base unit. The base unit for small caches is shown in Figure 1.a. Core 0 (P0) and core 1 (P1) are placed side by side. The L2 caches are placed in front of each core. In the base unit for large caches (Figure 1.b), the L2 is split in two pieces, one in front of the core, the other one beside it. In this way, each processor+cache unit is roughly squared. For 4 and 8 cores this floorplan is replicated and possibly mirrored, trying to obtain the aspect ratio which most approximates 1 (perfect square). Figures 1.c/d show how the 4-core and 8-core floorplans have been obtained respectively from the 2-core and 4-core floorplan. For any cache/core configurations, each floorplan has the shape/organization as outlined in Figure 1, but different size (these have been omitted for the sake of clarity). Table 1 shows the chip area for each design. For each configuration, the shared bus is placed according to the communication requirements. Size is derived from [19].

5.

EXPERIMENTAL SETUP

The simulation infrastructure that we use in this paper is composed of a microarchitecture simulator modeling a configurable CMP, integrated power models, and a temperature modeling tool. The CMP simulator is SESC [8]. It models a multiprocessor composed of a configurable number of cores. Table 2 reports main

(c) 4 cores

P0

P1

D$ I$

D$ I$

L2_0

L2_1

(b) L2 1024, 2048 KB

Base Unit

SharedBus

(d) 8 cores

Figure 1: Floorplans for 2 cores (a, b), and schemes for 4, 8 cores (c, d)

parameters of simulated architectures. Each core is an out-of-order superscalar processor, with private caches (instruction L1, data L1, and instructions and data L2). The 4-issue microarchitecture models the Alpha 21264 [24]. For any different issue width, all the processor structures have been scaled accordingly as shown in Table 2. Inter-processor communication develops on a high-bandwidth shared bus (57 GB/s), pipelined and clocked at half of the core clock (see Table 2). Coherence protocol acts directly among L2 caches and it is MESI snoopy-based. This protocol generates invalidate/write requests between L1 and L2 caches to ensure the coherency of shared data. Memory ordering is ruled by a weak consistency model. The power model integrated in the simulator is based on Wattch [9] for processor structures, CACTI [10] for caches, and Orion [26] for buses. The thermal model is based on Hotspot-3.0.2 [5]. HotSpot uses dynamic power traces and chip floorplan to drive thermal simulation. As outputs, it provides transient behavior and steady state temperature. According to [5], we chose a standard cooling solution featuring thermal resistance of 1.0 K/W. We chose ambient air temperature of 45 °C. We carefully select initial temperature values to ensure that thermal simulation converges to a stationary state. Hotspot has been augmented with a temperature-dependent leakage model, based on [27]. This model accounts for different transistor density of functional units and memories. Temperature dependency is modeled by varying the amount of leakage according to the proper exponential distribution as in [27]. At each iteration of the thermal simulation, the leakage contribution is calculated and ‘injected’ into each HotSpot grid element. Table 3 lists the benchmark we selected. They are 3 scientific applications from Splash-2 suite [28] and MPEG2 encoder/decoder from ALPbench [29]. All benchmarks have been run up to completion and statistics have been collected on the whole program run,

9

2.5

Table 2: Architecture configuration. Issue width parameters are in boldface. They are used as representatives of a core configuration through the paper.

Hit/miss lat MAF size Coherence Protocol #cores Shared Bus Bandwidth Memory Bus bandwidth Memory lat

Table 3: Benchmarks

FMM

#graduated instructions (M) 4387–5741

mpeg2dec

1168

mpeg2enc

4275

VOLREND

1425–1888

WATER-NS

1780

Description – Problem size Fast Multipole Method 16k particles, 10 steps MPEG-2 decoder – flowg (Stanford) 352×240, 10 frames MPEG-2 encoder – flowg (Stanford) 352×240, 10 frames Volume rendering using ray casting head, 50 viewpoints Forces and potentials of water molecules 512 molecules, 50 steps

after skipping initialization. For thermal simulations, we used the power trace related to the whole benchmark simulation (dumped every 10000 cycles). We used standard data set for Splash-2, while we limited to 10 frames for the MPEG2. In Table 3 is shown the total number of graduated instructions for each application. This number is fairly constant across all the simulated configurations for all the benchmarks, except for FMM and VOLREND. This variation (23-24%) is related to different thread spawning and synchronization overhead as the number of cores is scaled.

6.

PERFORMANCE AND POWER EVALUATION

In this section, we evaluate the simulated CMP configurations from the point of view of performance and energy. We consider the interdependence of these metrics as well as chip physical characteristics (temperature and floorplan). We base this analysis on average values across all the simulated benchmarks, apart from Section 6.4, which presents a characterization of each benchmark.

Delay [cycles]

Fetch/issue/retire width Issue Queue size INT I-Window size FP I-Window size INT registers FP registers ROB size LdSt/Int/FP units Ld/St queue entries IL1 Hit/miss lat ITLB entries DL1 Hit/miss lat MAF size DTLB entries Unified L2

3GHz @70 nm 0.9V 31 39 55 78 7 cycles BTB (1K entries, 2-way) Alpha-style Hybrid Branch Predictor (3.7KB) RAS (32 entries) 2/2/4 4/4/6 6/6/8 8/8/10 16 32 48 64 10 20 30 40 5 15 25 35 40 80 120 160 32 72 112 152 40 80 120 160 1/2/1 1/4/2 1/6/3 1/8/4 16/16 32/32 48/48 64/64 64KB, 2-way, 64B block, 2 ports 2/1 cycles 64 64KB, 2-way, 64B block, write-through 2/1 cycles 8 128 256 KB 512 KB 1024 KB 2048 KB 8-way, 64B block, write-back, 2 ports 10/4 cycles 32 MESI CMP 2 4 8 76B, 1.5GHz, 10 cycles delay 57 GB/s 6 GB/s 490 cycles

256 512 1024 2048

2

Processor Frequency Vdd Core area (w/ L1$) [mm2 ] Branch penalty Branch unit

x 10

1.5

1

0.5

0

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−8

Figure 2: Execution time (delay). Each group of bars (L2 cache varying) is labeled with <#procs>-

6.1

Performance

Figure 2 shows the average execution time, for all the configurations, in terms of cores, issue width and L2 caches. Results in terms of IPC show the same behavior. It can be observed that when scaling the system from 2 to 4, and from 4 to 8 processors, delay is reduced by a 47% each step. This trend is homogeneous across other parameter variations. It means that high parallel efficiency is achieved by these applications, and that communication overhead is not appreciable. By varying the issue width, for a given processor count and L2 size, a larger speedup is observed when moving from 2- to 4-issue (21%). If the issue width is furthermore increased, the improvement of the delay saturates (5.5% from 4- to 6-issue, and 3.3% from 6- to 8-issue). This is because of ILP diminishing returns. This trend is also seen (but not that dramatically) for the configurations of 4 and 8 cores. The impact of the L2 cache size is at maximum 4-5% (from 256KB to 2048KB), when the other parameters are fixed. Each time the L2 size is doubled the speedup is around 1.2-2%. Also this behavior is orthogonal to the other parameters since this trend can be seen across all configurations.

6.2

Energy

In this section, we analyze the effect of varying the number of cores, the issue width and the L2 cache on the system energy. Number of cores. In Figure 3, the energy behavior is reported. It can be seen that, when the number of processors is increased, the system energy slightly increases although the area nearly doubles as core count doubles. For example, for a 4-issue 1024KB configuration, the energy increase rate is 5% (both from 2 cores to 4 cores, and from 4 cores to 8 cores). In fact, the leakage energy increase, due to larger chip, is mitigated by the reduction of the execution time – power-on time – (i.e. the time the chip is leaking is shorter). On the other hand, when considering configurations with a given resource budget, and similar area (see Table 1), the energy typically decreases. For a 16 total issue width and 2MB total cache, the energy decreases by a 37%, from 2 to 4 cores. From 4 to 8, it slightly increases, but, the chip for 8 cores is 22% larger. In this case, the shorter delay makes the difference, making relatively large CMPs result in good energy efficiency. Issue width. By varying the issue width and, therefore, the core

25

63

62.5

Av. Temperature [°C]

Energy [J]

20

15

10

5

0

256 512 1024 2048

62

61.5

61

60.5

256 512 1024 2048

60

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−8

Figure 3: Energy. Each group of bars (L2 cache varying) is labeled with <#procs>-

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−8

Figure 5: Average chip temperature. Each group of bars (L2 cache varying) is labeled with <#procs>- 25

18

ED2P

256 512 1024 2048

16

ED1.5P

EDP

14 20 Energy [J]

Leakage Energy [J]

12 10 8

15

L2s

6 4

core

2 0

10 0.4

0.6

0.8

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−8

Figure 4: Leakage energy. Each bar is labeled <#issue>-. Core/L2s leakage breakdown is represented by a dash superposed on each bar.

complexity, energy has a minimum at 4-issue. This occurs for 2, 4 and 8 processors. The 4-issue cores offer the best balance between complexity, power consumption and performance. The 2-issue microarchitecture can’t efficiently extract available IPC (leakage due to delay dominates). 8-issue processor cores need much more energy to extract little more parallelism with respect to 4- and 6-issue one, and therefore they are quite inefficient. L2 size. L2 cache represents the major contributor to chip area, so it affects leakage energy considerably. The static energy component for our simulations ranges from 37% (2P 8-issue 256KBL2) to 72% (8P 2-issue 2048KB-L2). Figure 4 shows the leakage energy and the core-leakage/L2-leakage breakdown. Most of the leakage is due to the L2 caches. This component scales linearly with L2 cache size. The static power of the core increases as the issue-width and the number of cores is increased. Nevertheless, this effect is mitigated by larger L2 caches since they offer enhanced spreading capabilities of the chip (overall temperature decreases resulting in less leakage energy). Average chip temperature is shown in Figure 5. Despite abso-

1

1.2 1.4 Delay [cycles]

1.6

1.8

2 9

x 10

Figure 6: Energy Delay scatter plot for the simulated configurations lute variation is at most 1.7 °C, even small temperature variation can correspond to appreciable energy modification. In [22], SPICE simulation results are reported, showing that 1 °C translates into around 4-5% leakage variation. Regarding our simulations, temperature reduction due to caches causes the energy reduction of the core leakage energy, as shown in Figure 4. In this case, 5% energy reduction (core leakage) corresponds to 0.5 °C reduction of the average chip temperature. On the contrary, leakage in the cache is much more sensitive to area than temperature. So, the leakage increase due to the area can’t be compensated by the temperature decrease.

6.3

Energy Delay

In Figure 6, the Energy-Delay scatter plot for the design space is reported. Several curves with constant EDn P, with n=1, 1.5, and 2 are reported. Furthermore, the we connected with a line all the points of the energy-efficient family [23]. In the Energy-Delay plane, these points form a convex hull of all possible configurations of the design space. The best configuration from the Energy-Delay point of view is

Av. Temperature [°C]

Figure 9: Energy per cycle for each benchmark

10 256 512 1024 2048

9 8 7 Energy*Delay2 [Js2]

EN W D AT ER −N S

M

LR

de g2

en g2

pe m

pe

g2 e

pe m

m

pe

S

W

AT

ER

−N

D

M

EN LR

FM

VO

de c

en c

g2

g2

m pe

m pe

Figure 8: IPC for each benchmark

6 5 4 3 2 1 0

VO

60

c

0

FM

60.5

c

5

nc

0

61

m

2

62

61.5

10

FM M LR EN W D AT ER −N S

4

15

VO

6

63

62.5

20

ec

8

25

g2 d

Energy per cycle [nJ/cycle]

10

4P 8−issue 512L2 8P 4−issue 256L2

63.5

30

12

IPC

64

35

14

2−2 2−4 2−6 2−8 4−2 4−4 4−6 4−8 8−2 8−4 8−6 8−8

Figure 7: Energy Delay2 Product. Each group of bars (L2 cache varying) is labeled with <#procs>- the 8-core 4-issue 256KB-L2 one. This is optimal for the EDn P with n < 2.5, as it can be intuitively observed in Figure 6. If n ≥ 2.5 the optimum moves to 512KB-L2 and therefore to the points of the energy-efficient family with higher hardware intensity 1 . By minimizing energy the optimal solution is 2-core 4-issue 256KB-L2 (low hardware intensity). 2-core 2-issue configurations are penalized by poor performances, while 4-core 4-issue 256KB-L2 represents an interesting trade-offs featuring only 5.5% worse energy and 4.8% better delay. Nevertheless, one could now consider other metrics such as area (see Table 1) or average temperature (see Figure 5). In the first case (area), smaller core count architecture can be preferred. For example, 2-core 4-core 4-issue 256KB-L2 needs only 90 mm2 , while the optimal one (8-core 4-issue 256KB-L2) requires 180 mm2 , featuring 8.7% better delay. Otherwise, if chip temperature is minimized, larger caches can be preferred. For example, 8-core 4-

Figure 10: Average temperature for each benchmark

issue 2048KB-L2 features 0.7 °C lower average temperature. This comes at the expenses of relevant leakage power consumption in the L2 caches (see Figure 4). In Figure 6, we represent, as squares, the points with 2MB total L2 cache (equivalent to the optimum). They form 3 clusters, each one corresponding to a core/cache configuration: 8/256KB, 4/512KB and 2/1024KB. These points show that increasing the number of cores, while decreasing the L2 size for each processor, makes the energy-delay better. Furthermore, if considering the same number of total issues (8-core 4-issue and 4-core 8-issue), the advantages of reducing the number of issues, but increasing the core count come evident. Parallelism of applications in the considered benchmark set is much better exploited by multi-core architectures, rather than complex monolithic cores. It can be also observed that each cluster progressively ‘rotates’ as moving towards the optimum, making more similar different issue-width from the point of view of delay. Figure 7 shows the Energy Delay2 Product for the simulated configurations. ED2P strongly favors high-performance architectures. In fact, 8-cores CMPs regardless other parameters are better than any other architecture: 2% (w.r.t. large caches) up to 68% (w.r.t the ED2P optimum). These results show that the rule-of-the-thumb that configurations offering the same ILP (#cores × issue-width) have similar energy/performance trade-offs is not totally accurate when considering ED2P (still holds for EDP). For example, the 4-issue 2-core 512KB CMP is equivalent to the 2-issue 4-core 256KB CMP in terms of hardware budget but not on ED2P. Note that both configurations have a total of 4 issue width and 512KB in L2 caches. Nevertheless the area budget favors the 2-core over the 4-core (68 mm2 vs 81 mm2 die area). Cache size has a relatively small impact on performance (see Figure 2), while it is important for energy. For the considered dataset, 256KB L2 cache per processor can be considered enough. Moreover it minimizes energy. It should be pointed out that L2 size differently affects the core and the L2 leakage (see Figure 4). Intelligent cache management, which is aware of used data, could considerably reduce the dependence of the leakage on the L2 size, making larger caches more attractive when considering energy/delay trade-offs.

1

The hardware intensity is defined as is the ratio of the relative increase in energy to the corresponding relative gain in performance achievable by architectural modification. See [23] for a detailed discussion.

6.4

Per Benchmark Results

Figures 8, 9, and 10 show IPC, energy per cycle, and average tem-

L2_1 L2_0

P1

D$ I$

P0

D$ I$

D$ I$ P3

Area and aspect ratio are roughly equivalent for each cache size and topology. We report the thermal map for each floorplan type and cache size. This is the stationary solution as obtained by HotSpot. We use the grid model, making sure that grid spacing is 100 µm for each floorplan. The grid spacing determines the magnitudes of the hotspots, since each grid point has somewhat the average temperature of 120 µm × 120 µm surrounding it. Our constant grid spacing setup ensures that the hotspot magnitude is comparable across designs. Power traces are from the WATER-NS benchmark. We selected this application as representative of the behavior of all the benchmarks. The conclusions drawn in this analysis apply for each simulated application. For each thermal map (e.g. choose Figure 12), several common phenomena can be observed. Several temperature peaks correspond to each processor. This is due to the hotspots in the core.

P0

P1

P2

P3

D$ I$

D$ I$

D$ I$

D$ I$

L2_0

L2_1

L2_2

L2_3

SharedBus P0

P1

D$ I$

D$ I$

L2_0

P1

P2

P3

D$ I$

D$ I$

D$ I$

L2_0

L2_1

L2_2

L2_3

P2

• Centered (Figure 11.a and Figure 11.c). Cores are placed in the middle of the die.

(a) 256KB − Centered

P0 D$ I$

D$ I$

• Lined up (Figure 11.b and Figure 11.d). Cores are lined up on a side of the die.

L2_3

• Paired (Figure 1. Cores are paired and placed on alternate sides of the die.

L2_2

The main goal of this section is to discuss how floorplan design at system level can impact on chip thermal behavior. Several floorplan topologies, featuring different core/caches placement are analyzed. We reduce the scope of this part of the work to 4 cores and 4-issue CMPs and L2 cache size of 256KB and 1024KB. Similar results are obtained for the other configurations. Figure 11 shows the additional floorplans we have considered in addition to those in Figure 1. Floorplans are classified with respect to processor position in the die:

P3

Evaluation of Floorplans

P2

7.1

L2_3

This section discusses of the spatial distribution of the temperature (i.e. the chip thermal map) and its relationship with the floorplan design. We first analyze several different floorplan designs for CMPs. We then discuss temperature distribution while varying processor core complexity.

D$ I$

TEMPERATURE SPATIAL DISTRIBUTION

L2_2

7.

D$ I$

perature for each benchmark. Results are reported for two configurations: the optimal one (4-issue, 8 procs. 256KB-L2) and total cache/issue budget equivalent one, i.e. 8-issue, 4 procs. 512KBL2. These benchmarks have been selected since they represents two different kinds of workload. The mpeg2 decoder and the encoder are integer video applications, stressing integer units and memory system. FMM, VOLREND, and WATER-NS are floating point scientific benchmarks, each of them featuring different characteristics in terms of IPC and energy. IPC ranges from around 3 (FMM, 4-core) to 12 for WATER-NS, 8-core. It shows up to 100% difference across the benchmarks. Note that the IPC of mpeg2enc and FMM is much lower than VOLREND and mpeg2dec (3-5 vs 5-8), while WATER-NS stems for really high IPC (9-12). On the other hand, the energy per cycle doesn’t vary so much across benchmarks (24-32 nJ/cycle). It is also quite similar for the 4-core and the 8 core CMPs. Temperature variability is 1.7 °C across the benchmarks. Temperature differences, between the two architectures, reported in the graphs, is about 0.5 °C for mpeg2enc and FMM, 0.25 °C for VOLREND and mpeg2dec, while it is negligible for WATER-NS.

L2_1

SharedBus

SharedBus

(b) 256KB − Lined up

(c) 1024KB − Centered

(d) 1024KB − Lined up

Figure 11: Additional floorplans taken into account for evaluation of the spatial distribution of the temperature

Table 4: Average and maximum chip temperature for different floorplans L2 Paired Lined up Centered

256KB Max [°C] Av. [°C] 74.2 63.8 74.0 63.5 72.4 63.6

1024KB Max [°C] Av. [°C] 73.4 62.2 74.2 62.2 73.0 62.2

The nature of the hotspots will be analyzed shortly in next section. Caches – in particular the L2 cache – and the shared bus are cooler, apart from the portions which are heated because of the proximity to hot units. Paired floorplans for 256KB (Figure 12) and 1024KB (Figure 15) show similar thermal characteristics. The temperature of the two hotspots of each core (the hottest is the FP unit, the other one is the Integer Window coupled with the Ld/St queue) ranges from 73.1/71.4 to 74.2/72 °C for for 256KB L2. Hottest peaks are the ones closest to the die corners (FP units on the right side), since they dispose of less spreading sides into the silicon. For 1024KB L2 hotspots temperature is between 72.5/71.5 and 73.4/72. In this case the hottest peaks are between each core pair. This is because of thermal coupling between processors. The same phenomena appear for the lined up floorplans (Figure 12 and Figure 15). The rightmost hotspots suffers from corner effect, while inner ones from thermal coupling. Figure 14 shows an alternative design. Here processors are placed in the center of the die (see Figure 11.a). Also their orientation is different, since they are back to back. As can be observed in Table 4, the maximum temperature is lowered significantly (1.6/1.8 °C). In this case, this is due to the increased spreading perimeter of the processors: two sides are on the surrounding silicon, and another side is on the bus (the bus goes across the die, see Figure 11.a). The centered floorplans are the only ones with fairly hot bus. In this case, leakage in the bus can be significant. Furthermore, for this floorplan, between the backs of the processor the temperature is quite ‘high’ (69.4 °C), unlike other floorplans featuring relatively ‘cold’ backs (67.9 °C). For the 1024KB-centered floorplan (Figure 17) these characteristics are less evident. Maximum temperature decrease is only 0.4/1.2 °C with respect to paired and lined up floorplans. In particular, it is small if compared with the paired one. This is reasonable if considering that, in the paired floorplan the cache (1024KB) surrounds each pair of cores, providing therefore enough spreading area. Overall, the choice of the cache size and the relative placing of

75

65

75 75

60

0

70 T [°C]

0 2 4

2

65

0

0 8

2 10

mm

4

6 4

12 8

mm

8

4 2

10 6

6 14

0

8

4 12

60

6 2

0

65 2

4

60

6

70 T [°C]

T [°C]

70

8 6

mm

10 mm

14

8

mm

12

mm

Figure 12: Thermal map for the 256KB- Figure 13: Thermal map for the 256KB- Figure 14: Thermal map for the 256KBpaired floorplan (Figure 1) lined up floorplan (Figure 11.b) centered floorplan (Figure 11.a)

80

70

60

80

70

T [°C]

T [°C]

T [°C]

80

60

0

60

0

2

0 4

2 6

4 8

6 10

8 12

0

2 4

14

6

4 8

5

4 6

6 10

8

8 12

mm

10 10

10 14

mm

2

2

12 16

0

0

10 14

mm

70

12

12 16

14

mm mm

15

14

mm

Figure 15: Thermal map for the 1024KB- Figure 16: Thermal map for the 1024KB- Figure 17: Thermal map for the 1024KBlined up floorplan (Figure 11.d) paired floorplan (Figure 1) centered floorplan (Figure 11.c) the processors can be important in determining the chip thermal map. Those layouts, where processors are placed in the center and where L2 caches surround cores, typically feature lower peak temperature. The entity of the temperature decrease, with respect to alternative floorplans, is limited to approximately 1-2 °C. The impact on the system energy can be considered negligible, since L2 leakage dominates. Despite this, such a reduction of the hotspot temperature can lead to leakage reduction localized in the hotspot sources.

7.2

Processor Complexity

In Figure 18, several thermal maps for different issue width of the processors are reported. We selected the WATER-NS benchmark, a floating point one, since this enables us to discuss heating in both FP and Integer units. The CMP configuration is 2-core 256KB-L2. Hottest units are the FP ALU, the Ld/St queue and the Integer I-Window. Depending on the issue width, the temperature of the Integer Window and the Ld/St vary. For 2- and 4-issue, the FP ALU dominates, while as the the issue width is scaled up the Integer Window and the Ld/St becomes the hottest units of the die. This is due to the super-linear increase of power density in these structures.

Hotspot magnitude depends on issue width – i.e. power density of the microarchitecture blocks – and on the ‘compactness’ of the floorplan. In the 4-issue chip temperature peaks higher than in the 6-issue one exist. This is because the floorplan of the 6-issue core has many cold units surrounding hotspots (e.g. see the FP ALU hotspots). Interleaving hot and cold blocks is an effective method to provide spreading silicon to hot areas. Thermal coupling of hotspots exists for various units shown in Figure 18, and it is often caused by inter-processor interaction. For example, In the 2-issue floorplan, the LdSt queue of the left processor is thermally coupled with the FP ALU of the right one. In the 4-issue, the Integer Execution Unit is warmed by the coupling between LdSt queue and FP ALU of the two cores. For what concerns intra-processor coupling, it can be observed for the 4-issue design, between the FP ALU and the FP register file, and in the 8-issue, between the Integer Window and the the LdSt queue.

7.3 Discussion Different factors determine hotspots in the processors. Power density for each microarchitecture unit provides a proxy to temperature, but doesn’t suffice in explaining effects due to the thermal

FP ALU

LSQ

IntQ

FPRF

IntQ LSQ FP ALU

IntQ LSQ

FPQ

LSQ

FP ALU

IntExec_0 FPMap_0 IntReg_0 FPMul_0 FPAdd_0 Bpred_0

FPQ_0 IntMap_0IntQ_0

LdStQ_0

FPReg_0 FPMap_0

FPAdd_1

ITB_0

IntExec_0

FPQ_1 IntMap_1IntQ_1

IntExec_1

FPMul_1 FPReg_1

IntQ_0

LdStQ_0 FPAdd_0

ITB_1

DTB_

FPQ_0 ITB_0

Bpred_0

LdStQ_1 IntExec_0

DTB_0

FPAdd_1

FPQ_1 ITB_1

Bpred_1

IntMap_0 IntExec_1

Dcache_0

Icache_1

Dcache_1

Icache_0

Dcache_0

Icache_1

IntEx

L2_1

SharedBus

L2_0

IntQ_1

IntReg_0

FPQ_1

LdStQ_1 FPMap_1

IntReg_1

FPMap_1

FPMap_0

FPReg_0

IntReg_0

FPMul_0 ITB_0

FPAdd_1 FPReg_1

IntExec_0

IntReg_1

FPMul_1 ITB_1

FPMul_0 ITB_0

IntExec_1

IntMap_0

FPReg_0FPQ_0

FPMul_1 ITB_1

DTB_0 FPAdd_0

LdStQ_1

LdStQ_0

Bpred_0 Icache_0

Dcache_0

DTB_0

L2_0

L2_1

Bpred_1 Icache_1

Dcache_1

DTB_1

L2_1

Icache_0

IntMap_1

FPReg_1FPQ_1

DTB_1 FPAdd_1

Icache_1

Bpred_0 Dcache_0

L2_0

SharedBus

SharedBus

2-issue

IntQ_1 FPMap_0

Dcache_1

FP ALU L2_0

IntMap_1

IntQ_0

FPQ_0

LdStQ_0

DTB_1

FPAdd_0

Icache_0

IntExec_1

IntReg_1 IntMap_1IntQ_1

FPMul_0 FPReg_0

LdStQ_1

FPReg_1 FPMap_1

DTB_0Bpred_1

FPMap_1

IntMap_0IntQ_0

IntReg_1 FPMul_1

IntReg_0

Bpred_1 Dcache_1

L2_1

SharedBus

4-issue

6-issue

8-issue T [°C]

60

62

64

66

68

70

72

74

76

Figure 18: Thermal maps of 2-core 256KB-L2 CMP for VOLREND, varying issue width (from left to right: 2-, 4-, 6-, and 8-issue) • The choice of L2 cache size is not only matter of performance/area, but also temperature.

coupling and spreading area. It is important to care about the geometric characteristics of the floorplan. We summarize all those impacting on chip heating as follows:

• According to ED2P, best CMP configuration consists of a large number of fairly narrow cores. Wide cores can be considered too much power hungry to be competitive.

• Proximity of hot units. If two or more hotspots come close, this will produce thermal coupling and therefore raise the temperature locally.

• Floorplan geometric characteristics are important for chip temperature. The relative position of processors and caches must be carefully determined to avoid hotspot thermal coupling.

• Relative positions of hot and cold units. A floorplan interleaving hot and cold units will result in lower global power density (therefore lower temperature). • Available spreading silicon. Units placed in such a position, that limits its spreading perimeter, will result in higher temperature. E.g. the units placed in a corner of the die. These principles, all related to the common idea of lowering the global power density, apply to core microarchitecture, as well as CMP system architecture. In the first case, they suggest to make not too close hot units, like register files, instruction queues,. . . For what concerns CMPs they can be translated as follows: • Proximity of the cores. The interaction between two cores placed side-by-side, can generate thermal coupling between some units of the processors. • Relative position of cores and L2 caches. If L2 caches are placed to surround the cores, this results in better heat spreading and lower chip temperature. • Position of cores in the die. Processors placed in the center of the die offer better thermal behavior with respect to processors in the die corners or beside the die edge.

8. CONCLUSIONS AND FUTURE WORK In this paper, we have discussed the impact of the choice of several architectural parameters of CMPs on power/performance/thermal metrics. Our conclusions apply to multicore architectures composed of processors with private L2 cache and running parallel applications. Our contributions can be summarized as follows:

Several other factors may affect design choices, such as area, yield and reliability. These factors may, as well, affect the design choice. At the same time, orthogonal architectural alternatives such as heterogeneous cores, L3 caches, complex interconnects might be present in future CMPs. Nevertheless, evaluating all these alternatives is left for future research.

9.

ACKNOWLEDGMENTS

This work has been supported by the Generalitat de Catalunya under grant SGR-00950, the Spanish Ministry of Education and Science under grants TIN2004-07739-C02-01, TIN2004-03072 and HI2005-0299, and the Ministero dell’Istruzione, dell’Universit`a e della Ricerca under the MAIS-FIRB project.

10.

REFERENCES

[1] C. McNairy and R. Bhatia. Montecito: A dual-core, dual-thread Itanium processor. IEEE Micro, pages 10–20, March/April 2005. [2] P. Kongetira, K. Aingaran, and K. Olukotun. Niagara: A 32-way multithreaded Sparc processor. IEEE Micro, pages 21–29, March/April 2005. [3] R. Kalla, B. Sinharoy, and J.M. Tendler. IBM Power5 Chip: A Dual-Core Multithreaded Processor. IEEE MICRO, pages 40–47, March/April 2004. [4] Trevor Mudge. Power: A first class constraint for future architectures. In HPCA ’00: Proceedings of the 6th International Symposium on High-Performance Computer

[5]

[6] [7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

Architecture, Washington, DC, USA, 2000. IEEE Computer Society. Kevin Skadron, Mircea R. Stan, Karthik Sankaranarayanan, Wei Huang, Sivakumar Velusamy, and David Tarjan. Temperature-aware microarchitecture: Modeling and implementation. ACM Trans. Archit. Code Optim., 1(1):94–125, 2004. Intel. White paper: Superior performance with dual-core. Technical report, Intel, 2005. Kelly Quinn, Jessica Yang, and Vernon Turner. The next evolution in enterprise computing: The convergence of multicore x86 processing and 64-bit operating systems – white paper. Technical report, Advanced Micro Devices Inc., April 2005. Jose Renau, Basilio Fraguela, James Tuck, Wei Liu, Milos Prvulovic, Luis Ceze, Smruti Sarangi, Paul Sack, Karin Strauss, and Pablo Montesinos. SESC simulator, January 2005. http://sesc.sourceforge.net. David Brooks, Vivek Tiwari, and Margaret Martonosi. Wattch: a framework for architectural-level power analysis and optimizations. In Proc. of the 27th International Symposium on Computer Architecture, pages 83–94. ACM Press, 2000. Premkishore Shivakumar and Norman P. Jouppi. CACTI 3.0: An integrated cache timing, power, and area model. Technical Report 2001/2, Western Research Laboratory, Compaq, 2001. J. Huh, D. Burger, and S. Keckler. Exploring the design space of future cmps. In PACT’01: Proceedings of the 10th International Conference on Parallel Architectures and Compilation Techniques, pages 199–210, Washington, DC, USA, 2001. IEEE Computer Society. M. Ekman and P. Stenstrom. Performance and power impact of issue-width in chip-multiprocessor cores. In ICPP’03: Proceedings of the 2003 International Conference on Parallel Processing, pages 359–369, Washington, DC, USA, 2003. IEEE Computer Society. Stefanos Kaxiras, Girija Narlikar, Alan D. Berenbaum, and Zhigang Hu. Comparing power consumption of an SMT and a CMP DSP for mobile phone workloads. In CASES ’01: Proceedings of the 2001 international conference on Compilers, architecture, and synthesis for embedded systems, pages 211–220, New York, NY, USA, 2001. ACM Press. Ed Grochowski, Ronny Ronen, John Shen, and Hong Wang. Best of both latency and throughput. In ICCD ’04: Proceedings of the IEEE International Conference on Computer Design (ICCD’04), pages 236–243, Washington, DC, USA, 2004. IEEE Computer Society. J. Li and J.F. Martinez. Power-performance implications of thread-level parallelism on chip multiprocessors. In ISPASS’05: Proceedings of the 2005 International Symposium on Performance Analysis of Systems and Software, pages 124–134, Washington, DC, USA, 2005. IEEE Computer Society. J. Li and J.F. Martinez. Dynamic power-performance adaptation of parallel computation on chip multiprocessors. In HPCA ’06: Proceedings of the 12th International Symposium on High Performance Computer Architecture, Washington, DC, USA, 2006. IEEE Computer Society. Yingmin Li, David Brooks, Zhigang Hu, and Kevin Skadron. Performance, energy, and thermal considerations for SMT and CMP architectures. In HPCA ’05: Proceedings of the

[18]

[19]

[20]

[21]

[22]

[23]

[24] [25]

[26]

[27]

[28]

[29]

11th International Symposium on High-Performance Computer Architecture, pages 71–82, Washington, DC, USA, 2005. IEEE Computer Society. Y. Li, B. Lee, D. Brooks, Z. Hu, and K. Skadron. CMP design space exploration subject to physical constraints. In HPCA ’06: Proceedings of the 12th International Symposium on High Performance Computer Architecture, Washington, DC, USA, 2006. IEEE Computer Society. Rakesh Kumar, Victor Zyuban, and Dean M. Tullsen. Interconnections in multi-core architectures: Understanding mechanisms, overheads and scaling. In ISCA ’05: Proceedings of the 32nd Annual International Symposium on Computer Architecture, pages 408–419, Washington, DC, USA, 2005. IEEE Computer Society. K. Sankaranarayanan, S. Velusamy, M. Stan, and K. Skadron. A case for thermal-aware floorplanning at the microarchitectural level. Journal of Instruction Level Parallelism, 2005. Pedro Chaparro, Grigorios Magklis, Jose Gonzalez, and Antonio Gonzalez. Distributing the frontend for temperature reduction. In HPCA ’05: Proceedings of the 11th International Symposium on High-Performance Computer Architecture, pages 61–70, Washington, DC, USA, 2005. IEEE Computer Society. Ja Chun Ku, Serkan Ozdemir, Gokhan Memik, and Yehea Ismail. Thermal management of on-chip caches through power density minimization. In MICRO 38: Proceedings of the 38th annual IEEE/ACM International Symposium on Microarchitecture, pages 283–293, Washington, DC, USA, 2005. IEEE Computer Society. V. Zyuban and P. N. Strenski. Balancing hardware intensity in microprocessor pipelines. IBM Journal of Research & Development, 47(5-6):585–598, 2003. R. E. Kessler. The Alpha 21264 microprocessor. IEEE Micro, 19(2):24. Subbarao Palacharla, Norman P. Jouppi, and J. E. Smith. Complexity-effective superscalar processors. In ISCA ’97: Proceedings of the 24th annual international symposium on Computer architecture, pages 206–218, New York, NY, USA, 1997. ACM Press. Hang-Sheng Wang, Xinping Zhu, Li-Shiuan Peh, and Sharad Malik. Orion: a power-performance simulator for interconnection networks. In Proc. of the 35th annual ACM/IEEE International Symposium on Microarchitecture, pages 294–305, 2002. Weiping Liao, Lei He, and K.M. Lepak. Temperature and supply Voltage aware performance and power modeling at microarchitecture level. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 24(7):1042. Steven Cameron Woo, Moriyoshi Ohara, Evan Torrie, Jaswinder Pal Singh, and Anoop Gupta. The SPLASH-2 programs: characterization and methodological considerations. In ISCA ’95: Proceedings of the 22nd annual International Symposium on Computer Architecture, pages 24–36, New York, NY, USA, 1995. ACM Press. Man-Lap Li, Ruchira Sasanka, Sarita V. Adve, Yen-Kuang Chen, and Eric Debes. The alpbench benchmark suite for complex multimedia applications. In Proceedings of the IEEE International Symposium on Workload Characterization (IISWC-2005), Washington, DC, USA, 2005. IEEE Computer Society.

Design Space Exploration for Multicore Architectures: A ...

Design Space Exploration of Time-Multiplexed FIR ...

Experimental exploration of ultra-low power CMOS design space ...

$pdf-1320\the-vision-for-space-exploration-by-national-aeronautics ...$

pdf-1320\the-vision-for-space-exploration-by-national-aeronautics ...

A Platform for Developing Adaptable Multicore ...

Design and Evaluation Criteria for Layered Architectures - CiteSeerX

Reactive DVFS Control for Multicore Processors - GitHub

The-Function-Zoo-A-Group-Exploration-Lesson-Design-v7.pdf

$pdf-149\commercial-space-exploration-ethics-policy-and ...$

pdf-149\commercial-space-exploration-ethics-policy-and ...

Safe and Efficient Robotic Space Exploration with ... - Semantic Scholar

Safe and Efficient Robotic Space Exploration with Tele ...

Multicore Prog.pdf