Configurable Memory Hierarchies for Energy Efficiency ...

Viewer
Transcript

Configurable Memory Hierarchies for Energy Efficiency in a Many Core Processor CVA MEMO 130 Version 1.0 Vishal Parikh, R. Curtis Harting, and William J. Dally Electrical Engineering, Stanford University, Stanford, CA 94305 E-mail:{vparikh1, charting, dally}@stanford.edu February 13, 2012 Abstract In this paper, we propose a configurable memory system for CMPs to increase performance and decrease energy consumption. We evaluate multiple memory organization options in a shared-memory 256 node multi-processor executing regular and irregular scientific applications. We explore two classes of memory organizations, flat and hierarchical, and we find that different applications exhibit energy optimality in different configurations, thus motivating a configurable hierarchy. We propose a configurable memory system and API that allows the memory organization to be specified at application-load time. The size of shared caches, the degree of sharing (span), and the number of hierarchy levels below the last level cache can be configured. This results in a dynamic energy savings and performance improvement of over 30% versus a static configuration chosen with the best overall performance over all the benchmarks. Additionally, we explore the sources of energy consumption in a shared-memory, many-core machine and identify key system parameters that lead to energy inefficiency.

1 Introduction Much work has been done in optimizing chip multiprocessors (CMPs) for throughput, latency, and scalability. However, a growing concern is energy and power efficiency. In modern datacenters, often the limiting factor in performance is power usage, both at the building level [20] and at the board level [35]. Similarly, total die power limits result in the inability to maximize the utilization of modern processors. Inefficient computational mechanisms exacerbate these problems, attempting to increase performance to the detriment of energy efficiency. It is paramount to study energy efficient mechanisms in the design of shared-memory CMPs.

1

Many systems attempt to optimize for both energy and performance by removing hardware features such as global addressing and coherence [21, 32]. However, while these systems are highly scalable, they are often very difficult and expensive to program. Compounding this problem is the fact that software is becoming increasingly complex and expensive to develop. The constraints of overly exposed hardware platforms increases programming difficulty. Additionally, irregular applications (those whose memory access patterns cannot be determined at compile time) are very difficult to program using software managed memories [15, 31]. These applications represent a large and important class of scientific applications including N-body problems, mesh computations, ray-tracing, sparse data-structure traversals, etc. In order to efficiently execute irregular applications and minimize programming cost, it is necessary to have hardware support for memory coherence. Our work focuses on scientific, throughput-oriented computing (both regular and irregular), though the analyses presented here may be relevant to other areas such as database workloads, web workloads, and other latency-sensitive applications. It is important to note that the analysis of the memory system is applicable regardless of the structure of the processing elements and low level core interaction. A chip with a thousand small in-order floating point units (similar to many proposed accelerator architectures [29, 41]), a chip with a hundred super scalar cores (similar to Intel’s Larrabee [39]), or a chip with hundreds of small coherent clusters (similar to UIUC’s Rigel [26,27]) will all benefit from the analysis provided here, provided they are backed at some level by a shared-memory system. The system presented below is in the context of a fully hardware coherent, directory based memory system. There are many system parameters to be considered in the implementation of such a system, and each is the subject of some study. For this work we studied only the organization and sharing of the directory and cache. In this paper, we propose a framework of cache hierarchies. Within that framework, we propose two types of memory organizations: flat and hierarchical. We propose a methodology (API and hardware) for configuring and selecting an appropriate memory organization at application load time. We show that applications perform significantly better (over 30%) in both energy and computation time with a configurable cache as compared to a single static configuration. We will also discuss the sources of energy inefficiency in the memory system and the conflicts between energy optimality and performance. The paper makes the following contributions: We propose a dynamically reconfigurable cache system. We perform a detailed study of cache organization on a 256 core chip, with detailed timing and energy models. We propose a novel configuration system. We provide insight as to energy inefficiencies in the communication system, and we demonstrate that over a set of common scientific applications, no single configuration is best and significant savings can be had by dynamically configuring the cache hierarchy for each application.

2

2 Background Most modern shared memory CMPs such as Tilera’s Tile64 [4] are designed around a non-uniform cache architecture (NUCA) [28]. The caches are implemented as a sea of memories connected by an on-chip interconnection network. These memories, though physically separate, can be logically configured in a variety of different ways, such as a large global cache or several smaller private caches. A global directory typically manages the sharing information to enable scalability. Additional levels of hierarchy may be added such as a last level cache, or additional levels of intermediate cache. An important consideration in these designs is the tradeoff between the physical locality of processors to their working sets and the usable capacity of the on chip cache. If the available on-chip memory is divided into several semi-private caches, data may be both write- and read- shared via these caches without any global communication. Data is also more likely to be fewer hops away in a semi-private scheme. Both of these factors may decrease latency and access energy. Alternatively, a large global cache could result in longer average access times, but a higher hit rate.

2.1

Hierarchical Directories

Hierarchical memory systems have been used in many previously designed multicore and multi-processor systems [1, 30, 38]. Many of these have been hierarchical directory systems, but there have also been hybrid schemes that combine directory and broadcast implementations such as Waypoint [27]. A hierarchical memory system has many advantages in a multi-core system. Each core has the ability to accesses frequently reused data at a lower latency. When multiple cores share a cache, they can read- and write-share data without global communication. The ”span” of a cache (the number of processing elements or lower level memories that are shared by a particular memory) greatly affects global communication. In a non-uniform cache or memory system (NUCA or NUMA), the access latency of a memory is determined largely by the network latency, rather than the access latency of the actual SRAM. This couples the span, size, and latency of the memory. Figure 1 shows how cores in different cache neighborhoods communicate with one another. Consider an intermediate, semi-private L2 cache as depicted in Figure 1. Not only does the L2 provide lower latency than the L3 and additional capacity over the L1, but it also provides access to shared data at a shorter distance than sharing in the global directory and cache. If data is shared between two nodes all protocol and data messages only need to traverse as high as the lowest common directory.

3 Baseline Architecture The following results were obtained via simulation of a 256 core CMP. Each core is a simple, 2 issue, out-of-order core. Each core is part of a tile that contains a private L1 cache, a configurable memory, a configurable directory controller, and a network interface (shared between all components of the tile). The tiles are arranged in a 16x16

3

H4

H3

H3

H2 A

H2 B

C

D

Figure 1: Directory/Cache span and hierarchical sharing. The L2 (lower left) has a span of 4, thus, nodes A and B can communicate through the local L2 slice at H2 . The L3 has a span of 16 and A and C communicate through H3 . The L4 is global with a span of 64. The path for D to get to its L4 home node is shown. Each level needs only 4 bits of sharer information to designate which of its immediate descendants hold a copy of the data. grid and connected by a mesh network. A tile and system diagram are shown in Figure 2.

M Ctl Shared Cache Slice

T0

Shared Directory Slice

...

M Ctl

Network Interface Core

T1

M Ctl

L1 Cache T240

Mesh Router

T241

M Ctl

... ...

... ... M Ctl

T14

T15

...

M Ctl

M Ctl

T254

T255

M Ctl

Figure 2: Tile and System Diagram The top level of the on chip memory system is a global, last level cache with a directory. In the flat implementation, this is referred to as the L2. In the hierarchical implementations, this is referred to as the L3 (or L4) and there are intermediate L2s (and L3s) with directories. The L2 directory maintains sharing information for all items in L1s within its span. The L2 cache holds victims of the L1s and can satisfy requests to the L2 directory. The L3 (if available) holds directory information for all

4

elements in the L2 directories (which include the L1 information) and L2 caches in its span. A schematic diagram for the flat and hierarchical directory are shown in Figure 3. This framework for constructing on-chip memory hierarchies results in an efficient method for read- and write-sharing data. Read-shared data is replicated in the private or semiprivate caches. Write shared data does not need to be communicated globally, but only through the lowest branch in common to both communicating nodes. Additionally, as already mentioned, sharer information can be stored very efficiently in a hierarchy. The L2 (and above) caches are neither inclusive nor exclusive. If a cache line is written back (due to a modified-shared transition) or if it is evicted from all lower level caches (shared/modified-invalid transition), the line is inserted in the L2. The line is only evicted from the L2 if there is a conflict or invalidation. Additionally, since the L2 cache slice and L2 directory slice are co-located, if a request sent to the directory is present in the L2 cache, it can be satisfied by the L2 cache instead of by an L1 cache, which can save a network traversal. Cache line size is 64 bytes throughout. All L2 and above caches are 8 way set associative. In general the network must be traversed to reach a shared cache or directory (L2 or higher). However, since the network interface, L1, and shared memory are all on the same tile, it is possible to bypass the network if the required element is on the same tile. Three Level (Hierarchical) Layout

L3 Directory

L3 Cache (LLC)

Two Level (Flat) Layout

L2 Directory

L1

L1

L2 Cache (LLC)

L1

L1

...

L1

L1

L2Dir0

L1

L1

L1

L2Cache0

L1

L1

L1

... ...

L2Dirn

L1

L2Cachen

L1

L1

L1

Figure 3: The L2 directory maintains sharing information for L1s in its span. It has a number of entries equal to the sum of the number lines in the L1s in its span. Similarly, the L3 directory is inclusive of the L2 directory and cache tags as well as the L1 tags. Note: the caches at each level are not inclusive. The directories use an MSI protocol to maintain coherence and have duplicated tags. The directories are designed to be large enough to hold entries for all cached elements at the level below them. Directory implementation is not the subject of this paper, and has been the subject of much study [1, 6, 7, 11, 36, 45]. We have elected to implement an optimal directory leaving detailed implementation for future work. We represent the sharer information in the directory perfectly and have perfect associativity. Many of the directory implementations mentioned previously efficiently approach this behavior. Also, we have implemented a set-associative directory cache and a limited pointer sharing scheme and have found our implementation to be insensitive to 5

even modest parameters (8 way set associative and 8 pointers) in both performance and energy. This result has also been shown in [27].

4 Design Space There are two classes of memory organizations that we explore in this paper, based on the baseline architecture described above: a flat organization and a hierarchical organization. The flat organization has a 16kB L1 per core and a 128kB L2 per tile that combine to form a 30MB L2 that is shared globally across the chip. Accompanying the L2 is a 64k-entry directory (one entry per 64B L1 cache line), with 256 entries per tile. Each tile is structured as shown in Figure 4a. We then have three hierarchical organizations as described in Table 1. The first two have three levels. The L3 is global and there are several shared L2s spanning either 4 or 16 cores. The third hierarchy configuration is four levels. It is essentially a combination of the two 3-level hierarchies, with 64 L2s composed of four tiles (2x2) and 16 L3s composed of 16 tiles (4x4), and a global L4. We selected the four configurations to be representative of the set of all configurations. We capture two extremes and one midpoint, while keeping die area constant. The two extremes were the flat, two level hierarchy and the four level hierarchy. The four level hierarchy was taken as an upper bound because beyond that level, directory overhead begins to dominate. The two 3-level hierarchies were a logical midpoint between these two. There are many additional parameters that we can vary, including L1 size, L2 size, L2/L3 sharing. We picked sizes that were reasonable and consistent with other implementations. A sensitivity analysis for each of these parameters is beyond the scope of this paper. We did find that for the majority of the applications, the most sensitive parameter was LLC size. This is why we elected to use smaller L1s in favor of larger shared memories and LLCs.

4.1

Directory Implementation

Each L2 directory has sufficient entries to account for everything in the L1s in its span, and the L3 directory, if present, has sufficient entries to hold entries for every item in each L2 and L1 cache. This is larger than is typically necessary as there is often sharing reducing the number of entries required, but we leave optimization of the directory for future work. Directory implementation is orthogonal to the topic of this paper. A more optimized directory implementation could have slightly reduced the overhead in the LLC. At the limit, this would increase the size of the LLC in the 4 level hierarchy from 12MB to 17MB. However, this is still significantly smaller than the 32 MB LLC in the two-level hierarchy due to semi-private caches. While directory optimization may improve overall performance. We believe that the it will have a small effect on the comparison that we make here.

6

4.2

Network

Network implementation was not the subject of this study. We elected to use a mesh network for simplicity and because many related works also use a mesh. Network optimizations would improve both the flat and hierarchical implementations, which may reduce the magnitude of the result, but not change the outcome. L2 Cache Slice L2 Cache Slice

L2 Directory Slice L3 Cache Slice L3 Directory Slice

L2 Directory Slice

Network Interface Core

Network Interface

L1 Cache

Core

L1 Cache

Mesh Router

Mesh Router

(a) Flat Tile

(b) Hierarchy A Tile

Figure 4: Tile Configurations

5 Implementation The on chip implementation of the configurable memory system is straightforward and has minimal overhead in both area and latency. The concept is simple: instead of having a dedicated cache and directory for each level, we implement the data storage for caches and directories out of a single shared SRAM. The configurations are changed at application load time using a configuration tool. The configuration tool programs base and length registers for up to four levels of the memory in each slice.

5.1

Physical Implementation

The physical implementation consists of an SRAM and tag memory (SRAM with comparators) shared by the slices configured to the tile. The tile is structured as shown in Figure 5. A configurable controller affects the logical mapping via address translation. There are some compromises that need to be made. For example, the width of the tag storage must be the maximum possible tag width. Also, since directory entries are not the same size as cache lines, they must be packed. However, the overhead of these compromises is small (3% increase in SRAM). Note that we do still have a dedicated L1 implemented in the conventional manner, as the core is typically sensitive to L1 latency. Also note that what we refer to as L1, need not be physically implemented as one memory, but can be a hierarchy unto itself of several memories of increasing size and latency (this is akin to a private L1 and L2 in a single core processor). Such an optimization may be necessary depending on physical parameters of the system, but is independent of the higher level memory organization and so we assume the L1 is a single physical entity. 7

Level L1

L2

L3

LLC

Size Number Storage used Cache size Dir size Span/Number Storage used (dir+cache) Route Energy (pJ) Cache size Dir size Span/Number Storage used (dir+cache) Route Energy (pJ) Cache size Dir size Storage used (dir+cache) Route Energy (pJ)

Flat 16kB 256 4MB

Hier A 16kB 256 4MB 128kB 16kB (1kE) 4/64 9MB 34.1

Hier B 16kB 256 4MB 512kB 64kB (4kE) 16/16 9MB 68.2

30MB 2MB (128kE) 32MB 273

20MB 3MB (192kE) 23MB 273

20MB 3MB (192kE) 23MB 273

Hier C 16kb 256 4MB 128kB 16kB (1kE) 4/64 9MB 34.1 512kB 192kB (12kE) 16/16 11MB 68.2 7MB 5MB (328kE) 12MB 273

Table 1: Memory Configurations. Total storage available to shared caches and directories is 32MB (128kB/tile). The sharing of the intermediate levels is determined by the span. The LLC is global (shared by all nodes). ”kE” stands for 1024 directory entries at 128 bits each (including tag). Route Energy is the average energy to tranfer data to/from the cache. The configuration takes place in the address translation logic block. (This translates the global address to a local address for the SRAM, not to be confused with virtual-tophysical address translation). The incoming address needs to be translated based on the configuration to index into the SRAM. A relevant portion of the logic block, showing the main hardware and critical path, is shown in Figure 5. The main operation is an integer divide (modulus). The divide is necessary because otherwise each individual portion of the SRAM would need to be a power of two, which would leave a significant portion of the SRAM unused. The width of the divide is determined by the depth of the SRAM, which in our case is 17 bits. This logic should add 1-2 cycles to the L2 (and higher) latency. This should not be significant since in order to get to the L2, the network must be traversed, which takes several cycles. The logic required for this circuit will not have a significant impact on the area. Not shown in the diagram is the logic for accessing the directory in addition to the cache, which is largely a replica of what is shown, and the logic for determining the home node, which is a simple series of bit operations based on the middle address bits.

5.2

Multiprogrammed Workloads

Multiprogrammed workloads are not the norm in scientific computing, which was the focus of this work. However, this scheme could be easily adapted for multiprogram-

8

Level Address Shared Cache Slice L2 Size

Addr. Trans

L3 Size

Dir Ctl

L4 Size

Dir Ctl SRAM

Tag Mem

Shared Directory Slice

% Network Interface

L2 Start

Core

L3 Start L4 Start

L1 Cache

+

Mesh Router Index

Figure 5: Tile Structure and Address Translation Logic ming by including the memory configuration registers as part of the process context. Each cache would be partitioned using a conventional partitioning scheme. The lookup process would use the cache size and level from the appropriate context. This would allow each process to have a different memory organization but share physical storage space.

5.3

Configuration API

In practice, configuration is performed at application load time, as it requires a global clear of all caches and directories and a broadcast to all cores. The configuration is done by the operating system directed by a header in the application binary. The header specifies the number of levels and sizes similar to the tables described in Table 1. These are translated into offsets and sizes for the mux inputs depicted in Figure 5. A separate application, MemCfg, is used for determining and encoding the hierarchy in the binary. There are several ways to discover the optimal hierarchy configuration. The first is to study the application characteristics by analysis. The memory access patterns of many standard applications such as blocked matrix multiply are already well studied and able to be tuned to a particular cache hierarchy. A specific arrangement of caches can then be configured directly in MemCfg. The second way is to pick a test sample of organizations to use as a heuristic for the application. This is the method that was employed in this study. MemCfg provides the four test organizations described in this paper, which represent a large fraction of the design space, and the user can choose the best one. The third method of configuration is to use an autotuner to test a wide range of parameters at a fine granularity on a sample input of the application. In addition to varying the system parameters, application parameters may also be varied. This yields an incredibly rich optimization space. However, depending on the run time of the application, this may be an expensive operation.

6 Methodology We tested our hypotheses using a custom, timing-modeled simulator and several industry standard benchmarks run over billions of instructions with a variety of system 9

parameters.

6.1

Microarchitecture Simulator

We developed a detailed, timing accurate simulator for this work. There are two portions to the simulator: an execution driven front-end, and a performance modeling back-end. The two parts are tied together so performance characteristics (stalls, blocking, contention, etc.) of the simulated hardware influence the front-end execution in a realistic way. The simulator models detailed temporal effects such as network contention, memory hot spots, and load imbalance. System configuration is shown in Table 2. Component Core Implementation Number of Cores L1 L2 etc. DRAM Network Topology

Description 2-issue OOO 256 16kB 4-way Varies 16GB 16x16 Mesh

Table 2: System Configuration The front end uses PIN [33] to instrument a native x86 multithreaded binary. The instructions are cracked into RISC like operations to decouple them from the restrictions of x86 assembly. We also reorder instructions as necessary within a small instruction window. The cores are capable of fetching and retiring multiple instructions per cycle and multithreading to increase functional unit utilization. The core model is backed by a simple mesh network that models channel bandwidth and latency and router latency. Flow control and saturation effects are not modeled. We have tested our system using a very detailed network model [9] on a limited number of benchmarks and observed only a small performance difference. The network connects the cores to a detailed model of the memory system components including set associative caches, directories, and a low level MSI protocol designed to work with an unordered network. The DRAM is modeled with a fixed latency to simulate futuristic DRAM technologies without making assumptions as to implementation details.

6.2

Energy models

Each architectural model in the system has a corresponding energy model that we use to determine the energy usage of the applications. We assume a future 14nm process based on the ITRS roadmap [23]. The core energy model is split into two parts: a fixed overhead representing datapath energy and instruction cache energy and models for integer and floating point operations. This is derived from internal models and published work [14] . The caches and SRAMs are modeled using Cacti [37] at 22nm, which was the smallest process available at the time of this study, and scaled to 14nm 10

based on the relative feature sizes and voltages. The routers are modeled based on place-and-route results for a detailed Verilog model developed in [2]. Wires are modeled using the capacitive model described in [24] and are full swing. A full list of important parameters used in this study are listed in Table 3. We assume fine-grained clock-gating. We did not include static energy in our models. The purpose of this paper is to compare two types of configurations. Static energy would affect both of these roughly equally and would mostly reduce the magnitude of the difference (with the exception a small amount of static energy difference due to differing run times), but would not change the outcome as both flat and hierarchical would be equally affected by leakage. Estimating static energy is difficult as a future process would employ power-gating, multiple threshhold designs, etc. Introducing a crude, fixed static energy model would not paint a fair picture. However, even without these tactics, static energy would not dominate, and would be around 20-30% even in a fast, high-leakage process. Component Core [14] Link [24] Router [2] L1 Cache [37] Shared Cache [37] DRAM [42]

Value 15 0.03 0.37 4.8 8.8 24

Description pJ per 64 bit operation pJ per byte per mesh hop pJ per byte routed pJ per 64 byte cache line pJ per access pJ per byte read/written

Table 3: Energy of various operations

6.3

Thread Placement

Scheduling of threads to cores can affect performance. We did not assume optimal placement of threads to particular cores. In our implementation threads created with temporal proximity will, in general, have spatial proximity. This assumption is not overly favorable for the hierarchy and leaves much room for improvement. We tested both the row major and Z-order (Morton) mapping of threads to cores for a subset of the benchmarks and found they were insensitive to this variation.

6.4

Benchmarks

We have chosen seven benchmarks to study as descibed in Table 4. These benchmarks were selected from industry standard parallel benchmark suites and were chosen based on several characteristics. We desire benchmarks with both regular and irregular memory access patterns, that access the memory system in a variety of patterns to exercise both memory organizations, and that parallelize to 256 cores with a small enough dataset so they do not take prohibitively long to execute in simulation. The applications represent billions of simulated instructions each. Note that for SPLASH2, we augmented some of the data sets so they would scale to 256 cores. The DMM 11

implementation was a straight-forward, blocked implementation with blocks sized to each cache level. Benchmark FFT Canneal Raytrace DMM Volume Render Water-Spatial Swaptions

Suite SPLASH2 [44] PARSEC [5] SPLASH2 none SPLASH2 SPLASH2 PARSEC

Data Set 216 and 218 simsmall balls4.env 512 × 512DP head 2744 Particles 256 swaptions; 5k trials

Description Complex 1-d FFT Simulated annealing of a netlist Raytracing Dense matrix multiply Ray-based volume rendering N-body simulation of water Parallel financial simulation

Table 4: Benchmarks

7 Evaluation We evaluated the performance of the four configurations described in Section 4, including a flat configuration and three hierarchical configurations. We further investigated the sources of energy usage in the core and in the network.

7.1

Metrics

We evaluated the system performance based on two metrics: energy efficiency (energy to complete a particular benchmark), and computation time (speedup vs single thread performance). We did not attempt to artificially combine these metrics into a single metric such as energy-delay product, because the relative importance of energy efficiency and performance can depend on many complex external factors in the context of a real datacenter. We leave it to the reader to weigh this relative importance in the way they see fit. We compare the performance of each benchmark in several configurations. We report all of the benchmarks in both the flat organization and hierarchical organization A as in Table 1. For the applications that benefit from hierarchy, we further explored hierarchical organizations B and C as described in Table 1.

7.2

Flat vs Hierarchical

Figures 6 and 7 show the comparison of two memory organizations (Flat and Hierarchy A) across all benchmarks. We show both the energy breakdown and the speedup as we vary the number of threads. This highlights both the scalability of the benchmarks and system as well as the unpredictability in energy consumption as performance is increased. We found that the benchmarks fell in two categories: those that have hierarchical reuse and sharing, and those that do not. The applications that performed better in a flat memory organization are Canneal, FFT, and Raytrace. Canneal is a simulated annealing application that picks nodes at 12

Network

Memory

L03Dir

L02Dir

FFT (Large)

FPU/ALU

Raytrace

Core

Canneal

0.8 0.6 0.4 0.0

0.2

Normalized Energy

1.0

FFT

L1Cache

1

256

1

Flat

256

1

Hier

FFT

1

256

1

Hier

256

1

Flat

FFT (Large)

256

1

Hier

256

1

Flat

Raytrace

256

Hier

Canneal

150 100 0

50

Speedup

200

250

256

Flat

1

256

Flat

1

256

Hier

1

256

Flat

1

256

Hier

1

256

Flat

1

256

Hier

1

256

Flat

1

256

Hier

Figure 6: Energy breakdown and speedup for flat (left bars) and hierarchical (right bars) for several applications. All applications are shown with 1, 8, 16, 64, 128, and 256 threads and speedup and energy are relative to the single threaded version of the program running on the flat configuration random across the entire working set. As expected, this has no hierarchical reuse and does not benefit from the hierarchy. As a result, there is additional overhead in the network from unnecessary data movement in staging the data in the L2. Additionally, the reduced effective cache size causes additional cache misses, which increases the energy from DRAM. The flat organization results in a significant performance increase in Canneal due to decreased latency, because of the decreased network hops, and cache misses. The applications that performed better in a hierarchical memory organization are DMM, Volume Render, Water-spatial, and Swaptions. Similar to the result by Hughes et al. [22], a central shared cache typically resulted in excess overhead for these applications. DMM had the largest difference (400% in speed and energy) between flat and hierarchical, which is understandable as it has extensive hierarchical reuse and hi-

13

Network

Memory

L03Dir

Volume Render

L02Dir

FPU/ALU

Swaptions

Core

DMM

0.8 0.6 0.4 0.0

0.2

Normalized Energy

1.0

Water

L1Cache

1

256

1

Flat

256

1

Hier

Water

1

256

1

Hier

256

1

Flat

Volume Render

256

1

Hier

256

1

Flat

Swaptions

256

Hier

DMM

150 100 0

50

Speedup

200

250

256

Flat

1

256

Flat

1

256

Hier

1

256

Flat

1

256

Hier

1

256

Flat

1

256

Hier

1

256

Flat

1

256

Hier

Figure 7: Energy breakdown and speedup for flat (left bars) and hierarchical (right bars) for several applications. All applications are shown with 1, 8, 16, 64, 128, and 256 threads and speedup and energy are relative to the single threaded version of the program running on the flat configuration erarchical sharing. The gains for water-spatial were 30% in speed and 25% in energy as there were increased cache misses due to reduced capacity. The number of cache misses decreased as the number of threads was increased. This is because additional threads were able to take advantage of additional L2 caches resulting in a larger effective cache. The energy efficiency for volume render improved about 20% in the hierarchy and speed improved 30%. Swaptions also had considerable savings of 50% energy and 60% speed in addition to being more scalable (the flat version fails to speed up past 128 threads).

14

Network

Memory

L1Cache

L03Dir

L02Dir

Volume Render

FPU/ALU

Core

Swaptions

DMM

0.8 0.6 0.4 0.0

0.2

Normalized Energy

1.0

1.2

Water

L04Dir

1

256

1

Hier A

256

1

Hier B

256

1

Hier C

Water

1

256

1

Hier B

256

1

Hier C

256

1

Hier A

Volume Render

256

1

Hier B

256

1

Hier C

256

1

Hier A

Swaptions

256

1

Hier B

256

Hier C

DMM

150 100 0

50

Speedup

200

250

256

Hier A

1

256

Hier A

1

256

Hier B

1

256

Hier C

1

256

Hier A

1

256

Hier B

1

256

1

Hier C

256

Hier A

1

256

Hier B

1

256

Hier C

1

256

Hier A

1

256

Hier B

1

256

Hier C

Figure 8: Normalized energy breakdown and speedup (versus single-thread configuration A) of different hierarchies for 1,16,64,256 threads across several benchmarks. Benchmarks were selected based on those that benefited from hierarchical over flat.

7.3

Further Hierarchy Comparison

We also explored different hierarchical organizations. For brevity, we only report the applications that fared better with the hierarchy versus flat in section 7.2. We report results for three different configurations, described in Table 1 as A, B and C, the results of which are shown in figure 8. The energy results for DMM show that there was a 20% energy savings for the fastest version of DMM, though the hierarchy made little difference for the most energy energy efficient version (16 threads). Also, the 4 level DMM was 30% faster than the 3 level version. In Volume Render, we show a slight energy savings of about 10%. Water appears to be insensitive to hierarchy variation. Swaptions experienced a significant slowdown due to increased access latency with the deeper or wider hierarchies.

15

8 Analysis 8.1

Hierarchical Reuse

We have found that there is not necessarily a continuum of applications where at some point the applications begin to perform better with a hierarchy, but rather applications tend to fall into one of two bins: those with hierarchical reuse and those without. The applications that have hierarchical reuse typically have a large benefit from the presence of a hierarchy, and can achieve additional benefits from further refinement of the hierarchy. Applications that do not benefit from a hierarchy typically have the same fraction of energy in the network regardless of memory organization; however, they benefit from a much larger effective cache due to less fragmentation and directory overhead. As a result, depending on the working set size, the application may perform significantly better with a flat memory organization (if there is a decrease in cache misses), or the application may perform modestly better (due to reduced overhead from unnecessarily staging the data in the L2), as compared to hierarchy A, the best static configuration. For example in Raytrace, the flat version ran about 30% faster and exhibited better scaling. This is largely due to the excess traffic and latency from the hierarchy overhead.

8.2

Working Set Size

We tested two different sizes of FFT. The 64k point FFT had roughly the same performance in both the flat and the hierarchical memory organization. However, when the working set size of FFT was increased to 256k points, there was a clear discrepancy between the flat and hierarchical organizations in both energy and performance due to increased cache misses. For FFT the energy optimal point did not coincide with the fastest execution time. For the smaller FFT there is a clear bathtub curve in both the flat and hierarchical energy consumption, while execution time steadily decreases up to 256 threads. This is largely due to increased communication as the number of threads is increased. This behavior also exists in the larger FFT, however, it is eclipsed by the reduced cache misses due to the larger effective cache when using more cores (due to more L1s and L2s).

8.3

Energy Breakdown Analysis

Throughout the results we noticed several common trends as evidenced in Figure 9, which shows the energy for a 32 thread FFT. While the breakdown varies with application and number of threads, this graph is generally representative of trends throughout. There is a significant amout of energy in the network. For four of the benchmarks (Water, Volume Render, Swaptions, FFT) it is 30%-40% of the total energy, whereas for DMM, Raytrace and Canneal, it dominates. The core (which is ”core” plus ”FPU/ALU”), as expected is a significant fraction of the energy. On chip memory accesses (L1, L2, etc.) tend to make up a relatively small portion of the access energy. Most of the energy for on chip memory access is from data transfer and is

16

reported through the network energy. Off chip (DRAM) access varies widely between applications, and can easily dominate in applications that have poor on-chip hit rates. 8.3.1

Network vs Core

The large amount of network vs core energy may be surprising when compared to results for existing microprocessors, which are typically dominated by the core with negligible energy in the network. The reason for this shift is that in order for CMPs to continue to scale, it is important to drastically reduce core overhead through the use of efficient mechanisms such as instruction registers, operand register files, and reduction of speculation. These improvements can yield a drastically lower core energy as shown in [10] for fixed point, single-issue processors. We have assumed an extension of these principles for floating point, super-scalar processors. The energy assumptions we have made can be found in Table 3. We have also assumed very aggressive clock gating and power gating in order to effectively scale energy usage with activity. This is in contrast to coarse grained clock gating we see today, which results in a weak relationship between load and power usage. Core 14%

FPU/ALU 21%

Network 37%

L1 4% L2 3% L3 2% Memory 20%

Figure 9: FFT Energy Breakdown (400uJ Total)

8.3.2

Network Analysis

There are two important points to note: 1)The majority of the network energy is from global communication and 2)Energy consumption is dominated by sending data, and protocol overhead is insignificant. Figure 10a shows the breakdown of network energy. This data was obtained using configuration A (3 levels with L2 span of 4). Notice that the global communication (between the L2 and L3, labeled as L2Up and L3Down in the figure) dominates the energy even though the total bytes sent is a small fraction of those sent in the lower levels. In general there is an order of magnitude decrease in the number of access for each increasing level. Additionally, if an application has a significant number of cache misses, network energy between the LLC and memory controller (we have 8 memory controllers on the edges of the chip) can be significant as this is a global communication. However, the energy for off chip communication and DRAM access is typically much larger than the energy to move the data across the chip.

17

MCtlDown 13%

L1Up 4% L2Down 4%

Invalidate 0%

L3Up 2%

Ack/Nack 5% GetS 5% GetX 3%

WriteBack 31%

L2Up 37%

L3Down 40% SendData 56%

(a) Breakdown based on Location. L1Up/L2Down is communication between L1 to L2. Mctl is the DRAM controller on chip.

(b) Breakdown based on protocol message

Figure 10: FFT Network Energy Breakdown Figure 10b shows the breakdown of network energy based on protocol message type. The energy is dominated by Writeback and Senddata messages. This is not because there is a larger number of these messages (in fact there are guaranteed to be fewer Senddata messages than GetX+GetS), but rather because these two types of messages contain data, so they are much larger packets. Messages with data are around 76B, whereas messages without data are around 12B. Clearly, data movement and not protocol overhead is dominating the energy consumption of the network. Another observation from this graph is that Invalidate messages are a very small portion of the energy. When using a limited pointer scheme, the only messages that are broadcast are invalidate messages when the limited pointers have been overflowed. Since the energy in Invalidates is low, and overflows on writes are relatively infrequent, we found that the effect of limited pointers on energy efficiency was small.

9 Related Work 9.1

Directory Structure and Implementation

There has been much study of directory organization in shared memory machines both CMPs and multi-processor shared memory machines. Scott and Goodman showed that hierarchical directories can result in favorable directory scaling in [38]. Acacio et al. [1] also use hierarchical directories to solve the same problem in a different manner, and further demonstrate scalability. Ladan-Mozes and Leiserson utilize a hierarchical cache for sequential consistency in [30]. Kelm et al. demonstrate scalability to over 2000 cores using a hybrid hierarchical system in [27]. Haridi et al. implement a snoopy, bus-based hierarchical memory system in [18] as do Wilson et al. in [43]. Jerger et al. exploit a similar hierarchy concept in Virtual Tree Coherence [12], which utilizes trees built into the network to cache data at the lowest common node to a group of sharers. Guo et al. implemented a static, on-chip hierarchical scheme very similar to the hierarchy we present here [16]. However, they only evaluate a 16-node CMP and do not have the variety of benchmarks that we explore here. They also do not evaluate the

18

energy efficiency of the system. The Cuckoo Directory [13] is another efficent, scalable directory implementation, though the arrangement of memories was not studied. Srikantaiah et al. developed MorphCache [40], a system that dynamically varies the hierarchy, albeit with a different implementation that resulted in different tradeoffs than we present here. They implement a 16 core CMP with a shared bus rather than an interconnection network and broadcasts rather than directories. They evaluate configuration only within one level with fixed slice sizes and do not vary the number of hierarchy levels. Also their configurability is used to maximize cache-utilization rather than minimize energy and latency from global communication, resulting in different tradeoffs. We did not perform a direct comparison to MorphCache as their design does not easily scale up to 256 cores and our design would be inefficient at 16 cores.

9.2

Comparison to Flat, Adaptive Migration and Replication Schemes

A large body of work exists covering various methods of optimizing a flat memory, both dynamically and statically. Popular static optimizations include Victim Replication [46], and popular dynamic optimizations include Adaptive Selective Replication [3] and Cooperative Caching [8]. More recently, there have been several increasingly complex schemes that provide additional advantages over the original works such as Reactive NUCA [17] SP-NUCA [34], and Distributed Cooperative Caching [19, 25] This work is complementary to above systems hence it is not necessary or appropriate for us to make a direct comparison. The logical arrangement of the on-chip memories is a separate performance improvement technique from the implementation of replacement policies within the memories. As we have mentioned, each L2, L3, and L4 in the system could implement one of these schemes. While Victim Replication, Reactive NUCA, and many similar proposals add hierarchical effects and could narrow the gap between arrangements, such a comparison would be incomplete without implementing these optimizations in the hierarchical arrangements (A, B and C) as well. These implementations have a fair amount of depth and we believe that they could stand alone in a separate paper. Additionally, while these designs emulate some of the effects of a hierarchical memory system, they do not have a hierarchy. Running applications that share data can lead to thrashing or suboptimal data movement. Write-shared data is also not effectively handled without the presence of a true hierarchy. These studies are limited to low 10’s of cores, and do not investigate scalability to 100’s of cores. To our knowledge, this is the first paper to describe a configurable cache organization in a many-core CMP using an interconnection network. Flat and hierarchical designs have been proposed in other works. These works assume a particular organization and attempt to solve the problem by modifying the replacement policies within a static organization. We fix the policy, but allow an configurable organization. This is separate technique from policy optimization, and is analogous to how pipelining and dual issue are two separate and compelementary techniques for increasing IPC.

19

10

Conclusion

Energy efficient architectures are necessary to continue improving performance of modern CPUs. In a many core CMP, data movement is the single largest consumer of energy. Large distributed caches that are prevalent in CMPs can exacerbate the problem of excess data movement. We propose a configurable, hierarchical on-chip memory system. We confirm that the data movement is a dominant factor in energy consumption in a modern, energy efficient architecture using a timing-modeled simulator with detailed energy models. We then use this knowledge to tune the hierarchy for several scientific benchmarks to achieve significant gains in both execution time and energy efficiency. We show that there is a clear bi-modality between applications with and without hierarchical sharing and reuse. Some applications greatly benefit from hierarchical sharing and reuse through the use of nearby, semi-private caches, whereas applications without hierarchical sharing and reuse can benefit from the increased effective cache and decreased overhead of a flat organization. Of the caches that benefit from hierarchical reuse, some benefit from further tuning of the hierarchy, such as changing the size of the shared caches and increasing the number of hierarchy levels, providing further gains in execution time and energy efficiency. Of the applications tested, Canneal, Raytrace and FFT each perform better with a flat configuration. The best hierarchical cache can result in a 50% energy and execution time overhead. The remaining four applications perform better in a hierarchical configuration, with energy overhead reduction as high as 75% for DMM and speedup increases of 30% to 400% with 256 cores compared to the flat configuration. We further showed that tailoring of the hierarchy can result in an additional 30% performance gain over the baseline hierarchical layout. The reduction in energy and increase in speed with memory configuration strongly suggests that a static configuration is suboptimal.

20

References [1] M. E. Acacio, J. Gonzalez, J. M. Garcia, and J. Duato. A two-level directory architecture for highly scalable cc-numa multiprocessors. IEEE Transactions on Parallel and Distributed Systems, 16:67–79, 2005. [2] D. U. Becker and W. J. Dally. Allocator implementations for network-on-chip routers. In SC ’09: Proceedings of the 2009 ACM/IEEE Conference on High Performance Computing, Networking, Storage and Analysis, 2009. [3] B. M. Beckmann, M. R. Marty, and D. A. Wood. Asr: Adaptive selective replication for cmp caches. In Proceedings of the 39th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 39, pages 443–454, Washington, DC, USA, 2006. IEEE Computer Society. [4] S. Bell, B. Edwards, J. Amann, R. Conlin, K. Joyce, V. Leung, J. MacKay, M. Reif, L. Bao, J. Brown, M. Mattina, C.-C. Miao, C. Ramey, D. Wentzlaff, W. Anderson, E. Berger, N. Fairbanks, D. Khan, F. Montenegro, J. Stickney, and J. Zook. Tile64 - processor: A 64core soc with mesh interconnect. In Solid-State Circuits Conference, 2008. ISSCC 2008. Digest of Technical Papers. IEEE International, pages 88 –598, feb. 2008. [5] C. Bienia, S. Kumar, J. P. Singh, and K. Li. The parsec benchmark suite: characterization and architectural implications. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT ’08, pages 72–81, New York, NY, USA, 2008. ACM. [6] D. Chaiken, C. Fields, K. Kurihara, and A. Agarwal. Directory-based cache coherence in large-scale multiprocessors. Computer, 23(6):49 –58, jun 1990. [7] D. Chaiken, J. Kubiatowicz, and A. Agarwal. Limitless directories: A scalable cache coherence scheme. SIGPLAN Not., 26:224–234, April 1991. [8] J. Chang and G. S. Sohi. Cooperative caching for chip multiprocessors. Computer Architecture, International Symposium on, 0:264–276, 2006. [9] W. Dally and B. Towles. Principles and Practices of Interconnection Networks. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 2003. [10] W. J. Dally, J. Balfour, D. Black-Shaffer, J. Chen, R. C. Harting, V. Parikh, J. Park, and D. Sheffield. Efficient embedded computing. Computer, 41:27–32, 2008. [11] N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, pages 35–46, Washington, DC, USA, 2008. IEEE Computer Society. [12] N. D. Enright Jerger, L.-S. Peh, and M. H. Lipasti. Virtual tree coherence: Leveraging regions and in-network multicast trees for scalable cache coherence. In Proceedings of the 41st annual IEEE/ACM International Symposium on Microarchitecture, MICRO 41, pages 35–46, Washington, DC, USA, 2008. IEEE Computer Society. [13] M. Ferdman, P. Lotfi-Kamran, K. Balet, and B. Falsafi. Cuckoo directory: A scalable directory for many-core systems. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 169 –180, feb. 2011. [14] S. Galal and M. Horowitz. Energy-efficient floating point unit design. IEEE Transactions on Computers, 99(PrePrints), 2010. [15] M. Garland, S. Le Grand, J. Nickolls, J. Anderson, J. Hardwick, S. Morton, E. Phillips, Y. Zhang, and V. Volkov. Parallel computing experiences with cuda. Micro, IEEE, 28(4):13 –27, july-aug. 2008. [16] S.-L. Guo, H.-X. Wang, Y.-B. Xue, C.-M. Li, and D.-S. Wang. Hierarchical cache directory for cmp. Journal of Computer Science and Technology, 25:246–256, 2010.

21

[17] N. Hardavellas, M. Ferdman, B. Falsafi, and A. Ailamaki. Reactive nuca: near-optimal block placement and replication in distributed caches. In Proceedings of the 36th annual international symposium on Computer architecture, ISCA ’09, pages 184–195, New York, NY, USA, 2009. ACM. [18] S. Haridi and E. Hagersten. The cache coherence protocol of the data diffusion machine. In E. Odijk, M. Rem, and J.-C. Syre, editors, PARLE ’89 Parallel Architectures and Languages Europe, volume 365 of Lecture Notes in Computer Science, pages 1–18. Springer Berlin / Heidelberg, 1989. [19] E. Herrero, J. Gonz´alez, and R. Canal. Distributed cooperative caching. In Proceedings of the 17th international conference on Parallel architectures and compilation techniques, PACT ’08, pages 134–143, New York, NY, USA, 2008. ACM. [20] U. Hoelzle and L. A. Barroso. The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines. Morgan and Claypool Publishers, 1st edition, 2009. [21] H. Hofstee. Power efficient processor architecture and the cell processor. In HighPerformance Computer Architecture, 2005. HPCA-11. 11th International Symposium on, pages 258 – 262, feb. 2005. [22] C. Hughes, C. Kim, and Y.-K. Chen. Performance and energy implications of many-core caches for throughput computing. Micro, IEEE, 30(6):25 –35, nov.-dec. 2010. [23] International Technology Roadmap for Semiconductors. [24] A. B. Kahng, B. Li, L.-S. Peh, and K. Samadi. Orion 2.0: a fast and accurate noc power and area model for early-stage design space exploration. In Proceedings of the Conference on Design, Automation and Test in Europe, DATE ’09, pages 423–428, 3001 Leuven, Belgium, Belgium, 2009. European Design and Automation Association. [25] M. Kandemir, F. Li, M. Irwin, and S. W. Son. A novel migration-based nuca design for chip multiprocessors. In High Performance Computing, Networking, Storage and Analysis, 2008. SC 2008. International Conference for, pages 1 –12, nov. 2008. [26] J. H. Kelm, D. R. Johnson, M. R. Johnson, N. C. Crago, W. Tuohy, A. Mahesri, S. S. Lumetta, M. I. Frank, and S. J. Patel. Rigel: an architecture and scalable programming interface for a 1000-core accelerator. In Proceedings of the 36th annual international symposium on Computer architecture, ISCA ’09, pages 140–151, New York, NY, USA, 2009. ACM. [27] J. H. Kelm, M. R. Johnson, S. S. Lumettta, and S. J. Patel. Waypoint: scaling coherence to thousand-core architectures. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT ’10, pages 99–110, New York, NY, USA, 2010. ACM. [28] C. Kim, D. Burger, and S. W. Keckler. An adaptive, non-uniform cache structure for wire-delay dominated on-chip caches. In Proceedings of the 10th international conference on Architectural support for programming languages and operating systems, ASPLOS-X, pages 211–222, New York, NY, USA, 2002. ACM. [29] G. Kurian, J. E. Miller, J. Psota, J. Eastep, J. Liu, J. Michel, L. C. Kimerling, and A. Agarwal. Atac: a 1000-core cache-coherent processor with on-chip optical network. In Proceedings of the 19th international conference on Parallel architectures and compilation techniques, PACT ’10, pages 477–488, New York, NY, USA, 2010. ACM. [30] E. Ladan-Mozes and C. E. Leiserson. A consistency architecture for hierarchical shared caches. In Proceedings of the twentieth annual symposium on Parallelism in algorithms and architectures, SPAA ’08, pages 11–22, New York, NY, USA, 2008. ACM. [31] V. W. Lee, C. Kim, J. Chhugani, M. Deisher, D. Kim, A. D. Nguyen, N. Satish, M. Smelyanskiy, S. Chennupaty, P. Hammarlund, R. Singhal, and P. Dubey. Debunking the 100x gpu vs. cpu myth: an evaluation of throughput computing on cpu and gpu. In

22

[32] [33]

[34] [35] [36]

[37]

[38]

[39]

[40]

[41]

[42] [43]

[44]

[45]

[46]

Proceedings of the 37th annual international symposium on Computer architecture, ISCA ’10, pages 451–460, New York, NY, USA, 2010. ACM. E. Lindholm, J. Nickolls, S. Oberman, and J. Montrym. Nvidia tesla: A unified graphics and computing architecture. Micro, IEEE, 28(2):39 –55, march-april 2008. C.-K. Luk, R. Cohn, R. Muth, H. Patil, A. Klauser, G. Lowney, S. Wallace, V. J. Reddi, and K. Hazelwood. Pin: building customized program analysis tools with dynamic instrumentation. SIGPLAN Not., 40:190–200, June 2005. J. Merino, V. Puente, P. Prieto, and J. A. Gregorio. Sp-nuca: a cost effective dynamic non-uniform cache architecture. SIGARCH Comput. Archit. News, 36:64–71, May 2008. T. Mudge. Power: a first-class architectural design constraint. Computer, 34(4):52 –58, apr 2001. S. S. Mukherjee and M. D. Hill. An evaluation of directory protocols for medium-scale shared-memory multiprocessors. In Proceedings of the 8th international conference on Supercomputing, ICS ’94, pages 64–74, New York, NY, USA, 1994. ACM. N. Muralimanohar, R. Balasubramonian, and N. Jouppi. Optimizing nuca organizations and wiring alternatives for large caches with cacti 6.0. In Proceedings of the 40th Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 40, pages 3–14, Washington, DC, USA, 2007. IEEE Computer Society. S. Scott and J. Goodman. Performance of pruning-cache directories for large-scale multiprocessors. Parallel and Distributed Systems, IEEE Transactions on, 4(5):520 –534, may 1993. L. Seiler, D. Carmean, E. Sprangle, T. Forsyth, M. Abrash, P. Dubey, S. Junkins, A. Lake, J. Sugerman, R. Cavin, R. Espasa, E. Grochowski, T. Juan, and P. Hanrahan. Larrabee: a many-core x86 architecture for visual computing. ACM Trans. Graph., 27:18:1–18:15, August 2008. S. Srikantaiah, E. Kultursay, T. Zhang, M. Kandemir, M. Irwin, and Y. Xie. Morphcache: A reconfigurable adaptive multi-level cache hierarchy. In High Performance Computer Architecture (HPCA), 2011 IEEE 17th International Symposium on, pages 231 –242, feb. 2011. S. Vangal, J. Howard, G. Ruhl, S. Dighe, H. Wilson, J. Tschanz, D. Finan, A. Singh, T. Jacob, S. Jain, V. Erraguntla, C. Roberts, Y. Hoskote, N. Borkar, and S. Borkar. An 80-tile sub-100-w teraflops processor in 65-nm cmos. Solid-State Circuits, IEEE Journal of, 43(1):29 –41, jan. 2008. T. Vogelsang. Understanding the energy consumption of dynamic random access memories. Microarchitecture, IEEE/ACM International Symposium on, 0:363–374, 2010. A. W. Wilson, Jr. Hierarchical cache/bus architecture for shared memory multiprocessors. In Proceedings of the 14th annual international symposium on Computer architecture, ISCA ’87, pages 244–252, New York, NY, USA, 1987. ACM. S. C. Woo, M. Ohara, E. Torrie, J. P. Singh, and A. Gupta. The splash-2 programs: characterization and methodological considerations. SIGARCH Comput. Archit. News, 23:24–36, May 1995. J. Zebchuk, V. Srinivasan, M. K. Qureshi, and A. Moshovos. A tagless coherence directory. In Proceedings of the 42nd Annual IEEE/ACM International Symposium on Microarchitecture, MICRO 42, pages 423–434, New York, NY, USA, 2009. ACM. M. Zhang and K. Asanovic. Victim replication: Maximizing capacity while hiding wire delay in tiled chip multiprocessors. SIGARCH Comput. Archit. News, 33:336–345, May 2005.

23

Renewable Energy and Energy Efficiency ... - Semantic Scholar