An Adaptive Synchronization Technique for Parallel ...

Viewer
Transcript

An Adaptive Synchronization Technique for Parallel Simulation of Networked Clusters Ayose Falc´on

Paolo Faraboschi

Daniel Ortega

Hewlett-Packard Laboratories {ayose.falcon, paolo.faraboschi, daniel.ortega}@hp.com Abstract Computer clusters are a very cost–effective approach for High Performance Computing, but simulating a complete cluster is still an open research problem. The obvious approach—to parallelize individual node simulators—is complex and slow. Combining individual parallel simulators implies synchronizing their progress of time. This can be accomplished with a variety of parallel discrete event simulation techniques, but unfortunately any straightforward approach introduces a synchronization overhead causing up two orders of magnitude of slowdown with respect to the simulation speed of an individual node. In this paper we present a novel adaptive technique that automatically adjusts the synchronization boundaries. By dynamically relaxing accuracy over the least interesting computational phases we dramatically increase performance with a marginal loss of precision. For example, in the simulation of an 8-node cluster running NAMD (a parallel molecular dynamics application) we show an acceleration factor of 26x over the deterministic “ground truth” simulation, at less than a 1% accuracy error.

1. Introduction A computer cluster is a group of tightly coupled computers that work together as though they are a single computer. Clusters are used to improve performance and availability in a way that, because they are based on industrystandard commercial off-the-shelf (COTS) components, is typically more cost–effective than an ad-hoc solution. Clusters are used extensively in the High Performance Computing (HPC) field. In June 1997, the first clustered computer entered the TOP500 [18] list. Five years later, in June 2002, 16.20% of the total of TOP500 computers were clusters. In the June 2007 version of the TOP500 list, merely 10 years after, three out of four (precisely, 74.60%) of all TOP500 computers are cluster computers. Building a cluster out of single systems is a relatively easy task when compared to building a parallel computer from

scratch. Nevertheless, building a simulator for a cluster has proven to be of comparable complexity with that of building a simulator for a parallel computer. There are very few simulators for parallel machines and even less for full cluster computers that include both the functional and the timing simulation of the complete systems. Simulators of parallel machines are often ad-hoc implementations of a specific parallel machine: they are the result of many engineering-years of development which makes them very hard to retarget to other systems, including clusters. We believe that constructing simulators for clusters out of individual full-system simulators should be as easy as building clusters out of individual computers. This shift in perspective is one of the objectives of this paper. Most current full-system simulators model networking by providing a software proxy that channels packets from the simulated network to the external world. Combining several full-system simulators is as easy as providing a software switch that routes packets between the cluster machines and the outside world. Several system simulators, such as Simics Central [12] and AMD’s SimNowTM [1] come with such a functionality already built-in. Similarly, many virtual machines and emulators, such as VMWare [19] or QEMU [2] also embed similar “virtual networks”, to enable the networking of multiple instance of individual machines, often modeled after the VDE toolkit [6], which provides a generic layer for emulated network. The techniques described above provide the functional way to route packets from one simulated node to another. However, they do not provide any mechanism to ensure that the simulated time of the sender and the receiver are consistent with each other. This is the fundamental problem of Parallel Discrete Event Simulation (PDES), and some level of time synchronization is a necessary step to ensure any form of simulation accuracy. Providing time synchronization is what turns a loosely combined set of parallel full-system simulators into a cluster simulator. The challenges of this task involve controlling the time flow in a cohesive way, yet allowing for fast simulation and parallel execution. Here, we present a novel technique that enables combining multiple parallel full-system simulators into a “cluster simulator” capable of running standard distributed applica-

tions (such as MPI-based) on unmodified OS, with accurate timing and fast simulation speed. The main contribution of this paper is in the “adaptive quantum synchronization” algorithm which automatically adjusts the global synchronization accuracy based on a dynamic measurement of the networking traffic. As fewer packets flow by, we can lower the inter-node synchronization quantum, and vice versa. The rest of the paper is organized as follows. Section 2 analyzes related work, Section 3 presents our technique, Section 4 describes the simulation methodology and the evaluation methodology, Section 5 shows experimental results and finally, Section 7 concludes the paper.

2. Related Work Most parallel simulators developed in the past target parallel shared memory machines. One of the first and most successful parallel simulators is the Wisconsin Wind Tunnel (WWT) [16], which runs a parallel shared memory program on a parallel computer (CM-5). The WWT uses execution driven, distributed, discrete-event simulation to calculate program time: it divides time into lock-step quanta to ensure all events originating on a remote node that affect a node in the current quantum are known at the quantum’s beginning. With this approach, the WWT achieves accurate reproducible results without approximations. Burger and Wood [4] propose to trade accuracy versus performance in the context of the WWT simulator. The tradeoff is selected globally, and is accomplished by changing the timing model. Tango [5] is another popular shared memory simulator of parallel computers, which exploits direct execution on the host machine. Tango spawns an event generation process on each node in the host machine, but serializes all memory system simulation on one central simulation process. The field of Parallel Discrete Event Simulation has abundant literature, starting from the seminal works of ChandyMisra on “conservative” simulation [13] and Fujimoto [9] on “optimistic” (checkpoint-and-rollback) simulation. In more recent work, such as [11], PDES is used in a novel way, neither conservatively nor optimistically, but rather statistically. Each node runs independently and synchronizes at certain points by exchanging statistical information regarding the possible events that should have been communicated. This statistical information is enough to compute time progression and provides a good balance between speed and performance. As we described in the introduction, many researchers have addressed the problem of extending system simulation (such as like Simics [12]) into some sort of cluster emulation, like in [7] and [3]. However, they mostly target only network functional simulation and do not really address network timing issues or node synchronization in a parallel and distributed environment.

3. Adaptive Synchronization Full-system simulation of a complete computing node is a major challenge, but it has already been addressed in many different ways and is outside of the scope of this paper. For the purpose of our current research, the building blocks for the cluster simulator is a full-system simulator which includes models for the CPU, memory, network cards, disks, and other devices. Our full-system simulator employs a decoupled design. One component is responsible for the functional simulation, which emulates the behavior of the target machine (running the OS with the application) and models a large set of common devices. The other component is the timing simulation which is responsible for assessing the target performance (i.e., speed) by modeling the latency of each of the functions of the emulated devices, such as instructions, path to memory, or disk and network interface accesses. Full-system node simulators typically include network card models that live at the boundary of the simulated world. These models act as bridges between the simulated and the real world by providing a proxy functionality that channels network packets back and forth. A network “proxy” greatly enhances the functional features of the simulator, by allowing for external communication of the OS and applications. Nevertheless, the network communication is outside of the simulator’s control and there is no good way of attaching timing models to it. Fortunately, for our purposes, by combining the network functional simulation of all the simulation nodes, we can expand the simulated world to include the communication that happens within the cluster. Instead of bridging simulated packets directly to the external world, we bridge them to a centralized “network controller”, responsible for routing packets to and from the simulated nodes (not unlike the functionality offered by the VDE “switch” [6]). The network controller acts as a functional network simulator, and behaves like a perfect link-layer (MAC-to-MAC) network switch. Within a network controller, adding a timing component is a straightforward task: we can model any kind of network/switch/router topology by making packets take more or less (simulated) time to reach their endpoints. Figure 1 shows the combination of several full-system node simulators together with a network simulator which behaves like the network controller we have just described. With the network controller functionality up and running, we still have one missing piece for our node-combining approach to work, i.e., the synchronization of the simulated nodes. Notice that even without synchronizing the nodes’ simulated time, the functional simulation of the cluster would still behave correctly for most applications. As long as an application does not rely on isochronous nodes, which is common for most distributed programs, the functional behavior is independent of a possible skew in node

()*+ ,

()*+ -

!"#$$ % &' #

Figure 1. Components of a cluster simulator timing. However, the simulated time would be indeterminable, since each node would be running at its own speed. The internal simulated time of a node depends on many facts, such as the type of application that it is running and the complexity of said simulation. The speed of the simulator also depends on external factors such as the type of host in which it is running and its load. From an unknown observer living in the real world, the clocks of all the simulated machines would not only be skewed with respect to each other, but they will also have dynamically changing speeds. Nevertheless, since a bad clock should not change the behavior of a distributed application, nothing prevents the cluster application to proceed correctly 1 . Figure 2 shows an example of what happens during the round trip of a network communication between unsynchronized nodes. Node 1 sends a network packet to Node 2 at its own time ta , arriving at local time tc , for Node 2 (all times local to their respective nodes). After some time spent processing the packet, Node 2 answers back to Node 1 with a packet sent at time td , arriving at Node 1 at time tb . Since times progresses forward in both nodes in parallel, we can be sure that ta < tb and tc < td . The functional causality of the application is maintained by the data flow, regardless of the skew in clock times. Unfortunately the timing causality may be broken. Let’s assume that the latency of the network for the first packet is tn : the packet that leaves Node 1 at ta should reach Node 2 at ta + tn . If we tag the network packet with the originating timestamp, once the packet arrives at Node 2 we have three possible scenarios. (1) The simulated arrival time at Node 2 (tc ) is the same time (ta + tn ). In this case we have complete accuracy, but this also means we have been particu1 The macroscopic behavior is likely to be correct, regardless of the clock skews of the individual nodes, but finer-grain functionality may still be affected. An example of this is a packet retransmission due to a slow machine acknowledging its arrival. We assume this rarely happens.

Figure 2. Communication between timeskewed nodes

larly lucky, because the probability that two parallel simulations advance at the same exact simulation speed is tiny. (2) If the time is smaller (tc < ta + tn ) it means that Node 2 has not yet reached the simulated time when the packet should be delivered. Thus, the simulator may hold that packet for later and schedule its arrival perfectly. (3) If the destination node has already gone past the packet delivery time (tc > ta +tn ), it means that Node 2 has simulated too fast and has “missed” the delivery of packet. Because we cannot deliver a packet in the past, the only possibility we have is to schedule the packet immediately, and lose some accuracy, because the packet will not be able to affect events occurred since tc as it should have. We call these packets stragglers. Figure 3 shows four of the situations that may happen when simulation speeds differ in a quantum-synchronized system. In each quadrant, real-world time flows from top to bottom, and the horizontal lines represent beginning and end of a quantum (assumed to be 10 simulated time units). The vertical bars represent simulated time in two nodes, and they stop when simulation reaches the next quantum. The arrows indicate a packet flowing between the two nodes, and all four scenarios are related to a single packet roundtrip (e.g., what a ‘ping’ would do). In figure (a), both nodes run at the same simulation speed: this is the ideal situation that rarely happens and yields to the expected packet roundtrip time (in the example, 6 time units). In figure (c), Node 1 runs slower than Node 2, hence its simulation time advances more slowly and the packet roundtrip appears shorter than the ideal case (3 vs. 6 time units). In this case, we could delay the delivery of the packet until Node 1 reaches the correct time. Even if we don’t do that, the accuracy loss may still be acceptable with a reasonably short quantum. In figure (b), Node 1 runs much faster than Node 1, and when the packet comes back from Node 2 it has a timestamp in the past, so it becomes a

2

6

2

?

5

2

9

5 7

(c) Node 1 slightly slower: Latency appears shorter unless we delay the delivery of the packet at the right time Roundtrip = 3

1

Time

Time

5

3

(b) Node 1 slightly faster. Latency appears longer. Packet may be a “straggler” breaking time causality Roundtrip = 7

2 1

2 1

Roundtrip = 6

8

2

(a) Normal case: nodes simulate at similar speed

Time

Time

1

4

2

3

?

10

5

2

(d) Node 1 reaches the quantum before packet arrival. Packet is a “straggler”. The controller queues the packet to next quantum. Latency snaps to next quantum Roundtrip > 8

Figure 3. Four scenarios in quantum-synchronized systems straggler. We can deliver the packet right away, but the latency appears longer than the ideal case (7 vs. 6 time units), and we potentially break the time causality of execution. In figure (d), we have a pathological case of figure (b), where in addition to generating a straggler, we have no way of delivering it to Node 1 because it has already reached the end of the quantum. In this case the only option for the network controller is to queue the packet for delivery at the next quantum, with a resulting increase in visible roundtrip latency (8 vs. 6 time units). This phenomenon gets worse for longer quanta: if we had a quantum of 100, the visible latency could be as high as 98! Depending on the quantity of stragglers and their total delay time, the simulation accuracy diminishes. If we decided to completely remove any form of synchronization, not only the accuracy would be wrong, but we would also have no way of determining it. This is because all node clocks would be different and there would be no way of estimating the global time and the delivery error for packets. In the following sections, we describe the mechanisms we use to synchronize the individual simulated clocks to keep the accounting of the global time. However, it is important to remember that no matter what we do, we may still have accuracy errors related to stragglers. The key challenge to increase accuracy without incurring the cost of excessive synchronization is indeed to find a way to quickly detect the stragglers and adjust the simulation for them, which is the key contribution of this paper. Synchronizing clocks among the different simulators is akin to controlling time advancement in Parallel Discrete Event Simulation (PDES), a field of active research for several decades, which we use as a basis to explain our approach. Discrete Event Simulation (DES) models a set of state variables which have discrete transitions in response to events. These events are processed one at a time, each affecting the state variables and potentially scheduling more events. Parallel Discrete Event Simulation (PDES) partitions the state space among the multiple processing units.

Each node processes events and communicates with the rest to schedule events that may affect them. The main difficulty lies in determining the next event, since the first event in a local list may be preceded by events arriving from other nodes. In our scenario, events are network packets and nodes are the different full-system simulators that process those network packets. The name Stragglers comes from PDES literature. There are two main implementations of PDES simulators, conservative and optimistic. The optimistic approach assumes that stragglers are rare, and provides a checkpointing (fast) and rollback (slow) mechanism for those occasions when stragglers happen. By rolling back to a previously saved checkpoint, we can recover a coherent state and then reprocess the packet delivery in the correct timing sequence. If recovery happens infrequently, parallel simulation proceeds assuming no straggler will ever appear and we can achieve a substantial performance gain. In the conservative approach, no event gets locally processed until it is safe to do so. A basic implementation of conservative PDES assumes that all nodes operate in lockstep mode [16], by advancing through a set of discrete simulation quanta (Q). Safeness is assured if Q <= T , where Q is the time duration of each quantum and T is the minimum latency of a packet traversing the simulated network. Figure 4 shows an example of how this is accomplished. We can see two nodes (Node 1 and Node 3) sending packets to an intermediate Node 2. Assuming different network latencies for those two packets and assuming they leave their respective nodes at different times, we can see how Node 2 receives those packets before the next quantum has started, which can be shown to always happen if Q <= T . It then schedules all the packets to appear whenever they should have appeared. In this example we can see that the packet from Node 3 needs to arrive before the packet from Node 1 even though it functionally arrived later. Let’s assume that one of the nodes simulates faster. Even so, since the quanta are synchronized, this node must wait for the rest of the nodes to finish their quanta before proceeding all together.

1234 5

1234 6

1234 8

./ .7 : ; <=> ?@ < AB CDA@ < E F G ?>G C H

JKLM N

JKLM O

JKLM N

JKLM O

I

I

I

I

PQRSTUVRWXYZWVR [\]UT]Y^ _ _

_

_

`

PQRSTUVRWXYZWVR [\]UT]Y^ .0 .9

`

This mechanism prevents the occurrence of any straggler. At this point we need to observe that we are basically constraining ourselves to a relaxed form of conservative simulation (in PDES terms). Like other parallel simulation projects, we have also considered the theoretical possibility of checkpointing the simulator periodically so that we could roll back the nodes to the latest checkpoint before ta + tn and still deliver the event at the appropriate time. However, for full-system simulators, the “checkpointing” phase is too expensive, because we have to save all the machine memory (several GBs for modern machines), as well as the state of the disk “journal” (potentially other several GBs) to be able to roll-back to a previous execution state. A single checkpointing-rollback phase for a node can easily last in the order of 30-40 seconds which is clearly not affordable in this domain. Because the checkpointing functionality of our full-system simulator was conceived for another purpose (of saving the machine state for future use, such as VMWare [17]) the implementation of checkpointing is probably not optimized. However we believe that even with optimization, checkpointing a full-system simulator is still a far too time-consuming step to enable an optimistic approach. Unfortunately, quantum-based synchronization introduces a significant slowdown in cluster simulation. In Figure 5 we can see two nodes proceeding with and without synchronization. Not only the synchronization introduces bubbles (tagged as synchronization overhead in the figure) at the end of each quantum, but the heterogeneity of the lengths of the different simulators in reaching the quantum length means that basically the slowest node sets the pace for the simulation. Fortunately, the Q <= T relation is a sufficient but not necessary condition to maintain timing causality. Since the flow of network packets is not constant and there are many periods in which no packets are being sent or received by some or all nodes, there is potential to allow for bigger quanta during those periods. This concept is very similar to the idea of lookahead in PDES, which is the time interval in which the local simulation are guaranteed

` ab

PQRSTUVRWXYZWVR [\]UT]Y^ c

Figure 4. Quanta and timing causality

` c c

c ad

Figure 5. Slowdown due to synchronization not to send messages to another node. Predicting lookahead and similar techniques have been extensively used to speed up PDES. In full-system simulation there is no perfect way of correctly determining if there is not going to be another packet in the network with absolute precision, so nodes can not reliably determine their lookahead. Estimating lookahead typically relies on well defined topologies and propagation algorithms. In a computer cluster network implementing a “star” (all-to-all) topology, we cannot make any assumption about the target node of a packet, as well as we have to deal with broadcasts and multicasts. Our proposal overcomes this limitation by allowing a controlled accuracy loss in the timing simulation and adjusting the quantum for maximum speed. The technique works as described in Algorithm 1. The network controller controls the dynamic quantum duration, which starts at its minimum value. Depending on the density of network packets in that quantum, the quantum duration can increase or decrease, within a constrained range. In general, we have seen that the best configurations are those that grow the quantum in very small increments (such as 2% to 5%) but decrease it very quickly. In the experimental evaluation section, we present detailed results with several values for these parameters. √ Intuitively, setting dec to a value near to 1/ max Q or to √ 1/ 3 max Q, forces a dramatic reduction of the quantum duration in just two or three quanta at most. We describe this mechanism as “driving over speed bumps”. Imagine all simulators are cars running at their own speed. In the absence of obstacles (i.e., network packets), they can increase their speed with a small constant acceleration. Whenever network traffic increases, the cars see it as a “speed bump” and they are forced to abruptly decelerate to almost 0 for the duration of the bump. After the cars pass the speed bump, acceleration resumes slowly.

Algorithm 1: Dynamic Quantum algorithm Data: min Q = Quantum minimum length Data: max Q = Quantum maximum length Data: inc = Increase factor Data: dec = Decrease factor Data: Q = Current Quantum length Data: np = # of network packets in last quantum Q =min Q; repeat if (np = 0) then Q *= inc else Q *= dec end if (Q < min Q) then Q =min Q end if (Q > max Q) then Q =max Q end until end of simulation

Through this simple but effective mechanism we can control the accuracy loss introduced by longer quanta. Because we only lose accuracy in the presence of stragglers, we limit the number of stragglers to what can happen in a single quantum. As we see an increase in network traffic, we quickly lower the quantum to a value smaller than the network latency. By accelerating slowly, we capture the burst behavior of many distributed applications that cycle through a sequence of separate compute and communication phases. As the application terminates the communication phase and networking traffic decreases, we accelerate simulation by increasing the synchronization quantum.

4. Simulation and Evaluation Methodology In this section we describe the different components of our simulation environment, as well as the benchmarks and parameters used for calculating results. We use AMD’s SimNow simulator [1] as the full-system simulator component of our system, augmented with our own timing simulator [8] to model the CPU and the rest of the relevant devices. The SimNow simulator is a fast fullsystem emulator using dynamic compilation and caching techniques, which supports booting an unmodified OS and execute complex applications over it. SimNow implements the x86/x86-64 instruction sets, including system devices, and supports unmodified execution of several operating systems, including Windows and Linux. Our timing extensions enable AMD’s SimNow to model faithfully the timing of CPUs, disks and network interface cards (NIC). In order to simulate clusters we have added a “virtual network” application to AMD SimNow so that several SimNow instances can be interconnected through a func-

tional network 2 . Our NIC timing extensions within each SimNow-simulated node relay packets to the network controller, which is responsible of calling the network timing module and determining the total latency for that packet. The packet is then delivered to its final destination with information regarding its destination time. The destination NIC uses its timing interface to instruct the internal SimNow event scheduling system of the arrival of the network packet at the appropriate time. In other words, the network timing is composed of two parts: the timing of the NICs in each node, and the timing of the network switch connecting the nodes. For the purposes of this paper we have chosen a very aggressive (i.e., fast) network configuration which represents the worse possible scenario for our synchronization technique. We model a 10GB/s NIC with a minimum latency of 1µs, a perfect switch with infinite bandwidth and zero latency, and jumbo Ethernet packets (9000 Bytes). With this configuration we want to stress the network by making sure that large packets have a very low latency. Lower latencies imply a bigger presence of stragglers and the need for better synchronization. Our intention is to ensure that our technique is capable of simulating high speed networks at the fastest possible speed with high accuracy. The benchmarks selected for our experiments belong to two groups. NAS Parallel Benchmarks [14] represent highperformance computing applications with varying demands. This suite of programs provides a comprehensive view of the various performance characteristics of High Performance Computing systems. The suite consists of eight kernels that mimic the computational core of numeric methods used by Computational Fluid Dynamics (CFD) applications and three CFD pseudo-applications. Because not all NAS benchmarks can run on every combination of nodes, we have selected five of these eight applications for our experiments, those that could run for 2, 4 and 8-node clusters: • Embarrassingly Parallel (EP): accumulates statistics from dynamically generated pseudorandom numbers. Requires little interprocessor communication. • Integer sort (IS): performs a sorting operation used frequently in “particle method” codes. Requires moderate data communication and significant synchronization. • Conjugate gradient (CG): computes an approximation to the smallest eigenvalue of a large, sparse, symmetric positive definite matrix. Exhibits irregular long distance communication. • Multi-grid (MG): simplified multigrid kernel, which solves a 3-D Poisson PDE. Exhibits both short and long distance highly communication with highly structured communication patterns. 2 Note that the standard AMD SimNow distribution comes with a network “mediator” which implements a similar functionality, but without node synchronization and timing, so we had to build our own.

90%

Accuracy Error

70% 60%

70 10 100 1k dyn 1k 1.03:0.02 dyn 1k 1.05:0.02

10 100 1k dyn 1k 1.03:0.02 dyn 1k 1.05:0.02

60 50

Speedup

80%

50% 40% 30%

40 30 20

20%

10

10%

0

0% 2

4

2

8

4

8

# processors

# processors

Figure 6. NAS accuracy (left) and speedup (right) 25%

70 60

15%

Speedup

Accuracy Error

20%

80

10 100 1k dyn 1k 1.03:0.02 dyn 1k 1.05:0.02

10%

50

10 100 1k dyn 1k 1.03:0.02 dyn 1k 1.05:0.02

40 30 20

5%

10 0

0% 2

4

8

# processors

2

4

8

# processors

Figure 7. NAMD accuracy (left) and speedup (right) • Lower-upper (LU) : a regular-sparse, block (5x5) lower and upper triangular system solution. Exhibits a limited amount of parallelism and is a good indicator of network latency and instruction cache bandwidth. NAS results are provided in Millions of Operations Per Second (MOPS) and aggregated through a harmonic mean. Respecting the recommendations of the NAS benchmarks, we have selected the workload A, which is suitable for systems with up to 32 processors. For the message communication library, we use the open-source LAM implementation [10] of the MPI library over TCP/IP. The second benchmark selected for our experiments is NAMD [15]. NAMD is a parallel, object-oriented molecular dynamics code designed for high-performance simulation of large bio-molecular systems. NAMD scales to hundreds of processors on high-end parallel platforms and tens of processors on commodity clusters using gigabit Ethernet. NAMD uses the popular molecular graphics program VMD for simulation setup and trajectory analysis, but is also filecompatible with AMBER, CHARMM, and X-PLOR. The input selected for all our experiments is the one provided by the NAMD to be used as a benchmark (apoa1). The SimNow simulator guest runs a 64-bit Debian Linux. The simulation host is an HP Proliant DL585 G1 server

with four 2.6GHz AMD Opteron dual-core processors running 64-bit RedHat 4ES Linux. The NAS benchmarks were compiled directly in the simulator with gcc/g77 version 4.0, with ‘–O3’ optimization level. For the NAMD benchmarks we used an out-of-thebox 64-bit binary downloaded from the NAMD website and optimized for messaging over UDP. The simulated execution of the benchmarks is at maximum (guest) OS priority, to minimize the impact of other system processes. In order to evaluate just the execution of each benchmark, we restore a snapshot of the simulation taken when the machine is idle (except for standard OS housekeeping tasks) and directly invoke the execution of the benchmark from the Linux shell.

5. Experimental Results This section provides simulation results. Every experiment was run with 2, 4 and 8 processing nodes. Note that we limit the experiments to 8 nodes only because our simulation host is limited to 8 cores, so adding more processes would cause some serialization. Our approach could be distributed over a clustered simulation farm, but the results would be influenced by the characteristics of the physical cluster network, and this is a perturbation whose effect we wanted to

100

NAMD 1000 NAS 1000

NAMD 100 NAS 100

NAMD dyn 2

Simulation Speedup

leave out of this experiment. For each cluster configuration we performed a series of different runs with varying quantum constant size and different acceleration and deceleration factors of our adaptive synchronization technique. During each run we measured accuracy and simulation speed. The accuracy measurement is derived from the application-specific metric reported by the benchmarks themselves about their performance in the simulated world. For example, NAMD reports wall-clock time and NAS reports MOPS. We then use the application-specific metrics as an estimate for the relative accuracy of each of the experiments, using the experiment with the smallest synchronization quantum (1µs) as the “ground truth”. Speed is simpler to compute, and we report it as relative speed-up of simulation execution of each configuration versus the execution of the “ground truth” model. Notice that a quantum of 1µs should be sufficient to enforce timing causality as explained in the previous sections. For each benchmark and number of processors we run different experiments with fixed quantum of 1µs, 10µs, 100µs and 1000µs. As we discussed, the 1µs model is our baseline and the only deterministically correct execution. The other configurations represent different points of “standard” fixed-quantum synchronizations versus which we compare our adaptive technique. Our two best configurations differ in the acceleration factor of 3% and 5%. Both have the same min (1µs) and max (1000µs) quantum and the√same deceleration factor of 0.02 which is very close to 1/ 1000, whose advantage was discussed in previous sections. Figure 6 shows the accuracy and speed results for the combination of all the NAS benchmarks. The accuracy results (left) represent the harmonic mean of the results of the individual NAS benchmarks. Each of the five bars of each group is relative to the accuracy of the baseline, the execution with Q = 1µs. We can observe that having longer quanta is progressively more harmful for accuracy as the number of nodes increases. Intuition confirms that more nodes imply more communication and hence more stragglers in larger quanta scenarios. Our technique, nonetheless, shows very accurate behavior, even for 8-node systems. In the speed chart (right), note that the scale of the improvements is absolute: our technique shows a speed-up of up to 26x for the 8-node configuration. This is still far away from the nearly 65x speed-up of the Q = 1000µs, but to reach 65x the Q = 1000µs model needs to pay close to 85% accuracy error, which is clearly unacceptable. Given the excellent accuracy our approach shows (less than 5% error), we are close to an optimal speed/accuracy tradeoff. Figure 7 includes the charts for the NAMD benchmark. This benchmark exhibits bigger accuracy errors for our technique, but always under 6% for our worst case, the 5% acceleration mode for 8-node system. Nevertheless, we are still far from the 20% accuracy error shown by the fastest configurations. The speed figures are as impressive as NAS.

NAS dyn 2 NAMD dyn 1 NAS dyn 1

10 NAMD 10

NAS 10

1 0%

10%

20%

30% 40% Accuracy Error

50%

60%

70%

Figure 8. Pareto Optimality Curve Figure 8 shows a summary description of the speed vs. accuracy tradeoffs of the proposed two techniques and how they compare with the experiments run with bigger quanta. The results shown in this figure correspond to the 8-node systems. In the x axis we plot the relative accuracy error, as a percentage, with respect to the baseline, Q = 1µs, thus smaller values represent better accuracy. In the logarithmic y axis we plot the simulation execution speedup with respect to the same baseline, thus larger numbers represent larger speed-ups, thus better. Each point represents the accuracy error and speed of a given experiment. Values with square points correspond to NAS and values with round points correspond to NAMD. Our two configurations have been tagged as dyn 1 (3% acceleration) and dyn 2 (5% acceleration). The dotted lines show the Pareto optimality curve highlighting the “optimal” points of the explored space. A point in the figure is considered Pareto optimal if there is no other point that performs at least as well on one criterion (accuracy error or simulation speedup) and strictly better on the other criterion. All adaptive configurations lie in or very near the Pareto curve, and can thus be considered nearly optimal.

6. Scaling Out In this section we discuss a case study of some individual benchmarks in large scaled-out scenarios. The target configuration is a cluster of 64 single-processor nodes and is simulated on a computing farm of sixteen HP ProLiant BL25p blades with two dual-core AMD Opteron sockets each. With this configuration, each simulated node can have a dedicated CPU. We run the baseline of a 1µs-quantum as the maximum accuracy reference, and our “best” dynamic algorithm together with a few fixed-quantum experiments. The left charts of Figures 9(a), 9(b) and 9(c) represent the packet traffic over time, with nodes on the y axis (one per line), time on the x axis, and a line drawn from the source to the destination node for each exchanged packet. The right charts represent simulation speedup over the average speed of a 1µs-quantum

100

50

20

10

5

2

1 0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5

(a) NAS/EP 100

50

20

10

5

2

1 0

0.5

1

1.5

2

2.5

3

3.5

4

(b) NAS/IS 100

50

20

10

5

2

1 0

5

10

15

20

25

30

35

(c) NAMD

Figure 9. 64-node benchmarks: traffic (left) and speedup vs 1µs-quantum (right) over time simulation in logarithmic y scale. • NAS-EP (embarassingly parallel) is a good example where a dynamic quantum performs well in speed and accuracy even for large configurations. Because of its limited amount of communication (Fig. 9(a)), our adaptive technique is able to reduce the synchronization overhead and preserve an excellent precision. Quantum (µs) Acceleration vs. 1µs Accuracy Error vs. 1µs

100 72.7x 0.10%

10 7.9x 0.01%

1:100 12.9x 0.58%

• NAS-IS (integer sort) is a worst-case benchmark for accuracy because of its fine-grain synchronization nature (Fig. 9(b)). The use of MPI alltoall() causes long chains of packet dependences throughout

the benchmarks which – when dilated by a longer synchronization quantum – create a dramatic loss of accuracy and major divergence of estimated simulation time. For example, with a quantum of 100µs, observed MOPS are off by a factor of over 150x, and by a factor of 20x with a quantum of 10µs. With a very conservative adaptation schedule (slow acceleration and fast deceleration) we regain some level of accuracy, but only around 60%. Quantum (µs) Acceleration vs. 1µs Simulated Exec. Ratio vs. 1µs

100 84x 150x

10 9.8x 22x

1:100 27x 1.57x

• NAMD is a worst-case benchmark for speed because of the density of the data packet traffic. As we can see from the packet chart (left) of Figure 9(c), there

is no visible interval where the application is not exchanging data over the network. This causes the simulation performance to suffer significantly, and it is not possible to reach the 2 orders of magnitude speedup we showed for smaller configurations. As we can see from the right chart of Figure 9(c) the continuous presence of packets flowing through the simulated switch caps the speedup gain below 10x. On the other hand, the speed and accuracy of the adaptive quantum algorithm automatically adjusts to approximate the “best” quantum (around 10µs). This implies that we do not have to try multiple quanta to find the speed/accuracy sweet spot. Quantum (µs) Acceleration vs. 1µs Accuracy Error vs. 1µs

100 77.2x 104%

10 9.1x 1.01%

2:100 6.5x 0.79%

7. Conclusions In this paper we have introduced a novel approach that enables combining individual full-system node simulators to achieve a cluster simulator without paying excessive synchronization costs and minimizing the accuracy loss. We have explained the problems that arise in this combination and we have studied mechanisms to resolve them. We believe that cluster simulators that use this approach will be the norm in the future to come. We have presented a novel adaptive synchronization technique that allows this kind of clusters simulators to achieve fast simulation speed while retaining good accuracy. This is representative of a broader kind of adaptive techniques which dynamically trade simulation speed for accuracy and vice versa. In full-system simulation, such a feature is of tremendous importance, since it allows for faster simulation of uninteresting areas and slower analysis of interesting ones. What we presented in this paper is a first step in the direction of developing cluster simulators. Our results for small-medium clusters show speed-up improvements of over 20x versus a perfectly synchronized simulation, with less than a 5% accuracy error in most cases (and even less than 1% in some). In some experiments simulating larger clusters (64 nodes), the effectiveness of the algorithm somewhat diminishes as we can expect due to the increase in overall traffic density. In the future, we plan to extend the work to more complex clusters and demanding applications. This implies understanding the implications of pathological cases like “IS” and mitigating effects of excessive synchronization traffic. Finally, we also plan to combine this technique with “sampling” of the individual node simulators to take further advantage of another accuracy/speed tradeoff. We believe that the combination of these techniques will open up a much wider application space for full-system simulation.

References [1] R. Bedicheck. SimNow: Fast platform simulation purely in software. In Hot Chips 16, Aug. 2004. [2] F. Bellard. QEMU, a fast and portable dynamic translator. In USENIX 2005 Annual Technical Conference, FREENIX Track, p. 41–46, Apr. 2005. [3] M. Bergqvist, J. Engblom, M. Patel, and L. Lundeg˚ard. Some experience from the development of a simulator for a telecom cluster (CPPemu). In Software Engineering and Applications, p. 13–15, Nov. 2006. [4] D. Burger and D. A. Wood. Accuracy vs. performance in parallel simulation of interconnection networks. In Proc. of the 9th International Symposium on Parallel Processing, p. 22–31, Apr. 1995. [5] H. Davis, S. R. Goldschmidt, and J. Hennessy. Multiprocessor simulation and tracing using Tango. Multiprocessor Performance Measurement and Evaluation, p. 141–149, 1995. [6] R. Davoli. VDE: Virtual distributed ethernet. First International Conference on Testbeds and Research Infrastructures for the DEvelopment of NeTworks and COMmunities (TRIDENTCOM’05), p. 213–220, 2005. [7] J. Engblom and D. Ekblom. Simics: A commercially proven full-system simulation framework. In Workshop on Simulation in European Space Programmes, Nov. 2006. [8] A. Falc´on, P. Faraboschi, and D. Ortega. Combining simulation and virtualization through dynamic sampling. ISPASS, p. 72–83, 2007. [9] R. M. Fujimoto. Parallel discrete event simulation. Commun. ACM, 33(10):30–53, 1990. [10] LAM-MPI. website. http://www.lam-mpi.org/. [11] G. Lencse. Efficient parallel simulation with the statistical synchronization method. In Proc. of the 1998 Conference on Communication Networks and Distributed Systems Modeling and Simulation (CNDS’98), p. 3–8, Jan. 1998. [12] P. S. Magnusson, M. Christensson, J. Eskilson, D. Forsgren, G. H˚allberg, J. H¨ogberg, F. Larsson, A. Moestedt, and B. Werner. Simics: A full system simulation platform. Computer, 35(2):50–58, Feb. 2002. [13] J. Misra. Distributed discrete-event simulation. ACM Computing Surveys, 18(1):39–65, 1986. [14] NASA Ames Research Center. The NAS parallel benchmarks. http://www.nas.nasa.gov/Software/ NPB. [15] J. C. Phillips, R. Braun, W. Wang, J. Gumbart, E. Tajkhorshid, E. Villa, C. Chipot, R. D. Skeel, L. Kale, and K. Schulten. Scalable molecular dynamics with NAMD. Journal of Computational Chemistry, 26(16):1781–1802, Oct. 2005. [16] S. Reinhardt, M. Hill, J. Larus, A. Lebeck, J. Lewis, and D. Wood. The Wisconsin Wind Tunnel: Virtual prototyping of parallel computers. In Proc. of the SIGMETRICS Conference on Measurement and Modeling of Computer Systems, p. 48–60, May 1993. [17] M. Rosenblum. VMware’s virtual platform: A virtual machine monitor for commodity PCs. In Hot Chips 11, Aug. 1999. [18] TOP500 Project. TOP500 Supercomputer Sites. http://www.top500.org. [19] VMWare Inc. website. http://www.vmware.com.