Efficient Offloading of Parallel Kernels Using MPI Comm spawn

Sebastian Rinke, Suraj Prabhakaran, Felix Wolf HUCAA’13 | 1.10.13

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n◦ 287530

2

HUCAA’13

Sebastian Rinke et al.

State of the Art AC

AC

CN

CN

Interconnect

3

HUCAA’13

Sebastian Rinke et al.

CN

CN

AC

AC

State of the Art Pros High bandwidth

AC

AC

CN

CN

Low latency User has simple view Interconnect

Cons

4

Oblivious to varying workloads ⇒ Idle/overloaded ACs

CN

CN

CNs and ACs affect each other’s availability

AC

AC

HUCAA’13

Sebastian Rinke et al.

Network-attached Accelerators

CN

Pros

AC

AC allocation based on application needs

CN

Distributed memory kernel offload to multiple ACs

CN

Interconnect

MPI within (larger) kernel AC and CN have their own network interface

5

HUCAA’13

Sebastian Rinke et al.

AC

AC CN

Network-attached Accelerators Cons

CN

Greater penalty for data transfers How to offload MPI kernels to ACs?

AC CN Interconnect

Yet another programming model?

AC

CN AC CN

6

HUCAA’13

Sebastian Rinke et al.

Network-attached Accelerators Cons

CN

Greater penalty for data transfers How to offload MPI kernels to ACs?

AC CN Interconnect

Yet another programming model?

AC

CN AC CN

⇒ No, use MPI’s dynamic process model!

6

HUCAA’13

Sebastian Rinke et al.

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

7

HUCAA’13

Sebastian Rinke et al.

DEEP Architecture

CN BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

CN InfiniBand CN CN

Cluster

8

HUCAA’13

Sebastian Rinke et al.

Booster

DEEP Architecture Overview Cluster 128 cluster nodes (2 Intel Xeon E5-2680) QDR InfiniBand

Booster 512 booster nodes (1 Intel Xeon Phi) EXTOLL network (8×8×8 3D torus)

MPI over complete system

9

HUCAA’13

Sebastian Rinke et al.

CN BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

CN InfiniBand CN CN

Cluster

Booster

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

10

HUCAA’13

Sebastian Rinke et al.

Offloading Approach, Why? Main program and kernels are MPI programs Start all CN and AC processes at job start Distinguish between CN/AC processes during runtime

CN 0

AC 4

CN 1 Interconnect CN 2 CN 3

AC 5 AC 6

MPI_COMM_WORLD

11

HUCAA’13

Sebastian Rinke et al.

Offloading Approach, Why? Con All processes in one MPI COMM WORLD

CN 0

AC 4

CN 1

Workaround Split communicator Replace all occurences of MPI COMM WORLD with new communicator ⇒ Major code changes

12

HUCAA’13

Sebastian Rinke et al.

Interconnect CN 2 CN 3

AC 5 AC 6

MPI_COMM_WORLD

Spawn Create AC processes during runtime with MPI Comm spawn() Provides separate communicators starting with rank 0 Collectives for convenient intercommunication CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

13

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

MPI Comm spawn()

MPI_Comm_spawn( char *command, char *argv[], int maxprocs, MPI_Info info, int root, MPI_Comm comm, MPI_Comm *intercomm, int array_of_errcodes[])

14

HUCAA’13

Sebastian Rinke et al.

Intercommunicator 0

1 2

Local group

1 2 0 5 3 4 Remote group

Spawn Usage Scenario CN 0 AC CN 1 Interconnect

AC

CN 2 CN 3 COMM_WORLD

15

HUCAA’13

Sebastian Rinke et al.

AC

Spawn Usage Scenario CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

15

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Spawn Usage Scenario CN 0

Send data AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

15

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Spawn Usage Scenario CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

15

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Spawn Usage Scenario CN 0

Recv data AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

15

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Spawn Con One spawn allows for one kernel execution

Workaround 1. Terminate and re-spawn AC processes 2. Spawn once and use protocol to trigger kernel executions

16

HUCAA’13

Sebastian Rinke et al.

Spawn + Kernel Call Create AC processes during runtime with MPI Comm spawn() Trigger kernel execution with MPIX Kernel call() ⇒ No need for re-spawning and user-implemented protocol

CN 0

Run kernel AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

17

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

MPIX Kernel call()

MPIX_Kernel_call( char *kernelname, int argcount, void *args[], int *argsizes, int root, MPI_Comm comm, MPI_Comm intercomm)

18

HUCAA’13

Sebastian Rinke et al.

Intercommunicator 0

1 2

Local group

1 2 0 5 3 4 Remote group

Kernel Call Usage Scenario CN 0 AC CN 1 Interconnect

AC

CN 2 CN 3 COMM_WORLD

19

HUCAA’13

Sebastian Rinke et al.

AC

Kernel Call Usage Scenario CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

Run kernel AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

Send data AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

Recv data AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

Run kernel AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Code Example void main(int argc, char **argv) { // Spawn AC processes MPI_Comm_spawn(..., comm, &intercomm, ...);

void kernel0(double a, int b, char c) { // Get intercommunicator to parents MPI_Comm_get_parent(&intercomm); // Recv input data from parents MPI_Alltoall(..., intercomm);

// Start ”kernel0” on ACs MPIX_Kernel_call("kernel0", ..., comm, intercomm);

// Do calculations and communicate ... // Send results to parents MPI_Alltoall(..., intercomm);

// Send input data to kernel functions MPI_Alltoall(..., intercomm); } // Do some other calculations ... // Recv results from kernel functions MPI_Alltoall(..., intercomm); }

20

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call multiple()

MPIX_Kernel_call_multiple( int count, char *array_of_kernelname[], int *array_of_argcount, void **array_of_args[], int *array_of_argsizes[], int root, MPI_Comm comm, MPI_Comm intercomm)

21

HUCAA’13

Sebastian Rinke et al.

Intercommunicator 0

1 2

Local group

1 2 0 5 3 4 Remote group

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

22

HUCAA’13

Sebastian Rinke et al.

Spawned Program

Kernels must be available in program spawned on ACs Kernel execution requests handled by main function (provided) Programmer implements kernel functions only ⇒ Union of both parts during linking

23

HUCAA’13

Sebastian Rinke et al.

Main

Kernels

MPIX Kernel call*()

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments

2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments

2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer

3. ACs run all remaining kernels and wait for new requests

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments

2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer

3. ACs run all remaining kernels and wait for new requests

Termination of AC processes through empty kernel name

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*()

Note Compiler may change symbol name of kernel function Avoid name mangling for kernel entry functions C: No issues C++: Declare with extern "C" Fortran: Define with BIND(C) attribute

25

HUCAA’13

Sebastian Rinke et al.

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

26

HUCAA’13

Sebastian Rinke et al.

Results Evaluate kernel startup overhead for Multiple spawns Spawn + MPIX Kernel call() Spawn + MPIX Kernel call multiple()

27

HUCAA’13

Sebastian Rinke et al.

Results Evaluate kernel startup overhead for Multiple spawns Spawn + MPIX Kernel call() Spawn + MPIX Kernel call multiple()

Benchmark Environment Cluster part of DEEP 120 nodes (2 Intel Xeon E5-2680) 40 CNs 80 ACs ⇒ CN : AC = 1 : 2

QDR InfiniBand Open MPI 1.6.4, NFS file system

27

HUCAA’13

Sebastian Rinke et al.

Results

2 processes per CN ⇒ 80 CN processes 1 process per AC ⇒ 80 AC processes 5 doubles as kernel arguments

28

HUCAA’13

Sebastian Rinke et al.

Kernel Startup Times Multiple spawns 1 Spawn + Kernel_call 1 Spawn + Kernel_call_multiple 50

85

170

340

Time [sec]

40 30 20 10 0 1

29

HUCAA’13

Sebastian Rinke et al.

2

4

8 16 32 64 Number of kernel calls

128

256

Kernel call vs. Kernel call multiple 6

Kernel_call Kernel_call_multiple

Time [msec]

5 4 3 2 1 0 1

30

HUCAA’13

Sebastian Rinke et al.

2

4

8 16 32 64 Number of kernel calls

128

256

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

31

HUCAA’13

Sebastian Rinke et al.

Conclusion Network-attached ACs can help to address varying scaling characteristics within the same application Number of CNs and network-attached ACs is independent Distributed memory kernel functions with MPI communication can be offloaded Offloading mechanism based on MPI Comm spawn() MPIX Kernel call*() complements MPI’s spawn ⇒ Reduced kernel startup overhead Currently, work with application developers to integrate offloading

32

HUCAA’13

Sebastian Rinke et al.

Thank you.

33

HUCAA’13

Sebastian Rinke et al.

Efficient Offloading of Parallel Kernels Using ...

Evaluate kernel startup overhead for. Multiple spawns. Spawn + MPIX Kernel call(). Spawn + MPIX Kernel call multiple(). Benchmark Environment. Cluster part of DEEP. 120 nodes (2 Intel Xeon E5-2680). 40 CNs. 80 ACs. ⇒ CN : AC = 1 : 2. QDR InfiniBand. Open MPI 1.6.4, NFS file system. 27. HUCAA'13. Sebastian Rinke ...

2MB Sizes 1 Downloads 239 Views

Recommend Documents

Efficient parallel inversion using the ... - Semantic Scholar
Nov 1, 2006 - Centre for Advanced Data Inference, Research School of Earth Sciences, Australian ..... (what we will call the canonical version), and then.

Efficient parallel inversion using the ... - Semantic Scholar
Nov 1, 2006 - Centre for Advanced Data Inference, Research School of Earth Sciences, Australian National University, Canberra, ACT. 0200 ... in other ensemble-based inversion or global optimization algorithms. ... based, which means that they involve

Scalable Node Level Computation Kernels for Parallel ...
node level primitives, computation kernels and the exact inference algorithm using the ... Y. Xia is with the Computer Science Department, University of Southern.

Learning Kernels Using Local Rademacher Complexity
Figure 1: Illustration of the bound (3). The volume of the ..... 3 kernel weight l1 l2 conv dc. 85.2 80.9 85.8 55.6 72.1 n=100. 100. 50. 0. −1. 0. 1. 2 θ log(tailsum( θ).

Memory-Efficient Parallel Simulation of Electron Beam ...
Department of Computer Science, Old Dominion University, Norfolk, Virginia 23529. † ... researchers have developed a GPU-accelerated, high-fidelity ...... [Online]. Available: http://docs.nvidia.com/cuda/cuda-samples/#bandwidth-test.

On Pairwise Kernels: An Efficient Alternative and ...
we refer to as Kronecker kernel) can be interpreted as the weighted adjacency matrix of the Kronecker product graph of two graphs, the Cartesian kernel can be.

Parallel Computing System for efficient computation of ...
Parallel programming considerations. • The general process is executed over the CPU. – Host Code: C++. • The string comparison process is made in parallel.

Parallel Computing System for the efficient ...
tree construction and comparison processes for a set of molecules. Fur- thermore, in .... The abstractions and virtualization of processing resources provided by.

Efficient Parallel CKY Parsing on GPUs - Slav Petrov
of applications in various domains by executing a number of threads and thread blocks in paral- lel, which are specified by the programmer. Its popularity has ...

Instruction converting apparatus using parallel execution code
Nov 24, 2003 - S. 1. t. ?l f. 1 t h h. t the ?rst unit ?eld to the s-ltA unit ?eld in the parallel execu ee app 10a Ion e or Comp 6 e Seam 15 Dry' tion code, and the ...

Instruction converting apparatus using parallel execution code
Nov 24, 2003 - vm MmQOUQZOCUDMPmZH QZOUm mEPOO

Pareto Optimal Design of Absorbers Using a Parallel ...
high performance electromagnetic absorbers. ... optimization algorithms to design high performance absorbers: such algorithms return a set of ... NSGA to speed up the convergence. ..... optimal broadband microwave absorbers,” IEEE Trans.

Improper Deep Kernels - cs.Princeton
best neural net model given a sufficient number ... decade as a powerful hypothesis class that can capture com- ...... In Foundations of Computer Science,.

An Efficient Parallel Dynamics Algorithm for Simulation ...
portant factors when authoring optimized software. ... systems which run the efficient O(n) solution with ... cated accounting system to avoid formulation singu-.

FlumeJava: Easy, Efficient Data-Parallel Pipelines - Research at Google
Jun 5, 2010 - MapReduce [6–8] greatly eased this task for data- parallel computations. ... ment each operation (e.g., local sequential loop vs. remote parallel. MapReduce ..... the original graph, plus output A.1, since it is needed by some.

An Efficient Deterministic Parallel Algorithm for Adaptive ... - ODU
Center for Accelerator Science. Old Dominion University. Norfolk, Virginia 23529. Desh Ranjan. Department of Computer Science. Old Dominion University.

Learning Non-Linear Combinations of Kernels - CiteSeerX
(6) where M is a positive, bounded, and convex set. The positivity of µ ensures that Kx is positive semi-definite (PSD) and its boundedness forms a regularization ...

Efficient tuning of svm hyperparameters using radius ...
(3b). Manuscript received March 14, 2001; revised December 21, 2001 and January. 10, 2002. The author is with the Department of Mechanical Engineering, National. University of Singapore, Singapore 119260, Singapore (e-mail: mpessk@ guppy.mpe.nus.edu.

Efficient Speaker Recognition Using Approximated ...
metric model (a GMM) to the target training data and computing the average .... using maximum a posteriori (MAP) adaptation with a universal background ...

Text Extraction Using Efficient Prototype - IJRIT
Dec 12, 2013 - as market analysis and business management, can benefit by the use of the information ... model to effectively use and update the discovered Models and apply it ..... Formally, for all positive documents di ϵ D +, we first deploy its

Multipath Medium Identification Using Efficient ...
proposed method leads to perfect recovery of the multipath delays from samples of the channel output at the .... We discuss this connection in more detail in the ...

Efficient synthesis of N-acylbenzotriazoles using tosyl chloride - Arkivoc
This paper is dedicated to (the late) Professor Alan R. Katritzky .... synthesis of SAHA from cheap starting materials in a high overall yield (84%) and simple work.