Efficient Offloading of Parallel Kernels Using ...

Viewer
Transcript

Efficient Offloading of Parallel Kernels Using MPI Comm spawn

Sebastian Rinke, Suraj Prabhakaran, Felix Wolf HUCAA’13 | 1.10.13

The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under Grant Agreement n◦ 287530

2

HUCAA’13

Sebastian Rinke et al.

State of the Art AC

AC

CN

CN

Interconnect

3

HUCAA’13

Sebastian Rinke et al.

CN

CN

AC

AC

State of the Art Pros High bandwidth

AC

AC

CN

CN

Low latency User has simple view Interconnect

Cons

4

Oblivious to varying workloads ⇒ Idle/overloaded ACs

CN

CN

CNs and ACs affect each other’s availability

AC

AC

HUCAA’13

Sebastian Rinke et al.

Network-attached Accelerators

CN

Pros

AC

AC allocation based on application needs

CN

Distributed memory kernel offload to multiple ACs

CN

Interconnect

MPI within (larger) kernel AC and CN have their own network interface

5

HUCAA’13

Sebastian Rinke et al.

AC

AC CN

Network-attached Accelerators Cons

CN

Greater penalty for data transfers How to offload MPI kernels to ACs?

AC CN Interconnect

Yet another programming model?

AC

CN AC CN

6

HUCAA’13

Sebastian Rinke et al.

Network-attached Accelerators Cons

CN

Greater penalty for data transfers How to offload MPI kernels to ACs?

AC CN Interconnect

Yet another programming model?

AC

CN AC CN

⇒ No, use MPI’s dynamic process model!

6

HUCAA’13

Sebastian Rinke et al.

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

7

HUCAA’13

Sebastian Rinke et al.

DEEP Architecture

CN BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

CN InfiniBand CN CN

Cluster

8

HUCAA’13

Sebastian Rinke et al.

Booster

DEEP Architecture Overview Cluster 128 cluster nodes (2 Intel Xeon E5-2680) QDR InfiniBand

Booster 512 booster nodes (1 Intel Xeon Phi) EXTOLL network (8×8×8 3D torus)

MPI over complete system

9

HUCAA’13

Sebastian Rinke et al.

CN BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

BIC

BN BN

BN BN

BN BN

CN InfiniBand CN CN

Cluster

Booster

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

10

HUCAA’13

Sebastian Rinke et al.

Offloading Approach, Why? Main program and kernels are MPI programs Start all CN and AC processes at job start Distinguish between CN/AC processes during runtime

CN 0

AC 4

CN 1 Interconnect CN 2 CN 3

AC 5 AC 6

MPI_COMM_WORLD

11

HUCAA’13

Sebastian Rinke et al.

Offloading Approach, Why? Con All processes in one MPI COMM WORLD

CN 0

AC 4

CN 1

Workaround Split communicator Replace all occurences of MPI COMM WORLD with new communicator ⇒ Major code changes

12

HUCAA’13

Sebastian Rinke et al.

Interconnect CN 2 CN 3

AC 5 AC 6

MPI_COMM_WORLD

Spawn Create AC processes during runtime with MPI Comm spawn() Provides separate communicators starting with rank 0 Collectives for convenient intercommunication CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

13

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

MPI Comm spawn()

MPI_Comm_spawn( char *command, char *argv[], int maxprocs, MPI_Info info, int root, MPI_Comm comm, MPI_Comm *intercomm, int array_of_errcodes[])

14

HUCAA’13

Sebastian Rinke et al.

Intercommunicator 0

1 2

Local group

1 2 0 5 3 4 Remote group

Spawn Usage Scenario CN 0 AC CN 1 Interconnect

AC

CN 2 CN 3 COMM_WORLD

15

HUCAA’13

Sebastian Rinke et al.

AC

Spawn Usage Scenario CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

15

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Spawn Usage Scenario CN 0

Send data AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

15

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Spawn Usage Scenario CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

15

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Spawn Usage Scenario CN 0

Recv data AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

15

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Spawn Con One spawn allows for one kernel execution

Workaround 1. Terminate and re-spawn AC processes 2. Spawn once and use protocol to trigger kernel executions

16

HUCAA’13

Sebastian Rinke et al.

Spawn + Kernel Call Create AC processes during runtime with MPI Comm spawn() Trigger kernel execution with MPIX Kernel call() ⇒ No need for re-spawning and user-implemented protocol

CN 0

Run kernel AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

17

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

MPIX Kernel call()

MPIX_Kernel_call( char *kernelname, int argcount, void *args[], int *argsizes, int root, MPI_Comm comm, MPI_Comm intercomm)

18

HUCAA’13

Sebastian Rinke et al.

Intercommunicator 0

1 2

Local group

1 2 0 5 3 4 Remote group

Kernel Call Usage Scenario CN 0 AC CN 1 Interconnect

AC

CN 2 CN 3 COMM_WORLD

19

HUCAA’13

Sebastian Rinke et al.

AC

Kernel Call Usage Scenario CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

Run kernel AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

Send data AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

Recv data AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Usage Scenario CN 0

Run kernel AC 0

CN 1 Interconnect CN 2 CN 3 COMM_WORLD (A)

19

HUCAA’13

Sebastian Rinke et al.

AC 1 AC 2

COMM_WORLD (B)

Kernel Call Code Example void main(int argc, char **argv) { // Spawn AC processes MPI_Comm_spawn(..., comm, &intercomm, ...);

void kernel0(double a, int b, char c) { // Get intercommunicator to parents MPI_Comm_get_parent(&intercomm); // Recv input data from parents MPI_Alltoall(..., intercomm);

// Start ”kernel0” on ACs MPIX_Kernel_call("kernel0", ..., comm, intercomm);

// Do calculations and communicate ... // Send results to parents MPI_Alltoall(..., intercomm);

// Send input data to kernel functions MPI_Alltoall(..., intercomm); } // Do some other calculations ... // Recv results from kernel functions MPI_Alltoall(..., intercomm); }

20

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call multiple()

MPIX_Kernel_call_multiple( int count, char *array_of_kernelname[], int *array_of_argcount, void **array_of_args[], int *array_of_argsizes[], int root, MPI_Comm comm, MPI_Comm intercomm)

21

HUCAA’13

Sebastian Rinke et al.

Intercommunicator 0

1 2

Local group

1 2 0 5 3 4 Remote group

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

22

HUCAA’13

Sebastian Rinke et al.

Spawned Program

Kernels must be available in program spawned on ACs Kernel execution requests handled by main function (provided) Programmer implements kernel functions only ⇒ Union of both parts during linking

23

HUCAA’13

Sebastian Rinke et al.

Main

Kernels

MPIX Kernel call*()

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments

2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments

2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer

3. ACs run all remaining kernels and wait for new requests

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*() 1. CNs send kernel requests to ACs Kernel names Kernel arguments

2. ACs start first kernel Get address of kernel function through dlsym() Push kernel arguments onto process stack Invoke kernel using function pointer

3. ACs run all remaining kernels and wait for new requests

Termination of AC processes through empty kernel name

24

HUCAA’13

Sebastian Rinke et al.

MPIX Kernel call*()

Note Compiler may change symbol name of kernel function Avoid name mangling for kernel entry functions C: No issues C++: Declare with extern "C" Fortran: Define with BIND(C) attribute

25

HUCAA’13

Sebastian Rinke et al.

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

26

HUCAA’13

Sebastian Rinke et al.

Results Evaluate kernel startup overhead for Multiple spawns Spawn + MPIX Kernel call() Spawn + MPIX Kernel call multiple()

27

HUCAA’13

Sebastian Rinke et al.

Results Evaluate kernel startup overhead for Multiple spawns Spawn + MPIX Kernel call() Spawn + MPIX Kernel call multiple()

Benchmark Environment Cluster part of DEEP 120 nodes (2 Intel Xeon E5-2680) 40 CNs 80 ACs ⇒ CN : AC = 1 : 2

QDR InfiniBand Open MPI 1.6.4, NFS file system

27

HUCAA’13

Sebastian Rinke et al.

Results

2 processes per CN ⇒ 80 CN processes 1 process per AC ⇒ 80 AC processes 5 doubles as kernel arguments

28

HUCAA’13

Sebastian Rinke et al.

Kernel Startup Times Multiple spawns 1 Spawn + Kernel_call 1 Spawn + Kernel_call_multiple 50

85

170

340

Time [sec]

40 30 20 10 0 1

29

HUCAA’13

Sebastian Rinke et al.

2

4

8 16 32 64 Number of kernel calls

128

256

Kernel call vs. Kernel call multiple 6

Kernel_call Kernel_call_multiple

Time [msec]

5 4 3 2 1 0 1

30

HUCAA’13

Sebastian Rinke et al.

2

4

8 16 32 64 Number of kernel calls

128

256

Outline DEEP Architecture Offloading Approaches Implementation Results Conclusion

31

HUCAA’13

Sebastian Rinke et al.

Conclusion Network-attached ACs can help to address varying scaling characteristics within the same application Number of CNs and network-attached ACs is independent Distributed memory kernel functions with MPI communication can be offloaded Offloading mechanism based on MPI Comm spawn() MPIX Kernel call*() complements MPI’s spawn ⇒ Reduced kernel startup overhead Currently, work with application developers to integrate offloading

32

HUCAA’13

Sebastian Rinke et al.

Thank you.

33

HUCAA’13

Sebastian Rinke et al.

Efficient parallel inversion using the ... - Semantic Scholar

Scalable Node Level Computation Kernels for Parallel ...

Learning Kernels Using Local Rademacher Complexity

Memory-Efficient Parallel Simulation of Electron Beam ...

On Pairwise Kernels: An Efficient Alternative and ...

Parallel Computing System for efficient computation of ...

Parallel Computing System for the efficient ...

Efficient Parallel CKY Parsing on GPUs - Slav Petrov

Instruction converting apparatus using parallel execution code

Pareto Optimal Design of Absorbers Using a Parallel ...

Improper Deep Kernels - cs.Princeton

An Efficient Parallel Dynamics Algorithm for Simulation ...

FlumeJava: Easy, Efficient Data-Parallel Pipelines - Research at Google

An Efficient Deterministic Parallel Algorithm for Adaptive ... - ODU

Learning Non-Linear Combinations of Kernels - CiteSeerX

Efficient tuning of svm hyperparameters using radius ...

Efficient Speaker Recognition Using Approximated ...

Text Extraction Using Efficient Prototype - IJRIT

Multipath Medium Identification Using Efficient ...

Efficient synthesis of N-acylbenzotriazoles using tosyl chloride - Arkivoc