Towards a High Level Approach for the Programming of Heterogeneous Clusters M. Viñas, B. B. Fraguela, D. Andrade, R. Doallo Computer Architecture Group! Universidade da Coruña! Spain!

Motivation n 

Heterogeneous clusters are being increasingly adopted n 

n 

Large performance and power benefits

They involved more programming effort Distributed memory, both between nodes and host/devices n  Accelerators are harder to program than traditional CPUs n 

HUCAA 2016

2

Motivation (II) n 

Many proposals to tackle the programming of these systems. Often n  n  n 

n 

Still many low level details exposed SPMD processes Task-parallelism

Higher level approaches should be explored

HUCAA 2016

3

Contribution n 

n 

Explore data-parallelism at cluster level combined with a tool for heterogeneous programming Achieved using Hierarchically Tiled Array data type: arrays distributed by tiles on the cluster n  Heterogeneous Programming Library: simple development of heterogeneous applications n 

HUCAA 2016

4

Outline n  n  n  n  n 

Hierarchically Tiled Arrays Heterogeneous Programming Library Integration Evaluation Conclusions and future work

HUCAA 2016

5

Hierarchically Tiled Arrays (HTAs) n  n 

Data type in a sequential language HTAs are tiled arrays where each tile can be a standard array or a tiled array. They provide n 

Data distribution in distributed-memory n 

n  n 

Locality Data parallelism in operations on tiles n 

n 

Provides a global view

Single thread of control except in the data parallel operations

Implementation based on C++ and MPI n 

http://polaris.cs.uiuc.edu/hta/ HUCAA 2016

6

HTA Indexing n 

Can choose ranges of tiles, scalars or combinations of both h

h({0, 0:1})

HTA::alloc({3, 3}, {2, 2});

h[{0, 0:1}]

HUCAA 2016

h({1, 0:1})[{0, 0:1}]

7

Operations on HTAs op

=

op

=

op

=

HUCAA 2016

8

Operations on HTAs (II) n  n 

Element-by-element arithmetic operations Operations typical of array languages n 

n 

n 

Matrix multiply, transposition, etc.

Extended operator framework (reduce, mapReduce, scan) including user-defined operations Communications happen in assignments and provided array operations HUCAA 2016

9

Example HTA A = HTA::alloc({N, N}, {3, 2});

HUCAA 2016

A

10

Example HTA A = HTA::alloc({N, N}, {3, 2}); HTA B = A.clone();

HUCAA 2016

A

B

11

Example struct Example { void operator() (HTA A, HTA B) { A = A + alpha * B; } };

A

B

HTA A = HTA::alloc({N, N}, {3, 2}); HTA B = A.clone(); ... A = A + alpha * B; hmap(Example(), A, B); //Same

HUCAA 2016

12

Example A

struct Example { void operator() (HTA A, HTA B) { A = A + alpha * B; } }; HTA A = HTA::alloc({N, N}, {3, 2}); HTA B = A.clone(); ... A = A + alpha * B; hmap(Example(), A, B); //Same ... HTA C = A.transpose();

HUCAA 2016

C

13

Heterogeneous Programming Library (HPL) n  n 

Facilitates heterogeneous programming Based on two elements Kernels: functions that are evaluated in parallel by multiple threads on any device n  Data type to express arrays and scalars that can be used in kernels and host/serial code n 

n 

Implementation based in C++ and OpenCL n 

http://hpl.des.udc.es HUCAA 2016

14

HW/SW model n  n 

Serial code runs in the host Parallel kernels can be run everywhere n 

n 

Semantics like CUDA and OpenCL

Processors can only access their memory

HUCAA 2016

15

Kernels n 

Supports kernels n  n 

n 

In standard OpenCL Developed in an embedded language

Embedded language has n  n  n 

Macros for control structures Predefined variables Functions (sync, arithmetic ops, etc.) HUCAA 2016

16

Array data type n 

Array defines a ndimensional array that can be used in host code and kernels n  n  n 

Example: Array mx(100, 100) n=0 defines a scalar memoryFlag defines the kind of memory (global, local, constant, or private)

HUCAA 2016

17

Example: SAXPY (Y=a*X+Y) #include "HPL.h" using namespace HPL; float myvector[1000];

Host-side Arrays Uses own host storage Uses existing host storage

Array x(1000), y(1000, myvector); void saxpy(Array y, Array x, Float a) { y[idx] = a * x[idx] + y[idx]; } int main() { float a; //the vectors are filled in with data (not shown) eval(saxpy)(y, x, a); } HUCAA 2016

18

Example: SAXPY (Y=a*X+Y) #include "HPL.h" using namespace HPL; float myvector[1000]; Array x(1000), y(1000, myvector);

Kernel Idx = global thread id

void saxpy(Array y, Array x, Float a) { y[idx] = a * x[idx] + y[idx]; } int main() { float a; //the vectors are filled in with data (not shown) eval(saxpy)(y, x, a); } HUCAA 2016

19

Example: SAXPY (Y=a*X+Y) #include "HPL.h" using namespace HPL; float myvector[1000]; Array x(1000), y(1000, myvector); void saxpy(Array y, Array x, Float a) { y[idx] = a * x[idx] + y[idx]; } int main() { float a; //the vectors are filled in with data (not shown)

Request kernel execution

eval(saxpy)(y, x, a); } HUCAA 2016

20

HTA + HPL integration n 

I. Data type integration n 

n 

Simplify the joint usage of the HTA and Array types

II. Coherency management n 

Guarantee HTA and HPL invocations use valid versions of data

HUCAA 2016

21

I. Data type integration n 

HTAs are global multi-node objects, while HPL Arrays are per-node structures n 

n 

n 

Solution: definition of Arrays associated to the local HTA tiles

Typical HTA pattern: A single tile per node identified by the process/node id Build HPL Arrays so that their host-side memory is the one of the HTA tile

HUCAA 2016

22

Example Gets number of processes/nodes int N = Traits::Default::nPlaces(); auto h_arr = HTA::alloc({100, 100}, {N, 1}); int MYID = Traits::Default::myPlace(); Array local_arr(100, 100, h_arr({MYID, 1}).raw());

HUCAA 2016

23

Example Build an HTA with a column on N tiles of size 100x100 (each tile is placed in a different node) int N = Traits::Default::nPlaces(); auto h_arr = HTA::alloc({100, 100}, {N, 1}); int MYID = Traits::Default::myPlace(); Array local_arr(100, 100, h_arr({MYID, 1}).raw());

HUCAA 2016

24

Example Get id of the local node int N = Traits::Default::nPlaces(); auto h_arr = HTA::alloc({100, 100}, {N, 1}); int MYID = Traits::Default::myPlace(); Array local_arr(100, 100, h_arr({MYID, 1}).raw());

HUCAA 2016

25

Example Build local HPL Array of 100x100 elements whose host-side memory is in the place of the local tile of the HTA int N = Traits::Default::nPlaces(); auto h_arr = HTA::alloc({100, 100}, {N, 1}); int MYID = Traits::Default::myPlace(); Array local_arr(100, 100, h_arr({MYID, 1}).raw()); /* Rule: - Use h_arr for CPU/internote operations - Use local_arr for accelerator operations */

HUCAA 2016

26

Coherency management n 

HPL manages the coherency of its Arrays n 

n 

Must know whether an Array was read and/or written in each usage n  n 

n 

Automated transfers to/from/between devices

Kernel ops: Known from kernel analysis Host ops: Known from accessor or manually reported by a method called data

HTA operations are host operations n 

Inform on them to HPL Arrays using the data API HUCAA 2016

27

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

28

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

29

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

Host

GPU

Host

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

30

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

Host

GPU

Host

Host

GPU

Host

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

31

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

Host

GPU

Host

Host

GPU

Host

GPU

GPU

Both

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

32

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

Host

GPU

Host

Host

GPU

Host

GPU

GPU

Both

Both

GPU

Both

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

33

Evaluation n 

Applications: EP: Embarrassingly parallel with reduction n  FT: FFT with all-to-all communications n  Matmul: Distributed matrix product n  ShWa: finite-volume scheme (repetitive stencil) n  Canny: Finds edges in images (stencil) n 

HUCAA 2016

34

Programmability versus MPI+OpenCL 70

SLOCs cyclomatic number effort

60

% reduction

50 40 30 20 10 0

EP

FT

Matmul ShWa Canny average HUCAA 2016

35

Performance evaluation n 

n 

n 

Fermi cluster: 4 nodes with an Intel Xeon X5650 with 6 cores, 12 GB, and 2 Nvidia M2050 GPUs with 3GB each K20 cluster: 8 nodes with two Intel Xeon E5-2660 8-core CPUs, 64 GB, and a K20m GPU with 5 GB g++ 4.7.2 with optimization level O3 HUCAA 2016

36

Performance for EP 8 7

speedup

6

MPI+OCL Fermi HTA+HPL Fermi MPI+OCL K20 HTA+HPL K20

5 4 3 2 1 2

4 Number of GPUs HUCAA 2016

8 37

Performance for FT 4 3.5

speedup

3

MPI+OCL Fermi HTA+HPL Fermi MPI+OCL K20 HTA+HPL K20

2.5 2 1.5 1 0.5 2

4 Number of GPUs HUCAA 2016

8 38

Performance for ShWa 5.5 5

speedup

4.5

MPI+OCL Fermi HTA+HPL Fermi MPI+OCL K20 HTA+HPL K20

4 3.5 3 2.5 2 1.5 2

4 Number of GPUs HUCAA 2016

8 39

Conclusions & future work n 

n 

n 

Heterogeneous clusters are notoriously difficult to program Most proposals to improve the situation still require noticeable effort We explored combining n 

n 

distributed arrays with global semantics provided by HTAs simpler heterogeneous programming provided by HPL HUCAA 2016

40

Conclusions & future work n 

n  n 

Average programmability improvements w.r.t MPI+OpenCL between 19% and 45% (peak of 58%) Average overhead just around 2% Future work: integrate both tools into a single one

HUCAA 2016

41

HUCAA 2016

42

Towards a High Level Approach for the Programming of ... - HUCAA

... except in the data parallel operations. ▫ Implementation based on C++ and MPI. ▫ http://polaris.cs.uiuc.edu/hta/. HUCAA 2016. 6 .... double result = hta_A.reduce(plus());. Matrix A Matrix B .... Programmability versus. MPI+OpenCL.

801KB Sizes 2 Downloads 211 Views

Recommend Documents

High-level Programming Support for Robust Pervasive ...
for Robust Pervasive Computing Applications ... the development of pervasive computing applications. Our ... Sessions raise the level of abstraction of the imple-.

High-level Distribution for the Rapid Production of Robust Telecoms ...
guages like Erlang [1], or Glasgow distributed Haskell (GdH) [25] automati- .... standard packet data in GSM systems [9], and the Intelligent Network Service.

High-level Distribution for the Rapid Production of Robust Telecoms ...
time performance is achieved, e.g. three times faster than the C++ imple- ..... standard packet data in GSM systems [9], and the Intelligent Network Service.

Towards a 3D Virtual Programming Language to Increase Number of ...
in CS, such as the study performed at Georgia Tech [19]. Our work shows one available ... stages (high school and freshmen college students). Once students ..... Mom, let me play more computer games: They improve my mental rotation skills.

Towards a 3D Virtual Programming Language to Increase Number of ...
twofold: (1) provide an early-stage Virtual Reality 3D BBP proto- ... In the year 1983/1984, 37% of CS degrees ... The CS degrees awarded to women have.

High Level Transforms for SIMD and low-level ...
The second part presents an advanced memory layout trans- ..... We call this phenomenon a cache overflow. ... 4, center). Such a spatial locality optimization can also be used in a multithreaded ... In the following, we call this optimization mod.

The carbon-Ferrier rearrangement: an approach towards the ... - Arkivoc
7.3 Electron-rich nucleophiles. 7.4 Olefins ..... On incorporating a substituent at C-2 position or an electron withdrawing ..... to the ring nitrogen atom. Scheme 43.

A Noble Design Approach towards Distributed System ...
Organizations are having multiple EIS applications to cater to their business needs, ... Collaboration Models build around Service Oriented Architecture (SOA), which uses ... implementation to support multiple vendors & platforms and access legacy ..