Towards a High Level Approach for the Programming of Heterogeneous Clusters M. Viñas, B. B. Fraguela, D. Andrade, R. Doallo Computer Architecture Group! Universidade da Coruña! Spain!

Motivation n 

Heterogeneous clusters are being increasingly adopted n 

n 

Large performance and power benefits

They involved more programming effort Distributed memory, both between nodes and host/devices n  Accelerators are harder to program than traditional CPUs n 

HUCAA 2016

2

Motivation (II) n 

Many proposals to tackle the programming of these systems. Often n  n  n 

n 

Still many low level details exposed SPMD processes Task-parallelism

Higher level approaches should be explored

HUCAA 2016

3

Contribution n 

n 

Explore data-parallelism at cluster level combined with a tool for heterogeneous programming Achieved using Hierarchically Tiled Array data type: arrays distributed by tiles on the cluster n  Heterogeneous Programming Library: simple development of heterogeneous applications n 

HUCAA 2016

4

Outline n  n  n  n  n 

Hierarchically Tiled Arrays Heterogeneous Programming Library Integration Evaluation Conclusions and future work

HUCAA 2016

5

Hierarchically Tiled Arrays (HTAs) n  n 

Data type in a sequential language HTAs are tiled arrays where each tile can be a standard array or a tiled array. They provide n 

Data distribution in distributed-memory n 

n  n 

Locality Data parallelism in operations on tiles n 

n 

Provides a global view

Single thread of control except in the data parallel operations

Implementation based on C++ and MPI n 

http://polaris.cs.uiuc.edu/hta/ HUCAA 2016

6

HTA Indexing n 

Can choose ranges of tiles, scalars or combinations of both h

h({0, 0:1})

HTA::alloc({3, 3}, {2, 2});

h[{0, 0:1}]

HUCAA 2016

h({1, 0:1})[{0, 0:1}]

7

Operations on HTAs op

=

op

=

op

=

HUCAA 2016

8

Operations on HTAs (II) n  n 

Element-by-element arithmetic operations Operations typical of array languages n 

n 

n 

Matrix multiply, transposition, etc.

Extended operator framework (reduce, mapReduce, scan) including user-defined operations Communications happen in assignments and provided array operations HUCAA 2016

9

Example HTA A = HTA::alloc({N, N}, {3, 2});

HUCAA 2016

A

10

Example HTA A = HTA::alloc({N, N}, {3, 2}); HTA B = A.clone();

HUCAA 2016

A

B

11

Example struct Example { void operator() (HTA A, HTA B) { A = A + alpha * B; } };

A

B

HTA A = HTA::alloc({N, N}, {3, 2}); HTA B = A.clone(); ... A = A + alpha * B; hmap(Example(), A, B); //Same

HUCAA 2016

12

Example A

struct Example { void operator() (HTA A, HTA B) { A = A + alpha * B; } }; HTA A = HTA::alloc({N, N}, {3, 2}); HTA B = A.clone(); ... A = A + alpha * B; hmap(Example(), A, B); //Same ... HTA C = A.transpose();

HUCAA 2016

C

13

Heterogeneous Programming Library (HPL) n  n 

Facilitates heterogeneous programming Based on two elements Kernels: functions that are evaluated in parallel by multiple threads on any device n  Data type to express arrays and scalars that can be used in kernels and host/serial code n 

n 

Implementation based in C++ and OpenCL n 

http://hpl.des.udc.es HUCAA 2016

14

HW/SW model n  n 

Serial code runs in the host Parallel kernels can be run everywhere n 

n 

Semantics like CUDA and OpenCL

Processors can only access their memory

HUCAA 2016

15

Kernels n 

Supports kernels n  n 

n 

In standard OpenCL Developed in an embedded language

Embedded language has n  n  n 

Macros for control structures Predefined variables Functions (sync, arithmetic ops, etc.) HUCAA 2016

16

Array data type n 

Array defines a ndimensional array that can be used in host code and kernels n  n  n 

Example: Array mx(100, 100) n=0 defines a scalar memoryFlag defines the kind of memory (global, local, constant, or private)

HUCAA 2016

17

Example: SAXPY (Y=a*X+Y) #include "HPL.h" using namespace HPL; float myvector[1000];

Host-side Arrays Uses own host storage Uses existing host storage

Array x(1000), y(1000, myvector); void saxpy(Array y, Array x, Float a) { y[idx] = a * x[idx] + y[idx]; } int main() { float a; //the vectors are filled in with data (not shown) eval(saxpy)(y, x, a); } HUCAA 2016

18

Example: SAXPY (Y=a*X+Y) #include "HPL.h" using namespace HPL; float myvector[1000]; Array x(1000), y(1000, myvector);

Kernel Idx = global thread id

void saxpy(Array y, Array x, Float a) { y[idx] = a * x[idx] + y[idx]; } int main() { float a; //the vectors are filled in with data (not shown) eval(saxpy)(y, x, a); } HUCAA 2016

19

Example: SAXPY (Y=a*X+Y) #include "HPL.h" using namespace HPL; float myvector[1000]; Array x(1000), y(1000, myvector); void saxpy(Array y, Array x, Float a) { y[idx] = a * x[idx] + y[idx]; } int main() { float a; //the vectors are filled in with data (not shown)

Request kernel execution

eval(saxpy)(y, x, a); } HUCAA 2016

20

HTA + HPL integration n 

I. Data type integration n 

n 

Simplify the joint usage of the HTA and Array types

II. Coherency management n 

Guarantee HTA and HPL invocations use valid versions of data

HUCAA 2016

21

I. Data type integration n 

HTAs are global multi-node objects, while HPL Arrays are per-node structures n 

n 

n 

Solution: definition of Arrays associated to the local HTA tiles

Typical HTA pattern: A single tile per node identified by the process/node id Build HPL Arrays so that their host-side memory is the one of the HTA tile

HUCAA 2016

22

Example Gets number of processes/nodes int N = Traits::Default::nPlaces(); auto h_arr = HTA::alloc({100, 100}, {N, 1}); int MYID = Traits::Default::myPlace(); Array local_arr(100, 100, h_arr({MYID, 1}).raw());

HUCAA 2016

23

Example Build an HTA with a column on N tiles of size 100x100 (each tile is placed in a different node) int N = Traits::Default::nPlaces(); auto h_arr = HTA::alloc({100, 100}, {N, 1}); int MYID = Traits::Default::myPlace(); Array local_arr(100, 100, h_arr({MYID, 1}).raw());

HUCAA 2016

24

Example Get id of the local node int N = Traits::Default::nPlaces(); auto h_arr = HTA::alloc({100, 100}, {N, 1}); int MYID = Traits::Default::myPlace(); Array local_arr(100, 100, h_arr({MYID, 1}).raw());

HUCAA 2016

25

Example Build local HPL Array of 100x100 elements whose host-side memory is in the place of the local tile of the HTA int N = Traits::Default::nPlaces(); auto h_arr = HTA::alloc({100, 100}, {N, 1}); int MYID = Traits::Default::myPlace(); Array local_arr(100, 100, h_arr({MYID, 1}).raw()); /* Rule: - Use h_arr for CPU/internote operations - Use local_arr for accelerator operations */

HUCAA 2016

26

Coherency management n 

HPL manages the coherency of its Arrays n 

n 

Must know whether an Array was read and/or written in each usage n  n 

n 

Automated transfers to/from/between devices

Kernel ops: Known from kernel analysis Host ops: Known from accessor or manually reported by a method called data

HTA operations are host operations n 

Inform on them to HPL Arrays using the data API HUCAA 2016

27

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

28

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

29

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

Host

GPU

Host

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

30

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

Host

GPU

Host

Host

GPU

Host

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

31

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

Host

GPU

Host

Host

GPU

Host

GPU

GPU

Both

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

32

Example auto hta_A = HTA::alloc({{(HA/N), WA}, {N, 1}}); Array lA((HA/N), WA, hta_A({MY_ID}).raw());

Place(s) where each array is updated Matrix A

Matrix B

Matrix C

Host

Host

Host

Host

Host

Host

Host

GPU

Host

Host

GPU

Host

GPU

GPU

Both

Both

GPU

Both

auto hta_B = HTA::alloc({{(HB/N), WB}, {N, 1}}); Array lB((HB/N), WB, hta_B({MY_ID}).raw()); auto hta_C = HTA::alloc({{HC, WC}, {N, 1}}); Array lC(HC, WC, hta_C({MY_ID}).raw()); hta_A = 0.f; eval(fillinB)(lB); hmap(fillinC, hta_C); eval(mxmul)(lA, lB, lC, HC, alpha); lA.data(HPL_RD); // Brings A data to the host double result = hta_A.reduce(plus());

HUCAA 2016

33

Evaluation n 

Applications: EP: Embarrassingly parallel with reduction n  FT: FFT with all-to-all communications n  Matmul: Distributed matrix product n  ShWa: finite-volume scheme (repetitive stencil) n  Canny: Finds edges in images (stencil) n 

HUCAA 2016

34

Programmability versus MPI+OpenCL 70

SLOCs cyclomatic number effort

60

% reduction

50 40 30 20 10 0

EP

FT

Matmul ShWa Canny average HUCAA 2016

35

Performance evaluation n 

n 

n 

Fermi cluster: 4 nodes with an Intel Xeon X5650 with 6 cores, 12 GB, and 2 Nvidia M2050 GPUs with 3GB each K20 cluster: 8 nodes with two Intel Xeon E5-2660 8-core CPUs, 64 GB, and a K20m GPU with 5 GB g++ 4.7.2 with optimization level O3 HUCAA 2016

36

Performance for EP 8 7

speedup

6

MPI+OCL Fermi HTA+HPL Fermi MPI+OCL K20 HTA+HPL K20

5 4 3 2 1 2

4 Number of GPUs HUCAA 2016

8 37

Performance for FT 4 3.5

speedup

3

MPI+OCL Fermi HTA+HPL Fermi MPI+OCL K20 HTA+HPL K20

2.5 2 1.5 1 0.5 2

4 Number of GPUs HUCAA 2016

8 38

Performance for ShWa 5.5 5

speedup

4.5

MPI+OCL Fermi HTA+HPL Fermi MPI+OCL K20 HTA+HPL K20

4 3.5 3 2.5 2 1.5 2

4 Number of GPUs HUCAA 2016

8 39

Conclusions & future work n 

n 

n 

Heterogeneous clusters are notoriously difficult to program Most proposals to improve the situation still require noticeable effort We explored combining n 

n 

distributed arrays with global semantics provided by HTAs simpler heterogeneous programming provided by HPL HUCAA 2016

40

Conclusions & future work n 

n  n 

Average programmability improvements w.r.t MPI+OpenCL between 19% and 45% (peak of 58%) Average overhead just around 2% Future work: integrate both tools into a single one

HUCAA 2016

41

HUCAA 2016

42

Towards a High Level Approach for the Programming of ... - HUCAA

... except in the data parallel operations. ▫ Implementation based on C++ and MPI. ▫ http://polaris.cs.uiuc.edu/hta/. HUCAA 2016. 6 .... double result = hta_A.reduce(plus());. Matrix A Matrix B .... Programmability versus. MPI+OpenCL.

801KB Sizes 2 Downloads 291 Views

Recommend Documents

Towards a High Level Approach for the Programming of ... - HUCAA
Page 1 .... Build HPL Arrays so that their host-side memory is the one of the HTA tile ... Build an HTA with a column on N tiles of size 100x100. (each tile is placed ...

IndiGolog: A High-Level Programming Language for ... - Springer Link
Giuseppe De Giacomo, Yves Lespérance, Hector J. Levesque, and Sebastian. Sardina. Abstract IndiGolog is a programming language for autonomous agents that sense their environment and do planning as they operate. Instead of classical planning, it supp

High-level Programming Support for Robust Pervasive ...
for Robust Pervasive Computing Applications ... the development of pervasive computing applications. Our ... Sessions raise the level of abstraction of the imple-.

approach towards the manipulation of lego bricks for ...
APPROACH TOWARDS THE MANIPULATION OF LEGO BRICKS FOR MODEL CONSTRUCTION. 2 orientation), orientation, and a final step consisting of grasping and placement. Such a manipulator is presented in Figure 2, with a rotating head to allow each tool to be us

A Bidirectional Transformation Approach towards ... - Semantic Scholar
to produce a Java source model for programmers to implement the system. Programmers add code and methods to the Java source model, while at the same time, designers change the name of a class on the UML ... sively studied by researchers on XML transf

High-level Distribution for the Rapid Production of ...
Investigating the potential advantages of the high-level Erlang tech- nology shows that ..... feature of many wireless communication systems. Managing the call ...

High Level Modeling of a ΣΔ Modulator for the Test of a ...
VHDL-AMS description from a transistor schematic simulation as detailed on figure 1. After a transistor level simulation, the different process parameters are extracted and injected in the VHDL-AMS description. Authorized licensed use limited to: ST

High-level Distribution for the Rapid Production of Robust Telecoms ...
guages like Erlang [1], or Glasgow distributed Haskell (GdH) [25] automati- .... standard packet data in GSM systems [9], and the Intelligent Network Service.

High-level Distribution for the Rapid Production of Robust Telecoms ...
time performance is achieved, e.g. three times faster than the C++ imple- ..... standard packet data in GSM systems [9], and the Intelligent Network Service.

A Bidirectional Transformation Approach towards ... - Semantic Scholar
to produce a Java source model for programmers to implement the system. Programmers add code and methods to ... synchronized. Simply performing the transformation from UML model to Java source model again ... In: ACM SIGPLAN–SIGACT Symposium on Pri

Towards a 3D Virtual Programming Language to Increase Number of ...
in CS, such as the study performed at Georgia Tech [19]. Our work shows one available ... stages (high school and freshmen college students). Once students ..... Mom, let me play more computer games: They improve my mental rotation skills.

Towards Abductive Functional Programming - School of Computer ...
at runtime when the actual abducted term {1} + {2} is determined. The (ordered) standard basis of the vector type Va en- ables the parameterised term f to know ...

Towards Improving the Lexicon-Based Approach for Arabic ...
Towards Improving the Lexicon-Based Approach for Arabic Sentiment Analysis-First-Page.pdf. Towards Improving the Lexicon-Based Approach for Arabic ...

Towards a 3D Virtual Programming Language to Increase Number of ...
twofold: (1) provide an early-stage Virtual Reality 3D BBP proto- ... In the year 1983/1984, 37% of CS degrees ... The CS degrees awarded to women have.

High Level Modeling of a ΣΔ Modulator for the Test of ...
test signals from digital test patterns (obtained via Σ∆ modulation) and converting the responses of the analogue modules into digital signatures that are ...

High Level Transforms for SIMD and low-level ...
The second part presents an advanced memory layout trans- ..... We call this phenomenon a cache overflow. ... 4, center). Such a spatial locality optimization can also be used in a multithreaded ... In the following, we call this optimization mod.

A High-Level Protocol Specification Language for Industrial Security ...
Even assuming “perfect” cryptography, the design of security protocols is ..... has no access whatsoever; and channels which provide non-repudiation properties.

Method for presenting high level interpretations of eye tracking data ...
Aug 22, 2002 - Advanced interface design and virtual environments, Oxford Univer sity Press, Oxford, 1995. In this article, Jacob describes techniques for ...

pdf-90\high-speed-networking-a-systematic-approach-to-high ...
Page 1 of 11. HIGH-SPEED NETWORKING: A. SYSTEMATIC APPROACH TO HIGH- BANDWIDTH LOW-LATENCY. COMMUNICATION BY JAMES P. G.. STERBENZ, JOSEPH D. TOUCH. DOWNLOAD EBOOK : HIGH-SPEED NETWORKING: A SYSTEMATIC. APPROACH TO HIGH-BANDWIDTH LOW-LATENCY ...

pdf-90\high-speed-networking-a-systematic-approach-to-high ...
There was a problem loading more pages. pdf-90\high-speed-networking-a-systematic-approach-t ... mmunication-by-james-p-g-sterbenz-joseph-d-touch.pdf.

JUNIPER: Towards Modeling Approach Enabling Efficient Platform for ...
Performance Computing (HPC) and hardware acceleration with .... handling the problem of designing multi-cloud applications are: (i) choosing a modeling ...

Towards a Mobile Applications Security Approach
back the guidelines for secure mobile applications .... storage, performance are quite limited comparing to .... 'telecom/cal.vcs' for the devices calendar file.

A Noble Design Approach towards Distributed System ...
heterogeneous applications to a single complete suite. The system ... With the rapid increase in usage of such. Systems ... a strong inclination towards development of Composite .... that service as though it were a standard Web Service. It.