Concord: Homogeneous Programming for Heterogeneous Architectures

Rajkishore Barik, Intel Labs Brian T.Lewis, Intel Labs

Heterogeneous Platforms •

Heterogeneity is ubiquitous: mobile devices, laptops, servers, & supercomputers



Emerging hardware trend: CPU & GPU cores integrated on same die, share physical memory & even last-level cache Intel 4th generation core processors

AMD Trinity

Source: http://www.hardwarezone.com.my/feature-amd-trinity-apu-look-inside-2nd-generation-apu/conclusion-118

How do we program these integrated GPU systems? 3/2/2014

Programming Systems Lab, Intel Labs

2

Motivation: GPU Programming • Existing work: regular data-parallel applications using arraybased data structures map well to the GPUs – OpenCL 1.x, CUDA, OpenACC, C++ AMP, …

• Enable other existing multi-core applications to quickly take advantage of the integrated GPUs – Often use object-oriented design, pointers • Enable pointer-based data structures on the GPU – Irregular applications on GPU: benefits are not well-understood • Data-dependent control flow – Graph-based algorithms such as BFS, SSSP, etc.

Widen the set of applications that target GPUs 3/2/2014

Programming Systems Lab, Intel Labs

3

Contributions • Concord: a seamless C++ heterogeneous programming framework for integrated CPU and GPU processors – Shared Virtual Memory (SVM) in software • share pointer-containing data structures like trees

– Adapts existing data-parallel C++ constructs to heterogeneous computing: TBB, OpenMP – Supports most C++ features including virtual functions – Demonstrates programmability, performance, and energy benefits of SVM

• Available open source as Intel Heterogeneous Research Compiler (iHRC) at https://github.com/IntelLabs/iHRC/

3/2/2014

Programming Systems Lab, Intel Labs

4

Concord Framework Concord C++

Static Concord compiler

Executable: IA binary + OpenCL

CLANG LLVM

IA code gen

OpenCL JIT Compiler

Compute runtime

OpenCL to GPU ISA

OpenCL code gen GPU binary

Object: IA binary + OpenCL Linker 3/2/2014

CPU

Programming Systems Lab, Intel Labs

GPU

5

Concord C++ programming constructs Concord extends TBB APIs:

Existing TBB APIs:

template parallel_for_hetero (int numiters, const Body &B, bool device);

template parallel_for (Index first, Index last, const Body& B)

template parallel_reduce_hetero (int numiters, const Body &B, bool device);

template parallel_reduce (Index first, Index last, const Body& B)

Supported C++ features: • • • • • •

Classes Namespaces Multiple inheritance Templates Operator and function overloading Virtual functions

3/2/2014

Programming Systems Lab, Intel Labs

6

Concord C++ Example: Parallel LinkedList Search

class ListSearch {

class ListSearch {





void operator()(int tid) const{ ... list->key...

void operator()(int tid) const{ ... list->key...

}}; … ListSearch *list_object = new ListSearch(…);

}}; … ListSearch *list_object = new ListSearch(…);

parallel_for(0, num_keys, *list_object);

parallel_for_hetero (num_keys, *list_object, GPU);

TBB Version

Concord Version Concord Version

Run on CPU or GPU

Minimal differences between two versions 3/2/2014

Programming Systems Lab, Intel Labs

7

Example: parallel_for_hetero class Foo { float *A, *B, *C; public: Foo(float *a_, float *b_, float *c_):A(a_),B(b_),C(c_) { } void operator()(int i) const { // execute in parallel A[i] = B[i] + C[i]; } }; …… Foo *f = new Foo(A,B,C); parallel_for_hetero (1024, *f, GPU); // Data parallel operation for GPU

3/2/2014

Programming Systems Lab, Intel Labs

8

Example: parallel_reduce_hetero class Bar { float *A, sum; public: Bar(float *a_): A(a_), sum(0.0f) { } void operator()(int i) { // execute in parallel sum = f(A[i]); // compute local sum } void join(Bar &rhs) { sum += rhs.sum; // perform reduction } }; …… Bar *b = new Bar(A); parallel_reduce_hetero (1024, *b, GPU); // Data parallel reduction on GPU

3/2/2014

Programming Systems Lab, Intel Labs

9

Restrictions • No guarantee that the parallel loop iterations will be executed in parallel • No ordering among different parallel iterations – Floating-point determinism is not guaranteed

• Features not yet supported on the GPU – – – – –

Recursion (except tail recursion which can be converted to loop) Exception Taking address of local variable Memory allocation and de-allocation Function calls via function pointers (virtual functions are handled)

Silently execute on CPU if these features are present in GPU code

3/2/2014

Programming Systems Lab, Intel Labs

10

Key Implementation Challenges • Shared Virtual Memory (SVM) support to enable pointersharing between CPU and GPU •

Compiler optimization to reduce SVM translation overheads

• Virtual functions on GPU • Parallel reduction on GPU • Compiler optimizations to reduce cache line contention

3/2/2014

Programming Systems Lab, Intel Labs

11

SVM Implementation on x86 CPU virtual memory

Shared physical memory

GPU virtual memory

CPU_ptr SVM: Address shared with GPU (pinned)

GPU_ptr

offset

CPU_Base

offset

GPU surface mapped to shared area

GPU_Base

0x0…0

GPU_ptr = GPU_Base + CPU_ptr – CPU_Base 3/2/2014

Programming Systems Lab, Intel Labs

12

SVM Translation in OpenCL code class ListSearch { … void operator()(int tid) const{ ... list->key...

//__global char * svm_const = (GPU_Base – CPU_Base); #define AS_GPU_PTR(T,p) (__global T *) (svm_const + p)

}}; … ListSearch *list_object = new ListSearch(…);

__kernel void opencl_operator ( __global char *svm_const, unsigned long B_ptr) {

parallel_for_hetero (num_keys, *list_object, GPU); AS_GPU_PTR(LinkedList, list)->key… }

Concord C++

Generated OpenCL



svm_const is a runtime constant and is computed once



Every CPU pointer before dereference on the GPU is converted into GPU addressspace using AS_GPU_PTR 3/2/2014

Programming Systems Lab, Intel Labs

13

Compiler Optimization of SVM Translations int **a = data->a; for ( int i=0; i
Eager int **a = AS_GPU_PTR(int *, data->a); for ( int i=0; i
Overhead: 2N + 1



Lazy int **a = data->a; for ( int i=0; i
Best int **a = AS_GPU_PTR(int *, data->a); for ( int i=0; i
Overhead: N

Overhead: 1

Best strategy: – Eagerly convert to GPU addressspace & keep both CPU & GPU representations – If a store is encountered, use CPU representation – Additional optimizations • Dead-code elimination • Optimal code motion to perform redundancy elimination and place the translations 3/2/2014

Programming Systems Lab, Intel Labs

14

Virtual Functions on GPU Original hierarchy: class Shape { virtual void intersect() {…} virtual void compute() {…} }; class Triangle : Shape { virtual void intersect() {…} };

Object layout with vtable:

Shape

Shape::vtable intersect

vtableptr

compute

Triangle

Triangle::vtable intersect

vtableptr

Virtual Function call: void foo(Shape *s) { s->compute(); }

Shape:compute CPU Virtual Function call: void foo(Shape *s) { (s->vtableptr[1])(); }

Original code • •

Copy to shared memory

GPU Virtual Function call: void foo(Shape *s, void *gCtx) { if (s->vtableptr[1] == gCtx-> Shape::compute) Shape::compute(); }

Generated code

Copy necessary metadata into shared memory for GPU access Translate virtual function calls into if-then-else statements

3/2/2014

Programming Systems Lab, Intel Labs

15

Parallel Reduction on GPU parallel_reduce_hetero(16, B, GPU)

Private copies of B & parallel operation

B0 B1

B0

Hierarchical reduction in local memory

B1

B2

B2

B3

B3

B4

B4

B5

B5

B6

B7

B6

B8

B9

B7

class Body { … void operator()(int tid) const { … } void join(Body &rhs) { … } }

B10 B11 B12 B13 B14 B15

B8

B9

B10

join

B11

B12

B13

B14

join

B0

B8

B

3/2/2014

Programming Systems Lab, Intel Labs

16

B15

Compiler Optimization for Cache Contention • Integrated GPUs often use a unified cache among all GPU cores – Contention among GPU cores to access same cache line • number of simultaneous read and write ports to a cache line may not be same as the number of GPU cores

void operator ()(int i) { for (j=0; j
void operator ()(int i) { int start = i / W; /* W: no. of GPU cores */ for (j=0; j
• Key idea: Ensure that the j loop is accessed in a different order for each GPU core 3/2/2014

Programming Systems Lab, Intel Labs

17

Using GPU Memory hierarchy • Stack allocated objects in C++ are promoted to OpenCL private memory • Reductions are performed in OpenCL local shared memory • Automatic generation of local memory code for regular applications (work-in-progress)

3/2/2014

Programming Systems Lab, Intel Labs

18

Compiler Details • HeteroTBB pass:

Concord C++

– identify and lower Concord constructs – Handles virtual functions

Clang++

HeteroTBB Pass

• Hetero pass:

Hetero Pass HeteroGPU pass

LLVM Passes

• HeteroGPU pass:

HeteroCPU pass

Executa ble

– Check restrictions – Generates a list of kernels

Concord Runtime

– Perform compiler optimizations – Generate OpenCL code

• HeteroCPU pass: – Generates x86 executable with embedded OpenCL code

3/2/2014

Programming Systems Lab, Intel Labs

19

Runtime Details • OpenCL host program – Setup shared region and map to an OpenCL buffer

• Extract OpenCL code and JIT to GPU binary – Vendor OpenCL compiler

• Compile all the kernels at once – Cache the binary per function for future invocations – Amortizes the cost

• Allows heterogeneous CPU+GPU execution

3/2/2014

Programming Systems Lab, Intel Labs

20

Case Study: Barnes-Hut • An efficient algorithm for the N-body problem – Approximates far away bodies

• Algorithm: – Build an oct-tree representing positions of bodies – Update the centers of masses for all subtrees – Sort the bodies based on relative positions – Calculate gravitational forces between bodies (offload to GPU) – Update positions and velocities

• Takes advantage of (shared) pointers

3/2/2014

Programming Systems Lab, Intel Labs

21

Barnes-Hut CUDA Kernel void ForceCalculationKernel() { if (0 == threadIdx.x) { tmp = radiusd; dq[0] = tmp * tmp * itolsqd; for (i = 1; i < maxdepthd; i++) { dq[i] = dq[i - 1] * 0.25f; dq[i - 1] += epssqd; } dq[i - 1] += epssqd; if (maxdepthd > MAXDEPTH) { *errd = maxdepthd; }

} __syncthreads();

// iterate over all bodies assigned to thread for (k = threadIdx.x + blockIdx.x * blockDim.x; k < nbodiesd; k += blockDim.x * gridDim.x) { i = sortd[k]; // get permuted/sorted index // cache position info px = posxd[i]; py = posyd[i]; pz = poszd[i]; ax = 0.0f; ay = 0.0f; az = 0.0f; // initialize iteration stack, i.e., push root node onto stack depth = j; if (sbase == threadIdx.x) { node[j] = nnodesd; pos[j] = 0; }

if (maxdepthd <= MAXDEPTH) { base = threadIdx.x / WARPSIZE; sbase = base * WARPSIZE; while (depth >= j) { j = base * MAXDEPTH; // stack is not empty while ((t = pos[depth]) < 8) { diff = threadIdx.x - sbase; // node on top of stack has more children if (diff < MAXDEPTH) { to process dq[diff+j] = dq[diff]; n = childd[node[depth]*8+t]; // load child } pointer __syncthreads(); if (sbase == threadIdx.x) { // I'm the first thread in the warp pos[depth] = t + 1; }

Source: http://www.gpucomputing.net/?q=node/1314

if (n >= 0) { dx = posxd[n] - px; dy = posyd[n] - py; dz = poszd[n] - pz; tmp = dx*dx + (dy*dy + (dz*dz + epssqd)); // compute distance squared (plus softening) if ((n < nbodiesd) || __all(tmp >= dq[depth])) { // check if all threads agree that cell is far enough away (or is a body) tmp = rsqrtf(tmp); // compute distance tmp = massd[n] * tmp * tmp * tmp; ax += dx * tmp; ay += dy * tmp; az += dz * tmp; } else { // push cell onto stack depth++; if (sbase == threadIdx.x) { node[depth] = n; pos[depth] = 0; } } } else { depth = max(j, depth - 1); // early out because all remaining children are also zero } } depth--; // done with this level }

if (stepd > 0) { // update velocity velxd[i] += (ax accxd[i]) * dthfd; velyd[i] += (ay accyd[i]) * dthfd; velzd[i] += (az acczd[i]) * dthfd; } // save computed acceleration accxd[i] = ax; accyd[i] = ay; acczd[i] = az; } } }

~100 Lines of CUDA Code with optimization, hard to read and maintain 3/2/2014

Programming Systems Lab, Intel Labs

22

Barnes-Hut Concord C++ Kernel 1. void update (BH_Tree **stack, Body *body) { 2. while(!stack.empty()) { 3. Octree *tree = stack.top(); 4. stack.pop();

5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. }

• • •

Octree **children = ((OctreeInternal*)tree)->child; for(int i=0;i<8;i++) { Octree *child = children[i]; if (!child) continue; if (child->nodeType == LEAF || body->pos.distance(child->pos) * THETA > child->box.size()) { computeForce(body, child); } else { stack.push(child); } }

distance is 5 lines. computeForce is 9 lines. push is 2 lines and pop is 1line Total 33 lines of code No extra host code for device malloc and data copy 3/2/2014

Programming Systems Lab, Intel Labs

23

Experimental setup • Experimental Platform:

– Intel Core 4th Generation Ultrabook • CPU: 2 cores, hyper-threaded, 1.7GHz • GPU: Intel HD Graphics 5000 with 40 cores, 200MHz-1.1GHz • Power envelope 15W – Intel Core 4th Generation Desktop • CPU: 4 cores, hyper-threaded, 3.4GHz • GPU: Intel HD Graphics 4600 with 20 cores, 350MHz-1.25GHz • Power envelope 84W

• Energy measurements: MSR_PKG_ENERGY_STATUS • Comparison with multi-core CPU:

1. GPU-SPEEDUP: speedup using GPU execution 2. GPU-ENERGY-SAVINGS: energy savings using GPU execution

3/2/2014

Programming Systems Lab, Intel Labs

24

Workloads

*uses virtual function

3/2/2014

Programming Systems Lab, Intel Labs

25

Overheads (on desktop system) Other overhead per kernel invocation (in microseconds)

%age compile-time rel. to total time 3 250

2.5

200

2

150

1.5 1

100

0.5

50

0

0

• Compile-time is 1.03% of total execution time • Other overheads (excluding compile-time) is ~90 microseconds 3/2/2014

Programming Systems Lab, Intel Labs

26

Dynamic estimates of irregularity control

memory

remaining

100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0%

• •

BFS, Btree, ConnComp, FaceDetect, SkipList & SSSP exhibit a lot of irregularities (>50%) FaceDetect exhibits maximum percentage of memory irregularities 3/2/2014

Programming Systems Lab, Intel Labs

27

Overhead of SW-based SVM implementation Raytracer

Execution time in seconds

GPU-OPENCL

GPU-CONCORD

7 6 5 4 3 2 1

0

1000x1000

2000x2000

2400x2400

3000x3000

3200x3200

Image size SW-based SVM overhead is negligible for smaller images and is ˜6% for the largest image 3/2/2014

Programming Systems Lab, Intel Labs

28

Ultrabook: Speedup & Energy savings compared to multicore CPU GPU-SPEEDUP

GPU-ENERGY-SAVINGS

10

higher the better

9 8 7 6 5 4 3 2 1 0

Average speedup of 2.5x and energy savings of 2x vs. multicore CPU 3/2/2014

Programming Systems Lab, Intel Labs

29

Desktop: Speedup & Energy savings compared to multicore CPU GPU-SPEEDUP

GPU-ENERGY-SAVINGS

4

higher the better

3.5 3 2.5

2 1.5 1 0.5 0

Average speedup of 1.01x and energy savings of 1.7x vs. multicore CPU 3/2/2014

Programming Systems Lab, Intel Labs

30

Regular Workloads on Quad-core desktop: Speedup compared to multi-core CPU Speedup relative to multi-core CPU

12

10

8

6

4

2

0

BlackScholes

MatrixMult

Nbody

Seismic

CFD

Automatic local memory code generation can further boost performance 3/2/2014

Programming Systems Lab, Intel Labs

31

CPU+GPU Performance on HSW Desktop

Relative to ORACLE(CPU+GPU) Higher is better

CPU

GPU

SHARED-COUNTER

100 90 80 70 60 50 40 30

20 10 0

• CPU-alone and GPU-alone do not give the best performance • Hybrid CPU+GPU is necessary 3/2/2014

Programming Systems Lab, Intel Labs

32

Comparison with Manual code

BTree from Rodinia: Concord takes 2.68s vs. 3.26s for hand-coded OpenCL on the Desktop Haswell system

3/2/2014

Programming Systems Lab, Intel Labs

33

Conclusions & Future work • Runs out-of-the-box C++ applications on GPU – No new language invention

• Demonstrates that SVM is a key enabler in programmer productivity of heterogeneous systems • Implements SVM in software with low-overhead • Implements virtual functions and parallel reductions on GPU • Saves energy of 2.04x on Ultrabook and 1.7x on Desktop compared to multi-core CPU for irregular applications • Hybrid CPU+GPU execution looks promising for both performance and energy • Future work: – Support advanced features on GPU: exceptions, memory allocation, locks, etc. – Improve combined CPU+GPU heterogeneous execution 3/2/2014

Programming Systems Lab, Intel Labs

34

Cloth Physics demo using Concord:

Questions? Please try it out:

https://github.com/IntelLabs/iHRC/

3/2/2014

Programming Systems Lab, Intel Labs

35

Backup

3/2/2014

Programming Systems Lab, Intel Labs

36

Ultrabook: Speedup compared to multicore CPU GPU

GPU+PTROPT

GPU+L3OPT

GPU+ALL

higher the better

10 9 8 7 6 5 4 3 2 1 0

Average speedup of 2.5x vs. multicore CPU 3/2/2014

Programming Systems Lab, Intel Labs

37

Ultrabook: Energy savings compared to multi-core CPU GPU

GPU+PTROPT

GPU+L3OPT

GPU+ALL

higher the better

6 5 4

3 2 1 0

Average energy savings of 2.04x vs. multicore CPU 3/2/2014

Programming Systems Lab, Intel Labs

38

Desktop: Speedup compared to multi-core CPU GPU

GPU+PTROPT

GPU+L3OPT

GPU+ALL

3.5

higher the better

3 2.5 2 1.5 1 0.5 0

3/2/2014

Programming Systems Lab, Intel Labs

39

Desktop: Energy savings compared to multi-core CPU GPU

GPU+PTROPT

GPU+L3OPT

GPU+ALL

higher the better

4 3.5 3 2.5 2 1.5 1 0.5 0

Average energy savings of 1.7x vs. multicore CPU 3/2/2014

Programming Systems Lab, Intel Labs

40

Productivity

GPU Programming is hard

Concord C++ AMP, OpenACC, Cilk Plus.. CUDA, OpenCL, OpenGL, RenderScript?

Media Toolkit

Performance 3/2/2014

Programming Systems Lab, Intel Labs

41

Concord: Homogeneous Programming for Heterogeneous Architectures

Mar 2, 2014 - Irregular applications on GPU: benefits are not well-understood. • Data-dependent .... Best. Overhead: 2N + 1. Overhead: N. Overhead: 1. Lazy.

1MB Sizes 9 Downloads 260 Views

Recommend Documents

Heterogeneous Parallel Programming - GitHub
The course covers data parallel execution models, memory ... PLEASE NOTE: THE ONLINE COURSERA OFFERING OF THIS CLASS DOES NOT ... DOES NOT CONFER AN ILLINOIS DEGREE; AND IT DOES NOT VERIFY THE IDENTITY OF ...

Modal concord
The example is entirely natural, and its concord reading is strongly preferred. ... calls out for an explanation, since it is an obvious challenge to compositional.

Merge: A Programming Model for Heterogeneous Multi-core Systems
Mar 5, 2008 - proaches provide a data parallel API that can be efficiently mapped to a set of ... multi-core system, in contrast to static approaches, or dynamic.

[Heterogeneous Parallel Programming] Certificate [with Distinction].pdf
[Heterogeneous Parallel Programming] Certificate [with Distinction].pdf. [Heterogeneous Parallel Programming] Certificate [with Distinction].pdf. Open. Extract.

Invariance principles for homogeneous sums ...
to which most information about large random systems (such as the “distance .... analytic structure of f interacts with the specific “shape” of the distribution of the.

An Evolutionary Algorithm for Homogeneous ...
fitness and the similarity between heterogeneous formed groups that is called .... the second way that is named as heterogeneous, students with different ...

Invariance principles for homogeneous sums ...
first N elements of X. As in (1.1), and when there is no risk of confusion, we will drop the dependence on N and f in order to simplify the notation. Plainly, E[Qd(X)] = 0 and also, if. E(X2 i. ) = 1 for every i, ... In the specific case where Z is G

2. Generalized Homogeneous Coordinates for ...
ALYN ROCKWOOD. Power Take Off Software, Inc. ... direct computations, as needed for practical applications in computer vision and similar fields. ..... By setting x = 0 in (2.26) we see that e0 is the homogeneous point corre- sponding to the ...

Invariance principles for homogeneous sums of free ...
In [6], the authors were motivated by solving two conjectures, namely the Majority Is Stablest ...... Center successively every random variable X mp1 i′ p1 .... we call contraction of f with respect to γ the function Cγ(f) : {1,...,N}n−p → R

Interactive system for local intervention inside a non-homogeneous ...
Feb 8, 2001 - Gonzalez, “Digital Image Fundamentals,” Digital Image Processing,. Second Edition, 1987 ... Hamadeh et al., “Towards Automatic Registration Between CT and .... and Analysis, Stealth Station Marketing Brochure (2 pages).

heterogeneous catalysis for todays challenges ...
heterogeneous catalysis for todays challenges synthesis characterization and applications rsc green chemistry contains important information and a detailed ...

Speed-Based Mobility Management for Heterogeneous Wireless ...
anticipated that the future wireless mobile Internet ... the integration of IP and wireless technologies. ... heterogeneous wireless network mobility management.

Interactive system for local intervention inside a non-homogeneous ...
Feb 8, 2001 - Tech. Biol. Med., vol. 13, No.4, 1992, pp. 409-424. (Continued) ...... nation Systems, and Support Systems,” Journal of Microsurgery, vol. 1, 1980 ...

Entity identification for heterogeneous database ...
Internet continuously amplifies the need for semantic ..... ing procedure of an application service provider. (ASP) for the .... ю 17:4604 В Home ю 14:9700 В Bus.

Adaptive Bayesian personalized ranking for heterogeneous implicit ...
explicit feedbacks such as 5-star graded ratings, especially in the context of Netflix $1 million prize. ...... from Social Media, DUBMMSM '12, 2012, pp. 19–22.

Incorporating heterogeneous information for ... - ACM Digital Library
Aug 16, 2012 - A social tagging system contains heterogeneous in- formation like users' tagging behaviors, social networks, tag semantics and item profiles.

Homogeneous-Turbulence-Dynamics.pdf
study on well-liked search engines like google together with the keywords and phrases download Pierre Sagaut PDF eBooks. in order for you to only get PDF formatted books to download which are safer and virus-free you will find an array of websites. c

Residue curve map for homogeneous reactive quaternary mixtures
HOUSAM BINOUS. National Institute of Applied Sciences and Technology, BP 676 Centre Urbain Nord, 1080 Tunis, Tunisia ... involve solving a complex system of differential algebraic equations (DAEs). This can be .... ease of programming.

SOFTWARE ARCHITECTURES FOR JPEG2000 David ...
The paper describes non-obvious implementation strategies, state machines, predictive algorithms and internal ..... The first collects statistics from code-blocks.

Optimizing CABAC for VLIW architectures
Figure 5: Renormalization and bit insertion normalizes the values of low and range in the interval. [0, 1] so that they are at least separated by a QUAR-.

Incorporating heterogeneous information for ...
Aug 16, 2012 - [email protected]. Jianyong Wang ... formation like users' tagging behaviors, social networks, tag semantics and item profiles.