SYCL SG14, February 2016 © Copyright Khronos Group, 2014 - Page 1

BOARD OF PROMOTERS

Over 100 members worldwide any company is welcome to join

© Copyright Khronos Group 2014

SYCL

1. What is SYCL for and what do we want to achieve? 2. Where does SYCL fit in compared with other programming models? 3. Bringing OpenCL v1.2 features to C++ with SYCL

4. What next?

What is SYCL for?

• Modern C++ lets us separate the what from the how : - We want to separate what the user wants to do: science, computer vision, AI … - And enable the how to be: run fast on an OpenCL device

• Modern C++ supports and encourages this separation

What we want to achieve • We want to enable a C++ ecosystem for OpenCL: - Must run on OpenCL devices: GPUs, CPUs, FPGAs, DSPs etc - C++ template libraries - Tools: compilers, debuggers, IDEs, optimizers

- Training, example programs - Long-term support for current and future OpenCL features

Why a new standard? • There are already very established ways to map C++ to parallel processors - So we follow the established approaches

http://imgs.xkcd.com/comics/standards.png

• There are specifics to do with OpenCL we need to map to C++ - We have worked hard to be an enabler for other C++ parallel standards • We add no more than we need to © Copyright Khronos Group 2014

Where does SYCL fit in?

OpenCL / SYCL Stack User application code

C++ template libraries

C++ template libraries

C++ template libraries

SYCL for OpenCL Other technologies

OpenCL Devices CPU

CPU FPGA

GPU

DSP

Custom Processor

Philosophy • With SYCL, we wanted to align with the direction the C++ standard is going - And we also need to future-proof for future OpenCL device capabilities • Key decisions:

- We will not add any language extensions to C++ - We will work with existing C++ compilers - We will provide the full OpenCL feature-set in C++ - Everything must compile and run on the host as well as an OpenCL device © Copyright Khronos Group 2014

Where does SYCL fit in? – Language style C++ Embedded DSLs e.g.: Sh/RapidMind, Halide, Boost.compute Pros: Works with existing C++ compilers Cons: compile-time compilation, controlflow, composability

C++ Kernel languages e.g.: GLSL, OpenCL C and C++ kernel languages Pros: Explicit offload, independent host/device code & compilers, run-time adaptation, popular in graphics Cons: Hard to compose cross-device

C++ single-source e.g.: SYCL, CUDA, OpenMP, C++ AMP Pros: Composability, easy to use, offline compilation and validation Cons: host/device compiler conflict

Vector a, b; auto expr = a + b; Vector r = expr.eval (); Kernel myKernel; myKernel.load (“myKernel”); myKernel.compile (); myKernel.setArg (0, a); float r = myKernel.run (); void myKernel (float *arg) { return arg * 456.7f; }

C++ template library uses overloading to build up expression tree to compile at runtime

Host (CPU) code loads and compiles kernel for specific device, sets args and runs

Vector a, b, r; parallel_for (a.range (), [&](int id) { r [id] = a [id] + b [id]; });

Single source file contains code for host & device

Where does SYCL fit in? – Parallelism Directive-based parallelism e.g.: OpenMP, OpenAcc Pros: Original source code is annotated not modified; well-understood Cons: Hard to compose; execution order separate from source code

Thread parallelism e.g.: tbb, C++11 threads, pthreads Pros: Well understood; works with variety of algorithms Cons: Doesn’t map to highly parallel architectures like GPUs & FPGAS

Explicit parallelism e.g.: SYCL, Parallel STL, CUDA, C++AMP Pros: Composable; works with wide variety of processor architectures Cons: Requires user to know the parallelism

Vector a, b, r; for (int i=0; i< a.size(); i++) { #pragma parallel_for r [i] = a [i] + b [i]; }

Vector a, b, r; Thread t1 = createThread ([&]() { sumFirstHalf (r, a, b); }); Thread t2 = createThread ([&]() { sumSecondHalf (r, a, b); }); t1.wait (); t2.wait ();

Annotate serial code with #pragmas highlighting where the parallelising compiler should transform code

Create explicit threads to break up task into parallel sections

Vector a, b; parallel_for (a.range (), [&](int id) { a [id] = a [id] + b [id]; });

Parallelism is expressed explicitly in the program

Where does SYCL fit in? – Memory model Cache coherent single-address space e.g.: Multi-core CPUs, HSA, OpenCL 2 System Sharing Pros: Very easy to program – just pass around pointers (leave ownership issues to user); low-latency offload; very little impact on programming model Cons: Bandwidth limited; costs power; needs special operating system support

Non-coherent single-address space e.g.: HSA Coarse grained, OpenCL 2.x Pros: Doesn’t require (much) OS support or (much) hardware support Cons: Not supported on all processor cores; User must manage ownership

Multi-address space e.g.: SYCL 1.2, C++ AMP, OpenCL 1.x Pros: High performance and efficiency of memory accesses; wide device support Cons: Impact on programming model (pointers)

float *a = new float [size]; processCodeOnDevice (a, size); When parallelizing on a system with a cache-coherent single address space, only need to pass around pointers. This makes communication and offloading very low-cost and easy. Requires all memory accesses to go through virtual memory system and caches communicate ownership across all cores

float *a = NewShared (size); a.passOwnershipToDevice (size); processCodeOnDevice (a, size); All data is still referred to via shared pointers, but the user must manage the memory ownership between different cores.

Shared a (size); processCodeOnDevice (a); Data needs to be encapsulated in new datatypes that are able to manage ownership between host CPU and different devices

Bringing OpenCL features to C++ with SYCL

What features of OpenCL do we need? • We want to make it easy to write high-performance OpenCL code in C++ - SYCL code in C++ must use memory and execute kernels efficiently - We must provide developers with all the optimization options they have in OpenCL • We want to enable all OpenCL features in C++ with SYCL - Support wide range of OpenCL devices: CPUs, GPUs, FPGAs, DSPs… - Data on host: Images and buffers; mapping, DMA and copying - Data on device: global/constant/local/private memory; multiple pointer sizes - Parallelism: ND ranges, work-groups, work-items, barriers, queues, events - Multi-device: Platforms, devices, contexts • We want to enable OpenCL C code to interoperate with C++ SYCL code - Sharing of contexts, memory objects etc © Copyright Khronos Group 2014

Example SYCL Code #include void func (float *array_a, float *array_b, float *array_c, float *array_r, size_t count) { buffer
1 > buf_a(array_a, 1 > buf_b(array_b, 1 > buf_c(array_c, 1 > buf_r(array_r, (gpu_selector);

range<1>(count)); range<1>(count)); range<1>(count)); range<1>(count));

myQueue.submit([&](handler& cgh) { auto a = buf_a.get_access(cgh); auto b = buf_b.get_access(cgh); auto c = buf_c.get_access(cgh); auto r = buf_r.get_access(cgh); cgh.parallel_for(count, [=](id<1> i) { r[i] = a[i] + b[i] + c[i]; }); }); }

#include the SYCL header file

Encapsulate data in SYCL buffers which be mapped or copied to or from OpenCL devices Create a queue, preferably on a GPU, which can execute kernels Submit to the queue all the work described in the handler lambda that follows

Create accessors which encapsulate the type of access to data in the buffers Execute in parallel the work over an ND range (in this case ‘count’) This code is executed in parallel on the device

Task Graph Deduction const int n_items = 32; range<1> r(n_items);

Group

Group

int array_a[n_items] = { 0 }; int array_b[n_items] = { 0 }; buffer buf_a(array_a, range<1>(r)); buffer buf_b(array_b, range<1>(r)); queue q; q.submit([&](handler& cgh) { auto acc_a = buf_a.get_access(cgh); algorithm_a s(acc_a); cgh.parallel_for(r, s); });

Group

q.submit([&](handler& cgh) { auto acc_b = buf_b.get_access(cgh); algorithm_b s(acc_b); cgh.parallel_for(r, s); });

Efficient Scheduling

q.submit([&](handler& cgh) { auto acc_a = buf_a.get_access(cgh); algorithm_c s(acc_a); cgh.parallel_for(r, s); });

Data access with accessors • Encapsulates the difference between data storage and data access

• Allows creation of a parallel task graph with schedule, synchronization and data movement • Enables devices to use optimal access to data - Including having different pointer sizes on the device to those on the host - Allows usage of different address spaces for different data • Enhanced with call-graph-duplication (for C++ pointers) and explicit pointer classes - To enable direct pointer-like access to data on the device • Portable, because accessors can be implemented as raw pointers

The levels of parallelism of OpenCL

1. Single-threaded host code 2. Queues of OpenCL work for devices 3. Device work-groups 4. Device sub-groups (OpenCL 2.x only) 5. Device work-items

• To achieve best performance out of OpenCL devices, developers usually have to understand and use this hierarchy of parallelism (e.g. tiled algorithms) • Although this hierarchy will work on all OpenCL devices, the limits on sizes and the performance impact may vary considerably

Hierarchical Data Parallelism buffer my_buffer(data, range<1>(10));

Task (nD-range) Workgroup Work Work Workgroup item item Work Work item Workgroup item Work Work Work Work Workgroup item item item Work Work Work item Work item item item item Work Work Work Work item item Work item Work item Work Work item item item item Work item Work item

Work item Work item

q.submit([&](handler& cgh) { auto in_access = my_buffer.get_access(cgh); auto out_access = my_buffer.get_access(cgh); cgh.parallel_for_workgroup (range<1>(group_size), range<1> (local_size), [=](group<1> grp) { parallel_for_workitem(grp, [=](item<1> tile) { out_access[tile] = in_access[tile] * 2; }); })); });

Advantages: 1. 2. 3. 4.

Easy to understand the concept of work-groups Performance-portable between CPU, GPU and other devices No need to think about barriers (automatically deduced) Easier to compose components & algorithms into template algorithms (e.g. parallel_reduce)

‘Shared source’ approach to single-source • This is not required for SYCL, but is designed as a possible implementation • Have a different compiler for host and each device - Don’t really need to implement different front-end for each device, but can • Benefits - Many developers are required to use a specific host compiler - Allows front-ends to optimize for specific devices: e.g. CPU, FPGA, GPU, DSP - Allows the pre-processor to be used by developers for portability and performance portability

Example SYCL Code: Building the program

#include the SYCL header file. This can be implemented differently for host and device. Can also #include the compiled device kernel binaries

#include int main () { buffer
1 1 1 1

> > > >

buf_a(array_a, buf_b(array_b, buf_c(array_c, buf_r(array_r,

range<1>(count)); range<1>(count)); range<1>(count)); range<1>(count));

queue myQueue (gpu_selector); myQueue.submit([&](handler& cgh) { auto a = buf_a.get_access(cgh); auto b = buf_b.get_access(cgh); auto c = buf_c.get_access(cgh); auto r = buf_r.get_access(cgh); cgh.parallel_for(count, [=](id<> i) { r[i] = a[i] + b[i] + c[i]; }); }); }

On host, the accessors can represent the dependencies in the program. On device, they can be implemented as OpenCL pointers (whether global, local or constant) This code is extracted by a device compiler and compiled for a device, including any functions or methods called from here. All the code must conform to the OpenCL kernel restrictions (e.g. no recursion). This code can be compiled for different devices from the same source code This is the name of the lambda, which is used to enable the host to load the correct compiled device kernel into OpenCL. C++ reflection may remove this requirement

What next?

Progress report on the SYCL vision Open, royalty-free standard: released Conformance testsuite: going into adopters package

Open-source implementations: in progress (triSYCL and SYCL-GTX) Commercial, conformant implementation: in progress by Codeplay C++ 17 Parallel STL: open-source in progress https://github.com/KhronosGroup/SyclParallelSTL

Demonstrated that C++ embedded DSLs can be developed in SYCL and have greater flexibility, performance and customizability than runtime-only approaches Outreach: SYCL workshops at PPoPP and IWOCL • Template libraries for important C++ algorithms: getting going

• Integration into existing parallel C++ libraries: getting going © Copyright Khronos Group 2014

SYCL next • SYCL currently supports OpenCL v1.2 - We are now working towards supporting OpenCL 2.0 and above features in SYCL • In OpenCL v1.2, SPIR is optional, but in OpenCL 2.1, the newer SPIR-V is

required

- This will enable much wider device support

• Also working on a C++ kernel language for developers who want separate device and host code in C++

Questions

What is SYCL for? -

Where does SYCL fit in compared with other programming models? 3. Bringing OpenCL v1.2 .... address space, only need to pass around pointers. This makes.

2MB Sizes 14 Downloads 160 Views

Recommend Documents

What is SYCL for? -
established ways to map C++ to parallel processors. - So we follow the established approaches. • There are specifics to do with. OpenCL we need to map to C++.

What is Visualization Really for?
information loss: 22.6% information loss: 0%. Visualization Can Break the. Conditions of Data. Processing Inequality. Claude E. Shannon (1916-2001). Chen and Jänicke, TVCG, 2010 ... to gain insight into an information space ...” [Senay and Ignatiu

What is Bitcoin? What is Cryptocurrency? Why ... Accounts
Virtual Currency and Taxation Part I. Amy Wall, Tucson Tax Team. ○ Silk Road was an online black market (aka darknet market) founded in February 2011 by the “Dread Pirate Roberts” (later found to be Ross Ulbricht). ○ Silk Road sold illegal su

What is Strategy?
Laptop computers, mobile communica- tions, the Internet, and software such .... ten escort customers through the store, answering questions and helping them ...

What is NetBeans? - GitHub
A comprehensive, modular IDE. – Ready to use out of the box. – Support for latest Java specifications. & standards. – Other languages too. (PHP, C/C++, etc). – Intuitive workflow. – Debugger, Profiler,. Refactoring, etc. – Binaries & ZIPs

What Is Real?
Page 3 .... lapping lines of thought make it clear that the core units of quan- tum field theory do not behave like billiard .... Second, let us suppose you had a particle localized in your ... they suer from their own diculties, and I stick to the s

What is Strategy?
assembling final products, and training employees. Cost is ... proaches are developed and as new inputs become ..... in automotive lubricants and does not offer other ...... competitive advantage in Competitive Advantage (New York: The Free.

What is NAS.pdf
Sign in. Loading… Page 1. Whoops! There was a problem loading more pages. Retrying... What is NAS.pdf. What is NAS.pdf. Open. Extract. Open with. Sign In.

1.What is
C.R.M.Hurd. D.E.W.Burgess. Ans:A. 73.The concept 'Umland'means: ... Viticulture meant for: A.Lemon cultivation. B.Apple cultivation. C.Orange cultivation.

What is Virtualization? - Ashraf Aboulnaga
Database Replication. • Replication of front-end already possible. – through dynamic server provisioning e.g., IBM's. Tivoli, WebSphereXD, [Benn05], [Urga05], [Kar06]. • Database tier typically not replicated. Replication with Oracle RAC. • N

What is STEAM.pdf
Page 1 of 1. Connect ~ Engage ~ Inspire. OUR VISION. Our goal in FUSD is to provide quality programming that fosters each child's social and cognitive.

What is Geothermal Energy? - physicsinfo
However, this is not necessar- ily the result of geothermal energy but is more often stored solar energy from the sun (Ground source heat is explained in brief on ...

What Is AWS Icebreaker? - GitHub
physical devices from smart phone apps. The following diagram illustrates a high-level view of the Icebreaker service: You can interact with Icebreaker in a ...

What is welding - Arcraft Plasma
HCP. 17 . Metal with highest resistivity and lowest conductivity a. copper b. iron c. nickel d. Titanium. 18 . Susceptibility to stress corrosion cracking is generally less in a. High purity metal b. Martensitic microstructure c. High CE alloys d. HS

What Is Ransomware.pdf
(.pdf). Extrapolating from this, they would have earned more than. $394,000 in a month. And this was based on data from just one command. server and two Bitcoin addresses; the attackers were likely using multiple. servers and Bitcoin addresses for th

WHAT IS UFE
were put to great trouble to fit the new garment on me and ..... The effect of this striving is, actually, only a small preference for acute over obtuse angles between.

What is it?
Student's answers are recorded using their plicker cards along with the teachers device and displays the results in real time. plickers. “Plickers is a powerfully simple tool that lets teachers collect real-time formative assessment data without th

What is Degrowth?
Environmental Science and Technology Institute,. Autonomous ... “There is no alternative”. In reality: .... 3) Real - Real economy: flows of energy and material.

What is ESP.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. What is ESP.pdf.

What is GSoC? Developers
Goals of the program. ○ Help organizations continue to identify and bring in new developers each year. ○ Expose students to real world software development.

what is backup.pdf
1Storage, the base of a backup system. 1.1Data repository models. Whoops! There was a problem loading this page. Retrying... Whoops! There was a problem ...

what is energy?
binding energy that holds the electron to the nucleus; most amazing, the presence of this negative binding energy lowers the mass of the atom! It's as if negative ...

What is overproduction?
“Nothing can be more childish than the dogma, that because every sale is a purchase, and every purchase a sale ..... capitalists firms with huge amounts of cheap labor. ...... The profits obtained by upgrading existing computers do not justify ...

1.2. What is Design Patterns
Oct 16, 2016 - The benefit of naming all patterns is that we have, on ...... How to sum of all even number inside the string? ..... def __str__(self): return "Apple".