Language Constructs for Data Locality: Moving Policy Decisions from Language Definition to User Space

Brad Chamberlain, Chapel Team, Cray Inc. PADAL Workshop, Lugano Switzerland April 28th, 2014

C O M P U T E           |           S T O R E           |           A N A L Y Z E

Three Language Concepts for Taming Data Locality

Language Constructs for Data Locality: Moving Policy Decisions from Language Definition to User Space

Brad Chamberlain, Chapel Team, Cray Inc. PADAL Workshop, Lugano Switzerland April 28th, 2014

C O M P U T E           |           S T O R E           |           A N A L Y Z E

Safe Harbor Statement

This presentation may contain forward-looking statements that are based on our current expectations. Forward looking statements may include statements about our financial guidance and expected operating results, our opportunities and future potential, our product development and new product introduction plans, our ability to expand and penetrate our addressable markets and other statements that are not historical facts. These statements are only predictions and actual results may materially vary from those projected. Please refer to Cray's documents filed with the SEC from time to time concerning factors that could affect the Company and these forward-looking statements.

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

3

Prototypical Next-Gen Processor Technologies

Intel MIC

AMD APU

Nvidia Echelon

Tilera Tile-Gx

Sources: http://download.intel.com/pressroom/images/Aubrey_Isle_die.jpg, http://www.zdnet.com/amds-trinity-processors-take-on-intels-ivy-bridge-3040155225/, C O M P U T E           |           S T O R E  http://tilera.com/sites/default/files/productbriefs/Tile-Gx%203036%20SB012-01.pdf         |           A N A L Y Z E http://insidehpc.com/2010/11/26/nvidia-reveals-details-of-echelon-gpu-designs-for-exascale/, Copyright 2014 Cray Inc.

4

Why do we need data locality control?

Emerging processor designs… …are increasingly locality-sensitive …potentially have multiple processor/memory types

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

5

Data Locality Control in Current HPC Models Q: Why are current HPC models lacking w.r.t. data locality? A: Because they… …lock key data locality policies into the language ●  e.g., array layouts, parallel scheduling

…lack support for users to create new policy abstractions …expose too much about their target architectures

In Chapel, we’re striving to improve upon this status quo “How can we define a language that supports high level abstractions and enables users to plug in their own implementations?”

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

6

What is Chapel? ●  An emerging parallel programming language ●  Design and development led by Cray Inc. ●  in collaboration with academia, labs, industry

●  A work-in-progress ●  Goal: Improve productivity of parallel programming

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

7

What does “Productivity” mean to you? Recent Graduate:

“something similar to what I used in school: Python, Matlab, Java, …”

Seasoned HPC Programmer:

“that sugary stuff that I can’t use because I require full control to ensure good performance”

Computational Scientist:

“something that lets me express my parallel computations without having to wrestle with architecture-specific details”

Chapel Team:

“something that lets the computational scientist express what they want, without taking away the control the HPC programmer wants, implemented in a language as attractive as recent graduates want.” C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

8

Chapel's Implementation ●  Being developed as open source at SourceForge ●  Licensed as BSD software

●  Portable design and implementation, targeting: ●  ●  ●  ● 

multicore desktops and laptops commodity clusters and the cloud HPC systems from Cray and other vendors in-progress: exascale-era architectures

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

9

Multiresolution Design Multiresolution Design: Support multiple tiers of features ●  higher levels for programmability, productivity ●  lower levels for greater degrees of control Chapel language concepts Domain Maps Data Parallelism Task Parallelism Base Language Locality Control Target Machine

●  build the higher-level concepts in terms of the lower ●  permit the user to intermix layers arbitrarily C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

10

LULESH in Chapel

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

11

LULESH in Chapel

1288 lines of source code plus

266 lines of comments 487 blank lines

(the corresponding C+MPI+OpenMP version is nearly 4x bigger) This can be found in Chapel v1.9 in examples/benchmarks/lulesh/*.chpl

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

12

LULESH in Chapel

This is the only representation-dependent code. It specifies: •  data structure choices •  structured vs. unstructured mesh •  local vs. distributed data •  sparse vs. dense materials arrays

•  a few supporting iterators

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

13

Data Parallelism in LULESH (Structured) const Elems = {0..#elemsPerEdge, 0..#elemsPerEdge},

Nodes = {0..#nodesPerEdge, 0..#nodesPerEdge}; var determ: [Elems] real; forall k in Elems { …determ[k]… }

Elems

Nodes

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

14

Data Parallelism in LULESH (Unstructured) const Elems = {0..#numElems},

Nodes = {0..#numNodes}; var determ: [Elems] real; var elemToNode: [Elems] nodesPerElem*index(Nodes); forall k in Elems { …determ[k]… }

Elems

Nodes C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

15

Implementing Domains and Arrays Q: How are domains and arrays implemented? (distributed or local? distributed how? stored in memory how?) const Elems = {0..#numElems},

Nodes = {0..#numNodes}; var determ: [Elems] real;

A: Via Feature #1 (domain maps)…

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

16

Domain Maps: Concept Domain maps are “recipes” that instruct the compiler how to map the global view of a computation… = + α•

A = B + alpha * C;

…to the target locales’ memory and processors: = + α•

= + α•

Locale 0

= + α•

Locale 1

Locale 2

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

17

LULESH Data Structures (local) const Elems = {0..#numElems},

Nodes = {0..#numNodes}; var determ: [Elems] real; forall k in Elems { … }

Elems

No domain map specified ⇒ use default layout •  current locale owns all indices and values •  computation will execute using local processors only

Nodes C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

18

LULESH Data Structures (distributed, block) const Elems = {0..#numElems} dmapped Block(…),

Nodes = {0..#numNodes} dmapped Block(…); var determ: [Elems] real; forall k in Elems { … }

Elems

Nodes C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

19

LULESH Data Structures (distributed, cyclic) const Elems = {0..#numElems} dmapped Cyclic(…),

Nodes = {0..#numNodes} dmapped Cyclic(…); var determ: [Elems] real; forall k in Elems { … }

Elems

Nodes C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

20

Chapel’s Domain Map Philosophy 1.  Chapel provides a library of standard domain maps ●  to support common array implementations effortlessly

2.  Expert users can write their own domain maps in Chapel ●  to cope with any shortcomings in our standard library Domain Maps Data Parallelism Task Parallelism Base Language Locality Control

3.  Chapel’s standard domain maps are written using the same end-user framework ●  to ensure that the framework works and works well

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

21

Domain Map Descriptors Domain Map

Domain

Represents: a domain map value Generic w.r.t.: index type

Array

Represents: a domain

Represents: an array

Generic w.r.t.: index type

Generic w.r.t.: index type, element type

State: the domain map’s representation

State: representation of index set

Typical Size: Θ(1)

Typical Size: Θ(1) → Θ(numIndices)

Typical Size: Θ(numIndices)

Required Interface:

Required Interface:

Required Interface: ● 

create new domains

•  •  •  •  • 

create new arrays queries: size, members iterators: serial, parallel domain assignment index set operations

State: array elements

•  •  •  •  • 

(re-)allocation of elements random access iterators: serial, parallel slicing, reindexing, aliases get/set of sparse “zero” values

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

22

Domain Maps Summary ●  Data locality requires mapping arrays to memory well ●  distributions between distinct memories ●  layouts within a single memory

●  Most languages define a single data layout & distribution ●  where the distribution is often the degenerate “everything’s local”

●  Domain maps… …move such policies into user-space …exposing them to the end-user through high-level declarations const Elems = {0..#numElems} dmapped Block(…)

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

23

Implementing Data Parallel Loops Q: How are parallel loops implemented? (how many tasks? executing where? how are iterations divided up?) forall k in Elems { … }

Q2: What about zippered data parallel operations? (how to reconcile potentially conflicting parallel implementations?) forall (k,d) in zip(Elems, determ) { … } x += xd * dt;

A: Via Feature #2 (leader-follower iterators)…

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

24

Leader-Follower Iterators: Definition ●  Chapel defines all forall loops in terms of leaderfollower iterators: ●  leader iterators: create parallelism, assign iterations to tasks ●  follower iterators: serially execute work generated by leader

●  Given… forall (a,b,c) in zip(A,B,C) do a = b + alpha * c; …A is defined to be the leader …A, B, and C are all defined to be followers

●  Domain maps support default leader-follower iterators ●  specify parallel traversal of a domain’s indices/array’s elements ●  typically written to leverage affinity C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

25

Writing Leaders and Followers Leader iterators are defined using task/locality features: iter BlockArr.lead() { coforall loc in Locales do on loc do coforall tid in here.numCores do yield computeMyChunk(loc.id, tid); } Domain Maps Data Parallelism Task Parallelism Base Language Locality Control

Follower iterators simply use serial features:

Target Machine

iter BlockArr.follow(work) { for i in work do yield accessElement(i); } C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

26

Leader-Follower Summary ●  Data locality requires parallel loops to execute intelligently ●  appropriate number and placement of tasks ●  good data-task affinity

●  Most languages define fixed parallel loop styles ●  where “no parallel loops” is a common choice

●  Leader-follower iterators… …move such policies into user-space …expose them to the end-user through data parallel abstractions forall k in Elems { … } x += xd * dt;

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

27

OK, but what about those future architectures?

Feature #3 (hierarchical locales) ●  extends multiresolution philosophy to architectural modeling

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

28

Traditional Locales Concept: ●  Traditionally, Chapel has supported a 1D array of locales

locale

locale

locale

locale

●  Supports inter-node locality well, but not intra-node ●  (which, of course, is becoming increasingly important)

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

29

Recent Work: Hierarchical Locales Concept: ●  Support locales within locales to describe architectural sub-structures within a node (e.g., memories, processors) sub-locale A C C D

E

sub-locale B

locale

sub-locale A C C D

E

sub-locale B

locale

sub-locale A C C D

E

sub-locale B

locale

sub-locale A C C D

E

sub-locale B

locale

●  As with top-level locales, on-clauses and domain maps

map tasks and variables to sub-locales ●  Locale models are defined using Chapel code

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

30

Defining Hierarchical Locales 1)  Define the processor’s abstract block structure sub-locale A C C D

E

sub-locale B

locale

2)  Define how to run a task on any sublocale 3)  Define how to allocate/access memory on any sublocale

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

31

Hierarchical Locale Summary ●  Data locality requires flexibility w.r.t. future architectures ●  due to uncertainty in processor design ●  to support portability between approaches

●  Most programming models assume certain features in the target architecture ●  this is why MPI/OpenMP/UPC/CUDA/… have restricted applicability

●  Hierarchical Locales …move the definition of new architectural models to user space …are exposed to the end-user via Chapel’s traditional locality features on loc do coforall tid in here.numCores do

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

32

Summary Chapel’s multiresolution philosophy allows users to write… …custom array implementations via domain maps …custom parallel iterators via leader-follower iterators …custom architectural models via hierarchical locales

The result is a language that decouples crucial policies for managing data locality out of the language’s definition and into an expert user’s hand… …while making them available to end-users through highlevel abstractions C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

33

Why a new language? Q: Why develop a new language rather than a library or language extension? A: Because… …having custom syntax presents policies to the end-user more cleanly …it exposes optimization opportunities to the compiler …helps with rank-independent indexing, arr-of-struct v. struct-of-array, … …these concepts are more difficult to write in a traditional HPC language (due to lack of support for features like type inference, iterators, generics, …)

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

34

For More Information on… …domain maps User-Defined Distributions and Layouts in Chapel: Philosophy and Framework [slides], Chamberlain, Deitz, Iten, Choi; HotPar’10, June 2010. Authoring User-Defined Domain Maps in Chapel [slides], Chamberlain, Choi, Deitz, Iten, Litvinov; Cug 2011, May 2011.

…leader-follower iterators User-Defined Parallel Zippered Iterators in Chapel [slides], Chamberlain, Choi, Deitz, Navarro; PGAS 2011, October 2011.

…hierarchical locales Hierarchical Locales: Exposing Node-Level Locality in Chapel, Choi; 2nd KIISEKOCSEA SIG HPC Workshop talk, November 2013.

Status: all of these concepts are in-use in every Chapel program today (pointers to code/docs in the release available by request) C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

35

The Cray Chapel Team (Summer 2013) Chapel USA

C O M P U T E           |         Chapel   S T O R E  Seattle         |           A N A L Y Z E Copyright 2014 Cray Inc.

36

For More Information: Online Resources Chapel project page: http://chapel.cray.com ●  overview, papers, presentations, language spec, … Chapel SourceForge page: https://sourceforge.net/projects/chapel/ ●  release downloads, public mailing lists, code repository, … Mailing Aliases: contact the team at Cray [email protected]: announcement list [email protected]: user-oriented discussion list [email protected]: developer discussion [email protected]: educator discussion [email protected]: public bug forum

●  [email protected]: ●  ●  ●  ●  ● 

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

37

For More Information: Suggested Reading Overview Papers: ●  The State of the Chapel Union [slides], Chamberlain, Choi, Dumler,

Hildebrandt, Iten, Litvinov, Titus. CUG 2013, May 2013. ●  a high-level overview of the project summarizing the HPCS period

●  A Brief Overview of Chapel, Chamberlain (pre-print of a chapter for A

Brief Overview of Parallel Programming Models, edited by Pavan Balaji, to be published by MIT Press in 2014). ●  a more detailed overview of Chapel’s history, motivating themes, features

Blog Articles: ●  [Ten] Myths About Scalable Programming Languages, Chamberlain.

IEEE Technical Committee on Scalable Computing (TCSC) Blog, (https://www.ieeetcsc.org/activities/blog/), April-November 2012. ●  a series of technical opinion pieces designed to rebut standard arguments

against the development of high-level parallel languages C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

38

Chapel: the next five years ●  Harden prototype to production-grade ●  add/improve lacking features ●  optimize performance

●  Target more complex/modern compute node types ●  e.g., Intel MIC, CPU+GPU, AMD APU, …

●  Continue to grow the user and developer communities ●  including nontraditional circles: desktop parallelism, “big data” ●  transition Chapel from Cray-managed to community-governed

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

39

Chapel… …is a collaborative effort — join us!

C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

40

Legal Disclaimer Information in this document is provided in connection with Cray Inc. products. No license, express or implied, to any intellectual property rights is granted by this document. Cray Inc. may make changes to specifications and product descriptions at any time, without notice. All products, dates and figures specified are preliminary based on current expectations, and are subject to change without notice. Cray hardware and software products may contain design defects or errors known as errata, which may cause the product to deviate from published specifications. Current characterized errata are available on request. Cray uses codenames internally to identify products that are in development and not yet publically announced for release. Customers and other third parties are not authorized by Cray Inc. to use codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is at the sole risk of the user. Performance tests and ratings are measured using specific systems and/or components and reflect the approximate performance of Cray Inc. products as measured by those tests. Any difference in system hardware or software design or configuration may affect actual performance. The following are trademarks of Cray Inc. and are registered in the United States and other countries: CRAY and design, SONEXION, URIKA, and YARCDATA. The following are trademarks of Cray Inc.: ACE, APPRENTICE2, CHAPEL, CLUSTER CONNECT, CRAYPAT, CRAYPORT, ECOPHLEX, LIBSCI, NODEKARE, THREADSTORM. The following system family marks, and associated model number marks, are trademarks of Cray Inc.: CS, CX, XC, XE, XK, XMT, and XT. The registered trademark LINUX is used pursuant to a sublicense from LMI, the exclusive licensee of Linus Torvalds, owner of the mark on a worldwide basis. Other trademarks used in this document are the property of their respective owners. Copyright 2014 Cray Inc. C O M P U T E           |           S T O R E           |           A N A L Y Z E Copyright 2014 Cray Inc.

41

http://chapel.cray.com

[email protected] http://sourceforge.net/projects/chapel/

Language Constructs for Data Locality - Chapel

Apr 28, 2014 - lower levels for greater degrees of control ..... codenames in advertising, promotion or marketing and any use of Cray Inc. internal codenames is ...

10MB Sizes 1 Downloads 239 Views

Recommend Documents

Language Constructs for Data Locality - CRAY Chapel - Cray Inc.
Apr 28, 2014 - Page 24 .... Myths About Scalable Programming Languages, Chamberlain. IEEE Technical Committee on Scalable Computing (TCSC) Blog,.

Language Constructs for Data Locality - CRAY Chapel - Cray Inc.
Apr 28, 2014 - statements that are not historical facts. These statements ... multicore desktops and laptops .... Chapel defines all forall loops in terms of leader-.

Language Constructs for Data Locality - Semantic Scholar
Apr 28, 2014 - Licensed as BSD software. ○ Portable design and .... specify parallel traversal of a domain's indices/array's elements. ○ typically written to ...

Chapel Chapel
code can be obfuscated/brittle due to these issues ..... C, Modula, Ada: syntax ..... perform code studies of benchmarks, apps, and libraries in Chapel.

Hadoop Data Locality Change for Virtualization Environment ... - GitHub
This network topology is designed and work well for hadoop cluster running on physical server ... 1. Physical network is still hierarchical: rack switch, data center switch, etc. 2. Rack-awareness is still .... service (BlockManager). With clusterMap

The Data Locality of Work Stealing - Semantic Scholar
Jan 22, 2002 - School of Computer Science ... Department of Computer Sciences ..... We also require that the dags have a single node with in-degree x , the ...

The Data Locality of Work Stealing - Carnegie Mellon School of ...
work stealing algorithm that improves the data locality of multi- threaded ...... reuse the thread data structures, typically those from the previous step. When a ...

The Data Locality of Work Stealing - Carnegie Mellon School of ...
running time of nested-parallel computations using work stealing. ...... There are then two differences between the locality-guided ..... Pipelining with fu- tures.

The Data Locality of Work Stealing - Carnegie Mellon School of ...
Department of Computer Sciences. University of Texas at Austin .... race-free computation that can be represented with a series-parallel dag [33]. ... In the second class, data-locality hints supplied by the programmer are used in thread ...

EXPLOITING LOCALITY
Jan 18, 2001 - memory. As our second solution, we exploit a simple, yet powerful principle ... vide the Web servers, network bandwidth, and content.

The Chapel Doors.pdf
when we come through the. chapel doors, "Sh, be still." Page 1 of 1. The Chapel Doors.pdf. The Chapel Doors.pdf. Open. Extract. Open with. Sign In. Main menu.

Chapel
Nov 16, 2009 - GPU computing in Chapel: STREAM revisited and CP. ❑ Status ...... Data-intensive computing using Chapel's user-defined reductions.

The Data Locality of Work Stealing - Semantic Scholar
Jan 22, 2002 - School of Computer Science. Carnegie ... Department of Computer Sciences. University of .... Locality-guided work stealing does significantly better than standard work ...... University of California at Berkeley, November 1989.

Jointly Learning Data-Dependent Label and Locality ...
Jointly Learning Data-Dependent Label and Locality-Preserving Projections. Chang Wang. IBM T. J. ... Sridhar Mahadevan. Computer Science Department .... (l ≤ m), we want to compute a function f that maps xi to a new space, where fT xi ...

A Conditional Approach to Dispositional Constructs - PDFKUL.COM
Research Support Grant BS603342 from Brown University to Jack C. Wright. We would like to thank the administration, staff, and children of Wed- iko Children's Services, whose cooperation made this research possible. We are especially grateful to Hugh

Exploiting Locality in Quantum Computation for Quantum Chemistry
Nov 25, 2014 - where rA is the vector from a point A that defines the center of ...... approach, which we will call Hamiltonian averaging, and bound its costs in .... exponential decline in the quality of trial wave functions, as measured by overlap 

Round Chapel TnC 2017.pdf
Page 3 of 8. Round Chapel TnC 2017.pdf. Round Chapel TnC 2017.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Round Chapel TnC 2017.pdf.

Unary Data Structures for Language Models - Research at Google
sion competitive with the best proposed systems, while retain- ing the full finite state structure, and ..... ronments, Call Centers and Clinics. Springer, 2010.

Referential Semantic Language Modeling for Data ...
Department of Computer Science and Engineering. Minneapolis, MN ... ABSTRACT. This paper describes a referential semantic language model that ..... composes the HHMM reduce variables βd into reduced referent ed. R and final state fd.

Data Selection for Language Modeling Using Sparse ...
semi-supervised learning framework where the initial hypothe- sis from a ... text corpora like the web is the n-gram language model. In the ... represent the target application. ... of sentences from out-of-domain data that can best represent.

MaltParser: A Language-Independent System for Data ...
parsing [19] and its realization in the MaltParser system. ..... of the 39th Annual ACM Southeast Conference, pp. 95–102 ... Online large-margin training of depen-.

Locality-Based Aggregate Computation in ... - Semantic Scholar
The height of each tree is small, so that the aggregates of the tree nodes can ...... “Smart gossip: An adaptive gossip-based broadcasting service for sensor.

For Language Teaching (Language Teaching ...
Jan 1, 1985 - Bunches of varieties of books from several areas are supplied. From fictions to scientific research and spiritual can be looked as well as figured ...