Improving Performance of Communication Through Loop Scheduling in UPC Michail Alvanosa,d,1,∗, Gabriel Tanase2,b , Montse Farreras6,e , Ettore Tiotto4,c , Jos´e Nelson Amaral1,f , Xavier Martorell5,e a

Barcelona Supercomputing Center, C/ Jordi Girona, 1-3, Campus Nord UPC, Barcelona, Spain, 08034 b IBM Thomas J. Watson Research Center 1101 Kitchawan Rd, Yorktown Heights, 10598, NY , United States c IBM Toronto Laboratory, Toronto, Canada d IBM Canada CAS Research, Markham, Ontario, Canada e Department of Computer Architecture, Universitat Polit`ecnica de Catalunya, C/ Jordi Girona, 1-3, Barcelona 08034, Spain. f Department of Computing Science, University of Alberta, Athabasca Hall (ATH) 342, T6G 2E8, Edmonton, Alberta, Canada

Abstract Partitioned Global Address Space (PGAS) languages appeared to address programmer productivity in large scale parallel machines. The main goal of a PGAS language is to provide the ease of use of shared memory programming model with the performance of MPI. Unified Parallel C programs containing all-to-all communication can suffer from shared access conflicts on the same node. Manual or compiler code optimization is required to avoid oversubscription of the nodes. The downside of manual code transformations is the increased program complexity and hindering of the programmer productivity. This paper explores loop scheduling algorithms and presents a compiler optimization that schedules the loop iterations to provide better network uti∗

Corresponding author Email: [email protected] 2 Email: [email protected] 3 Email: [email protected] 4 Email: [email protected] 5 Email: [email protected] 6 Email: [email protected] 1

1

lization and avoid node oversubscription. The compiler schedules the loop iterations to spread the communication uniformly across nodes. The evaluation shows mixed results: (i) slowdown up to 10% in Sobel benchmark, (ii) performance gains from 3% up to 25% for NAS FT and bucket-sort benchmarks, and (iii) up to 3.4X speedup for the microbenchmarks. Keywords: Unified Parallel C, Partitioned Global Address Space, One-Sided Communication, Performance Evaluation, Loop Scheduling 1. Introduction Partitioned global address space languages [1, 2, 3, 4, 5, 6, 7] promise simple means for developing applications that can run on parallel systems without sacrificing performance. These languages extend existing languages with constructs to express parallelism and data distribution. They provide a shared-memory-like programming model, where the address space is partitioned and the programmer has control over the data layout. One of the characteristics of partitioned global address space (PGAS) languages is the transparency provided to the programmer when accessing shared memory. Accesses on shared memory lead to automatic creation of additional runtime calls and possible network communication. A number of previous research proposed different ways to improve performance including prefetching [8], static coalescing [9, 10], and software cache [11]. Furthermore, high-radix network topologies are becoming a common approach [12, 13] to address the latency wall of modern supercomputers. A high-radix networks provides low latency through low hop count. This network architecture provides good performance on different traffic patterns. To effectively avoid network congestion the programmer or the runtime spreads the non-uniform traffic evenly over the different links. However, an important category of UPC applications, that require all-to-all communication, may create communication hotspots without the awareness of the programmer. Although UPC language provides collective operations, users avoid using them to exploit the simplicity of the language. This paper explores possible loop scheduling schemes and proposes a loop transformation to improve the performance of the network communication. The paper explores three approaches of manual loop scheduling for loops with fine-grained communication and three approaches for loops with coarse 2

grained communication. In the automatic loop transformation, the compiler transforms the loops to spread the communication between different networks links to produce a more uniform traffic pattern. The implementation uses R R the XLUPC compiler platform [14] and the evaluation the IBM Power 775 [12]. Our contributions are: • We propose and evaluate different loop scheduling schemes to increase the performance of the applications by decreasing potential contention in the network. The performance improvement depends on scheduling algorithm and accesses type. • We present a compiler loop transformation that transforms the loop to improve the performance of the all-to-all communication pattern. The compiler provides comparable performance to manually optimized loops. The rest of this paper is organized as follows. Section 2 provides an overview of the Unified Parallel C language, and introduces the contention problem. Section 3 presents the loop scheduling optimization. Section 4 presents the methodology used. Section 5 presents the evaluation and section 6 presents the related work. Section 7 presents the conclusions. 2. Background Partitioned Global Address Space Languages (PGAS) programming languages provide a uniform programming model for local, shared and distributed memory hardware. The programmer sees a single coherent shared address space, where variables may be directly read and written by any thread, but each variable is physically associated with a single thread. PGAS languages, such as Unified Parallel C [1], Co-Array Fortran [2], Fortress [3], Chapel [4], X10 [5], Global Arrays [15], and Titanium [6], extend existing languages with constructs to express parallelism and data distribution. The programmer can access each variable between different processes by using regular reads and writes. The downside of this approach is that the programmer is not always aware of the locality of data and can use remote accesses that lead to performance degradation.

3

2.1. Unified Parallel C The Unified Parallel C (UPC) language [1] is an example of PGAS programming model. The language is an extension of the C programming language [16] designed for high performance computing on large-scale parallel machines. UPC uses a Single Program Multiple Data (SPMD) model of computation in which the amount of parallelism is fixed at program startup time. The UPC language can be mapped to either distributed-memory machines, shared-memory machines or hybrid which are clusters of shared memory machines. Listing 1 presents the computation kernel of a parallel vector addition. The benchmark adds the contents of three vectors (A, B, and D) to the vector C. The programmer declares all vectors shared arrays. Shared arrays can be accessed from all UPC threads. In this example, the programmer does not specify the layout qualifier (blocking factor). Thus, the compiler assumes that the blocking factor is one. The construct upc forall distributes loop iterations among the UPC threads. The fourth expression in the upc forall construct is the affinity expression. The affinity expression specifies that the owner thread of the specified element will execute the ith loop iteration. The UPC compiler transforms the loop in a simple for loop with the proper bound limits and step size (Listing 2). 1 #define N 16384 2 shared int A[N+1], B[N+1], C[N], D[N] 3 4 upc_forall(i=0; i
1 2 3 4 5 6

#define N 16384 shared int A[N+1], B[N+1], C[N], D[N] for (i=MYTHREAD; i < N; i+= THREADS){ C[i] = A[i+1] + B[i+1] + D[i]; } Listing 2: Example of a transformed parallel upc forall loop.

2.1.1. Memory Allocation and Data Blocking In UPC there are two different types of memory allocations: local memory allocations performed using malloc and which are currently outside of 4

Threads

0 1 2 3

Data

12 13 14 15

BF= AR_SZ/THDS or [*]

(a)

Data

Threads

0 1 2 3

0 1 2 3

4 8 12 5 9 13 6 10 14 7 11 15

BF= [1] or empty

(b)

Threads

0 1 2 3

Data

0 1 2 3 4

...

BF= [0] or []

(c)

Figure 1: Different blocking allocation schemes: “Perfect blocking” (a), default blocking (b), and single UPC thread affinity blocking (c).

the tracking capability of the runtime, and shared memory allocated using UPC specific constructs such as upc all alloc(), upc alloc() and upc global alloc() or statically with the keyword shared. Shared arrays allocated using upc all alloc() are allocated symmetrically at the same virtual memory address on all locations. The users can declare shared arrays allocated in one thread’s address space using upc alloc. On of the key characteristics of the Partitioned Global Address Space programming models, is that they introduced some complexity (partitioning) because it was very difficult to obtain performance from the Global Address Space (GAS) languages of the 90s, usually implemented as software DSMs. Unified Parallel C supports three different blocking factors. Blocking factor plays a key role on the performance of the application. The programmer declares the shared by setting the blocking factor at the array declaration or through the dynamic memory allocation. For example, the programmer declares an array by in blocked from uses this statement: shared [*] int D[N]; or shared [N/THREADS] int D[N];. Figure 1 presents the different blocking schemes and Figure 1(a) presents the most commonly used blocking scheme in applications. 2.1.2. Sources of Overhead: The compiler translates the shared accesses to runtime calls to fetch and store data. Runtime calls are responsible for fetching, or modifying, the requested data. Each runtime call may imply communication, creating finegrained communication that leads to poor performance. Previous work on eliminating fine-grained access [10, 17, 18] has greatly improved the performance of these type of code by orders of magnitude. In this example, the compiler privatizes accesses C[i] and D[i] (Listing 3). The compiler does not privatize the A[i+1] and B[i+1] accesses because these elements belong to 5

other UPC threads. 1 2 3 4 5 6 7 8 9

local_ptr_C = _local_addr(C); local_ptr_D = _local_addr(D); for (i=MYTHREAD; i < N; i+= THREADS){ tmp0 =__xlupc_deref(&A[i+1]); tmp1 =__xlupc_deref(&B[i+1]); *(local_ptr_C + OFFSET(i)) = tmp0 + tmp1 + *(local_ptr_D + OFFSET(i)); } Listing 3: Final form of a transformed parallel upc forall loop.

In addition, for loops that contain all-to-all or reduction communication can overwhelm the nodes and create network congestion. The impact of hotspot creation is even higher in high-radix networks such as the PERCS interconnect [19]. Listing 4 presents an example of a naive reduction code executed from all the UPC threads. In this example, all the UPC threads execute this part of the code, creating network hotspots. The array is allocated in blocked form, thus the first N/THREADS elements belong to the first UPC thread. 1 2 3 4 5 6 7 8 9

#define N 16384 shared [*] int A[N+1], B[N+1], C[N], D[N] long long int calc(){ int sum = 0, i; for (i=0; i
2.2. XL UPC Framework The experimental prototype for the code transformations described in this paper is built on top of the XLUPC compiler framework [14]. The XLUPC compiler has three main components: (i) the Front End (FE) transforms the UPC source code to an intermediate representation (W-Code); (ii)

6

DLinks To other SNs

10 GB/s/link Bidir

To other SNs D Links

D Links

...

...

D Links

... ...

... ...

7 LL links 24GB/s/link Bidirectional

Node 7

...

...

...

...

...

...

...

...

Drawer 0

24 LR Links 5GB/s/link Bidirectional

D Links

Drawer 0

Drawer 0 ...

...

. . .

Drawer 1

Drawer 1

Drawer 2

Drawer 2 ...

...

...

...

...

...

Drawer 3

Drawer 3

SuperNode 0

SuperNode 1

Drawer 3 SuperNode 2

Figure 2: The P775 system. A node (octant) consists of four POWER7 chips and a hub chip. A drawer contains up to eight compute nodes, and a supernode contains up to four drawers.

the Toronto Portable Optimizer (TPO) high-level optimizer performs machine independent optimizations; (iii) a low-level optimizer performs machinedependent optimizations. The FE translates the UPC source into an extended W-Code to annotate all the shared variables and other constructs with their UPC semantics. TPO contains optimizations for UPC and other languages including C and C++. The TPO performs a number of UPC-specific optimizations such as the locality analysis, parallel loop-nest optimizations and the coalescing optimization presented in this paper. The XLUPC compiler uses the IBM PGAS runtime [20]. The runtime supports shared-memory multiprocessors using the Pthreads library and the Parallel Active Messaging Interface (PAMI) [21]. The runtime exposes to the compiler an Application Program Interface for managing shared data and synchronization. R R 2.3. IBM Power 775 Overview The P775 [12] system employs a hierarchical design allowing highly scalable deployments of up to 512K processor cores. We include in Figure 2 a basic diagram of the system to help users visualize the various networks emR ployed. The compute node of the P775 consists of four POWER7 (P7) [22] CPUs and a HUB chip [19], all managed by a single OS instance. The POWER7 processor has 32 KBytes instruction and 32 KBytes L1 data cache

7

per core, 256 KBytes 2nd level per core, and a 32 MByte 3rd level cache shared per chip. Each core is equipped with four SMT threads and 12 execution units. The HUB provides the network connectivity between the four P7 CPUs participating in the cache coherence protocol. Additionally the HUB acts as a switch supporting communication with other HUBs in the system. There is no additional communication hardware present in the system (no switches). Each compute node (octant) has a total of 32 cores, 128 threads, and up to 512GB memory. The peak performance of a compute node is 0.98 Tflops/sec. A large P775 system is organized, at a higher level, in drawers consisting of 8 octants (256 cores) connected in an all-to-all fashion for a total of 7.86 Tflops/s. The links between any two nodes of a drawer are referred to as Llocal links (LL) with a peak bandwidth of 24 Gbytes/s in each direction. A supernode consists of four drawers (1024 cores, 31.4 Tflops/s). Within a supernode each pair of octants is connected with an Lremote (LR) link with 5 Gbytes/s in each direction. A full P775 system may contain up to 512 super nodes (524288 cores) with a peak performance of 16 Pflops/s. Between each pair of supernodes multiple optical D-links are used, each D-link having a peak performance of 10 Gbytes/s in each direction. The machine has a partial all-to-all topology where any two compute nodes are at most three hops away. The large bandwidth of the all-to-all topology, enabled various P775 systems to be top ranked in HPC Challenge benchmarks (e.g., GUPS, FFT) [23] and Graph 500 [24]. 3. Loop scheduling The all-to-all communication pattern and the concurrent access of shared data allocated to one UPC thread often stresses the interconnection network The all-to-all communication pattern is one of the most important metrics for the evaluation of the bandwidth of the system. In this case, each thread communicates with all other UPC threads to exchange data. Moreover, this pattern shows up in a large number of scientific applications including FFT, Sort, and Graph 500. Thus, improving the network’s efficiency by scheduling the loop iterations, the communication overhead can be significantly decreased. This section examines four different approaches to schedule loop iterations for either coarse-grained or fine-grained communication. The transformations assume that the programmer allocates the shared arrays in blocked fashion. Furthermore, we assume that the number 8

the loop upper bound has the same value as the number of elements of the loop. This assumption is not always true but it simplifies the presentation of the algorithms. Finally, this section presents a solution to automatic loop transformation. 3.1. Approaches The core idea is to schedule the accesses in such form to ensure that each thread does not access the same shared data. The programmer manually transforms the loop to increase the performance. We categorize the loop scheduling transformations in four categories: • Skew loops to start the iteration form a different point of the loop iteration space. Listing 5 presents the modified code. This is possible in UPC language by using the MYTHREAD keyword in the equation that calculates the induction variable. To calculate the new induction variable of the loop we use the following equation: N EW IV = (IV + M Y T HREAD × Block) % U B; Where Block = loop.

SIZE OF ARRAY T HREADS

and U B is the upper bound of the

• Skew loops plus: The ‘plus’ is to uniform distribute the communication among nodes. Each UPC thread starts from a different block of the shared array and also from a different position inside the block. The new induction variable is calculated by: N EW IV = (IV + Block M Y T HREAD × Block + M Y T HREAD × T HREADS ) % U B; Figure 3 right shows differences between the two skewed versions. The ‘plus’ version access elements from other UPC threads in diagonal form. This approach is expected to achieve better performance than the baseline because it uniformly spreads the communication more than the ‘simple’ version. • Strided accesses: Each thread starts from a different block. The loop increases the induction variable by a constant number: the number of UPC threads per node plus one. This approach requires the upper bound of the loop to not be divisible by the constant number

9

(stride) [25, 26, 27]. The new induction variable is calculated by the following equation: N EW IV = (IV × ST RIDE + M Y T HREAD)%U B; To ensure the non-divisibility of the loop upper bound we can use a simple algorithm: ST RIDE = T HREADS + 1; while (U B % (ST RIDE) == 0) ST RIDE + +; For example in the Power 775 architecture, when running with 32 UPC threads per node and assuming that the upper bound of the loop is 2048, the new induction variable is calculated by: N EW IV = (IV × 33 + M Y T HREAD) % U B; 1 2 3 4 5 6 7 8 9 10 11 12 13

shared [M/THREADS] double K[M]; shared [M/THREADS] double L[M]; double fine_grain_get_skew(){ uint64_t i=0, block=M/THREADS; double res = 0; for (i=0;i
Listing 5: Example of loop modifications when code contains fine-grained accesses.

• Random shuffled : the loop uses a look-up array that contains the number of threads randomly shuffled. This approach works only when the upper bound of the loops is equivalent to the number of threads. This loop creates an all-to-all communication pattern through the network. This optimization is applicable to loops with both fine-grained and coarse-grained communication. There are two downsides when applying this approach to loop with fine-grained communication. First of all, 10

... ... ...

P0: P1: P2:

...

P1: P2:

...

... (a)

... ... ...

P0:

P1: P2:

...

... (b)

... ... ...

P0:

... (c)

Figure 3: Different schemes of accessing a shared array. The shared object is allocating in blocked form. Each row represents data residing in one UPC thread and each box is an array element. The different access types are: (a) baseline: all UPC threads access the same data; (b) ‘Skewed’: each UPC thread access elements from a different UPC thread; (c) ‘Skewed plus’: each UPC thread access elements from a different thread and from a different point inside the block.

it requires SIZE OF ARRAY ×N U M T HREADS×sizeof (uint64 t) memory that cannot be allocated of large arrays. The second drawback is that the runtime (or the compiler) has to shuffle the array, each tome the upper bound of the loop changes, wasting valuable time of the program execution. 3.2. Compiler-assisted loop transformation The idea of compiler-assisted loop transformation is to conceal the complexity and make the network more straightforward for the programmer. First, the compiler collects normalized loops that contain shared references and have no loop carried dependencies. Next, the compiler checks when the upper bound of the loop is greater or equal to the number of UPC threads, and if it is not upc forall loop. The compiler categorizes the loops in two groups based on the loop upper bound and shared access type. The compiler applies the transformation depending on the loop category. Figure 4 presents the compiler algorithm. The compiler makes the decision based on the usage of upc memget and upc memput calls, and the value of loop upper bound. This type of code is more likely to contain all-to-all communication, thus making it ideal target for the random shuffled solution. The compiler categorizes the loops in: • Loops that have coarse-grained transfers and whose upper bound is the number of UPC threads. In this case, the runtime returns to the program a look-up table with the number of UPC threads equal to the 11

array size. The contents of the look-up table are the randomly shuffled values for the induction variable. We use this technique only to loops with the upper bound equal with the number of the threads. Thus, the range of shuffled values range from 0 up to THREADS-1. To improve the performance of this approach, the runtime creates the shuffled array at the initialization phase. Next, the compiler replaces the induction variable inside the body of the loop with the return value of the look up table. Listing 6 presents the final form of the transformed loop. The all-to-all communication pattern belongs to this category. 1 2 3 4 5 6 7 8 9 10 11 12

shared [N/THREADS] double X[N]; shared [N/THREADS] double Y[N]; void memget_threads_rand(){ uint64_t i=0, block = N/THREADS; double *lptr = (double *) &X[block*MYTHREAD]; uint64_t *tshuffle = __random_thread_array(); for ( i=0; i
1 shared [N/THREADS] double X[N]; 2 3 void memget_threads(){ 4 uint64_t i=0, block = N/THREADS; 5 for ( i=0; i
• Loops that contain fine-grained communication or contain coarse-grained are transfered but in these cases the upper bound of the loop is different from the number of UPC threads. In this case, the compiler 12

Candidate Loop

YES

NO upc_mem* calls ?

Is UB THREADS ?

NO

Skew loop iterations

YES Use a shuffled array with the number of threads

Figure 4: Automatic compiler loop scheduling.

skews the iterations in such a way that each UPC thread starts executing from a different point in the shared array. The compiler uses the simple form to ‘skew’ the iterations to avoid the creation of additional runtime calls. Finally, the compiler replaces the occurrences of the induction variable inside the loop body. Listing 7 illustrates the final form of a loop containing fine-grained communication.

4. Methodology R The evaluation uses two different Power 775. The first one has 1024 nodes and is used to evaluate the manual code modifications. The second system contains 64 nodes allowing runs with up to 2048 UPC threads and it is used to evaluate the automatic compiler transformations. We use one process per UPC thread and schedule one UPC thread per R POWER7 core. The UPC threads are grouped in blocks of 32 per node and each UPC thread is bound to its own core. The experimental evaluation runs each benchmark five times. The results presented in this evaluation are the average of the execution time for the five runs. In all experiments the execution time variation is less than 3%. All benchmarks are compiled using the ‘-qarch=pwr7 -qtune=pwr7 -O3 -qprefetch=aggressive’ compiler flags. The evaluation varies the data set size with the number of UPC threads (weak scaling).

13

4.1. Benchmarks and Datasets The evaluation uses three microbenchmarks and four applications: Microbenchmark: The microbenchmark is a loop that accesses a shared array of structures. There are three variations of this microbenchmark. In all versions each UPC thread executes the same code in the loop. In the upc memput microbenchmark the loop contains coarse grain upc memput calls. The fine-grained get contains shared reads and the fine-grained put contains shared writes. Listing 8 presents the code used in upc memput benchmarks. 1 2 3 4 5 6 7 8 9 10 11 12 13

#define N (1llu<<32) shared [N/THREADS] double X[N]; shared [N/THREADS] double Y[N]; void memget_threads(){ uint64_t i=0; uint64_t block = N/THREADS; double *lptr = (double *) &X[block*MYTHREAD]; for (i=0;i
Sobel: The Sobel benchmark computes an approximation of the gradient of the image intensity function, performing a nine-point stencil operation. In the UPC version [28] the image is represented as a one-dimensional shared array of rows and the outer loop is a parallel upc forall loop. Listing 9 presents the kernel of the Sobel benchmark. Gravitational fish: The gravitational UPC fish benchmark emulates fish movements based on gravity. The benchmark is a N-Body gravity simulation using parallel ordinary differential equations [29]. There are three loops in the benchmark that access shared data, one for the computation of acceleration between fishes, one for data exchange, and one for the new position calculation. Bucket-sort: The benchmark sorts an array of 16-byte records using bucket-sort [30] algorithm. Each node generates its share of the records. Each thread uses a 17 × 2GB buffer to hold records received from other 16 14

threads, which are destined to be sorted on this thread. Once this thread has generated all its share of the records, it distributes the remainder of each bucket to the corresponding thread. Once this thread has received all its appropriate data from each of the other threads, it performs a sort on the local data. FT: The benchmark is part of the NPB NAS [31] suite. The benchmark solves a three-dimensional partial differential equation (PDE) using the fast Fourier transform (FFT). The benchmark creates an all-to-all communication pattern over the network. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21

typedef struct { uint8_t r[IMGSZ]; } RowOfBytes; shared RowOfBytes orig[IMGSZ]; shared RowOfBytes edge[IMGSZ]; void Sobel_upc(void){ int i,j,d1,d2; double magn; upc_forall(i=1; i
&edge[i].r[0]){ - orig[i-1].r[j-1]); - orig[ i ].r[j-1])<<1; - orig[i+1].r[j-1]);

d2 = ((int) orig[i-1].r[j-1] - orig[i+1].r[j-1]); d2 += ((int) orig[i-1].r[ j ] - orig[i+1].r[ j ])<<1; d2 += ((int) orig[i-1].r[j+1] - orig[i+1].r[j+1]); magn=sqrt((double)(d1*d1+d2*d2)); edge[i].r[j] = (uint8_t) (magn>255) ? 255:magn; } } } Listing 9: UPC version of Sobel.

5. Experimental results This experimental evaluation assesses the effectiveness of the loop transformation by presenting the following: (1) a manual code transformations study on microbenchmarks using manual code modifications and a large number of UPC threads to help understand the maximum speedup that can be 15

32000 Baseline Skewed Stride Random

GB/s

3200 320

32768

16384

8192

4096

2048

1024

512

256

128

64

32

32

UPC THREADS

Figure 5: Effect of loop scheduling policies on performance for upc memput

achieved and the potential performance bottlenecks; (2) the performance of compiler transformed microbenchmarks and real applications. 5.1. Manual code transformations study Figures 5 presents the results for the coarse-grained microbenchmark. The drop of the performance when going from 512 to 1024 UPC threads is due to the HUB link limits. The microbenchmarks use two supernodes when running with 1024 UPC threads. Thus, a portion of communication between the UPC threads uses the (remote) D-Links. Furthermore, the strided version has lower performance than the random shuffled version. The traffic randomization balances the use of global links reducing contention. In contrast, the strided version creates less randomized traffic and created predicted traffic by using different node on each loop iteration. Overall, performance for a coarse-grained communication pattern is better when using random allocation. Note, that the is no skewed plus version because the upc memget and upc memput calls always start from the beginning of the block. Figure 6 illustrates the performance of the fine-grained microbenchmarks. The skewed and skewed plus versions have similar performance, in finegrained category. Thus, using a different starting point inside the block has no real impact in the performance. The strided version has worse performance than the skewed version in fine-grained get version. On the other hand, the strided has better performance using the fine-grained put version. This occurs because the runtime issues in-order remote stores that target the same remote node.

16

Baseline Skewed Skewed Plus Stride

900

Baseline Skewed Skewed Plus Stride

9000

GB/s

GB/s

900 90

UPC THREADS

8192

4096

2048

1024

512

256

128

64

32

8192

4096

2048

1024

512

256

128

64

32

90

UPC THREADS

(a)

(b)

Figure 6: Effect of loop scheduling policies on performance for fine-grained get (a) and fine-grained put (b).

Moreover, the performance of the fine-grained put is an order of magnitude faster than the fine-grained get. This behaviour is noticeable especially in the strided version. The main reason behind this is that the runtime allows overlapping of store/put operations when the destination node is different. On the other hand, in the read/get operations, the runtime blocks and waits the transfers to finish. Overall, loops that contain coarse-grained memget/memput transfers, the compiler should use random shuffle. On the other hand, the compiler should use skewing transformation for loops with fine-grained communication. 5.2. Compiler-assisted loop transformation This section compares the automatic compiler transformation with the manual transformation. The main difference is that the manual approach avoids additional overhead through the insertion of runtime calls. Figure 7 compares the compiler-transformed and hand-optimized code. While the performance of the manual and compiler-transformed fine-grained microbenchmarks is similar, the compiler transformation achieves slightly lower performance than the hand-optimized benchmark because of the insertion of runtime calls. Figure 8 presents the application results. There are three different patterns in the applications. In the first category are applications that have performance gain compared with the version without the scheduling (baseline), such as the NAS FT benchmark. This benchmark achieves from 3% up to 15% performance gain, due to its all-to-all communication pattern. 17

4

Baseline Compiler Hand Optimized

150

UPC THREADS

UPC THREADS

UPC THREADS

(a)

(b)

(c)

1024

512

256

32

1024

512

256

64

128

15

32

2048

1024

512

256

128

64

0.4

32

6

1500

128

60

Baseline Compiler Hand Optimized

64

40

MB/s

Baseline Compiler Hand Optimized

GB/s

GB/s

600

Figure 7: Comparison of compiler-transformed and hand-optimized code: upc memput (a), fine-grained get (b), and fine-grained put (c).

UPC THREADS

UPC THREADS

(a)

(b)

512 1024

256

64

1024

512

256

128

64

32

1024

512

256

128

64

15

1.5

200

32

300

2000

Baseline Compiler

128

3000

GOp/s

Baseline Compiler GOp/s

Objects/s

150

20000

Baseline Compiler

32

30000

UPC THREADS

(c)

Figure 8: Comparison of baseline and compiler-transformed code for fish (a), Sobel (b), and NAS FT (c).

Moreover, the performance of the gravitational fish benchmark is almost identical and the transformation reveals minimal performance gains. On the other hand, the performance of the Sobel benchmark decreases up to 20% compared with the baseline version, because of poor cache locality. Table 1 presents the cache misses for the Sobel benchmark for different cache levels using the hardware counters. The main reason for the bad cache locality is that the Sobel uses an array of structs and each struct contains an array. The transformation ‘skews’ the iterations of the loop that access the inner array causing bad cache locality. Figure 9 presents the results for bucket-sort with enabled and disabled local sort. There are minor differences between the baseline and the transformed version when using the benchmarks with enabled the local sort. How18

Cache level Level 1 Level 2 Level 3

Baseline % 0.14% 0.19% 0.32%

Schedule 0.19% 24.49% 28.84%

Table 1: Cache misses using for the Sobel benchmark using 256 UPC threads. Results are the average from each of 256 cores. 10000

Baseline Compiler

Baseline Compiler M Record/s

300

30

UPC THREADS

2048

1024

512

128

256

32

2048

1024

512

256

128

64

100

32

3

1000

64

M Record/s

3000

UPC THREADS

(a)

(b)

Figure 9: Comparison of baseline and compiler-transformed code for bucket-sort (a) and bucket-sort with only the communication pattern (b).

ever, the transformed version has better results — up to 25% performance gain — than the baseline version when only the communication part is used. In Figure 9(b) the performance for 32 UPC threads is 40% worse than the baseline because of the overhead of the additional runtime calls. The effectiveness of the loop transformation is limited when running less than 32 UPC threads. 5.3. Summary The evaluation indicates that the compiler transformation is an effective technique for increasing the network performance in UPC languages. The results show a performance gain from 5% up to 25% compared to the baseline versions of the applications. Moreover, microbenchmark results show even higher performance gains of up to 3.4X. On the other hand, the loop transformation has negative effect on the cache locality in some benchmarks, such as Sobel.

19

6. Related Work Overall, memory optimizations is a widely researched topic [32, 33]. Traditional loop transformations focus on increasing the cache performance and bandwidth [34, 35]. Researchers also use loop scheduling techniques to improve the performance of Non-Uniform Memory Access (NUMA) machines [36, 37] and heterogeneous machines [38, 39]. The Crystallizing Fortran Project compiler [40] employs different algorithms [41] to change the data layout of the application to improve the performance in distributed environments. In contrast, our approach modifies the order of loop iterations and not the data layout of the application. Other runtime implementations [42, 14] use the techniques presented in this paper to schedule the communication internally in the runtime. However, the programmer must use the runtime calls to efficient use the interconnect. On the other hand, our approach does not restrict the user to use explicitly the runtime calls to improve the network performance. Recent efforts on loop transformations focus on reducing memory bank conflicts especially in GPUs [43] and embedded systems [44, 32]. In a similar way, our approach exploits loop transformations to increase the network efficiency by distributing the shared access across the UPC threads. In contrast with the most of the previous work, our approach employs a skewed and random distribution to improve the performance. The Dragonfly interconnect [45] uses randomized routing to reduce network hotspots. On the other hand, this randomized routing requires additional hardware support. We move the complexity of loop scheduling to software. Our compiler transformation can offer comparable performance by randomizing the source and destination pairs involved in communication. Other approaches to produce uniform random traffic pattern, includes randomized task placement [46] and adaptive routing [27]. The authors in [27] randomize the routing at runtime to achieve better performance based on different patterns. Randomized task placement [46] can increase the amount of randomized traffic and avoid traffic. However other researchers proved [47] that despite the improved results, the performance of randomized uniform traffic is still far from ideal. 7. Conclusions and Future Work This paper presents an optimization to increase the performance of different communication patterns. The optimization schedules the loop iter20

ations to increase the network performance. The paper evaluates different approaches to schedule the loop iterations to optimize the network traffic and to avoid the creation of network hotspots. Moreover, the paper presents a compiler optimization that automatically transforms the loops to improve network performance. The compiler transformation achieves performance that is similar to the manually modified versions. Even though the communication efficiency increases, there is still room for further improvements for the compiler transformations. Nevertheless, this paper makes a significant contribution to improve the performance of programs that contain problematic access patterns for high-radix networks, including the all-to-all communication. Acknowledgments The researchers at Universitat Polit`ecnica de Catalunya and Barcelona Supercomputing Center are supported by the IBM Centers for Advanced Studies Fellowship (CAS2012-069), the Spanish Ministry of Science and Innovation (TIN2007-60625, TIN2012-34557, and CSD2007-00050), the European Commission in the context of the HiPEAC3 Network of Excellence (FP7/ICT 287759), and the Generalitat de Catalunya (2009-SGR-980). IBM researchers are supported by the Defense Advanced Research Projects Agency under its Agreement No. HR0011-07-9-0002. The researchers at University of Alberta are supported by the NSERC Collaborative Research and Development (CRD) program of Canada. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the funding agencies. References [1] U. Consortium, UPC Specifications, v1.2, Tech. rep., Lawrence Berkeley National Lab Tech Report LBNL-59208. [2] R. Numwich, J. Reid, Co-Array Fortran for parallel programming, Tech. rep. (1998). [3] E. Allen, D. Chase, J. Hallett, V. Luchangco, J.-W. Maessen, S. Ryu, G. L. S. Jr., S. Tobin-Hochstadt, The Fortress Language Specification Version 1.0, http://labs.oracle.com/projects/plrg/ Publications/fortress.1.0.pdf (March 2008). 21

[4] Cray Inc, Chapel Language Specification Version 0.8, http://chapel. cray.com/spec/spec-0.8.pdf (April 2011). [5] P. Charles, C. Grothoff, V. Saraswat, C. Donawa, A. Kielstra, K. Ebcioglu, C. von Praun, V. Sarkar, X10: An Object-oriented Approach to Non-uniform Cluster Computing 40 (10) 519–538. [6] K. A. Yelick, L. Semenzato, G. Pike, C. Miyamoto, B. Liblit, A. Krishnamurthy, P. N. Hilfinger, S. L. Graham, D. Gay, P. Colella, A. Aiken, Titanium: A High-performance Java Dialect, Concurrency - Practice and Experience 10 (11-13) (1998) 825–836. [7] J. Lee, M. Sato, Implementation and Performance Evaluation of XcalableMP: A Parallel Programming Language for Distributed Memory Systems, in: Parallel Processing Workshops (ICPPW), 2010 39th International Conference on, 2010, pp. 413–420. doi:10.1109/ICPPW.2010. 62. [8] M. Alvanos, M. Farreras, E. Tiotto, X. Martorell, Automatic Communication Coalescing for Irregular Computations in UPC Language, in: Conference of the Center for Advanced Studies, CASCON ’12. [9] C. I. W. Chen, K. Yelick, Communication Optimizations for FineGrained UPC Applications, in: In 14th International Conference on Parallel Architectures and Compilation Techniques, 2005. [10] Christopher Barton, George Almasi, Montse Farreras, and Jose Nelson Amaral, A Unified Parallel C compiler that implements automatic communication coalescing, in: 14th Workshop on Compilers for Parallel Computing, 2009. [11] Z. Zhang, J. Savant, S. Seidel, A UPC Runtime System Based on MPI and POSIX Threads, Parallel, Distributed, and Network-Based Processing, Euromicro Conference on. [12] R. Rajamony, L. Arimilli, K. Gildea, PERCS: The IBM POWER7-IH high-performance computing system, IBM Journal of Research and Development 55 (3) (2011) 3–1.

22

[13] G. Faanes, A. Bataineh, D. Roweth, T. Court, E. Froese, B. Alverson, T. Johnson, J. Kopnick, M. Higgins, J. Reinhard, Cray Cascade: a Scalable HPC System Based on a Dragonfly network, in: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, SC ’12, 2012. [14] G. Tanase, G. Alm´asi, E. Tiotto, M. Alvanos, A. Ly, B. Daltonn, Performance Analysis of the IBM XL UPC on the PERCS Architecture, Tech. rep., RC25360 (2013). [15] J. Nieplocha, R. J. Harrison, R. J. Littlefield, Global arrays: A nonuniform memory access programming model for high-performance computers, The Journal of Supercomputing 10 (2) (1996) 169–189. [16] ISO/IEC JTC1 SC22 WG14, ISO/IEC 9899:TC2 Programming Languages - C, Tech. rep., http://www.open-std.org/JTC1/SC22/WG14/ www/docs/n1124.pdf (May 2005). [17] C. Barton, C. Cascaval, G. Almasi, Y. Zheng, M. Farreras, S. Chatterje, J. N. Amaral, Shared memory programming for large scale machines, Programming Language Design and Implementation (PLDI) (2006) 108– 117. [18] M. Alvanos, M. Farreras, E. Tiotto, J. N. Amaral, X. Martorell, Improving Communication in PGAS Environments: Static and Dynamic Coalescing in UPC, in: Proceedings of the 27th annual international conference on Supercomputing, ICS ’13. [19] B. Arimilli, R. Arimilli, V. Chung, S. Clark, W. Denzel, B. Drerup, T. Hoefler, J. Joyner, J. Lewis, J. Li, N. Ni, R. Rajamony, The PERCS High-Performance Interconnect, High-Performance Interconnects, Symposium on (2010) 75–82. [20] G. I. Tanase, G. Alm´asi, H. Xue, C. Archer, Composable, Non-blocking Collective Operations on Power7 IH, in: Proceedings of the 26th ACM international conference on Supercomputing, ICS ’12, pp. 215–224. [21] Redbooks, IBM, PAMI Programming Guide, 2011, http://publib. boulder.ibm.com/epubs/pdf/a2322730.pdf.

23

[22] R. Kalla, B. Sinharoy, W. Starke, M. Floyd, Power7: IBM’s NextGeneration Server Processor, Micro, IEEE 30 (2) (2010) 7 –15. [23] HPCC, HPC Challenge Benchmark Results, http://icl.cs.utk.edu/ hpcc/ (March 2013). [24] The Graph 500 List, http://www.graph500.org/results_june_2012 (June 2012). [25] Das, Sarkar, Conflict-free data access of arrays and trees in parallel memory systems, in: Proceedings of the 1994 6th IEEE Symposium on Parallel and Distributed Processing, SPDP ’94, IEEE Computer Society, Washington, DC, USA, 1994, pp. 377–384. [26] D. L. Erickson, Conflict-free access to rectangular subarrays in parallel memory modules, Ph.D. thesis, Waterloo, Ont., Canada, Canada, doctoral Dissertation, UMI Order No. GAXNN-81075 (1993). [27] M. Garcia, E. Vallejo, R. Beivide, M. Odriozola, C. Camarero, M. Valero, G. Rodriguez, J. Labarta, C. Minkenberg, On-the-fly Adaptive Routing in High-Radix Hierarchical Networks, in: Parallel Processing (ICPP), 2012 41st International Conference on, IEEE, 2012, pp. 279–288. [28] T. El-Ghazawi, F. Cantonnet, UPC performance and potential: a NPB experimental study, in: Proceedings of the 2002 ACM/IEEE conference on Supercomputing, Supercomputing ’02, pp. 1–26. [29] S. Aarseth, Gravitational N-Body Simulations: Tools and Algorithms, Cambridge Monographs on Mathematical Physics, Cambridge University Press, Cambridge, U.K.; New York, U.S.A., 2003. [30] T. H. Cormen, C. E. Leiserson, R. L. Rivest, C. Stein, Introduction to algorithms, MIT press, 2001, ISBN 0262032937. [31] H. Jin, R. Hood, P. Mehrotra, A practical study of UPC using the NAS Parallel Benchmarks, in: Proceedings of the Third Conference on Partitioned Global Address Space Programing Models, PGAS ’09, 2009, pp. 8:1–8:7.

24

[32] P. R. Panda, F. Catthoor, N. D. Dutt, K. Danckaert, E. Brockmeyer, C. Kulkarni, A. Vandercappelle, P. G. Kjeldsberg, Data and Memory Optimization Techniques for Embedded Systems, ACM Transactions Des. Autom. Electron. Systems 6 (2) (2001) 149–206. [33] P. Marchal, J. I. G´omez, F. Catthoor, Optimizing the Memory Bandwidth with Loop Fusion, in: Proceedings of the 2nd IEEE/ACM/IFIP international conference on Hardware/software codesign and system synthesis, CODES+ISSS ’04, 2004, pp. 188–193. [34] J. R. Allen, K. Kennedy, Automatic Loop Interchange, in: Proceedings of the 1984 symposium on Compiler construction, pp. 233–246. [35] M. E. Wolf, D. E. Maydan, D.-K. Chen, Combining loop transformations considering caches and scheduling, in: Proceedings of the 29th annual ACM/IEEE international symposium on Microarchitecture, MICRO 29, 1996, pp. 274–286. [36] E. P. Markatos, T. J. LeBlanc, Using processor affinity in loop scheduling on shared-memory multiprocessors, Parallel and Distributed Systems, IEEE Transactions on 5 (4) (1994) 379–400. [37] H. Li, S. Tandri, M. Stumm, K. C. Sevcik, Locality and loop scheduling on numa multiprocessors, in: Parallel Processing, 1993. ICPP 1993. International Conference on, Vol. 2, IEEE, 1993, pp. 140–147. [38] M. Cierniak, W. Li, M. J. Zaki, Loop scheduling for heterogeneity, in: High Performance Distributed Computing, 1995., Proceedings of the Fourth IEEE International Symposium on, IEEE, 1995, pp. 78–85. [39] A. T. Chronopoulos, R. Andonie, M. Benche, D. Grosu, A class of loop self-scheduling for heterogeneous clusters, in: Proceedings of the 2001 IEEE international conference on cluster computing, Vol. 291, 2001. [40] M. Chen, Y.-i. Choo, J. Li, Compiling parallel programs by optimizing performance, The Journal of Supercomputing 2 (2) (1988) 171–207. [41] J. Li, M. Chen, Compiling Communication-Efficient Programs for Massively Parallel Machines, Parallel and Distributed Systems, IEEE Transactions on (1991) 361–376.

25

[42] R. Rajamony, M. Stephenson, E. Speight, The Power 775 Architecture at Scale, Tech. rep., RC25366 (2013). [43] M. M. Baskaran, U. Bondhugula, S. Krishnamoorthy, J. Ramanujam, A. Rountev, P. Sadayappan, A Compiler Framework for Optimization of Affine Loop Nests for GPGPUs, in: Proceedings of the 22nd annual international conference on Supercomputing, ICS ’08, pp. 225–234. [44] Q. Zhang, Q. Li, Y. Dai, C.-C. Kuo, Reducing Memory Bank Conflict for Embedded Multimedia Systems, in: Multimedia and Expo (ICME ’04) IEEE International Conference on, pp. 471–474. [45] J. Kim, W. J. Dally, S. Scott, D. Abts, Technology-driven, highlyscalable dragonfly topology, in: Proceedings of the 35th Annual International Symposium on Computer Architecture, ISCA ’08, pp. 77–88. [46] A. Bhatele, N. Jain, W. D. Gropp, L. V. Kale, Avoiding hot-spots on two-level direct networks, in: High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for, IEEE, 2011, pp. 1–11. [47] A. Jokanovic, B. Prisacari, G. Rodriguez, C. Minkenberg, Randomizing task placement does not randomize traffic (enough), in: Proceedings of the 2013 Interconnection Network Architecture: On-Chip, Multi-Chip, IMA-OCMC ’13, ACM, New York, NY, USA, 2013, pp. 9–12.

26

Improving Performance of Communication Through ...

d IBM Canada CAS Research, Markham, Ontario, Canada e Department of Computer .... forms the UPC source code to an intermediate representation (W-Code); (ii). 6 ...... guages - C, Tech. rep., http://www.open-std.org/JTC1/SC22/WG14/.

716KB Sizes 0 Downloads 127 Views

Recommend Documents

Improving Student Performance Through Teacher Evaluation - Gallup
Aug 15, 2011 - 85 Harvard Graduate School of Education Project on the. Next Generation of Teachers. (2008). A user's guide to peer assistance and review.

improving performance through administrative ...
Apr 12, 2016 - performance meets the Teaching Quality Standard (Ministerial Order .... Certification of Teachers Regulation 3/99 (Amended A.R. 206/2001).

Soft-OLP: Improving Hardware Cache Performance Through Software ...
Soft-OLP: Improving Hardware Cache Performance Through Software-Controlled. Object-Level Partitioning. Qingda Lu. 1. , Jiang Lin. 2. , Xiaoning Ding. 1.

Soft-OLP: Improving Hardware Cache Performance Through Software ...
weak-locality accesses and place them in a dedicated cache. (bypass buffer) to ... best of our knowledge, Soft-OLP is the first work that uses .... Decision Maker.

Improving IEEE 1588v2 Clock Performance through ...
provides synchronization service for the network [2]. We have previously .... SMB-6000B system with a LAN-3320A module to generate the background traffic.

pdf-1463\healthcare-informatics-improving-efficiency-through ...
... the apps below to open or edit this item. pdf-1463\healthcare-informatics-improving-efficiency ... ogy-analytics-and-management-by-stephan-p-kudyba.pdf.

Improving Simplified Fuzzy ARTMAP Performance ...
Research TechnoPlaza, Singapore [email protected] 3Faculty of Information Technology, Multimedia University,. Cyberjaya, Malaysia [email protected]

Improving Workflow Fault Tolerance through ...
mation that scientific workflow systems often already record for data lineage reasons, allowing our approach to be deployed with minimal additional runtime overhead. Workflows are typically modeled as dataflow networks. Computational en- tities (acto