An Optimizing Compiler for GPGPU Programs with Input-Data Sharing Yi Yang

Ping Xiang

Jingfei Kong

Huiyang Zhou

Dept. of ECE North Carolina State University

School of EECS Univ. of Central Florida

School of EECS, UCF Univ. of Central Florida

Dept. of ECE North Carolina State University

[email protected]

[email protected]

[email protected]

[email protected]

Abstract Developing high performance GPGPU programs is challenging for application developers since the performance is dependent upon how well the code leverages the hardware features of specific graphics processors. To solve this problem and relieve application developers of low-level hardware-specific optimizations, we introduce a novel compiler to optimize GPGPU programs. Our compiler takes a naive GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler then analyzes the code, identifies memory access patterns, and generates optimized code. The proposed compiler optimizations target at one category of scientific and media processing algorithms, which has the characteristics of input-data sharing when computing neighboring output pixels/elements. Many commonly used algorithms, such as matrix multiplication, convolution, etc., share such characteristics. For these algorithms, novel approaches are proposed to enforce memory coalescing and achieve effective data reuse. Data prefetching and hardwarespecific tuning are also performed automatically with our compiler framework. The experimental results based on a set of applications show that our compiler achieves very high performance, either superior or very close to the highly finetuned library, NVIDIA CUBLAS 2.1.

The input of our compiler is a naïve GPU kernel function, which simply identifies the fine-grain work item that can be executed in parallel. A typical candidate is to compute one data element in the output domain. The code segment shown in the Figure 1 is such a naïve kernel for matrix multiplication, which calculates one element in the product matrix. This kernel is functionally correct but does not include any optimization for performance. Our compiler analyzes the naïve kernel, checks the off-chip memory access patterns, and converts the non-coalesced memory accesses into coalesced ones. Then the compiler finds possible data sharing across neighboring threads and thread blocks. Based on the data sharing pattern, the compiler intelligently merges threads and/or thread-blocks to improve memory reuse. Additionally, the compiler schedules the code to enable data prefetching so as to overlap the computation and memory access latencies. float sum = 0; for (int i=0; i
Figure 1. A Naïve kernel for matrix multiplication

2. Compiler Framework

Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors – Compilers, Optimization

Naive kernel functions

General Terms Performance, Experimentation, Languages

Checking Memory Coalesing

Keywords GPGPU; Compiler;

1. Introduction

Converting non-coalesced accesses into coalesced ones

The high computational power and affordability of state-of-art graphics processor units (GPU) have made them the first widely used parallel computers with teraflops capability. In the meantime, developing high performance GPGPU code is challenging as the performance is highly dependent on how well the GPU hardware resources are utilized. In this paper, we propose an optimizing compiler to relieve the application developers of the low-level hardware-specific performance optimizations.

Checking data sharing patterns Thread & thread block merge for memory reuse Data Prefetching Optimized kernel functions & invocation parameters

Copyright is held by the author/owner(s). PPoPP’10 January 9–14, 2010, Bangalore, India. ACM 978-1-60558-708-0/10/01.

Figure 2. The framework of the proposed optimizing compiler. The proposed compile framework is shown in Figure 2. It consists of the following steps.

1

a) Checking Memory Coalescing: In order to determine whether off-chip memory accesses can be coalesced, we identify the address model of each memory accesses in the kernel function. b) Converting Non-Coalesced Accesses into Coalesced Ones: After the compiler analyzes every array access in the kernel code, the compiler converts the non-coalesced accesses into coalesced ones through the shared memory. c) Checking data sharing patterns: In this step, the compiler detects possible data sharing among threads. d) Thread/Thread Block Merge to Enhance Memory Reuse: After detecting that there exists data sharing among neighbor thread blocks, we merge thread blocks or threads to improve data sharing. e) Data Prefetching: The compiler analyzes the memory accesses in a loop and uses a temporary variable to prefetch data for the next iteration before the computation in the current loop.

3. Experimental Results We implemented the proposed compiler framework in Cetus, a source-to-source compiler infrastructure for C programs [2]. The CUDA language support in Cetus is ported from MCUDA [1]. In our experiments, we used NVIDIA GTX280 GPUs with CUDA SDK 2.1 and a 32-bit CentOS 5.2 operating system. Among the algorithms that we studied, a few of them, including transpose matrix vector multiplication (tmv), matrix vector multiplication (mv), and matrix multiplication (matmul), are implemented in CUDA CUBLAS library. We compared our optimized kernel with the highly tuned CUBLAS (both v1.1 and v2.1) on GTX 280. Figure 4 shows the performance comparison of the three algorithms for different input sizes. From Figure 4, we can see that the kernel optimized by our compiler (‘labeled ‘optimized’) achieves better performance than CUBLAS 2.1 for transpose matrix vector multiplication and matrix vector multiplication for different input sizes (an average of 47% and 15% improvement, respectively). For matrix multiplication, our performance is very close to CUBLAS 2.1 (within 1% difference). For all three algorithms, our optimized code achieves much higher performance than CUBLAS1.1 for all the data input sizes (on average, 48%, 26%, and 50% for tmv, mv, and matmult, respectively).

For the naïve kernel function shown in Figure 1, the optimized code generated by our compile is shown in Figure 3. int i = 0; int k; float sum_0 = 0; … float sum_15 = 0; for (i=0; i
4. Conclusion In this paper, we present an optimizing compiler for GPGPU programs so as to relieve application developers of low-level GPU hardware specific optimizations. We propose novel compiler techniques to generate memory coalesced code and merge threads/thread blocks for memory data sharing/reuse. Our experimental results show that the optimized code achieves very high performance, superior or close to manually optimized programs.

if (tidx<16) { // 16 is the number of the threads to be merged shared0_0[(0+tidx)]=a[idy*16+0][((i+tidx)+0)]; … shared0_15[(0+tidx)]=a[idy*16+15][((i+tidx)+0)]; } __syncthreads(); for (k=0; k<16; k=(k+1)) { float r0; r0 = b[(i+k)][idx]) sum_0+=shared0_0[(0+k)]*r0; … sum_15+=shared0_15[0+k]*r0; } __syncthreads();

Acknowledgements

This research is supported by an NSF CAREER award CCF0747062.

References

} c[idy*16+0][idx] = sum_0; … c[idy*16+15][idx] = sum_15;

[1] [2]

Figure 3. The optimized kernel function for matrix multiplication

optimized

Matrix multiplication on GTX280

Matrix vector multiplication 25

cublas2.1

50

optimized

cublas2.1

optimized 400

cublas1.1

30 20 10 0

Input matrix size

(a)

Performance (Gflops)

20

40

Performance (Gflops)

Performance (Gflops)

Transpose matrix vector multiplication 60

J. Stratton, et. al., MCUDA:An efficient implementation of CUDA kernels on multicores. IMPACT Technical Report, UIUC, 2008. S.-I. Lee, et. al., Cetus – an extensible compiler infrastructure for source-to-source transformation. LCPC, 2003.

15 10 5 0

Input matrix size

(b)

cublas2.1

cuda sdk

cublas1.1

300 200 100 0 1kx1k

2kx2k 3kx3k Input matrix size

(c)

4kx4k

Figure 4. Performance comparison between CUBLAS and our optimized kernel on GTX280 for (a) transpose matrix vector multiplication (tmv), (b) matrix vector multiplication (mv), and (c) matrix multiplication (matmult).

2

A Lightweight Algorithm for Dynamic If-Conversion ... - Semantic Scholar

Jan 14, 2010 - Checking Memory Coalesing. Converting non-coalesced accesses into coalesced ones. Checking data sharing patterns. Thread & thread block merge for memory reuse. Data Prefetching. Optimized kernel functions & invocation parameters float sum = 0; for (int i=0; i

327KB Sizes 0 Downloads 97 Views

Recommend Documents

Lightweight, High-Resolution Monitoring for ... - Semantic Scholar
large-scale production system, thereby reducing these in- termittent ... responsive services can be investigated by quantitatively analyzing ..... out. The stack traces for locks resembled the following one: c0601655 in mutex lock slowpath c0601544 i

A Fast Greedy Algorithm for Outlier Mining - Semantic Scholar
Thus, mining for outliers is an important data mining research with numerous applications, including credit card fraud detection, discovery of criminal activities in.

implementing dynamic semantic resolution - Semantic Scholar
testing of a rule of inference called dynamic semantic resolution is then ... expressed in a special form, theorem provers are able to generate answers, ... case of semantic resolution that Robinson called hyper-resolution uses a static, trivial mode

Modified Aho Corasick Algorithm - Semantic Scholar
apply the text string as input to the pattern matching machine. ... Replacing this algorithm with the finite state approach resulted in a program whose running.

Secure Dependencies with Dynamic Level ... - Semantic Scholar
evolve due to declassi cation and subject current level ... object classi cation and the subject current level. We ...... in Computer Science, Amsterdam, The Nether-.