An Optimizing Compiler for GPGPU Programs with Input-Data Sharing Yi Yang
Ping Xiang
Jingfei Kong
Huiyang Zhou
Dept. of ECE North Carolina State University
School of EECS Univ. of Central Florida
School of EECS, UCF Univ. of Central Florida
Dept. of ECE North Carolina State University
[email protected]
[email protected]
[email protected]
[email protected]
Abstract Developing high performance GPGPU programs is challenging for application developers since the performance is dependent upon how well the code leverages the hardware features of specific graphics processors. To solve this problem and relieve application developers of low-level hardware-specific optimizations, we introduce a novel compiler to optimize GPGPU programs. Our compiler takes a naive GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler then analyzes the code, identifies memory access patterns, and generates optimized code. The proposed compiler optimizations target at one category of scientific and media processing algorithms, which has the characteristics of input-data sharing when computing neighboring output pixels/elements. Many commonly used algorithms, such as matrix multiplication, convolution, etc., share such characteristics. For these algorithms, novel approaches are proposed to enforce memory coalescing and achieve effective data reuse. Data prefetching and hardwarespecific tuning are also performed automatically with our compiler framework. The experimental results based on a set of applications show that our compiler achieves very high performance, either superior or very close to the highly finetuned library, NVIDIA CUBLAS 2.1.
The input of our compiler is a naïve GPU kernel function, which simply identifies the fine-grain work item that can be executed in parallel. A typical candidate is to compute one data element in the output domain. The code segment shown in the Figure 1 is such a naïve kernel for matrix multiplication, which calculates one element in the product matrix. This kernel is functionally correct but does not include any optimization for performance. Our compiler analyzes the naïve kernel, checks the off-chip memory access patterns, and converts the non-coalesced memory accesses into coalesced ones. Then the compiler finds possible data sharing across neighboring threads and thread blocks. Based on the data sharing pattern, the compiler intelligently merges threads and/or thread-blocks to improve memory reuse. Additionally, the compiler schedules the code to enable data prefetching so as to overlap the computation and memory access latencies. float sum = 0; for (int i=0; i
Figure 1. A Naïve kernel for matrix multiplication
2. Compiler Framework
Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors – Compilers, Optimization
Naive kernel functions
General Terms Performance, Experimentation, Languages
Checking Memory Coalesing
Keywords GPGPU; Compiler;
1. Introduction
Converting non-coalesced accesses into coalesced ones
The high computational power and affordability of state-of-art graphics processor units (GPU) have made them the first widely used parallel computers with teraflops capability. In the meantime, developing high performance GPGPU code is challenging as the performance is highly dependent on how well the GPU hardware resources are utilized. In this paper, we propose an optimizing compiler to relieve the application developers of the low-level hardware-specific performance optimizations.
Checking data sharing patterns Thread & thread block merge for memory reuse Data Prefetching Optimized kernel functions & invocation parameters
Copyright is held by the author/owner(s). PPoPP’10 January 9–14, 2010, Bangalore, India. ACM 978-1-60558-708-0/10/01.
Figure 2. The framework of the proposed optimizing compiler. The proposed compile framework is shown in Figure 2. It consists of the following steps.
1
a) Checking Memory Coalescing: In order to determine whether off-chip memory accesses can be coalesced, we identify the address model of each memory accesses in the kernel function. b) Converting Non-Coalesced Accesses into Coalesced Ones: After the compiler analyzes every array access in the kernel code, the compiler converts the non-coalesced accesses into coalesced ones through the shared memory. c) Checking data sharing patterns: In this step, the compiler detects possible data sharing among threads. d) Thread/Thread Block Merge to Enhance Memory Reuse: After detecting that there exists data sharing among neighbor thread blocks, we merge thread blocks or threads to improve data sharing. e) Data Prefetching: The compiler analyzes the memory accesses in a loop and uses a temporary variable to prefetch data for the next iteration before the computation in the current loop.
3. Experimental Results We implemented the proposed compiler framework in Cetus, a source-to-source compiler infrastructure for C programs [2]. The CUDA language support in Cetus is ported from MCUDA [1]. In our experiments, we used NVIDIA GTX280 GPUs with CUDA SDK 2.1 and a 32-bit CentOS 5.2 operating system. Among the algorithms that we studied, a few of them, including transpose matrix vector multiplication (tmv), matrix vector multiplication (mv), and matrix multiplication (matmul), are implemented in CUDA CUBLAS library. We compared our optimized kernel with the highly tuned CUBLAS (both v1.1 and v2.1) on GTX 280. Figure 4 shows the performance comparison of the three algorithms for different input sizes. From Figure 4, we can see that the kernel optimized by our compiler (‘labeled ‘optimized’) achieves better performance than CUBLAS 2.1 for transpose matrix vector multiplication and matrix vector multiplication for different input sizes (an average of 47% and 15% improvement, respectively). For matrix multiplication, our performance is very close to CUBLAS 2.1 (within 1% difference). For all three algorithms, our optimized code achieves much higher performance than CUBLAS1.1 for all the data input sizes (on average, 48%, 26%, and 50% for tmv, mv, and matmult, respectively).
For the naïve kernel function shown in Figure 1, the optimized code generated by our compile is shown in Figure 3. int i = 0; int k; float sum_0 = 0; … float sum_15 = 0; for (i=0; i
4. Conclusion In this paper, we present an optimizing compiler for GPGPU programs so as to relieve application developers of low-level GPU hardware specific optimizations. We propose novel compiler techniques to generate memory coalesced code and merge threads/thread blocks for memory data sharing/reuse. Our experimental results show that the optimized code achieves very high performance, superior or close to manually optimized programs.
if (tidx<16) { // 16 is the number of the threads to be merged shared0_0[(0+tidx)]=a[idy*16+0][((i+tidx)+0)]; … shared0_15[(0+tidx)]=a[idy*16+15][((i+tidx)+0)]; } __syncthreads(); for (k=0; k<16; k=(k+1)) { float r0; r0 = b[(i+k)][idx]) sum_0+=shared0_0[(0+k)]*r0; … sum_15+=shared0_15[0+k]*r0; } __syncthreads();
Acknowledgements
This research is supported by an NSF CAREER award CCF0747062.
References
} c[idy*16+0][idx] = sum_0; … c[idy*16+15][idx] = sum_15;
[1] [2]
Figure 3. The optimized kernel function for matrix multiplication
optimized
Matrix multiplication on GTX280
Matrix vector multiplication 25
cublas2.1
50
optimized
cublas2.1
optimized 400
cublas1.1
30 20 10 0
Input matrix size
(a)
Performance (Gflops)
20
40
Performance (Gflops)
Performance (Gflops)
Transpose matrix vector multiplication 60
J. Stratton, et. al., MCUDA:An efficient implementation of CUDA kernels on multicores. IMPACT Technical Report, UIUC, 2008. S.-I. Lee, et. al., Cetus – an extensible compiler infrastructure for source-to-source transformation. LCPC, 2003.
15 10 5 0
Input matrix size
(b)
cublas2.1
cuda sdk
cublas1.1
300 200 100 0 1kx1k
2kx2k 3kx3k Input matrix size
(c)
4kx4k
Figure 4. Performance comparison between CUBLAS and our optimized kernel on GTX280 for (a) transpose matrix vector multiplication (tmv), (b) matrix vector multiplication (mv), and (c) matrix multiplication (matmult).
2