An Optimizing Compiler for GPGPU Programs with Input-Data Sharing Yi Yang

Ping Xiang

Jingfei Kong

Huiyang Zhou

Dept. of ECE North Carolina State University

School of EECS Univ. of Central Florida

School of EECS, UCF Univ. of Central Florida

Dept. of ECE North Carolina State University

[email protected]

[email protected]

[email protected]

[email protected]

Abstract Developing high performance GPGPU programs is challenging for application developers since the performance is dependent upon how well the code leverages the hardware features of specific graphics processors. To solve this problem and relieve application developers of low-level hardware-specific optimizations, we introduce a novel compiler to optimize GPGPU programs. Our compiler takes a naive GPU kernel function, which is functionally correct but without any consideration for performance optimization. The compiler then analyzes the code, identifies memory access patterns, and generates optimized code. The proposed compiler optimizations target at one category of scientific and media processing algorithms, which has the characteristics of input-data sharing when computing neighboring output pixels/elements. Many commonly used algorithms, such as matrix multiplication, convolution, etc., share such characteristics. For these algorithms, novel approaches are proposed to enforce memory coalescing and achieve effective data reuse. Data prefetching and hardwarespecific tuning are also performed automatically with our compiler framework. The experimental results based on a set of applications show that our compiler achieves very high performance, either superior or very close to the highly finetuned library, NVIDIA CUBLAS 2.1.

The input of our compiler is a naïve GPU kernel function, which simply identifies the fine-grain work item that can be executed in parallel. A typical candidate is to compute one data element in the output domain. The code segment shown in the Figure 1 is such a naïve kernel for matrix multiplication, which calculates one element in the product matrix. This kernel is functionally correct but does not include any optimization for performance. Our compiler analyzes the naïve kernel, checks the off-chip memory access patterns, and converts the non-coalesced memory accesses into coalesced ones. Then the compiler finds possible data sharing across neighboring threads and thread blocks. Based on the data sharing pattern, the compiler intelligently merges threads and/or thread-blocks to improve memory reuse. Additionally, the compiler schedules the code to enable data prefetching so as to overlap the computation and memory access latencies. float sum = 0; for (int i=0; i
Figure 1. A Naïve kernel for matrix multiplication

2. Compiler Framework

Categories and Subject Descriptors D.3.4 [Programming Languages]: Processors – Compilers, Optimization

Naive kernel functions

General Terms Performance, Experimentation, Languages

Checking Memory Coalesing

Keywords GPGPU; Compiler;

1. Introduction

Converting non-coalesced accesses into coalesced ones

The high computational power and affordability of state-of-art graphics processor units (GPU) have made them the first widely used parallel computers with teraflops capability. In the meantime, developing high performance GPGPU code is challenging as the performance is highly dependent on how well the GPU hardware resources are utilized. In this paper, we propose an optimizing compiler to relieve the application developers of the low-level hardware-specific performance optimizations.

Checking data sharing patterns Thread & thread block merge for memory reuse Data Prefetching Optimized kernel functions & invocation parameters

Copyright is held by the author/owner(s). PPoPP’10 January 9–14, 2010, Bangalore, India. ACM 978-1-60558-708-0/10/01.

Figure 2. The framework of the proposed optimizing compiler. The proposed compile framework is shown in Figure 2. It consists of the following steps.

1

a) Checking Memory Coalescing: In order to determine whether off-chip memory accesses can be coalesced, we identify the address model of each memory accesses in the kernel function. b) Converting Non-Coalesced Accesses into Coalesced Ones: After the compiler analyzes every array access in the kernel code, the compiler converts the non-coalesced accesses into coalesced ones through the shared memory. c) Checking data sharing patterns: In this step, the compiler detects possible data sharing among threads. d) Thread/Thread Block Merge to Enhance Memory Reuse: After detecting that there exists data sharing among neighbor thread blocks, we merge thread blocks or threads to improve data sharing. e) Data Prefetching: The compiler analyzes the memory accesses in a loop and uses a temporary variable to prefetch data for the next iteration before the computation in the current loop.

3. Experimental Results We implemented the proposed compiler framework in Cetus, a source-to-source compiler infrastructure for C programs [2]. The CUDA language support in Cetus is ported from MCUDA [1]. In our experiments, we used NVIDIA GTX280 GPUs with CUDA SDK 2.1 and a 32-bit CentOS 5.2 operating system. Among the algorithms that we studied, a few of them, including transpose matrix vector multiplication (tmv), matrix vector multiplication (mv), and matrix multiplication (matmul), are implemented in CUDA CUBLAS library. We compared our optimized kernel with the highly tuned CUBLAS (both v1.1 and v2.1) on GTX 280. Figure 4 shows the performance comparison of the three algorithms for different input sizes. From Figure 4, we can see that the kernel optimized by our compiler (‘labeled ‘optimized’) achieves better performance than CUBLAS 2.1 for transpose matrix vector multiplication and matrix vector multiplication for different input sizes (an average of 47% and 15% improvement, respectively). For matrix multiplication, our performance is very close to CUBLAS 2.1 (within 1% difference). For all three algorithms, our optimized code achieves much higher performance than CUBLAS1.1 for all the data input sizes (on average, 48%, 26%, and 50% for tmv, mv, and matmult, respectively).

For the naïve kernel function shown in Figure 1, the optimized code generated by our compile is shown in Figure 3. int i = 0; int k; float sum_0 = 0; … float sum_15 = 0; for (i=0; i
4. Conclusion In this paper, we present an optimizing compiler for GPGPU programs so as to relieve application developers of low-level GPU hardware specific optimizations. We propose novel compiler techniques to generate memory coalesced code and merge threads/thread blocks for memory data sharing/reuse. Our experimental results show that the optimized code achieves very high performance, superior or close to manually optimized programs.

if (tidx<16) { // 16 is the number of the threads to be merged shared0_0[(0+tidx)]=a[idy*16+0][((i+tidx)+0)]; … shared0_15[(0+tidx)]=a[idy*16+15][((i+tidx)+0)]; } __syncthreads(); for (k=0; k<16; k=(k+1)) { float r0; r0 = b[(i+k)][idx]) sum_0+=shared0_0[(0+k)]*r0; … sum_15+=shared0_15[0+k]*r0; } __syncthreads();

Acknowledgements

This research is supported by an NSF CAREER award CCF0747062.

References

} c[idy*16+0][idx] = sum_0; … c[idy*16+15][idx] = sum_15;

[1] [2]

Figure 3. The optimized kernel function for matrix multiplication

optimized

Matrix multiplication on GTX280

Matrix vector multiplication 25

cublas2.1

50

optimized

cublas2.1

optimized 400

cublas1.1

30 20 10 0

Input matrix size

(a)

Performance (Gflops)

20

40

Performance (Gflops)

Performance (Gflops)

Transpose matrix vector multiplication 60

J. Stratton, et. al., MCUDA:An efficient implementation of CUDA kernels on multicores. IMPACT Technical Report, UIUC, 2008. S.-I. Lee, et. al., Cetus – an extensible compiler infrastructure for source-to-source transformation. LCPC, 2003.

15 10 5 0

Input matrix size

(b)

cublas2.1

cuda sdk

cublas1.1

300 200 100 0 1kx1k

2kx2k 3kx3k Input matrix size

(c)

4kx4k

Figure 4. Performance comparison between CUBLAS and our optimized kernel on GTX280 for (a) transpose matrix vector multiplication (tmv), (b) matrix vector multiplication (mv), and (c) matrix multiplication (matmult).

2

A Lightweight Algorithm for Dynamic If-Conversion ... - Semantic Scholar

Jan 14, 2010 - Checking Memory Coalesing. Converting non-coalesced accesses into coalesced ones. Checking data sharing patterns. Thread & thread block merge for memory reuse. Data Prefetching. Optimized kernel functions & invocation parameters float sum = 0; for (int i=0; i

327KB Sizes 1 Downloads 361 Views

Recommend Documents

Lightweight, High-Resolution Monitoring for ... - Semantic Scholar
large-scale production system, thereby reducing these in- termittent ... responsive services can be investigated by quantitatively analyzing ..... out. The stack traces for locks resembled the following one: c0601655 in mutex lock slowpath c0601544 i

A Randomized Algorithm for Finding a Path ... - Semantic Scholar
Dec 24, 1998 - Integrated communication networks (e.g., ATM) o er end-to-end ... suming speci c service disciplines, they cannot be used to nd a path subject ...

A Unified Framework and Algorithm for Channel ... - Semantic Scholar
with frequency hopping signalling," Proceedings of the IEEE, vol 75, No. ... 38] T. Nishizeki and N. Chiba, \"Planar Graphs : Theory and Algorithms (Annals of ...

A Fast and Efficient Algorithm for Low-rank ... - Semantic Scholar
The Johns Hopkins University [email protected]. Thong T. .... time O(Md + (n + m)d2) where M denotes the number of non-zero ...... Computer Science, pp. 143–152 ...

A Fast and Efficient Algorithm for Low-rank ... - Semantic Scholar
republish, to post on servers or to redistribute to lists, requires prior specific permission ..... For a fair comparison, we fix the transform matrix to be. Hardarmard and set .... The next theorem is dedicated for showing the bound of d upon which

a niche based genetic algorithm for image ... - Semantic Scholar
Image registration can be regarded as an optimization problem, where the goal is to maximize a ... genetic algorithms can address this problem. However ..... This is akin to computing the cosine ... Also partial occlusions (e.g. clouds) can occur ...

A Fast Greedy Algorithm for Outlier Mining - Semantic Scholar
Thus, mining for outliers is an important data mining research with numerous applications, including credit card fraud detection, discovery of criminal activities in.

implementing dynamic semantic resolution - Semantic Scholar
testing of a rule of inference called dynamic semantic resolution is then ... expressed in a special form, theorem provers are able to generate answers, ... case of semantic resolution that Robinson called hyper-resolution uses a static, trivial mode

implementing dynamic semantic resolution - Semantic Scholar
expressed in a special form, theorem provers are able to generate answers, .... First Australian Undergraduate Students' Computing Conference, 2003 page 109 ...

A Lightweight Algorithm for Automated Forum ...
method using only links and text information in the forum pages. The proposed method is able to accurately extract the content present in the different forum page types in individual data regions. Our experimental results show the effectiveness of ou

Modified Aho Corasick Algorithm - Semantic Scholar
apply the text string as input to the pattern matching machine. ... Replacing this algorithm with the finite state approach resulted in a program whose running.

Lightpath Protection using Genetic Algorithm ... - Semantic Scholar
connectivity between two nodes in the network following a failure by mapping ... applications and high speed computer networks because of huge bandwidth of ...

Lightpath Protection using Genetic Algorithm ... - Semantic Scholar
virtual topology onto the physical topology so as to minimize the failure ... applications and high speed computer networks because of huge bandwidth of optical ...

Modified Aho Corasick Algorithm - Semantic Scholar
goto function g(0,a)=s, where s is valid if a={h,s}.then this function makes a transition from State 0 to state 1 or 3 depending on the symbol. In the current example ...

Variation of the Balanced POD Algorithm for Model ... - Semantic Scholar
is transformation-free, i.e., the balanced reduced order model ... over the spatial domain Ω = [0, 1] × [0, 1], with Dirichlet boundary ..... 9.07 × 100. 2.91 × 100. MA.

Variation of the Balanced POD Algorithm for Model ... - Semantic Scholar
is transformation-free, i.e., the balanced reduced order model is approximated directly ... one dimensional hyperbolic PDE system that has a transfer function that can be ... y)wy +b(x, y)u(t), over the spatial domain Ω = [0, 1] × [0, 1], with Diri

Going Mini: Extreme Lightweight Spam Filters - Semantic Scholar
on very few features may allow extremely fast classification due to excellent ..... than the online filtering scenario to mimic the setting of training on a limited ...

Dynamic Approaches to Cognition - Semantic Scholar
neurocognitive model of the state of the brain-mind. In R. Bootzin, J. Kihlstrom ... nition is a dynamical phenomenon and best understood in dynamical terms. ... cal research, particularly in connectionist form (Smolensky. 1988). By the 1990s, it ...

Stable communication through dynamic language - Semantic Scholar
texts in which particular words are used, or the way in which they are ... rules of grammar can only be successfully transmit- ted if the ... are much more likely to pass through the bottleneck into the ... ternal world is not sufficient to avoid the

Dynamic Moral Hazard and Stopping - Semantic Scholar
Jan 3, 2011 - agencies “frequently” to find a wide variety of workers. ... 15% and 20% of searches in the pharmaceutical sector fail to fill a post (Pharmafocus. (2007)). ... estate agent can affect buyer arrival rates through exerting marketing

Somatosensory Integration Controlled by Dynamic ... - Semantic Scholar
Oct 19, 2005 - voltage recording and representative spike waveforms (red) and mean ..... Note the deviation of the experimental data points from the unity line.

Dynamic Moral Hazard and Stopping - Semantic Scholar
Jan 3, 2011 - agencies “frequently” to find a wide variety of workers. ... 15% and 20% of searches in the pharmaceutical sector fail to fill a post (Pharmafocus. (2007)). ... estate agent can affect buyer arrival rates through exerting marketing

Optimal Dynamic Hedging of Cliquets - Semantic Scholar
May 1, 2008 - Kapoor, V., L. Cheung, C. Howley, Equity securitization; Risk & Value, Special Report, Structured. Finance, Standard & Poor's, (2003). Laurent, J.-P., H. Pham, Dynamic Programming and mean-variance hedging, Finance and Stochastics, 3, 8