Stack-Based Parallel Recursion on Graphics Processors Ke Yang

Bingsheng He

Qiong Luo

Pedro V. Sander

Jiaoying Shi

Zhejiang Univ. [email protected]

HKUST [email protected]

HKUST [email protected]

HKUST [email protected]

Zhejiang Univ. [email protected]

Categories and Subject Descriptors D.1.3 [Programming Techniques]: Concurrent Programming -- Parallel programming Language

Multiple-Data) GPU threads are grouped into a warp, and a batch of thread warps form a block. Thread blocks are the synchronization unit for GPU execution, and threads within a block share a small piece of on-chip local memory. Due to the massive threading parallelism and the SIMD nature of the GPU, GPU programs must exploit SIMD coherence, minimize thread communication and utilize on-chip local memory for efficiency. It is therefore challenging to use GPUs for parallel recursions [4], because (1) recursions are not directly data parallel since there are communications between each pair of recursion caller and callee, (2) the recursion tree may be irregular, and (3) data sizes of base cases may vary. Considering these challenges, we propose three GPU-based stack implementation alternatives, namely per-thread stack, perwarp stack and per-block stack, and study their performance. We have implemented these alternatives in a GPU-based tree traversal application. Our preliminary results show that stack-based recursion can be done on the GPU efficiently and that the relative performance of each alternative depends on the fanout of the recursion tree.

General Terms Algorithms, Languages

2.

Keywords Stack, Parallel Recursion, Graphics Processors

We design three kinds of parallel stacks that differ in the granularity of stack sharing. We avoid write conflicts in sharing through software mechanisms [5]. It can also be done through hardware; either option can be expensive, and the cost generally increases with the number of conflicting threads.

Abstract Recent research has shown promising results on using graphics processing units (GPUs) to accelerate general-purpose computation. However, today's GPUs do not support recursive functions. As a result, for inherently recursive algorithms such as tree traversal, GPU programmers need to explicitly use stacks to emulate the recursion. Parallelizing such stack-based implementation on the GPU increases the programming difficulty; moreover, it is unclear how to improve the efficiency of such parallel implementations. As a first step to address both ease of programming and efficiency issues, we propose three parallel stack implementation alternatives that differ in the granularity of stack sharing. Taking tree traversals as an example, we study the performance tradeoffs between these alternatives and analyze their behaviors in various situations. Our results could be useful to both GPU programmers and GPU compiler writers.

1.

Introduction

Recursion is a fundamental programming construct. A recursive function consists of a base case, which can be solved directly, and a recursive case, which calls the function itself and reduces the problem domain towards the base case. At each recursion level, if the current function call becomes a base case, it will be solved and will return to the caller. Otherwise, it will partition the problem into sub-cases and go down to the next recursion level for each sub-case. Such execution forms a recursion tree, and can be converted into an iterative process using auxiliary data structures, in particular, a stack. In this paper, we study parallel implementation alternatives for stack-based recursion on GPUs. The GPU can be viewed as a kind of massively threaded parallel hardware. Figure 1 shows a typical organization of GPU threads. Multiple SIMD (Single-Instructionblock

2.1

thread GPU threads

Figure 1.

GPU thread organization.

Copyright is held by the author/owner(s). PPoPP’09, February 14-18, 2009, Raleigh, North Carolina, USA. ACM 978-1-60558-397-6/09/02.

t_stack

A per-thread stack (t_stack) is a local array owned by each thread; individual threads do not share stacks at all. Each thread independently handles a recursive task using stack operations similar to those on the CPU, and all these tasks can be executed in parallel [6]. However, if there are branches in a recursion case, the execution of individual threads will be divergent, and the SIMD hardware will be underutilized. Moreover, since concurrent writes occur among all threads, there will be intensive communication between threads. As a result, this t_stack is suitable for recursions with fine-grained parallelism. 2.2

warp

GPU-Based Parallel Stacks

b_stack

A per-block stack (b_stack) is a local array owned by a thread block. Each block of threads handles a recursive case, and multiple recursive cases are executed simultaneously among blocks. In each recursive case, we first partition the input and generate related bookkeeping information in parallel. Then we perform a block-level synchronization operation, after which a single thread of each block uses the bookkeeping information to locate all the sub-cases and pushes them to the stack. Additionally, all base cases are parallelized among blocks.

Thread communications in b_stacks occur at two levels, namely intra-block and inter-block. Intra-block communication is efficient using the intrinsic barrier mechanism, and it is possible to cache block-level bookkeeping information on-chip. Furthermore, due to the limited number of blocks, we can pre-allocate a write buffer to each block, and the overhead of inter-block communication is much smaller than that of inter-thread one in t_stacks. 2.3

Discussion

Stack depth. All three kinds of stacks can be efficiently implemented in CUDA [1] by allocating a sufficiently large, fixedsized array. In the uncommon cases of stack overflow, the thread dumps the stack to the GPU memory and resumes the execution in a second kernel. Hybrid alternatives. Since the granularity of parallelism may vary in a recursion tree, it might be better to switch among different stack models during execution. For example, we may use b_stacks for a small number of sub-tasks at the beginning, and then use w_stacks or t_stacks for a large number of sub-tasks approaching the base cases. The challenge of developing an efficient hybrid scheme on the GPU is to reduce the switching overhead, especially the book-keeping. Applications versus compilers. Given the strengths and weaknesses of the three kinds of parallel stacks, GPU developers have the flexibility to choose individual or hybrid alternatives suitable for their own algorithms, presumably having a better knowledge of their algorithms than the compiler. On the other hand, if the compiler natively supports recursion, it will significantly ease programming and possibly improves the efficiency.

3.

Time (ms)

CPU t_stack b_s tack w_s tack

1000

100

w_stack

A per-warp stack (w_stack) is a local array owned by a SIMD warp; each warp handles a task in parallel. The granularity of sharing a w_stack lies between that of t_stack and b_stack. This stack minimizes thread divergence within one warp. Compared with b_stacks, the warp-scope stacks and their bookkeeping structures are smaller and more likely to fit in on-chip stores. More importantly, the overhead of serializing stack updates by a single thread is confined by the warp width. Therefore, the communication overhead is generally less than that of b_stack. 2.4

10000

Preliminary Results

We have applied our GPU-based parallel stacks to a representative recursive problem, tree traversal [3]. The tree index is a twodimensional R-tree [2] on 4M records amount to 64MB, and the workloads are 100K two-dimensional range queries. We use CUDA to implement the programs on a GeForce 8800GTX GPU. For comparison, we have also implemented a CPU-based parallel traversal routine using OpenMP with two threads running on an Athlon Dual Core CPU. This routine uses native recursion. We study the query performance at various degrees of parallelism, specifically, the number of input partitions of each recursive case, or the fanout of the recursion tree. For tree traversal, this corresponds to the node size in number of entries (denoted as N). With N varied under the same workload, we measure the execution time using the three types of stacks on the GPU, in comparison with the CPU time. Figure 2 shows the time in log scale with node size varied. The highest speedup of GPU over CPU using t_stack, b_stack and w_stack is 5X, 5.9X and 6.3X, respectively. When N is smaller than four, parallelism among nodes is limited, especially near the root of the recursion tree. As a result, most threads in b_stack and w_stack are underutilized, and t_stack becomes the fastest. As the

10 2

4

8

16

32 64 128 256 512 1024 Node size

Figure 2. Traversal performance with node size varied. node size grows, the divergence in t_stack becomes significant. When N is larger than 1024, t_stack becomes even slower than the CPU routine. For N less than the warp size (32), b_stack performs similarly to w_stack. When the node size is larger than 256, b_stack’s communication overhead becomes more significant, and w_stack performs faster. Therefore, t_stack is the first choice for small nodes (e.g., N < 4), and w_stack is the best alternative for large nodes. Since our preliminary results are limited to tree traversals, we speculate b_stack might be more apt at more coarsegrained tasks such as sorting, and might outperform w_stack in such cases. Such further comparison is in our ongoing work.

4.

Conclusions

Graphics processors have become an attractive alternative for general-purpose high performance computing on commodity hardware. In this study, we have designed three stack implementation alternatives for emulating recursions on GPUs. These parallel stacks differ in the granularity of stack sharing and are suitable for different situations. We have implemented these alternatives for tree traversal on the GPU and have compared the performance with the node size varied. Our results could be useful to both GPU programmers and GPU compiler writers. As ongoing work, we are applying our techniques to other recursive algorithms, such as quick sort, on the GPU, are investigating the relative performance of these alternatives, and are exploring a hybrid approach that utilizes multiple kinds of stacks.

Acknowledgments The authors thank the anonymous reviewers for their insightful suggestions. This work was supported by grant 616808 from the Hong Kong Research Grants Council. Ke Yang and Bingsheng He are currently with Microsoft Corp.

References [1] [2] [3]

[4]

[5] [6]

CUDA, http://developer.nvidia.com/object/cuda.html. A. Guttman, R-trees: A dynamic index structure for spatial searching. In Proc. ACM SIGMOD, pp. 47-54. 1984. S. Popov, J. Günther, S. Hans-Peter et al, Stackless KD-Tree Traversal for High Performance GPU Ray Tracing In: Computer Graphics Forum 26(3), pp. 415–424, 2007. L. Prechelt, S. U. Hänßgen, Efficient Parallel Execution of Irregular Recursive Programs, IEEE Transactions on Parallel Distributed Systems 2002, 13(2):167 - 178. B. He, K. Yang, R. Fang et al, Relational Joins on Graphics Processors, SIGMOD 2008. K. Zhou, Q. Hou, R. Wang, B. Guo, Real-Time KD-Tree Construction on Graphics Hardware, SIGGRAPH Asia 2008.

Stack-Based Parallel Recursion on Graphics ... - Semantic Scholar

Feb 18, 2009 - the GPU increases the programming difficulty; moreover, it is ... Language .... dimensional R-tree [2] on 4M records amount to 64MB, and the.

32KB Sizes 4 Downloads 223 Views

Recommend Documents

Stack-Based Parallel Recursion on Graphics ... - Semantic Scholar
Feb 18, 2009 - the GPU increases the programming difficulty; moreover, it is unclear how to improve ... General Terms Algorithms, Languages. Keywords Stack ...

Parallel Exact Inference on the Cell Broadband ... - Semantic Scholar
data representation can be ported to other parallel computing systems for the online scheduling of directed acyclic graph ..... or save all data. However, note that ...

Parallel Exact Inference on the Cell Broadband ... - Semantic Scholar
Yinglong Xia. Computer Science Department ..... Buehrer discussed scientific computing using .... Each task in SL has a property called the dependency degree,.

Parallel Exact Inference on the Cell Broadband ... - Semantic Scholar
A developer must understand both the algorithmic and archi- tectural aspects to propose efficient algorithms. Some recent research provides insight to parallel algorithms design for the. Cell [9], [10]. However, to the best of our knowledge, no exact

Parallel unstructured grid computations - Semantic Scholar
optional. Although the steps are the same for structured and unstructured grids as well ... ture part (DDD) and at the unstructured mesh data structure. Finally, we ...

Efficient parallel inversion using the ... - Semantic Scholar
Nov 1, 2006 - Centre for Advanced Data Inference, Research School of Earth Sciences, Australian National University, Canberra, ACT. 0200 ... in other ensemble-based inversion or global optimization algorithms. ... based, which means that they involve

Parallel unstructured grid computations - Semantic Scholar
Huge amounts of data are produced by parallel computations ..... mandatory to define standardized interfaces for the PDE software components such that.

Parallel unstructured grid computations - Semantic Scholar
as sequential and parallel computation, programming effort can vary from almost nothing .... the application programmer UG provides a framework for building ...

Efficient parallel inversion using the ... - Semantic Scholar
Nov 1, 2006 - Centre for Advanced Data Inference, Research School of Earth Sciences, Australian ..... (what we will call the canonical version), and then.

On Knowledge - Semantic Scholar
Rhizomatic Education: Community as Curriculum by Dave Cormier. The truths .... Couros's graduate-level course in educational technology offered at the University of Regina provides an .... Techknowledge: Literate practice and digital worlds.

On Knowledge - Semantic Scholar
Rhizomatic Education: Community as Curriculum .... articles (Nichol 2007). ... Couros's graduate-level course in educational technology offered at the University ...

Parallel generation of samples for simulation ... - Semantic Scholar
Analytical modeling of complex systems is crucial to de- tect error conditions or ... The current SAN solver, PEPS software tool [4], works with less than 65 million ...

Mining Large-scale Parallel Corpora from ... - Semantic Scholar
Multilingual data are critical resources for building many applications, such as machine translation (MT) and cross-lingual information retrieval. Many parallel ...

Performance of Parallel Prefix Circuit Transition ... - Semantic Scholar
analysis of jitter continues to be fundamental to the design, test, commissioning, and ... programmable gate array), and custom hardware; flexible ap- proaches are .... parallel methods for repeated application of that operation. The methods ...

Parallel generation of samples for simulation ... - Semantic Scholar
This advantage justifies its usage in several contexts where .... The main advantage of. SAN is due to ..... ular Analytical Performance Models for Ad Hoc Wireless.

Inducing Value Sparsity for Parallel Inference in ... - Semantic Scholar
observe the marginal of each latent variable in the model, and as soon as the probability of the max- marginal value crosses a ..... entity names in text. Bioinformatics, 21(14):3191–3192, 2005. [4] Yan Liu, Jaime G. Carbonell, Peter Weigele, and V

Points and Lines Axis-parallel Lines Hyperplanes ... - Semantic Scholar
with constraint, the algorithm fails to find a solution to our counterexample. (Another ..... rankx ¯y(pi ), then initialize the four coordinates ai ,bi ,ci ,di to the index i.

Delay Diversity Methods for Parallel OFDM Relays - Semantic Scholar
(1) Dept. of Signal Processing and Acoustics, Helsinki University of Technology. P.O. Box 3000 ... of wireless OFDM networks in a cost-efficient manner. ... full diversity benefits, which limits the number of relays, Nr. To achieve maximum delay.

Toward Scalable Parallel Software An Active ... - Semantic Scholar
Example applications include graphical interface libraries, fundamental .... As an example, many companies are looking at parallel machines to support better.

Toward Scalable Parallel Software An Active ... - Semantic Scholar
most applications of object technology remain trivial. .... what ideas support scalable parallel software development ... The ability to define such objects from a.

Performance of Parallel Prefix Circuit Transition ... - Semantic Scholar
can be the most time-consuming step in jitter measurement because ..... We call the FSMs described in Figures 1 .... IEEE AUTOTEST Conference, 2013, pp. 1–6 ...

Points and Lines Axis-parallel Lines Hyperplanes ... - Semantic Scholar
segments in the tour (or the path) are axis-parallel. Four Optimization Problems. • Minimum-Link Covering Tour. • Minimum-Link Rectilinear Covering Tour.

Delay Diversity Methods for Parallel OFDM Relays - Semantic Scholar
of wireless OFDM networks in a cost-efficient manner. ... full diversity benefits, which limits the number of relays, Nr. To achieve maximum delay diversity, the ...

SML3d: 3D Graphics for Standard ML - Semantic Scholar
ear algebra types and operations found in computer graphics appli- cations, as well as ... a high-level library, such as Apple's SceneKit [2], which provides.