Parallel Sparse Matrix Vector Multiplication using ...

Viewer
Transcript

Parallel Sparse Matrix Vector Multiplication using Greedy Extraction of Boxes Dr. Dhananjay Brahme Computational Research Laboratories Pune, India Email: [email protected]

Binit Ranjan Mishra BITS Pilani, India Email: [email protected]

Abstract Parallel Sparse Matrix Vector Multiplication (PSpMV) is a compute intensive kernel used in iterative solvers like Conjugate Gradient, GMRES and Lanzcos. Numerous attempts at optimizing this function have been made that require fine tuning of many hardware and software parameters to achieve optimal performance. We attempt to offer a simple framework that involves (i) Employing a greedy algorithm to extract variable-sized dense sub matrices without zeroes filled in, (ii) Partitioning the sparse matrix in a load balanced manner and maintaining partial information at each node, and (iii) Overlapping communication with computation. Using the aforementioned, we reduce memory traffic and hide communication latencies, and hope to inherently achieve improved cache and register utilization. This paper reports the performance improvements of PSpMV as such and when used in Preconditioned Conjugate Gradient (PCG).

1. Introduction Sparse Matrix Vector Multiplication (SpMV) dominates the performance of diverse applications in scientific and engineering computing, economic modeling and information retrieval. It is a major computing part of iterative solvers like Conjugate Gradient, GMRES, Lanzcos solvers etc. used in these domains. To enable simulation of large problems in reasonable time, requires that solvers and the core SpMV function be sped up. The major issue with sparse kernels is that they suffer from higher instruction and storage overheads, as well as indirect and irregular memory access patterns. A variety of techniques used to do this has been reported in the scientific literature. These include 1) Choosing optimum representations: The performance of SpMV is limited by Memory-CPU bandwidth, therefore representing a matrix in a more compact manner reduces the compute and memory throughput imbalance. This leads to an improvement in the SpMV performance. Some of the formats currently used are:

Anup Barve Computational Research Laboratories Pune, India Email: [email protected]

a) Block Compressed Row (BCSR) format: Here, a particular block size (r, c) is chosen and the whole matrix is represented as a collection of r×c dense blocks. This requires a filling in of zeros to create a complete r × c box [3], [15], [20]. Sparsity [9] describes techniques for choosing the optimal block size for a particular matrix and machine. Using this technique, they have achieved significant speed-up over the base CSR performance. b) Unaligned Block Compressed Row (UBCSR) format: In BCSR format, all blocks begin at rows i and columns j such that i (mod r) j (mod c) 0. However, the patterns of non-zeros for real matrices do not necessarily obey this constraint. This results in using fill-ins of zeros for these boxes. In UBCSR format, boxes in every row and column are row shifted by a number between 0 and r − 1 and column shifted by a number between 0 and c − 1, so as to reduce the fill-ins required. Vuduc and Moon [24] have used this technique in their SpMV. c) Splitting: In the above formats the entire matrix is represented as a collection of blocks of one size (r, c). Many matrices are better represented as a collection of blocks of several sizes (r1 , c1 ), (r2 , c2 ), ... (rk , ck ) [24]. By representing them this way, the amount of fill-in required is reduced [22]. The matrix A is considered as a sum of matrices A1 , A2 , ... , Ak where matrix Ai is represented as collection of blocks of size (ri , ci ) in UBCSR format. In this method there is cost involved in searching for a set of dominant block sizes. d) Permutation: Permutation of rows and columns is used to create dense substructures, and reduce the fill-ins [23], [7], [8], [10]. It is also used to increase the number of non-zeros in diagonal sub-blocks [19]. This reduces communication in multinode parallel SpMV. 2) Blocking: An optimization that has been widely applied to SpMV is that of blocking in thread, cache,

local store, TLB and register. While such an optimization was initially intended to increase the register reuse and locality of references to the input vector, the main advantage is to reduce the working set of the algorithm. Cache blocking is used to allow reuse of elements of the x vector (from cache) while multiplying subsequent rows. This is effective for very large matrices [25], [21], [22], [16]. Thread blocking helps in load balancing and register blocking reduces the memory footprint, that aims at improving multithread performance exclusively. 3) In addition to the above techniques, to promote effective use of the underlying architecture, techniques like loop unrolling, programming with SIMD intrinsic instructions are employed [14], [12], [13]. However, some of these efforts at optimizing data structures to yield reduced memory-processor bandwidth, rely on filling zeroes into the existing matrix structure or reordering it to get larger and denser sub-matrices (of pre-determined sizes in some cases). Filling in zeroes, in effect, decreases the advantage of reducing the data storage, and also requires extensive experimentation to tune the parameters to get optimum results. Reordering [17], on the other hand, requires heavy pre-processing and data overhead to maintain additional ordering information. Newer sparse matrix representations that store less non-zeroes than the matrix itself, require explicit pattern recognition in the matrix. This again, is significant overhead when dealing with very general matrices. For instance, in the UBCSR method [24] the user has to specify the set of block sizes to split the matrix. The filling in of zeroes to complete the chosen block size certainly helps in reducing data storage for the indices. However, automatically choosing an optimal set of block sizes is an NP-hard problem. In addition, there is a preprocessing cost of transforming the matrix from the CSR format to the UBCSR format and splitting the matrix. Most of these methods have been automated invoking complex heuristics to determine optimum values for the used parameters like appropriate block size, register blocking, format and index size [25]. However, proper application of heuristics again requires proper understanding of the underlying hardware architecture. Attempts at software pipelining (which is ineffective for out-of-order superscalars) and removal of branches, too suffer from the same issue and need to be different for different multi-core architectures. Applied correctly and efficiently, these will certainly lead to a performance boost, but at the cost of data and computation overhead that may or may not be amortized over multiple iterations. To achieve slightly better performance by greatly increasing the complexity of code was an option we decided not to take for the sake of a simple, scalable solution that would not have much maintenance issues and can be modified and tuned easily for more specific situations.

In this paper, we report our effort at developing an optimal PSpMV. Our design decision to keep the framework simple yet efficient, required us to avoid heuristical approaches. The approach that we adopted was to not add any fillins and employ a greedy extraction of the boxes or lines from the matrix as they occur naturally. This results in a definite reduction in storage size of the sparse matrix and the simplicity of the algorithm ensures low setup time. The argument given against extensive parallelizing of SpMV is that memory bandwidth does not scale with the number of cores and discovery of matrix structure at run time will heavily increase pre-processing time, including sending extra information over distributed nodes and associated latency[6]. Our goal is to provide a PSpMV with much better runtime than CSR and at the same time keep the setup cost involved in creating optimal sparse matrix representations low. Good scaling over multiple parallel processors and threads was also a major factor to be kept in mind. We have implemented our PSpMV on a large cluster of HP blades made up of Intel clovertown processors (2 Quad core Xeon processors). To present the results of our experiments, we chose a set of 13 matrices from the University of Florida website [5]. Table 1 lists these matrices. We have removed all the zero fill-ins (if any) from these matrices. These are from various domains like FEM, Structural Engineering, etc. Further sections discuss the method used for optimizing PSpMV and the results of our approach.

2. Pre-Processing the Sparse Matrix In order to achieve optimizations as expected of our algorithm, we need to have specialized data structures that maintain all information about the entire matrix in a form that would be most beneficial. In designing these data structures, we keep in mind the following : 1) single node performance limitation primarily due to limited memory bandwidth, which restricts the rate at which matrix and vector elements are accessed from memory, and 2) multinode limitation primarily due to distribution of information and hence, communication delays across processor nodes. It must, however, be noted that our claim to optimum performance seeds from an effective coupling of both single-node and multi-node optimizations, and not each one individually. This is because we noticed that many of the optimization techniques that give excellent performance on a singlenode, simply lose the performance benefit they provide, over multiple nodes and multiple threads [25], [11], [6]. We shall therefore, give a brief overview of what happens on a single node, before proceeding to giving a run through the operations in a multi-node multi-thread environment. Many of the individual techniques may not be novel, but the unique combination is something that holds promise.

Table 1. Matrix set from University of Florida (All matrices are real symmetric. All except the last two are positive definite.) Matrix Name af 2 k101

n 503625

nnz 17550675

Domain Sheet Metal Simulation

af shell3

504855

17562051

Sheet Metal Simulation

Forming Forming

BenElechi1

245874

13150496

Unknown

audikw 1

943695

77651847

Automotive model

bone010

986703

47851783

Bone micro-finite element models

boneS10

914898

40878708

Bone micro-finite element models

ct20stif

52329

2600295

Structural engineering

inline 1

503712

36816170

Inline matrix

crankshaft

skater

stiffness

msdoor

415863

19173163

Door test matrix

nd24k

72000

28715634

3D mesh problems

ship 001

34920

3896496

Ship structure

Ga41As41H72

268096

18488476

Density Theory

Functional

Si41Ge41H72

185639

15011265

Density Theory

Functional

2.1. Naturally occurring boxes

The primary method we propose to improve SpMV performance on a single node is creating a compact representation of the sparse matrix. If we look at the non-zero pattern of the matrices shown in Table 1, it is clear that many of these matrices have a specific geometric pattern. These are 1) horizontal lines, vertical lines and diagonal lines of varying lengths, and 2) boxes of varying widths and heights. We take advantage of this observation and represent a matrix as a collection of points and variable-sized boxes. Treating the matrix as a collection of variable size boxes subsumes the benefit provided by extracting lines alone. The uniqueness of our approach is that we do not fill-in any zeros. The number of non-zeros in our representation is the same as CSR. This means size of matrix data is always reduced, which in turn reduces the amount of data that needs to be fetched from the memory as each multiplication is performed. This reduces the compute, memory throughput imbalance and improves the overall performance of SpMV multiplication. We employ a greedy algorithm to transform a matrix from Compressed Row format (CSR) to a matrix comprising of boxes and points (non-zeroes that could not be grouped as such). The next section describes the method used to create variablesized boxes from the CSR format.

2.2. Procedure to extract boxes and points Consider a CSR representation of a sparse matrix as shown below: sparse matrix csr { int nnz; int nrows; int cols[nnz]; double vals[nnz]; int first[nrows+1]; } In CSR format, a matrix is represented as a collection of column, value pair (array of size nnz). Elements corresponding to the ith row are stored as cols[ j] and vals[ j] where f irst[i] <= j < f irst[i + 1] and 0 < i < nrow. The index for the first element corresponding to ith row is stored in f irst[i]. The f irst[nrows + 1] stores the total number of elements. 1) Hlines : Scan and create Horizontal lines (Hline). These are stored by rows, and sorted by the column origin for Hlines within the same row. Update the hlengths values corresponding to the column indices that constitute this Hline. 2) Vlines : Scan and create Vertical lines (Vline). These are stored by rows and sorted by the column origin for Vlines within the same row. Update the vlengths

values corresponding to the column indices in that constitute this Vline. Here, a small but useful tweak is implemented. The scan for Vlengths is done starting from the last row bottom up, where its value is already set to 1. If this calculation would have been done from top row to bottom, we wouldn’t have known the vlength values beforehand, requiring an extra iteration to calculate that. We, essentially manage to complete the entire process in one iteration. 3) Hwidth : Let hlen and vlen be the entries in hlengths and vlengths array. Scan the hlengths and vlengths array till both hlen and vlen > 1 are found. Set vmin to vlen. Set hwidth to hlen. 4) Box Dimensions : Next, scan the next hlen − 1 entries of vlengths array. Update vmin if vlen < vmin and vlen > 1. If you find vlen = 1 while scanning then stop and note the value of current hlen as hrem. The box size is Height = vmin, Width = hwidth − hrem + 1. 5) If non-zero entry at row i + 1 and column j + 1 exists (where i and j are the matrix index iterators), then { dlen++; w--; h--; i++; j++; Let hlen_i be the hlengths entry and vlen_j be the vlengths entry at (i, j) non-zero element w = minimum(w, hlen_i); h = minimum(h, vlen_j); go to 5; } else { if w < h then W = dlen; H = dlen + h; else H = dlen ; W = dlen + w; } 6) The width and the height of the box are W and H respectively. 7) Repeat steps 3 to 6 till you do not find any more boxes. While repeating the above steps ensure that we do not visit elements that are already part of the boxes created so far. 8) Having exhaustively extracted all boxes, the remaining non-zero values would be boxes of unit height and unit width. These would be referred to as points.

2.3. Representation using boxes/partition blocks We shall henceforth refer to our extracted variable-sized boxes and points as partition blocks.

These Partition Blocks are grouped according to the number of processes and partitioning scheme, into Partition Groups as described in the next section.

2.4. Partitioning ⎡ ⎢⎢⎢ 1 ⎢⎢⎢ 0 ⎢⎢⎢ ⎢⎢⎢ 0 ⎢⎢⎢ ⎢⎢⎣ 9 0

0 0 6 7 13

3 4 0 11 8

2 0 5 10 12

⎤ 0 ⎥⎥⎥ 0 ⎥⎥⎥⎥⎥ 0 ⎥⎥⎥⎥⎥ 0 ⎥⎥⎥⎥⎦ 0

Throughout the paper from now on, we shall discuss the boxes as partition blocks and a collection of Partition Blocks, as Partition Groups. These hold relevance with respect to partitioning the matrix on the basis of the number of processes and threads. On root node : 1) Read the sparse matrix in CSR format. 2) PBlocks : Chop the matrix horizontally and then vertically as described above to create Partition Blocks(boxes). Parameters like number of threads and processes contribute to the number of horizontal chops and hence the number of partition-blocks that get generated. If there are p processes and t threads, then the entire matrix is horizontally chopped into p parts and each part is then horizontally divided into t strips. In the above example, we had 2 processes and 2 thread for each process. Hence, the partitioning was performed such that Process 0 got rows 0,1,2 and Process 1 got rows 3,4. Within Process 0, Thread 0 got row 0 and Thread 1 got rows 1,2. Within Process 1, Thread 0 got row 3 and Thread 1 got row 4. The vertical partitioning is done such that each process has two partition groups(PG), PG 0 from column index 0 to 2 and PG 1 from column index 3 to 4. The further discussion formalizes this description. partition block { int rowid; /* row id */ int colid; /* col id */ int tid ; /* thread id */ int sendNode; /* result recver */ int recvNode; /* vector sender*/ int nnzeros; /* nnz in the matrix*/ unsigned int *rcols; unsigned int *sresults; sparse_matrix *smatrix; }

3) PGroups : Create a partition-group for each process which will contain a collection of partition-blocks as per the partitioning method. partition group { int gid; /* Group ID */ comm *sendpres; comm *recvpres; comm *sendx; comm *recvx; int tranks; int sranks; int *firstrows; int *firstcols; int *group_to_node; int block_size; int ncthreads; int local_nblocks; } 4) Group partition blocks into partition groups : Depending on different partitioning schemes, partition blocks can be grouped into partition groups. The schemes that have been implemented are Row, Column and a few network architecture aware partitioning schemes. The example used the row partitioning scheme and the results provided too are for row partitioning. The code, however, provides the user an option to implement his/her own partitioning scheme or make the code use tools like hMetis very easily. 5) Vector Information : Create vector communication information depending on the partitioning scheme. Data structures are created to maintain the vector and results communication information for each partition group. This information is contained in the partition group data structure and is required during core multiplication operation at each process (rank). This communication is sequenced before-hand by the sequencer. Each partition block within a partition group has its own vector related information (described in core multiplication) with reference to the vector information available at its partition group. 6) Vector compaction optimization : Due to the nature of the partitioning scheme, only a partial vector is available to each process. Hence, the necessary elements of the remaining vector need to be communicated to it. These elements are compacted (redundant elements are removed) into a data structure and the communication is carried out. An associated issue arises out of threading. A partition group has a number of partition blocks within it, that are assigned to a number of threads. Each partition block now requires separate vector indices for accurate multiplication. However, it is possible that different

subsets of these indices are the same for distinct partition blocks. This may require at least some same indices to be sent repeatedly, which means we would have redundant communication. In the given example, let us suppose the vector that the matrix is to be multiplied with is ⎡ ⎤ ⎢⎢⎢ 4 ⎥⎥⎥ ⎢⎢⎢ 7 ⎥⎥⎥ ⎢⎢⎢ ⎥⎥⎥ ⎢⎢⎢⎢ 13 ⎥⎥⎥⎥ ⎢⎢⎢ 28 ⎥⎥⎥ ⎢⎣ ⎥⎦ 1 In this case, the vector is partitioned row-wise in the same way as the partition groups (because vector element information is always contained with the partition group). So vector elements at indices 0,1,2 are contained with Process 0 (values 4,7,13) and those at indices 3,4 are contained with Process 1 (values 28,1). Now P0T0 (Process 0 Thread 0) and P0T1 need the vector elements at vector index 3 (corresponding to values 2 and 5), from a non-local source - Process 1 (remaining vector values will be available at Process 0). Similarly, P1T0 needs vector indices 0,1,2 (corr. to values 9,7,11), P1T1 needs vector indices 1,2 (corr. to values 13,8) contained at Process 0. This would require the P0 to send values at indices 0,1,2,1,2. To overcome this problem, a merged array of vector elements(that is without any duplicate vector elements) is communicated. By merging, we mean that the union of both sets of indices (those required by P1T0 and P1T1) is taken and stored as a merged array. For each partition block, a mask vector is maintained. This mask vector essentially keeps the exact indices of the merged array that the partition block would be needing. Hence, upon receiving the merged array, the necessary vector elements can be extracted and used. This is a fairly common and obvious trick, but considering its implementation complexity and the performance and scaling benefit it provided, in conjunction with our other optimizations, it is worth specifying elaborately. 7) Send the partition group and the compressed vector extraction information for each thread in charge of each partition block within the partition group to its corresponding process (rank). In the given example, let us take the case of Process 1. It contains the vector element values at indices 0,1,2. But P1T0 needs 0,1,2 and P1T1 needs 1,2. Therefore, the partition block corresponding to P1T0 will contain the mask vector 0,1,2 whereas that for P1T1 will contain 1,2, which means that it will extract particularly those indices from the vector available at its process(partition group). On non root node : Simply receive the partition group from the root node. With

this information in hand at each node, the multiplication routine can be conveniently performed.

The assignment of particular partition blocks to specific threads, makes this step very convenient.

3. SpMV Kernel and multi-threading

3.3. Threaded Multiplication

3.1. Core Multiplication

Each partition block contains a thread-id which is used to select the thread. This thread-id is assigned at the time of pre-processing and this information is contained within the partition block data structure. Whenever the process is ready with its matrix and vector components, it creates the multiplication job and posts it in the job queue. The main thread is made aware of the completion of all multiplication operations with the help of a semaphore construct. The semaphore is waited upon allocating a block to a thread and posted upon completion. The basic multiplication function that each thread performs is further optimized by loop unrolling. Loop unrolling is done only for boxes with height or width dimensions up to 5, since the benefits that loop unrolling provides, like removal of branches and pipelining, are effective only for small loops. Moreover, the frequency of boxes within this size is higher than those with a larger size. Further optimizations like use of register variables wherever necessary and distribution of iterations over threads, are also responsible for improvement in code performance.

After the preprocessing step of partitioning the matrix and populating the necessary data structures, the core multiplication needs to be carried out. The partition groups are distributed amongst the processes and the partition blocks within the partition group are assigned to their respective threads. This is explained in greater detail in the upcoming sections. For each process, primarily two categories of multiplication operations are to be carried out : 1) Local multiplication : Local multiplication is the multiplication of partition blocks with the partial vector that the process has locally. 2) Non-local multiplication : Non-local multiplication, on the other hand, is carried out when the partition block requires the corresponding elements of the vector to be received from other processes. Over here, the process/thread has to wait to receive its required partial vector from the source process. Communication latency may hence creep in. As in the example, the upper right part (values 2,5) and the lower left part (values 9,7,11,13,8) required non-local multiplication.

3.2. Multi-threading infrastructure We have devised a simple scheme to provide for multithreading in our application. 1) The core multiplication routine is executed in a main thread. 2) A user defined number of threads is spawned to form a threadpool. 3) The jobs (multiplication operations per partition block) to be assigned to them are queued up. 4) There is a separate job queue for each thread. When a job is available in the job queue, the thread picks up that job and starts executing. 5) This job is treated as a critical section and is safeguarded by a lock (mutex) Since core multiplication is assigned to separate threads, it is possible to achieve computation-communication overlap. This is because the main thread carries out the communication for non-local multiplications, while the other threads perform the associated computations(local as well as nonlocal). Result Collection : Each process contains a partial result vector at the end of multiplication operation. This can be gathered on a root process using the MPI gather collective.

4. Results We ran our implementation over many matrices including those presented in this paper. Our initial aim was to better the naive CSR multiplication and then the algorithm that extracted square boxes and lines. Here, Code version 2 simply searches for square boxes, lines and points while offering multi-thread and multi-node support. Since we want to make a case for variable-sized boxes over fixed size ones and general lines, we present the comparison as shown in Table 2. Code version 4 is the implementation of the algorithm described in the paper. The performance improvement is generally around the 50% mark, except for a couple of matrices where it is much lesser and even goes to the negative. This may be attributed to the specific size and structure of the matrix and is being investigated further to, if possible, come up with a better scheme that takes care of such cases. One of the major concerns in writing a parallel SpMV is its scaling. By scaling we mean that, if an application is suitably paralleled, it should take half as much time on 2 × x compute nodes as it takes on x compute nodes. By employing our solution of using a mask vector and other minor optimizations, code v4 benefited tremendously over code v2 as shown in Figure 1. However, scaling in PSpMV eventually hits a limit. This is because after you have split the sparse matrix beyond a certain limit, the individual sub-matrices become simply too small and fit into cache,

Table 2. Performance of selected matrices giving comparison between code v2(square boxes and points) and code v4(variable-sized boxes). 4 thread per process and one process per processor. Matrix Name af 2 k101

no. of processes 1

CSR 86.46

Code v2 63.93

Code v4 31.59

% improvement (v4/v2) 50.58%

% improvement (v4/CSR) 63.46%

%Boxes 100

2

43.09

32.80

16.04

51.09%

62.77%

%Points 0

4

21.57

16.44

7.96

51.58%

63.09%

af shell3

1

87.13

65.28

31.93

51.08%

63.35%

%Boxes 99.79

2

42.58

33.42

15.98

52.18%

62.47%

%Points 0.21

4

20.77

16.47

8.04

51.18%

61.29%

BenElechi1

1

56.55

45.30

21.15

53.31%

62.6%

%Boxes 100

2

27.99

23.06

10.77

53.29%

61.52%

%Points 0

4

14.28

11.98

5.10

57.42%

64.28%

ct20stif

1

10.91

11.01

3.61

67.21%

66.91 %

%Boxes 96.48

2

5.36

5.46

2.87

47.43%

46.45 %

%Points 3.52

4

2.52

2.59

1.60

38.22%

36.51%

bone010

1

217.92

267.84

96.53

63.95%

55.70%

%Boxes 76.06

2

109.05

133.85

48.82

63.52%

55.23%

%Points 23.94

4

52.75

64.11

25.10

60.84%

52.59%

boneS10

1

186.24

242.32

81.94

66.18%

56.00%

%Boxes 81.24

2

92.78

119.10

47.37

60.22%

48.94%

%Points 18.76

4

45.00

58.26

23.23

60.12%

48.37%

inline 1

1

160.29

153.99

66.81

56.61%

64.68%

%Boxes 100

2

78.66

75.55

53.53

29.14%

31.94%

%Points 0

4

37.57

36.05

31.20

13.45%

16.95%

nd24k

1

102.37

152.92

47.68

68.82%

53.42%

%Boxes 86.92

2

51.13

74.81

39.37

47.37%

29.87%

%Points 13.08

4

23.98

33.50

31.40

6.26%

-30.94%

ship 001

1

17.42

15.59

6.86

55.99%

60.62%

%Boxes 100

2

7.98

7.17

4.53

36.82%

43.23%

%Points 0

4

3.87

3.51

3.73

-6.26%

3.7%

Ga41As41H72

1

76.96

81.97

38.84

52.61%

49.53%

%Boxes 64.41

2

31.52

38.37

23.53

38.67%

25.35%

%Points 35.59

4

22.60

23.10

14.44

37.48%

36.10%

making our optimizations useless. Even further splitting disintegrates our methods into naive CSR multiplication, that is, all we shall be able to extract are points. Almost all existing optimizations in PSpMV suffer from these inevitable degradations. We also ran the Preconditioned conjugate gradient using the Parasails pre-conditioner. This pre-conditioner was generated using Hypre tool [1]. Table 3 compares the run-times using HYPRE and our implementation. The SpMV kernel was called from PCG on multi-nodes. On a single node, we perform nearly 50% better than HYPRE which increases as we increase the number of nodes until a certain point. After about 16 nodes, our performance benefit reduces, but still remains better than HYPRE. At about 128 nodes, our scaling degrades, whereas HYPRE still scales well uptil around 200 nodes. This issue needs to be addressed and may require

better partitioning and communication schemes as suggested by Boman et al.[4] and Falgout [18]. We may also want to run our code on other processors and network architectures to be entirely certain about the scalability issue. We also attempted to benchmark our code against Oski and PetSc. Our code, on a single node, without threads, performed at nearly 80% of Oski’s performance, which can be attributed to the heavy auto-tuning of matrices Oski performs. Since the parallel version of the Oski code is not available, we are unable to give a complete comparision statistic. We showed comparable SpMV performance with respect to PetSc, though if the end-to-end run-time is considered, they suffer heavily because assembling large sparse matrices takes a very long time on PetSc [2]. This made it very difficult to exhaustively try all matrices for all cases and present a complete result.

References [1] https://computation.llnl.gov/casc. [2] http://www.mcs.anl.gov/petsc/petsc2/documentation/faq.html. [3] R. Barrett, M. Berry, T. F. Chan, J. Demmel, J. Donato, J. Dongarra, V. Eijkhout, R. Pozo, C. Romine, and H. A. van der Vorst. Templates for the Solution of Linear Systems: Building Blocks for Iterative Methods, 2nd Edition. SIAM, 1994. [4] E. G. Boman, U. V. Catalyurek, C. Chevalier, K. D, I. Safro, and M. M. Wolf. Advances in parallel partitioning, load balancing and matrix ordering for scientific computing.

Figure 1. Scaling of square-boxes version against variable-sized boxes version for the msdoor matrix

5. Conclusion and Future Work Our experiments have shown that representing the matrix as a collection of variable-sized dense sub-matrices can give substantial improvement in performance and this we can owe to the greedy nature of the algorithm adopted. However, as we have seen from the results, we need to address a few scalability concerns and even aim at better single-node performance. But the fact that it is a novel combination of methods that has not been investigated before and still gives decent performance with a simple and flexible code base, does open up a lot of possibilities for improvement that we intend to focus on in the future. One such idea is to represent the matrix as a sum of its L(lower triangular), D(diagonal) and U(upper triangular) components and implement our algorithm for the L and U parts. The greedy algorithm in this case can be modified very cleverly to reduce a large number of iterations and decrease data storage and transfer. Work is ongoing in this direction as well as in trying out new partitioning schemes that help improve performance in context of our algorithm (though hMetis, which provides most commonly known partitioning schemes can be easily integrated into our code).

[5] T. Davis. http://www.cise.ufl.edu/research/sparse/matrices/. [6] M. Hoemmen. Current and coming oski features, 2009. [7] E.-J. Im. Optimizing the Performance of Sparse Matrix-Vector Multiplication. PhD thesis, EECS Department, University of California, Berkeley, 2000. [8] E.-J. Im and K. Yelick. Optimizing sparse matrix vector multiplication on SMPs. In Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing, March 1999. [9] E.-J. Im, K. Yelick, and R. Vuduc. Sparsity: Optimization framework for sparse matrix kernels. Int. J. High Perform. Comput. Appl., 18(1):135–158, 2004. [10] E.-J. Im and K. A. Yelick. Optimizing sparse matrix computations for register reuse in SPARSITY. In Proceedings of the International Conference on Computational Science, volume 2073 of LNCS, pages 127–136, San Francisco, CA, May 2001. Springer. [11] A. Jain. poski: A library to parallelize oski, 2008. [12] I. James B. White and P. Sadayappan. On improving the performance of sparse matrix-vector multiplication. HighPerformance Computing, International Conference on, 0:66, 1997. [13] H. Kotakemori and H. Hasegawa. Performance evaluation of a parallel iterative method library using openmp. High Performance Computing and Grid in Asia Pacific Region, International Conference on, 0:432–436, 2005.

Acknowledgment The authors would like to thank the management of Computational Research Laboratories Ltd. for making available extensive computational facilities in the form of a large (1700 node) HP/Intel Clovertown Cluster to perform numerous experiments. We would also like to give a special mention to Nachiket Sahasrabudhe, Ashish Kabra and Shashank Saraogi for contributing significantly to the development of the code and the testing process.

[14] J. Mellor-Crummey and J. Garvin. Optimizing sparse matrixvector product computations using unroll and jam. International Journal of High Performance Computing Applications, 18(2):225–236, 2004. [15] G. V. Paolini and G. Radicati di Brozolo. Data structures to vectorize cg algorithms for general sparsity patterns. BIT, 29(4):703–718, 1989. [16] A. Pinar and M. T. Heath. Improving performance of sparse matrix-vector multiplication. SC Conference, 0:30, 1999.

[17] A. Pinar and M. T. Heath. Improving performance of sparse matrix-vector multiplication. In Proceedings of the 1999 ACM/IEEE conference on Supercomputing, January 1999. [18] U. M. Y. R. E. Falgout, J. E. Jones. Pursuing scalability for hypre’s conceptual interfaces. In ACM Transactions on Mathematical Software, September 2005. [19] R¨ollin. Towards a fast parallel sparse matrix-vector multiplication. In E. H. D’Hollander, J. R. Joubert, F. J. Peters, and H. Sips, editors, Parallel Computing: Fundamentals & Applications, Proceedings of the International Conference ParCo’99, 17-20 August 1999, Delft, The Netherlands, pages 308–315, 2000. [20] Y. Saad. Parallel iterative methods for sparse linear systems. In D. Butnariu, Y. Censor, and S. Reich, editors, Inherently Parallel Algorithms in Feasibility and Optimization and their Applications, Studies in Computational Mathematics 8, pages 423–440, Amsterdam, 2001. Elsevier. [21] O. Temam and W. Jalby. Characterizing the behavior of sparse algorithms on caches. In Supercomputing ’92: Proceedings of the 1992 ACM/IEEE conference on Supercomputing, pages 578–587, Los Alamitos, CA, USA, 1992. IEEE Computer Society Press. [22] S. Toledo. Improving the memory-system performance of sparse-matrix vector multiplication. In IBM Journal of Research and Development, 1997. [23] R. Vuduc, J. W. Demmel, and K. A. Yelick. OSKI: A library of automatically tuned sparse matrix kernels. In Proceedings of SciDAC 2005, Journal of Physics: Conference Series, San Francisco, CA, USA, June 2005. Institute of Physics Publishing. (to appear). [24] R. W. Vuduc and H.-J. Moon. Fast sparse matrix-vector multiplication by exploiting variable block structure. First international conference, HPCC 2005, Sorrento, Italy, September 21-23, proceedings, 3726:807–816, 2005. [25] S. Williams, L. Oliker, R. Vuduc, J. Shalf, K. Yelick, and D. James. Optimization of sparse matrix-vector multiplication on emerging multicore platforms. In SC ’07: Proceedings of the 2007 ACM/IEEE conference on Supercomputing, pages 1–12, New York, NY, USA, 2007. ACM.

Table 3. Comparison of PCG times of HYPRE vs our implementation and performance improvement (4 processes per node) MatrixName

Format

af 2 k101 (% af shell3 (% BenElechi (% ct20stif (% nd12k (% ship 001 (% Si41Ge41H72 (%

HYPRE In-house Improvement) HYPRE In-house Improvement) HYPRE In-house Improvement) HYPRE In-house Improvement) HYPRE In-house Improvement) HYPRE In-house Improvement) HYPRE In-house Improvement)

(Number

of

nodes)

1

2

4

8

16

15.798030 7.71346 51.1746% 16.415725 7.28639 55.6134% 11.064007 4.81208 56.5069% 1.935239 0.930455 51.9204% 5.420640 2.83254 47.6954% 2.561540 1.27894 50.0714% 8.159445 4.18046 48.7654%

9.203318 3.63704 60.4812% 9.388054 3.67963 60.8052% 5.432400 2.48923 54.1781% 1.202646 0.507084 57.836% 2.995273 1.50859 49.6343% 1.710019 0.784055 54.1493% 4.497290 2.33762 48.0216%

4.507441 1.97673 56.1452% 4.584092 2.0957 54.2832% 2.784789 1.37504 50.6232% 0.577334 0.373078 35.379% 1.850556 0.85817 53.6263% 0.907057 0.52398 42.233% 2.786037 1.48372 46.7444%

2.305648 1.20891 47.567% 2.175310 1.33599 38.5839% 1.768317 0.921972 47.8616% 0.306040 0.261144 21.671% 1.008547 0.648346 35.7148% 0.582319 0.399908 31.3249% 1.507203 0.931389 38.2041%

1.133993 0.739867 34.7556% 1.127944 0.97001 14.002% 0.721970 0.554573 23.1861% 0.134317 0.229807 -71.09% 0.517745 0.44686 13.6911% 0.220723 0.486735 13.6911% 0.793102 0.746315 5.8992%

Improving the Performance of the Sparse Matrix Vector ...