Computational Mechanics 28 (2002) 456–468  Springer-Verlag 2002 DOI 10.1007/s00466-002-0310-6

Iterative mesh partitioning optimization for parallel nonlinear dynamic finite element analysis with direct substructuring Y.-S. Yang, S.-H. Hsieh

456 Abstract This work presents a novel iterative approach for mesh partitioning optimization to promote the efficiency of parallel nonlinear dynamic finite element analysis with the direct substructure method, which involves static condensation of substructures’ internal degrees of freedom. The proposed approach includes four major phases – initial partitioning, substructure workload prediction, element weights tuning, and partitioning results adjustment. The final three phases are performed iteratively until the workloads among the substructures are balanced reasonably. A substructure workload predictor that considers the sparsity and ordering of the substructure matrix is used in the proposed approach. Several numerical experiments conducted herein reveal that the proposed iterative mesh partitioning optimization often results in a superior workload balance among substructures and reduces the total elapsed time of the corresponding parallel nonlinear dynamic finite element analysis. Keywords Mesh partitioning, Parallel computing, Finite element computation

1 Introduction Parallel direct substructure method, which involves static condensation of substructures’ internal degrees of freedom, is one of the most popular domain decomposition methods (Smith et al., 1996) for parallel finite element analysis. However, the parallel efficiency of the method remains a topic of concern. For example, computational workload imbalance among substructures inhibits good efficiency. For an efficient parallel finite element analysis, it is essential that the finite element mesh be partitioned such that among Received 22 August 2001 / Accepted 20 January 2002

Y.-S. Yang National Center for Research on Earthquake Engineering, Taiwan, R.O.C. Formerly PhD student at Department of Civil Engineering, National Taiwan University, Taipei 10617, Taiwan, R.O.C. S.-H. Hsieh (&) Department of Civil Engineering, National Taiwan University, Taipei 10617, Taiwan, R.O.C. e-mail: [email protected] The authors would like to thank the National Science Council of the Republic of China for financially supporting this research under Contract No. NSC 89-2211-E-002-038.

processors (or substructures) computational workloads are well balanced and inter-process communication is minimized. Although many mesh partitioning algorithms have been proposed (Fox, 1988; Farhat and Simon, 1995; Pothen, 1997; Hsieh et al., 1997), most employ simple and general assumptions in their optimization process for load balancing without considering the characteristics of a parallel solution algorithm. For example, these algorithms generally assume that a balanced number of elements among substructures implies a balance of workloads among processors. Therefore, the mesh partitions produced by these algorithms normally do not result in satisfactory parallel efficiency for a specific parallel solution algorithm, such as the parallel direct substructure method (Vanderstraeten et al., 1996; Yang and Hsieh, 1997). Hsieh et al. (1999) proposed an iterative mesh partitioning approach to improve workload balance among processors in parallel substructure finite element computations. This approach attempts to balance the computational workloads among substructures by iteratively adjusting the mesh partitions through element weight tuning. The element weights within a substructure are tuned in each iterative step according to the substructure workloads predicted by a set of empirical rules, which are derived from numerical experiments on numerous finite element meshes. Based on the basic idea of iterative mesh partitioning approach of Hsieh et al. (1999), a new iterative mesh partitioning approach with a more accurate workload prediction scheme and its corresponding tuning strategy on element weights are proposed herein. In addition, this work focuses on mesh partitioning optimization for parallel substructure finite element analysis with the general sparse matrix technique (George and Liu, 1981). The substructure method employed herein solves the equilibrium equations through direct methods (rather than iterative ones) because of their robustness and reliability on ill-conditioned nonlinear structural analyses. Moreover, the general sparse matrix technique (rather than the skyline matrix one) is adopted in this work. Most previous research has employed the skyline matrix technique to solve equilibrium equations encountered in finite element analysis (e.g., Nikishkov et al., 1996; Papadrakakis and Tsompanakis, 1999; Farhat and Wilson, 1988). Although the programming involved in skyline matrix technique is easier, the general sparse matrix technique was adopted for application herein primarily because it has been demonstrated to have superior performance in terms of time and storage requirements than the skyline matrix technique does (Ashcraft et al., 1987; Poole et al., 1992; Yang et al., 1998).

The remainder of this paper is organized as follows. Section 2 introduces the parallel direct substructure method and the mesh partitioning techniques employed herein. The workload properties of the parallel direct substructure method are also discussed. In Sect. 3, the proposed iterative mesh partitioning is described. In Sect. 4, five finite element meshes are used to test the effectiveness of the proposed approach. Finally, conclusions are drawn in Sect. 5.

associate with the internal degrees of freedom of each substructure are solved concurrently within each processor, as follows:

fuE gk ¼ ½BTk fuE gg

ð7Þ

fuI gk ¼ ½KII 1 k ðff I gk ½KIE k fuE gk Þ

ð8Þ

Figure 1 presents the flowchart for the parallel direct substructure method. It should be noted that the condensed set of equations associated with the interface degrees of freedom could be solved either sequentially or in 2 Parallel direct substructure method and mesh partitioning parallel. This work performs the sequential solution of the In this section, the parallel direct substructure method is condensed set of equations on a single processor because it described briefly. Then, the mesh partitioning algorithms is faster than the parallel one in the present implementation for all of the examples studied herein. and packages that were applied in this work are introIn this work, the parallel substructure method employs duced. Finally, load balancing among substructures for the general sparse matrix technique (George and Liu, parallel direct substructure method is discussed. 1981). Due to the similarity between solution of simultaneous equations and substructure condensation, the rou2.1 tines in the SPARSPAK library (George and Liu, 1981) are Parallel direct substructure method extended slightly to facilitate the sparse matrix impleParallel direct substructure method first partitions a mentation of the direct substructure method. In addition, structure into a number of non-overlapping substructures the decomposition algorithm of Han and Abel (1984) is and assigns each substructure to a separate processor. used for the static condensation of the substructure Each processor then forms its substructure equilibrium equilibrium equations. More detailed descriptions reequations: garding the implementation of the present parallel direct      substructure method can be found in Hsieh et al. (2002). KII KIE fI uI

½Kk fugk ¼

KEI

KEE

k

uE

¼

k

fE

ð1Þ

k

in which [K] is the stiffness matrix, fug is the displacement vector, and ffg is the external force vector. The subscript k denotes the kth substructure, and subscripts I and E denote the internal and interface degrees of freedom, respectively. Then, static condensation of substructure internal degrees of freedom is performed independently and concurrently within each processor without any inter-process communication as expressed by the following two equations:

  KEE k ¼ ½KEE k ½KEI k ½KII 1 k ½KIE k   fE ¼ ff E g ½KEI  ½KII 1 ff I g k k k k k

ð2Þ ð3Þ

The condensed system of each substructure is then assembled to form a set of interface equations for the unknowns along the substructure interfaces and the solution of the interface system of equations is performed as follows: NP   X   KEE g ¼ ð½Bk KEE k ½BTk Þ

ð4Þ

k¼1 NP   X   fe ¼ ð½Bk fE k Þ g

ð5Þ

k¼1

 1   fuE gg ¼ KEE g fE g

ð6Þ

in which the subscript g denotes the interface system; the matrix ½Bk is the Boolean matrix for mapping the degrees of freedom from the substructure k to the interface system. Notably, the matrix inversion in Eq. (6) is only symbolic and matrix factorization, such as Cholesky decomposition, is typically performed instead. Finally, the unknowns

2.2 Mesh partitioning Mesh partitioning is to partition a finite element mesh into several sub-meshes, so that each sub-mesh (or substructure) can be assigned to a single processor to perform a parallel substructure analysis. Mesh partitioning of a finite element mesh is vital to the efficiency of parallel substructure analysis. It governs the load balance and the quantity of communication (or message passing) among processors, as well as the size of the interface system of equations. To date, numerous mesh partitioning algorithms have been proposed and literature reviews on them can be found in Farhat and Simon (1995), Pothen (1997), Hsieh et al. (1997), and Fox (1988). Three mesh partitioning packages, RST (Hsieh et al., 1995), METIS (Karypis and Kumar, 1998) and JOSTLE (Walshaw, 1999), are employed in this work. They are based on graph partitioning techniques (Pothen, 1997), which have been used extensively for parallel scientific computations. Herein, a finite element mesh is transferred to a Communication Graph (C-Graph), which is similar to the diagonal dual graph used by Vanderstraeten and Keunings (1995), and then partitioned via one of the mesh partitioning packages. Details on the transformation from a finite element mesh to a C-Graph can be found in Hsieh et al. (1995). The Recursive Spectral Two-way (RST) mesh partitioning method (Hsieh et al., 1995) is modified from the Recursive Spectral Bisection (RSB) method (Pothen et al., 1990; Simon, 1991) with a fast multilevel spectral bisection implementation (Barnard and Simon, 1993). Via a two-way recursive approach, the RST method can partition a finite element mesh into an arbitrary number of substructures

457

458

Fig. 1. Flowchart for the parallel direct substructure method of this work

(rather than being limited to 2n substructures through the RSB method). Hsieh et al. (1997) demonstrated that the RST method normally produces partitioning with fewer edge-cuts (connections between elements residing in different substructures) than the mesh partitioning methods presented by Farhat (1988), Al-Nasra and Nguyen (1991), and Malone (1988) do. METIS graph partitioning package (version 4.0) (Karypis and Kumar, 1998) provides high quality partitions (few edge-cuts) within a limited computing time. METIS adopts a multilevel graph partitioning scheme and allows users to assign weights on vertices and/or edges. Then, it partitions a weighted graph into several subgraphs, each with the same total vertex weights and minimum total weights of the edge-cuts. Yang (2000) demonstrated that, in minimizing both edge-cuts and mesh partitioning time, METIS is competitive to the RST method for mesh partitioning. Similar to METIS, JOSTLE (version 1.1.8) (Walshaw, 1999) also adopts a multilevel graph partitioning scheme and allows users to assign weights on vertices and/or edges. In addition, when partitioning a mesh, JOSTLE considers the initial partitions. That is, if an initial partition is provided, rather than a total repartition of the graph, JOSTLE performs local migration of elements among partitions. This type of partitioning is very useful for adaptive finite element analysis in which the weights of vertices and/or edges in the corresponding graph may change during the process of adaptive mesh refinement and where dynamic load balancing is often required.

2.3 Load balancing among substructures within the parallel substructure method Load balance among substructures is vital to achieve good efficiency in parallel substructure finite element analysis. Previously, it has typically been assumed that the workload of each substructure is proportional to the number of elements (for meshes with same type of elements) or the total element weights (for meshes with various elements with differing weights). However, researchers have demonstrated that the assumption is not sufficient for balancing workloads in parallel substructure method (Vanderstraeten et al., 1996; Yang and Hsieh, 1997). In fact, the workload of each substructure does not solely depend on the number and types of elements. The primary workload of each substructure is on the substructure condensations. Based on the decomposition algorithm (Han and Abel, 1984), the substructure condensation can be considered a total factorization of ½KII  and partial factorization on ½KEI , ½KII  and ½KIE . Therefore, the number of the internal and interface nodes (related to submatrix sizes), the node-element connectivity and matrix ordering (related to the sparsity of the sub-matrices) are vital factors, which affect substructure workload. Furthermore, the ordering of a substructure matrix requires that the internal nodes be ordered prior to the interface ones. Therefore, the effectiveness of the matrix ordering method employed is also an essential factor. To measure imbalance among substructures, an imbalance factor FIB is defined as follows:

459

Fig. 3. Mesh partitioning result of the M12BD model using the RST method

Fig. 2. The M12BD model Table 1. Mesh partitioning information and load imbalance factors for M12BD using the RST method

Table 2. Mesh partitioning information and load imbalance factors for HIGH60 using the RST method

M12BD, RST partitioning, bestMMNM substructure matrix ordering

HIGH60, RST partitioning, bestMMNM substructure matrix ordering

Number Number Number of Condensation time interface of of (s) node elements nodes Substructure Substructure Substructure Substructure FIB

1 2 3 4

1008 1008 1008 1008 1.0

5922 5922 5922 5922 1.0

186 186 186 186 1.0

42.8 43.3 46.4 42.6 1.1

Maximum quantity among all substructures Minimum quantity among all substructures

Number Number Number of Condensation time interface of of (s) node elements nodes Substructure Substructure Substructure Substructure FIB

1 2 3 4

13050 13050 13050 13050 1.0

10395 10400 10448 10430 1.0

142 291 174 323 2.3

27.2 161.1 79.3 189.3 7.0

The workload of each substructure, especially on substructure condensation, is difficult to predict with only its number of elements. Therefore, to further improve the ð9Þ efficiency of the parallel direct substructure method, there in which ‘‘quantity’’ refers to the number of elements, nodes, is still a need to develop a mesh partitioning method that interface nodes, or condensation time in this work. For can produce a well-balanced workload for substructure example, the FIB of the substructure condensation time de- condensation. notes the ratio of the maximum over minimum elapsed time for substructure condensation among all substructures. 3 In a few cases, the load balance can be achieved easily. An iterative mesh partitioning approach For example, the RST method partitions the M12BD-2 The concept of using an iterative method to optimize model (Fig. 2) into four equal parts (Fig. 3). Consequently, mesh partitioning is not new. Vanderstraeten et al. load balance is achieved in the parallel substructure finite (Vanderstraeten and Keunings, 1995; Vanderstraeten element analysis (see Table 1). However, it is practically et al., 1996) have proposed several two-step mesh partiimpossible to partition an irregular mesh into several tioning approaches. These approaches first produce a equal parts. For example, the RST method partitions the partitioning result via an initial partitioning algorithm. HIGH60 model (see Fig. 4) into four substructures Subsequently a series of retrofitting of partitioning results (Fig. 5), each with the same number of elements. Table 2 are performed to reduce cost function. Their experiments shows that the mesh partitioning does not produce an demonstrate that the two-step approach produces highadequate load balance in substructure condensation quality partitioning results at a reasonable computational among processors (or substructures) in the parallel cost, and perform better (with fewer interface nodes) than substructure finite element analysis. the multilevel RSB algorithm does. In addition, Walshaw

FIB ¼

460

Fig. 4. The HIGH60 model

et al. (1999) and Diekmann et al. (2000) proposed an iterative mesh partitioning method to optimize substructure shape. However, their cost functions do not agree with the workload properties of the parallel direct substructure method, and the partitioning results may not be generally suited to the parallel direct substructure method. An iterative mesh partitioning approach (IMP-MMJ) is proposed herein, which is modified from a previous iterative mesh partitioning approach (IMP-MNM) proposed by Hsieh et al. (1999). Both are composed of four major phases: initial partitioning, workload prediction, element weight tuning, and partitioning result adjustment. The first phase is performed only once at the beginning of each mesh partitioning task, while the others are performed iteratively until the substructure workloads are balanced. The IMP-MNM adopts METIS for initial partitioning (‘M’), tunes element weights within each substructure based on the number of substructure interface nodes (‘N’), and then employs METIS to re-partition the mesh. The IMP-MMJ proposed herein employs METIS (‘M’) to initially partition the mesh, predicts substructure workload with Multi-Stage Minimum Degree (MSMD) (Ashcraft et al., 1999) ordering sub-program (‘M’) to tune the element weights, and applies JOSTLE (‘J’) to adjust the mesh partitioning result. Figure 6 illustrates the flowchart of the iterative mesh partitioning approach. The following subsections give detailed discussions on each phase of the approach.

Fig. 5. Mesh partitioning result of the HIGH60 model using the RST method

3.1 Initial partitioning Initial partitioning determines a starting point for the subsequent iterations. The authors believe that an adequate initial partitioning is essential for the subsequent iterations to achieve reasonably good workload balance among substructures as well as the convergence of iterations. METIS (version 4.0) was selected for the initial partitioning within both IMP-MNM and IMP-MMJ approaches. Herein, a weighted C-Graph of the finite element mesh is formed for the initial partitioning. The weight of each vertex (namely, its element weight) is assigned as the number of nodes of its corresponding element, while the weight of each edge between two vertices (or two elements) is defined as the number of nodes that the two corresponding elements share. Figure 7 illustrates the formation of a weighted C-Graph of a finite element mesh that contains six Q4 elements. The numbers marked on the weighted C-Graph denote the vertex and edge weights.

Fig. 7. Weighted Communication Graph (CG) of a finite element mesh

Fig. 6. Flowchart of the iterative mesh partitioning approach with weight tuning

The weight assignment to vertices and edges in the weighted C-Graph has the following implications. First, computational workload of an element is assumed to be proportional to the number of nodes it has and the workload of a substructure is assumed to be the sum of element weights. Secondly, when two elements that share nodes are separated into two distinct substructures, more interface nodes are generated. Although this weight assignment scheme generally fails to provide accurate information regarding substructure workloads, it is sufficient for the initial partitioning phase. Moreover, the initial partitioning result is eventually improved upon.

3.2 Workload prediction Previously, it was normally assumed that the workload of each substructure is proportional to the number of elements or nodes that it contains. Therefore, most previous mesh partitioning algorithms produce partitions (or substructures) with an identical number of elements or nodes. This assumption is reasonably suited to explicit dynamic finite element analysis because, if both the mass and damping matrices are lumped, the major computational load is spent in computing element-by-element matrixvector multiplication of the internal force vector ff g in each time step: X ð½ki fui gÞ ð10Þ ff g ¼ i

in which ½ki  and fui g are the stiffness matrix and the displacement vector for element i, respectively. Furthermore, some mesh partitioning algorithms can balance the total weights of elements or nodes among substructures if

various weighting values are assigned to different types of elements or nodes in the mesh. For example, an element with more nodes may have a larger weighting value to reflect that it has more workload. Previous research on parallel explicit dynamic finite element analysis has achieved workload balance among substructures, which produces adequate parallel efficiency (Hsieh and Abel, 1993). Unfortunately, within static or implicit dynamic finite element analysis, substructure workload is not as easy to predict as it is in explicit dynamic finite element analysis. In parallel direct substructure method, the substructure workload depends not only on the number of elements and nodes, but also on the number of interface nodes, nodeelement connectivity, substructure matrix sparsity, and matrix ordering. The total number of interface nodes among substructures should be as small as possible to reduce the interface system (Eqs. (4)–(6)) and the interprocess communication among processors. In addition, other parallel finite element methods may have different mesh partitioning requirements. For example, to have adequate convergence within the FETI method (Farhat and Roux, 1991) to in turn solve the interface system with an iterative solver, each substructure must have a geometrical aspect ratio near one. As various analysis methods may have distinct workload properties, the design of a workload prediction scheme should consider the demand of the targeted analysis method. To overcome load imbalance of the parallel substructure method, Nikishkov et al. (1996) proposed a mesh partitioning scheme, which considers the actual numerical workload of each substructure, and achieves an adequate load balance. However, Nikishkov’s method is only suitable for two-dimensional (2D) regular finite element meshes, rather than three-dimensional (3D) finite element meshes with arbitrary shapes. Furthermore, this method assumes the use of the skyline matrix technique, instead of the general sparse matrix technique. In the IMP-MNM approach proposed by Hsieh et al. (1999), the workload is evaluated according to the number of interface nodes (assuming that more interface nodes in a substructure result in an increased workload). An improved load balance for the parallel substructure method has been observed in many cases repeatedly. However, the number of interface nodes is not sufficient to predict the substructure workload. Thus, a more accurate scheme is proposed herein to predict the workload of each substructure. MSMD matrix

461



x

ordering method, which is an extension of the Constrained ðFWL Þk Minimum Degree (CMD) method (Liu, 1989), is employed ðFTUN Þk ¼ MeanðF Þ WL herein to predict the substructure condensation workload. where The MSMD sub-program implemented in SPOOLES has two features that render it feasible for workload predicN SUB Y tion:

462

(1) It not only orders the matrix, but also provides the statistics of the numerical operation counts and number of nonzero entries that are required for matrix factorization. (2) It enables multi-stage matrix ordering. By assigning internal nodes to stage zero and interface nodes to stage one, the numerical operation counts as well as nonzero entries that are required for substructure condensation (at stage zero) can be obtained. With the numerical operation counts obtained from the MSMD sub-program at stage zero, WN, the workload of each substructure, FWL, can then be defined as follows:

FWL ¼ WN

ð11Þ

Herein the workload balance among substructures is the only requirement. However, the FWL factor can be defined according to the specific demands of various situations. For example, if the balance of the substructure matrix storage is also required, the FWL factor can be defined as a combination of the numerical operation counts ðWN Þ and the nonzero entries that are required for substructure condensation. In this work, the workload imbalance allowance is set to 1.1. That is, if the imbalance of the FWL , (see Eqs. (9) and (11)) is less than 1.1, the iteration procedure ends. However, if the imbalance of the FWL is still larger than 1.1 after thirty iterations, the iteration procedure ends and the result produced by the 30th iteration is used.

3.3 Element weight tuning In this phase, in a substructure with heavier predicted workload the element weights will be multiplied by a factor greater than one; alternately, those in a substructure with lighter predicted workload will be multiplied by a factor of less than one. It is assumed that the re-distribution of the element weights results in a better load balance after the mesh partitioning result is tuned. In the IMP-MNM approach, the multiplication factor for each substructure is proportional to the number of interface nodes. This is based on an assumption that a substructure with more interface nodes should have fewer elements to achieve a better workload balance. In the proposed IMP-MMJ approach, the weight of each element in the kth substructure is multiplied by a factor of ðFTUN Þk , defined as follows:

MeanðFWL Þ ¼

ð12Þ !N 1

SUB

ðFWL Þj

ð13Þ

j¼1

The NSUB denotes the number of substructures. The relaxation factor x is used to relax the tuning factor FTUN (if x < 1) and hence avoid violent changes of mesh repartitioning in subsequent iterations. In this study, the relaxation factor x is set to 0.5 initially. In addition, if the current iteration does not produce a better load balance, the value of x will be multiplied by 0.75 to prevent divergence in iterations. The tuning factor is normalized, as shown in Eqs. (12) and (13), to prevent that the weights of elements from becoming either very large or very small following numerous iterations.

3.4 Partitioning result adjustment This phase redistributes elements from substructures with more total element weights to those with fewer ones. That is, the mesh partitioning result is modified such that the total element weights among substructures are balanced. In the IMP-MNM approach, the METIS partitioning method is employed to re-partition the mesh. However, this method re-partitions the finite element mesh in the jth iteration without considering the partitioning result in the ðj  1Þth iteration. Experiments of this approach in Tsai (1999) have demonstrated that the mesh partitioning results may change violently during mesh partitioning iterations. Figure 8 presents the mesh partitioning results ðNSUB ¼ 3Þ via the IMP-MNM method using METIS as the re-partitioning kernel (Tsai, 1999). The violent changes generally obstruct convergence of iterations. In the proposed IMP-MMJ approach, the JOSTLE partitioning method (Walshaw, 1999) is employed for iterative mesh partitioning. One of the primary differences between JOSTLE and METIS is that to re-partition, the former migrates vertices along the interfaces among subgraphs. Therefore, if the element weights are not altered dramatically, the mesh partitioning results of the jth and ðj  1Þth iterations will be similar. Notably, a smooth change of mesh partitioning results during the iterations generally assists convergence. Figure 9 depicts the use of the JOSTLE partitioning kernel in iterative mesh partitioning. Obviously, the mesh partitioning results change smoothly during the iterative process, which assists convergence. Moreover, the final mesh partitioning result (i.e., the one in the 5th iteration) is roughly symmetric, and seems to be more balanced than the one obtained from the

Fig. 8. Iterative mesh partitioning results using METIS (Tsai, 1999)

Fig. 9. Iterative mesh partitioning results using JOSTLE Table 3. Procedure of the proposed IMP-MMJ approach 1. Set the maximum number of iterations (30), relaxation factor x (0.5), variation for the relaxation factor dx (0.75), and the imbalance allowance Fa (1.1) 2. Initial partitioning: partition a finite element mesh into substructures (using METIS). Set the iteration index i to zero 3. Workload prediction: predict the workload of each substructure FWL (estimated numerical operation for substructure condensation) and the imbalance factor of FWL of the i-th iteration, denotes as (FIBW)i If (FIBW)i is less than the imbalance allowance Fa then stop the iteration procedure. If i > 0 and (FIBW)i > (FIBW)i-1 then reset x to x · dx 4. Element weight tuning: multiply each element weight in the kth substructure by a factor of FTUN, which is defined in Eq. (12) 5. Partitioning result adjustment: tune the partitioning result such that the total element weights among substructures are balanced (using JOSTLE) 6. Set the iteration index i to i + 1 and go to step 3 Descriptions and numbers in the parentheses denote the settings used in this work

the procedure of the proposed IMP-MMJ mesh partitioning approach. Fig. 10. The 15STORY model

IMP-MNM method in Fig. 8. In the following section, several numerical experiments will be applied to demonstrate the improvements on load balance among substructures, elapsed time on solving the interface system, and the total elapsed time of the parallel substructure analyses via the proposed approach. Table 3 summarizes

4 Numerical experiments Herein, parallel finite element analyses of five finite element models (Figs. 4 and 10–13) are performed to study the proposed mesh partitioning approach – IMP-MMJ. In addition, the experimental results are compared with those from the IMP-MNM approach. An object-oriented parallel finite element program, FE2000, is employed for parallel

Fig. 11. The H5-12 model

463

464

Fig. 12. The O4-12 model

finite element analyses. FE2000 employs the extended SPARSPAK (see Sect. 2.1) for sparse matrix storage and operations. FE2000 uses Message Passing Interface (MPI, Message Passing Interface Forum, 1994) to handle all message-passing tasks among processors. A Pentium II-350 PC with a LINUX RedHat operating system is used in mesh partitioning. A PC cluster consisting of four Pentium II-350 based PCs (each with 256 Mb of memory and the LINUX RedHat operating system) is used for parallel finite element analysis. As well, the network system used in the PC cluster has a communication bandwidth of 100 Mbps. Table 4 lists the elapsed time of mesh partitioning and linear static analysis of each model using the METIS, IMPMNM, and IMP-MMJ partitioning methods, respectively. Notably, the analysis time listed in Table 4 does not include the mesh partitioning time, but includes the time for some data processing, such as sparse matrix ordering and symbolic factorization (George and Liu, 1981). Furthermore, a mesh partitioning result that is suitable for a linear static analysis is not necessarily suited to a nonlinear dynamic analysis. That is, linear static analysis entails data processing, which is performed only once and is often negligible in a nonlinear dynamic analysis. In substructure orderings, each substructure performs the MSMD, MMD (Multiple Minimum Degree)(Liu, 1985), and BestOfNDandMS (Ashcraft et al., 1999) ordering

Fig. 13. The BLADE model (Wawrzynek, 1991)

Table 4. Elapsed time of mesh partitioning and parallel linear static analyses

Models

Mesh partitioning time (s) METIS

15STORY H5-12 O4-12 HIGH60 BLADE NP +

0.7 0.8 0.8 3.0 0.3

IMP-MNM 1.8 2.1 2.1 8.8 0.6

+

(2) (2) (2) (2) (7)

Analysis time (s) IMP-MMJ

METIS

IMP-MNM

IMP-MMJ

2.3 4.9 4.8 60.8 29.5

20.7 19.8 68.3 481.0 47.1

20.7 16.5 68.3 316.9 53.1

24.5 18.8 69.1 199.7 41.5

(3) (6) (6) (17) (30)

(number of processors) = NSUB (number of substructures) = 4 Numbers in the parentheses denote the number of mesh partitioning iterations

465

Fig. 15. Mesh partitioning results of the HIGH60 model using a IMP-MNM and b IMP-RMJ

Fig. 14. Mesh partitioning results of HIGH60 using METIS

methods and selects the one with lowest storage requirement for static condensation. This mixed substructure matrix ordering method is abbreviated as ‘BestMMNM’ herein. Furthermore, to satisfy the substructure matrix ordering constraint that the internal degrees of freedom should be ordered prior to the interface ones, the orderings of the interface degrees of freedom are shifted posterior to those of the internal ones after either of the MMD and BestOfNDandMS methods is performed. If the MSMD ordering method is performed, the ordering constraint is satisfied automatically and no shifting is needed. Both the METIS and the IMP-MNM approaches spend limited time on mesh partitioning, and can almost be ignored with respect to the analysis time. However, the IMPMMJ requires more time than the above two approaches do, particularly on the HIGH60 and BLADE models. This is because the IMP-MMJ approach predicts the workload iteratively via a more complicated procedure (as described above) than the METIS and IMP-MNM approaches do. Nevertheless, it is worthwhile to employ a more effective partitioning approach with higher cost in nonlinear static

or dynamic analysis because the mesh partitioning is performed only once at the beginning of the iterative analysis. However, the more effective approach can save much computing time within substructure condensations and interface system solution, which are performed more than hundreds or even thousands of times. In Table 4, it should be noted that the IMP-MNM approach does not adjust the mesh partitioning results in the iterative process for the 15STORY and O4–12 models. This is because that, within the mesh partitioning results of these models, each substructure has roughly the same number of interface nodes, which enables the IMP-MNM approach to predict that the workloads are well balanced. For the HIGH60 model, the METIS partitioning approach produces partitions with poorly balanced workload among processors (see Fig. 14), but both the IMP-MNM and IMP-MMJ approaches obtain superior mesh partitioning results (than METIS does) and reduce analysis time (see Fig. 15 and Table 4). For the BLADE example using the IMP-MMJ approach, the mesh partitioning iteration does not converge before it reaches the maximum number of iterations (i.e., 30 iterations herein) and stops. The final imbalance factor is found to be 1.19, which is still higher than the workload imbalance allowance with value of 1.1. Nevertheless, the IMP-MMJ reduces the analysis time of the BLADE model.

Table 5. Workload balance of condensation time

Models

15STORY H5-12 O4-12 HIGH60 BLADE

466

Table 6. Number of interface nodes and time for solving interface system

FIB of condensation time

METIS

IMP-MNM

IMP-MMJ

METIS

IMP-MNM

IMP-MMJ

2.4 10.1 25.4 260.6 18.2

2.4 7.6 25.4 128.7 17.1

1.7 7.1 21.3 90.2 12.6

1.2 2.1 1.8 9.1 2.2

1.2 2.0 1.8 1.4 2.5

1.2 1.3 1.3 1.2 1.5

NP (number of processors) = NSUB (number of substructures) = 4

Models

15STORY H5-12 O4-12 HIGH60 BLADE

Max. condensation time (s)

Total number of interface nodes

Solution time (s)

METIS

IMP-MNM

IMP-MMJ

METIS

IMP-MNM

IMP-MMJ

207 209 312 532 580

207 200 312 524 617

221 188 312 408 519

8.3 5.0 31.0 140.2 17.9

8.2 4.5 31.0 132.5 18.4

11.1 4.7 31.1 60.1 14.3

NSUB (number of substructures) = 4

Condensation time is normally the most time-consuming process of a large-scale finite element analysis that uses the substructure method. Table 5 lists the maximum substructure condensation time and the load imbalance factors in the experiments. In most of the examples, both the IMP-MNM and IMP-MMJ approaches reduce the maximal condensation time among substructures, particularly for the HIGH60 model, which is the most poorly balanced one when the METIS partitioning is used. However, the IMP-MMJ approach balances the workload better than the IMP-MNM approach does. It results in lower imbalance factors (maximum 1.5) and reduces the maximal condensation time from 16 to 65%. The experimental results reveal that the IMP-MMJ approach effectively balances the condensation workload among substructures, particularly in those that have a poor initial balance of the condensation workload. As stated, interface system size is an essential factor of parallel efficiency. A mesh partitioning result with fewer interface nodes results in less time for solving the interface system and reduces analysis time. Table 6 presents the total numbers of interface nodes and the time for solving the interface system in the experiments. In three of the five models studied in this work, the IMP-MMJ approach reduces both the interface nodes and solution time. However, for the 15STORY model, it increases both the interface nodes and solution time. It is observed that the IMP-MMJ approach seems to perform slightly better than the IMP-MNM approach on reducing interface nodes although neither of the iterative mesh partitioning approaches considers minimization of the number of interface nodes (Yang, 2000). In addition, although the IMP-MMJ approach seems to consider only the workload balance between substructures in its optimization, in several cases it produces partitions with reduced numbers of interface nodes compared to the METIS and IMP-MNM partitioning algorithms. This may be mainly due to that, in the phase of iterative partitioning result adjustment, the

JOSTLE algorithm employed by the IMP-MMJ approach actually considers minimization of edge-cuts in the process of migrating elements locally among substructures. Nonlinear dynamic finite element analyses are performed in the following experiments. The 3-D beam-column (BC) element and twenty-noded brick element (B20) are used in this work for geometric nonlinear analyses of framed structures and solid models, respectively. Geometric nonlinear behavior is considered through a second order analysis using updated Langrangian formulation (Bathe, 1996) with geometric element stiffness matrix (Yang and Kuo, 1994; Bathe, 1996). Newmark’s and Newton–Raphson’s methods are used for implicit dynamics and nonlinear iterations, respectively. For each model, a set of constant loads that does not vary with time is applied. The time step is 0.01 s in which only 0.2 s of the dynamic analysis are performed. The convergence tolerance for the Newton iterations is 106 , that is, the iteration stops only if the following condition is satisfied:

qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ff i gT ff i g qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi < 106 ff 0 gT ff 0 g

ð14Þ

Table 7. Total elapsed time of parallel nonlinear dynamic analyses Models

15STORY H5-12 O4-12 HIGH60 BLADE

Total number of iterations 60 57 40 40 60

Analysis time (s)+ METIS

IMP-MNM

IMP-MMJ

868 1021 2511 16726 2508

870 1013 2516 12126 2553

1077 850 2378 7055 2116

NP (number of processors) = NSUB (number of substructures) = 4 + Mesh partitioning time has been included

where ff 0 g and ff i g denote the external force vector and the imbalance force vector at the ith iteration, respectively. Table 7 lists the total number of Newton iterations as well as the total elapsed time for the nonlinear dynamic analysis using different mesh partitioning approaches. The total elapsed time of each analysis in Table 7 has included the corresponding mesh partitioning time shown in Table 4. It can be seen that, except for the 15STORY example, the IMP-MMJ algorithm outperforms both the METIS and IMP-MNM algorithms in minimizing the elapsed time of parallel nonlinear dynamic analysis. Up to 58% and 42% of savings on the total elapsed analysis time (e.g. the HIGH60 example) have been observed when the performance of the IMP-MMJ algorithm is compared with those of the METIS and IMP-MNM algorithms, respectively. Generally speaking, both the IMP-MNM and the IMPMMJ iterative mesh partitioning approaches optimize adequately for better load balance of substructure condensation and decrease the maximum condensation time among substructures. Moreover, the IMP-MMJ approach performs even better than the IMP-MNM approach on improving efficiency of the parallel direct substructure method. The IMP-RMJ approach is more reliable for mesh partitioning optimization because it predicts the substructure workload more precisely and tunes the partitioning results more smoothly than the IMP-MMJ approach. The major drawback of the IMP-MMJ approach is that it requires considerable time in mesh partitioning optimization. Nevertheless, in nonlinear static or dynamic finite element analyses that use the parallel direct substructure method, time spent on obtaining an optimal or near-optimal partitioning result can save much more time in substructure condensations and interface system solutions. It has been shown through the numerical examples studied herein that the proposed IMP-MMJ approach can effectively improve the parallel efficiency of nonlinear dynamic finite element analysis with direct substructuring.

5 Conclusions In this study, an iterative mesh partitioning approach, IMP-MMJ method, was proposed for better load balance in parallel nonlinear dynamic finite element analysis employing direct substructuring. The IMP-MMJ method embeds a workload prediction phase, which employs the MSMD matrix ordering sub-program in SPOOLES, to obtain estimations for operation counts of substructure condensation to in turn predict the substructure workload. To adjust the mesh partitioning result through element weights tuning, it also employs an iterative strategy. To partition and adjust mesh partitioning, IMP-MMJ adopts JOSTLE as the partitioning kernel, which can locally tune the partitioning result from the previous iteration. The numerical experiments studied herein indicate that the IMP-MMJ approach achieves a superior load balance within substructure condensation and renders the parallel substructure method more efficient than the METIS and IMP-MNM methods in most cases. Although the IMPMMJ method requires much time, it should be worthwhile for nonlinear dynamic finite element analysis as mesh

partitioning is performed only once at the outset while numerous numerical computations within substructure condensation and interface system solutions may be eliminated during the analysis.

References Al-Nasra M, Nguyen DT (1991) An algorithm for domain decomposition in finite element analysis. Comput. Struct. 39: 277–289 Ashcraft CC, Grimes RG, Peyton BW, Simon HD (1987) Progress in sparse matrix methods for large linear systems on vector supercomputers. Int. J. Supercomput. Appl. 1(4): 10–30 Ashcraft C, Pierce D, Wah DK, Wu J (1999) The Reference Manual for SPOOLES, Release 2.2: An Object Oriented Software Library for Solving Sparse Linear Systems of Equations, Boeing Shared Services Group, USA Barnard ST, Simon H (1993) A fast multilevel implementation of recursive spectral bisection for partitioning unstructured problems. Proceedings of the 6th SIAM Conference on Parallel Processing for Scientific Computing, pp. 711–718 Bathe KJ (1996) Finite Element Procedures. Prentice-Hall, USA Diekmann R, Preis R, Schlimbach F, Walshaw C (2000) Shapeoptimized mesh partitioning and load balancing for parallel adaptive FEM. Parallel Comput. 26(12): 1555–1581 Farhat C (1988) A simple and efficient automatic FEM domain decomposer. Comput. Struct. 28: 579–602 Farhat C, Simon HD (1995) TOP/DOMDEC: a software tool for mesh partitioning and parallel processing. Comput. Syst. Eng. 6(1): 13–26 Farhat C, Wilson E (1988) A parallel active column equation solver. Comput. Struct. 28: 289–304 Farhat C, Roux FX (1991) A method of finite element tearing and interconnecting and its parallel solution algorithm. Int. J. Numer. Mech. Eng. 32: 1205–1227 Fox GC (1988) A review of automatic load balancing and decomposition methods for the hypercute. In: Schultz M (ed) Numerical Algorithms for Modern Parallel Computer Architectures – The IMA Volumes in Mathematics and Its Applications, Vol. 13, pp. 63–77 George A, Liu JWH (1981) Computer Solution of Large Sparse Positive Definite Systems. Prentice Hall, New Jersey, USA Han TY, Abel JF (1984) Substructure condensation using modified decomposition. Int. J. Numer. Mech. Eng. 20(11): 1959–1964 Hsieh SH, Paulino GH, Abel JF (1995) Recursive spectral algorithms for automatic domain partitioning in parallel finite element analysis. Comput. Meth. Appl. Mech. Eng. 121: 137–162 Hsieh SH, Paulino GH, Abel JF (1997) Evaluation of automatic domain partitioning algorithms for parallel finite element analysis. Int. J. Numer. Meth. Eng. 40(6): 1025–1051 Hsieh SH, Abel JF (1993) Use of networked workstations for parallel nonlinear structural dynamic simulations of rotating bladed-disk assemblies. Comput. Syst. Eng. 4(4–6): 521–530 Hsieh SH, Yang YS, Hsu PY (2002) Integration of general sparse matrix and parallel computing technologies for large-scale structural analysis. Comput.-Aided Civil Infrastruct. Eng. (accepted) Hsieh SH, Yang YS, Tsai PL (1999) Improved mesh partitioning for parallel substructure finite element computations. Proceedings of the 7th East Asia-Pacific Conference on Structural Engineering and Construction, pp. 123–128 Karypis G, Kumar V (1998) METIS a software package for partitioning unstructured graphs, partitioning meshes, and computing fill-reducing orderings of sparse matrices version 4.0. Technical report, Department of Computer Science/Army HPC Research Center, MN, USA

467

468

Liu JWH (1985) Modification of the minimum-degree algorithm by multiple elimination. ACM Trans. Math. Softw. 11(2): 141–153 Liu JWH (1989) On the minimum degree ordering with constraints. SIAM J. Sci. Comput. 10: 1136–1145 Malone JG (1988) Automatic mesh decomposition and concurrent finite element analysis. Comput. Meth. Appl. Mech. Eng. 70: 27–58 Message Passing Interface Forum (1994) MPI: a message-passing interface standard. Int. J. Supercomput. Appl. 8(3/4): 159–416 Nikishkov GP, Makinouchi A, Yagawa G, Yoshimura S (1996) Performance study of the domain decomposition method with direct equation solver for parallel finite element analysis. Comput. Mech. 19: 84–93 Papadrakakis M, Tsompanakis Y (1999) Domain decomposition methods for parallel solution of shape sensitivity analysis problems. Int. J. Numer. Meth. Eng. 44: 281–303 Poole EL, Knight Jr, NF, Davis Jr, DD (1992) High-performance equation solver and their impact on finite element analysis. Int. J. Numer. Meth. Eng. 33: 855–868 Pothen A (1997) Graph partitioning algorithms with applications to scientific computing. In: Keyes DE, Sameh AH, Venkatakrishnan V (eds), Parallel Numerical Algorithms, Kluwer Academic Press, Netherlands Pothen AM, Simon H, Liou KP (1990) Partition sparse matrices with eigenvectors of graphs. SIAM J. Matrix Anal. Appl. 11(3): 430–452 Simon HD (1991) Partitioning of unstructured problems for parallel processing. Comput. Syst. Eng. 2(2/3): 135–148 Smith BF, Bjorstad PE, Gropp WD (1996) Domain Decomposition Parallel Multilevel Methods for Elliptic Partial differential Equations. Cambridge University Press, New York, USA Tsai PL (1999) Optimized mesh partitioning for parallel substructure finite element analysis. Master thesis, Department

of Civil Engineering, National Taiwan University, Taiwan, R.O.C. (in Chinese) Vanderstraeten D, Keunings R (1995) Optimized partitioning of unstructured finite element meshes. Int. J. Numer. Meth. Eng. 38: 433–450 Vanderstraeten D, Farhat C, Chen PS, Keunings R, Ozone O (1996) A retrofit based methodology for the fast generation and optimization of large-scale mesh partitions: beyond the minimum interface size criterion. Comput. Meth. Appl. Mech. Eng. 133: 25–45 Walshaw C (1999) Serial Jostle Library Interface: Version 1.1.8, Technical report, School of Computing & Mathematical Sciences, University of Greenwich, London, UK Walshaw C, Cross M, Diekmann R, Schlimbach F (1999) Multilevel mesh partitioning for optimizing domain shape. Int. J. High Perform. Comput. Appl. 13(4): 334–353 Wawrzynek PA (1991) Discrete modeling of crack propagation: theoretical aspects and impelementation issues in two and three dimensions. PhD dissertation, Cornell University, USA Yang YB, Kuo SR (1994) Theory and Analysis of Nonlinear Framed Structures. Prentice Hall, USA Yang YS, Hsieh SH (1997) Some experiences on parallel finite element computations using IBM/SP2. Proceedings of the 7th KAIST-NTU-KU Tri-laterial Seminar/Workshop on Civil Engineering, Kyoto, Japan, pp. 63–68 Yang YS, Hsieh SH, Chou KW, Tsai IC (1998) Large-scale structural analysis using general sparse matrix technique. Proceedings of the Eighth KKNN Seminar on Civil Engineering, Singapore, pp. 260–265 Yang YS (2000) Parallel computing for nonlinear dynamic finite element structural analysis with general sparse matrix technology. PhD Dissertation, Department of Civil Engineering, National Taiwan University, Taiwan, R.O.C.

Iterative mesh partitioning optimization for parallel ... - Springer Link

substructure method are also discussed. In Sect. 3, the proposed iterative mesh partitioning is described. In. Sect. 4, five finite element meshes are used to test ...

527KB Sizes 0 Downloads 225 Views

Recommend Documents

Iterative mesh partitioning strategy for improving ... - Semantic Scholar
Computer-Aided Engineering, Department of Civil Engineering, National Taiwan ... substructure interior degrees-of-freedom are performed independently and .... elements to achieve a good balance of workloads among substructures.

Iterative mesh partitioning strategy for improving ... - Semantic Scholar
Accordingly, the overall efficiency of parallel analysis is not optimized. A better ... This work embeds, in MPE++ a graph partitioning package, called. JOSTLE ...

Parallel sorting on cayley graphs - Springer Link
This paper presents a parallel algorithm for sorting on any graph with a ... for parallel processing, because of its regularity, the small number of connections.

Iterative species distribution modelling and ground ... - Springer Link
Aug 4, 2012 - Abstract Endemic species play an important role in conservation ecology. However, knowledge of the real distribution and ecology is still scarce for many endemics. The aims of this study were to predict the distribution of the short-ran

A Graph-Partitioning Based Approach for Parallel Best ... - icaps 2017
GRAZHDA* seeks to approximate the partitioning of the actual search space graph by partitioning the domain tran- sition graph, an abstraction of the state space ...

Minimizing Maximum Lateness on Identical Parallel ... - Springer Link
denote the objective value of schedule S. We call a batch containing exactly B ... A technique used by Hall and Shmoys [9] allows us to deal with only a constant ...

High-Level Data Partitioning for Parallel Computing on ...
Nov 23, 2010 - Comparison of Communications on a Star Topology . ...... (2001b), the future of computing platforms is best described ... of a small number of interconnected heterogeneous computing .... as the computational (number crunching) equivale

Global optimization of minority game by intelligent agents - Springer Link
room randomly at each round, it is found that the intel- ligent agents also ... agents selects the strategy with the highest virtual point. When he changes the ...

Iterative Data-based Modelling and Optimization for ...
factors that optimize the objective function. However, the .... ing the following likelihood function .... pre-selected basis functions (e.g. orthogonal polynomials.

Exploiting Graphics Processing Units for ... - Springer Link
Then we call the CUDA function. cudaMemcpy to ..... Processing Studies (AFIPS) Conference 30, 483–485. ... download.nvidia.com/compute/cuda/1 1/Website/.

Evidence for Cyclic Spell-Out - Springer Link
Jul 23, 2001 - embedding C0 as argued in Section 2.1, this allows one to test whether object ... descriptively head-final languages but also dominantly head-initial lan- ..... The Phonology-Syntax Connection, University of Chicago Press,.

MAJORIZATION AND ADDITIVITY FOR MULTIMODE ... - Springer Link
where 〈z|ρ|z〉 is the Husimi function, |z〉 are the Glauber coherent vectors, .... Let Φ be a Gaussian gauge-covariant channel and f be a concave function on [0, 1].

Tinospora crispa - Springer Link
naturally free from side effects are still in use by diabetic patients, especially in Third .... For the perifusion studies, data from rat islets are presented as mean absolute .... treated animals showed signs of recovery in body weight gains, reach

Chloraea alpina - Springer Link
Many floral characters influence not only pollen receipt and seed set but also pollen export and the number of seeds sired in the .... inserted by natural agents were not included in the final data set. Data were analysed with a ..... Ashman, T.L. an

GOODMAN'S - Springer Link
relation (evidential support) in “grue” contexts, not a logical relation (the ...... Fitelson, B.: The paradox of confirmation, Philosophy Compass, in B. Weatherson.

Bubo bubo - Springer Link
a local spatial-scale analysis. Joaquın Ortego Æ Pedro J. Cordero. Received: 16 March 2009 / Accepted: 17 August 2009 / Published online: 4 September 2009. Ó Springer Science+Business Media B.V. 2009. Abstract Knowledge of the factors influencing

Quantum Programming - Springer Link
Abstract. In this paper a programming language, qGCL, is presented for the expression of quantum algorithms. It contains the features re- quired to program a 'universal' quantum computer (including initiali- sation and observation), has a formal sema

BMC Bioinformatics - Springer Link
Apr 11, 2008 - Abstract. Background: This paper describes the design of an event ontology being developed for application in the machine understanding of infectious disease-related events reported in natural language text. This event ontology is desi

Isoperimetric inequalities for submanifolds with ... - Springer Link
Jul 23, 2011 - if ωn is the volume of a unit ball in Rn, then. nnωnVol(D)n−1 ≤ Vol(∂D)n and equality holds if and only if D is a ball. As an extension of the above classical isoperimetric inequality, it is conjectured that any n-dimensional c

Probabilities for new theories - Springer Link
where between 0 and r, where r is the prior probability that none of the existing theories is ..... theorist's internal programming language"(Dorling 1991, p. 199).

A Process Semantics for BPMN - Springer Link
Business Process Modelling Notation (BPMN), developed by the Business ..... In this paper we call both sequence flows and exception flows 'transitions'; states are linked ...... International Conference on Integrated Formal Methods, pp. 77–96 ...

Unsupervised Learning for Graph Matching - Springer Link
Apr 14, 2011 - Springer Science+Business Media, LLC 2011. Abstract Graph .... tion as an integer quadratic program (Leordeanu and Hebert. 2006; Cour and Shi ... computer vision applications such as: discovering texture regularity (Hays et al. .... fo