An Efficient Parallel Dynamics Algorithm for Simulation of Large Articulated Robotic Systems Kishor Bhalerao Department of Mechanical Engineering, The University of Melbourne, Australia Email: [email protected] James Critchley, Continental Corporation, Email: [email protected] Kurt Anderson Department of Mechanical, Aerospace and Nuclear Engineering, Rensselaer Polytechnic Institute, USA Email: [email protected]

Abstract

1

This paper presents a new parallel computer algorithm for the simulation of large articulated robotic systems. The method employs the Divide and Conquer Algorithm (DCA) multibody methodology and achieves significant increases in speed by using a variation of the Articulated body Algorithm (ABA) to efficiently construct the DCA subsystems. The method outperforms contemporary parallel dynamics algorithms for serial topologies and degenerates to the ABA in absence of parallel resources. The implementation and comparison with the ABA demonstrates useful speed increases with as few as two parallel processors and speeds ups by a factor of 2.25 in the presence of four processors for serial manipulators. The paper also addresses the practical limitations on obtainable speedups due to the interprocessor communication costs and its consequences on choosing optimal DCA subsystems.

Developing efficient dynamics algorithms has long been the subject of robotics research with the recent efforts being focused on methods to exploit multi-core architectures. There is a lot to consider when choosing a parallel computer implementation to solve a dynamics problem. Ultimately, the end user is interested in speed and resource consumption. However, scalability, adaptability, and portability are also important factors when authoring optimized software. From a resource perspective (CPU cycles), the most efficient algorithms are the sequential linear complexity O(n) propagation methods [Featherstone, 2008, Jain, 1991, 2010] (where n is the number of degrees of freedom). The efficiency and the relative ease of implementation has resulted in the O(n) methods finding applications in a wide range of areas beyond the traditional domain of robotics [Flores et al., 2011, Sherman et al., 2011]. This family of O(n) methods perform recursive propagations from body to body and therefore are highly sequential in the computation of serial topolo1

Introduction

gies. This means they have little native concurrency and are not amenable to general purpose parallel computer implementations. However, the independent branches of articulated systems offers significant concurrency which is trivially exploited in parallel computer methods. Reference [Hwang et al., 1990] describes an early effort deploying this type of basic O(n) parallelism to a real-time vehicle simulation In 1995 Fijany et al. introduced the Constraint Force Algorithm (CFA) [Fijany et al., 1995] to address the lack of efficient concurrency in chain systems. While other researchers where deploying parallel resources to speed-up traditional O(n3 ) solutions, the authors demonstrated that only optimal order resource algorithms would provide scalable benefits and merit general purpose implementation. The CFA was the first such method achieving O(log(n)) time complexity with O(n) parallel processors for serial chains (the most difficult case). The CFA relies heavily on the block diagonal nature of the multibody chain system’s linear algebra problem to achieve logarithmic binary tree type concurrency. The notion of block diagonal systems allowed the CFA to later be generalized to accommodate all system topologies [Featherstone and Fijany, 1999] but the general algorithm could only accommodate short branches and loops before degrading concurrency and losing logarithmic complexity. To address the lack of general system solutions in logarithmic complexity algorithms, Anderson and Duan developed the Hybrid Direct Iterative Algorithm (HDIA) [Anderson and Duan, 1999, 2000]. The method partitions large multibody systems into subsystems which run the efficient O(n) solution with the addition of unknown constraint forces which are solved in exactly the same manner as the common O(n) kinematic loop solution. However, the method requires the solution of a global constraint equation which scales with desired concurrency. Rather than accept an O(n3c ) performance (where nc represents the number of constraints which have been “cut” in order to achieve the desired level of parallelism), a parallel iterative pre-conditioned conjugate gradient method was applied. This demands an iterative solution for larger problems and the method does not achieve strictly logarithmic performance.

Featherstone’s Divide and Conquer Algorithm (DCA) [Featherstone, 1999a,b] was the first algorithm which provides a time optimal O(log(n)) speed per integration time step with a resource optimal O(n) processors for a very broad class of problems. The DCA is approximately 4 times slower than the fastest sequential O(n) algorithm. This implies that for a theoretically optimal parallel implementation, the DCA requires 4 processors to break even with the O(n) approach. The DCA is very popular and has been extended in numerous ways to include the Assembly Disassembly Algorithm (ADA) [Yamane and Nakamura, 2002] and the Orthogonal Complement Based Divide and Conquer Algorithm (ODCA) [Mukherjee and Anderson, 2007]. These two methods perform similar operations as that of the DCA and as such they have comparable computational speed [Yamane and Nakamura, 2009]. In recognition of the disconnect between algorithms of low computational order and those which could provide useful results (for practical systems and common computers), Critchley and Anderson developed two logarithmic complexity methods which outperform the fastest sequential algorithms with just two parallel processors. These are Recursive Coordinate Reduction Parallelism (RCRP) [Critchley and Anderson, 2004] and Efficient DCA (DCAe) [Critchley, 2005]. The RCRP is based on a recursive solution for loop closure constraints and requires a complicated accounting system to avoid formulation singularities and as such might never be used for a general dynamics engine. The DCAe on the other hand uses an O(n) algorithm to more efficiently construct DCA subsystems at the requisite level of concurrency. The DCAe requires one propagation sweep per handle to construct the DCA subsystems and achieves an improvement over the O(n) algorithm by a factor of 1.20 and 1.80 for serial chains with 2 and 4 processors, respectively [Critchley et al., 2009] (and continues to scale in this manner). The method [Bhalerao, 2010] presented in this paper derives the idea of using an O(n) algorithm to construct the DCA subsystems from the DCAe. The method applies the Articulated Body Algorithm (ABA) to calculate the handle equations of a DCA subsystem and as such is referred to as the DCA2

ABA. Unlike DCAe, excess propagation sweeps are avoided. Additionally, any existing general purpose sequential dynamics engines driven by the ABA (or any of the O(n) methods) can be readily modified to use the DCA-ABA. To support the development of this new algorithm involving the combination of the DCA and ABA algorithms, both methods are briefly reviewed in section 2. Section 3 then derives and describes the DCAABA and offers additional insight into applying the method. A discussion of the parallel implementation and the associated numerical and parallel timing results follows in section 4. Finally, the work is summarized with conclusions in section 5.

2

DCA consists of two process assembly and disassembly which must be performed twice. The first assembly-disassembly sweep is for the system kinematics while the second is for the system kinetics. This section discusses the parallel kinetics sweep while the parallel kinematics sweep is discussed in section 3.3.1. The assembly operation between two kinematically connected subsystems involves expressing the dynamics of the assembled subsystem in terms of spatial quantities associated with the mutually exclusive kinematic joints or handles. On the other hand, the spatial quantities and the time-derivatives of the generalized speeds associated with the handles across the connecting joint are computed during the disassembly operation. The assembly-disassembly process is described in figure 1. Bodies k and k + 1 are the leaf nodes and form a part of a larger articulated system. H1k and H2k are the two handles associated with body k. J k denotes the joint connecting bodies k and k + 1. k = [(αk )T (aki )T ]T and Fic = Let Aki k T k T T [(τic ) (fic ) ] denote the spatial acceleration of handle Hik and the spatial constraint force acting on it, respectively. Unless mentioned otherwise, all kinematic quantities are assumed to be relative to the inertial frame. The two-handle equations of motion of body k can be written as

Underlying Dynamics Algorithms

The method presented in this paper relies on the DCA and the ABA which are briefly reviewed in this section.

Divide-and-Conquer-Algorithm F2k Jk

Body k H2k

F1k

F1k+1

F2k+1 H2k+1

H1k

Assembly

Body k + 1

H1k+1

Jk F1k H1k

Disassembly

2.1

k k k k k Ak1 = ζ11 F1c + ζ12 F2c + ζ13 ,

(1a)

Ak2

(1b)

=

k k ζ21 F1c

+

k k ζ22 F2c

+

k ζ23 .

In equation (1), the inverse of the spatial mass matrix k of body k is embedded in the terms ζij . The known external forces acting on body k along with the gyrok k scopic and Coriolis terms are absorbed in ζ13 and ζ23 . A similar expression can be written for each body in the system. The goal in the assembly process is to obtain a two-handle expression for the articulated body k : k + 1 in a form similar to that of equation (1). This is done by using the kinematic relationship between spatial acceleration of handles across joint J k which can be stated as

F2k+1

H2k+1

Assembled subsystem k : k + 1 Figure 1: Divide and Conquer Algorithm

Here, the details of the DCA which are relevant k k k k to the current formulation are reproduced. InterAk+1 − Ak2 = P J u˙ J + P˙ J uJ . (2) 1 ested readers may refer to [Featherstone, 1999a,b, k Mukherjee and Anderson, 2007, Yamane and Naka- In equation (2), P J is a 6 × f k matrix, whose mura, 2002] for a more detailed discussion. The columns correspond to the spatial partial velocity 3

[Kane and Levinson, 1985] of handle H1k+1 with rek spect to the f k generalized speeds (uJ ) associated k with the joint J k . Let DJ be the orthogonal comk plement of P J such that it forms a constraint load subspace of joint J k . Then, using equation (2) and k+1 k+1 k the fact that F2c = −F1c , the expression for F1c can be written as k+1 k+1 k F1c = Wa F1c + Wb F2c + Wc ,

Wa =

k X ζ21 ,

Wb =

k+1 −X ζ12 , k

k+1 k Wc = X (ζ23 − ζ13 + P˙ J u),

X =D

Jk

Xˇ (D

Jk T

the generalized speeds associated with each joint as k k k k u˙ J = (P J )T (Ak+1 − Ak2 − P˙ J uJ ). This process 1 continues until one reaches the leaf nodes.

2.2

The ABA is an O(n) method principally for sequential implementation. Some of the references describing the ABA are [Featherstone, 1983, 2008, Bae and Haug, 1987, Anderson, 1993, Rodriguez, 1987, Jain, 2010]. The common structure in several of these methods has been described in reference [Jain, 1991]. The ABA consists of a two main processes, the inward triangularization sweep and the outward substitution sweep. In the triangularization sweep, the articulated body inertia (ABI) of each body is recursively computed while the time derivatives of the generalized speeds are obtained in the outward substitution sweep. Following the identical handle notation as that of section 2.1, each body k + 1 is connected to its parent k via the inboard handle H1k+1 . Let vector rk locate the handle H2k with respect to H1k . If vector β k locates the handle H1k+1 with respect to handle H1k , then the relationship between spatial acceleration of handles H1k and H1k+1 can be written as

(3a) (3b) (3c)

) ,

(3d)

k k k+1 k Xˇ = {(DJ )T [ζ22 + ζ11 ](DJ )}−1 .

(3e)

Equation (3) can now be used in the equations of motion associated with H1k and H2k+1 to obtain the handle equations for the assembled body k : k + 1 as k:k+1 k k:k+1 k+1 k:k+1 Ak1 = ζ11 F1c + ζ12 F2c + ζ13 ,

Ak+1 2 k:k+1 ζ11 k:k+1 ζ13 k:k+1 ζ22 k:k+1 ζ23

= = = = =

(4a)

k:k+1 k k:k+1 k+1 k:k+1 ζ21 F1c + ζ22 F2c + ζ23 , k:k+1 k k k ζ11 − ζ12 Wa , ζ12 = −ζ12 Wb , k:k+1 k+1 k k ζ13 − ζ12 Wc , ζ21 = ζ21 Wa ,

(4b)

k+1 ζ22 k+1 ζ23

(4e)

+ +

k+1 ζ21 Wb , k+1 ζ21 Wc .

Articulated Body Algorithm

(4c) (4d) (4f)

k

k

Ak+1 = (Sβk )T Ak1 + Akt + P J u˙ J . 1

Equation (4) describes the dynamics of the fictitious rigid body k : k + 1 in terms of the spatial quantities acting on the boundary handles. Note that the form of the equations (1) and (4) is identical. The internal dynamics within these bodies are still present, but are no longer in evidence. The treatment of the assembled subsystem k : k + 1 is now identical to that of the individual bodies. This process is continued until one obtains the boundary handle equations for the entire system. The resulting set of systemwide handle equations can be solved to obtain the unknown spatial quantities (accelerations and constraint loads) associated with the boundary handles of the system. During disassembly these known spatial quantities are back-substituted into the equations of individual subsystems to calculate the unknown spatial quantities at the connecting handles. Then equation (2) is used to obtain the time derivatives of

(5)

In equation (5), Sβk is a 6 × 6 spatial position cross product matrix associated with vector β k . The quantities Akt and Sβk are known from the state of the system. The Newton-Euler equations of motion for body k about the inboard handle H1k can be written as k k I k Ak1 = F k + F1c + Srk F2c . (6) In equation (6), I k is the 6 × 6 spatial inertia matrix which is constant in the frame of the body. F k is known from the state of the system and includes the gyroscopic terms and the external spatial forces acting on the body. Srk is the spatial position cross product matrix associated with vector rk . Let I˜k and F˜ k denote the ABI and the bias force of body k. If nb denotes the total number of bodies in the system, then the equations of motion for the articulated body 4

3.1

k : nb can be written as1

Motivation

k 0 = I˜k Ak1 − F˜ k − F1c , where k k k+1 ˜ k+1 k T ˜ I =I +T I (S ) ,

Disassembly

Assembly

(7a) Consider the implementation of the DCA for an important class of problems with np < nb , where np (7b) β denotes the number of processors. In such a sitF˜ k = F k + T k+1 Fˇ k , (7c) uation, the best case scenario is a DCA assemblyk k T k+1 = Sβk {U − I˜k+1 P J (Iˇk )−1 (P J )T } , (7d) disassembly tree with np leaf nodes. The handle equations of motion for these np nodes must be Fˇ k = F˜ k+1 − I˜k+1 Akt , (7e) formed sequentially by each processor. Consequently, k k Iˇk = (P J )T I˜k+1 P J . (7f) for most systems of interest sequential computation is unavoidable with the DCA as described in [FeatherIn equation (7), U is a 6 × 6 identity matrix. The stone, 1999a,b, Mukherjee and Anderson, 2007, Yaspatial constraint force acting on handle H1k in equa- mane and Nakamura, 2002]. The idea is explained tion (7a) can be eliminated by projecting the equations of motion in the joint motion subspace to obtain k−1 (P J )T {I˜k Ak1 − F˜ k } = 0. Note that the inversion of the f k × f k matrix Iˇk in equation (7d) can be (I) avoided by performing the triangularization operation through the intermediate frames. For simplicity, Split the system into 4 DCA nodes the discussion here is limited to equation (7). During the triangularization operation, the expression for k u˙ J is obtained as   k k u˙ J = −(Iˇk )−1 (P J )T I˜k+1 (Sβk )T Ak1 − Fˇ k . (8) (II) The trivial boundary condition for the start of the triangularization sweep is I˜nb = I nb and F˜ nb = F nb . On completing the inward triangularization sweep 0 and noting that A01 = 0 (0th body is ground), u˙ J can be obtained from equation (8). Using equation (5) and (8), the spatial acceleration of inboard handle of each body and the time derivatives of the generalized speeds associated with all the joints in the system are computed in the outward substitution sweep. Figure 2: Assembly-Disassembly tree for a quad-core This completes a brief overview of the two under- CPU lying algorithms. schematically in figure 2 assuming that np = 4. Consider an articulated system shown in figure 2(I). Since nb > np , the first step is to split the given system into 3 A Hybrid DCA-ABA np subsystems. Each subsystem is sequentially conThis section presents the hybrid parallel dynamics verted to a DCA node by the np processors. Thus, algorithm which is of O(nb ) and O(log(nb )) compu- the DCA assembly-disassembly tree has just four leaf tational complexity for sequential and parallel imple- nodes. Although the DCA is an O(nb ) algorithm for sementation, respectively. quential implementation, it is approximately four 1 In some references (e.g. [Anderson, 1993]), the term Ak t times slower than the ABA. Additionally, DCA is (see equation (7e)) does not appear explicitly. The differences in the computational cost for both the methods however is less accurate than ABA [Featherstone, 1999b]. This motivated the use of the ABA instead of the DCA to negligible. This issue is further discussed in section 3.3.1. 5

J i−1 and J i , respectively. The goal is to convert the subsystem k + 1 : k + ni to a DCA node i using the ABA. For the ease of notation let the node i consist ni 1 of bodies 1 : ni . Additionally, let F1c and F2c denote the spatial constraint forces acting on the boundary handles H11 and H2ni , respectively. Let the body 1 of

calculate the handle equations of DCA subsystems in the assembly-disassembly tree. Before proceeding to the description of the DCAABA, it is worth elaborating on the choice of using the ABA for computing the handle equations. Indeed, one could use any forward dynamics formulation to calculate the handle equations. The ABA is known to be faster than other sequential techniques for number of bodies greater than 8 [Stelzle et al., 1995]. Thus the more traditional O(n3b ) methods may be employed to calculate the handle equations of DCA subsystems with less than 8 bodies. The efficiency of forward dynamics algorithms is usually compared for serial manipulators. However, the traditional O(n3b ) techniques can match the ABA in efficiency for much larger systems when the branch induced sparsity is exploited in the factorization of Joint Space Inertia Matrix [Featherstone, 2005]. Thus an optimal way to compute the handle equations of motion is to use the fastest algorithm for a given subsystem. However, the size of an articulated subsystem for an equal processor load balance is dependent on the dynamics algorithm chosen to generate the DCA node equations. Additionally, the programming efforts required for generating a theoretically optimal implementation are disproportionately large compared to the benefits obtained from such hybrid formulations. In this paper, the systems of interest are large. The ABA outperforms other dynamics algorithms for a large class of problems and thus the method of choice for this discussion of the hybrid formulation.

3.2

J i−1 i−1 k+1

Ji k + ni

i+1

ni

1

ni F2c

1 F1c

Ji

J i−1 i−1

i+1

i

A DCA node ni : Number of bodies in the ith DCA node Figure 3: Convert subsystem k + 1 : k + ni to node i the subsystem i be chosen as the root node for the ABA. On performing the triangularization operation, the equations of motion for body 1 can be written as 1 0 = I˜1 A11 − F˜ 1 − F1c −{

ni Y

ni T k }Srni F2c .

(9)

k=2

In equation (9), the expressions for I˜1 , F˜ 1 and T k are obtained from equation (7). Multiplying equation (9) by (I˜1 )−1 , the expression for A11 can be obtained as 1:ni 1 1:ni ni 1:ni A11 = ζ11 F1c + ζ12 F2c + ζ13 .

Handle Equations of a DCA Node

(10)

A similar expression is required for An2 i which can be obtained in two different ways. The underlying principle in each method is identical. The goal is to ni express An2 i as a linear function of A11 and F2c using kinematics of the system and the intermediate results obtained during the inward triangularization sweep. An1 i and An2 i are trivially linear in each other and the 3.2.1 Unbranched Topology discussion in the rest of this section is limited to obConsider a serial chain as shown in figure 3. Sub- taining an expression for An1 i . Once this expression is system i consists of bodies k + 1 : k + ni and is con- obtained, equation (10) can be used to get the handle nected to adjacent DCA nodes i−1 and i+1 via joints equation of motion associated with H2ni . In order to replace a given subsystem by a single DCA node one needs to express the spatial acceleration of the boundary handles solely in terms of the spatial constraint forces acting on them. This process is next discussed for different topologies.

6

Both of the methods require an identical manipulation to derive the recursive relations. This involves k eliminating u˙ J from equation (5). During the inward triangularization operation, the expression for k u˙ J is obtained as

Modified Inward Triangularization Sweep

Another equivalent set of operations can be performed during the inward triangularization sweep. Assume that An1 i can be written as a linear function of Ak+1 and F2ni as An1 i = Ψk+1 Ak+1 + Ψk+1 F2ni + 1 1 1 2 k+1 k+1 k k Ψ3 . For k = ni − 1, Ψi , i = {1, 2, 3} are trivu˙ J = −(Iˇk )−1 (P J )T [ I˜k+1 (Sβk )T Ak1 − Fˇ k ially obtained. Now the goal is to obtain a similar ni − Z k F2c ], (11a) relationship between An1 i and Ak1 . As before, the reni quired expression for An1 i is obtained using equation Y Zk = ( (11b) (12) as T p )Srni . p=k+2

An1 i = Ψk1 Ak1 + Ψk2 F2ni + Ψk3 ,

Using equation (11), equation (5) may be rewritten as Ak+1 = (T k+1 )T Ak1 + X˜ k Z k F2ni + 1 Akt + X˜ k Fˇ k ,

Ψk1 Ψk2 Ψk3

(12a)

= = =

Ψk+1 (T k+1 )T , 1 Ψk+1 X˜k Z k + Ψk+1 , 1 2 k+1 k k ˇk ˜ Ψ1 [At + X F ] +

(14a) (14b) (14c)

Ψk+1 . 3

(14d)

Thus, using the recursive relationship given in (12b) equations (13) or (14), Ani is obtained as a linear 1 function of A11 and F2ni . From this and using equak k+1 The matrices X˜ and T are obtained as inter- tion (10), the required expression for Ani is obtained 2 mediate results during the inward triangularization as sweep (see equation (7)) and need not be recom1:ni 1 1:ni ni 1:ni puted. Additionally, the quantity (Akt + X˜k Fˇ k ) must An2 i = ζ21 F1c + ζ22 F2c + ζ23 . (15) be computed as a part of the outward substitution sweep. Thus, the only additional computation in Equations (10) and (15) constitute the handle equaobtaining equation (12) involves the matrix product tions of motion for the subsystem which can now be X˜ k Z k . Next, elementary mathematical induction is used in a DCA framework. used to obtain the relationship between An1 i and A11 . k k X˜ k = (P J )(Iˇk )−1 (P J )T .

3.2.2

Branched Topology

Modified Outward Substitution Sweep J2

Assume that Ak1 can be written as Ak1 = Φk1 A11 + ni Φk2 F2c + Φk3 . For k = 1, Φki , i = {1, 2, 3} are trivially obtained. Now, the goal is to obtain Ak+1 in a similar 1 form. Using equation (12), Ak+1 is obtained as 1

J1

I III

ni Ak+1 = Φk+1 A11 + Φk+1 F2c + Φk+1 , 1 1 2 3

(13a)

Φk+1 = (T k+1 )T Φk1 , 1

(13b)

Φk+1 = (T k+1 )T Φk2 + X˜ k Z k , 2 Φk+1 = (T k+1 )T Φk3 + Akt + X˜k Fˇ k .

(13c)

3

II

J3 Figure 4: A branched subsystem

(13d)

The process described in section 3.2.1 can be trivUsing the recursive relationships given in equation ially extended to convert a branched subsystem into (13), An1 i is also obtained in a form similar to that of a DCA node. To obtain the ABI of a body, the recursive relationships given in equation (7) are now equation (13a). 7

This completes the discussion of forming DCA nodes from a given subsystem using the ABA. From this point, the DCA-ABA performs similar manipulations as that of the DCA and the discussion on modeling algebraic constraints given in references [Featherstone, 1999b, Mukherjee and Anderson, 2007] are directly applicable to the DCA-ABA. Thus, the DCAABA can also be used for articulated systems with kinematic loops.

summed up over the immediate children. Further details can be found in references [Featherstone, 2008, Jain, 2010]. Now, consider a subsystem as shown in figure 4. The subsystem consists of three (I, II and III) chains. The dotted lines within each chain represent an arbitrary number of kinematically connected bodies. This subsystem is connected to the other DCA nodes via joints J1 , J2 and J3 . During the inward triangularization sweep, the spatial constraint forces acting on the boundary handles are propagated to the root node which gives one of the three required handle equations of motion. The other two are obtained by using either the modified inward or outward sweeps discussed in section 3.2.1. 3.2.3

3.3

Algorithm Overview

The overall algorithm involves 4 sweeps of the system. The first two sweeps are for the system kinematics while the remaining two deal with system kinetics.

Unbranched Topology: A Special Case 3.3.1 Node i

1 F1c

The kinematics sweep involves updating the rotation matrices, body fixed vectors, linear and angular velocities of all the bodies. These are the quantities one must calculate to generate an arbitrary state dependent force acting on any body. Details of parallel kinematics can be found in references [Critchley et al., 2009, Lathrop, 1985] and the idea is only briefly discussed here. The given system is divided into np subsystems as a first step. Initially, the kinematic quantities are updated on each subsystem in parallel. At this point, the subsystem which connects the system to the inertial frame requires no further calculation and is finalized. This is followed by an assembly-disassembly operation at the end of which the kinematic quantities associated with the parent of the root body of each node are finalized. Using this boundary condition, the kinematic quantities for the np − 1 nodes are obtained. The equations of motion for the body k (see equation (7a)) involve the spatial acceleration of the inboard handle. It is also possible to obtain a similar set of equations in terms of acceleration expressed purely as a function of the unknown time derivatives of the system generalized speeds [Anderson, 1993, Critchley et al., 2009]. This requires the computation of cumulative gyroscopic and Coriolis terms during the kinematics sweep. Details of the parallelization

ni F2c

1 F1c

1 0 0 1

Root

Parallel Kinematics Sweep

ni F2c

Triangularization Substitution Figure 5: Two processors per DCA node The method described in sections 3.2.1 and 3.2.2 results in np DCA nodes. It is also possible to exploit the O(n) parallelism for chain topologies. In the ABA, independent branches can be trivially propagated in parallel. This fact can be exploited to n directly reduce a given system to 2p DCA nodes. The idea is explained in figure 5. The root node for the chain subsystem is chosen to be approximately equidistant from the boundary handles. This results in a simple branched topology. The two branches can now be propagated independently by two pron cessors. Such an approach results in 2p DCA nodes n for np processors and is referred to as (np , 2p ). The np (np , 2 ) approach is faster than the (np , np ) DCAABA described previously. This is because the ren quired transition from np nodes to 2p nodes using DCA is improved by employing the faster ABA. 8

4.1

of this approach can be found in reference [Critchley et al., 2009]. The method presented in this paper uses actual accelerations which results in the term Akt (see equation (7e)) in the equations. This quantity is local to each joint and may be trivially computed in parallel as a part of the kinematics sweep as soon as the spatial velocity of a body is finalized.

3.3.2

Sequential Formation of a DCA Node

The main difference between the proposed DCAABA and the DCA is the method used to compute the handle equations of motion of a given subsystem. Figure 6 gives a time comparison between the two methods for the kinetics sweep. The kinetics sweep for the DCA includes the cost of forming two-handle equations of motion at body level. For a subsystem

Parallel Kinetics Sweep

The kinetics sweep involves computing the NewtonEuler equations of motion for each rigid body representing a node following which the handle equations for np nodes can be obtained using the modified inward triangularization sweep (or inward triangularization followed by a modified outward substitution) of the ABA. This is followed by the DCA assembly-disassembly operation at the end of which the u˙ associated with the connecting joints of the np nodes are obtained. The remaining u˙ are obtained during the outward substitution sweep of the ABA.

0.9

Constrained Unconstrained at Handle 2 Constrained Unconstrained at Handle 2

Time (ms) per Kinetics Sweep

0.8 0.7

DCA

0.6 0.5 0.4 0.3

DCA−ABA 0.2 0.1 0

4

Numerical Examples

0

50

100

150

200

250

300

Number of Bodies

Figure 6: Time comparison for a 3D subsystem

The algorithms are implemented in C++ and compiled using gcc 4.4.3 with optimization. All the recursive equations during the parallel kinematics and kinetics sweep are symbolically optimized. The operating system is Linux with a 2.66 GHz Intel quad core and a 4GB RAM, unless otherwise mentioned. All the timing studies report an average of 105 evaluations (an average of 10 runs each involving 104 evaluations) of the quantity of interest. The chosen system is a serial 3D n-body manipulator with revolute joints. It should be noted that DCA works best with a serial topology while the O(n) parallelism works best with articulated mechanisms. Thus for articulated mechanisms such as humanoids, the O(n) parallelism suffices and is expected to be faster than the DCA. However, the usual benchmarking of dynamics algorithms is performed for serial chains and the discussion in the rest of this section is limited to a serial topology.

constrained at the terminal handles the DCA-ABA is on an average faster than the DCA by a factor of 2.1. For a terminal subsystem with an unconstrained boundary handle, the cost of kinetics sweep is lower for both the algorithms since the calculations associated with the unconstrained handle 2 can now be ignored. For such a subsystem the DCA-ABA effectively performs identical operations as the ABA in addition to the matrix operations required to calculate equation (10) from equation (9). For the unconstrained case, the DCA-ABA is expected to be approximately 4 times faster than the DCA. In the reported simulation times the speedup achieved by the DCA-ABA is 3.75 for a 256 body chain. Thus, for a 3D n-body chain with single degree of freedom joints, the DCA-ABA is always expected to outperform the DCA. This fact is next demonstrated using 9

a parallel implementation.

sequential ABA implementation is used for comparison. Even with 4 processors, the traditional implementation of the DCA cannot outperform a sequen4.2 Parallel Implementation tial ABA. On the other hand, the DCA-ABA always A highly optimized multi-threaded implementation outperforms both the DCA and the ABA with 4 proof the DCA and the DCA-ABA is developed using cessors. The DCA-ABA gives its best performance the POSIX standard. Semaphores are used for syn- when x2 = κx1 , κ ∈ {2.2, 2.3}. For a system with chronization of the threads. 512 bodies, the best runs of the DCA-ABA are 1.4 and 2.25 times faster than the sequential ABA on 2 4.2.1 Load Balancing and 4 processors, respectively. For an optimal implementation, the load balance between different threads has to be equal. The terminal subsystem being unconstrained at one handle has a different computational cost associated with its kinetics sweep (see figure 6). Thus, efficiency of the algorithms is a function of the number of bodies selected for the terminal subsystem. For np processors, the system is divided into np − 1 subsystems consisting of x1 bodies each while the terminal subsystem consists of x2 bodies. The timing studies reported in figure 6 show that an unconstrained DCA-ABA subsystem is approximately twice as fast as a constrained subsystem of a comparable size. This suggests the general rule x2 ≈ 2x1 for an even load balance.

4.2.2

Next, the minimum size of the system for which a parallel implementation is useful is investigated. Figure 8 gives the time comparison of the kinetics sweep between a sequential ABA and a parallel DCA-ABA. It should also be noted that the timing results for 2 processors are included primarily to demonstrate the scaling of the algorithm. O(n) parallelism is expected to exceed the DCA and the DCA-ABA in performance for all configurations with 2 processors. A DCA framework is necessary only in presence of 4 or more processors for a serial manipulator. 0.4

Sequential ABA DCA−ABA (4 processors) DCA−ABA (2 processors)

0.35

Time (ms) per Kinetics Sweep

1.4

DCA−ABA (2 processors) DCA−ABA (4 processors) DCA (4 processors) Reference Sequential ABA

1.2

Time (ms) per Kinetics Sweep

Scaling of the Algorithm

1

0.8

0.6

0.4

0.3

0.25

0.2

0.15

0.1

0.05

0.2

0

0

100

200

300

400

500

600

Number of Bodies 0 10

20

30

40

50

60

70

80

% Bodies in a Terminal Sub−System

Figure 7: 3D system with 512 bodies

90

Figure 8: Time Comparison of a Sequential ABA and a Parallel DCA-ABA

For the timing study, the ratio between x2 and x1 is Figure 7 reports the average time per kinetics maintained constant at 2.1. It is seen that for smaller sweep for each algorithm. The execution time of a systems with less than 32 bodies, the ABA is faster 10

parallel. This is followed by the assembly-disassembly operation which is performed sequentially. The kinetics sweep is then completed by disassembling the 8 subsystems in parallel. During the sequential assembly-disassembly operation, the outermost DCA subsystem is free at the terminal handle. This subsystem is always chosen as a child during the sequential assembly and the calculations corresponding to terminal handle can now be ignored. Such an approach is 4.2.3 Assembly -Disassembly Tree referred to as {4-1} and {8-1} for 4 and 8 processors, For a truly logarithmic complexity algorithm it is respectively. crucial to get as close as possible to an ideal binary assembly-disassembly tree. Indeed, getting close to 1.4 such a tree for an arbitrary articulated mechanism Sequential ABA is a challenging problem and has been addressed in 1.2 {8−4−2−1} Tree some detail in references [Featherstone, 1999b, Ya{8−1} Tree {4−2−1} Tree 1 mane and Nakamura, 2007]. The challenges are two{4−1} Tree fold. Time (ms) per Kinetics Sweep

than a parallel DCA-ABA. This parallel implementation suggests that, in the least, 8 bodies must be assigned per processor to break even with the sequential implementation. As the system size increases the DCA-ABA starts outperforming the ABA and for a system with 512 bodies, a 4 processors DCA-ABA is 2.25 times faster than the ABA.

1. The load balance between the processors while forming the DCA subsystems from an arbitrary articulated system must be equal. 2. Any articulations to the DCA subsystems potentially kills the parallelism during the assemblydisassembly phase. Hence the generated DCA subsystems must be as “flat” or simply connected as possible. However, the results of section 4.2.2 suggests that at lower processor loads, the communication costs between the threads are significant. During the assembly-disassembly operation, each thread performs a single assembly (or disassembly) before requiring a synchronization or communication with another active thread. This communication between threads must be considered as an additional cost incurred by the algorithm and has important consequences on choosing an optimal assembly-disassembly tree for an arbitrary articulated mechanism. Ideal binary trees with 4 and 8 processors are represented as {4-2-1} (see figure 2) and {8-4-2-1}, respectively. One possibility to avoid the frequent communication between threads during the assembly-disassembly phase is to perform this operation sequentially. For example consider the case with 8 processors. The 8 DCA subsystems are first formed in

0.8

0.6

0.4

0.2

0

0

200

400

600

800

1000

1200

Number of Bodies

Figure 9: Comparison of a Sequential and a Parallel Assembly-Disassembly Operation Figure 9 gives a comparison of the different trees (Benchmarked on a 16 core AMD 2.33 GHz system). It is seen that the benefits of parallelism obtained from an ideal binary tree are lost due to the greater communication costs between threads. This suggests that the primary benefits obtained in the parallel algorithms are in efficient formation of the DCA subsystem and not in the actual assembly-disassembly operation. Thus this result further motivates the use of a fast method on subsystems as opposed to the traditional DCA. 4.2.4

Parallel Kinematics

The cost of kinematics is usually much lower than that of the kinetics. However, as the size of the

11

system increases, the cost of kinematics are also expected to be significant. In this section the parallel kinematics sweep described in section 3.3.1 is benchmarked. The assembly-disassembly operation during the kinematics sweep is much cheaper than the kinetics sweep. Hence, based on the results obtained in section 4.2.3, this operation is performed sequentially. In the kinematics sweep, the innermost subsystem must be processed just once while the rest are processed twice. As was the case with kinetics, the system is divided into np − 1 subsystems each with x1 number of bodies. The innermost subsystem has x2 number of bodies. The efficiency of parallel kinematics is benchmarked as a function of x2 for 1024 bodies and shown in figure 10(a). The ratio between x1 and x2 which gives the best performance is kept constant while testing the scaling of the parallel kinematics sweep which is reported in figure 10(b). As expected, for smaller systems with less than 32 bodies, the parallel kinematics sweep is slower than the sequential operation. However, the 4 processor parallel sweep is 2.04 times faster than the sequential kinematics for a system with 512 bodies.

5

Time (ms) per Kinematics Sweep

0.35

0.3

0.25

0.2

0.15

0.1

0

10

20

30

40

50

60

70

80

90

% Bodies in Root Sub−System

(a) Kinematics Sweep for 1024 Bodies

Time (ms) per Kinematic Sweep

0.12

Conclusions

1 Processor 2 Processors 4 Processors

0.1

0.08

0.06

0.04

0.02

0

This paper presented a novel approach to compute the handle equations of a DCA subsystem resulting in a new dynamics algorithm DCA-ABA. The DCAABA is of O(nb ) and O(log(nb )) computational complexity for sequential and parallel implementation, respectively. The new approach involves using a variation of ABA on subsystems. The actual assemblyassembly operation is similar to that of the DCA. The paper also addressed the effect of the communication cost between processors on the efficiency of parallel algorithms. The important conclusions are briefly summarized below. The DCA-ABA outperforms the DCA due to the use of a faster ABA for computing handle equations. For a 3D serial manipulator with 512 revolute joints, the DCA-ABA is 2.25 times faster than the ABA when implemented on 4 processors. For smaller systems the communication costs between the proces-

1 Processor 2 Processors 4 Processors

0

100

200

300

400

500

600

Number of Bodies

(b) Scaling of Parallel Kinematics

Figure 10: Parallel Kinematics

sors outweigh the advantages of parallel computation. The parallel implementation of the DCA-ABA breaks even with the sequential implementation when the number of bodies per processor is 8. On the other hand, the DCA is theoretically expected to break even with the ABA in presence of 4 processors. However, due to the additional cost associated with the synchronization of the threads, the ABA continues to outperform a 4 processor DCA. The synchronization/communication costs between the threads during the binary assemblydisassembly phase are significant compared to the

12

computational cost of assembling (or disassembling) References two DCA nodes. This implies that the primary advantage of the DCA multibody formalism is in the K.S. Anderson. An order-n formulation for the motion simulation of general multi-rigid-body tree efficient formation of DCA subsystems and not in the systems. Computers and Structures, 46(3):547– binary assembly-disassembly phase. 559, 1993. The results in section 4.2.3 suggest that even in presence of sufficient parallel resources (np = nb ), K.S. Anderson and S. Duan. A hybrid parallelizable an ideal nb node DCA binary tree will certainly be low order algorithm for dynamics of multi-rigidslower than a sequential ABA. At most n8b processors body systems: Part I, Chain systems. Mathematimay be used at which point the parallel DCA-ABA cal and Computer Modeling, 30:193–215, 1999. will break even with the sequential ABA. However, any real speed improvements justifying the additional K.S. Anderson and S. Duan. Highly parallelizable low order algorithm for the dynamics of complex multi programming effort can only be achieved when the rigid body systems. Journal of Guidance, Control number of bodies per processor are much greater than and Dynamics, 23(2):355–364, Mar.–Apr. 2000. nb 8 implying a use of fewer than 8 processors. This further motivates the use of the DCA-ABA which D.S. Bae and E.J. Haug. A recursive formulation for efficiently constructs the DCA subsystems using the constrained mechanical system dynamics: Part I, ABA. Open loop systems. Mechanisms, Structures, and Machines, 15(3):359–382, 1987. Another important consequence of the synchronization cost is in the scheduling problem for an articK.D. Bhalerao. On Methods for Efficient and Accuulated mechanism. When attempting to split an arrate Design and Simulation of Multibody Systems. ticulated system into np subsystems, it is always betPhD thesis, Rensselaer Polytechnic Institute, 2010. ter to opt for an equal load balance between the processors even if it comes at the cost of a sub-optimal J.H. Critchley. An efficient multibody divide and assembly-disassembly tree. conquer algorithm. In ASME International Design Engineering Technical Conferences, pages 149–158, From an implementation point of view, a sequen2005. tial dynamics engine driven by an O(n) method can be readily modified to use the DCA-ABA. This is be- J.H. Critchley and K.S. Anderson. Parallel logarithcause the algebraic constraints on a system are usumic order algorithm for general multibody system ally imposed at the acceleration level. This demands dynamics. Multibody Systems Dynamics, 12(1):75– the calculation of u˙ as a function of the unknown 93, 2004. spatial constraint forces. Thus, the O(n) dynamics engines which allow the modeling of kinematic loops J.H. Critchley, K.S. Anderson, and A. Binani. An efficient multibody divide and conquer algorithm already possess the capability to construct DCA suband implementation. Journal of Computational systems using an O(n) method. Consequently, the and Nonlinear Dynamics, 4(2):021004, 2009. DCA-ABA is particularly attractive when considering a general purpose parallel implementation. R. Featherstone. The calculation of robotic dynamics using articulated body inertias. International Journal of Robotics Research, 2(1):13–30, 1983. Acknowledgement

R. Featherstone. A divide-and-conquer articulated body algorithm for parallel O(log(n)) calculation of rigid body dynamics. Part 1: Basic algorithm. This work was supported by ARC Discovery Grant International Journal of Robotics Research, 18(9): number DP1093476. The authors gratefully acknowl867–875, Sep. 1999a. edge this support. 13

R. Featherstone. A divide-and-conquer articulated R. Mukherjee and K. S. Anderson. An orthogonal body algorithm for parallel O(log(n)) calculation complement based divide-and-conquer algorithm of rigid body dynamics. Part 2: Trees, loops, and for constrained multibody systems. Nonlinear Dyaccuracy. International Journal of Robotics Renamics, 48(1-2):199–215, 2007. search, 18(9):876–892, September 1999b. G. Rodriguez. Kalman filtering, smoothing, and recursive robot arm forward and inverse dynamics. R. Featherstone. Efficient factorization of the jointIEEE Journal of Robotics and Automation, 3(6): space inertia matrix for branched kinematic trees. 624–639, 1987. The International Journal of Robotics Research, 24 (6):487, 2005. M.A. Sherman, A. Seth, and S.L. Delp. Simbody: multibody dynamics for biomedical research. ProR. Featherstone. Rigid Body Dynamics Algorithms. cedia IUTAM, International Union of Theoretical Springer Verlag New York Inc, 2008. and Applied Mechanics, Elsevier Science, pages R. Featherstone and A. Fijany. A technique for ana241–261, 2011. lyzing constrained rigid-body systems, and its application to the constraint force algorithm. IEEE W. Stelzle, A. Kecskemethy, and M. Hiller. A comparative study of recursive methods. Archive of Transactions on Robotics and Automation, 15(6): Applied Mechanics, 66(1-2):9–19, 1995. 1140–1144, December 1999. A. Fijany, I. Sharf, and G.M.T. D’Eleuterio. Paral- K. Yamane and Y. Nakamura. Efficient parallel dynamics computation of human figures. In Proceedlel O(log n) algorithms for computation of manipings of ICRA ’02. IEEE International Conference ulator forward dynamics. IEEE Transactions on on Robotics and Automation, volume 1, pages 530– Robotics and Automation, 11(3):389–400, 1995. 537, 2002. S.C. Flores, M. Sherman, C.M. Bruns, P. Eastman, and R.B. Altman. Fast flexible modeling of rna K. Yamane and Y. Nakamura. Automatic scheduling for parallel forward dynamics computation of structure using internal coordinates. IEEE/ACM open kinematic chains. In Proceedings of Robotics: Transactions on Computational Biology and BioinScience and Systems, 2007. formatics, 8(5):1247–1257, 2011. R.S. Hwang, D.S. Bae, J.G. Kuhl, and E.J. Haug. K. Yamane and Y. Nakamura. Comparative study on serial and parallel forward dynamics algorithms Parallel processing for real-time dynamics systems for kinematic chains. International Journal of simulations. Journal of Mechanical Design, 112(4): Robotics Research, 28(5):622–629, 2009. 520–528, 1990. A. Jain. Unified formulation of dynamics for serial rigid multibody systems. Journal of Guidance, Control, and Dynamics, 14(3):531–542, 1991. A. Jain. Robot and Multibody Dynamics: Analysis and Algorithms. Springer Verlag, 2010. T. R. Kane and D. A. Levinson. Dynamics: Theory and Application. Mcgraw-Hill, NY, 1985. R.H. Lathrop. Parallelism in manipulator dynamics. The International Journal of Robotics Research, 4 (2):80–102, 1985. 14

An Efficient Parallel Dynamics Algorithm for Simulation ...

portant factors when authoring optimized software. ... systems which run the efficient O(n) solution with ... cated accounting system to avoid formulation singu-.

210KB Sizes 3 Downloads 385 Views

Recommend Documents

An Efficient Deterministic Parallel Algorithm for Adaptive ... - ODU
Center for Accelerator Science. Old Dominion University. Norfolk, Virginia 23529. Desh Ranjan. Department of Computer Science. Old Dominion University.

Efficient FDTD algorithm for plane-wave simulation for ...
propose an algorithm that uses a finite-difference time-domain ..... velocity is on the free surface; in grid type 2, the vertical component is on the free surface. ..... 50 Hz. The model consists of a 100-m-thick attenuative layer of QP. = 50 and QS

Memory-Efficient Parallel Simulation of Electron Beam ...
Department of Computer Science, Old Dominion University, Norfolk, Virginia 23529. † ... researchers have developed a GPU-accelerated, high-fidelity ...... [Online]. Available: http://docs.nvidia.com/cuda/cuda-samples/#bandwidth-test.

Saving Time in a Space-Efficient Simulation Algorithm
Saving Time in a Space-Efficient Simulation Algorithm. J. Markovski. Abstract—We present an efficient algorithm for computing the simulation preorder and ...

An Efficient Algorithm for Location-Aware Query ... - J-Stage
Jan 1, 2018 - location-aware service, such as Web mapping. In this paper, we ... string descriptions of data objects are indexed in a trie, where objects as well ...

An Efficient Algorithm for Clustering Categorical Data
the Cluster in CS in main memory, we write the Cluster identifier of each tuple back to the file ..... algorithm is used to partition the items such that the sum of weights of ... STIRR, an iterative algorithm based on non-linear dynamical systems, .

CSE 6730 Project Report: Parallel Molecular Dynamics Simulation
Apr 27, 2012 - Molecular dynamics simulations allow researchers to investigate phenomena emergent in systems at the ... cluster-parallelized molecular dynamics simulation (PMDS) for periodic sys- tems on the order of ..... As shown in Figures 8-10, w

VChunkJoin: An Efficient Algorithm for Edit Similarity ...
The current state-of-the-art Ed-Join algorithm im- proves the All-Pairs-Ed algorithm mainly in the follow- .... redundant by another rule v if v is a suffix of u (including the case where v = u). We define a minimal CBD is a .... The basic version of

An Efficient Algorithm for Learning Event-Recording ...
learning algorithm for event-recording automata [2] based on the L∗ algorithm. ..... initialized to {λ} and then the membership queries of λ, a, b, and c are ...

BeeAdHoc: An Energy Efficient Routing Algorithm for ...
Jun 29, 2005 - Mobile Ad Hoc Networks Inspired by Bee Behavior. Horst F. Wedde ..... colleagues are doing a nice job in transporting the data pack- ets. This concept is ..... Computer Networks A. Systems Approach. Morgan Kaufmann ...

An Efficient Algorithm for Location-Aware Query ... - J-Stage
Jan 1, 2018 - †The author is with Graduate School of Informatics, Nagoya. University .... nursing. (1, 19). 0.7 o5 stone. (7, 27). 0.1 o6 studio. (27, 12). 0.1 o7 starbucks. (22, 18). 1.0 o8 starboost. (5, 5). 0.3 o9 station. (19, 9). 0.8 o10 schoo

An Efficient Pseudocodeword Search Algorithm for ...
next step. The iterations converge rapidly to a pseudocodeword neighboring the zero codeword ..... ever our working conjecture is that the right-hand side (RHS).

An Efficient Algorithm for Monitoring Practical TPTL ...
on-line monitoring algorithms to check whether the execution trace of a CPS satisfies/falsifies an MTL formula. In off- ... [10] or sliding windows [8] have been proposed for MTL monitoring of CPS. In this paper, we consider TPTL speci- ...... Window

An Efficient Algorithm for Sparse Representations with l Data Fidelity ...
Paul Rodrıguez is with Digital Signal Processing Group at the Pontificia ... When p < 2, the definition of the weighting matrix W(k) must be modified to avoid the ...

An I/O-Efficient Algorithm for Computing Vertex ...
Jun 8, 2018 - graph into subgraphs possessing certain nice properties. ..... is based on the belief that a 2D grid graph has the property of being sparse under.

An Efficient Algorithm for Learning Event-Recording ...
symbols ai ∈ Σ for i ∈ {1, 2,...,n} that are paired with clock valuations γi such ... li = δ(li−1,ai,gi) is defined for all i ∈ {1, 2,...,n} and ln ∈ Lf . The language.

An exact algorithm for energy-efficient acceleration of ...
tion over the best single processor schedule, and up to 50% improvement over the .... Figure 3: An illustration of the program task de- pendency graph for ... learning techniques to predict the running time of a task has been shown in [5].

An Efficient Algorithm for Similarity Joins With Edit ...
ture typographical errors for text documents, and to capture similarities for Homologous proteins or genes. ..... We propose a more effi- cient Algorithm 3 that performs a binary search within the same range of [τ + 1,q ..... IMPLEMENTATION DETAILS.

An exact algorithm for energy-efficient acceleration of ...
data transfers given that the adjacent nodes are executed on different processors. Input data nodes represent the original input data residing in CPU memory.

Efficient Dynamics
The All-New BMW M2. Technical Specifications. M2 DKG. M2. Engine type. N55B30T0. N55B30T0. Transmission type. DKG manual transmission. Body. Seats.

An Efficient Genetic Algorithm Based Optimal Route Selection ... - IJRIT
Wireless sensor Network (WSN) is getting popular especially for applications where installation of the network infrastructure is not possible, such as.

An Efficient Genetic Algorithm Based Optimal Route Selection ... - IJRIT
infrastructure, but imposes some drawbacks and limitations (mainly on .... Networks”, http://www.monarch.cs.rice.edu/monarch-papers/dsr-chapter00.pdf.