1

Parallel Computing System for efficient computation of Molecular Similarity based on Negative Electrostatic Potential : First results Raul Torres

Abstract—In this document we present the first results obtained in the applicattion of a kernel based distance for molecule comparison. Essentially, it consists in converting the tree representation of the electrostatic potential obtained with TARIS in a more compact canonical string representation, that allow us to apply a kernel method for strings. In its most simple version, the string kernel searches the number of ocurrences of a balanced string (corresponding to a subtree) only in its structure; the clustering applied over the similarity matrix shows a too broad classification of our 45 molecule set. A more complex kernel aditionally calculates the relation between the weights of each subtree, resulting in a more exact classification in the same set. The algorithms were implemented under CUDA, obtaining supreme reduction of execution time in comparison with a serial implementation. Index Terms—CUDA, GPU Computing, kernel methods, molecular similarity, parallel computing, string kernels, machine learning, tree kernels

Next, we apply a pattern analysis algorithm that is general purpose, and robust. The process can be resumed in four steps: 1) Data items are embedded into a vector space called the feature space. 2) Linear relations are sought among the images of the data items in the feature space. 3) The algorithms are implemented in such a way that the coordinates of the embedded points are not needed, only their pairwise inner products. 4) The pairwise inner products can be computed efficiently directly from the original data items using a kernel function. Figure 1. Embedding to the feature space in order to find linear relationships

I. I NTRODUCTION N this document we present the first results obtained in the applicattion of a kernel based distance for molecule comparison. Essentially, it consists in converting the tree representation of the electrostatic potential obtained with TARIS in a more compact canonical string representation, that allow us to apply a kernel method for strings. In its most simple version, the string kernel searches the number of ocurrences of a balanced string (corresponding to a subtree) only in its structure; the clustering applied over the similarity matrix shows a too broad classification of our 45 molecule set. A more complex kernel aditionally calculates the relation between the weights of each subtree, resulting in a more exact classification in the same set. The algorithms were implemented under CUDA, obtaining supreme reduction of execution time in comparison with a serial implementation.

I

II. P REVIOUS W ORK A. Kernel methods The strategy of kernel methods are well documented in the book of Cristianini and Shawe [12]. In general it is to embed the data into a space where the patterns can be discovered as linear relations(Figure 2.1). The initial mapping component is defined implicitly the kernel function that depends on the specific data type and domain knowledge concerning the patterns that are to be expected in the particular data source. Raul Torres is with Grupo de Química Teórica - Universidad Nacional de Colombia

B. Kernels for Structured Data In the paper of Gartner[5] convolution kernels are proposed to be the best known kernel for representation spaces that are not mere attribute-value tuples. Their basic idea is that the semantics of composite objects can often be captured by a relation between the object and its parts; The kernel on the object is composed of kernels defined on different parts. 0 • x, x ∈ X are the objects → 0→ • x ,x ∈ X1 × · · · × XD are the tuples of parts of these objects. • R : (X1 × · · · × XD ) × X is the relation of composition −1 → • R (x) = x : R(x→ , x) is the decomposition relation • The convolution kernel is defined as kconv (x, x0 ) =

X

D Y

x→ ∈R−1 (x),x0→ ∈R−1 (x0 )

d=1

kd (xd , x0d )

The advantage of convolution kernels is that they are very general and can be applied in many different situations. However, because of that generality, they require a significant

PARALLEL COMPUTING SYSTEM FOR EFFICIENT COMPUTATION OF MOLECULAR SIMILARITY BASED ON NEGATIVE ELECTROSTATIC POTENTIAL

amount of work to adapt them to a specific problem, which makes choosing R in real-world applications a non-trivial task. One of those applications is related with string objects, and for them, string kernels have been defined. C. String kernels Strings kernels are defined in a comprehensive way in[4]as it follows: • A is a finite set of characters called alphabet, e.g. A = A, C, G, T • $ is a sentinel character that marks the end of a string but $ ∈ /A • A string x is any combination of the elements of A: x ∈ Ak , k = 0, 1, 2, ... k • The empty string belongs to A , and is represented by the symbol . ∗ • The non-empty strings are symbolized by A . ∗ • s, t, u, v, w, x, y, z ∈ A denote strings • a, b, c ∈ A denote characters • |x| denotes the length of x ∗ • uv ∈ A denotes the concatenation of two strings u and v ∗ • au ∈ A denotes the concatenation of a character and a string • x[i : j],1 ≤ i ≤ j ≤ |x| denote the sub-string of x between locations i and j • If x = uvw for some (possibly empty) u, v, w, – u is called a prefix of x – v is called a sub-string (v v x) – w is called a suffix of x. • numy (x)denotes the number of occurrences of y in x The kernel they define is given by: X X k(x, y) = wsx δsx ,sy = nums (x)nums (y)ws sx vx,sy vy

s∈A∗

Now, these are the possible ways to obtain the value of the weight: • ws = 0 for all |s| > 1: it only counts single characters (bag-of-characters kernel) • ws 6= 0 if and only if s is bounded by a whitespace in one or both sides (bag-of-words kernel) • ws = 0 for all |s| > n: it only counts sub-strings which length are less or equal to a given number • ws = 0 for all |s| = 6 k : it only counts sub-strings of length k (k-spectrum kernel) In the previous article they propose a new algorithm strings and trees in linear time, obviating dynamic programming. Their algorithm is supposed to be extended in various ways to provide linear time prediction cost in the length of the sequence to be classified. Additionally, they offer a way of dealing with different kinds of weights. D. The TARIS method More exactly, in TARIS, the trees are weighted in their nodes, registering the electrostatic potential value of a given surface. The method starts by searching isopotential surfaces

2

at given value (e.g. -0.15) and for each founded independent surface, it creates a weighed node saving the value at that level. Next, the algorithm jumps to the following level according to the established step size (e.g. 0.02) and repeats the previous process. If some of the new surfaces envelop surfaces in the previous level, the corresponding node of that surface connects to the existing nodes as a parent node. The process repeats until the algorithm reaches the cutoff value, which is the maximum value of electrostatic potential minor to 0. A refinement step is used for simplifying the tree structure: if a parent node has only one child node, the parent node collapses with it (in other words, the method only stores the deepest node in branches with just one child). Finally, the method adds a imaginary root node, which purpose is to unite in one single tree all generated trees for a molecule. At this moment, we obtain a structure that reflects the relations between successive isopotential surfaces. This allow us to forget the 3D representation, and concentrate the efforts in a tree comparison problem. One of the most used algorithms to tackle it is the edit distance metric. Briefly, it consists in finding the minimum-cost operations set (insert a node, delete a node, replace a node) that allows us to convert a tree in another tree. The obtained value is the distance between those two trees, and it can be used to create a similarity or dissimilarity matrix when several molecules are compared. Then, using hierarchical clustering, we can obtain the dendogram for a possible classification.

E. Other tree comparison methods There are alternative measures in tree comparison. Subgraph isomorphism (obtain the same graph by only changing node labels) is classified as an NP problem, but there are several optimizations that demonstrate good results in video indexing, computer vision and ligand docking [6], [7], [10], [14]; given two graphs (G1 and G2), two variants of the solution can be used: Maximum common sub-graph looks for the mayor isomorphic sub-graph of G1 with the mayor subgraph of G2. In the same way, Minimum common super-graph tries to find the closest superior graph containing both G1 and G2. This distances showed better results than the edit distances in some applications [1]because they don’t need function costs. We can´t ignore the results in some phylogenetic trees comparison studies [3], [11], based in the Maximum Parsimony method. Another choice could be the Maximum Verisimilitude, which is like the first, but it associates some probability values to each possible solution to the problem. In the parallel computing field, In [3], [11] the authors develop comparison of phylogenetic trees under a messagepassing parallel computing interface; the execution mode is Master-slave, in which there is a central processing node controlling the operations in the slave nodes. The prediction accuracy doesn’t outperforms significantly but the execution time speeds up in a notorious way. Other studies like [6] try to compare Maximum common sub-graph and Minimum common super-graph in their serial and parallel versions.

PARALLEL COMPUTING SYSTEM FOR EFFICIENT COMPUTATION OF MOLECULAR SIMILARITY BASED ON NEGATIVE ELECTROSTATIC POTENTIAL

3

F. GPU Parallel Computing in Molecular chemistry

Classification

Graphics processors (GPUs) provide a vast number of simple, data-parallel, deeply multithreaded cores and high memory bandwidths including support for both single- and double-precision IEEE floating point arithmetic. Over time, GPU architectures are becoming increasingly programmable, offering the potential for dramatic speedups for a variety of general-purpose applications compared to contemporary general-purpose processors (CPUs); before, GPU implementations could only be achieved using existing 3D-rendering APIs: OpenGL or DirectX [2]. NVIDIA’s offers a solution: CUDA language, an extension to C for programming over their cards without the need of mapping the instruction to a graphics API. CUDA represents the coprocessor as a device that can run a large number of threads. The threads are managed by representing parallel tasks as kernels (the sequence of work to be done in each thread) mapped over a domain (the set of threads to be invoked). Kernels are scalar and represent the work to be done at a single point in the domain. The kernel is then invoked as a thread at every point in the domain. The parallel threads share memory and synchronize using barriers. Data is prepared for processing on the GPU by copying it to the graphics board’s memory. Data transfer is performed using DMA and can take place concurrently with kernel processing. Once written, data on the GPU is persistent unless it is deallocated or overwritten, remaining available for subsequent kernels[2]. The abstractions and virtualization of processing resources provided by the CUDA thread block programming model allow programs to be written with GPUs that exist today but to scale to future hardware designs. Future CUDA-compatible GPUs may contain a large multiple of the number of streaming multiprocessors in current generation hardware. Well-written CUDA programs should be able to run unmodified on future hardware, automatically making use of the increased processing resources[13]. The impressive performance of GPUs is due, in part, to their optimized hardware for graphics. The requirements for floating-point calculations for graphics are considerably different from those for scientific computing[8].

The similarity matrix obtained with the GPU computing process was analyzed by means of hierarchical clustering using the average linkage method. General Representation of molecules The proposed strategy converts from tree to a string representation in the next way: • Every node is represented by [] characters • When a node has children, each child is established inside the [] of the parent: [[][]] •

[45, 889[78, 76[][]][987, 5[][]] • •

The kernel we proposed is based in the kernel definition given in [4] and has this structure: X

k(x, y) =

However, this simple kernel only works with the structure of the trees and performs poorly. So, we propose a weighted and normalized kernel that performs more accurately: P

k w (x, y) =

•

• • • •

Data set We used a set of 45 small molecules identified by their functional groups: (A) alcohols, (B) ethers, (C) aldehydes, (D) ketones, (E) carboxylic acids, (F) esters, (G) amides, and (H) amines (Figure 1). The molecules were originally stored in the GML format generated by TARIS according to the experimental set defined in the reference study. From this files we have extract the topology of the tree, the size and the value of the isopotential surface of each node.

nums (x)nums

s∈B

•

In order to compare our results with the ones obtained in the reference work [10], [9], we have set a default configuration in the eight different experimental designs we have tried.

The leaf nodes have no weight associated We propose a canonical representation – The subtrees with more nodes are traslated first – Next the subtrees with more levels are listed – Next the subtrees with more weights are listed

Proposed Kernel

•

III. M ETHODOLOGY

If a weight is associated to a non-leaf node, this value is written after the first [:

• •

s∈B

nums (x)nums (y)ws k(x, y)

ws is 1 if wy and wx are 0 (wy and wx are the respective weights of x and y trees) x ws = w wy if wx ≤ wy w Otherwise, ws = wxy x and y are the strings that represents the molecules B is the set of balanced sub-string [...] The weights can be calculated in 8 different ways (see below) The process is reduced to find the number of balanced sub-strings founded in both molecules;in other words, a sub-tree. The leaf nodes are not counted as sub-trees The whole string is considered a sub-string

Hardware • • •

CPU: AMD Athlon X2 64 Bits RAM: 2 GB GPU: Geforce 8200 M

PARALLEL COMPUTING SYSTEM FOR EFFICIENT COMPUTATION OF MOLECULAR SIMILARITY BASED ON NEGATIVE ELECTROSTATIC POTENTIAL

Figure 2.

Formula and Id for the 45 Small Organic Molecules Used for the Functional Group Classification

Figure 3. Tree representation: The red and green circles denote sub-trees that appears in both trees. In this case, the simple kernel is 2

Figure 4. String representation: The gray area is the same green circle in the previous figure. The red square is related to the red circle too.

Parallel programming considerations •

•

4

The general process is executed over the CPU – Host Code: C++ The string comparison process is made in parallel – Device Code: CUDA for C

Experimental configurations The two variables we manipulate in the experimentation are:

Isopotential surface size: ISS Isosurface value: IV We used factorial design (32 experiments), because the states of the two variables are bounded to three: • (N)Don’t use • (A)Accumulated: the summation of all the values of each node gives the weight for the sub-tree • (S)Simple: the value of the root node of the sub-tree gives the weight for the sub-tree The experiments will be referenced by a short name for understanding purposes: • W0: There are no weights. Only the structure is important. (N)ISS x (N) IV • W1: Accumulated isopotential surface size. (A)ISS x (N) IV • W2: Accumulated isopotential surface size times Accumulated isosurface value. (A)ISS x (A) IV • W3: Simple isopotential surface size. (S)ISS x (N) IV • W4: Accumulated isosurface value. (N)ISS x (A) IV • W5: Simple isosurface value. (N)ISS x (S) IV • W6: Accumulated isopotential surface size times Simple isosurface value. (A)ISS x (S) IV • W7: Simple isopotential surface size times Simple isosurface value. (S)ISS x (S) IV • W8: Simple isopotential surface size times Accumulated isosurface value. (S)ISS x (A) IV • •

IV. R ESULTS The classification results will be listed in order of performance, avoiding those with the worst results. The B6 molecule behaves poorly in almost every experiment. We think about it as an outlier, maybe because the

PARALLEL COMPUTING SYSTEM FOR EFFICIENT COMPUTATION OF MOLECULAR SIMILARITY BASED ON NEGATIVE ELECTROSTATIC POTENTIAL

prior calculation are wrong. In later experiments we expect to classify it better.

5

Figure 7. Dendogram for W7 Experiment: 7 molecules were misclassified (B6,C1,C2,D1,E1,E2,F1)

Experiment W3 This was the most accurate experimental set, were the weights are reduced to the size of the isopotential surface at the root node of the sub-trees. Figure 5. Dendogram for W3 Experiment: 6 molecules were misclassified (B6,C1,D1,E1,E2,F1)

Experiment W2 Another experiment that also reports 7 misclassified molecules. The weights are obtained by the summation of all the isosurface sizes times the summation of all the isopotential values of each node of the sub-trees Figure 8. Dendogram for W2 Experiment: 7 molecules were misclassified (B6,C1,C2,E1,E2,E3,F1)

Experiment W8 This was the following more accurate experimental set, were the weights are reduced to the size of the isopotential surface at the root node times the summation of the isopotential values of each nodes of the sub-trees. Figure 6. Dendogram for W8 Experiment: 7 molecules were misclassified (E2,E3,E1,C1,C2,F1,B6)

Experiment W0 The simple experiment with only the structure misclassifies some molecules but doesn’t differentiates too much between the groups Execution time In general, the execution time is reduced to approximately 2 seconds. Experiment W7 This experiment also reports 7 misclassified molecules. The weights are obtained from the size of the isopotential surface times the isopotential value at the root node of the sub-trees.

V. C ONCLUSIONS We think that the good behavior of the W3 experiment is because of the grow of the isopotential surface size, we don´t need to accumulate the sizes in the child and subsequent nodes.

PARALLEL COMPUTING SYSTEM FOR EFFICIENT COMPUTATION OF MOLECULAR SIMILARITY BASED ON NEGATIVE ELECTROSTATIC POTENTIAL

Figure 9. (B6,E2)

Dendogram for W0 Experiment: 2 molecules were misclassified

[13] John E. Stone, James C. Phillips, Peter L. Freddolino, David J. Hardy, Leonardo G. Trabuco, and Klaus Schulten. Accelerating molecular modeling applications with graphics processors. Journal of Computational Chemistry, 28(16):2618–2640, 2007. [14] Kim Shearer Svetha, Kim Shearer, Svetha Venkatesh, and Horst Bunke. An efficient least common subgraph algorithm for video indexing. In Int. Conf. on Pattern Recognition, pages 1241–1243, 1998.

In general terms, the kernel method used achieves a good prediction. The pre-assumptions that a string kernel can be applied to tree-like structured data was verified by applying it to a 45 data set of molecules. The next efforts of this research will be focused in the application of a kernel that takes in count the co-rooted tree and a more robust representation of strings named suffix tree. The use of CUDA as a programming environment has allow us to perform a several concurrent operations in a fast way than in the serial paradigm, without the need of a cluster of computers, proving that GPU Computing offers a tremendous computational power at low cost. R EFERENCES [1] Horst Bunke. A graph distance metric based on the maximal common subgraph. Pattern Recognition Letters, 19:255–259, 1998. [2] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaffer, and Kevin Skadron. A performance study of general-purpose applications on graphics processors using cuda. J. Parallel Distrib. Comput., 68(10):1370–1380, 2008. [3] Zhihua Dua. Reconstruction of large phylogenetic trees: A parallel approach. Computational Biology and chemistry, 29:273–280, 2005. [4] Thomas Gartner, John W. Lloyd, and Peter A. Flach. Kernels for structured data. SIGKDD Explorations, 5:49–58, 2002. [5] Thomas Gartner, John W. Lloyd, and Peter A. Flach. Kernels and distances for structured data. Mach. Learn., 57(3):205–232, 2004. [6] Gupta. Finding largest subtrees and smallest supertrees. Algorithmica, 21:181–210, 1998. [7] Hidovic. Metrics for attributed graphs based on the maximal similarity common subgraph. International Journal of Pattern Recognition and Artificial Intelligence, 18:299–313, 2004. [8] Jeremy S. Meredith, Gonzalo Alvarez, Thomas A. Maier, Thomas C. Schulthess, and Jeffrey S. Vetter. Accuracy and performance of graphics processors: A quantum monte carlo application case study. Parallel Comput., 35(3):151–163, 2009. [9] Edgar E. Daza Ray M. Marin, Nestor F. Aguirre. Graph theoretical similarity approach to compare molecular electrostatic potentials. Journal of Chemical Information and Modeling, 48:109–118, 2008. [10] John W. Raymond. Rascal: Calculation of graph similarity using maximum common edge subgraphs. The Computer Journal, 45 (6):631– 644, 2002. [11] Schmidt. Tree-puzzle: maximum likelihood phylogenetic analysis using quartets and parallel computing. Bioinformatics application note, 18:502–504, 2002. [12] Cristianini Nello Shawe-Taylor John. Kernel Methods for Pattern Analysis. Cambridge University Press, 2004.

6

Raul Torres Student of Master in Computer Science, Universidad Nacional de Colombia. Researching in GPU parallel computing for machine learning PLACE PHOTO HERE