The VLDB Journal DOI 10.1007/s00778-013-0306-1

REGULAR PAPER

Received: 13 June 2012 / Revised: 12 January 2013 / Accepted: 17 January 2013 © Springer-Verlag Berlin Heidelberg 2013

3 4 5 6 7 8 9 10 11 12 13 14 15 16

17

Abstract Graphs are widely used to model complicated data semantics in many applications in bioinformatics, chemistry, social networks, pattern recognition, etc. A recent trend is to tolerate noise arising from various sources such as erroneous data entries and find similarity matches. In this paper, we study graph similarity queries with edit distance constraints. Inspired by the q-gram idea for string similarity problems, our solution extracts paths from graphs as features for indexing. We establish a lower bound of common features to generate candidates. Efficient algorithms are proposed to handle three types of graph similarity queries by exploiting both matching and mismatching features as well as degree information to improve the filtering and verification on candidates. We demonstrate the proposed algorithms significantly outperform existing approaches with extensive experiments on real and synthetic datasets.

1 Introduction

18

cted

2

Keywords

Graph similarity query · Edit distance · q-Gram

Electronic supplementary material The online version of this article (doi:10.1007/s00778-013-0306-1) contains supplementary material, which is available to authorized users. X. Zhao (B) · X. Lin · W. Wang The University of New South Wales, Sydney, Australia e-mail: [email protected] X. Lin e-mail: [email protected] W. Wang e-mail: [email protected]

Graphs have a wide range of applications and have been utilized to model complex data in biological and chemical information systems, multimedia, social networks, etc. There has been considerable interest in many fundamental problems in analyzing graphs. Various algorithms are devised to solve the problems, including graph pattern mining [24,34,43], graph containment search and indexing [5,35,41], etc. Due to the existence of noise and inconsistency in data, a recent trend is to study similarity matches among graphs [22,23,26,27,31,32,36,38]. This body of work solves the problem of searching for graphs in a database that approximately contain or are contained by a query. Among the various graph similarity measures used in these studies, graph edit distance [3,21] has been widely accepted for representing distances between graphs. Compared with alternative distance or similarity measures, graph edit distance has three advantages: (1) It allows changes in both vertices and edges; (2) it reflects the topological information of graphs; and (3) it is a metric that can be applied to any type of graphs. Due to these elegant properties, graph edit distance has been used in the context of classification and clustering tasks in various application domains [20]. However, the expensive computation of graph edit distance poses serious algorithmic challenges. To tackle the NP-hardness of the problem [38], a few algorithms have been proposed to either convert it to binary linear programming and compute the bounds [13], or seek unbounded suboptimal answers with heuristics [9]. In this paper, we investigate graph similarity queries with graph edit distance constraints and focus on three types of queries which cover a wide range of searching and data cleaning tasks in graph database applications:

orre

1

unc

Author Proof

Xiang Zhao · Chuan Xiao · Xuemin Lin · Wei Wang · Yoshiharu Ishikawa

pro of

Efficient processing of graph similarity queries with edit distance constraints

C. Xiao · Y. Ishikawa Nagoya University, Nagoya, Japan e-mail: [email protected] Y. Ishikawa e-mail: [email protected]

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51

X. Zhao et al.

57 58 59

60 61 62

Author Proof

63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103

Due to the expensive computation of graph edit distance, the state-of-the-art approaches to graph or subgraph similarity search with edit distance constraints are mainly based on a filter-and-verify scheme, that is, first generate a set of promising candidates that potentially satisfy necessary conditions for the edit distance constraint, and then verify them by edit distance computation. The κ-AT algorithm [27] borrows the q-gram idea from the solution to string similarity problems [10] and defines a q-gram as a tree consisting of a vertex along with all those that can be reached in q hops. A count filtering condition on the minimum required number of common q-grams is established to qualify the candidates that satisfy the edit distance constraint. However, it suffers from the looseness of the lower bound due to the impact of edit operations on common q-grams and therefore is only effective against sparse graphs. The choice of q-gram length is also limited to very small values, but short q-grams usually result in poor selectivity and consequently large candidate size. The star structure [38] is exactly the same feature as the 1-gram defined by κ-AT. Unlike κ-AT, it computes the lower and upper bounds of graph (or subgraph) edit distance through bipartite matching between the star representations of two graphs. For graph similarity search, it has to invoke bipartite matching for the query with every data graph. The time complexity will be O(|R| · |V |3 ), where |R| is the dataset size and |V | is the number of vertices in a graph. Thus, an immediate remedy is to take advantage of indexes. To this end, an indexing and query processing framework SEGOS [31] is proposed recently. Based on a two-level index structure, SEGOS adopts a novel search strategy adapted from the threshold-based algorithm (TA) and the combined algorithm (CA) [8] to enhance the star structure-based solution. As a result, SEGOS is superior from both perspectives of indexing and searching strategy. However, implicit parameters hidden in the algorithm need to be tuned in order to achieve good performance. Moreover, graph edit distance computation is not involved in its evaluation, and hence, the overall runtime performance of SEGOS remains unclear. Distinct from existing approaches, we explore a novel perspective of utilizing path-based q-grams. We find that the count filtering condition of path-based q-grams is stricter than that of tree-based q-grams. This enables us to perform similarity queries on denser graphs as well as choose longer q-grams for better selectivity. Another novelty is to exploit

pro of

56

– We devise algorithms for graph and subgraph similarity search queries, which are non-trivial extensions of the GSimJoin algorithm proposed in [42]. For graph similarity join queries, R-S join scenario was not covered by [42], but is discussed in this paper. We also discuss the adaptation to directed multigraphs. – We devise a novel q-gram matching condition by exploiting the vertex degree information in q-grams. The proposed technique is orthogonal to the two major filtering techniques proposed in [42] and substantially reduces the size of the candidate set from GSimJoin. – We evaluate the effect of the new techniques and the performance on the three types of graph similarity queries with more experiments.

cted

55

the valuable information provided by mismatching q-grams that do not match in a candidate pair. Two filtering conditions are accordingly proposed so that the size of the candidate set can be substantially reduced. In addition, we leverage the vertex degree information to devise a new q-gram matching condition. We also elaborate how to speed up graph edit distance computation by further utilizing the filtering conditions. As a consequence, three algorithms are designed, respectively, to handle the three types of similarity queries. The superior time efficiency against alternative methods is demonstrated by extensive experimental evaluations. A preliminary version of this paper appeared in [42]. In this version, we make substantial improvements:

orre

54

– Graph similarity search: find data graphs whose edit distances to a query are within a threshold. – Graph similarity join: find pairs of graphs from two datasets such that the pairs’ edit distances are within a threshold. – Subgraph similarity search: find data graphs that contain subgraphs to which the edit distances from a query are within a threshold.

unc

52 53

TYPESET

DISK

LE

105 106 107 108 109 110 111 112 113 114 115 116

117 118 119 120 121 122 123 124 125 126 127 128 129 130

The rest of the paper is organized as follows: Sect. 2 presents the problem definition and preliminaries. Section 3 introduces the definition of path-based q-gram and the basic algorithmic framework for graph similarity search. Sections 4 and 5 present two filtering techniques exploiting mismatching q-grams. Section 6 advances another idea to leverage degree differences when matching q-grams. Section 7 elaborates the verification of candidates. Extensions to graph similarity join and subgraph similarity search are covered in Sects. 8 and 9, respectively. Further adaptation to directed multigraphs are discussed in Sect. 10. In Sect. 11 are experimental results and analyses. Section 12 summarizes related work, followed by conclusion in Sect. 13.

143

2 Preliminaries

144

2.1 Problem definition

145

For the ease of exposition, we focus on simple graphs first and postpone the extension to other graphs in Sect. 10. A simple graph is an undirected graph with neither self-loops nor multiple edges. A labeled graph r can be represented in a quadruple (V, E, l V , l E ), where V is a set of vertices,

123 Journal: 778 MS: 0306

104

CP Disp.:2013/1/28 Pages: 26 Layout: Large

131 132 133 134 135 136 137 138 139 140 141 142

146 147 148 149 150

Graph similarity queries with edit distance constraints

The graph edit distance between r and s ged(r, s) = 3, for example, in r inserting an N-labeled vertex and a single edge between C2 and N, then replacing the double edge with a single edge.

154 155

Author Proof

156 157

158 159 160 161 162

163 164 165 166 167 168

169 170 171

172 173 174 175 176 177

178 179 180 181 182 183 184

185 186 187 188 189 190 191

Definition 1 (graph isomorphism) A graph r is isomorphic to another graph s if there exists a bijection f : V (r ) → V (s) such that (1) ∀u ∈ V (r ), f (u) ∈ V (s) ∧ l V (u) = l V ( f (u)), and (2) ∀e(u, v) ∈ E(r ), e( f (u), f (v)) ∈ E(s)∧ l E (e(u, v)) = l E (e( f (u), f (v))).

Definition 2 (subgraph isomorphism) A graph r is subgraph isomorphic to another graph s (denoted r ⊑ s), if there exists an injection f : V (r ) → V (s) such that (1) ∀u ∈ V (r ), f (u) ∈ V (s) ∧ l V (u) = l V ( f (u)), and (2) ∀e(u, v) ∈ E(r ), e( f (u), f (v)) ∈ E(s) ∧ l E (e(u, v)) = l E (e( f (u), f (v))). r is also called a subgraph of s. A graph edit operation is an edit operation to transform one graph to another [3,21]. It can be one of the following six operations: – – – – – –

insert an isolated labeled vertex into the graph, delete an isolated labeled vertex from the graph, change the label of a vertex, insert a labeled edge into the graph, delete a labeled edge from the graph, change the label of an edge.

The graph edit distance between r and s, denoted by ged(r, s), is the minimum number of edit operations that transform r to a graph isomorphic to s. It is easy to show that graph edit distance is a metric. Computing the graph edit distance between two graphs is proved to be NP-hard [38]. For brevity, we use “edit distance” for “graph edit distance” in the rest of the paper when there is no ambiguity. Example 1 Figure 1 sketches the molecular structures of cyclopropanone (r ) and 2-aminocyclopropanol (s) omitting hydrogen atoms. They are used in the investigation of potential antiviral drugs [7]. For ease of illustration, subscripts are added to the carbon atoms, while C1 , C2 , and C3 correspond to an identical label; single and double lines indicate different chemical bonds, modeled by edge labels in real data.

TYPESET

DISK

LE

195

196 197 198 199

204

Running multiple graph search queries in a batch mode results in a graph similarity join query.

206

Problem 2 (graph similarity join) Given two graph collections R and S, graph similarity join with edit distance threshold τ returns pairs of graphs from each collection, such that their edit distance is no larger than τ , that is, { r, s | −ged(r, s) ≤ τ, r ∈ R, s ∈ S }. A self-join associates a collection with itself, that is, given a graph collection R, it returns { ri , r j | −ged(ri , r j ) ≤ τ ∧ ri .id < r j .id, ri ∈ R, r j ∈ R }. It is also desirable to discover data graphs which approximately contain given queries, and therefore, we consider the subgraph similarity search query.

Problem 3 (subgraph similarity search) Given a data graph collection R, a query graph s, and edit distance threshold τ , graph similarity search is to find all the graphs r from R such that there exists a subgraph r ′ of r to which the edit distance from s is no larger than τ , that is, { r | −ged(r ′ , s) ≤ τ ∧r ′ ⊑ r, r ∈ R }. Next, we first study graph similarity search and defer the extension to graph similarity join and subgraph similarity search to Sects. 8 and 9. Additionally, we focus on in-memory implementation when describing algorithms. 2.2 Tree-based q-gram approach A problem related to graph similarity queries is string similarity queries with edit distance constraints, which have been extensively studied since last decade, [15,17,28,29] to name a few recent advances. Among them, several prevalent approaches are based on q-grams [17,33], namely substrings of length q. Since an edit operation only affects a limited number of q-grams, similar strings will have certain amount of overlap between their q-gram sets.1 Based on this 1 q-Grams in strings are accompanied by their starting positions in the string, and thus there is no duplicate.

CP Disp.:2013/1/28 Pages: 26 Layout: Large

200 201 202 203

205

207 208 209 210 211 212 213 214

215 216 217

218 219 220 221 222 223

224 225 226 227

228

123 Journal: 778 MS: 0306

193 194

Problem 1 (graph similarity search) Given a data graph collection R, a query graph s, and edit distance threshold τ , graph similarity search is to find all the graphs r from R such that the edit distance between r and s is no larger than τ , that is, { r | −ged(r, s) ≤ τ, r ∈ R }.

cted

153

orre

152

E ⊆ V × V is a set of edges, and l V (resp. l E ) is a labeling function that assigns labels to vertices (resp. edges). V (r ) (resp. E(r )) denotes the vertex (resp. edge) set of r . |V (r )| and |E(r )| represent the number of vertices and edges in r , respectively. l V (u) denotes the label of a vertex u, and l E (e(u, v)) denotes the label of an edge between u and v, u, v ∈ V .

unc

151

In this paper, we study three types of graph similarity queries based on edit distance, namely graph similarity search, graph similarity join, and subgraph similarity search. The graph similarity search query is formalized as follows.

pro of

Fig. 1 Cyclopropanone and 2-aminocyclopropanol

192

229 230 231 232 233 234 235 236

X. Zhao et al.

240 241

Author Proof

242 243 244

245 246 247

248 249 250 251 252

253

254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279

Example 2 Consider graph r in Fig. 1, and q = 1. r embodies four 1-grams, as shown in Fig. 2, with the first 1-gram appearing twice.

3 A path-based q-gram method

289

The maximum number of tree-based q-grams that can be affected by an edit operation is shown as Dtr ee = 1 + q −1 γ · (γ −1) γ −2 , where γ is the maximum vertex degree in the graph. κ-AT algorithm advises that if graphs r and s are within edit distance τ , they must share at least

L Btr ee = max(|V (r )| − τ · Dtr ee (r ), |V (s)| − τ · Dtr ee (s)) common q-grams. A pair of graphs conforming to the lower bound is a candidate pair. Note it does not necessarily satisfy the edit distance constraint. Hence, edit distance calculation is invoked for every candidate pair. κ-AT algorithm is observed to have loose lower bound on common q-grams when (1) there is a vertex with high degree in the graph, (2) the distance threshold is large, or (3) q is large. The lower bound can even become less than or equal to zero. We call such phenomenon underflowing. This issue results in the following dilemma: We have to use very short q-grams, for example, 1-grams, to ensure the pairs of graphs to have at least one common q-gram so that the all-pair comparison brought about by underflowing can be avoided; however, using short q-grams suffers from poor performance as they are usually frequent and hence yield large candidate set. Considering the two graphs in Fig. 1 and τ = 1, the lower bound is only 1 if we use 1-grams and becomes non-positive under larger distance thresholds or with longer q-grams. Another approach for graph similarity search [38] is based on star representations of graphs. A star structure is exactly a 1-gram defined by κ-AT; nevertheless, it does not apply the count filtering to approach the problem. With a distinct flavor, it utilizes bipartite matching to derive lower and upper bounds of edit distance for punning and validation, respectively. As a step further, SEGOS [31] enhances star structures with a two-level index. In the upper level, stars from data graphs

TYPESET

DISK

LE

281 282 283 284 285 286 287

294

3.1 Definition of path-based q-gram

295

290 291 292 293

A path in a graph is a sequence of vertices and edges such that there is an edge between any consecutive vertices. A path is simple if there are no repeated vertices. The length of a path is the number of edges in the path.

299

Definition 3 (path-based q-gram) A path-based q-gram in a graph r is a simple path of length q.

301

Given a path, we have two label sequences starting from two terminal vertices, by sequentially concatenating the labels of its vertices and edges. Nonetheless, we only associate the lexicographically smaller one to a path-based q-gram w as its label sequence, denoted by seq(w), for example, C-N and N-C are two label sequences of a path-based q-gram w, but we only keep C-N as seq(w), as it is lexicographically smaller. Thus, each path-based q-gram w has a label sequence seq(w) of length 2q + 1. w is symmetric if its label sequence is symmetric, for example, C-O-C. Since the length of a path can be zero in the case of a single vertex, 0-gram is defined to be a single vertex. The number of paths in a graph is in O(|V | · γ q ), where γ is the maximum vertex degree. Compared with tree-based q-grams, decomposing a graph into path-based q-grams increases the total number of q-grams. In the rest of the paper, we use “path-based q-gram” and “q-gram” interchangeably when there is no ambiguity. Example 3 Consider the two graphs in Fig. 1 and q = 1. Assume we take atom symbols as vertex labels and use “−” and “=” as edge labels for single and double bonds, respectively. There are four 1-grams in r : C = O (×1)

C − C (×3),

123 Journal: 778 MS: 0306

280

Seeing the drawback of tree-based q-grams, we quest for a new way of defining q-grams on graphs. Akin to q-grams on strings which are essentially sequences, we may choose paths in a graph as its q-grams, which are convertible to sequences. Next, we formally introduce path-based q-grams.

cted

239

288

orre

238

observation, these approaches essentially relax the edit distance constraint to a weaker count constraint on the number of common q-grams, called count filtering. Inspired by the idea of q-gram on string similarity queries, [27] proposes κ-AT algorithm that defines q-grams on graphs based on trees. For each vertex u, a tree-based q-gram is a set of vertices that can be reached from u in q hops, represented in a breadth-first-search tree rooted at u.

unc

237

pro of

Fig. 2 Tree-based q-grams

are used to index the graphs in inverted lists; in the lowerlevel index, each star is broken into multiple vertices and indexed in inverted lists. The performance of SEGOS is dependent on the parameters that control the access to its twolevel index. Edit distance computation is not involved in the experimental evaluation in [31] either, and hence, the overall runtime performance remains unclear. We will compare with SEGOS in the experimental study, equipping it with edit distance computation.

CP Disp.:2013/1/28 Pages: 26 Layout: Large

296 297 298

300

302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319

320 321 322 323

324

Graph similarity queries with edit distance constraints

and five 1-grams in s:

326

C − N (×1)

330

331 332 333 334

Author Proof

335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363

364 365

366

367 368 369 370 371 372 373

Condition 1 (label-based match) Consider q-grams wr and ws , as well as corresponding label sequences seq(wr ) and seq(ws ), respectively. wr matches ws , if seq(wr ) = seq(ws ). The time complexity of checking whether two q-grams match, that is, comparing labels one by one, is O(q). We can compare the hash codes of the sequences and hence reduce the time complexity to O(1). This may introduce false positives due to hash collision but no false negatives, and thus the correctness of any filtering algorithms relying on the matching condition is not affected. In practice, most false positives can be avoided by choosing an appropriate hash function. From now on, we assume there is no hash collision and two q-grams match with respect to Condition 1 if the hash codes of their label sequences coincide. Given graphs r and s, we extract all the paths in the two graphs to make two q-gram sets Q r and Q s . Two q-grams wr ∈ Q r and ws ∈ Q s are common if wr matches ws ; equally, we say one q-gram (either wr or ws ) is shared by r and s. We abuse the notation Q r ∩ Q s to denote the common q-grams between Q r and Q s ; and Q r \Q s denotes the q-grams from Q r that cannot match any q-gram in Q s . Similar to tree-based q-grams, a count filtering condition for path-based q-grams can be developed to relax the edit distance constraint to a weaker count constraint on the number common q-grams. Before presenting that, we first study the effects of edit operations on a graph’s q-grams. Let Q ru denote the set of q-grams containing vertex u, and Q ruv denote the set of q-grams containing two consecutive vertices u and v. We say a q-gram is affected by an edit operation if the edit operation changes the q-gram’s label sequence. Theorem 1 reveals at most how many q-grams in Q r are affected when an edit operation occurs in r . Theorem 1 An edit operation on graph r affects at most D path (r ) = maxu∈V (r ) |Q ru |q-grams in Q r . Proof We enumerate the effects of various edit operations: – Insert an isolated labeled vertex into the graph. No q-grams in Q r are affected. – Delete an isolated labeled vertex from the graph. The number of q-grams affected is either 1 when q = 0, or 0 otherwise. – Change the label of a vertex. Changing vertex u’s label affects |Q ru | ≤ maxu∈V (r ) |Q ru |q-grams.

TYPESET

DISK

LE

375 376 377 378 379 380

382

Lemma 1 (count filtering) Consider two graphs r and s. If ged(r, s) ≤ τ , they must share at least

384

L B path = max(|Q r | − τ · D path (r ), |Q s | − τ · D path (s))

385

common q-grams.

386

Example 4 Consider Fig. 1, τ = 1, and q = 1. Changing the label of C1 gives the maximum |Q ru | = 3 for both graphs. Hence, the lower bound of common path-based q-grams between r and s is max(4 − 3, 5 − 3) = 2. If we increase the q-gram length to 2, the lower bound of pathbased q-grams is still above zero, as given by max(5 − 5, 7 − 6) = 1, whereas using tree-based q-grams provides a lower bound of −5.

381

383

387 388 389 390 391 392 393 394

A subtle case in counting common q-grams is: When a q-gram in Q r matches two q-grams in Q s , adding up two common q-grams results in multiple matching, since only one indeed matches. In general, if mq-grams from Q r match nq-grams from Q s such that these m + nq-grams correspond to an identical label sequence, at most min(m, n) q-grams contribute to the common q-grams. Multiple matching is avoided when counting the number of common q-grams and will be further discussed in Sect. 7.1.

403

3.2 Comparison with tree-based q-grams

404

Now, we compare the influence of edit operations on treebased and path-based q-grams.

406

– For q = 1, consider r in Fig. 1. All the tree-based q-grams, which cover the whole graph, can be affected by an edit operation on C1 , while the path-based q-gram consisting of C2 and C3 remains unaffected. This example showcases path-based q-grams can preserve more common structural information than tree-based q-grams, excluding the affected part. – For longer q-grams, the number of vertices covered by a tree-based q-gram increases exponentially with q. One edit operation on any of the vertices makes the q-gram mismatch. On the contrary, the coverage of a path-based q-gram increases linearly with q, and therefore, the probability of being hit by an edit operation is decreased.

123 Journal: 778 MS: 0306

374

According to Theorem 1, the count filtering condition for path-based q-grams can be established in Lemma 1.

cted

329

Given a q-gram size q, we say path-based q-grams wr and ws match if they correspond to the same label sequence, and they are matching q-grams, otherwise mismatching q-grams. It is formally stated in Condition 1.

orre

328

C − C (×3).

unc

327

C − O (×1)

– Insert an labeled edge into the graph. No q-grams in Q r are affected. – Delete an labeled edge from the graph. Supposing e(u, v) is deleted, the number of affected q-grams is |Q ruv | ≤ maxu∈V (r ) |Q ru |. – Change the label of an edge. It affects the same number of q-grams as deleting an edge from the graph. ⊔ ⊓

pro of

325

CP Disp.:2013/1/28 Pages: 26 Layout: Large

395 396 397 398 399 400 401 402

405

407 408 409 410 411 412 413 414 415 416 417 418 419

– The number of path-based q-grams grows exponentially with q, given by O(|V | · γ q ), while it is |V | for treebased q-grams. As a consequence, we would have more path-based q-grams left after applying τ edit operations than tree-based q-grams, due to larger total number of q-grams, and lower probability of being hit by edit operations.

420 421 422 423 424 425 426

Author Proof

431 432 433 434

435

436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463

464 465

3.3 Prefix filtering

q-gram universe, respectively. Let p-prefix denotes the first p elements in a set. If |Q r ∩ Q s | ≥ α, the (|Q r | − α + 1)-prefix of Q r and the (|Q s | − α + 1)-prefix of Q s must share at least one common q-gram.

An efficient way to find the pairs of graphs that satisfy the count filtering condition is to use an inverted index [2]. An inverted index maps each q-gram w to a list of identifiers of the graphs that contain w. With an inverted index built on data graphs, for a query graph, we scan its q-grams and use each of them to probe the inverted list in order to collect data graph identifiers as well as their occurrence numbers. All the graphs whose occurrence numbers meet the count filtering lower bound are candidates of the query. The main performance bottleneck in accessing inverted index is that the inverted lists of some q-grams can be fairly long, for example, the carbon chain C − C − C exists in most organic compounds. These long inverted lists incur prohibitive accessing overhead, and a large number of candidates will be produced if they share such q-grams with the query. Existing approaches to string similarity problem address it by employing prefix filtering [4,17,33] to quickly prune the candidates that are guaranteed to not meet the count filtering condition. The intuition is that if two sets of q-grams meet the lower bound constraint, they must share at least one common q-gram if we look into part of the q-grams. Figure 3 illustrates the idea of prefix filtering. Suppose q-grams are sorted by the same ordering, and l is the number of q-grams in both sets. The unshaded cells are prefixes, for example, wa and wb are Q r ’s prefixes. If Q r and Q s have no common q-grams in their prefixes, the number of their common q-grams is no more than L B path − 1. We formally state the prefix filtering principle for graph similarity queries. Lemma 2 (prefix filtering) Consider graphs r and s, their q-gram sets Q r and Q s , sorted by a global ordering O of the 2

Rare cases are observed that tree-based q-grams deliver identical or even tighter count filtering lower bound than path-based q-grams.

TYPESET

DISK

LE

468 469

478

3.4 Graph similarity search algorithm

479

Combining count filtering and prefix filtering, we are ready to present the basic GSimSearch algorithm for similarity search queries (Algorithm 1). It consists of two phases: indexing (Algorithm 2) and query answering (Algorithm 3). In the indexing phase, the algorithm takes as input a graph database R and a distance threshold τ , and constructs an inverted index. It first decomposes each data graph into a set of q-grams and indexes its prefix by incorporating prefix filtering (Lines 3–7 in Algorithm 2). In the query answering phase, the algorithm receives a query graph s and decomposes it into a q-gram set, sorted in the same global order as in the indexing phase. For each q-gram w in its prefix, it probes the inverted index to collect the data graphs containing w in their prefixes. The candidates are sent into Verify and checked by (1) count filtering and then (2) the expensive edit distance computation. According to Lemmas 1 and 2, the prefix length is τ · D path (r ) + 1 for each r (Line 4 in Algorithm 2), and τ · D path (s) + 1 for query s (Line 4 in Algorithm 3). In addition, the numbers of vertices and edges in r and s must have a difference within τ . This size filtering is included in Line 8 in Algorithm 3. The indexing phase can be done offline if we are offered the storage to keep the indexes. Although the edit distance threshold τ may not be given beforehand in some cases, observing that the prefixes under higher distance thresholds always subsume the prefixes under lower distance thresholds, we can build index for a fixed threshold τmax and it can be used for all similarity queries with τ ≤ τmax .

123 Journal: 778 MS: 0306

466 467

In order to achieve fewer candidates and faster execution, we sort the q-grams set of each graph in ascending order of document frequency of label sequences, that is, the number of graphs containing the q-gram’s label sequence. In this way, q-grams with rare label sequences reside ahead of those with frequent ones in q-gram sets. Intuitively, label sequences with low (resp. high) document frequencies are possessed by less (resp. more) graphs. Sorting q-grams in this order is a good heuristic to speed up similarity queries [4].

cted

430

Fig. 3 Illustration of prefix filtering

orre

429

Experimental results have suggested that path-based q-grams have the advantage of presenting tighter count filtering lower bounds over tree-based q-grams.2 This potentially deliver the chance of using longer q-grams in seek of better selectivity and runtime performance. We remark that using path-based q-grams cannot get rid of the underflowing issue in extreme cases; however, it reduces the chance of underflowing, compared with tree-based q-grams.

unc

427 428

pro of

X. Zhao et al.

CP Disp.:2013/1/28 Pages: 26 Layout: Large

470 471 472 473 474 475 476 477

480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507

Graph similarity queries with edit distance constraints

Algorithm 1: GSimSearch (R, s, τ )

4 Minimum edit filtering

523

We first show an illustrative example.

524

Input

Algorithm 2: GSimIndex (R, τ ) Input

8

return I

1 2

Author Proof

3 4 5 6

The above example evidences the implication of edit operations occurring on mismatching q-grams and the existence of redundancy within prefixes. Both index size and number of candidates passing prefix filtering can be reduced if we are able to shorten prefix lengths.

cted

7

: R is a collection of graphs; τ is an edit distance threshold. Output : An inverted index I built on R. Ii ← ∅ (1 ≤ i ≤ |U |) ; /* inverted index */ for each r ∈ R do Q r ← r ’s q-grams sorted in O ; pr ← τ · D path (r ) + 1; for i = 1 to pr do w ← Q r [i]; Iw ← Iw ∪ { r } ; /* index for q-gram w */

Example 5 Consider Fig. 1, τ = 1, and q = 1. The count filtering lower bound is 2; the two graphs share 3 q-grams (see Examples 3 and 4), and thus survive the filter. However, the two mismatching q-grams in s—C-O and C-N—are disjoint (bounded regions in s). At least two edit operations are required to affect both of them. We infer an edit distance lower bound between r and s to be 2, and hence prune the pair. This motivates us to find the minimum number of edit operations that cause the observed mismatching q-grams.

pro of

1 2

: R is a collection of graphs; s is a query graph; τ is an edit distance threshold. Output : A set of query result T . I ← GSimIndex (R, τ ) ; /* build index */ T ← GSimQuery (s, I , τ ) ; /* find results */

4.1 Minimum graph edit operations

Algorithm 3: GSimQuery (s, I , τ ) Input

3 4 5 6 7 8 9

10 11

509 510 511 512 513 514 515 516 517 518 519 520 521 522

This completes the basic algorithmic framework for graph similarity search. In the following sections, first we study how to exploit the information provided by mismatching q-grams to gain efficiency. Although similar property also happens to strings [33], the scenario on graphs is much more challenging: (1) q-grams on strings have starting positions, and hence, are easy to locate, while q-grams on graphs do not have such attribute; and (2) the minimum edit operation problem on strings is of polynomial time complexity while we will show it is NP-hard on graphs. We propose non-trivial techniques for graphs to reduce both index and candidate sizes. Moreover, we present a stricter matching condition for path-based q-gram to further reduce candidates by integrating more structural information, followed by an optimized verification algorithm.

unc

508

T ← T ∪ Verify(s, A); return T

Example 5 illustrates the case of disjoint mismatching qgrams. To handle the general case of overlapping q-grams, we formulate the minimum graph edit operation problem.

Problem 4 (minimum graph edit operation) Given a set of q-grams Q, find the minimum number of graph edit operations that can affect all the q-grams in Q. Theorem 2 The minimum graph edit operation problem is NP-hard.

Proof First, we prove that only the operation of changing vertex label needs to be considered. For any vertex edit operation, the affected q-grams are a subset of those affected by changing the vertex’s label. For any edge edit operation, the affected q-grams are also subsumed by those affected by changing the vertex label of the edge’s either end. Second, we show a polynomial reduction from the minimum set cover problem. Consider a universe U of elements and n sets whose union constitutes the universe. Each element is treated as a q-gram, and for each set containing multiple elements, we let these q-grams overlap on a vertex. Therefore, changing the label of this vertex affects these q-grams, that is, covering these elements. Then, the minimum number of edit operations that affect the q-grams in Q is exactly the minimum number of sets whose union covers all elements in U . By reduction from the set cover problem, the minimum graph edit operation problem is NP-hard. ⊔ ⊓

Despite its NP-hardness, the problem is solvable with an exact algorithm enumerating the positions of τ edit operations, since we only concern whether the answer is within τ .

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

526 527 528 529 530 531 532 533

534 535 536 537 538

539

orre

1 2

: s is a query graph; I is R’s inverted index; τ is an edit distance threshold. Output : T = { r | −ged(r, s) ≤ τ, r ∈ R }. T ← ∅; A ← empty map from id to boolean; Q s ← s’s q-grams sorted in O ; ps ← τ · D path (s) + 1; for i = 1 to ps do w ← Q s [i]; for each r ∈ Iw such that A[r ] has not been initialized do if abs(|V (r )| − |V (s)|) + abs(|E(r )| − |E(s)|) ≤ τ then A[r ] ← true ; /* find a candidate */

525

540 541 542

543 544 545

546 547

548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564

565 566 567

X. Zhao et al.

Algorithm 4: MinEditLowerBound (Q)

pro of

1 2

Input : Q is a set of q-grams. Output : A lower bound of the minimum edit operations that affect all the q-grams in Q. edit ← compute min-edit(Q) with the greedy algorithm; return ⌈ ln |Q|−lnedit ln |Q|+0.78 ⌉

Fig. 4 Example of minimum edit operation

Algorithm 5: MinEdit (Q)

569 570

572 573 574 575 576 577 578 579 580 581

1 2

1 2 3 4

584 585 586 587

588 589 590 591

592 593

Example 6 Figure 4 shows the structure of phenol (r ) and toluidine (s) molecules. Supposing q = 2, there are three mismatching q-grams in s: C-C-C, C-C-N, and C=C-N, as bounded by dashed lines. At least, two edit operations are needed to affect them, for example, changing the vertex labels of C1 and C2 . It is noteworthy to mention the following two properties, which are essential to the filtering techniques we are going to present. Let min-edit (Q) denotes the minimum number of graph edit operations on a set of q-grams Q. Proposition 1 (Monotonicity) min-edit (Q)≤min-edit (Q ′ ) ≤ ged(r, s), ∀Q ⊆ Q ′ ⊆ Q r \Q s .

596

Proposition 2 (Disconnectivity) min-edit (Q 1 ∪ Q 2 ) = min-edit (Q 1 ) + min-edit (Q 2 ), ∀wi ∈ Q 1 , w j ∈ Q 2 , wi , and w j have no common vertices.

597

4.2 Minimum prefix length

594 595

598 599 600 601 602 603 604 605

Input : Q r is a sorted set of q-grams of graph r . Output : The minimum prefix length of Q. left ← τ + 1; right ← τ · D path (r ) + 1; while left < right do mid ← (left + right)/2; edit ← MinEditLowerBound(Q r [1 . . mid]); if edit ≤ τ then left ← mid + 1; else right ← mid;

5

6

12

right ← left; left ← τ + 1; while left < right do mid ← (left + right)/2; edit ← MinEdit(Q r [1 . . mid]); if edit ≤ τ then left ← mid + 1; else right ← mid;

13

return left

7

8

10

11

orre

583

Input : Q is a set of q-grams. Output : The exact minimum edit operations that affect all the q-grams in Q. edit ← compute exact min-edit(Q); return edit

Algorithm 6: MinPrefixLen (Q r )

9

582

unc

Author Proof

571

The worst-case time complexity is O(|VQ |τ + |Q|), where VQ is the set of vertices contained by the q-grams in Q. One may notice the reduction in the proof of Theorem 2 is a direct problem mapping. It is straightforward to show a minimum edit operation problem can be reduced to a minimum set cover problem via the reverse mapping. Thus, we conclude the minimum set cover problem and minimum edit operation problem are equivalent. As a consequence, to alleviate the problem of large |VQ |, we may compute an approximate answer using the greedy algorithm for the minimum set cover problem, with an approximation ratio of ln |Q| − ln ln |Q| + 0.78 [25]. Algorithm 4 encapsulates the approximate algorithm and is guaranteed to return a lower bound of the exact answer in O(τ (|VQ | + |Q|) log |Q|) time.

cted

568

Recall Example 5, although the lower bound of common q-grams is L B path , it is likely the minimum edit operations that occur on mismatching q-grams already exceed τ , and thus the candidate should be discarded. Based on this observation, we seek a minimum prefix such that at least τ + 1 edit operations are needed to affect all the prefix q-grams. In this case, r and s are guaranteed to not meet the edit distance constraint if all their prefix q-grams mismatch.

The monotonicity (Proposition 1) enables us to find the minimum prefix length for a set of q-grams Q r with a binary search within the range of [τ + 1, τ · D path (r ) + 1] (Algorithm 6). It performs two rounds of binary search, the first seeking an upper bound of the minimum prefix length and the second the exact answer. In the first round, to check whether the q-grams within a prefix length need τ + 1 edit operations, the greedy algorithm (Algorithm 4) is called to find the lower bound of the answer to the minimum graph edit operation problem. The result from the first round is used as the upper bound of the second round, in which the exact algorithm (Algorithm 5) is applied iteratively.

Proof (Correctness of Algorithm 6) The minimum prefix length is at least τ + 1 as an edit operation on a vertex affects at least one q-gram. The minimum prefix length is at most τ · D path (r ) + 1, according to Lemmas 1 and 2. Hence, left and right in the first round of binary search bound the minimum prefix length. Since it ends when left equals right, and right is only modified when at least τ + 1 edit operations are needed to affect the q-grams in Q r [1 . . . right], the first round of binary search always returns an upper bound of the minimum prefix length. With the upper bound gained in

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

606 607 608 609 610 611 612 613 614 615 616 617

618 619 620 621 622 623 624 625 626 627

Graph similarity queries with edit distance constraints

634 635 636 637

Author Proof

638 639 640 641

642

643 644 645 646 647 648

649 650 651 652 653

654 655 656 657

658 659 660

661 662

663 664 665

666 667 668 669 670 671

To apply minimum edit filtering condition in the basic graph similarity search algorithm, we replace Line 4 in Algorithm 2 with “ pr ← MinPrefixLen(Q r ),” and Line 4 in Algorithm 3 “ ps ← MinPrefixLen(Q s ).” Example 7 Consider graph s in Fig. 1 and its five 1-grams sorted according to the order as they are listed in Example 3. When τ is 1, the minimum prefix length is 2, while the prefix length before using minimum edit filtering is 4.

5 Label filtering

6 7

return ε

1 2 3 4 5

pro of

633

Input : Q is a set of mismatching q-grams from r to s. Output : A lower bound of ged(r, s). C ← the connected components formed by Q; ε ← 0; for each ci ∈ C do εm ← MinEdit(ci ); εl ← |L V (ci )\L V (s)| + |L E (ci )\L E (s)|; ε ← ǫ + max(εm , ǫl );

mismatching q-grams may imply differences in vertex and edge labels. In addition, we employ minimum edit filtering to enhance the power of local label filtering. Recall the disconnectivity of minimum graph edit operations (Proposition 2), the observed mismatching q-grams can be articulated and form a set of connected components. We may derive the lower bound of edit distance on the whole graph by computing that in each component and summing them up. Algorithm 7 explains the implementation of the enhanced local label filtering after including minimum edit filtering. It first computes the connected components of the input q-grams (Line 1). This is implemented with a disjoint set data structure by one scan of the input q-grams. For each connected component consisting of mismatching q-grams, we compute the minimum edit operations within the component using (1) minimum edit filtering (Line 4) and (2) local label filtering (Line 5). The larger one is then chosen as the edit distance lower bound within this component and added up to the total edit distance lower bound. The time complexity of the algorithm is O(|VQ |τ + q|Q| + |E Q |), where VQ (resp. E Q ) is the set of vertices (resp. edges) contained by the q-grams in Q. In case of a large |VQ |, we may calculate an approximate answer to the minimum edit operation problem with the greedy algorithm, and the time complexity is O(τ (|VQ | + |Q|) log |Q| + q|Q| + |E Q |).

cted

632

Lemma 3 (minimum edit filtering) For the q-grams of graphs r and s, denote the minimum prefix lengths pr and ps , respectively. If ged(r, s) ≤ τ, Q r ’s pr -prefix and Q s ’s ps -prefix must share at least one common q-gram.

Algorithm 7: LocalLabelFilter(Q, s)

In this section, we introduce another approach to exploit the label differences in mismatching q-grams. Minimum edit filtering estimates a edit distance lower bound, but works in a pessimistic way assuming edit operations are scattered. However, edit operations can be clustered within several mismatching q-grams.

Example 8 Consider Fig. 1 and q = 1. The two mismatching q-grams are bounded in dashed lines. If we compare the labels in the mismatching q-gram in the right bounding box with those in r , they already incur at least one edit operation, because there is no nitrogen atom (N) in r .

orre

630 631

the first round, the second round of binary search finds the ⊔ ⊓ minimum prefix length, according to Proposition 1.

Motivated by this idea, we are able to establish a lower bound of edit distance from the labels in mismatching qgrams. Let L V (r ) denote the multiset of the vertex labels in r and L E (r ) the multiset of the edge labels in r .

Lemma 5 (global label filtering) Consider graphs r and s. If ged(r, s) ≤ τ, Γ (L V (r ), L V (s)) + Γ (L E (r ), L E (s)) ≤ τ , where Γ (A, B) = max(|A|, |B|) − |A ∩ B|.

Example 9 Consider graphs r and s in Fig. 5, τ = 2, and q = 2. Global label filtering yields a lower bound of 2; count filtering needs at least two common q-grams, and they do share C-C-C and C-C-C; minimum edit filtering gives two edit operations. Hence, the pair passes the three filters. The bounded regions indicate two connected components by jointing the mismatching q-grams in r . The number of edit operations on the left is 1 (via minimum edit filtering) and 2 on the right (via local label filtering). Thus, the pair can be pruned.

According to Lemmas 4 and 5, we prune a graph pair (r, s), if |L V (r ′ )\L V (s)| + |L E (r ′ )\L E (s)| > τ for any subgraph r ′ of r , or Γ (L V (r ), L V (s)) + Γ (L E (r ), L E (s)) > τ . Although the local label filtering can be applied on any subgraphs, we choose as a heuristic to use it on the subgraphs containing at least one mismatching q-gram, since

Fig. 5 Example of local label filtering

Lemma 4 (local label filtering) Consider graphs r and s. If ged(r, s) ≤ τ, ∀r ′ ⊑ r, |L V (r ′ )\L V (s)|+|L E (r ′ )\L E (s)| ≤ τ.

unc

628 629

Applying local label filtering on whole graphs immediately yields global label filtering.

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

672 673 674 675 676 677 678 679

680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696

697 698 699 700 701 702 703 704 705 706

X. Zhao et al.

709 710 711 712 713 714

715

717 718 719 720 721 722 723 724 725 726 727

6.1 Exploiting difference in vertex degrees Consider an instance of two matching q-grams under Condition 1. Since it yields a bijection on their vertices in the original graphs, the degrees must be adjusted to the same using no more than τ edit operations, if they meet the edit distance constraint. Inspired by this, we devise a degree-based filtering scheme to test the degree differences between each pair of vertices in two q-grams sequentially and count how many edit operations are needed to level the degrees. As any two q-grams satisfying Condition 1 have the same label sequence, we are confined to applying the operations that change degrees, namely edge insertions and deletions. The idea of degree filtering is illustrated in the following example.

730

wr : C1 (3)-C2 (2) − N1 (3) − C3 (1)

731

ws : C1 (4) − C2 (3) − N1 (3) − C3 (2)

732 733 734 735 736 737

738 739 740 741 742 743 744

745 746 747 748 749 750

751 752

Problem 5 (minimum edge edit operation) Given two q-grams wr and ws , find the minimum number of edge edit operations to convert deg(wr ) to deg(ws ). The above problem can be converted to a minimum cardinality perfect b-matching problem [16] and solved in O(q 4 log q) time. Next, we consider an additional constraint on this problem and hence compute it more efficiently as well as obtain a filtering strategy with greater pruning power.

6.2 Leveraging existing edges on q-gram vertices The minimum edge edit operation problem quests for the least edit operations to adjust the degrees of a q-gram, but ignores the existing edges incident on the vertices of the q-gram. We refer as existing edges the edges that are not included in the q-gram but whose both ends belong to the q-gram. Let us take the following example. Example 11 Figure 6 depicts the two q-grams in Example 10, with existing edges between the vertices in r and s, shown in dashed lines. Looking into the edit operations that make r isomorphic to s, where wr [i] maps to ws [i] in the resulting bijection, we observe that any solution must contain deleting edge e(C1 , N) (Op.1) and inserting edge e(C2 , C3 ) (Op.2). Afterward, the degrees in wr become 2, 3, 2, and 2, from left to right. To change them to 4, 3, 3, and 2, respectively, three operations in Op. 3 and 4 are required. In all, five edge edit operations are necessary.

orre

729

Example 10 Consider the two q-grams with vertex degrees shown in the parentheses, τ = 1 and q = 3.

728

The two q-grams possess the same label sequence and thus satisfy Condition 1. Comparing the degrees from left to right, it takes at least two edit operations to make the two sets of degrees identical, for example, inserting in wr an edge between C1 and C2 , and then inserting an edge to C3 . It is obvious they cannot match under this threshold. To handle the general case, we first look at asymmetric q-grams and the symmetric case will be discussed in the end of this section. Let deg(wr ) denotes the degree sequence of wr , comprising the degrees deg(wr [i]) of the vertices in wr in sequence, i ∈ [1, q + 1]. To level each deg(wr [i]) and deg(ws [i]) of the degree sequences, the following four edge edit operations are available:

unc

Author Proof

716

Recall Condition 1 on matching q-grams. One may notice that by extracting q-grams out of graphs, we neglect certain structural information and compare only the linear structure. This section introduces another filtering technique based on degree information attached to vertices. We provide an edit distance lower bound by comparing two degree sequences and propose a novel q-gram matching condition.

vertex in the q-gram and one outside (assuming this vertex can always be found in r ). To check whether two q-grams match with respect to the edit distance constraint, we formulate the minimum edge edit operation problem.

pro of

708

6 Vertex degree filtering

cted

707

– Op. 1: insert a labeled edge e(u, v), u, v ∈ V (wr ); – Op. 2: delete a labeled edge e(u, v), u, v ∈ V (wr ); – Op. 3: insert a labeled edge e(u, v), u ∈ V (wr ), v ∈ V (r )\V (wr ); – Op. 4: delete a labeled edge e(u, v), u ∈ V (wr ), v ∈ V (r )\V (wr ). Op. 1 and 2 represent the operations on two vertices in this q-gram, while Op. 3 and 4 are the operations involving one

TYPESET

DISK

LE

755 756

757 758 759

760 761 762 763 764

765

766 767 768 769 770 771

772 773 774 775 776 777 778 779 780 781

The above example showcases that existing edges may incur edit operations. Apart from edge insertion (Op. 1) and deletion (Op. 2), changing edge labels may also happen, since these edges are outside the q-grams, and may differ in labels. To this end, we introduce another edge edit operation.

786

– Op. 5: change the label of an edge e(u, v), u, v ∈ V (wr ).

787

After comparing the two q-grams and the existing edges, it is easy to obtain the necessary operations in Op. 1, 2, and 5. Note these are also the only operations involving a pair of vertices in the q-gram. The degrees in the q-grams change thereafter,

Fig. 6 Idea of exploiting existing edges

123 Journal: 778 MS: 0306

753 754

CP Disp.:2013/1/28 Pages: 26 Layout: Large

782 783 784 785

788 789 790 791

Graph similarity queries with edit distance constraints

Input

3 4 5 6 7

9 10 11 12 13

14 15 16 17

792 793 794 795

796 797 798 799 800

if e(wr [i], wr [ j]) ∈ E(r ) and e(wr [i], wr [ j]) ∈ E(s) then ε ←ǫ+1; /* Op. 1 */ i ← i + 1; j ← j + 1; if e(wr [i], wr [ j]) ∈ E(r ) and e(wr [i], wr [ j]) ∈ E(s) then ε ←ǫ+1; /* Op. 2 */ i ← i − 1; j ← j − 1; for each i ∈ [1, q + 1] do ε ← ǫ + | i | ;

/* Op. 3 and 4 */

if ε ≤ τ then return true else return false

and then the new differences in the degrees are eliminated using Op. 3 and 4. The pseudo-code of the algorithm is presented in Algorithm 8, and the degree-based matching condition is summarized in Condition 2.

Condition 2 (degree-based match) Consider two q-grams wr and ws satisfying Condition 1, and threshold τ . wr matches ws , if no more than τ edge edit operations are necessary to transform (1) deg(wr ) to deg(ws ), and (1) the existing edges incident on V (wr ) to those incident on V (ws ).

804 805 806 807 808 809 810 811 812 813 814 815 816 817 818

unc

803

Algorithm 8 checks whether two q-grams match under Condition 2 in O(q 2 ) time. It is more efficient than the solution to the minimum edge edit operation problem using b-matching and the pruned candidates subsume all those can be pruned by the latter. To integrate it into the algorithm for graph similarity search queries, we change Line 9 in Algorithm 3 to “A[r ] ← CheckDegree(wr , ws , r, s, τ ).” Symmetric q-gram may have asymmetric degree sequence. To deal with this case, the pair of q-grams need to be checked from both sides, that is, to run Algorithm 8 on wr against ws as well as the inversion of ws . Either satisfying Condition 2 makes the pair of q-grams match. In relation to the space cost imposed by utilizing vertex degrees, recall we check Condition 1 with the hash codes of q-grams’ label sequences. Supposing a label sequence is hashed into a 4-byte integer, additional q + 1 bytes are needed here to record the q-gram’s vertex identifiers, which

7.1 Integrating multiple filters

TYPESET

DISK

LE

Algorithm 9 presents the Verify algorithm. The candidates are examined by three filters in succession: global label filtering (Lines 3–4), count filtering (Lines 5–6), and local label filtering (Lines 7–9). We put global label filtering first because it prunes graphs that disagree on labels with a small cost, O(|V | + |E|) for each check, whereas the worst-case complexities of the latter two are O(|Q|2 q 2 ) and O(|V |τ +q|Q|+|E|), respectively. Count filtering is invoked before local label filtering, since the latter takes as input the set of mismatching q-grams, which is a by-product of count filtering. After the three filters, those still surviving are verified by the expensive edit distance computation. It is noteworthy to mention the CompareQGrams algorithm in Line 5. Using Conditions 1 and 2, it extracts the sets of mismatching q-grams in both r and s, returned in Q r′ and Q ′s , respectively. In addition, it computes the numbers of mismatching q-grams in r and s, returned in ε2 and ǫ3 , respectively. We note the multiple matching of common q-grams is allowed in computing the sets but disallowed in computing the numbers. For instance, if wr and wr′ both match ws , one of them, either wr or wr′ , has to mismatch. As a result, this mismatching q-gram contributes to the number of mismatching q-grams in ε2 to be tested by count filtering. However, it is not included in Q r′ , because we are unsure whether the mismatching one is wr or wr′ .

CP Disp.:2013/1/28 Pages: 26 Layout: Large

821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836

838 839

840

123 Journal: 778 MS: 0306

819 820

837

The verification comprises (1) multiple filters that prune unpromising candidates and (2) edit distance computation.

801

802

7 Verification algorithm

orre

Author Proof

8

for each i ∈ [1, q] do for each j ∈ [i + 1, q + 1] do if e(wr [i], wr [ j]) ∈ E(r ) and e(wr [i], wr [ j]) ∈ E(s) and l E (e(wr [i], wr [ j])) = l E (e(wr [i], wr [ j])) then ε ←ǫ+1; /* Op. 5 */

cted

1 2

: wr and ws are two q-grams satisfying Condition 1; r and s are their original graphs; τ is an edit distance threshold. Output : A boolean indicating whether the wr and ws satisfy degree-based matching condition. ε ← 0; for each i ∈ [1, q + 1] do i ← deg(wr [i]) − deg(ws [i]);

are used to retrieve degrees and existing edges from the data graphs. In all, our algorithm needs a total of q + 5 bytes to store a q-gram. In particular, we keep only hash codes in the inverted index; only when the hash codes match, we check Condition 2 using the q-gram sets and the data graphs. We remark considering the degree information associated with a q-gram implies the effort to qualify the possible matches of a q-gram, which reduces the candidates in return. One may compare q-grams under the matching condition exploiting degree information to positional q-grams [10] in string similarity queries, which only match if their positions in the strings differ by at most τ . Vertex degree filtering is orthogonal to count filtering and prefix filtering, and thus, conducting it after the two existing filters does not affect the correctness. Moreover, it utilizes more structural information to improve the selectivity of q-grams, while retaining the advantage of tighter lower bounds of path-based q-grams over tree-based q-grams regarding count filtering condition.

pro of

Algorithm 8: CheckDegree (wr , ws , r , s, τ )

841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865

X. Zhao et al.

11

12

return T

1 2 3 4 5 6 7 8 9

Author Proof

Algorithm 10: GraphEditDistance(r, s)

Input : s is a query graph; A is map indicating s’s candidates. Output : T = { r | −ged(r, s) ≤ τ }. T ← ∅; for each r such that A[r ] = true do ε1 ← Γ (L V (r ), L V (s)) + Γ (L E (r ), L E (s)) ; /* global label filtering */ if ε1 ≤ τ then (Q r′ , Q ′s , ε2 , ǫ3 ) ← CompareQGrams(Q r , Q s ) ; /* count filtering */ if ε2 ≤ τ · D path (r ) and ε3 ≤ τ · D path (s) then ε4 ← LocalLabelFilter(Q r′ , s) ; /* local label filtering */ ε5 ← LocalLabelFilter(Q ′s , r ) ; /* local label filtering */ if ε4 ≤ τ and ε5 ≤ τ then edit ← GraphEditDistance(r, s); if edit ≤ τ then T ← T ∪ { r };

10

1 2 3 4 5 6 7

Input : r is a data graph; s is a query graph. Output : ged(r, s), if ged(r, s) ≤ τ ; or τ + 1, otherwise. M ← DetermineVertexOrder(r ); initial.Vr ← ∅, initial.Vs ← ∅, initial.n = 0; Q. push(initial) ; /* a priority queue */ while Q = ∅ do current ← Q. pop(); if current.n = |V (r )| then return current.g(x) v ← M[current.n + 1]; for each v ′ ∈ V (s) or a dummy vertex, such that v ′ ∈ current.Vs and |deg(v) − deg(v ′ )| ≤ τ do next.Vr ← current.Vr ∪ { v }; next.Vs ← current.Vs ∪ { v ′ }; next.n ← current.n + 1; next.g(x) ← ExistingDistance(next); next.h(x) ← EstimateDistance(next); if next.g(x) + next.h(x) ≤ τ then Q. push(next);

8 9 10 11 12 13 14 15 16

7.2 Graph edit distance computation

891

h(x) = Γ (L V (rq ), L V (sq )) + Γ (L E (rq ), L E (sq )),

869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889

892 893 894 895 896 897

unc

868

one by one in the aforementioned order. A new state is formed (Lines 9–12) by including a vertex of r and its counterpart— a dummy vertex to indicate vertex deletion, or an unmapped vertex of s within τ in terms of degree difference. Then, g(x) and h(x) are computed, and the state is inserted to the priority queue thereafter (Line 15). The algorithm terminates when all the vertices of r have been mapped (Line 7), or no vertex mapping within threshold τ is achieved (Line 16).

905

7.2.1 Optimizing search order

906

orre

890

Most widely used exact approaches for computing graph edit distance are based on A* algorithm [11]. We first review a state-of-the-art approach for graph edit distance [19] and then see how our techniques can be employed for speedup. A* explores the whole possible vertex mapping space between two graphs in a best-first fashion. It maintains a priority queue of states such that each state represents a partial vertex mapping, associated with a “priority” via function f (x). f (x) is the sum of two functions: (1) the edit operations observed from the initial state to the current (denoted g(x)); and (2) a heuristic estimate of the edit operations that will occur from the current to the goal—a state with all the vertices mapped (denoted h(x)). A* guarantees to find the optimal vertex mapping whenever the goal is popped from the queue, provided h(x) is admissible, that is, h(x) does not overestimate the distance from the current state to the goal. With no vertex mapped initially, we form a new state by mapping a vertex of r to either a vertex of s, or none to imply vertex deletion. g(x) is the number of edit operations between the partial graphs regarding the current mapping. For h(x) in weighted graph edit distance, [19] gives an estimation of the edit distance between the remaining parts via bipartite matching. For our unweighted case, h(x) becomes exactly the result of “global” label filtering:

867

return τ + 1

cted

866

pro of

Algorithm 9: Verify(r, A)

where rq is constituted of the current unmapped vertices and their incident edges. Algorithm 10 details the A* algorithm to compute graph edit distance. At first, an order of vertex mapping is determined (Line 1, to be discussed shortly). Starting from an initial state with no vertex mapped, the vertices of r are mapped

We observe the basic A* algorithm does not discuss the impact of search order on the efficiency of the algorithm. Due to the removal of unpromising candidates with multiple filters, the pairs to be verified are very likely to resemble, though they may not satisfy the edit distance constraint. As for two graphs whose edit distance is not within the threshold, the isomorphic part of the graphs do not incur any edit operations; and therefore, if we start with this part and proceed in a threshold-based manner, the algorithm does not terminate until very late stage of the search process. In contrast, the process ends more quickly if we start with the parts that need edit operations. Recall the mismatching q-grams identified by CompareQGrams algorithm. The mismatching q-grams indeed contribute edit operations and hence should be favored. Algorithms 11 exploit this idea and determines the order of vertices to be processed by the A* algorithm. The vertices contained by at least one mismatching q-gram are put before the others. In the interest of connectivity, we break tie by mapping vertices in the order of spanning tree, so as to expedite the discovery of edge edit operations. Moreover, we pick as the tree root the vertex with most infrequent label regarding the graph and tie is broken arbitrarily. Using such order

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

898 899 900 901 902 903 904

907 908 909 910 911 912 913 914 915 916 917 918 919 920 921 922 923 924 925 926 927 928 929

Graph similarity queries with edit distance constraints

Algorithm 11: DetermineVertexOrder(r, Q r′ )

Algorithm 12: EstimateDistance(rq , sq ) Input

1 2 3 4 5 6

931 932 933 934

leverages the connectivity of a graph and can quickly find edge edit operations. For instance, assume in r, u and v are adjacent vertices in the spanning tree and they are mapped to u ′ and v ′ in s, respectively. An edge edit operation occurs if there is no edge between u ′ and v ′ in s.

3 4 5

Algorithm 13: GSimJoin (R, τ ) Input

1 2 3 4 5

938 939 940 941 942 943 944 945

7

cted

937

Example 12 Consider graphs in Fig. 7, q = 1 and τ = 4. Mismatching q-grams are C1 -O, C3 -F1 , and C3 -F2 , yielding two connected components as bounded by dashed lines. Thus, we put the vertices in these two components ahead of others and start with C1 -O. They are ordered as C1 ≺ O, since O is less frequent than C in r . Then, we order the component consisting of C3 , F1 , and F2 . As F is less frequent than C, for example, F1 is picked as root, we have F1 ≺ C3 ≺ F2 by the order of spanning tree. Finally, we append the remaining vertex C2 and obtain the search order O ≺ C1 ≺ F1 ≺ C3 ≺ F2 ≺ C2 .

6

: R is a collection of graphs; τ is an edit distance threshold. Output : T = { r, s | −ged(r, s) ≤ τ }. T ← ∅; Ii ← ∅ (1 ≤ i ≤ |U |) ; /* inverted index */ for each r ∈ R do A ← empty map from id to boolean; Q r ← r ’s q-grams sorted in O ; pr ← MinPrefixLen(Q r ); for i = 1 to pr do w ← Q r [i]; for each s ∈ Iw such that A[s] has not been initialized do if abs(|V (r )| − |V (s)|) + abs(|E(r )| − |E(s)| ≤ τ then A[s] ← true ; /* find a candidate */

8

9

10

11

Iw ← Iw ∪ { r } ;

12

14

/* index for q-gram w */

T ← T ∪ Verify(r, A);

13

return T

orre

935 936

8 Graph similarity join

946

947 948 949 950 951 952 953 954 955 956 957 958 959

7.2.2 Optimizing heuristic estimation

Any lower bound of edit distance can serve as the heuristic estimate h(x) to render the A* algorithm admissible. We consider not only global label filtering but also local label filtering in h(x). The mismatching q-grams in the remaining graphs composed of unmapped vertices are first extracted and then sent into local label filtering to get lower bounds of edit distance between the two remaining graphs. Algorithm 12 provides the pseudocode of the algorithm. Note that we compute mismatching q-grams from both rq to sq and sq to rq , and hence have two lower bounds from local label filtering. The lower bound from global label filtering is also considered. The maximum of the three is returned as the result of heuristic estimate.

unc

Author Proof

930

Insert the vertices not contained by any mismatching q-gram into M in the order of spanning tree; return M

1 2

: rq and sq are two graphs consisting of unmapped vertices. Output : A lower bound of ged(rq , sq ). ε1 ← Γ (L V (rq ), L V (sq )) + Γ (L E (rq ), L E (sq )); (Q r′ , Q ′s ) ← CompareQGrams(rq , sq ); ε2 ← LocalLabelFilter(Q r′ , sq ), ε3 ← LocalLabelFilter(Q ′s , rq ); h ← max(ε1 , ǫ2 , ε3 ); return h

pro of

: r is graph; Q r′ is a set of mismatching q-grams from r to s. Output : An array of vertices that the A* algorithm will find mappings in order. M ← []; C ← the connected components formed by Q r′ ; for each ci ∈ C do Insert vertices in ci into M in the order of spanning tree; Input

Fig. 7 Example of ordering vertices

960

As a batch version of graph similarity searches, the proposed techniques are ready to be extended to graph similarity join queries. In this section, we introduce the algorithms for selfjoin first and then R-S join.

964

8.1 Algorithm for self-join

965

Combining count filtering, prefix filtering, minimum edit filtering, and local label filtering, we present a graph similarity join algorithm GSimJoin (Algorithm 13). It takes as input a collection of data graphs and follows an index nested loops join style, maintaining an inverted index on the fly. It iterates through each graph r ∈ R. According to Lemmas 2 and 3, the minimum prefix length is calculated by the MinPrefixLen algorithm for each graph r (Line 6). In addition, the numbers of vertices and edges in r and s ∈ R must differ by at most τ (Line 10). For each q-gram w in Q r ’s prefix, it probes the inverted index to find other graphs s that contain w in their prefixes, satisfying Conditions 1 and 2. The candidates are sent into the Verify algorithm. Afterward, r is inserted into w’s posting list for future use (Line 12). The

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

961 962 963

966 967 968 969 970 971 972 973 974 975 976 977 978 979

X. Zhao et al.

982

8.2 Algorithm for R-S join

999

Cindex = |R| · c p + |R| · l · ci ,

987 988 989 990 991 992 993 994 995 996 997

1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012

1013

1015 1016

The above equation shows that it is more efficient to choose the smaller one as inner relation to create index for inmemory R-S joins, and the gap becomes more substantial under larger thresholds due to longer prefixes. Besides index-nested loops join, another possibility to leverage index is to first build inverted indexes on R and S, respectively, and then join the two indexes to find candidates. In this case, the cost of indexing two relations is ′ Cindex = |R| · c p + |R| · l · ci + |S| · c p + |S| · l · ci .

In the joining phase, for each identical q-gram w that appears in both p R and p S , two inverted lists of w are accessed and merged to derive candidates. Thus, the join cost is  2 · ca + n c · cv . C ′join =

where c p is the cost of generating q-grams of a graph and computing its minimum prefix length, ci the cost of a posting list insertion, and l the average prefix length. The cost in the joining phase is divided into three parts: (1) generating q-grams and computing minimum prefixes for graphs in S, (2) probing inverted index to generate candidates, and (3) verifying candidates. The cost of the first part is |S| · c p . The second part is proportional to the total number of inverted index access and thus can be modeled using the frequencies of the q-grams in both R’s and S’s prefixes. The third part depends on the number of candidates. Let p R denotes the set of q-grams comprising the prefixes of graphs in R, the cost in the joining phase is  ca + n c · cv , C join = |S| · c p + where ca is the cost of an inverted index access, cv is the cost of one candidate verification, and n c is the number of candidates to be verified.

′ Summing up Cindex and C ′join yields the total cost C B of the strategy that joining two indexes to derive candidates. Given the identical number of candidates n c for verification, it is clear that C B is larger than both C R and C S .

9 Subgraph similarity search

This section extends the solution to the problem of subgraph similarity search with edit distance constraints. For ease of exposition, we define the subgraph edit distance from s to r , dented by sub_ged(s, r ), as the minimum number of edit operations that transform s to a graph r ′ such that r ′ is subgraph isomorphic to r . The example below illustrates the subgraph edit distance from a graph to another.

3

: R and S are two collections of graphs; τ is an edit distance threshold. Output : T = { r, s | −ged(r, s) ≤ τ, r ∈ R, s ∈ S }. I ← GSimIndex (R, τ ) ; /* build index */ for each s ∈ S do T ← T ∪ GSimQuery(s, I, τ ) ; /* find results */

4

return T

1 2

1022

1023 1024 1025 1026 1027 1028 1029 1030

1031

1032 1033 1034

1035

DISK

LE

1036 1037 1038 1039

1040

1041 1042 1043 1044 1045 1046 1047

Note subgraph edit distance is not symmetric. Given two graphs r and s, sub_ged(r, s) may not be equal to

1053

Fig. 8 Example of subgraph edit distance

TYPESET

1020 1021

1051

123 Journal: 778 MS: 0306

1018 1019

Example 13 Figure 8 shows two molecules after omitting hydrogen atoms. Atoms are modeled by vertex labels. Single and double bonds are modeled by edge labels. The subgraph edit distance from s to r sub_ged(s, r ) = 3.

Algorithm 14: GSimJoin (R, S, τ ) Input

1017

w∈ p R ∧w∈ p S

w∈ p R ∧w∈ p S 1014

C R − C S = (|R| − |S|) · l · ci .

orre

986

unc

985

Summing up Cindex and C join , we have the total cost C R for the case where R is the inner relation. Swapping R and S in the above equations yields the cost for the join with S as the inner relation. The candidate sizes are the cases. Thus, the difference in the two costs is

cted

998

The algorithm for joining two different graph databases (Algorithm 14) is designed in index-nested loops join style. It consists of two phases: (1) indexing phase to build an inverted index on the inner relation, and (2) joining phase to scan the outer relation and find join results with the index. The former duplicates the indexing phase of graph similarity search, and the latter is equivalent to invoking the similarity query answering phase for multiple times. As the two relations may differ in size, the efficiency of R-S join is influenced by the choice of inner/outer relations. Assuming both relations and the inverted index fit into main memory, R and S are of the same data distribution, we analyze the join costs with R and S being the inner relation, respectively. Consider R as the inner relation. The indexing phase computes the prefix length for each r ∈ R and inserts prefixes into inverted index. The cost of the index phase is

983 984

Author Proof

algorithm eventually returns all pairs of graphs r, s such that ged(r, s) is no more than the given threshold.

pro of

980 981

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1048 1049 1050

1052

Graph similarity queries with edit distance constraints

1059

9.1 Algorithmic framework

1060 1061 1062 1063

Author Proof

1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079

1080

1081 1082 1083 1084 1085 1086

1087 1088 1089 1090

1091 1092

1093 1094 1095 1096

1097 1098 1099

Subgraph similarity search algorithm is composed of two phases. In the indexing phase, it takes as input a collection of graphs and builds an inverted index for data graph q-grams. In the query answering phase, it first decomposes the given query into q-grams and computes its prefix. For each q-gram w in the query’s prefix, it probes the index to find candidates that contain w in their q-gram sets. The candidate graphs satisfying size filtering are sent to VerifySub to tell ehether they are query results. For subgraph similarity search, the query may appear (approximately) anywhere in a data graph. In order not to miss any results, we have to track all parts of the data graph. This gives rise to the major difference between subgraph similarity and graph similarity search, that is, the whole q-gram set of each graph is recorded by the inverted index. Nonetheless, prefix filtering still applies to the query graph. The query’s prefix length is the same as that for graph similarity queries, and only the q-grams within the query graph’s prefix are used to generate candidates. We will see shortly more differences and make necessary modifications.

Fig. 9 Example of degree-based match for subgraph

Lemma 8 (minimum edit filtering for subgraph) Consider query graph s, data graph r, Q ′s as mismatching q-grams from Q s to Q r . If sub_ged(s, r ) ≤ τ, min-edit (Q ′s ) ≤ τ .

9.2 Adapting multiple filters

We study how to adapt the filters to subgraph similarity queries, including size filtering, count filtering, minimum edit filtering, local label filtering, and vertex degree filtering. In the subgraph similarity setting, size filtering removes a data graph if it is smaller than the query by more than τ vertices and edges together, as stated in the following lemma.

Lemma 6 (size filtering for subgraph) Consider data graph r and query s. If sub_ged(s, r ) ≤ τ, Λ(|V (s)|, |V (r )|) + Λ(|E(s), |E(r )|) ≤ τ , where Λ(A, B) is defined to equal A − B, if A > B; 0, otherwise. On subgraph similarity queries, count filtering only seeks matches for q-grams in Q s , but not vice versa. Lemma 7 (count filtering for subgraph) Consider query graph s and data graph r . If sub_ged(s, r ) ≤ τ, s and r share at least Sub_L B path = |Q s | − τ · D path (s) common q-grams. Minimum edit filtering looks into the mismatching q-grams Q ′s from Q s to Q r , a by-product of the counting process above. Applying MinEdit on Q ′s gives a lower bound.

TYPESET

DISK

LE

1102

1104

– global: |L V (s)\L V (r )| + |L E (s)\L E (r )| ≤ τ ; and – local: |L V (s ′ )\L V (r )| + |L E (s ′ )\L E (r )| ≤ τ, ∀s ′ ⊑ s.

1106

We apply the global label filtering on whole graphs, and the local label filtering on the connected components of mismatching q-grams. With the mismatching q-grams from Q s to Q r , minimum edit filtering and local label filtering are applied to derive two distance lower bounds, respectively. The larger one is chosen as the edit distance lower bound within this component and then summed up. Vertex degree filtering finds the least edit operations to convert ws ’s degree sequence to ws′ such that the degree sequence of wr is inclusive of that of ws′ , that is, deg(wr [i]) ≥ deg(ws′ [i]), i ∈ [1, q+1]. We also enforce the existing edges on wr ’s vertices to be inclusive of those on the vertices of ws′ , that is, ∀u, v ∈ ws′ , if e(u, v) ∈ E(s), e( f (u), f (v)) ∈ E(r ).

Condition 3 (degree-based match for subgraph) Consider q-grams wr and ws satisfying Condition 1. Given a threshold τ, wr matches ws , if no more than τ edge edit operations are needed to make (1) deg(wr ) inclusive of deg(ws ) and (2) existing edges in wr inclusive of those in ws . Example 14 Consider the q-grams in Fig. 9 and τ = 2. Dashed lines represent the exiting edges of the q-grams. By deleting in ws the existing edge between C1 and N, and deleting one edge at C3 , we can change deg(ws ) to 1, 2, 2, 1, included by deg(wr ): 1, 3, 2, 1. It takes only two edge edit operations, and thus wr matches ws consequently. To test whether wr and ws match under Condition 3, we use Algorithm 8 and make the following modifications: (1) We first consider the existing edges in ws and apply necessary Op. 1, 2, and 5, resulting a new degree sequence ws′ . (2) We sum up the differences where deg(ws′ [i]) is greater than deg(wr [i]), meaning that only Op. 4 is applied. The results are summed up and then compared with τ . We summarize the verification for subgraph similarity search in Algorithm 15. The algorithm accepts as input a query graph and iterates to verify one candidate at a time.

123 Journal: 778 MS: 0306

1100 1101

Lemma 9 (label filtering for subgraph) Consider query graph s and data graph r . If sub_ged(s, r ) ≤ τ ,

cted

1057

orre

1056

unc

1055

pro of

1058

sub_ged(s, r ). In this paper, the subgraph edit distance constraint is defined from a query graph to a data graph. Next, we first present the algorithmic framework for subgraph similarity search queries, and modify multiple filters afterward, followed by a description of the verification procedure.

1054

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1103

1105

1107 1108 1109 1110 1111 1112 1113 1114 1115 1116 1117 1118 1119 1120

1121 1122 1123 1124 1125

1126 1127 1128 1129 1130 1131

1132 1133 1134 1135 1136 1137 1138 1139 1140 1141

X. Zhao et al.

1 2 3 4 5 6 7 8 9 10

1145 1146 1147 1148 1149 1150 1151 1152

1153

1154 1155 1156 1157 1158 1159 1160 1161 1162 1163 1164

1165 1166 1167 1168

1169 1170 1171

1172 1173 1174

the self-loop at C1 is discarded. Then, we extract path-based q-grams from r¯ . Two q-grams are obtained as shown in Q r¯ . Finally, for each q-gram, we recover the directions, multiple edges, and self-loops that exist in the multidigraph,for example, in C2 -C3 , multiple directed edges replace the single edge; C1 -C3 is oriented, and a self-loop is assigned to C1 , as shown in Q r . To distinguish the two types of q-grams, the q-grams extracted from the underlying simple graph are called “simple q-grams,” and the q-grams with directions, multiple edges, and self-loops recovered are called “multidi q-grams.” Next, we present the matching condition of multidi q-grams. As we employ simple q-grams as intermediates to obtain multidi q-grams, the vertex sequence of a multidi q-gram is the same as that of its simple q-gram. Hence, to compose the multidi q-gram’s label, sequence is straightforward. Denote deg − (wr ) (resp. deg + (wr )) the in-degree (resp. out-degree) sequence of wr , comprising the in-degrees deg − (wr [i]) (resp. deg + (wr [i])) of the vertices in wr , i ∈ [1, q + 1].

cted

1143 1144

Global label filtering (Lines 3–4), counting filtering (Lines 5– 6), and local label filtering (Lines 7–8) are employed successively. The CompareQGrams algorithm in Line 5 extracts the mismatching q-grams from Q s to Q r , returned in Q ′s , as well as its number in ε2 . Surviving candidates are verified through the final subgraph edit distance computation. We use the A*-based algorithm to handle the final verification with a few minor modifications. The mismatching q-grams from s to r are collected to determine the order of vertex mapping, and the adapted label filtering for subgraph is utilized to deliver an improved estimation of h(x).

Fig. 10 Example of extracting multidi q-grams

10 Extensions

orre

1142

return T

This section discusses the extension to directed multigraphs. A directed multigraph, or multidigraph, r = (V, E, l V , l E ) is a labeled graph such that (1) the edges are directed; (2) multiple edges may exist between vertices; and (3) self-loops may exist on vertices. Apart from the edit operations defined in Sect. 2.1, for multidigraphs, we allow another operation: Change the direction of an edge. The aforementioned techniques can be directly applied except for the q-gram extraction of a multidigraph. The basic idea is to convert it to a simple graph, generate q-grams, and then recover multiple directed edges and self-loops.

unc

Author Proof

11

Input : s is a query graph; A is map indicating s’s candidates. Output : T = { r | sub_ged(s, r ) ≤ τ }. T ← ∅; for each r such that A[r ] = true do ε1 ← |L V (s)\L V (r )| + |L E (s)\L E (r )|; if ε1 ≤ τ then (Q ′s , ε2 ) ← CompareQGrams(Q s , Q r ); if ε2 ≤ τ · D path (s) then ε3 ← LocalLabelFilter(Q ′s , r ); if ε3 ≤ τ then edit ← SubgraphEditDistance(s, r ); if edit ≤ τ then T ← T ∪ { r };

pro of

Algorithm 15: VerifySub(s, A)

Example 15 Figure 10 sketches a multidigraph r . Edge labels are omitted, and subscripts are added to the carbon atoms for ease of exposition. First, we construct an underlying simple graph r¯ of r by – replacing multiple directed labeled edges between vertices with one single undirected unlabeled edge, and – discarding self-loops at vertices. For instance, the three directed edges between C1 and C3 in r are converted to one undirected edge in r¯ , the directed edges between C2 and C3 are changed to one undirected edge, and

Condition 4 (multidi q-gram matching condition) Given a threshold τ , two multidi q-grams wr and ws match, if – wr and ws are isomorphic; and – if no more than τ edit operations are needed to convert (1) deg − (wr ) to deg − (ws ), (2) deg + (wr ) to deg + (ws ), and (3) the existing edges incident on wr ’s vertices to those incident on ws ’s vertices.

Thanks to the label sequences of the multidi q-grams, the isomorphism test can be done in O(|Er | + |E s |) time, where Er and E s are the edges in r and s, respectively. We first label compare the sequences of the two q-grams (hash codes can be used here for O(1) check). If the label sequences are identical, we check whether the directions, multiple edges, and self-loops contained in both q-grams are the same through a sequential scan on both q-grams.3 After the isomorphism test, we check whether they are degree-based matching using the technique presented in Sect. 6. Note that in-degrees and 3

Two sequential scans on ws if the sequences are symmetric.

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1175 1176 1177 1178 1179 1180 1181

1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193

1194 1195

1196 1197 1198 1199 1200

1201 1202 1203 1204 1205 1206 1207 1208 1209 1210

Graph similarity queries with edit distance constraints

1216 1217 1218 1219

1220 1221

Author Proof

1222 1223 1224 1225 1226

1227 1228 1229 1230

1231 1232

1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251

|R|

Avg |V |

|l V |/|l E |

d

Min/max (◦ )

AIDS

4,000

25.76

44/3

0.08

0/12

600

32.63

3/5

0.12

0/9

PROTEIN

pro of

1215

Dataset

– AIDS is an antivirus screen compound dataset from the Developmental Therapeutics Program in NCI/NIH.4 It contains 42,687 chemical compounds. We randomly sampled 4,000 graphs to make up the set of data graphs. – PROTEIN is a protein database from the Protein Data Bank,5 constituted of 600 protein structures. Vertices represent secondary structure elements and are labeled with their types—helix, sheet, and loop. Edges are labeled with lengths in amino acids.

– inserting a vertex or an edge affects no multidi q-grams; – deleting a vertex v, changing its label, inserting or deleting a self-loop incident on the vertex, or changing the label or direction of the vertex’s self-loop, affect |Q rv | multidi q-grams; and – deleting an edge e(u, v), changing its label or direction, affect |Q ruv | multidi q-grams. As |Q ruv | ≤ |Q ru | ≤ maxu∈V (r ) |Q ru |, the maximum number of multidi q-grams that can be affected by one edit operation is Dmultidi (r ) = maxu∈V (r ) |Q ru |. The lower bound of common multidi q-grams for count filtering is L Bmultidi = max(|Q r | − τ · Dmultidi (r ), |Q s | − τ · Dmultidi (s)).

Statistics of the datasets are listed in Table 1. The graph 2|E| density d, defined as |V |(|V |−1) , influences the number of path-based q-grams. The greater the graph density, the more path-based q-gram in a graph. The maximum degree implies the edit effect of a single edit operation. Besides real-life datasets, synthetic datasets were generated. The synthetic graph generator6 measures graph size by the number of edges. The graph density is 0.3 by default, and the cardinalities of vertex and edge label domains are set to 2 and 1, respectively. We applied these default settings if not otherwise specified. We randomly sampled 100 graphs from data graphs and added a random number of edit operations within [0, τ ] to make up the corresponding sets of query graphs. All the experiments were carried out on a machine of Quad-Core AMD Opteron Processor 8378@800 MHz with 96 GB RAM, running Ubuntu 10.04.1 LTS. All the algorithms were implemented in C++ and ran in main memory. We measured (1) average prefix length; (2) index size; (3) index construction time, including q-gram extraction, prefix length computation, and inverted list construction; (4) candidates identified by inverted index and surviving size filtering (denoted Cand-1); (5) candidates that need ged computation (denoted Cand-2); and (6) query response time, including candidate generation time (query’s q-gram extraction included) and ged computation time. Cand-1, Cand-2, and query response time were logged and reported on the basis of 100 search queries unless otherwise specified.

cted

1214

Table 1 Dataset statistics

Since a multidi q-gram is constructed from its corresponding simple q-gram, they contain exactly the same sequence of vertices. Thus, the lower bound L Bmultidi equals the lower bound L B path derived on its underlying simple graph. To apply minimum edit filtering, we need to solve the minimum edit operation problem on mismatching multidi q-grams. We may apply vertex label substitutions on the mismatching multidi q-grams and obtain the minimum number of edit operations required to affect all of them. We also apply the local label filtering on the mismatching multidi q-grams, and the filtering rationale remains the same as simple graphs. Similar to count filtering, we note that the results of these filtering techniques are equivalent to those obtained by invoking them on the underlying simple graph. All the candidates passing the multiple filters are verified by Algorithm 10, except that differences in directions, multiple edges, as well as self-loops are added up to g(x) and estimated in h(x) when a vertex is mapped.

orre

1213

out-degrees are treated separately, and the returned operation numbers are summed up to be compared with τ . Count filtering is based on the minimum number of common q-grams after applying τ edit operations. We observe that the edit effects of operations on multidi q-grams are equivalent to those on simple q-grams. Let Q r denote the set of r ’s multidi q-grams, Q rv denotes the multidi q-grams containing vertex u, and Q ruv the multidi q-grams containing consecutive vertices u and v. In particular,

unc

1211 1212

1252

11 Experiments

1253

In this section, we report experimental results and analyses.

1254

11.1 Experiment setup

5

1255

Two publicly available real-life datasets were used:

6

4

http://dtp.nci.nih.gov/docs/aids/aids_data.html.

http://www.iam.unibe.ch/fki/databases/iam-graph-database/ download-the-iam-graph-database. http://www.cse.ust.hk/graphgen/.

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1256 1257 1258 1259 1260 1261 1262 1263 1264

1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284 1285 1286 1287 1288 1289 1290 1291 1292 1293

X. Zhao et al.

60 40 20 1

2

3

4

Basic GSimSearch + MinEdit

Indexing Time (s)

80

3

10

2

10

5

1

2

3

4

ged Threshold (τ)

ged Threshold (τ)

(a)

(b)

Basic GSimSearch + MinEdit

8 6 4 2 0

5

1

2

τ=1

Cand-2

Cand-1

10

+ MinEdit + Local Label + Degree Real Result

4

10

4

Response Time (s)

Basic GSimSearch + MinEdit

105

3

10

2

3

4

5

1

2

3

4

ged Threshold (τ)

ged Threshold (τ)

(d)

(e) 10

4

10

3

10

2

10

1

10

0

A* + Improved Order + Improved h(x)

-1

10

1

2

3

4

ged Threshold (τ)

104 103 102 101

4

10

4

5

τ=2

(c) τ=3

τ=4

τ=5

BA MA LA DA

BA MA LA DA

BA MA LA DA

ged Computation Candidate Generation

3

10

102 1

10

0

10

10-1

5

BA MA LA DA

BA MA LA DA

ged Threshold (τ)

τ=1

Response Time (s)

2

ged Computation Time (s)

Author Proof

1

τ=2

τ=3

τ=4

τ=5

BA MO LH DH

BA MO LH DH

BA MO LH DH

(f)

ged Computation Candidate Generation

100 10-1

cted

10

10

3

3

ged Threshold (τ)

pro of

Prefix Length

Index Size (kB)

Basic GSimSearch + MinEdit Average q-gram Number

100

5

BA MO LH DH

BA MO LH DH

ged Threshold (τ)

(g)

(h)

Fig. 11 Effect of filters and ged computation. a AIDS, prefix length. b AIDS, index size. c AIDS, indexing time. d AIDS, Cand-1. e AIDS, Cand-2. f AIDS, query response time. g AIDS, distance computation time. h AIDS, query response time

1296 1297 1298 1299 1300 1301 1302 1303 1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319

In order to evaluate the effectiveness of our filtering techniques, we use “Basic GSimSearch” for the GSimSearch algorithm without minimum edit, local label, or vertex degree filtering. “+ MinEdit” denotes applying minimum edit filtering to compute prefix length. “+ Local Label” denotes further applying local label filtering. “+ Degree” denotes further applying degree-based q-gram match condition, that is, the complete filtering of GSimSearch algorithm. We first study the effect of minimum edit filtering. Figure 11(a) shows the average prefix length of Basic GSimSearch and + MinEdit on AIDS dataset with q = 4 and varying edit distance threshold. + Local Label and + Degree have the same prefix length as + MinEdit and thus are omitted in this figure. The prefix lengths of both algorithms grow steadily when the threshold increases. Basic GSimSearch’s average prefix length approaches the average number of qgrams in a graph when τ > 4, as most graphs become underflowing. After applying minimum edit filtering, the prefix length is substantially shortened, up to 75 %. As index size is influenced by prefix length, we plot the memory consumption for index storage in Fig. 11b. Both algorithms need small amount of memory and exhibit a similar trend as on prefix length. The memory consumed by + MinEdit is only 586.9 kB when τ is as large as 5. Figure 11c gives indexing time. We observe a nearly constant indexing

time for Basic GSimSearch, and a growing trend for + MinEdit, taking 6.0s at τ = 5. This is expectable, since + MinEdit solves the NP-hard minimum graph edit operation problem for minimum prefix length. We will see shortly this cost is rewarding in the query processing phase. The number of Cand-1 is mainly influenced by prefix length, as plotted in Fig. 11d. The Cand-1 size can be reduced by as much as 85 % when τ = 1. As for local label and vertex degree filtering, Fig. 11e compares the number of Cand-2 produced by + MinEdit, + Local Label, and + Degree on AIDS. The number of real results is also shown in the figure. Local label filtering results in remarkable reduction on Cand-2, up to 51 %. Utilizing degree information leads to an additional 51 % reduction. To reflect the effect of filters on running time, we appended the A* algorithm [19], labeled as “A*,” to verify the candidates. The overall query response time is plotted in Fig. 11f, where embraces the following combinations:

orre

1295

11.2 Evaluating filtering methods

unc

1294

– – – –

BA: Basic GSimSearch / A*; MA: + MinEdit / A*; LA: + Local Label / A*; DA: + Degree / A*.

The overall runtime decreases when we apply more filters, as fewer candidates are sent to verification, with slight increase in candidate generation time though. The maximum speedup

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1320 1321 1322 1323 1324 1325 1326 1327 1328 1329 1330 1331 1332 1333 1334 1335 1336 1337

1338 1339 1340 1341

1342 1343 1344

Graph similarity queries with edit distance constraints

1350

11.3 Evaluating graph edit distance computation

1347 1348

1351 1352 1353 1354

1356 1357 1358 1359 1360 1361 1362 1363 1364

1365 1366 1367 1368

– – – –

BA: Basic GSimSearch / A*; MO: + MinEdit / + Improved Order; LH: + Local Label / + Improved h(x); DH: + Degree / + Improved h(x).

1378

11.4 Evaluating q-gram length

1370 1371 1372 1373 1374 1375 1376

1379 1380 1381 1382 1383 1384 1385 1386 1387 1388 1389 1390

1401

11.5 Graph similarity searches

1402

This subsection evaluates graph similarity search queries by comparing our algorithms with three alternatives.

Figures 12a–e present the results of GSimSearch on AIDS with varying q-gram length. As expected, the larger q is, the more q-grams are affected by one edit operation, and hence the longer prefix is. Accordingly, more time is needed to construct the index. Regarding Cand-1 and Cand-2, the general trend is the candidate sizes first drop with increasing q-gram length, reach the bottom at q of 3 and 4, and then rebound. There are several factors contributing to this: (1) Small q indicates a small q-gram domain, and hence the posting list of a q-gram can be fairly long. This leads to a large candidate size, for example, when q = 2. (2) Large q indicates a long prefix length. We have to probe more posting

TYPESET

DISK

LE

1393 1394 1395 1396 1397 1398 1399 1400

1403 1404

1405

We compare GSimSearch with κ-AT-Search and SEGOS on both real datasets.

1407

– GSimSearch is our proposed algorithm that utilizes path-based q-grams for graph similarity search queries. – κ-AT-Search is a state-of-the-art algorithm based on tree-based q-grams, known as κ-AT’s [27]. We reengineered this algorithm and further applied prefix filtering, size filtering, and global label filtering to find the candidates, and basic A* algorithm was used to verify the candidates. we choose q as 1 because it yields the best runtime performance in this set of experiments. – SEGOS is another state-of-the-art algorithm based on star structure [31]. We received the source code from the authors and implemented the basic A* algorithm to verify candidates. Edge labels were not supported and thus discarded where SEGOS was involved. SEGOS is parameterized by (1) k, which defines the top-k star search in TA stage and (2) h, which instructs to perform prune test for every h accessed entries in CA stage. We tuned and chose k = 100 and h = 1,000 for best performance.

First, we compare GSimSearch with κ-AT-Search. In Fig. 13a, b are the average prefix lengths of GSimSearch and κ-AT-Search on AIDS and PROTEIN, respectively. In spite of longer prefix on AIDS, GSimSearch has more average number of q-grams in a graph. For example, κ-ATSearch’s prefix length is 8.2 when τ is 1 and the average number of q-grams in a graph is 25.6, while GSimSearch’s prefix length is 8.9 and the average number of q-grams is 71.5. This means κ-AT-Search requires two graphs to have an average of 25.6 − 8.2 + 1 = 18.4 common q-grams to become a candidate, while GSimSearch needs an average of 71.5 − 8.9 + 1 = 63.6 common q-grams. In this sense,

123 Journal: 778 MS: 0306

1391 1392

11.5.1 Comparison with κ -AT-Search and SEGOS

orre

1377

BA and MO have small candidate generation time, but become uncompetitive for large τ on total response time, due to large Cand-2 size and inefficient ged computation. LH can be up to 2.2x faster than MO and 23.7x faster than BA. DH further reduces LH’s running time by up to 59 %. As a controlled experiment, we argue that the performance boost from MO to LH comes from the more effective filtering and efficient verification, and the tighter q-gram matching condition results in the enhancement from LH to DH.

1369

unc

Author Proof

1355

To evaluate ged computation, we verify with three algorithms the candidates of + Degree under q = 4, τ = 4. Based on “A*,” we improve the search order leveraging mismatching q-grams, consequent algorithm labeled “+ Improved Order.” Local label filtering is further applied for estimating h(x), consequent algorithm labeled “+ Improved h(x).” Figure 11g reports the overall ged computation time to verify the same set of candidate pairs. The optimizations improve the time efficiency of ged computation, and the margin gets more significant under larger τ . Combining the filtering and ged computation algorithms according to various techniques employed, we show the overall query response time decomposed into two phases in Fig. 11h. The notations denote the following combinations:

lists, which also increases the candidate size. The second factor explains why the candidate sizes rebound for larger q. The trend of candidate size reflects the query response time for varying q. The figure shows q = 4 achieves the best runtime performance for τ ∈ [2, 5]. The only exception is, when τ = 1, q = 2 is the most efficient. This is because the candidate generation of 4-grams is more costly than that of 2-grams, while candidate sizes are very close at τ = 1. After performing similar tests on PROTEIN, we chose, as default parameter settings in the remaining experiments, q = 4 on AIDS and q = 3 on PROTEIN for GSimSearch.

cted

1346

pro of

1349

of DA is 2.3x over BA, 2.1x over MA, and 1.8x over LA. We also observe DA has even smaller candidate generation time than LA when τ > 2. This is because degree-based matching condition reduces the candidates to be checked by local label filter, which is shown to be relatively more costly.

1345

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1406

1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425

1426 1427 1428 1429 1430 1431 1432 1433 1434 1435 1436 1437

2

10

2

3

4

3

10

5

1

2

3

4

ged Threshold (τ)

(a)

4

Response Time (s)

Cand-2

10

2-gram 3-gram 4-gram 5-gram 6-gram

3

10

2

10

102 1

10

100 -1

4

5

1

ged Threshold (τ)

2

3

4

5

(c)

2

3

4

5

ged Threshold (τ)

(d)

(e)

Fig. 12 Effect of q-gram length. a AIDS, prefix length. b AIDS, indexing time. c AIDS, Cand-1. d AIDS, Cand-2. e AIDS, query response time

40 30 20

100 80 60 40 20

10

0 1

2

3

4

5

1

2

ged Threshold (τ)

(a) κ-AT-Search GSimSearch

100 50 0

1

2

3

4

5

ged Threshold (τ)

2

104

0

(g) 200

3

4

70

40 2

2

τ=1

3

ged Threshold (τ)

(j)

10 0 1

2

4

5

τ=4

τ=5

0

KA

GS

2

KA

GS

τ=1

KA

GS

KA

GS

KA GS

DISK

LE

τ=2

τ=3

τ=4

τ=5

ged Computation Candidate Generation

4

10

3

10

102 101 0

10

-1

10

KA

GS

KA

GS

KA

GS

KA

ged Threshold (τ)

ged Threshold (τ)

(k)

(l)

GS

KA GS

h PROTEIN, Cand-1. i AIDS, Cand-2. j PROTEIN, Cand-2. k AIDS, query response time. l PROTEIN, query response time

123 TYPESET

3

(i)

10

10

1

ged Threshold (τ)

1

-2

103

(h)

Fig. 13 Comparison with κ-AT-Search. a AIDS, prefix length. b PROTEIN, prefix length. c AIDS, index size. d PROTEIN, index size. e AIDS, indexing time. f PROTEIN, indexing time. g AIDS, Cand-1.

Journal: 778 MS: 0306

3

κ-AT-Search GSimSearch Real Result

5

10

10

5

20

102 4

102

-1

4

30

ged Threshold (τ)

τ=3

5

40

104

3

τ=2

4

50

(f)

ged Computation Candidate Generation

103

5

ged Threshold (τ)

3

1

Response Time (s)

80

1

5

4

κ-AT-Search GSimSearch

60

5

10

10

120

3

(c)

κ-AT-Search GSimSearch

4

κ-AT-Search GSimSearch Real Result

160

2

(b)

Response Time (s)

4

ged Threshold (τ)

1

ged Threshold (τ)

2

unc

3

200 100

102

103 2

300

(e)

104

κ-AT-Search GSimSearch

1

400

5

0

1

Cand-1

Cand-1

5

4

500

ged Threshold (τ)

(d) 10

6

κ-AT-Search GSimSearch

600

ged Threshold (τ)

orre

150

4

κ-AT-Search GSimSearch

8

Indexing Time (s)

Index Size (kB)

200

3

CAND-2

0

Index Size (kB)

120

Prefix Length

Prefix Length

50

700

κ-AT-Search GSimSearch

140

Indexing Time (s)

κ-AT-Search GSimSearch

60

cted

70

CAND-2

Author Proof

3

1

2-gram 3-gram 4-gram 5-gram 6-gram

3

10

10 2

5

ged Threshold (τ)

(b)

1

4

10

100

ged Threshold (τ)

104

2-gram 3-gram 4-gram 5-gram 6-gram

5

10

pro of

1

2-gram 3-gram 4-gram 5-gram 6-gram

101

Cand-1

2-gram 3-gram 4-gram 5-gram 6-gram

103

Indexing Time (s)

Index Size (kB)

X. Zhao et al.

CP Disp.:2013/1/28 Pages: 26 Layout: Large

Graph similarity queries with edit distance constraints

1443 1444 1445 1446 1447 1448 1449

Author Proof

1450 1451 1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475 1476 1477 1478 1479 1480 1481 1482 1483 1484 1485 1486 1487 1488 1489 1490

1504

11.5.2 Comparison with M-tree

1505

pro of

1442

In summary, we analyze runtime speedup versus space costup. The results for τ = 3 are listed in Table 2. We compare the inverted index and q-gram set sizes of all algorithms (“q-gram set size” of SEGOS is the size of star representations of data graphs). GSimSearch spends the most amount of space, 1.5x against SEGOS and slightly greater than κ-AT-Search on AIDS, respectively. On the other hand, GSimSearch is 2.7x faster than κ-AT-Search and 2.4x faster than SEGOS under the setting. On PROTEIN, the space consumption of GSimSearch is 3.5x that of κ-AT-Search and 5.0x that of SEGOS, whereas its speedups are 1316.5x and 109.3x, respectively. We suggest investing moderately more memory for runtime speedup if space is not critical.

We compare GSimSearch with M-tree, a general indexing technique for metric space similarity search [6], on synthetic datasets. Implementation details of M-tree are supplied in the supplementary material. We provide the results on synthetic datasets of |R| = 100 in Fig .15. The query response time is measured on the basis of 10 queries, and the queries were sampled from data graphs with random numbers of edit operations added therein. First, we compare them on the dataset of graph size 10 varying τ , and q = 2 was chosen for GSimSearch. We compare the indexing performance in Fig. 15a, b. M-tree takes much longer time, up to four orders of magnitude greater than GSimSearch, to build index due to its ged evaluation between pairs of data graphs. Both algorithms build small indexes, less than 8kB. GSimSearch has a even smaller index size then M-tree when τ ≤ 2. The online performance comparison is shown in Fig. 15c, d. GSimSearch always has fewer Cand-2 than M-tree. As a consequence, the query response time of GSimSearch is constantly smaller than M-tree, with the largest gap being four orders of magnitude. The large gap on running time is attributed to the loosely bounded distance evaluations invoked by M-tree, while GSimSearch runs all verifications in a threshold-based manner. We also test the indexing scalability on four datasets with graph size ranging in { 5, 10, 15, 20 }, and τ fixed to 2. We chose for GSimSearch q equal to 1, 2, 3, 4 for the four sizes, respectively, and show the results in Fig. 15e, f. The indexing time of M-tree, having a much larger starting point, grows faster than GSimSearch. This tendency implies, due to the huge cost of edit distance evaluations, M-tree becomes impractical when graphs are large. The index sizes of both algorithms showcase a growing trend. GSimSearch has smaller indexes for small graphs but with a faster growth rate, and hence larger index sizes for large graphs.

cted

1440 1441

GSimSearch has a tighter count filtering lower bound. Note the q-gram length is 1 for κ-AT-Search, that is, the count filtering lower bound is the tightest among all its q settings. Both algorithms are competitive in index size, shown in Fig. 13c, d. GSimSearch consumes more construction time, as in Fig. 13e, f. We also note although the worst-case complexity of extracting paths is O(|V |γ q ) for GSimSearch, the time for extracting q-grams is 0.25 s on AIDS and 0.19 s on PROTEIN. Compared with total indexing time, for example, 2.6 s on AIDS and 4.8 s on PROTEIN when τ = 4, the overhead of extracting q-grams is small. Figure 13g–j gives the Cand-1 and Cand-2 sizes of the two algorithms. GSimSearch performs better than κ-ATSearch on both Cand-1 and Cand-2 sizes. There are three major factors: (1) 4-grams based on paths are more selective than 1-grams based on trees. This results in less number of Cand-1 for GSimSearch. (2) GSimSearch’s count filtering constraint is stricter than κ-AT-Search’s. (3) GSimSearch employs local label filtering and degree-based matching condition to further prune candidates. The last two factors contribute to GSimSearch’s advantage on Cand-2. The running time of both algorithms are shown in Fig. 13k, l (“KA” and “GS” are short for κ-AT-Search and GSimSearch, respectively). The query response time grows rapidly when more edit operations are allowed. κ-AT-Search exhibits better candidate generation time, as GSimSearch applies extra filters to prune candidates. However, GSimSearch is always better than κ-AT-Search in terms of overall query response time, and the gap is more substantial under large τ . The speedup against κ-AT-Search on AIDS is up to 3.5x and 6672.4x on PROTEIN. The latter showcases the superior time advantage of GSimSearch on denser graphs. Section 11.8.2 provides more comparison on graph density. Next, we compare GSimSearch with SEGOS. The comparisons with SEGOS on index size, indexing time, Cand-2, and query response time are provided in Fig. 14a–h (“SE” and “GS” are short for SEGOS and GSimSearch, respectively). As shown in Fig. 14a, b, both algorithms build spaceefficient indexes. GSimSearch consumes less memory but more time to build index. The two algorithms show a similar increasing trend on Cand-2. On AIDS when τ = 1, SEGOS has a smaller number of Cand-2 than real results, because it derives an edit distance upper bound to confirm certain results without verification. Nevertheless, GSimSearch has a smaller growth rate than SEGOS when τ gets larger. GSimSearch is always faster than SEGOS, with speedup up to 11.9x on AIDS and 1243.8x on PROTEIN. The overall performance superiority boils down to two facts: (1) The filtering techniques in GSimSearch return fewer candidates for most parameter settings. (2) The improved verification reduces running time, and such effect is more remarkable on denser graphs.

orre

1439

unc

1438

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1491 1492 1493 1494 1495 1496 1497 1498 1499 1500 1501 1502 1503

1506 1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540

2

3

4

101

100

5

1

2

3

4

ged Threshold (τ)

ged Threshold (τ)

(a)

(b)

SEGOS GSimSearch

60

102

SEGOS GSimSearch Real Result

105

40 20 0

4

10

103

3

4

5

1

2

3

(d)

(e)

3

10

2

10

1

10

0

τ=3

τ=4

τ=5

ged Computation Candidate Generation

-1

10

SE

GS

SE

GS

SE

GS

2

3

4

5

(c)

SEGOS GSimSearch Real Result

300 200

SE

GS

ged Threshold (τ)

104 10

3

10

2

10

1

SE

GS

1

2

3

4

5

ged Threshold (τ)

τ=1

Response Time (s)

Response Time (s)

10

τ=2

0 1

5

τ=2

τ=3

τ=4

(f)

τ=5

ged Computation Candidate Generation

100 10

-1

cted

Author Proof

ged Threshold (τ)

τ=1

2

ged Threshold (τ)

4

ged Threshold (τ)

104

4

100

10 2

6

400

2

1

5

CAND-2

1

SEGOS GSimSearch

8

Indexing Time (s)

2

101

Indexing Time (s)

Index Size (kB)

10

SEGOS GSimSearch

3

10

pro of

SEGOS GSimSearch 103

CAND-2

Index Size (kB)

X. Zhao et al.

SE

GS

SE

GS

SE

GS

SE

GS

SE

GS

ged Threshold (τ)

(g)

(h)

Fig. 14 Comparison with SEGOS. a AIDS, index size. b PROTEIN, index size. c AIDS, indexing time. d PROTEIN, indexing time. e AIDS, Cand-2. f PROTEIN, Cand-2. g AIDS, query response time. h PROTEIN, query response time

11.6.1 Self-joins

Table 2 Data structure sizes (kB) Index

q-Gram multiset

Total

(a) AIDS, τ = 3 202.3

2,514.0

2,716.4

κ-AT-Search (q = 1)

130.3

2,518.3

2,648.8

SEGOS

515.1

1,259.1

1,774.2

51.1

2,604.8

2,655.9

33.8

735.5

769.3

158.3

367.8

526.1

1541

1542 1543

1544 1545 1546 1547 1548 1549 1550 1551 1552 1553

11.6 Graph similarity joins

unc

GSimSearch (q = 3) κ-AT-Search (q = 1) SEGOS

As a batch version of similarity search queries, similarity join exhibits the same indexing behavior as similarity search. Therefore, we omit the indexing performance comparisons, but include the indexing time in the total running time. Figure 16a–d reports the number of Cand-2 and running time of the three algorithms on the two datasets. Similar trends are observed as in the experiment for graph similarity searches. GSimJoin outperforms SEGOS-Join in terms of Cand-2 under all the threshold settings except for τ = 1 on PROTEIN. The exception is due to the upper bound validation in SEGOS-Join. Both algorithms generate much fewer candidates than κ-AT-Join. κ-AT-Join is also the slowest under most of the threshold settings, as expected from its largest candidate size. GSimJoin always exhibits less total running time than the others, despite greater indexing time. In particular, GSimJoin is faster than the runner-up SEGOSJoin by up to 58.7x on AIDS and 28.3x on PROTEIN.

orre

GSimSearch (q = 4)

(b) PROTEIN, τ = 3]

We compare the following algorithms on real datasets without edge labels for similarity joins. – GSimJoin is our proposed algorithm that utilizes pathbased q-grams for graph similarity join queries. – κ-AT-Join is an adapted algorithm from κ-AT-Search using tree-based q-grams [27], with q = 1. – SEGOS-Join is adapted from SEGOS [31] using star structure. In order to make SEGOS support self-joins, we ran SEGOS in an index-nested loops join mode. It iterates through the dataset and selects each graph as a query with the corresponding database contains all the graphs with smaller identifiers than that of the query.

11.6.2 R-S joins We made two relations from the AIDS corpus: the sample of size 4 k used in previous experiments was taken as relation R, and we randomly sampled graphs from the remaining corpus to constitute S. We first fixed the size of S as 20 k and show the running time of GSimJoin under varying τ in Fig. 16e (“IR” and “IS” represent using R and S as inner relation,

123 Journal: 778 MS: 0306

TYPESET

DISK

1554

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1555 1556 1557 1558 1559 1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571

1572

1573 1574 1575 1576 1577 1578

Graph similarity queries with edit distance constraints

2

10

100 10-2 10-4

1

2

3

4

M-tree GSimSearch

15 10 5

2

3

4

ged Threshold (τ)

(a)

(b)

20 0

2

3

4

102 0

10 10

-2

10-4

5

M-tree GSimSearch

104

5

1

5

10

15

ged Threshold (τ)

Graph Size (|E|)

(d)

(e)

2

3

4

5

ged Threshold (τ)

Indexing Size (kB)

40

1

Author Proof

10

Indexing Time (s)

Response Time (s)

6

M-tree GSimSearch

200 0

1

ged Threshold (τ)

60

400

0

5

M-tree GSimSearch Real Result

600

CAND-2

4

10

pro of

Indexing Time (s)

Indexing Size (kB)

M-tree GSimSearch

6

10

20

(c)

M-tree GSimSearch

10

5

0

5

10

15

20

Graph Size (|E|)

(f)

Fig. 15 Comparison with M-tree. a Synthetic, indexing time. b Synthetic, index size. c Synthetic, Cand-2. d Synthetic, query response time. e Synthetic, indexing time. f Synthetic, index size

103 102 101

102 1

2

3

4

5

1

2

ged Threshold (τ)

103

ged Computation Candidate Generation Index Construction

1

100 10-1 KA SE GS

KA SE GS

τ=1

τ=5

102 10

KA SE GS

5

τ=3

τ=4

τ=5

KA SE GS

KA SE GS

KA SE GS

103 101 10-1 KA SE GS

KA SE GS

ged Threshold (τ)

(b)

τ=4

KA SE GS

KA SE GS

τ=2

τ=3

130 40 30 20 10

0

IR

IS

IR

IS

IR

(c)

τ=4

Candidate Generation Index Construction

140

orre

4

τ=3

Running Time (s)

Running Time (s)

10

τ=2

4

τ=2

ged Computation Candidate Generation Index Construction

105

ged Threshold (τ)

(a) τ=1

3

Running Time (s)

κ-AT-Join SEGOS-Join GSimJoin Real Result

104

IS

IR

IS

160

Running Time (s)

10

4

κ-AT-Join SEGOS-Join GSimJoin Real Result

cted

6

CAND-2

CAND-2

τ=1

10

|S|=4k

(d)

|S|=12k

|S|=16k

|S|=20k

IR

IR

IR

Candidate Generation Index Construction

120 80 40 0

IR

ged Threshold (τ)

ged Threshold (τ)

|S|=8k

(e)

IS

IR

IS

IS

IS

IS

Dataset Size of S

(f)

Fig. 16 Graph similarity join. a AIDS, Cand-2. b PROTEIN, Cand-2. c AIDS, total running time. d PROTEIN, total running time. e AIDS, filtering time. f PROTEIN, filtering time

1590

11.7 Subgraph similarity searches

1580 1581 1582 1583 1584 1585 1586 1587 1588

1591 1592

unc

1589

respectively). Since both generate the same candidates and spend the same amount of time on verification, we remove verification time from the figure to make the differences visible. It can be seen that IR saves overall running time by up to 10.5 %, and the gap is increasing with τ . The result corroborates the claim in Sect. 8.2 that indexing the smaller relation yields better runtime performance. We then fixed τ as 4 and plot in Fig .16f the running time with |S| ranging in {4k, 8k, 12k, 16k, 20k}. IR always beats IS by a small margin, and the gap grows steadily with larger |S|, as expected from the analysis in Sect. 8.2.

1579

We compare the following algorithms for subgraph similarity search queries:

– SGSimSearch is our proposed algorithm that utilizes path-based q-grams for subgraph similarity search queries. q = 2 on both AIDS and PROTEIN. – AppSub is an algorithm computing lower bound of constrained subgraph edit distance based on star structure [38]. Note the edit operation of changing vertex label is disabled in AppSub. The results are consequently a subset of that of SGSimSearch. We take its filtering time as a lower bound of overall query response time, since the binary code from the authors reports only candidate sizes and filtering time.

We randomly sampled as queries 100 graphs with |V | ≤ 20 from AIDS and PROTEIN, respectively. In indexing phase, SGSimSearch takes a small amount of time and memory, for example, it takes 0.612 s and 389.3 kB memory for index

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603

1604 1605 1606 1607

102 1

2

3

4

5

τ=3

τ=4

τ=5

104 2

10

0

10

AP GS

sub_ged Threshold (τ)

AP GS

AP GS

AppSub SGSimSearch Real Result

106

AP GS

AP GS

10

5

104 103 102

1

sub_ged Threshold (τ)

(a)

2

3

4

5

Response Time (s)

104

τ=2

sub_ged Computation Candidate Generation

τ=1 105 4 10 103 102 1 10 100

sub_ged Threshold (τ)

τ=2

(b)

(c)

τ=3

τ=4

τ=5

AP GS

AP GS

sub_ged Computation Candidate Generation

AP GS

AP GS

AP GS

sub_ged Threshold (τ)

pro of

CAND-2

106

τ=1 6

10

CAND-2

AppSub SGSimSearch Real Result

108

Response Time (s)

X. Zhao et al.

(d)

Fig. 17 Subgraph similarity search. a AIDS, Cand-2. b AIDS, query response time. c PROTEIN, Cand-2. d PROTEIN, query response time

Author Proof

1611 1612 1613 1614 1615 1616 1617 1618 1619

1620

1621 1622

1623

1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646

1657

11.8.2 Varying data graph density

1658

cted

1610

gradually. More specifically, GSimSearch, having a small starting point, overtakes SEGOS when graphs are larger than 400. We argue that GSimSearch and SEGOS are, in general, more sensitive to graphs size than κ-AT-Search on indexing and filtering, but outperform κ-AT-Search on overall response time for fewer candidates (e.g., 1.8 and 5.5 k against 20.1 k when |E| = 400). Considering verification plays the major part in similarity queries, we nevertheless suggest spending reasonably more time on indexing and filtering so that, as a benefit, the total response time can be reduced.

11.8 Scalability evaluation

We evaluate the scalability of the algorithms against data graph size, data graph density, and dataset cardinality. 11.8.1 Varying data graph size

We evaluate the scalability against graph density on synthetic datasets. We set the number of graphs to 4 k, the average graph size to 60, and varied the average density in { 0.2, 0.4, 0.6, 0.8 }. We fix τ to 2 and plot Cand-2 number and response time in Fig. 18d, e, respectively. We observe that all algorithms take longer time to response according to the growth of density. As for candidate generation time, κ-AT-Search’s is smaller when graph gets denser. This is due to the fact that denser graphs are more prone to be underflowing. In this case, less time is spent on candidate generation, with underflowing graphs immediately becoming candidates. Note we used the smallest q-gram size q = 1 for κ-AT-Search. GSimSearch spends more time on candidate generation with density, since there are more path-based q-grams to process. This is rewarding during online query processing phase, as demonstrated by the smallest candidate size. SEGOS does not display notable change in time consumption on candidate generation when graphs become denser. Since ged computation time exhibits remarkable growth on denser graphs, GSimSearch is the most overall time-efficient for its least candidate size.

orre

1609

on AIDS. Since AppSub does not involve an index, we compare Cand-2 and query response time in Fig. 17a–d. As to Cand-2, AppSub exhibits a constant manner under smaller thresholds [38]. SGSimSearch delivers less Cand-2 than AppSub under small τ but shows a rising trend. Hence, the number of Cand-2 from SGSimSearch overtakes that of AppSub on PROTEIN when τ is as large as 5. Although SGSimSearch’s total response time is more than AppSub’s filtering time, AppSub’s candidates need verification. In consideration of more candidates and lack of optimization on verification, it is unlikely that AppSub would outperform in terms of total response time under small τ settings.

We evaluate the scalabilities of GSimSearch, κ-AT-Search, and SEGOS regarding data graph size on synthetic datasets. Five datasets with density 0.1 were generated. Each dataset has 4 k graphs, and the average graph size of the datasets ranges in {100, 200, 300, 400, 500}. Thus, the average number of vertices reaches 100 when |E| = 500. q = 3 was chosen for GSimSearch. Default parameter settings for the other algorithms were applied. We show the results under τ = 2. Figure 18a advises all algorithms have comparable indexing performance when graphs are small and take longer on larger graphs. κ-AT-Search scales with the smallest growth rate and is the most time-efficient when graphs are large. GSimSearch takes more time, 72.0x larger than κ-ATSearch, to build the index due to a larger number of q-grams in graphs when |E| = 500. Regarding index size in Fig. 18b, all algorithms need larger space to store the indexes for larger graphs, and q-gram-based approaches have smaller indexes than SEGOS under the given settings. The response time for 100 queries are visualized in Fig. 18c. With respect to candidate generation, κ-AT-Search scales the best without notable change along with the increase in graph size, while the other two consume longer time

unc

1608

11.8.3 Varying dataset cardinality We evaluate the scalability against dataset cardinality under τ = 2. We sampled 20–100 % graphs from AIDS without edge labels. Figure 18f compares the three join algorithms. From the square root of total running time, we perceive the quadratic growth of all the three algorithms, given that the real join result has a quadratic growth in dataset cardinality.

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1647 1648 1649 1650 1651 1652 1653 1654 1655 1656

1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675 1676 1677 1678 1679

1680

1681 1682 1683 1684 1685 1686

104

1

10

0

10

3

10

2

10

1

100

200

300

400

10

500

100

200

Graph Size (|E|)

300

Graph Size (|E|)

(a)

(b) d=0.2

Response Time (s)

κ-AT-Search SEGOS GSimSearch 3

Cand-2

10

Author Proof

10

2

0.2

0.4

400

0.6

0.8

d=0.4

d=0.6

ged Computation Candidate Generation

3

10

2

10

1

10

0

10

-1

10

KA SE GS

KA SE GS

KA SE GS

Graph Density (d)

Graph Density (d)

(d)

(e)

5

|E|=200

|E|=300

|E|=400

|E|=500

KA SE GS

KA SE GS

KA SE GS

ged Computation Candidate Generation

10

3

10

1

10

-1

10

500

KA SE GS

KA SE GS

pro of

-1

10

|E|=100

κ-AT-Search SEGOS GSimSearch

Response Time (s)

10

Graph Size (|E|)

d=0.8

KA SE GS

Square Root of Running Time (s)

κ-AT-Search SEGOS GSimSearch

2

Index Size (kB)

Indexing Time (s)

Graph similarity queries with edit distance constraints

20

(c)

κ-AT-Join SEGOS-Join GSimJoin

15 10

5 0

0.2

0.4

0.6

0.8

1

Scale Factor

(f)

Fig. 18 Scalability evaluation. a Synthetic, indexing time. b Synthetic, index size. c Synthetic, query response time. d Synthetic, Cand-2. f Synthetic, query response time. g AIDS, total running time

1690 1691

1692

1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718

problem of similarity all-matching, that is, to find all embeddings missing a given number of edges from the query in a large graph. TreeSpan [44] is the most up-to-date solution leveraging the query’s spanning trees on demand. Neighborhood-based similarity [14] is also considered. Graph edit distance computation Another line of related research focuses on graph edit distance computation. So far, the fastest exact solution is credited to an A*-based algorithm incorporating a bipartite heuristic [19]. To render it less computationally demanding, approximate methods are proposed to find suboptimal answer, for example, [9,18]. String and tree similarity query String similarity queries are well studied, and q-gram technique is widely applied, especially with edit distance constraints [10,33]. Others include (1) chunk-based approach, utilizing non-overlapping substrings [15,17,28,29]; (2) enumeration-based approach, enumerating resulting strings after editing [30]; and (3) triebased approach, indexing strings in a tree structure [28,40]. q-gram-like structures, such as q-level binary branches [37], pq-grams [1], are also defined on tree-structured data. Parent–child and sibling relations are encoded in these substructures, which is not explicitly available in graphs. In addition, the lower bounds established through tree-based q-grams are usually loose for graphs due to the exponential coverage.

cted

1689

The numbers of real join results are 7, 24, 44, 80, and 129 for the five scales, respectively. The GSimJoin algorithm demonstrates advantage over the others as its growth rate is smallest, with an overall speedup of as large as 22.1x against κ-AT-Join, and 5.5x against SEGOS-Join. 12 Related work

Similarity queries are important in various applications. Graph similarity search Structure similarity search receives considerable attention lately. Closure-Tree is put forward to identify k graphs that are most nearly isomorphic to the query [12]. To formalize a general definition of structural similarity, graph edit distance is employed to measure the difference [38]. A recent advance is to employ κAT [27] as q-grams for edit distance-based similarity search. It builds inverted index by decomposing graphs into κ-AT’s and perform filtering by comparing a count filtering-based distance lower bound with the threshold. The latest effort SEGOS [31] proposes an indexing and query processing framework for the same problem. GSimSearch belongs to this category. Graph similarity containment search Subgraph similarity search is to retrieve graphs that approximately contain the query. Grafil [36] develops a feature-based pruning technique for subgraph similarity search, and similarity is defined as the number of missing edges with respect to maximum common subgraph. GrafD-index [22] exploits effective pruning and validation rules to tackle the problem of connected subgraph similarity search. As the counterpart, supergraph similarity search is also investigated to retrieve graphs that are approximately contained by the query [23]. Graph similarity matching Similarity queries on large graphs have also been studied. SAPPER [39] solves the

orre

1688

unc

1687

13 Conclusion

TYPESET

DISK

LE

In this paper, we study three types of graph similarity queries with edit distance constraints. Unlike previous methods using trees or star structures, we propose a method exploiting the number of common fixed-length paths between pairs of graphs. Two filtering techniques are developed to handle both

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733 1734 1735 1736 1737 1738 1739 1740 1741 1742 1743

1744

123 Journal: 778 MS: 0306

1719

1745 1746 1747 1748 1749

X. Zhao et al.

1754 1755 1756

1757

1758 1759

Author Proof

1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791 1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809

References 1. Augsten, N., Böhlen, M.H., Gamper, J.: The pq-gram distance between ordered labeled trees. ACM Trans. Database Syst. 35(1) (2010) 2. Baeza-Yates, R., Ribeiro-Neto, B.: Modern Information Retrieval, 1st edn. Addison Wesley, Reading (1999) 3. Bunke, H., Allermann, G.: Inexact graph matching for structural pattern recognition. Pattern Recogn. Lett. 1(4), 245–253 (1983) 4. Chaudhuri, S., Ganti, V., Kaushik, R.: A primitive operator for similarity joins in data cleaning. In: ICDE, p. 5 (2006) 5. Chen, C., Yan, X., Yu, P.S., Han, J., Zhang, D.-Q., Gu, X.: Towards graph containment search and indexing. In: VLDB, pp. 926–937 (2007) 6. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: VLDB, pp. 426– 435 (1997) 7. Cottell, J.J., Link, J.O., Schroeder, S.D., Taylor, J., Tse, W.C., Vivian, R.W., Yang, Z.-Y.: Antiviral Compounds, patent WO2009005677 (2009) 8. Fagin, R., Lotem, A., Naor, M.: Optimal aggregation algorithms for middleware. In: PODS, pp. 102–113 (2001) 9. Fankhauser, S., Riesen, K., Bunke, H.: Speeding up graph edit distance computation through fast bipartite matching. In: GbRPR, pp. 102–111 (2011) 10. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: VLDB, pp. 491–500 (2001) 11. Hart, P., Nilsson, N., Raphael, B.: A formal basis for the heuristic determination of minimum cost paths. IEEE Trans. Syst. Sci. Cybern. 4(2), 100–107 (1968) 12. He, H., Singh, A.K.: Closure-tree: an index structure for graph queries. In: ICDE, p. 38 (2006) 13. Justice, D., Hero, A.O.: A binary linear programming formulation of the graph edit distance. IEEE Trans. Pattern Anal. Mach. Intell. 28(8), 1200–1214 (2006) 14. Khan, A., Li, N., Yan, X., Guan, Z., Chakraborty, S., Tao, S.: Neighborhood based fast graph search in large networks. In: SIGMOD Conference, pp. 901–912 (2011) 15. Li, G., Deng, D., Wang, J., Feng, J.: Pass-join: a partition-based method for similarity joins. PVLDB 5(1), 253–264 (2012) 16. Pulleyblank, W.R.: Handbook of Combinatorics Chapter Matchings and Extensions, vol. 1. MIT Press, Cambridge (1995) 17. Qin, J., Wang, W., Lu, Y., Xiao, C., Lin, X.: Efficient exact edit similarity query processing with the asymmetric signature scheme. In: SIGMOD Conference, pp. 1033–1044 (2011) 18. Raveaux, R., Burie, J.-C., Ogier, J.-M.: A graph matching method and a graph matching distance based on subgraph assignments. Pattern Recogn. Lett. 31(5), 394–406 (2010) 19. Riesen, K., Fankhauser, S., Bunke, H.: Speeding up graph edit distance computation with a bipartite heuristic. In: MLG (2007) 20. Robles-Kelly, A., Hancock, E.R.: Graph edit distance from spectral seriation. IEEE Trans. Pattern Anal. Mach. Intell. 27(3), 365–378 (2005)

pro of

1753

21. Sanfeliu, A., Fu, K.-S.: A distance measure between attributed relational graphs for pattern recognition. IEEE Trans. Syst. Man Cybern. 13(3), 353–362 (1983) 22. Shang, H., Lin, X., Zhang, Y., Yu, J.X., Wang, W.: Connected substructure similarity search. In: SIGMOD Conference, pp. 903–914 (2010) 23. Shang, H., Zhu, K., Lin, X., Zhang, Y., Ichise, R.: Similarity search on supergraph containment. In: ICDE, pp. 637–648 (2010) 24. Silva, A., Jr, W.M., Zaki, M.J.: Mining attribute-structure correlated patterns in large attributed graphs. PVLDB 5(5), 466–477 (2012) 25. Slavík, P.: A tight analysis of the greedy algorithm for set cover. In: STOC, pp. 435–441 (1996) 26. Tian, Y., Patel, J.M.: TALE: a tool for approximate large graph matching. In: ICDE, pp. 963–972 (2008) 27. Wang, G., Wang, B., Yang, X., Yu, G.: Efficiently indexing large sparse graphs for similarity search. IEEE Trans. Knowl. Data Eng. 24(3), 440–451 (2012) 28. Wang, J., Li, G., Feng, J.: Trie-join: efficient trie-based string similarity joins with edit-distance constraints. PVLDB 3(1), 1219–1230 (2010) 29. Wang, W., Qin, J., Chuan, X., Lin, X., Shen, H.T.: Vchunkjoin: an efficient algorithm for edit similarity joins. IEEE Trans. Knowl. Data Eng. 99 (preprints) (2012) 30. Wang, W., Xiao, C., Lin, X., Zhang, C.: Efficient approximate entity extraction with edit distance constraints. In: SIGMOD Conference, pp. 759–770 (2009) 31. Wang, X., Ding, X., Tung, A.K.H., Ying, S., Jin, H.: An efficient graph indexing method. In: ICDE, pp. 210–221 (2012) 32. Williams, D.W., Huan, J., Wang, W.: Graph database indexing using structured graph decomposition. In: ICDE, pp. 976–985 (2007) 33. Xiao, C., Wang, W., Lin, X.: Ed-Join: an efficient algorithm for similarity joins with edit distance constraints. PVLDB 1(1), 933– 944 (2008) 34. Yan, X., Han, J.: gSpan: Graph-based substructure pattern mining. In: ICDM, pp. 721–724 (2002) 35. Yan, X., Yu, P.S., Han, J.: Graph indexing: a frequent structurebased approach. In: SIGMOD Conference, pp. 335–346 (2004) 36. Yan, X., Yu, P.S., Han, J.: Substructure similarity search in graph databases. In: SIGMOD Conference, pp. 766–777 (2005) 37. Yang, R., Kalnis, P., Tung, A.K.H.: Similarity evaluation on treestructured data. In: SIGMOD Conference, pp. 754–765 (2005) 38. Zeng, Z., Tung, A.K.H., Wang, J., Feng, J., Zhou, L.: Comparing stars: on approximating graph edit distance. PVLDB 2(1), 25–36 (2009) 39. Zhang, S., Yang, J., Jin, W.: SAPPER: subgraph indexing and approximate matching in large graphs. PVLDB 3(1), 1185–1194 (2010) 40. Zhang, Z., Hadjieleftheriou, M., Ooi, B.C., Srivastava, D.: Bedtree: an all-purpose index structure for string similarity search based on edit distance. In: SIGMOD Conference, pp. 915–926 (2010) 41. Zhao, P., Yu, J.X., Yu, P.S.: Graph indexing: Tree + delta >= graph. In: VLDB, pp. 938–949 (2007) 42. Zhao, X., Xiao, C., Lin, X., Wang, W.: Efficient graph similarity joins with edit distance constraints. In: ICDE, pp. 834–845 (2012) 43. Zhu, F., Qu, Q., Lo, D., Yan, X., Han, J., Yu, P.S.: Mining top-k large structural patterns in a massive network. PVLDB 4(11), 807–818 (2011) 44. Zhu, G., Lin, X., Zhu, K., Zhang, W., Yu, J.X.: TreeSpan: efficiently computing similarity all-matching. In: SIGMOD Conference, pp. 529–540 (2012)

cted

1752

scattered and clustered edit operations as well as facilitate the graph edit distance computation. Degree-associated structural information is also exploited to reduce candidate size and enhance runtime performance. Comprehensive experiments conducted on real and synthetic datasets demonstrate that the new algorithms outperform the existing methods based on either tree-based q-grams or star structures.

orre

1751

unc

1750

123 Journal: 778 MS: 0306

TYPESET

DISK

LE

CP Disp.:2013/1/28 Pages: 26 Layout: Large

1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849 1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871

SUPPLEMENTARY MATERIAL The following materials are provided in this supplementary document: • An example of graph similarity self-join using GSimJoin algorithm; and • Implementation details and parameter tuning of the algorithms evaluated in the experiments. 1. Example of Graph Similarity Self-Join Due to the interest of space, we omit in the paper the example illustrating the algorithm for self-join (Algorithm 13). C1

C2 C3 g1

O

Q1 :

C3 -O C1 =C2 C2 -C3

Q2 :

C3 -N C1 -C2 C1 -C3 C2 -C3

Q3 :

C3 -N C1 =C2 C2 -C3

C1 C3 C2 C1

N

g2

C2 C3 g3 C-O

g1

C=C

g1

I

N C-O

g1

C-O

g1

C-N

g2

C-N

g2 g3

C=C

g1

C=C

g1 g3

C-C

g2

C-C

g2

I′

I ′′

Figure 1. Example of Self-Join Figure 1 shows the first three graphs in R, q = 1, and τ = 1. Subscripts are added to the carbon atoms for ease of exposition. The q-gram multiset of each graph is presented to its right according to increasing document frequency order. We calculate the minimum prefix length for each graph, and indicate by grayed cells the q-grams beyond prefix. Algorithm 13 processes the graphs one by one. Specifically, g1 has three q-grams, and the first two are within the prefix. Hence, they are indexed in inverted index I. Then, we look into the prefix of g2 . We find it has no common q-grams with I, and thus, producing no candidates. Then, we append the q-grams in g2 ’s prefix, deriving I ′ . We look into the prefix of g3 , which match C-N and C=C in I ′ under Condition 2. Therefore, we obtain 1

two candidate pairs, namely, (g2 , g3 ) and (g1 , g3 ), to be further tested by multiple filters. Eventually, (g2 , g3 ) is pruned by global label filtering, and (g1 , g3 ) is verified to be an answer. After that, the prefix q-grams of g3 are indexed in the inverted index, yielding I ′′ for join with the remaining graphs in R. 2. Algorithm Setup for Experiments Due to the space limit, we omit in the paper the implementation details and the parameter tuning of the following algorithms involved in the experiment. 2.1. κ-AT-Search. κ-AT-Search [27] is a state-of-the-art algorithm based on treebased q-grams. We reengineered this algorithm, and further applied prefix filtering, size filtering, and global label filtering to find the candidates. The basic A* algorithm was used to verify the candidates. We ran κ-AT-Search algorithm with different q-gram lengths, and found q = 1 yields the smallest candidate size and the best runtime performance under all threshold settings. Therefore, we chose as default parameter setting q = 1 for κ-AT-Search. 2.2. SEGOS. SEGOS [31] is a state-of-the-art algorithm based on star structure. We received the source code from the authors, and used the basic A* algorithm to verify candidates. Edge labels in datasets were discarded when a comparison involves SEGOS, as the current implementation does not handle edge labels. Given a query graph, SEGOS follows a cascade framework: In the lower level, top-k similar stars to each star of the query are returned; in the upper level, graph pruning is done based on the top-k results from the lower level. The former stage follows the TA fashion, while the latter employs the CA strategy [11]. Hence, k defines the top-k stars to seek in the TA stage, while h instructs to perform pruning for every h accessed entries in the CA stage. That is, having a sorted list for each star of the query, it evaluates the edit distance lower bound for unseen data graphs, for every h accessed entries; and it stops when the distance lower bound is larger than τ . In this way, it prevents SEGOS accessing graphs of high dissimilarity. We tuned and chose k = 100 and h = 1000 for best performance in the experiment. 2.3. M-tree. M-tree [6] is a general indexing technique for metric space similarity queries. Due to the metric property of graph edit distance, we can organize data graphs in an M-tree, where graph proximity is defined by edit distance. It is of interest to see how structural oriented index perform against the metric space index. Particularly, we built an M-tree index offline with tree node capacity set to 5 (experiment shows it archived the best performance). When overflow occurred in an internal node, we employed the following cost-effective split policy: random strategy to promote routing objects, and generalized hyperplane strategy to distribute the entries thereafter [6]. In the query answering phase, triangle inequality was applied for filtering. Before a distance had to be computed on demand, we 2

applied the global label filtering based distance lower bound to avoid expensive ged computation if possible. M-tree involves a large number of ged computations when building the index. To speed up the computations, we • allocated a distance matrix of O(|R|2) to memorize the computed distances for reuse, where |R| is dataset cardinality; • applied the global label filtering based distance lower bound before a distance has to be computed on demand; and • incorporated the upper bound techniques borrowed from star structure [38] to enable the verification algorithm to run in a threshold based fashion.

3

Efficient processing of graph similarity queries with edit ...

DISK. LE. CP Disp.:2013/1/28 Pages: 26 Layout: Large. Author Proof. Page 2. uncorrected proof. X. Zhao et al. – Graph similarity search: find data graphs whose edit dis-. 52 .... tance between two graphs is proved to be NP-hard [38]. For. 182.

663KB Sizes 2 Downloads 278 Views

Recommend Documents

Efficient Graph Similarity Joins with Edit Distance ...
Delete an isolated vertex from the graph. ∙ Change the label .... number of q-grams as deleting an edge from the graph. According to ..... system is Debian 5.0.6.

Efficient Graph Similarity Joins with Edit Distance ...
information systems, multimedia, social networks, etc. There has been ..... inverted index maps each q-gram w to a list of identifiers of graphs that contain w.

Efficient Exact Edit Similarity Query Processing with the ...
Jun 16, 2011 - edit similarity queries rely on a signature scheme to gener- ... Permission to make digital or hard copies of all or part of this work for personal or classroom ... database [2], or near duplicate documents in a document repository ...

An Efficient Algorithm for Similarity Joins With Edit ...
ture typographical errors for text documents, and to capture similarities for Homologous proteins or genes. ..... We propose a more effi- cient Algorithm 3 that performs a binary search within the same range of [τ + 1,q ..... IMPLEMENTATION DETAILS.

VChunkJoin: An Efficient Algorithm for Edit Similarity ...
The current state-of-the-art Ed-Join algorithm im- proves the All-Pairs-Ed algorithm mainly in the follow- .... redundant by another rule v if v is a suffix of u (including the case where v = u). We define a minimal CBD is a .... The basic version of

Supporting Approximate Similarity Queries with ...
support approximate answering of similarity queries in P2P networks. When a ... sampling to provide quality guarantees. Our work dif- ...... O(log n) messages. In [16], the authors propose a de- centralized method to create and maintain a random expa

Efficient Online Top-k Retrieval with Arbitrary Similarity ...
Mar 25, 2008 - many real world attributes come from a small value space. We show that ... many good algorithms and indexing structures have been. Permission to ... a score popular operating systems and versions. Due to the ... finally conclude in Sec

Efficient Skyline Retrieval with Arbitrary Similarity ...
IBM Research, India Research Lab, Bangalore. {deepak. .... subject of recent research [20, 9]. Among the ...... Microsoft Research TR, June 2000. [9] K. Deng, X.

On Efficient Graph Substructure Selection
Abstract. Graphs have a wide range of applications in many domains. The graph substructure selection problem is to find all subgraph isomor- phic mappings of ...

Improving Performance of Graph Similarity Joins Using ...
1 National University of Defense Technology, China. 2 Nagoya University ... good performance, and the graph edit distance computation was not involved in .... of substructures affected by changing the label of the vertex of largest degree [13].

Processing Probabilistic Range Queries over ...
In recent years, uncertain data management has received considerable attention in the database community. It involves a large variety of real-world applications,.

RE EDIT WORK graph 31.08.2017.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Main menu.

Efficient Hierarchical Graph-Based Video Segmentation
els into regions and is a fundamental problem in computer vision. Video .... shift approach to a cluster of 10 frames as a larger set of ..... on a laptop. We can ...

Efficient Graph-Based Semi-Supervised Learning of ... - Slav Petrov
improved target domain accuracy. 1 Introduction. Semi-supervised learning (SSL) is the use of small amounts of labeled data with relatively large amounts of ...

Learning Context Sensitive Shape Similarity by Graph ...
Mar 24, 2009 - The Label propagation supposes the number of classes C is known, and all ..... from the Ph.D. Programs Foundation of Ministry of Education of China (No. ... [17] X. Zhu, “Semi-supervised learning with graphs,” in Doctoral ...

Efficient Similarity Joins for Near Duplicate Detection
Apr 21, 2008 - ing in a social network site [25], collaborative filtering [3] and discovering .... inverted index maps a token w to a list of identifiers of records that ...

Efficient and Effective Similarity Search over Probabilistic Data ...
To define Earth Mover's Distance, a metric distance dij on object domain D must be provided ...... Management of probabilistic data: foundations and challenges.

REQUEST+: A framework for efficient processing of ...
Jun 24, 2013 - the total number of sets, we devise a pruning method that utilizes the concept of circular convex set defined in [14]. .... In this section, we propose REQUEST+, a framework for region-based query processing in sensor networks. ......

Efficient structure similarity searches: a partition-based ...
Thus, it finds a wide spectrum of applications of different domains, including object recognition in computer vision. [3], and molecule analysis in chem-informa-tics [13]. For a notable example, compound screening in the process of drug development e

Efficient and Effective Similarity Search over ...
36th International Conference on Very Large Data Bases, September 13-17,. 2010 ... bridge the gap between the database community and the real-world ...

Efficient and Effective Similarity Search over Probabilistic Data Based ...
networks have created a deluge of probabilistic data. While similarity search is an important tool to support the manipulation of probabilistic data, it raises new.

A Efficient Similarity Joins for Near-Duplicate Detection
duplicate data bear high similarity to each other, yet they are not bitwise identical. There ... Permission to make digital or hard copies of part or all of this work for personal or .... The disk-based implementation using database systems will be.

Efficient Histogram-Based Similarity Search in Ultra ...
For easy illustration, we take the recently proposed Local. Derivative ..... fc dup1 dup2. Precision. 10. 20. 30. 50. 100. (c) effect of k. 0. 0.02. 0.04. 0.06. 0.08. 0.1 fb.