IJRIT International Journal of Research in Information Technology, Volume 2, Issue 11, November 2014, Pg. 253-256

International Journal of Research in Information Technology (IJRIT) www.ijrit.com

ISSN 2001-5569

A Survey on Efficiently Indexing Graphs for Similarity Search Divyashree Bhoye1, Prof. Mandar Kshirsagar 2 1

2

Student, Computer Engineering Department, V.A.C.O.E.A Ahmednagar, Maharashtra, India [email protected]

Assistant Prof, Computer Engineering Department, V.A.C.O.E.A Ahmednagar, Maharashtra, India [email protected]

Abstract Graphs are widely used to model complex entities in many applications including bio-informatics, chemical compounds, road networks, social networks, pattern recognition etc. Managing such large amount of graph data in these domains is a very challenging problem. A fundamental and critical query primitive is to efficiently index and perform similarity search on a large collection of graphs. Similarity search of complex structures is an important operation in graph-related applications since exact matching is often too restrictive Many techniques have been proposed to support similarity search which are based on the graph edit distance. In this paper, we study the problem of graph similarity search, and different techniques which retrieve graphs that are similar to a given query graph under the constraint of the minimum edit distance.

Keywords: Similarity Search, Indexing Graphs, Graph Edit Distance.

1. Introduction Recently there has been rapid growth in use of graphs as data models, such as protein –protein interaction in bio-informatics, chemical compounds, road networks, social networks, pattern recognition. Similarity search over a large data set of graphs is a fundamental and crucial issue in graph based application. For example, the cities in the road network can be considered as vertices of the graph and the roads connecting these cities can be considered as edges across the corresponding vertices In similarity search, we look for graphs in a database which are similar to a query graph. Two graphs are said to be similar to each other by judging the size of their maximum common subgraph. Sequential searching over a large data set of graphs introduces a huge computational cost. Due to low efficiency of sequential search, another search method is employed called as a filter-and verification method, to speed up the search efficiency of graph similarity matching over a graph set and an index can be used on the graph data set to filter and reduce the candidates. The rest of this paper is organized as follows: Section 2 introduces the research work related to this paper, Section 3 presents the K- adjacent tree index and filtering principle based on the concept of K-adjacent tree. Section 4 concludes the paper

Divyashree Bhoye, IJRIT- 253

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 11, November 2014, Pg. 253-256

2. Related Work Sequential scan is extremely costly, because one has access the whole database in one by one fashion. Also subgraph isomorphism is an NP-complete problem. Thus, filter-and-verification method is employed to speed up the search efficiency of graph similarity search over a graph data set. Here, firstly one generates a set of candidates that satisfy necessary conditions of the edit distance constraints, and then verify them with edit distance computation. Since the filtering is the main phase to improve the search efficiency, a lot of indexing techniques are proposed recently to speed up the filtering phase. Most of this research work can be categorized into two groups; frequent-subgraph based indexing and graph-decomposition based indexing. Yan et al. introduce a novel indexing technique G-Index based on frequent subgraph patterns. It makes use of frequent subgraph structures as the basic indexing features. Shang et al. propose a novel indexing technique QuickSI to efficiently compute verification phase for testing subgraph isomorphism. Cheng et al. propose a nested inverted-index called FG-index to avoid candidate verification by exploiting frequent subgraphs and edges as indexing features. Frequent-sub graph-based indexing methods have two main shortcomings: The effectiveness and efficiency of such kind of methods depend on the quality of selected features and it is difficult to construct and maintain the index because the frequent subgraph mining algorithm usually takes a very long time to compute. Williams et al. have developed three kinds of graph decomposition schemas, Clique Decomposition, Modular Decomposition, and Node Label Decomposition (NLD) to decompose a graph dataset, and to describe the results of a graph decomposition, Directed Acyclic Graphs (DAGs) are constructed and a Graph Decomposition Index (GDI) is proposed to support graph similarity search. Tian and Patel have proposed an indexing method by incorporating graph structural information in a hybrid index structure called NH-index. Graph decomposition indexing methods suffer from two main drawbacks: They have to enumerate all connected sub graphs, and therefore, complexity of graph decomposition is exponential to the graph size that is being decomposed and the frequency information existing in the graph decomposition results is not utilized for improving the efficiency of graph similarity search. In this paper, a novel graph decomposition method called k-Adjacent Tree (k-AT) is introduced. This method is inspired by the idea of “Q-Gram” from string matching. A graph is decomposed into a set of kAdjacent Trees and the decomposed results are indexed by a k-AT index. A lower bound of the edit distance between graphs is derived and used for filtering graphs. This guarantees the absence of false negatives. This method incorporates both graph decomposition methods and frequent subgraph methods

3. k-Adjacent Tree Index So far, we have studied different indexing techniques now we will discuss how to use k-Adjacent tree pattern decomposition for lower bound estimation. This indexing method uses the idea of Q-Grams index which is used in string matching. To avoid the structural complexity of graph patterns, we use adjacent tree patterns for index construction. Adjacent tree is able to preserve its structural information well. It is much faster to do similarity search on adjacent tree than on graph grams. The following figure shows representation of the k-adjacent tree for a given graph.

Fig. 1(a) Given graph Divyashree Bhoye, IJRIT- 254

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 11, November 2014, Pg. 253-256

Fig. 1(b) Representation as k-adjacent tree 3.1 k-AT Index implementation Firstly we generate all the k-ATs of each graph in the graph data set and store them in a table. For a query graph Q, we also generate its k-ATs, and for each graph G in the data set we calculate the number of common k-ATs of Q and G. Then we use inequality (1) given below to check whether graph G belongs to the candidate set of the query or not. |Candk|=|{G|G ∈ D,| k-ATS(Q) ∩ k-ATS(G)| ≥ |V(G)| - €.δ(G)k}|

(1)

The following figure represents a basic block diagram for the system.

Fig. 1 Block diagram of system.

Divyashree Bhoye, IJRIT- 255

IJRIT International Journal of Research in Information Technology, Volume 2, Issue 11, November 2014, Pg. 253-256

3.2 Performance Evaluation In this section we compare the query processing and filtering performance of k-AT method with frequent subgraph based indexing and graph decomposition based indexing. FG-index is a kind of frequent subgraph based indexing and DAG index is a kind of graph decomposition based indexing. K-AT index is a combination of both frequent subgraph based indexing method and graph decomposition based indexing method, so we choose these two indexing techniques to compare with our system. After comparing these three indexing techniques, observations were as follows: DAG has worst filtering performance and k-AT has almost the same filtering capacity as the FG index but k-AT is slightly better than FG index. And observations for query processing performance show that k-AT is far superior to FG index and DAG. This is because FG index and DAG invoke a lot of subgraph isomorphism test operations while k-AT invokes subtree isomorphism test operation. The cost of subtree isomorphism test is much faster than that of the subgraph isomorphism test.

4. Conclusions We evaluate the global similarity between the graphs by decomposing them into smaller pieces (k-ATs) and pairing up these pieces. k-AT records more structured information than a normal graph decomposition based indexing method and also maintaining the simple structure of tree. This gives us a method for indexing and candidate filtering for similarity search in a graph data set. Also experimental results evince that when applied to large graph data set filtering on k-AT index can be both fast and accurate.

Acknowledgments The authors thank a lot to all who supported them in their work. And they would like to express their sincere gratitude and appreciation to all the staff members for the patience, guidance, help and for being their greatest source of information.

References [1] T.H. Cormen, “Np Completeness,” Introduction to Algorithms, W. Yu, ed., second ed., vol. 7, pp. 620630. China Machine Press, 2007. [2] Efficiently Indexing Large Sparse Graphs for Similarity Search Guoren Wang, Bin Wang, Xiaochu Yang, Member, IEEE Computer Society, and Ge Yu, Member, IEEE IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 24, NO. 3, MARCH 2012 [3] X. Yan, P.S. Yu, and J. Han, “Graph Indexing: A Frequent Structure-Based Approach,” Proc. ACM SIGMOD, pp. 335-345,2004. [4] J.H. Xifeng Yan and P.S. Yu, “Graph Indexing Based on Discriminative Frequent Structure Analysis,” ACM Trans. Database Systems, vol. 30, no. 4, pp. 960-993, 2005. [5] H. Shang, Y. Zhang, X. Lin, and J.X. Yu, “Taming Verification Hardness: An Efficient Algorithm for Testing Subgraph Isomorphism,” Proc. 34th Int’l Conf. Very Large Data Bases, pp. 364-375, 2008. [6] J. Cheng, Y. Ke, W. Ng, and A. Lu, “Fg-Index: Towards Verification-Free Query Processing on Graph Databases,” Proc. ACM SIGMOD, pp. 857-872, 2007. [7] J. Cheng, Y. Ke, and W. Ng, “Efficient Query Processing on Graph Databases,” ACM Trans. Database Systems, vol. 34, no. 1, pp. 1-44,2009. [8] O. Johansson, “Graph Decomposition Using Node Labels,” doctoral dissertation, Royal Inst. Of Technology, 2001. [9] Y. Tian and J.M. Patel, “Tale: A Tool for Approximate Large Graph Matching,” Proc. 24th Int’l Conf. Data Eng., pp. 963-972, 2008.

Divyashree Bhoye, IJRIT- 256

A Survey on Efficiently Indexing Graphs for Similarity ...

Keywords: Similarity Search, Indexing Graphs, Graph Edit Distance. 1. Introduction. Recently .... graph Q, we also generate its k-ATs, and for each graph G in the data set we calculate the number of common k-ATs of Q and G. Then we use inequality (1) given below to check whether graph G belongs to the candidate set of ...

104KB Sizes 8 Downloads 299 Views

Recommend Documents

A Short Survey on P2P Data Indexing - Semantic Scholar
Department of Computer Science and Engineering. Fudan University .... mines the bound of hops of a lookup operation, and the degree which determines the ...

A Short Survey on P2P Data Indexing - Semantic Scholar
Department of Computer Science and Engineering. Fudan University ... existing schemes fall into two categories: the over-DHT index- ing paradigm, which as a ...

A Survey of Indexing Techniques for Scalable Record Linkage and ...
A Survey of Indexing Techniques for Scalable Record Linkage and Deduplication..pdf. A Survey of Indexing Techniques for Scalable Record Linkage and ...

Similarity-Aware Indexing for Real-Time Entity Resolution
ing the manual efforts required in the entity resolution process. ... names sourced from an Australian telephone directory from 2002.1. A given name attribute was ...

A Proposal for Linguistic Similarity Datasets Based on ...
gory oriented similarity studies is that “stimuli can only be ... whether there is a similarity relation between two words, the ... for numerical similarity judgements, but instead to ask them to list commonalities and differences be- tween the obj

A Survey on Network Codes for Distributed Storage - IEEE Xplore
ABSTRACT | Distributed storage systems often introduce redundancy to increase reliability. When coding is used, the repair problem arises: if a node storing ...

A Survey on Leveraging Deep Neural Networks for ...
data. • Using Siamese Networks. Two-stream networks, with shared weight .... “Learning Multi-domain Convolutional Neural Networks for Visual Tracking” in ...

Indexing certain physiological parameters on drought management
S. VINCENT, R. RETHINARAJA AND S. RAJARATHINAM. Coconut Research Station, Veppankulam 614 906, Tamil Nadu. Abstract: The genotype of coconut East Coast Tall was studied for its tolerance of drought through the physiological parameters by using differ

A Recipe for Concept Similarity
knowledge. It seems to be a simple fact that Kristin and I disagree over when .... vocal critic of notions of concept similarity, it seems only fair to give his theory an.

Indexing certain physiological parameters on drought management
Indexing certain physiological parameters on drought management in rainfed coconut ... experimental field was determined gravimetrically at two depths viz.

A Study on Similarity and Relatedness Using ... - Research at Google
provide the best results in their class on the. RG and WordSim353 .... of car and coche on the same underlying graph, and .... repair or replace the * if it is stolen.

Query Expansion Based-on Similarity of Terms for ...
expansion methods and three term-dropping strategies. His results show that .... An iterative approach is used to determine the best EM distance to describe the rel- evance between .... Cross-lingual Filtering Systems Evaluation Campaign.

Query Expansion Based-on Similarity of Terms for Improving Arabic ...
same meaning of the sentence. An example that .... clude: Duplicate white spaces removal, excessive tatweel (or Arabic letter Kashida) removal, HTML tags ...

A note on minimal 30connected graphs
G. If two edges uw and wv are consecutive edges in two walks in C, then the degree of w is at least e. Proof of Theorem 1. The smallest 30connected graph is the ...

Reachability Queries on Large Dynamic Graphs: A ...
inapplicable to the dynamic graphs (e.g., social networks and the ... republish, to post on servers or to redistribute to lists, requires prior specific permission.

Recommendation on Item Graphs
Beijing 100085, China [email protected]. Tao Li. School of Computer Science. Florida International University. Miami, FL 33199 [email protected].

A novel method for measuring semantic similarity for XML schema ...
Enterprises integration has recently gained great attentions, as never before. The paper deals with an essential activity enabling seam- less enterprises integration, that is, a similarity-based schema matching. To this end, we present a supervised a

A Coalescing-Branching Random Walks on Graphs
construction of peer-to-peer (P2P), overlay, ad hoc, and sensor networks. For example, expanders have been used for modeling and construction of P2P and overlay networks, grids and related graphs have been used as ..... This can be useful, especially

Calculus on Computational Graphs: Backpropagation - GitHub
ismp/52_griewank-andreas-b.pdf)). The general .... cheap, and us silly humans have had to repeatedly rediscover this fact. ... (https://shlens.wordpress.com/),.

Fast Multilevel Transduction on Graphs
matrix [1]; the second term is the fit term, which measures how well the predicted labels fit the original labels .... Gl = (Vl, El), we split Vl into two sets, Cl and Fl.