Finding the kclosest pairs in metric spaces Hisashi Kurasawa
Atsuhiro Takasu
Jun Adachi
The University of Tokyo 212 Hitotsubashi, Chiyodaku, Tokyo, JAPAN
National Institute of Informatics 212 Hitotsubashi, Chiyodaku, Tokyo, JAPAN
National Institute of Informatics 212 Hitotsubashi, Chiyodaku, Tokyo, JAPAN
[email protected]
[email protected]
[email protected]
ABSTRACT We investigated the problem of reducing the cost of searching for the k closest pairs in metric spaces. In general, a kclosest pair search method initializes the upper bound distance between the k closest pairs as infinity and repeatedly updates the upper bound distance whenever it finds pairs of objects whose distances are shorter than that distance. Furthermore, it prunes dissimilar pairs whose distances are estimated as longer than the upper bound distance based on the distances from the pivot to objects and the triangle inequality. The cost of a kclosest pair query is smaller for a shorter upper bound distance and a sparser distribution of distances between the pivot and objects. We propose a new divideandconquerbased kclosest pair search method in metric spaces, called Adaptive MultiPartitioning (AMP). AMP repeatedly divides and conquers objects from the sparser distancedistribution space and speeds up the convergence of the upper bound distance before partitioning the denser space. As a result, AMP can prune many dissimilar pairs compared with ordinary divideandconquerbased method. We compare our method with other partitioning method and show that AMP reduces distances computations.
Categories and Subject Descriptors H.2.4 [Database Management]: Systems—Multimedia databases; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Indexing methods
1. INTRODUCTION It is important to enumerate similar pairs of objects from a data set. This has various applications such as record linkages, data mining, multimedia databases, and geographical information systems. There are two similar object pair enumeration problems: similarity join and kclosest pair finding. The former finds pairs of objects whose distances are shorter than a specified upper bound, and the latter finds the topk closest object
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NTSS 2011, March 25, 2011, Uppsala, Sweden. Copyright 2011 ACM 9781450306126/11/03 ...$10.00.
pairs from a data set for a given number k. The similarity join query potentially does not return any pair or many pairs when it is difficult to define an appropriate upper bound distance. Even in such a case, the kclosest pair query answers a fixed number of closest pairs. We propose a fast method for kclosest pair query in metric spaces. The kclosest pair query can be solved using a nested loop method. However, this naive method requires N2 distance computations. Such a high search cost leads to poor scalability. A divideandconquer method reduces the cost to O(N (log N )d−1 ) for ddimensional Euclidean spaces. Such a method consists of the following steps. • Divide: It partitions a region into two subregions using a hyperplane that perpendicularly crosses axes at a mean value.
• Conquer: It recursively searches closest pairs in each subregion and update the upper bound distance between the kth closest pair.
• Combine: It finds closest pairs with one object in each subregion that lies within the upper bound distance of the partitioning boundary. Unfortunately, the divideandconquer method for metric spaces is more complicated than for Euclidean spaces. Coordinates are not defined in metric spaces and cannot be used for partitioning a region. Instead, ball partitioning is generally used [8, 9]. Ball partitioning selects one object as a ‘pivot’, and divides a region based on the distance from the pivot. It requires N − 1 distance computations where N is the number of objects. Thus, we need to reduce the computational cost in the divide step as well as the conquer step. Furthermore, the distribution of distances between the pivot and objects is often skewed. Many objects tend to reside near the partitioning boundary when the ball partitioning divides a region into two subregions by a median distance. We should also take into account the distance distribution. However, existing methods using ball partitioning do not take these issues into account. We developed a new partitioning method that can prune more objects with less pivots than existing methods by using the distance distribution. We propose a new divideandconquerbased kclosest pair search method in metric spaces, called Adaptive MultiPartitioning (AMP). AMP is based on the following observations:
• The divideandconquer method works well for multipartitioning in which intervals exceed the upper bound distance between the kth closest pair. • It is more difficult to prune dissimilar objects in dense regions of distances from the pivot than sparse regions. • The upper bound distance u is updated and decreases whenever it finds pairs of objects whose distances are shorter than u.
• We propose a novel kclosest pair search method to reduce the number of distance computations. • The method uses only the triangle inequality, and it can handle all distance functions that satisfy metric space postulates. • We conducted experiments on several real datasets and demonstrated that our method outperforms other methods. The rest of this paper is organized as follows. Section 2 formally defines the problem we focus on in this paper. Section 2 gives the background of this study, and Section 3 overviews some related work. Section 4 describes AMP in detail. Section 5 shows the experimental results, and Section 6 is our conclusion.
2. PROBLEM STATEMENT Our kclosest pair search method deals with metric spaces, which are defined as follows. Definition 1 (Metric space). Let M = (D, d) be a metric space defined for a domain of objects D and a distance function d : D × D 7→ R. M satisfies the following postulates [17]. ∀x, y ∈ D, d(x, y) ≥ 0(nonnegativity) ∀x, y ∈ D, d(x, y) = d(y, x)(symmetry) ∀x, y ∈ D, x = y ↔ d(x, y) = 0(identity) ∀x, y, z ∈ D, d(x, z) ≤ d(x, y) + d(y, z)(triangle inequality) (1) Examples of the distance function d are the Minkowski distance, Jaccard’s coefficient, the Hausdorff distance, and the edit distance. A kclosest pair query is defined as follows. Definition 2 (kclosest pair query). Given a set of objects S in D, a kclosest pair query with threshold k returns the object set R and the upper bound distance u. R
Meaning metric space domain of objects distance function query for the kclosest pair method upper bound distance between the kth closest pair query object object pivot object set (temporary) result set
u q o, a, b, x, y p S, X, Y A
• A shorter upper bound distance can prunes more dissimilar objects. AMP iteratively and recursively divides a region using a sparseregionfirst strategy. For a sparse region, it calculates the k closest pairs. It then sets the interval of the next region to the maximum distance in the currently obtained k closest pairs. As a result, a space is multiple partitioned for each pivot, and the upper bound distance is expected to decrease before searching dense regions. The contributions of this paper are as follows.
Table 1: Notation
Notation M D d k
and u satisfy A⊆S×S A = k ∀(x, y) ∈ A, ∀(a, b) ∈ (S × S − A), d(x, y) ≤ d(a, b) u = max ({d(x, y)(x, y) ∈ A}).
(2)
That is, A consists of the most similar k object pairs from S. We focus on the self kclosest pair query case. The kclosest pair query is sometimes called ‘topk similarity join query’ or ‘kCPQ’. Table 1 summarizes the symbols used in this article.
3.
RELATED WORK
We start with pruning techniques in metric spaces. Because coordinates are not explicitly defined in metric spaces, all pruning techniques use the distance from the pivot to each object and the triangle inequality. A simple partitioning technique is ball partitioning, which uses only one pivot and divides a region into two subregions based on the distance from each object to the pivot [15]. It is used to construct a similarity search index for range searches and knearest neighbor searches [7, 5]. Quickjoin [9] modifies ball partitioning for similarity join searches as follows. Definition 3 (Modified Ball partitioning). For a metric space M = (D, d), suppose a pivot p divides a set X of objects in D into two regions: X1
= {o ∈ Xd(o, p) ≤ rp } ,
(3)
X2
= {o ∈ Xd(o, p) > rp } ,
(4)
Furthermore, suppose the following subsets: X10 X20
= {o ∈ X1 d(o, p) > rp − u} , = {o ∈ X2 d(o, p) ≤ rp + u} ,
(5) (6)
where u is an upper bound distance. When searching object pairs in X whose distance is shorter than u, it is sufficient to check object pairs in X1 , X2 , and X10 × X20 . Similarly, suppose p also divides an object set Y in D: Y1 Y2
= =
{o ∈ Y d(o, p) ≤ rp } , {o ∈ Y d(o, p) > rp } ,
Y10 Y20
= =
{o ∈ Y1 d(o, p) > rp − u} , {o ∈ Y2 d(o, p) ≤ rp + u} .
(7) (8) (9) (10)
When searching object pairs in X × Y whose distance is shorter than u, it is sufficient to check object pairs in X1 × Y1 , X2 × Y2 , X10 × Y20 , and X20 × Y10 . Modified ball partitioning can prune more dissimilar object pairs by a smaller upper bound distance and a smaller number of objects near the partitioning boundary. Now let us briefly survey kclosest pair search methods. Because the similarity join is very similar to the kclosest pair query, we overview studies on these two problems. Previously proposed similarity join methods in metric spaces can be categorized into two methods. The first is the indexbased method. This method constructs an index and inserts all the objects in the dataset into the index. For each object, it searches for other objects whose distances are within the query by using the same procedure as the range search on the index. All existing indexbased studies focus on improving the index. While general metric indexes are designed to deal with any distances in the range queries [10, 18, 11], the indexes for similarity join queries assume fixeddistance range queries. eDindex [8] constructs an index that is an extension of Dindex [7]. eDindex differs from Dindex in that it replicates objects whose distances from the partitioning boundary are within the query while dividing a region into subregions like modified ball partitioning. By using this replication technique, eDindex can execute the similarity join query independently within each separated region. List of twin clusters (LTC) is another extension of similarity search indexes [13]. LTC [5] is based on list of clusters (LC). It is designed to resolve range queries, similarity join, and kclosest pair queries between two datasets. It makes two lists of overlapping clusters and a distance matrix between the cluster centers of each dataset. It uses the matrix to prune objects based on the triangle inequality. For searching the kclosest pairs, LTC creates a queue heap to store temporary kclosest object pairs and their distances. It also makes a variable upper bound distance and initializes it to ∞. When the size of the heap exceeds k, it removes the longest distance pair from the heap and sets the upper bound distance to the distance between the kth closest pair. It sets the upper bound distance as the similarity join query and finds the closest object pairs by updating the heap and decreasing the upper bound distance. It does not tune the query, while eDindex does. LTC is a generalpurpose index for similarity searches between two datasets. The second method is the divideandconquerbased method. Quickjoin [9] recursively divides and conquers an object set into subsets based on the distances from a pivot by using modified ball partitioning. It prunes all the object pairs in the two subsets if the distance between the sets exceeds the query. The partitioning boundary of modified ball partitioning is the distance between the pivot and a randomly selected object. This results in many objects tending to reside near the partitioning boundary, which makes it difficult to prune dissimilar object pairs. Most kclosest pair methods are designed for a specific data structure. They mainly focus on pruning and search ordering. The Topk set similarity join [16] handles sets; it uses a tokenbased pruning and avoids repeated verification. Approximate kclosest pairs with SPacefilling curves (ASP) [3] is an approximate kclosest pair method for highdimensional vectors; it uses Hilbert space filling curves. An indexbased method [6] for spatial databases has been proposed that prunes object pairs by using Rtree. These meth
ods cannot be extended to metric spaces or another specific space. Only a few papers discuss kclosest pair queries in metric spaces. Furthermore, the authors of those papers did not exploit the distribution of distances between the pivot and objects for determining the partitioning distance. By using the distance distribution in partitioning techniques, it may be possible to reduce the number of pivots and improve the pruning effect. This reduces the cost of distance computations between the pivots and objects and that among the objects. Therefore, we developed a new partitioning technique using the distance distribution. Compared with the indexbased method, the divideandconquerbased method is better at dealing with self join queries because there is no reason to build an index for one query as the authors of LTC mentioned. Thus, we used the divideandconquerbased method. To the best of our knowledge, only LTC supports the kclosest pair query in metric spaces in previous studies.
4.
ADAPTIVE MULTIPARTITIONING
AMP is a divideandconquerbased kclosest pair search method in metric spaces. It uses multiball partitioning for reducing the computational cost in the divide step. Moreover, it uses the convergence of the upper bound distance and the distance distribution for the partitioning procedure, especially for determining the interval size of the partitioning.
4.1
Multiball partitioning
The idea of multipartitioning is used in various studies for specific spaces such as εkdB [14]. Multiball partitioning is defined as follows. Definition 4 (MultiBall partitioning). For a metric space M = (D, d), suppose a pivot p divides an object set X in D into regions: X0 X1
= {o ∈ X0 ≤ d(o, p) < t0 } , = {o ∈ Xt0 ≤ d(o, p) < t1 } , ···
Xi
= {o ∈ Xti−1 ≤ d(o, p) < ti } , ···
Xn
= {o ∈ Xtn−1 ≤ d(o, p) < tn } ,
(11)
where ti (0 ≤ i ≤ n) are the partitioning distances of p. For an upper bound distance u, suppose the inequality ti+1 − ti ≥ u
(12)
holds when searching object pairs in X whose distance is shorter than u. Then, it is sufficient to check object pairs {(a, b)  a ∈ Xi , b ∈ Xj , i − j ≤ 1}. Similarly, suppose p divides an object set Y in D into regions: Y0
= {o ∈ Y 0 ≤ d(o, p) < t0 } ,
Y1 Yi
= {o ∈ Y t0 ≤ d(o, p) < t1 } , ··· = {o ∈ Y ti−1 ≤ d(o, p) < ti } ,
Yn
··· = {o ∈ Y tn−1 ≤ d(o, p) < tn } .
(13)
Figure 1: Adaptive MultiPartitioning
When searching object pairs in X × Y whose distance is shorter than u, it is sufficient to check object pairs {(a, b)  a ∈ Xi , b ∈ Yj , i − j ≤ 1}. Multiball partitioning can be adapted to kclosest pair search method if the upper bound distance u exceeds the distance between the kth closest pair. Thus, we should consider the intervals of the partitioning distances. In general, a kclosest pair search method initializes the upper bound distance u as infinity and repeatedly updates the upper bound distance u whenever it finds k pairs of objects whose distances are shorter than u. That is, the upper bound distance u converges to the distance between the kth closest pair while searching. We came up with the idea to adjust the partitioning distances ti by taking the convergence into consideration.
4.2 Partitioning Procedure Because pivots prune dissimilar object pairs based on the triangle inequality, the distance density with respect to a pivot and the upper bound distance affects the pruning performance. We can prune objects in a sparse region w.r.t. the distance from the pivot more effectively. Moreover, a shorter upper bound distance can prune more dissimilar objects. We focus on the convergence of the upper bound distance and the distance distribution. We believe that dissimilar object pairs in a dense distance distribution should be pruned by using the upper bound distance, which has already converged. Thus, AMP searches closest pairs in the sparse distance distribution before the dense distance distribution. For detecting the distance distribution, AMP calculates the skewness s of the distance density. Skewness is a measure of the asymmetry of the distribution. It is defined as: " 3 # (χ − µ) s=E , (14) σ where χ is a random variable, µ is the mean, σ is the standard deviation, and E is the expectation operator. A nega
tive skew of the distance density indicates that objects near the pivot are sparse. AMP applies divideandconquer operations from the near to far side of the pivot. On the other hand, a positive skew indicates that objects near the pivot are dense. Therefore, AMP divides and conquers in the opposite direction. Figure 1 shows the concept of AMP. AMP searches the kclosest pairs on two given object sets X and Y (X ≤ Y ) by performing the following steps. For a metric space M = (D, d), AMP first creates a queue heap A to store temporary kclosest object pairs and their distances. It also makes a variable upper bound distance u and initializes it to ∞. When the size of the heap exceeds k, it removes the longest distance pair from the heap and sets the upper bound distance to the distance between the kth closest pair. AMP manages the reference distance Refo,S of each object o in an object set S. The initial distance of Refo,S is nil. Let us define ti as the ith partitioning distance. AMP then recursively divides and conquers the given object sets X and Y , which we call AMP(X, Y ). In the following procedure, expr1 ? expr2 : expr3 means to evaluate the expression expr2 if expr1 is true; otherwise, evaluate expr3 as in C language. 1. Remove the objects from X and Y , which hold Refo,S 6= nil ∧ Refo,S > u (S = X, Y ). 2. If min {X, Y } ≤ 3 holds, search closest pairs by Nested Loops and return. 3. Randomly choose a pivot object p from X, and remove p from X. 4. Calculate the distance from p to each object in X ∪ Y . 5. Set dmin , dmax , dmean , and s to be the minimum, maximum, mean, and skewness of the distances, respectively. 6. Update A and u if the object pair (oi , p) is found where oi is in Y and d(oi , p) < u holds. 7. Set t0 to be (s < 0 ? dmin : dmax ). 8. Set t1 to be (s < 0 ? min{t0 + u, dmean } : max{t0 − u, dmean }). 9. Set S1 to be (s < 0 ? {o ∈ S  t0 ≤ d(o, p) < t1 } : {o ∈ S  t0 ≥ d(o, p) > t1 }) (S = X, Y ). 10. Call AMP(X1 , Y1 ). 11. While (s < 0 ? ti < dmax : ti > dmin ) holds: (a) Set ti+1 to be (s < 0 ? ti + u : ti − u). (b) Set Si+1 to be (s < 0 ? {o ∈ S  ti ≤ d(o, p) < ti+1 } : {o ∈ S  ti ≥ d(o, p) > ti+1 }) (S = X, Y ). (c) Call AMP(Xi+1 , Yi+1 ). (d) Set object sets Si0 to be {oo ∈ Si ∧ ti − d(p, o) < u}. (e) For all o ∈ Si0 , Refo,Si0 = min {Refo,S , ti − d(p, o)} (S = X, Y ). 0 (f) Set object sets Si+1 to be {oo ∈ Si+1 ∧ ti − d(p, o) < u}. 0 (g) For all o ∈ Si+1 , 0 Refo,Si+1 = min {Refo,S , ti − d(p, o)} (S = X, Y ).
0 0 (h) Call AMP(Xi0 , Yi+1 ) and AMP(Xi+1 , Yi0 ).
Table 2: Real Dataset
We can solve the kclosest pair problem in the case of X = Y with minor modification. For brevity we omit the modification.
Distance Dimension Average Variance Skewness Kurtosis
4.3 Search Cost The computational cost is the number of distance computations between objects. Let X and Y be given object sets. The computational cost of AMP(X, Y ) is: X AMP(X, Y ) = X + Y  − 1 + AMP(Xi , Yi )  {z } i divide step  {z } conquer step
+
X
AMP(Xi , Yj ) ,
(15)
{(i,j)i−j=1}

{z
combine step
}
where AMP(·, ·) denotes the cost of the kclosest pair query. When X is equal to Y , the cost of AMP(X) is: P AMP(X) = X − 1 + i AMP(Xi ) P + {(i,j)i−j=1} AMP(Xi , Xj ) . (16)
5. EVALUATION
We evaluated the computational cost for finding k closest pairs on real datasets.
5.1 Outline of Experiments We used the following methods in the experiment. AMP is our kclosest pair search method. It uses a sparseregionfirst strategy. AMP in reverse order is a comparative method of AMP. It uses a denseregionfirst strategy. Binary partitioning is a modified ball partitioningbased kclosest pair search method. Nested loops is a naive kclosest pair method, which does not prune any objects and computes the distances of all pairs of objects in the dataset. We used the following real datasets, all of which are available on the Web. These datasets have been used in many recent related studies [12, 9, 4]. We removed duplicate objects from each dataset. NASA [1] is a set of feature vectors made by NASA. It consists of 40, 150 vectors in a 20dimensional feature space. The vectors were compared using the Euclidean distance. Corel image feature [2] consists of color histogram vectors generated from the Corel image collection. It consists of 68, 040 vectors in a 32dimensional space. The vectors were compared using the Euclidean distance. Color histogram [1] consists of color histograms of 112, 544 images represented by vectors in a 112dimensional space. The vectors were compared using the Euclidean distance. Figure 3 shows the distance density of each dataset, and Table 2 lists the properties of the distances between the objects in each dataset. Note that NASA is in the lowest dimensional feature space, the skewness of the Color histogram is the highest, and NASA has a wide range of distances and its skewness is the lowest. We implemented AMP and comparative methods on the Metric Space Library [1], which is written in C. We con
Nasa Euclid 20 1.48 0.211 0.0447 2.39
Corel image features Euclid 32 0.564 0.0332 0.444 3.08
Color histogram Euclid 112 0.415 0.0310 0.828 3.57
ducted the experiment on a Linux PC equipped with an Intel(R) Quad Core Xeon(TM) X5492 3.40 GHz CPU and 64 GB of memory. The library and our codes were compiled with GCC 4.4. All datasets were processed in memory for all examined methods.
5.2
Computational Cost
We evaluated how AMP reduces the search cost with respect to the query size. We measured the number of distance computations for kclosest pair queries with k ranging from 1 to 100, 000. Each result was the average of over 500 queries of its dataset. Figure 2 shows the computational cost. The vertical axis represents the number of distance computations for a kclosest pair query divided by the value for Nested loops, i.e., (N · (N − 1))/2 where N is the number of objects in the dataset. The horizontal axis is k. None of the methods require an index, so this result shows the total computational cost for the search. A lower percentage indicates a lower computational cost. We can see that AMP reduces the computational cost. These results show that multipartitioning prunes more dissimilar object pairs than binary partitioning. Furthermore, AMP works better than AMP in reverse order for all results. This indicates that the sparseregionfirst strategy is better than the denseregionfirst strategy. In particular, AMP requires much less distance computations than other similar methods for the Color histogram. This suggests that the skewness of the Color histogram is large and the sparseregionfirst strategy works well for skewed datasets.
6.
CONCLUSION
We investigated the problem of the kclosest pair query in metric spaces. We proposed an efficient kclosest pair search method that prunes dissimilar object pairs based on the triangle inequality. The method repeatedly divides and conquers the objects from the sparser space and speeds up the convergence of the upper bound distance before partitioning the denser space. We are currently conducting experiments using synthetic datasets and theoretically analyzing the performance of our method in detail.
7.
REFERENCES
[1] Metric spaces library, http://www.sisap.org/metric_space_library.html. [2] Uci kdd archive, http://kdd.ics.uci.edu/. [3] F. Angiulli and C. Pizzuti. Approximate kclosestpairs in large highdimensional data sets.
(a) NASA
(b) Corel image feature
(c) Color histogram
Figure 2: Computational Cost
(a) NASA
(b) Corel image feature
(c) Color histogram
Figure 3: Distance density
[4]
[5]
[6]
[7]
[8] [9]
[10]
Journal of Mathematical Modelling and Algorithms, 4(2):149–179, 2005. B. Bustos and G. Navarro. Improving the space cost of knn search in metric spaces by using distance estimators. Multimedia Tools Appl., 41(2):215–233, 2009. E. Chevez and G. Navarro. A compact space decomposition for effective metric indexing. Pattern Recognition Letters, 24(9):1363–1376, 2005. A. Corral, Y. Manolopoulos, Y. Theodoridis, and M. Vassilakopoulos. Algorithms for processing kclosestpair queries in spatial databases. Data & Knowl. Eng., 49(1):67–104, 2004. V. Dohnal, C. Gennaro, P. Savino, and P. Zezula. Dindex: Distance searching index for metric data sets. Multimedia Tools and Applications, 21(1):9–33, 2003. V. Dohnal, C. Gennaro, and P. Zezula. Similarity join in metric spaces using edindex. In DEXA, 2003. E. H. Jacox and H. Samet. Metric space similarity joins. ACM Trans. on Database Systems, 33(2):1–38, 2008. H. V. Jagadish, B. C. Ooi, K. L. Tran, C. Yu, and R. Zhang. idistance: An adaptive b+tree based indexing method for nearest neighbor search. ACM Trans. on Database Systems, 30(2):364–397, 2003.
[11] H. Kurasawa, D. Fukagawa, A. Takasu, and J. Adachi. Maximal metric margin partitioning for similarity search indexes. In CIKM, 2009. [12] G. Navarro and N. Reyes. Dynamic spatial approximation tree. Journal of Experimental Algorithmics, 12:1–68, 2008. [13] R. Paredes and N. Reyes. Solving similarity joins and range queries in metric spaces with the list of twin clusters. Journal of Discrete Algorithms, 7(1):18–35, 2009. [14] K. Shim, R. Srikant, and R. Agrawal. Highdimensional similarity joins. In ICDE, 1997. [15] J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40(4):175–179, 1991. [16] C. Xiao, W. Wang, X. Lin, and H. Shang. Topk set similarity joins. In ICDE, 2009. [17] P. Zezula, G. Amato, V. Dohnal, and M. Batko. Similarity Search: The Metric Space Approach (Advances in Database Systems). SpringerVerlag, 2005. [18] Y. Zhuang, Y. Zhuang, Q. Li, L. Chen, and Y. Yu. Indexing highdimensional data in dual distance spaces: a symmetrical encoding approach. In EDBT, 2008.