Efficient Top-k Hyperplane Query Processing for Multimedia Information Retrieval Navneet Panda

Edward Y. Chang

Dept. of Computer Science University of California Santa Barbara

Dept. of Electrical and Computer Engg. University of California Santa Barbara

[email protected]

[email protected]

ABSTRACT

In the query-concept learning phase, active learning is frequently employed to refine the user concept iteratively. In A query can be answered by a binary classifier, which seporder to learn the target concept, the user is queried for the arates the instances that are relevant to the query from the labels of a few instances at each iteration. The instances to ones that are not. When kernel methods are employed to be labeled by the user are chosen on the basis of the degree of train such a classifier, the class boundary is represented as uncertainty in their current classification so that the concept a hyperplane in a projected space. Data instances that are learning can achieve fast convergence. The query of interest farthest from the hyperplane are deemed to be most relevant in this phase is formulated as the set of top-k instances given to the query, and that are nearest to the hyperplane to be the lowest absolute scores by the SVM classifier. These are most uncertain to the query. In this paper, we address the the instances closest to the hyperplane1 . Having converged twin problems of efficient retrieval of the approximate set of on the concept of interest, the retrieval phase of query proinstances (a) farthest from and (b) nearest to a query hypercessing returns the most relevant instances to the learned plane. Retrieval of instances for this hyperplane-based query concept to the user. The top-k instances of interest with rescenario is mapped to the range-query problem allowing for spect to the query concept are the k instances with the highthe reuse of existing index structures. Empirical evaluation est positive scores i.e., the k farthest data instances from the on large image datasets confirms the effectiveness of our aphyperplane in the relevant (positive) half-plane. proach. In this paper, we propose an efficient approach to process top-k hyperplane queries, for both most uncertain and most Categories and Subject Descriptors relevant queries. Without an efficient pruning algorithm, processing top-k queries of both kinds requires sequentially H.4 [Data Mining Information Retrieval Multimedia]: scanning the entire dataset to find matching instances. When Efficient Retrieval the number of instances is in the order of thousands or more, such naive sequential processing is simply not scalable. To Keywords be more specific about the problem at hand, we assume that Support Vector Machines (SVMs) [4, 5, 28] is the employed Support vector machines, kernel based methods, retrieval learning algorithm to obtain the query hyperplane. SVMs find a maximum margin hyperplane separating the positive 1. INTRODUCTION training instances from negative ones. SVMs can employ implicit projection to find a separating hyperplane (linear Query-concept learning and efficient retrieval of relevant boundary) in a projected space H. The linear boundary in instances lie at the heart of multimedia retrieval [26]. SupH, translates to a complex (often nonlinear) boundary in the port Vector Machines (SVMs), or more generally kernel methods, have made the hyperplane query a significant new paradigm input space, making SVMs a powerful tool for pattern recognition. The hyperplane is then used to obtain instance scores for both query-concept learning and instance retrieval. The (explained in Section 3). The score of an instance is its diskey advantage of a hyperplane query over a traditional point tance from the hyperplane in the projected space H. The query [9, 18] is that the hyperplane query can be richly deSVM algorithm determines the top-k instances (nearest to fined with positive and negative data-instances, as compared or farthest from the hyperplane) in the corpus by computing to a single positive instance in the point query. A hyperplane the scores of all the instances. The top-k instances with the query represents the query concept as a binary classifier seplowest absolute scores are returned when the most ambiguarating the positive and negative instances. ous samples are desired and the top-k instances with highest positive scores are selected when the most relevant instances are queried. Given the usually large dataset sizes, evaluation of the scores of all instances (in either the concept-learning or Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. MM’06, October 23–27, 2006, Santa Barbara, California, USA. Copyright 2006 ACM 1-59593-447-2/06/0010 ...$5.00.

1

Other criteria like diversity also play a role in the choice of instances but the ambiguity in the labeling of an instance is captured by its nearness to the hyperplane. In this paper we focus on obtaining instances close to the hyperplane whose diversity can then be computed as a post-processing step.

the instance-retrieval phase) involves multiple disk accesses and can quickly become extremely costly. Even when the dataset is small enough to fit into main memory, performing large numbers of computations can be very expensive since the hyperplane is represented by a linear combination of a subset of training instances (called support vectors) and evaluating an unlabeled instance involves inner-product computations with each of the support vectors. Our proposed pruning approach aims at speeding up retrieval of the top-k ambiguous and the top-k positive instances identified by SVMs. Our approach encloses the regions containing relevant instances in bounding boxes. Only the scores of instances within the bounding boxes are evaluated in our quest for the top-k instances. We detail boundingbox formation and pruning in the remainder of the paper. Apart from the speedup due to pruning, our approach enjoys manifold advantages: • Retrieving both the nearest and the farthest. The approximate set of both the most ambiguous and the most relevant instances can be retrieved efficiently by the approach. • Translating hyperplane queries to range queries. Since bounding boxes are essentially range queries, reuse of existing indexing techniques for range queries is feasible. For large datasets, this method can reduce the number of unnecessary disk accesses. • Adaptive to kernels and kernel parameters. Our approach can adapt well to changes in both kernels and kernel parameters (explained in Section 3) without necessitating a change in index structures or retrieval strategy. The rest of this paper is organized as follows. Section 3 presents a brief overview of SVM-based learning. Details of the proposed approach are presented in Section 4. Empirical validation of the proposed approach is presented in Section 5. We present our conclusions and possible future work in Section 6.

2.

RELATED WORK

SVMs have had successes in many real-world learning tasks over the past decade. Active learning using SVMs for text retrieval was studied in [27]. Its application in the multimedia domain has been studied for image retrieval [14, 26] and video retrieval [7, 22]. SVM indexing for approximate retrieval of top-k farthest instances was first proposed in [23]. The proposed approach addresses only top-k relevant queries, but not top-k uncertain queries. Efficient top-k queries has been the subject of intense research [1, 8, 9, 15, 16, 18, 20] for over a decade. (The references are by no means complete; detailed discussion may be found in [15, 19].) A traditional top-k query assumes that the query concept can be represented as a point in a metric space, and the top-k results are the k nearest instances to the query point in that space. We are interested in instances nearest to as well as farthest from a hyperplane representing a query concept. The hyperplane query is fundamentally different from the traditional point query paradigm. Range queries are often formulated as boxes in input space with all instances within the specified ranges being of interest. A large number of indexing structures catering to such

queries exist in current literature [2, 13, 21, 25]. Section 4.3 demonstrates how the problem under consideration can benefit from these existing index structures.

3.

SVM OVERVIEW

In this section we present a brief introduction to SVMs, presenting only their salient aspects. Interested readers may refer to [5, 28]for details. We present the SVM formulation in the binary class setting2 . Learning the Hyperplane: Given a set of training instances, xi ∈ ℜd , with labels yi ∈ {−1, +1}, i = 1, · · · , ntr , SVMs find a maximum margin hyperplane separating the positive training instances from the negative ones. The hyperplane is specified in terms of its normal, w, and its scalar displacement, b, from the origin. Learning the optimal hyperplane is formulated as the following quadratic programming problem (QPP): 1 k w k2 2 subject to yi (w · φ(xi ) + b) ≥ 1. min

(1)

w,b

The function φ is the implicit projection function. Transforming the primal problem (Eq. 1) into its dual using Lagrangian multipliers αi ∈ ℜ and simplifying, we obtain: max α

ntr X

αi −

i

subject to αi ≥ 0,

ntr ntr 1 XX αi αj yi yj φ(xi ) · φ(xj ) 2 i j ntr X

(2)

αi yi = 0.

i

The normal, w, to the hyperplane is represented P tr as a linear αi yi φ(xi ). weighted sum of the training vectors, w = n i Here, αi are the scalar weights to be determined by solving the QPP (Eq. 2). Only a subset of the training instances with nonzero αi (called support vectors) represent the normal w. Implicit Projection: It is important to note that only the inner products between instances (φ(xi ) · φ(xj )) in the projected space are of importance. If it were possible to determine these inner products using input space feature vectors xi and xj , an explicit projection of instances to H (which may be extremely high-dimensional) would be unnecessary. Kernels satisfying Mercer’s condition [28]achieve the above and can be conveniently used to replace the inner product φ(xi ) · φ(xj ) by k(xi , xj ). Some popular kernels are the Gaussian kernel exp(−γ k xi − xj k22 ), the Laplacian kernel exp(−γ k xi − xj k1 ) and the polynomial kernel (1 + xi · xj )p . The parameters (γ, p) are tunable allowing us to choose from among several different projections. Again, in each of the kernels, we can compute the inner products (φ(xi ) · φ(xj )) between instances in a projected space without actually projecting instances into that space. Decision Function: Having learned the normal to the hyperplane by solving the QPP formulated above, we can now classify unseen instances. Given an unseen instance zr ∈ ℜd , 2 A possible generalization to the multi-class setting trains a classifier for each class separately (one-per-class).

we determine its class as follows : ntr X sign(w · φ(zr ) + b) =sign( αi yi φ(xi ) · φ(zr ) + b) =sign(

i ntr X

αi yi k(xi , zr ) + b).

1 0 0 1

1 0 0 1 1 0 1 0

1 0 0 1

1 0 0 1 0 1 0 1

Definition 1. The top-k relevant instances are the k instances with the maximum positive3 scores. We are interested only in the instances with positive scores, since SVMs tag instances with negative scores as belonging to the negative class. Definition 2. The top-k ambiguous instances are the k instances with the minimum absolute scores.

4.

RETRIEVAL STRATEGY

Having presented the rudiments of SVMs in the previous section, we now present the details of our approach to retrieve the approximate set of top-k instances nearest to and farthest from the query hyperplane. We assume that the optimal maximum margin hyperplane (separating the positively labeled training instances from the negatively labeled ones) is learned using the SVM learning algorithm. Standard packages [6, 10, 17] based on the sequential minimal optimization algorithm [24] are available for the same. The learning procedure, determines the scalar weights, αi , associated with the training instances. Before presenting the details of our approach, we revisit the scenario of interest. Scenario: The SVM learning algorithm uses data samples xi ∈ ℜd with associated labels yi ∈ {−1, +1} as input to learn the classifier. The unlabeled data instances, zr ∈ ℜd , need to be processed to determine the top-k nearest instances and the top-k farthest positive instances. Our approach develops a pruning technique aimed at evaluating only a subset of the unseen vectors. The main idea is to determine regions in ℜd that may contain instances of interest, and then bound these regions using efficient structures. Unseen instances are processed in two stages. In the first stage, we check to see if they are contained within any of the bounding structures. Only instances enclosed within the bounding structures are then evaluated in the second stage to ascertain their scores. We want to focus our resources on those instances with a high probability of being relevant to the given query. Our approach takes the following steps: • Group chosen training instances to determine relevant regions. • Determine bounding boxes for the regions, thereby easing the verification of instance-membership. • Form structures for efficient querying of bounding boxes. • Retrieve instances for evaluation, and determine the approximate set of top-k instances. We discuss each of the steps in detail in the remainder of this section. 3 If the number of SVM-identified positives (c+ ) is less than k, only the c+ positive instances form part of the top-k set.

1 0 0 1 0 1 0 1

1 0 0 1

1 0 0 1

1 0 0 1

1 0 0 1

i

The score of instance zr (w · φ(zr ) + b) indicates its distance from the hyperplane. A larger positive score represents a longer distance from the hyperplane, which indicates higher relevance to the positive class.

1 0 0 1 0 1 0 1

1 0 0 1

1 0 1 0

1 0 0 1 1 0 0 1

Figure 1: Necessity of using neg. support vectors

1 0 0 1 0 1

1 0 0 1

1 0 0 1

Figure 2: Necessity of using pos. vectors

4.1 Grouping Instances Necessity: This stage helps us determine the regions in input space, ℜd , which may contain instances of interest. Our grouping technique shares similarities with the support vector clustering (SVC) technique outlined in [3]. In general, clustering of instances is a costly exercise. Our algorithm aims to substantially lower the cost of grouping instances by finding approximate groups. Throughout the rest of this section, we refer to the support vector clustering algorithm as SVC and our approximate clustering algorithm as ASVC. Salient Aspects: We wish to determine the regions in input space that may contain instances of interest. In order to do so the input data is the union of the set of negative support vectors (determined at the classifier learning stage) and the set of positive training vectors. In addition, the same kernel and parameter used by the classifier-learning algorithm is employed in order to accomplish the implicit projection of instances to the projected space. There are three important aspects that need to be considered, namely, the inclusion of the negative support vectors, the inclusion of all the positive instances, and the projected space in use. We discuss these aspects as follows. Negative Support Vectors: We use Figure 1 to illustrate the need for using negative support vectors. In the figure, the filled circles represent positive instances and the empty circles represent negative instances. The solid (red) curve delineates the boundary between the positive and negative classes; the dashed (blue) box encloses the instances of interest. As shown in Figure 1, it is possible for us to miss a large number of instances of interest by just considering the positive vectors in the grouping step. Further, the ambiguous instances (instances closest to the hyperplane) may be found on either side of the hyperplane. (In Section 4.2.1 we outline measures to lower the probability of missing ambiguous instances further.) Since the negative support vectors help delineate the boundary between the two classes, including them helps avoid missing instances of interest, thereby improving the accuracy of the retrieval algorithm. Positive Instances: A logical question arises as to why we use only the support vectors in the negative class but the entire set of positive instances? Some regions of the space occupied by positive instances may not in fact be delineated by support vectors. For example, in Figure 2, the support vectors are all located close to the separating boundary and are shown with thicker boundaries. Limiting ourselves to the support vectors gives an incorrect idea of the region occupied by the positive instances. ASVC uses all positive vectors as input to mitigate this error.

Projected Space: The last aspect we discuss is the equivalence of the projected space in the classifier-learning and grouping steps.

• Stage 2 : Aggregate instances into groups. Details of techniques applied (Membership detection and Partitioning) are discussed below.

Lemma 1. The projected space in kernel-based methods is decided by (a) the choice of the kernel function and (b) the choice of the parameter associated with the kernel function.

Alpha Seeding: The learning algorithm to determine R typically chooses a subset of the training vectors as the “active set” [17] and iteratively refines it, adding and discarding vectors till the optimality criteria are satisfied. The speed of convergence of the learning algorithm depends on the choice of the initial active set. A good choice of the active set leads to much faster convergence, often displaying super-linear convergence behavior. At the end of the classifier-learning step, we already have the set of positive and negative support vectors. These vectors lie on the boundary of the space containing the instances of interest. The one-class approach tries to find the minimum bounding sphere around a set of given training vectors and hence, choosing the vectors at (or close to) the boundary (support vectors from the classification phase) as the active set is expected to yield faster convergence. Membership Detection: To speed-up the aggregation of instances into groups, we allow the relaxation of constraints governing the group-membership of instances. If two instances belong to the same cluster in SVC, with high probability, they should belong to the same group in ASVC. However, if two instances do not belong to the same cluster under SVC, they may or may not belong to different groups under ASVC. Effectively, the above relaxation allows for larger, fewer groups. Each of the groups under ASVC may enclose one or more of the clusters under SVC. Consider the distances between instance-pairs belonging to the same cluster under SVC. Let dmax be the maximum of all such distances. Let us assume for the moment that it is possible to estimate dmax . In ASVC, starting with an arbitrary instance, xi , all instances within a radius of dmax of xi are labeled to belong to the same group as xi . If new instances have been added to the group, these are processed recursively with instances within dmax of each new instance being added to the group. If ungrouped instances remain, one of these is chosen randomly and the process outlined above continued till all instances have been grouped.

Proof. In Appendix. Lemma 2. The projected space under consideration in both the classifier-learning and grouping steps are equivalent. Proof. The proof follows from the fact that we retain the same kernel function and the parameter in both the classifierlearning and grouping steps.

4.1.1 Algorithm SVC Before presenting the details of our grouping strategy, we present a brief overview of the SVC technique, which enjoys some important benefits [3] as compared to traditional clustering algorithms. Notably, • The SVC algorithm does not require the specification of the number of clusters a priori as in algorithms like k-means, and • The lack of an explicit bias on the shape of the clusters endows SVC with the unique ability of learning arbitrarily-shaped clusters. The SVC algorithm proceeds in two stages. • Stage 1 : Determine the radius, R, of the minimum bounding hyper-sphere enclosing the instances in the projected space. • Stage 2 : Determine the clusters by examining every pair of instances for possible co-membership to the same cluster. Given a pair of instances, if there exists a path connecting them such that the path lies completely within the bounding hyper-sphere then the instances belong to the same cluster. In practice, intermediate points (numbering 20 in [3]) on the line joining the two instances are checked to see if they lie within the hypersphere. An important drawback of the support vector clustering technique stems from the costly pairwise comparison necessary to determine the clusters. The cost associated with this step is the sum of the cost of solving the quadratic optimization problem obtaining R, and the cost of the actual clustering step. The total cost [3, 24] is given by O(n2.3 d) + O(n2 n′ sv d), where n is the number of input instances and n′ sv is the number of support vectors obtained from the solution of the optimization problem determining R.

4.1.2 Algorithm ASVC Having presented the details of SVC, we now present our grouping algorithm ASVC and discuss its advantages over SVC. ASVC proceeds in two stages. • Stage 1 : As in SVC, determine the radius R of the minimum bounding hypersphere in projected space. To do so efficiently, we leverage information available from the classification stage. Specifically, we use the alphaseeding technique outlined in [11].

Theorem 1. A pair of instances belonging to the same cluster under SVC belong to the same group under ASVC. Proof In Appendix. ASVC assumes the availability of dmax . Exact evaluation of dmax is avoided by estimating the same. Let the estimate of dmax be denoted by dˆmax . To estimate dmax , we pick a random instance, xi , and compute its distances from all other instances. A binary search is then performed (starting at the farthest instance) on the sorted distances for the instance xj such that xi and xj belong to the same cluster (determined using the pairwise check used in SVC). The process is repeated m times (elaborated below), each time updating the currently recorded maximum pairwise distance (dn ) if necessary. Finally we set dˆmax = 2dn . The choice of the number of iterations, m, depends on the acceptable error in the estimation of dˆmax . Let dt denote the distance between a pair of instances such that they are evaluated by the pairwise method as belonging to the same cluster. We only assume that dt follows a symmetric distribution and use the extreme value distribution to obtain error bounds for the estimate dˆmax .

Notation x+ = sv− = R = c = S = ns = BB = d =

Positive training instances Negative support vectors Radius of min. bounding hypersphere Center of hypersphere in projected space H x+ ∪ sv− Number of instances per slice Bounding Box Dimensionality of each vector

6

4

Train one-class SVM on S to obtain R and c dn = 0 for i = 1 to m u = Random instance in S Compute sorted list of instances in S based on increasing distances from u Perform binary search for instance Sp s.t. u and Sp belong to same cluster dt = Distance between u and Sp dn = dn > dt ?dn : dt dˆmax = 2dn

(b) procedure Detect members Input : S, ns and dˆmax Output : Labels of instances f = Random feature S′ = Sorted arrangement of instances along feature f label = 0 for i = 1 to |S′ |/ns for j = 1 to ns if Unlabeled S′(i−1)∗ns +j Label S′(i−1)∗ns +j with label Push S′(i−1)∗ns +j onto stack while stack not empty u = Pop stack for k = 1 to ns if Unlabeled S′(i−1)∗ns +k if distance(u and S′(i−1)∗ns +k ) < dˆmax Label S′(i−1)∗ns +k with label Push S′(i−1)∗ns +k onto stack label = label + 1

5 9 3 10

(a) procedure Obtain estimate dˆmax Input : S, m Output : dˆmax

8

7

1

Figure 4: Error induced by partitioning

2

11

12

Figure 5: Query boxes and index partitions

ples, m, used for computing dˆmax may be chosen based on the acceptable probability of error. For instance, picking just 10 samples, P (dˆmax ≥ dmax ) = 0.999 (less than 0.1% chance of failure). In our experiments we set m = 10. Partitioning: We use a partitioning strategy to speed up the grouping of instances. The speedup is attained at a small loss of accuracy (discussed in Section 4.2). The partitioning strategy proceeds by selecting one feature randomly and partitioning the data instances (sizes discussed in Section 4.2) based on that feature. This “slices” the space into (hyper)rectangular partitions as shown in Figure 4, where the y-axis has been chosen to partition the data space. ASVC is then applied to each of the slices separately. Pairwise comparisons and computations are restricted to instances belonging to the same slice. The effect is that of speeding up aggregation with groups being determined on a per-slice basis. Cost: The cost associated with the technique outlined above is the sum of the cost of the optimization step to determine R (O((ntr+ + nsv− )2.3 d)), and the cost of the grouping step (O(ns (ntr+ + nsv− )d)). If SVC had been applied to the slices for clustering, it would have incurred a cost of O(n′sv ns (ntr+ + nsv− )d).

4.2 Bounding Boxes

Necessity: This step obtains bounding boxes around the groups. These bounding boxes are then used to filter unseen instances speeding up determination of instances of interest. Input : S Salient Aspects: Having determined the groups, we conOutput : Bounding Box(es) BB struct the bounding boxes by determining the maximum and for i = 1 to |S| minimum coordinates of respective groups along each fealabel = Si .label ture. Groups, though tighter, are not as computationally for j = 1 to d j j lower j tractable and efficient as bounding boxes since each boundBBlower > Sji = S if BBlabel i label j upper j ing box needs just 2d features for its specification. BBupper < Sji = Sji if BBlabel label As demonstrated in Figure 4, (the dotted box represents the bounding box for the top slice, the region with positives Figure 3: Algorithms for (a) Estimating dˆmax being enclosed by the solid curve) the partitioning strategy (b) Grouping Instances (c) Forming Bounding may introduce error with some regions within the slice conBoxes taining positives being excluded from the bounding box. For the sake of analysis we choose a slice arbitrarily. Let the number of instances per slice be greater than ns . For simLemma 3. (Extreme Value Distribution) Let ξ1 , ..., ξm be plicity, we assume that all the instances in the slice belong to identically distributed independent random variables with the the same group. Results for multiple groups per slice follow common cumulative distribution function F (ξi ). The cumulative distribution function of the maximum (ξmax = maxi∈[m] ξi ) from similar reasoning. Focusing on a particular feature, say r, we find the r-th of the random variables follows the distribution (F (ξmax ))m . feature distances4 between instances belonging to the same Since we have assumed a symmetric distribution, the prob4 ability that ξmax (dn in our case) is greater than 50% of the Note that we are only looking at the r-th feature values maximum of all estimates is 1 − (0.5)m . The number of samwhen computing these distances. (c) procedure Form Bounding Boxes

group. Let ̺r be the maximum r-th feature distance between a pair of training instances belonging to the chosen group. We would like to estimate how close ̺r is to the maximum (̺rmax ) for the region within the slice enclosing the group5 . Lemma 4. The distribution for the maximum r-th feature distance ̺r in the chosen slice is F (̺r )ns . Proof. With ns instances per slice, the total number of pairwise distances (for every feature and otherwise) is ns (ns −1) but only ns of these are independent. The proof 2 follows from Lemma 3 if the independent pairwise distances are denoted by random variables. For example, assuming a uniform distribution6 and ns = 45, the probability of ̺r being within 90% of the actual maximum7 (̺rmax ) is ≈ 0.99. We report results with slice sizes ranging from 25 to over 100 in Section 5. Since the feature r in the analysis was arbitrarily chosen the above analysis applies to every feature. As the feature ranges obtained are close to the maximal feature ranges, with high probability, almost all positive instances identified by the SVM are captured if all instances with feature values within the range(s) are evaluated. The instances with the highest scores are but a subset of the positives; hence, with high probability most of these instances will be enclosed within the feature ranges. Lemma 4 ensures this holds with high probability since the width of the bounding box along every dimension is close to the best possible. Cost: Only a linear pass is required to determine the bounding boxes with the cost being O((ntr+ + nsv− )d).

4.2.1 Expanding Bounding Boxes Instances close to the hyperplane in the projected space H by definition translate to instances close to the boundary in the input space. The error induced by the partitioning strategy outlined in Section 4.1 affects the retrieval of boundary instances. To enhance the accuracy of retrieval of instances closest to the boundary our approach expands the bounding boxes so that a majority of these boundary instances are evaluated. j Let l = BBlower and u = BBtupper j be the lower and t upper limits along the j-th feature for the t-th bounding box. Expansion of the bounding box can be achieved by setting l and u to ǫl and 1ǫ u respectively (0 ≤ ǫ ≤ 1). The new width of the bounding box along j-th feature is 1ǫ u − ǫl. The ratio with respect to the original width is given by 1 u ǫ

− ǫl 1 u − lǫ2 1 = ≥ . u−l ǫ u−l ǫ

Given ns instances per slice, the largest inter-instance distance along a chosen feature is probabilistically close to the maximum possible (Lemma 4). For example, in the case of the uniform distribution with ns instances per slice, the probability of the maximum inter-instance distance (̺r ) along any feature r being within ǫ of the actual maximum (̺rmax ) 5 Using the maximum ensures that all instances in the region are captured by the bounding box constructed. 6 Though the uniform distribution assumption may seem to be a strong assumption, empirical performance in Section 5 suggests that it is not a poor choice for analysis. 7 All analysis is replaced with the number of instances in a group when all instances in a slice do not belong to the same group.

for the region is distributed as 1 − ǫns . Thus, P ( 1ǫ ̺r ≥ ̺rmax ) = 1 − ǫns . For example, ǫ = 0.95 translates to P ( 1ǫ ̺r ≥ ̺rmax ) > 0.9 using the uniform distribution with 45 instances per slice. ǫ can be chosen appropriately depending upon the acceptable error and the number of instances per slice. A smaller ǫ leads to larger bounding boxes requiring the evaluation of a larger number of instances while a large ǫ may lead to instances close to the boundary not being evaluated.

4.3 Index Structure Necessity: In large datasets, the cost of retrieving instances from the disk often surpasses the computational cost. Therefore, it is important to minimize the number of disk accesses using index structures to retrieve only relevant partitions. We present the mapping of the problem under consideration to the well-studied area of range queries. We also outline the necessity and the nature of other structures for efficient retrieval of relevant instances. Salient Aspects: In the previous section we outlined the development of bounding boxes around the regions of interest. Treating the bounding boxes as range queries allows us to use multiple existing indexing structures like the Rtree [13], the R*-tree [2], and the the TV-tree [21]. These are representative of the vast amount of literature on index structures for range queries but by no means a complete listing of available resources. What we wish to highlight is that a database with an existing index structure can be used to retrieve partitions containing the computed bounding boxes. The instances contained in these partitions, however, represent a superset of the instances within the bounding boxes. Instance addition and deletion can be handled by indexspecific techniques. An important benefit stems from the independence of the index structure from the kernel and the parameter in use. The index structure can be reused for multiple kernels and parameters without requiring any changes to data organization or storage of additional structures. The hyperplane query can therefore be addressed by an additional computational structure above the original index, completely avoiding costly construction of new index structures and possible reorganization of data. Cost: The cost associated with this step is the cost of constructing the chosen index structure. Unlike the other steps, this step has only a one-time cost.

4.3.1 Structure for Efficient Query Necessity: As noted above, it is possible for the index structure to return a superset of the instances belonging to the bounding boxes. For example in Figure 5, the solid lines indicate the partitions of the data instances in the index structure, while the dashed lines indicate the bounding boxes for a hypothetical concept. The query for the bounding boxes would return partitions 3 and 4. However, not all the instances in the partitions are of interest. The construct Str described in this section helps prune instances outside the bounding boxes efficiently. The number of bounding boxes may vary with the concept under consideration. Efficient query processing requires the rapid elimination of possibilities (unseen instances) when possible. Salient Aspects: Given the bounding box coordinates for a query, Str records the bounding boxes at each such coordinate so that processing of an unseen instance can rapidly eliminate bounding boxes not enclosing the instance. In

Notation z = Test instances Str = Structure containing bounding boxes d = Dimensionality of each vector

essence, the range associated with each feature is subdivided into smaller zones either belonging to (one or more) bounding boxes, or empty. We consider a particular feature, r, for clarity of exposition. Strr contains the sorted coordinate values (2nbb in number) of the bounding boxes along this feature. The sorted array of coordinates is processed to record the list of functional8 bounding box(es) (Strrj .f unctional bb) associated with each coordinate (Strrj ). This sorted array provides us with a means of deciding the bounding boxes populating each zone within the range spanned by the feature. Since a bounding box can be either present or absent, a simple bit-string can be used to represent the bounding boxes in each zone. Thus, at each of the 2nbb coordinates for each feature, we have a bit string of length nbb indicating the bounding boxes functional in that zone. Cost: The cost associated with the above step is bounded by the cost of the sorting step and is O(nbb d log(nbb )).

4.4 Query Processing Retrieving the top-k instances of interest involves retrieving instances within bounding boxes and evaluating them to ascertain their scores. When using a range-query index structure, relevant partitions are retrieved from the disk. Given an unseen instance, zt , deciding on its class membership proceeds as follows. A feature, r, is picked9 and a binary search is performed for the feature-value, zrt , on the structure Strr . Let the binary search return coordinate j. The binary search yields either an empty set or a set of possible bounding boxes (Strrj .f unctional bb) enclosing the unseen instance. If the result is a set of bounding boxes, another feature is picked and the set of bounding boxes enclosing the unseen instance revised. The process is continued till we either end up with the empty set or exhaust all features. If we still have bounding boxes for the instance after all features have been explored, it is possible that the unseen instance is of interest. The unseen instance is then evaluated with respect to the classifying hyperplane in order to decide upon class membership and obtain its score. In the iterative stages the top-k instances with the lowest absolute scores are returned. When the concept has been identified the top-k positive instances with the highest scores are returned as responses to the hyperplane query. If no and ni indicate the number of instances outside and inside the bounding boxes respectively, the overall cost is bounded above by O((no +ni ) d log(nbb )+ni nsv d). The first component of the above cost caters to the cost of determining the bounding box(es) enclosing the instances and the second component quantifies the cost of score computation. The cost savings stem from the non-evaluation of no instances, which can be stated as O(no nsv d). Savings due to non-retrieval of instances using an index structure have not been taken into account here, since these are index-specific. i )nsv d . HowThe speedup is given by (no +ni )(ndo +n log(n )+ni nsv d bb

8

Bounding boxes with feature ranges enclosing the coordinate are “functional” at the coordinate since an instance with the coordinate value could possibly be enclosed by one or more of the bounding boxes. 9 There are various methods of choosing the feature for pruning the bounding boxes. Possibilities include random selection, feature with narrow ranges first, history-based selection etc. In our experiments we scanned the features sequentially.

procedure Process instance Input : z, Str Output : Top-k instances with highest scores for i = 1 to |z| f unctional bb = All the bounding boxes for j = 1 to d Perform binary search for zji on Strj Update f unctional bb if f unctional bb is empty break if f unctional bb not empty Evaluate score of instance zi and update current top-k

Figure 6: Query Processing Algorithm ever, this is the lower bound on the speedup. Higher speedup would almost always be achieved since O((no +ni ) d log(nbb )) denotes the worst-case time complexity for pruning the no instances. The speedup after simplification translates to nsv . ni log(n )+ n bb

no +ni

sv

Lemma 5. If log(nbb ) ≪ nsv , the speedup obtained using bounding boxes instead of brute force evaluation of all ini . stances in the dataset is given by non+n i With the assumption that the dataset is large and the bounding boxes tight, no is expected to be much larger than ni leading to large speedup values. This is borne out by our experiments (Section 5). The above does not take into account either the cost of training the one-class SVM or the cost of constructing necessary structures (Section 4.1). Speedup results reported in Section 5 include these costs.

5.

EXPERIMENTS

To evaluate the effectiveness of the proposed techniques, we carried out empirical evaluations on multiple pre-defined concepts on datasets. Our experiments were targeted at evaluating the accuracy and efficiency of the presented techniques. Quality Measures: Since our approach retrieves the approximate set of top-k instances, we used two different measures to evaluate the quality of retrieved instances. The first measure used by us was the recall achieved. To do so, the actual set of top-k instances were first obtained by exact score computation of all the instances in the dataset. The instances retrieved by our approach were then compared with the actual set of top-k instances to determine the number of matches. Apart from recall, another measure advocated in [1] measures the average effective error for 1-nearest neighbor search as: X 1 dindex E= − 1), ( |Q| query x ∈Q d∗ q

where dindex denotes the distance of the retrieved instance from query point xq , d∗ is the distance of the nearest instance to the query point, and the sum is taken over all queries. This avoids the 0/1 loss pattern in recall. For the k-nearest neighbor problem, as in [12], we sum and then average the

ratios of the distance of closest point retrieved to the distance of the nearest neighbor, the distance of the second closest point retrieved to the distance of the second nearest neighbor, and so on (till the k-th nearest neighbor). We use similar metrics for the average effective error using hyperplane queries. For the 1-farthest instance query the average effective error is given by: X 1 sci E′ = (1 − ∗ ), ′ |Q | sc ′ query Hq ∈Q

where sci is the distance of the retrieved instance from the hyperplane, sc∗ is the distance of the farthest instance from the hyperplane and Hq represents The reversal of sign is because here we are interested in the farthest instances (the denominator is always greater than or equal to the numerator sci in sc ∗ ). Generalization to k-farthest instances from the hyperplane is achieved in a similar manner to [12] by summing and averaging respective ratios of errors for the k retrieved instances. However, the above measure is not very informative in the nearest instance to the hyperplane query scenario where the score (|sc∗∗ |) of the instance closest to the hyperplane is comparable to the difference between scores (|sci |−|sc∗∗ |). Using the mean score scavg , we modify the computation of the effective error as X 1 |scavg | − |sci | E ′′ = (1 − ), ∗∗ |Q′ | |sc avg | − |sc | ′ query Hq ∈Q

with generalization to k instances as described before. Efficiency Measure: To measure the efficiency of our approach, we considered (a) the number of instances evaluated and (b) the time taken by our approach as compared to a brute-force evaluation of all instances in the dataset. No indexing structures were used to give maximum benefit to sequential evaluation since all instances were assumed to be loaded in memory. Datasets: We used two datasets for our experimental evaluation. The first dataset is an image collection from Corbis (http://pro.corbis.com) containing nearly 315, 000 samples, each with 144 continuous attributes describing color, texture and shape characteristics [26]. Of the over 1, 100 categories, 400 with the largest number of instances were chosen. For each selected category, 50% of the dataset was used as the training set for the SVM learning algorithm. SVM learning was carried out using SVM-light [17]. The set of top-k farthest instances were then retrieved from the entire dataset. Our second dataset was obtained from the Corel image collection and contains more than 51, 000 images. 144 length feature vectors detailing various aspects of the images like hue, saturation, brightness, contrast etc. were extracted. There are more than 500 categories in the dataset. We used the top 400 categories in our experiments. Setup: For all our experiments we used the popular Gaussian kernel. First we selected relevant kernel parameters for each of the datasets using cross-validation (γ = 0.01 for the Corbis dataset and γ = 0.001 for Corel). The experiments were carried out on a Linux workstation with a 1GHz Pentium processor and 1.5GB of RAM. Our experiments focused on evaluating the effect of changes in the size of k and the number of instances per slice. Results: We present results for both the retrieval of the

nearest and the farthest instances to the hyperplane separately. The farthest instance to the hyperplane queries were performed independent of the nearest instance queries using the same concepts of interest. For the nearest instance queries we set ǫ = 0.95, while ǫ = 1 was used for the farthest instance queries (since the farthest instances are expected to be distant from the boundary and adjustments to the boundary are not as important). Figure 7 presents the results of our approach on the Corbis dataset averaged over all 400 different hyperplane queries. Figures 7(a) and 7(b) present results for the nearest instance to the hyperplane query while Figures 7(c) and 7(d) present results for the farthest instances from the hyperplane queries. The x-axis represents the various k values employed (top{1, 5, 10, 15, 20}) and the y-axis represents the quality measures (recall and effective error respectively). Each curve is obtained with a different number of instances per-slice (25, 35, 45, 50, 75 and 100). There is little variation in recall rates over the different values of k with recall rates for higher slice sizes being consistently above 80%. Further, the effective error rates over the same slice sizes and top-k nearest and farthest queries are less than 0.001 and 0.015 respectively, indicating the very high quality of retrieved samples even when the exact set of top-k instances could not be retrieved. We also present performance measures displaying the cumulative densities of the quality measures and speedup over the 400 top-20 queries with a slice size of 45 in Figures 8 and 9. Figure 8 presents the results for the nearest instances to the hyperplane queries while Figure 9 presents performance characteristics for the farthest instances from hyperplane queries. In both the figures (8 and 9), the first two graphs show the plots of the cumulative densities of the recall and effective-error, while the last two graphs present the results for speedup and the percentage of instances not evaluated. Over 75% of the queries in both the settings have effective error rates of less than 0.01 indicating the high quality of retrieved instances. Speedups over the 400 nearest instances to hyperplane queries range from a minimum of 7 to a maximum of 93 times, with a mean speedup of 42.5. Speedups for the farthest instances from hyperplane queries range over a minimum of 16 to a maximum of 109 times with the average speedup being about 58. Speedup was computed by comparing the sequential processing time with the total time taken by various components comprising our approach (cost of the one-class algorithm, cost of constructing the pruning structure, cost of checking whether an instance is within a bounding box, and cost of evaluating the scores of all instances within the bounding box). To give maximum benefit to the sequential scan strategy, all instances in the dataset were checked for membership of bounding box. Using an index structure would further enhance the speedup by retrieving only relevant partitions. However, the number and type of partitions retrieved would be index structure dependent. Another measure of the efficiency of our approach is the fraction of instances whose scores did not need to be computed. Figures 8(d) and 9(d) show the percentage of instances not evaluated (scores not computed) in both the query settings. The uniformly high rates demonstrate the effectiveness of our pruning approach with 98.7% not evaluated on average. Experiments on the Corel dataset were carried out with a fixed k value of 20. We present a summary of the perfor-

5

10 15 k value in top-k

25 35

45 50

20

0.9999 0.9996 0.9993 0.999 5 10 15 k value in top-k

75 100

25 35

45 50

(a)

1 - Avg. Effective Error

0.6

1.0002

Average Recall

0.7

1 - Avg. Effective Error

0.8

0.9 0.8 0.7 0.6 5

20

25 35

75 100

10 15 k value in top-k 45 50

(b)

20

0.999 0.996 0.993 0.99 0.987 5 10 15 k value in top-k

75 100

25 35

45 50

(c)

20

75 100

(d)

Figure 7: Average retrieval quality over varying slice sizes and k values for both nearest and farthest queries

Recall

0.8 0.7 0.6 0.5 0.4

0.998

Speedup

1 - Effective Error

1 0.999 0.997 0.996 0.995 0.994 0

0.2 0.4 0.6 0.8 Fraction of Queries

1

0

(a)

0.2 0.4 0.6 0.8 Fraction of Queries

110 100 90 80 70 60 50 40 30 20 10 0

Fraction not Evaluated

1 0.9

0

1

(b)

0.2 0.4 0.6 0.8 Fraction of Queries

1 0.98 0.96 0.94 0.92 0.9 0.88

1

0

(c)

0.2 0.4 0.6 0.8 Fraction of Queries

1

(d)

Figure 8: Nearest instance retrieval results over 400 queries on Corbis dataset (45 instances per slice, top-20)

0.7 0.6 0.5 0.4 0

0.2 0.4 0.6 0.8 Fraction of Queries

(a)

1

120

Fraction not Evaluated

0.8

1 0.995 0.99 0.985 0.98 0.975 0.97 0.965

100 80

Speedup

1 - Effective Error

1 0.9 Recall

Average Recall

1 0.9

60 40 20 0

0

0.2 0.4 0.6 0.8 Fraction of Queries

1

0

(b)

0.2 0.4 0.6 0.8 Fraction of Queries

(c)

1

1 0.99 0.98 0.97 0.96 0.95 0.94 0.93 0

0.2 0.4 0.6 0.8 Fraction of Queries

1

(d)

Figure 9: Farthest instance retrieval results over 400 queries on Corbis dataset (45 instances per slice, top-20) mance results obtained on this relatively small dataset. On average, 77.3% of the instances in the dataset pruned out by the indexing structure with a speedup of about 3.9 times in its quest for the most ambiguous instances. The average error in the choice of the ambiguous instances was less than 0.005 with an average recall of 77%. In the retrieval of the farthest instances from the hyperplane, on an average 85.3% of the dataset instances were pruned leading to a speedup of 6.2 times. The average error for the farthest instance queries was found to be 0.09 with an average recall of 79.2%. Remark: The performance of the proposed technique varies with the distribution of the concept of interest. A compact concept of interest occupying a small region of the entire data space would yield good performance. Similarly, a concept of interest distributed in space yet with component regions compact and distant from each other would yield good results. Further, the larger dataset demonstrates better performance both in terms of speedup and accuracy, which bodes well for the scalability of the proposed technique. One of the scenarios where the proposed technique would not be effective is when the instances belonging to the concept of interest are scattered in the input space. However, most concepts are expected to cluster in input space and the absence of any

structure in the distribution of the data instances indicates that the extracted features may be inadequate to model the concept. Another aspect which would affect the performance of the proposed technique is the dimensionality of the input data space. Our experiments indicate that the approach is effective in moderately high dimensions when the features number in the hundreds. There do exist domains like text retrieval where the feature descriptors are very sparse but the number of features are close to 50, 000. The performance of bounding box techniques even with approximations in these extremely high-dimensional spaces will be poor in general.

6.

CONCLUSION

In this paper we presented techniques for efficient retrieval of top-k instances identified by the SVM algorithm. The retrieval problem was mapped (after suitable preprocessing) to the well-known database range-query problem thus enabling the deployment of SVM-based techniques on large datasets without a need for the construction of new index structures. Our experiments validated the effectiveness of the proposed approach, with both datasets displaying high recall and efficiency over a total of 800 different hyperplane queries. Further, the large percentage of instances not needing evaluation

bodes well for the use of range-query index structures since only few partitions are expected to be retrieved on average.

7.

REFERENCES

[1] S. Arya, D. Mount, N. Netanyahu, R. Silverman, and A. Wu. An optimal algorithm for approximate nearest neighbor searching in fixed dimensions. In Proceedings of the 5th SODA, pages 573–82, 1994. [2] N. Beckmann, H.-P. Kriegel, R. Schneider, and B. Seeger. The R*-tree: An efficient and robust access method for points and rectangles. In Proceedings of the 1990 ACM SIGMOD International Conference on Management of Data, Atlantic City, NJ, May 23-25, 1990, pages 322–331. ACM Press, 1990. [3] A. Ben-Hur, D. Horn, H. Siegelmann, and V. Vapnik. Support vector clustering. Journal of Machine Learning Research, 2:125–137, 2001. [4] B. E. Boser, I. Guyon, and V. Vapnik. A training algorithm for optimal margin classifiers. In Computational Learing Theory, pages 144–152, 1992. [5] C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery, 2(2):121–167, 1998. [6] C.-C. Chang and C.-J. Lin. LIBSVM: a library for support vector machines, 2001. Software available at http://www.csie.ntu.edu.tw/ cjlin/libsvm. [7] M.-Y. Chen, M. Christel, A. Hauptmann, and H. Wactlar. Putting active learning into multimedia applications; dynamic definition and refinement of concept classifiers. In ACM Multimedia, 2005. [8] P. Ciaccia and M. Patella. Pac nearest neighbor queries: Approximate and controlled search in high-dimensional and metric spaces. In In Proceedings of International Conference on Data Engineering, pages 244–255, 2000. [9] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. VLDB, pages 426–435, 1997. [10] R. Collobert and S. Bengio. SVMTorch: Support vector machines for large-scale regression problems. Journal of Machine Learning Research, 1:143–160, 2001. [11] D. DeCoste and K. Wagstaff. Alpha seeding for support vector machines. In Proceedings of the International Conference on Knowledge Discovery and Data Mining, pages 345–359, 2000. [12] A. Gionis, P. Indyk, and R. Motwani. Similarity search in high dimensions via hashing. In The VLDB Journal, pages 518–529, 1999. [13] A. Guttman. R-trees: A dynamic index structure for spatial searching. In B. Yormark, editor, SIGMOD’84, Proceedings of Annual Meeting, Boston, Massachusetts, June 18-21, 1984, pages 47–57. ACM Press, 1984. [14] X. He, W.-Y. Ma, O. King, M. Li, and H. Zhang. Learning and inferring a semantic space from user’s relevance feedback for image retrieval. In ACM Multimedia, pages 343–346, 2002. [15] M. E. Houle and J. Sakuma. Fast approximate similarity search in extremely high-dimensional data sets. In ICDE, 2004. [16] P. Indyk and R. Motwani. Approximate nearest neighbors: towards removing the curse of dimensionality. In Proc. of 30th STOC, pages 604–613, 1998. [17] T. Joachims. Making large-scale svm learning practical. In B. Sch¨ olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods - Support Vector Learning. MIT-Press, 1999. [18] N. Katayama and S. Satoh. The SR-tree: an index structure for high-dimensional nearest neighbor queries. In ACM SIGMOD Int. Conf. on Management of Data, pages 369–380, 1997. [19] D. A. Keim. Tutorial on high-dimensional index structures: Database support for next decades applications. In

Proceedings of the ICDE, 2000. [20] K.-I. Lin, H. V. Jagadish, and C. Faloutsos. The TV-tree: An index structure for high-dimensional data. VLDB Journal: Very Large Data Bases, 3(4):517–542, 1994. [21] K.-I. Lin, H. V. Jagadish, and C. Faloutsos. The tv-tree: An index structure for high-dimensional data. VLDB Journal, 3(4):517–542, 1994. [22] A. P. Natsev, M. R. Naphade, and J. Teˇsi´ c. Learning the semantics of multimedia queries and concepts from a small number of examples. In ACM Multimedia, pages 598–607, 2005. [23] N. Panda and E. Y. Chang. Exploiting geometry for support vector machine indexing. In SIAM International Data Mining Conference, 2005. [24] J. Platt. Fast training of support vector machines using sequential minimal optimization. In B. Sch¨ olkopf, C. Burges, and A. Smola, editors, Advances in Kernel Methods Support Vector Learning, 1998. [25] H. Samet. The quadtree and related hierarchical data structures. ACM Computing Surveys, 16(2), pages 187–260, 1984. [26] S. Tong and E. Y. Chang. Support vector machine active learning for image retrieval. In ACM International Conference on Multimedia (MM), pages 107–118, 2001. [27] S. Tong and D. Koller. Support vector machine active learning with applications to text classification. In P. Langley, editor, Proceedings of ICML, pages 999–1006, Stanford, US, 2000. Morgan Kaufmann Publishers, San Francisco, US. [28] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, 1998.

APPENDIX Lemma 1. The projected space in kernel-based methods is decided by (a) the choice of the kernel function and (b) the choice of the parameter associated with the kernel function. Proof. The inner product between any two instances, xi , xj , in projected space is given by K(xi , xj ) = hφ(xi ), φ(xj )i = fp (xi , xj ), where φ is the implicit projection function, f denotes the kernel used and p denotes the parameter for the chosen kernel. Importantly, kernels project instances into a Hilbert space where the norm is defined in terms of the inner product. Therefore, the choice of the kernel function f and the parameter p are enough to decide the projected space. Theorem 1. A pair of instances belonging to the same cluster under SVC belong to the same group under ASVC. Proof. We prove by contradiction. Let xi and xj be a pair of instances belonging to the same cluster under SVC and to different groups under ASVC. The distance between xi and xj has to exceed dmax since all instances within dmax of each other have to belong to the same group under ASVC. Further, there does not exist any sequence of instances, {xi , xa1 , ..., xar , xj } such that the distance between every adjacent pair of instances is less than dmax . However, since they belong to the same cluster under SVC, there exists at least one series of instances, {xi , xb1 , ..., xbt , xj }, such that all the instances in the series belong to the same cluster and the distance between every adjacent pair is not greater that dmax (Note that the series might consist of only xi and xj in some cases). The above implies that, by using the same sequence of instances, ASVC would assign xi and xj to the same group. Thus, the assumption that xi and xj belong to different groups under ASVC is invalid.

Efficient Top-k Hyperplane Query Processing for ...

ABSTRACT. A query can be answered by a binary classifier, which sep- arates the instances that are relevant to the query from the ones that are not. When kernel methods are employed to train such a classifier, the class boundary is represented as a hyperplane in a projected space. Data instances that are farthest from the ...

274KB Sizes 0 Downloads 248 Views

Recommend Documents

Efficient Query Processing for Streamed XML Fragments
Institute of Computer System, Northeastern University, Shenyang, China ... and queries on parts of XML data require less memory and processing time.

A Space-Efficient Indexing Algorithm for Boolean Query Processing
index are 16.4% on DBLP, 26.8% on TREC, and 39.2% on ENRON. We evaluated the query processing time with varying numbers of tokens in a query.

Efficient Exact Edit Similarity Query Processing with the ...
Jun 16, 2011 - edit similarity queries rely on a signature scheme to gener- ... Permission to make digital or hard copies of all or part of this work for personal or classroom ... database [2], or near duplicate documents in a document repository ...

An Efficient Algorithm for Location-Aware Query ... - J-Stage
Jan 1, 2018 - location-aware service, such as Web mapping. In this paper, we ... string descriptions of data objects are indexed in a trie, where objects as well ...

An Efficient Algorithm for Location-Aware Query ... - J-Stage
Jan 1, 2018 - †The author is with Graduate School of Informatics, Nagoya. University .... nursing. (1, 19). 0.7 o5 stone. (7, 27). 0.1 o6 studio. (27, 12). 0.1 o7 starbucks. (22, 18). 1.0 o8 starboost. (5, 5). 0.3 o9 station. (19, 9). 0.8 o10 schoo

Using OBDDs for Efficient Query Evaluation on Probabilistic Databases
a query q and a probabilistic database D, we construct in polynomial time an ... formation have, such as data cleaning, data integration, and scientific databases. ..... The VO-types of the variable orders of Fig. 3 are (X∗Y∗)∗ and X∗Y∗, re

A Space-Efficient Indexing Algorithm for Boolean Query ...
lapping and redundant. In this paper, we propose a novel approach that reduces the size of inverted lists while retaining time-efficiency. Our solution is based ... corresponding inverted lists; each lists contains an sorted array of document ... doc

Efficient Error-tolerant Query Autocompletion
clude command shells, desktop search, software development environments (IDE), and mobile applications. ... edit distance is a good measure for text documents, and therefore has been widely adopted and studied [8 ..... 〈12, 2, 1 〉. 〈12, 3, 1 ã€

Linked Data Query Processing Strategies
Recently, processing of queries on linked data has gained at- ... opment is exciting, paving new ways for next generation applications on the Web. ... In Sections 3 & 4 we present our approach to stream-based query ..... The only “interesting”.

Chapter 5: Overview of Query Processing
calculus/SQL) on a distributed database (i.e., a set of global relations) into an equivalent and efficient lower-level query (of ... ASG2 to site 5: 1000 * tuple transfer cost. 10,000. – Select tuples from ASG1 ∪ ASG2: 1000 * tuple access cost. 1

REQUEST+: A framework for efficient processing of ...
Jun 24, 2013 - the total number of sets, we devise a pruning method that utilizes the concept of circular convex set defined in [14]. .... In this section, we propose REQUEST+, a framework for region-based query processing in sensor networks. ......

On efficient k-optimal-location-selection query ...
a College of Computer Science, Zhejiang University, Hangzhou, China ... (kOLS) query returns top-k optimal locations in DB that are located outside R. Note that ...

On efficient k-optimal-location-selection query ... - Semantic Scholar
Dec 3, 2014 - c School of Information Systems, Singapore Management University, ..... It is worth noting that, all the above works are different from ours in that (i) .... develop DBSimJoin, a physical similarity join database operator for ...

LigHT: A Query-Efficient yet Low-Maintenance Indexing ...
for indexing unbounded data domains and a double-naming strategy for improving ..... As the name implies, the space partition tree (or simply partition tree for short) ..... In case of mild peer failures, DHTs can guarantee data availability through.

GPUQP: Query Co-Processing Using Graphics Processors - hkust cse
on how GPUs can be programmed for heavy-duty database constructs, such as ... 2. PRELIMINARIES. As the GPU is designed for graphics applications, the basic data .... processor Sorting for Large Database Management. SIGMOD 2006: ...

REQUEST: Region-Based Query Processing in Sensor ...
In wireless sensor networks, node failures occur frequently. The effects of these failures can ..... tion service for ad-hoc sensor networks. SIGOPS Oper. Syst. Rev.

GPUQP: Query Co-Processing Using Graphics ...
computing devices including PCs, laptops, consoles and cell phones. GPUs are .... using the shared memory to sort all bitonic sequences whose sizes are small ...

Shared Query Processing in Data Streaming Systems
systems that can manage streaming data have gained tremendous ..... an application executes business and presentation logic, where there are fewer ..... systems (see Section 2.3 for a brief survey), only a small part of it involves shared ...... proc

Top-k Linked Data Query Processing
score bounds (and thus allow an earlier termination) as compared to top-k .... In a pull-based implementation, operators call a next method on their in-.

Sempala: Interactive SPARQL Query Processing on Hadoop - GitHub
Impala [1] is an open-source MPP SQL query engine for Hadoop inspired by ..... installed. The machines were connected via Gigabit network. This is actually ..... or cloud service (Impala is also supported by Amazon Elastic MapReduce).

bayesian multi-hyperplane machine
Apply data augmentation technique & stochastic gradient descent (SGD) => efficiently infer model parameters and hyper-parameters for large-scale datasets.

Secure kNN Query Processing in Untrusted Cloud Environments.pdf ...
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Secure kNN ...

Exploiting Query Logs for Cross-Lingual Query ...
General Terms: Algorithms, Performance, Experimentation, Theory ..... query is the one that has a high likelihood to be formed in the target language. Here ...... Tutorial on support vector regression. Statistics and. Computing 14, 3, 199–222.

From Region Encoding To Extended Dewey: On Efficient Processing ...
we reduce disk access, but we also support the .... query node and the term “element” to refer to a data .... is not hard to prove that given any element s, the gap.