Empirical Evaluation of Excluded Middle Vantage Point Foreston Biological Sequences Workload Weijia Xu

Lee Parnell Thompson

Daniel P Miranker

Texas Advanced Computing Center

Department of Computer Sciences

Department of Computer Sciences

The University of Texas at Austin

The University of Texas at Austin

The University of Texas at Austin

[email protected]

[email protected]

[email protected]

ABSTRACT Wedevelop and evaluate a version of the excluded middle vantage point forest in support of range searches and load balancing for parallel queries. The algorithm is evaluated using a benchmark suite that includes real-world biological sequence workloads. Favorable results are demonstrated when comparing to the Multiple Vantage Point Tree and Spatial Approximation Tree algorithms with respect to sequential measures. We also demonstrate that the performance of this approach scales linearly up to at least 128 cores and outperforms a naïve distributed multiple vantage point forest approach when run in parallel.

Categories and Subject Descriptors H.3.1 [Content Analysis and Indexing]: Indexing methods

General Terms Algorithms

Keywords Metric space index, Exclusion

1. INTRODUCTION Partition-based indexing schemes are descended from binary search trees, which solve set membership queries in O(log n). The complexity result depends on the property that when a search descends the tree, the decision to search the right or left child of a node is mutually exclusive. Starting at least with Gutmann's Rtree, partition-based multidimensional indexing methods include decision procedures that do not guarantee an exclusive decision[1]. A search may descend through multiple children of a node when covering predicates overlap, or when using range searches, the distance separating the partitions is less than twice the search radius. Subsequently, many indexing schemes have been developed, whose performance is evaluated strictly empirically and depends on the workload[2]. We investigate the use of exclusion as a method to introduce parallelism and to improve the performance of partition-based indexing schemes for range queries in metric space indexing. The basic idea starts with building a conventional partition-based index tree on a data set. The data in the middle partition are removed such that the covering predicates are reduced in size, Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. NTSS 2011, Mar 25, 2011, Uppsala, Sweden. Copyright 2011 ACM 978-1-4503-0612-6/11/03 ...$10.00.

which eliminates overlapand even introduces gaps between them. The process is repeated recursively on the removed data, until further exclusion no longer provides an advantage. Search of the resulting forest of index trees can be done in parallel. Herein we report on an empirical assessment of the use of exclusion in conjunction with distance-based indexing and range queries. Distance-based indexing assumes only that there is a set of data S and a metric-distance function d. A range query of x, q(x, S) <= r, returns all data points, y∈ S, such that d(x, y) <= r. Distancebased indexing does not require domain specific information. The available information about the data is derived solely from an oracle that computes the distance between pairs of objects.The only requirement is that the distance function be a metric. The data set can be clustered and indexed off-line. During an on-line search, the triangle inequality enables the elimination of clusters of data as possible solutions. This so-called, “black-box” model is advantageous for any application where the data cannot be effectively mapped to feature vectors. In biology, many data formats, such as biological sequences, molecular structures,andfunction annotations, lack precise mathematical models. They often have heuristic distance functions for comparison among data objects.As the volume of thesedata growdue to the availability of new high throughput sequencing technologies, metric space indexing can be applied to speed-up the similarity search among these datasets. Here, we focus on biological sequence workloads since identifying similar biological sequencesis a basic component in many bioinformatics studies, including genome comparison, phylogenetic footprints, and genome-wide motif driven discoveries[3,4,5]. Our approach builds on work developed by Yianilos[6]. In that work, conditioned on knowing an existingsufficiently small maximum range for any query,τ.Yianilos defined an exclusionbased parallel algorithm, the "Excluded Middle Vantage Point Forest".The algorithm yields sub-linear asymptotic performance guarantees. Sadly, although the empirical performance results presented in that manuscript are promising, the results suggest that the parameter range that manifest asymptotic guarantees are not viableon real workloads and lead to unbalanced index forests. Where Yianilos parameterized the exclusion based on distance, our version, the emVP index forest, the exclusion is parameterized on a proportion of excluded points. The emVP index does not guarantee the separation distance between partitions, and thus does not achieve Yianilos’ asymptotic algorithmic results. However, the algorithm does achieve favorable practical results. We show our approach can outperform the multiple vantage points tree and the spatial approximation tree onreal biological workloads.We also demonstrate linear scalability when

searchingin parallel and comparethe performance with a naïve distributed indexing approach. Although the results presented here are focused on biological data,the algorithm can be used with any data set with a defined metric distance function.

2. ALGORITHM AND IMPLEMENTATION A metric space is defined as a space, X, where the distance between any two points x, y ‫א‬X, d(x, y),satisfies the following: 1) Positivity: d(x, y) = 0 only if x = y 2) Symmetric: d(x, y) = d(y, x) 3) Triangle Inequality: d(x, y) + d(y, z) ≥ d(x, z) Utilizing these properties a pivot based index exploits the intrinsic clustering of a dataset by selecting one or more points from X and calculating the distance from these points to all other points in the space. These points are called pivots, P = {p1, p2, ...,pk}. The points within X are then partitioned into subsets, or nodes within a tree in an implementation, based on their distances to pivots.The process is recursively applied until a complete tree is formed.

2.1 Vantage Point Forest with Exclusion

Algorithm1 Building index forest build(0, |D|-1, D) while |D|>BSIZE pivotselectPivot(D) mid sort(D, pivot) leftIndexindexof[mid-τ] rightIndexindexof[mid+τ] if (rightindex–leftIndex ) >m|D| leftIndex mid- m|D|/2 rightIndexmid+m|D|/2 end if E.add(D, leftIndex, rightIndex) this.left build(0, leftIndex, D) this.right build (rightIndex, |D|-1, D) left_min distance(D[0], pivot) left_max distance(D[leftIndex], pivot) right_min distance(D[rightIndex], pivot) right_max distance(D[|D|-1], pivot) end while this.data D Figure 2Pseudo code for building index with exclusion.

r1 p

r2

Figure 1 Illustration of three partitions with one pivot. The middle partition is the shaded area. Figure 1 shows an example of pivot-based indexing on one pivot with three partitions defined by distance values r1 and r2. Given a query point q and a radius r,if the d(q, p)+rr2,then only one partition needs to be searched. Otherwise, both partitions need to be searched. But, if the middle partition, the shaded area in Figure 1, can be excluded from the current tree, then all range searches with r<(r2-r1)/2 only search one partition. Thisexclusion idea is used in constructing both the D-index and emVPforest in[6].Both algorithmsrequire a fixed separation distance, or exclusion width of the middle partition, to be 2τ, whereτis the maximum search radius. In addition to τ, we also prioritize sizing the middle partition to be close to a proportion, m, of the data (Figure2). There are several reasons.First, the amount of data in the two separated partitionsmay not be equal if the middle partition is defined by a fixed distance and cause an unbalanced tree structure. Secondly,in practice, the proportion of excludeddata varies when τis at a fixed value. If a very high percentage of data is excluded at each node, the resulting index forests will contain a large number of index trees. For each query, every index tree is required to be accessed at least once. The high number of index trees will increase search overhead significantly. Lastly, when the data set has diameter less than 2τ, it becomes impossible to partition the data based on the fixed distance and causes large variations of number of data points indexed in each tree.Since there is no algorithmic benefit to creating exclusions with width greater than 2τ, the percentage of points inside the exclusion set can be smaller than m if the exclusion width is 2τ.

Figure 2 shows pseudo code for building an index tree with exclusion. D denotes the data set to be indexed. When the number of data points in D exceeds some node size threshold, BSIZE, the data set will be split into three partitions in the following manner. A pivot point is chosen from Dand each data pointwill be sorted based on their distance to the pivot point. The middle partition is then selected based on the valuesm and τand points inside the partition are move into exclusion set E. The other two partitions are used to form the left and right children of the current node. The minimum and maximum distances from pivot point to the two children are stored inleft_min, left_max, right_minandright_max correspondingly. The process keeps the left and right children well separated: one child clustered near the vantage point; and the other away from the vantage point. To ensure the two children are separated, at least one of the following will hold: 1) right_minleft_max>2τ;2) rightIndex – leftIndex<=m|D|.The algorithm proceeds recursively until the data to be indexed can be stored within one node. Once the tree is built, the excluded portion, E, will be used to construct next index tree. The tree construction process ends when the excluded portion can be stored in one node. Algorithm2Range query Query(q, r) Set results empty IfisLeafthen for each data d in this node ifdistance(q, d) <= r then results.add(d) endif end for else dist distance(q, pivot) if(distleft_min-r) results.add(left.query(q,r)) end if if(distright_min-r) results.add(right.query(q,r)) end if return results

Figure 3Pseudo code for the range query.

The query algorithm of each index follows the standard query process of vantage point trees. Each index tree can be searched separately as shown in Figure 3. The query results from each index tree are then combined to form the final result set. Since there is no dependency between different indexes during the search, all indexes can be searched sequentially or in parallel. When assuming m as a constant, the theoreticalanalysis in [6] holds as following: For a data set of N points, the algorithm would yield O(N1-ρ) trees, each of max depth O(log N), resulting in an O(N1-ρ log N) query time on a conventional computer. Here, ρ is a function of m such that ρ = 1 1 − log 2 (1 − m ) , where m is the proportion of data located in the excluded middle. Although our algorithm does not guarantee a fixed m value, we keep m bounded and expect to have similar complexity. The space requirement of the emVP algorithm is linear to the size of data set.

2.2Index Distribution for Parallel Search A key aspect of efficient parallel processing is to have the workload evenly distributed among all available computing processes. For each index tree, the query cost is expected to be linear to the depth of the tree plus the cost of searchingthe leaf nodes. Since the size of the maximum leaf nodes is a predefined constant and only one leaf node is expected to be searched per index, we estimate the workload for each index as linear to the depth of the tree. The depth of each index varies in the emVP forest. Therefore, the index distribution algorithm, shown in Figure 4, distributes indexes to the computing process with the least sum of depths of all indexes already assigned to it. Once all indexes are distributed, each computing process can perform the searchesindependently. As there are possibly N1-ρ trees the total search time is O(N1-ρlog N)[6].With P parallel computing processes, the search time is O(N1-ρ(log N)/P).

Index Tree Distribution T: list of index trees sorted by decreasing depth P: List of list of trees to be processed in any process. Q: Priority queue for sums of all tree depth in any process for i from 0 to |P| Q.add(0, P[i]) end for for i from 0 to |T| pQ.pop() p.add(T[i]) Q.add(p.depth+T[i].depth, p) end for Figure 4Algorithm to distribute vantage point forests to multiple processors.

3. RELATED WORK This work is similar to other methods that seek to improve the separation of the covering predicates, such asthe extensions to the R-tree, the R+,trees and R*trees. Intuitively, we are excluding the data points that in the R* tree construction of Beckman et. al. are removed and reinserted[7,1]. Similarly the data replicated in Selliset.al.'s R+ tree algorithm roughly corresponds to the data excluded in our algorithm[8]. Both replication and exclusion enable a greater likelihood that the search procedure will investigate a single child of an index node. In R+trees, since the replication increases the size of the problem, an asymptotic complexity result is not obtained directly[9]. Similarly, if the exclusion construction is applied to an R tree, an increase in

performance is only assured by parallel implementation. With respect to the indexing algorithm, the emVP is an extension to the Vantage Point Tree and as such is similar to other partitioning algorithms which bisect a space, such as the MVP Trees, kd-trees, and bisector trees[10,11,12]. The VP Tree algorithm operates as follows: Given a metric space and a set, S, of points within that space, a point, p, is selected by some criteria. This point is labeled as a vantage point and a radius from that point is drawn such that S is equally partitioned into two sets, Si and So, where Si includes all the points that are inside the radius and So all the points outside.This idea of partitioning the space into equal elements can also be seen in the classic kd-tree algorithm[12], which uses rectilinear cuts, as opposed to spherical cuts, through the space. Our approach is built on the exclusion algorithm proposed in[6]. The algorithm partitions the original data into subsets that are separated from each other by a distance of at least 2τ. As the tree is formed, data points whose distances to the query point are within τ from the median are excluded from consideration, forming the “excluded middle”.When τ = 0, the emVP indexdegenerates to a VP Tree. Once the initial vantage point tree is completed, the construction is then recursively applied to the points from the “excluded middle,” forming a forest of vantage point trees. The algorithm guarantees that at each level of each tree, a search predicate only chooses one partition for the nearest neighbor search in which the distance between the query and its nearest neighbor is less than τ. Consequently, a search of any vantage point tree in the forest is O(log n) calculations, where n is the number of data points. The reader is referred to[6] for details. The core of our approach is to depart from the strict requirements of τto attain asymptotic results empirically. The concept of exclusion is also presented inD-index and eDindex algorithms. The eD-index and its predecessor, the D-index, use the idea of exclusion to ensure a separation distance of at least ρ between sets of points[13][14]. The separation is ensured by using a ρ-split functionthat maps a set of points, S, into three sets based on their distance from a point, x, and the exclusion widthof ρ. Two of the resulting sets are called separable buckets since any point less than the median distance minus ρ goes into one of the separablebucket and all points with a distance greater than median distance plus ρ go into another separable bucket. The third set is known as the exclusion bucket that contains and all other points. Several split functions can be combined to create more separable buckets and a large exclusion bucket. Then the split function can be applied to large exclusion bucket to create additional level. The algorithm requires a fixed separationdistance ρ and does not utilize parallelism as the excluded points become extra nodes within the tree. The resulting index structure is a tree with various numbers of nodes at each level and various depths for each branch. This research explores constructing the emVP index based on a proportion of points to be excluded. InYianilos’implementation,the excluded portion of the middle partition m is assumed to be a function of τ and is uniform throughout the data structure. However, this simplifying assumption is only valid for synthetic data sets.In practice, using fixed τ, or ρ as in D-index, creates imbalanced tree structures. In our approach, we keep the size of the excluded partition under a given threshold to produce balanced trees and limit the total number of degenerate trees at the end of the recursion.

Creating a balanced index structure makes our approach favor parallelism. Given the restructuring of Moore’s law as a statement concerning the regular doubling of the number of processor cores on a chip, interest in parallel methods for similarity search is increasing[15,16,17,18,19]. These works include the List of Clusters (LoC) algorithm[20],which partitions a set of points S by choosing a “center” point, c, and a radius, r. All points within S that have a distance of r from c are partitioned into Si and all outside into So. The algorithm is then recursively repeated on So until no points remain. It was shown that the LoC build process parallelized well by distributing the points within S to multiple processors but maintaining the same list of clusters. The parallel list of clusters generated by the LoC does not make use of exclusion to ensure separable points like inour approach. Other attempts have also been made to parallelize current metric space algorithms, such as M-Tree[18,15] and kd-tree[12]. In our approach, usingexclusion to create separable sets can be conceptualized as an intelligent data-partitioning scheme. The partitioning process in the emVP algorithm increases performance beyond the capability of partitioning methods that, for each of p processors, simply distribute 1/p of the problem to each processor. The similarity between two biological sequences is defined by the most similar substrings from the two sequences and requires a computational cost that is quadratic to the lengths of the sequences[21].To effectively supportsequence searches at a large scale, a candidate set comprisedby the set of similar q-grams (substring with length q) between two sequences is created from which more detailed and costly comparison can be run later. Metric space indexing techniques have been applied to speed up the similarity search over q-grams [22,23,24]. To ensure sensitivity of the final results, all q-grams within a relaxed threshold needs to be retrieved instead of just the top k results.

The SISAP metric space library was used for running the Multiple Vantage Point Tree(MVP) and Spatial Approximation Tree(SAT)[26,27]. The MVP index was run with a bucket size of three with six pivots. The defaults were used for the SAT algorithm as no parameters are needed. The protein sequence data was too large for the current algorithm implementationsfrom SISAP and therefore was only used with emVP implementation. The emVP algorithm is implemented with Java 1.6.A separate program is implemented in C with the Message Passing Interface to carry out parallel query processing with Java code. The implementation supports an arbitrary number of processing cores. When only one core is specified, the execution is equivalent to a sequential search process.

4.2Sequential Query Processing with emVP We first compare the sequential performance of emVP forest with a standard VP tree, and thereby assess the benefit of the excluded middle method. At each VP tree there is a single vantage point. Each node has three children. i.e. the data is split into three partitions. We evaluate a set of emVP forests, each forest built for a different value of τ. Other parameters used in index construction, such as the size of leaf nodes, are kept the same for both algorithms.

4. RESULT In this section, we show the effectiveness of our approach in following three aspects: • Performance of sequential query processing • Performance of parallel query processing • Effectiveness of partition data by exclusion

4.1Evaluation and Testing Data Our evaluation uses publicly available datasets that commonly appear in the literature including protein sequence data, and gene sequence data[23,24]. All benchmarks were downloaded from their indigenous web sites in May 2010. DNASequence Data This datasetcontains 1,000,009DNA sequence fragments, each containing 11 nucleotides. The distance between two sequence fragments is measured by an edit distance.

Figure 5Speedups of the range search with protein sequences. The cost of the range queries of various radii are measured as the total number of distance calculations. For each range query, q,Ci(q)denotes the cost of query q against index i.We further define the base query cost of each query, Cb(q), as the query cost against the single vantage point tree. The speed-up S of using index i, is then defined as: S = Cb (q ) . Multiple queries are i Ci (q ) usedand averaged prior to plotting.

Protein Sequence Data The protein sequence data set contains overlapping substrings of length 6 from an established benchmark for remote homologous protein sequence detection furnished by the National Center of Biotechnology Institute[25]. The benchmark consists of 2,892,155substrings from 6,433 sequences. The distance between two q-grams is measured by a weighted Hamming distance[23].The indexes were builtwith 90% of the data and the rest 10% of the data was used as queries. To investigate the effect of exclusion as a way for partitioning data for parallel searching, we compared the emVPindex forest against a set of conventional vantage point trees built on data that has been equally distributed into p processors at random.

Figure 6Speedups of the range search with DNA sequences.

Our results show much better performance gains for emVP with both biological sequence data sets. Figure 5 shows the results on range search with proteinsequence data. The results indicate speed-ups, up to 2.5 times, for range queries on all tested radius, ranging from 1 to 5 for excluded middle forests built with τ of 1, 2 and 3. A similar result is observed for the nucleotide sequence data in Figure 6 in which speed-ups reach up to 3 times the performance of the conventional vantage point tree.

700 600 500 400 300 200 SAT

100

MVP

emVP

Maximum query cost per perocessing cores Thousands

1000

800 Query cost (thousands)

nodes per index ranges from 3 to 37, workload balancing can still be easily achieved at index level. Figure 8 shows the maximum, minimum and average numbers of nodes searched per core when using various numbers of processors, ranging from 4 to 128. Figure 9 shows the scalability of distributed query costs up to 128 processors. The query cost for a parallel search is measured as the maximum number of distance calculation in one processor.

100 10 1 0

16 32 48 64 80 96 112 128 144 Number of processing cores

0 5.00% 10.00% Percentage of results returned

Figure 7Performance comparisons of emVPvs other indices on the DNA dataset. The vertical axis shows the query cost in terms of the number of distance calculations The comparison of excluded middle forests against the comparison indices is shown in Figure 7. On the DNA dataset where data is highly clustered the emVPalgorithm performs well across all radii.

4.3Parallel search performance An advantage of using excluded middle forest index is that the parallelization of the search process is straightforward and scales well with the number of processors. In this subsection, we demonstrate this property using the protein sequence data set.

Number of index nodes per processing cores (log)

6400 max average min

1600

Figure 9Scalability of parallel excluded middle search. Plot shows maximum query costs in log scale, measured as number of distance calculations, per processorfor a range query of radius 3 when using various numbers of processors.

4.4Effect of PartitionData by Exclusion To study the effect of distributing data by exclusion, we compared the result to a vantage point forest created by random distributed data. The entire proteindata set is first evenly and randomly divided to 512 groups which are used to build 512 indexes. The results shown in Figure 10 are measured using 32 processors. 100000 vp forest

80000 Query Cost

0.00%

emVP 60000 40000 20000 0 0

1

2Search3Radius 4

5

6

Figure 10Query cost comparison between emVP forest and a vantage point forest built by random data distribution

400

100 4

16 32 64 128 Number of processing cores

Figure 8Scalabilityof workload distribution Workload balancing is the one of the most important factors in parallel implementation. When applied to high dimensional data, the number of trees created in an excluded middle forest can easily be much higher than the number of processors that are commonly available to a medium size computing clusters. For the protein sequence data set, the optimal excluded middle forest is, empirically, built where τ=3. This particular emVPforest contains 1,457 trees with a total of 21,602 nodes. Although the number of

Despite the fact that the random vantage point tree has a uniform data distribution per index, the excluded middle forest method shows 2 times better query performance. We think the data partitioning method affects the distance distribution in the subset. We expect subsets from the random data distribution to have similar distance distributions as the original data, and the subsets fromthe exclusionhave bigger variance which creates a better index tree.

5. DISCUSSION We presented a modified excluded middle vantage point forest algorithm to create a more balanced index structure. Empirical evaluations with biological sequence data demonstrated favorable results for both sequentialand parallel search.

The exclusion approach increases the separation distance among sibling nodes to improve the prune rate at each level. If a single index tree can provide enough separation distance (for example, when the data is already well separated orthe search radius is very small), the exclusion approach is not useful when searched sequentially. Our results indicate exclusion can be a viable strategy for data partitioning. When data is partitioned into subsets randomly, each subset generally has the same distance distribution as the original data. However, when partitioning data based on exclusion, we suspect each subset may have a different distance distribution from the original data. This leaves opportunity to create some subsets that areeasier to index. We are interested in extending the use of the exclusions with other indexing algorithms, such as the R tree, the m-tree and the List of Clusters. Theoretically, we would like to expand Yianilos’ algorithmic analysis on prune rate for nearest neighbor search to range search, for which the prune rate is also a function of search radius and query prediction. Empirically, we are seeking to optimize the modified emVP forest with investigating pivot selection method and introducing redundancy.

[12]

[13]

[14]

[15]

[16]

[17]

6. ACKNOWLEDGEMENT This research was supported in part by the National Science Foundation, DBI-0640923, and the National Institutes of Health, GM085337.

[18]

7. REFERENCES [1] A. Guttman, "R-Trees - A Dynamic Index Structure for Spatial," in the 1984 ACM SIGMOD international conference on Management of data, New York, NY, USA, 1984, pp. 47-57. [2] H. Samet, Foundations of Multidimensional and Metric Data Structures. USA: Morgan Kaufmann Series in Computer Graphics, 2006. [3] G. Bejerano et al., "Ultraconserved elements in the human genome," Science, vol. 304, no. 5675, pp. 132-5, 2004. [4] E. C.Rouchka, W. Gish, and D. J. States, "Comparison of whole genome assemblies of the human genome," Nucleic Acids Res., vol. 30, no. 22, pp. 5004-5014, 2002. [5] Z. Zhang and M. Gerstein, "Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements," Journal of Biology , vol. 2, no. 11, 2003. [6] P. Yianilos, "Excluded middle vantage point forests for nearest neighbor search," in In DIMACS Implementation Challenge, ALENEX'99, Baltimore, MD, 1999. [7] N. Beckmann, H. Kriegel, R. Schneider, and B.Seeger, "The R*-Tree: An Efficient and Robust Access Method for Points and Rectangles," in ACM SIGMOD, 1990, pp. 322-331. [8] Timos K. Sellis, Nick Roussopoulos, and Christos Faloutsos, "The R+-Tree: A Dynamic Index for Multi-Dimensional Objects," VLDB, 1987 pp. 507-518 [9] L. Arge, V. Samoladas, and J. S. Vitter, "On TwoDimensional Indexability and Optimal Range Search Indexing," in PODS, 1999, pp. 346-357. [10] T. Bozkaya and M. Ozsoyoglu, "Distance-based indexing for high-dimensional metric spaces," in the 1997 ACM SIGMOD, Tucson, Arizona, United States, 1997. [11] H. Noltemeier, K. Verbarg, and C. Zirkelbach, "Monotonous Bisector-Asterisk Trees - a Tool for Efficient Partitioning of

[19]

[20]

[21] [22]

[23]

[24]

[25]

[26] [27]

Complex Scenes of Geometric Objects," Lecture Notes in Computer Science, vol. 594, pp. 186-203, 1992. J. L. Bentley, "Multidimensional binary search trees used for associative searching," Commun. ACM, vol. 18, no. 9, pp. 509-517, 1975. V. Dohnal, C. Gennaro, and P. Savino, "D-Index: Distance Searching Index for Metric Data Sets," Multimedia Tools Appl., vol. 21, no. 1, pp. 9-33, 2003. V. Dohnal, C. Gennaro, and P. Zezula, "Similarity Join in Metric Spaces Using eD-Index," in Database and Expert Systems Applications, 2003, pp. 484-493. A. Alpkocak, T. Danisman, and T. Ulker, "A parallel similarity search in high dimensional metric space using Mtree," Advanced Environments, Tools, and Applications for Cluster Computing, vol. 2326, pp. 166-171, 2002. V. Dohnal, J. Sedmidubsky, P. Zezula, and D. Novak, "Similarity Searching: Towards Bulk-Loading Peer-to-Peer Networks," in the First International Workshop on Similarity Search and Applications (sisap 2008) , Canjun, Mexico, 2008, pp. 87-94. V. Gil-Costa, M. Marin, and N. Reyes, "An Empirical Evaluation of a Distributed Clustering-Based Index for Metric Space Databases," in SISAP 2008, Canjun, Mexico, 2008, pp. 95-102. P. Zezula, P. Savino, and F. Rabitti, "Processing M-trees with Parallel Resources," in the Workshop on Research Issues in Database Engineering, 1998. M. Shevtsov, A. Soupikov, and A. Kapustin, "Highly parallel fast kd-tree construction for interactive ray tracing of dynamic scenes," in Eurographics 2007, 2007, pp. 395-404. E. Chávez and G. Navarro, "A compact space decomposition for effective metric indexing," Pattern Recogn. Lett. , no. 26, pp. 1363-1376, 2005. D. Gusfield, Algorithms on Strings, Trees and Sequences.: Press Syndicate of the University of Cambridge, 1997. E.W. Myers, "A sublinear algorithm for approximate keyword searching," Algorithmica., vol. 12, no. 4, pp. 345374, 1994. W. Xu, R. Mao, and S. Wang et al., "On integrating peptide sequence analysis and relational distance-based indexing ," in IEEE 6th Symposium on Bioinformatics and Bioengineering (BIBE06), Arlington, VA, USA, 2006. A. Sacan and I. H. Toroslu, "Approximate Similarity Search in Genomic Sequence Databases Using Landmark-Guided Embedding," in First International Workshop on Similarity Search and Applications (sisap 2008), 2008. A. A. Schaffer, L. Aravind, and T. L. Madden et al., "Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements," Nucl. Acids Res., vol. 29, no. 14, pp. 29943005, July 2001. G. Navarro, "Searching in metric spaces by spatial approximation," VLDB, vol. 11, no. 1, pp. 28-46, 2002. (2010, May) SISAP, Metric spaces library. [Online].

“http://sisap.org/Metric_Space_Library.html"

Empirical Evaluation of Excluded Middle Vantage Point ...

Mar 25, 2011 - demonstrate that the performance of this approach scales linearly up to at least 128 cores ..... to carry out parallel query processing with Java code. The .... conference on Management of data, New York, NY, USA,. 1984, pp.

192KB Sizes 0 Downloads 249 Views

Recommend Documents

vantage point dual.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Whoops! There was a problem previewing this document. Retrying... Download. Connect ...

Empirical Evaluation of Volatility Estimation
Abstract: This paper shall attempt to forecast option prices using volatilities obtained from techniques of neural networks, time series analysis and calculations of implied ..... However, the prediction obtained from the Straddle technique is.

STABILITY OF WEIGHTED POINT EVALUATION ...
C(X) satisfy f∞ = g∞ = 1 and fg ≡ 0. In this paper we pro- vide the exact maximal distance from ϵ-disjointness preserving linear functionals to the set of weighted point evaluation functionals. 1. Introduction. In [8], B.E. Johnson studied whe

Implementation and Empirical Evaluation of Server ...
IBM, SAP and NTT) geographically distributed over three continents (Microsoft .... ACM Symposium on Applied Computing (ACM SAC 2005), Special Track on.

An Empirical Evaluation of Client-side Server Selection ... - CiteSeerX
selecting the “best” server to invoke at the client side is not a trivial task, as this ... Server Selection, Replicated Web Services, Empirical Evaluation. 1. INTRODUCTION .... server to be invoked randomly amongst the set of servers hosting a r

An Empirical Evaluation of Test Adequacy Criteria for ...
Nov 30, 2006 - Applying data-flow and state-model adequacy criteria, .... In this diagram, a fault contributes to the count in a coverage metric's circle if a test.

Empirical Evaluation of Brief Group Therapy Through an Internet ...
Empirical Evaluation of Brief Group Therapy Through an Int... file:///Users/mala/Desktop/11.11.15/cherapy.htm. 3 of 7 11/11/2015, 3:02 PM. Page 3 of 7. Empirical Evaluation of Brief Group Therapy Through an Internet Chat Room.pdf. Empirical Evaluatio

Empirical Evaluation of 20 Web Form Optimization ... - Semantic Scholar
Apr 27, 2013 - and higher user satisfaction in comparison to the original forms. ... H.3.4 Systems and Software: Performance evaluation;. H.5.2 User Interfaces: ...

An Empirical Evaluation of Client-side Server Selection ...
Systems – client/server, distributed applications. Keywords .... this policy selects the server to be invoked randomly amongst the set of servers hosting a replica of ...

Empirical Evaluation of Brief Group Therapy Through an Internet ...
bulletin board ads that offered free group therapy to interested. individuals. ... Empirical Evaluation of Brief Group Therapy Through an Internet Chat Room.pdf.

An Empirical Performance Evaluation of Relational Keyword Search ...
Page 1 of 12. An Empirical Performance Evaluation. of Relational Keyword Search Systems. University of Virginia. Department of Computer Science. Technical ...

Empirical Evaluation of 20 Web Form Optimization ... - Semantic Scholar
Apr 27, 2013 - Unpublished master's thesis. University of. Basel, Switzerland. [2] Brooke ... In: P. W. Jordan, B. Thomas, B. A.. Weerdmeester & I. L. McClelland ...

Empirical Evaluation of 20 Web Form Optimization Guidelines
Apr 27, 2013 - Ritzmann, Sandra Roth and Sharon Steinemann. Form Usability Scale FUS. The FUS is a validated questionnaire to measure the usability of ...

Empirical Evaluation of Signal-Strength Fingerprint ...
Institute of Computer Science, Foundation for Research and Technology-Hellas (FORTH) ...... AP based on the degree of autocorrelation, the mean, and.

Market Vantage Success.indd
Potential customers can go to the company's site to obtain an instant live demo of the software or download a free trial. “The client's goal was to increase the ...

Market Vantage Success.indd
through smart online marketing strategies—it's that simple,” says Riemer. While that ... AdWords to optimize campaigns ... search engine optimization. This is.

Vantage Pro Technical Reference - HiSPARC
Nov 9, 2001 - Extend the readings using a best-curve fit above and below the air temperature ... The console table only differs in that whole degrees are used.

Vantage Pro Technical Reference - HiSPARC
Nov 9, 2001 - 39.2. 40. 37.7. 37.8. 37.9. 38.0. 38.2. 38.3. 38.4. 38.5. 38.7. 38.8. 38.9. 39.0. 39.2. 39.3. 39.4. 39.5. 39.7. 39.8. 39.9. 40.1. 40.2. 41. 38.6. 38.7.

Aston martin v8 vantage owners manual pdf
Whoops! There was a problem loading more pages. Retrying... Aston martin v8 vantage owners manual pdf. Aston martin v8 vantage owners manual pdf. Open.

Empirical calibration of confidence intervals - GitHub
Jun 29, 2017 - 5 Confidence interval calibration. 6. 5.1 Calibrating the confidence interval . ... estimate for GI bleed for dabigatran versus warfarin is 0.7, with a very tight CI. ... 6. 8. 10. True effect size. Systematic error mean (plus and min.

Aston martin v8 vantage workshop manual pdf
Whoops! There was a problem loading more pages. Retrying... Aston martin v8 vantage workshop manual pdf. Aston martin v8 vantage workshop manual pdf.

Empirical and theoretical characterisation of ...
Available online 22 June 2005. Abstract .... a comparison between the measurements and FE calcula- tions with ..... Meeting, Boston, 1–5 December, 2003, pp.