Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

Transcript

University of Alberta Library Release Form Name of Author: Jianjun Zhou Title of Thesis: Efficiently Searching and Mining Biological Sequence and Structure Data Degree: Doctor of Philosophy Year this Degree Granted: 2009 Permission is hereby granted to the University of Alberta Library to reproduce single copies of this thesis and to lend or sell such copies for private, scholarly or scientific research purposes only. The author reserves all other publication and other rights in association with the copyright in the thesis, and except as herein before provided, neither the thesis nor any substantial portion thereof may be printed or otherwise reproduced in any material form whatever without the author’s prior written permission. Jianjun Zhou Date: University of Alberta E FFICIENTLY S EARCHING AND M INING B IOLOGICAL S EQUENCE AND S TRUCTURE DATA by Jianjun Zhou A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Department of Computing Science Edmonton, Alberta Spring 2009 University of Alberta Faculty of Graduate Studies and Research The undersigned certify that they have read, and recommend to the Faculty of Graduate Studies and Research for acceptance, a thesis entitled Efficiently Searching and Mining Biological Sequence and Structure Data submitted by Jianjun Zhou in partial fulfillment of the requirements for the degree of Doctor of Philosophy. J¨org Sander Supervisor Guohui Lin Co-Supervisor Mario Nascimento Davood Rafiei David Wishart Caetano Traina Junior External Examiner Date: To my Parents, Thanks for your decade-long support. Abstract The rapid growth of biological sequence and structure data imposes significant challenges on searching and mining them. While handling growing data-sets has been a continuously interesting topic in the database and data mining communities, the unique characteristics of biological data make it difficult or even impossible to directly apply traditional database searching and mining methods. In many biological databases, the data objects and the dissimilarity measurement (i.e. distance function) between data objects form a so-called metric space, in which the notions of dimensionality in traditional vector space are no longer valid and the dissimilarity measurement can be computationally much more expensive than traditional measurements such as the Euclidean distance in a low dimensional space. In this thesis, we study the problems of performing efficient clustering and similarity searches on biological sequence and structure data using an expensive distance function. The efficient solutions to these problems relies on the ability of the searching and mining algorithms to avoid expensive distance computations. For this central challenge, we propose several novel techniques including directional extent in non-vector data bubbles, pairwise ranking, virtual pivots and partial pivots. In-depth theoretical studies and extensive experimental results on several real-life data-sets confirm the superiority of our methods over the previous approaches. Acknowledgements I am grateful to my supervisors, Dr. Guohui Lin and Dr. J¨org Sander, who have generously supported me in my PhD study. From them, I have learned a lot. I got to know Dr. Sander before I entered the PhD program. Throughout all these years, he has been patiently guiding me in my study, showing me how to develop an idea, and more importantly, how to face success and failure. While being critical, he always respects my opinion, treating me not just as his student, but his friend as well. Dr. Lin is one of the most hard-working teachers I have ever had. He gives his students a good example of how diligence will lead to a successful career, and how perseverance will eventually remove the blockades on the challenging road of research. He motivates his students to learn by putting not just a high requirement on them, but also a even higher requirement on himself. Table of Contents 1 Introduction 1.1 Database Support for Efficient Hierarchical Clustering . . . . . . . . . . . 1.2 Database Support for Efficient Similarity Search in Metric Spaces . . . . . 1 2 4 I Speed-up Clustering 8 2 Speed-up Clustering with Data Summarization 2.1 Related Work . . . . . . . . . . . . . . . . . . 2.2 Speed-up Clustering with Data Summarization 2.2.1 Data Bubbles for Euclidean Vector Data 2.2.2 Data Bubbles for Non-Vector Spaces . 2.3 Performance Evaluation . . . . . . . . . . . . . 2.3.1 Data-sets and Experimental Setup . . . 2.3.2 Comparison with Original Data Bubbles 2.3.3 Comparison with Random Sampling . . 2.3.4 An Application to a Real-life Data-set . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 10 12 12 16 29 29 31 32 36 37 3 Speed-up Clustering with Pairwise Ranking 3.1 Preliminaries . . . . . . . . . . . . . . . . . 3.1.1 Three Major Clustering Approaches . 3.1.2 Triangle Inequalities in Metric Spaces 3.2 Motivation . . . . . . . . . . . . . . . . . . . 3.3 k-Close Neighbor Ranking . . . . . . . . . . 3.3.1 Ranking Using Triangle Inequalities . 3.3.2 An Intuition of When Ranking Works 3.4 Pairwise Hierarchical Ranking . . . . . . . . 3.4.1 Partitioning . . . . . . . . . . . . . . 3.4.2 Ranking . . . . . . . . . . . . . . . . 3.4.3 Reducing the Overhead in Ranking . 3.5 Experimental Evaluation . . . . . . . . . . . 3.5.1 Synthetic Data . . . . . . . . . . . . 3.5.2 Real-life Data . . . . . . . . . . . . . 3.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 38 38 39 40 41 41 43 44 47 47 49 54 54 56 64 . . . . . . . . . . . . . . . II Speed-up Similarity Search 4 The Concept of Virtual Pivots for Similarity Search 4.1 Related Work . . . . . . . . . . . . . . . . . . . 4.1.1 Hierarchical Approaches . . . . . . . . . 4.1.2 Non-hierarchical Approaches . . . . . . 4.2 The Pruning Ability of Pivots . . . . . . . . . . . 4.2.1 One Pivot versus Several Pivots . . . . . 67 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68 69 69 70 72 73 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 78 82 85 86 87 88 89 5 Efficient Similarity Search with Virtual Pivots and Partial Pivots 5.1 Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Partial Pivots . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 The Complete Algorithm with Virtual and Partial Pivots . . . 5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Results on the HIV-1 Dataset . . . . . . . . . . . . . . . . . . 5.2.2 Results on the HA Gene Dataset . . . . . . . . . . . . . . . . 5.3 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Query Performance Dependence on the Effectiveness of Pivots 5.3.2 Query Performance w.r.t. Ranking . . . . . . . . . . . . . . . 5.3.3 Query Performance w.r.t. t . . . . . . . . . . . . . . . . . . . 5.3.4 Time and Space Complexity . . . . . . . . . . . . . . . . . . 5.3.5 Distance Distributions . . . . . . . . . . . . . . . . . . . . . 5.3.6 HIV-1 Genotyping Accuracy . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90 91 91 93 95 95 100 103 103 104 106 107 108 109 110 4.3 4.4 4.5 4.6 4.7 4.2.2 Random Pivots versus Close Pivots . Virtual Pivots . . . . . . . . . . . . . . . . . 4.3.1 The Pruning Ability of Virtual Pivots Boosting Virtual Pivots . . . . . . . . . . . . 4.4.1 Why Boosting Works . . . . . . . . . Advantages of Using Virtual Pivots . . . . . . Algorithm . . . . . . . . . . . . . . . . . . . Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 Conclusions and Future Work 111 Bibliography 112 List of Tables 5.1 5.2 5.3 The detailed numbers of distance computations and the runtime (in seconds) by all seven k-nn search methods in the smaller preprocessing, and their resultant query performance in 1-nn search in terms of the average number of distance computations and the average runtime (in seconds) per query. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102 The distance intervals associated with the five bins and the numbers of query objects therein. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 The HIV-1 computational genotyping accuracies by k-nn search and majority vote, k = 1, 2, 3, 4, 5. . . . . . . . . . . . . . . . . . . . . . . . . . . . 110 List of Figures 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 2.10 2.11 2.12 2.13 2.14 2.15 2.16 3.1 3.2 3.3 3.4 3.5 3.6 3.7 3.8 3.9 3.10 3.11 3.12 3.13 3.14 3.15 3.16 Example reachability plot. . . . . . . . . . . . . . . . . . . . . . . . . . . Illustration of the distance between original Data Bubbles for vector data (Fig. adapted from [9]). . . . . . . . . . . . . . . . . . . . . . . . . . . . . No object is close to the center of the set. . . . . . . . . . . . . . . . . . . “Directional extent” of a Data Bubble. . . . . . . . . . . . . . . . . . . . . Illustration of direction and reverse direction. . . . . . . . . . . . . . . . . Illustration of border distance. . . . . . . . . . . . . . . . . . . . . . . . . “Normal” Data Bubble separation. . . . . . . . . . . . . . . . . . . . . . . A “gap” in Data Bubble B. Since the cluster in Bubble A spread across the ceter line between representative rA and rB , Bubble B contains points from two clusters. In B, the border distance in direction A is larger than in reverse direction of A. . . . . . . . . . . . . . . . . . . . . . . . . . . . . Examples for a “gap” in a Data Bubble B. . . . . . . . . . . . . . . . . . . Reachability plots for the whole synthetic data-sets used for the evaluation. New versus old method on DS-Vector using N-score. . . . . . . . . . . . . New versus old method on DS-Vector using F-score. . . . . . . . . . . . . New versus old method on DS-UG using F-score. . . . . . . . . . . . . . . non-vector data bubbles vs. random sampling on the DS-Tuple data-set. . . Scale-up speed-up w.r.t. number of objects on DS-Tuple. . . . . . . . . . . Results for DS-Real. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . An OPTICS walk. The arrows represent the ordering in the walk. Although the corresponding reachability-distances are different from the distances between the pairs of objects, the lengths of the edges indicate the level of the reachability-distance values. It shows that most plotted reachabilitydistances are small in values. . . . . . . . . . . . . . . . . . . . . . . . . . Ranking with triangle inequalities. Although D(q, o) cannot be estimated by using p′ , chances are that D(q, o) can be estimated by using another pivot p; while p′ cannot be used to estimate D(q, o), p′ can be used to estimate D(q ′ , o) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . P r[EP (q, o) ≤ D(q, c)] with respect to µ, σ, δ, and |P |. . . . . . . . . . . An example for hierarchical ranking. . . . . . . . . . . . . . . . . . . . . . Hierarchical ranking algorithm. . . . . . . . . . . . . . . . . . . . . . . . . Linking the occurrences of each object. . . . . . . . . . . . . . . . . . . . k-cn ranking algorithm with best-first frontier search. . . . . . . . . . . . . The Reachability plots from OPTICS (a and b) and OPTICS-Rank (c and d) are almost identical (due to the property of OPTICS clustering, there are switches of cluster positions). . . . . . . . . . . . . . . . . . . . . . . . . Clustering accuracy and performance on DS-Vector. . . . . . . . . . . . . Clustering accuracy and performance on DS-Tuple. . . . . . . . . . . . . . OPTICS output for DS-Protein with three cut-lines (y = 0.1, 0.2, 0.3). . . . Clustering accuracy on DS-Protein. . . . . . . . . . . . . . . . . . . . . . Performance of OPTICS-Rank and OPTICS-Bubble on DS-Protein. . . . . The distribution of ǫ values for rankings in OPTICS-Rank. . . . . . . . . . Effect of changing parameter k. . . . . . . . . . . . . . . . . . . . . . . . Effect of changing branching factor b. . . . . . . . . . . . . . . . . . . . . 14 15 19 22 23 24 25 25 26 30 32 32 33 34 35 36 40 42 45 48 48 50 51 54 56 57 58 59 60 60 61 63 3.17 3.18 3.19 3.20 Effect of changing step limit s. . . . . . . . . . . . . Scalability with respect to the size of the data-set. . . Clustering results on DS-Jaspar. . . . . . . . . . . . Distribution of F-scores. The overall F-score is 0.87. 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Methods to enhance pruning ability . . . . . . . . . . . . . . . . . . . . . The lower bound of m (δ ≤ 0.3). . . . . . . . . . . . . . . . . . . . . . . . Portion of neighbors within a δ radius. . . . . . . . . . . . . . . . . . . . . Illustration of virtual pivots. Dashed lines represent distances to be estimated. Pruning schemes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The factor f (s, δ). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Probability of having an LBE smaller than δ. . . . . . . . . . . . . . . . . 5.1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 65 66 66 74 77 78 80 81 84 86 Partial pivots can help when there is no close pivot to the query object. Let q be a query object, p be a partial pivot far away from q, and oi (i = 1, 2, 3) be neighbors of p. |D(q, p) − D(p, oi )| is close to D(q, p) and can be larger than the current k-nn distance upper bound so that oi will be pruned away. . 92 5.2 The average numbers of distance computations per query by all seven k-nn search methods on HIV-1 dataset, for k = 1, 3, 5, 10. . . . . . . . . . . . . 97 5.3 The average runtime per query using by all seven k-nn search methods on HIV-1 dataset, for k = 1, 3, 5, 10. . . . . . . . . . . . . . . . . . . . . . . . 98 5.4 Performance measures per query by the six k-nn search methods, with different amounts of preprocessing efforts on HIV-1 dataset using global alignment distance. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 5.5 The average numbers of distance computations per query by all seven k-nn search methods with two different amounts of preprocessing efforts, on HA dataset, for k = 1, 3, 5, 10. . . . . . . . . . . . . . . . . . . . . . . . . . . 101 5.6 The average runtime per query by all seven k-nn search methods with two different amounts of preprocessing efforts, on HA dataset, for k = 1, 3, 5, 10. 102 5.7 The average numbers of distance computations per query by all six methods in 1-nn search for five bins of query objects with different nearest neighbor distance ranges, on three datasets. . . . . . . . . . . . . . . . . . . . . . . 105 5.8 Performance on HIV-1-GA when the number of fixed pivots to perform ranking, |S|, increases from 1 to 80. . . . . . . . . . . . . . . . . . . . . . 106 5.9 Performance on HIV-CCV when |S| increases from 1 to 70. . . . . . . . . 107 5.10 Performance when the number of predicted close neighbors for each data object, t, increases from 1 to 50. . . . . . . . . . . . . . . . . . . . . . . . 108 5.11 The distributions of all pairwise global alignment distances and all pairwise CCV-based euclidean distances on the complete HIV-1 dataset of 1,198 viral strains. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 University of Alberta E FFICIENTLY S EARCHING AND M INING B IOLOGICAL S EQUENCE AND S TRUCTURE DATA by Jianjun Zhou A thesis submitted to the Faculty of Graduate Studies and Research in partial fulfillment of the requirements for the degree of Doctor of Philosophy. Department of Computing Science Edmonton, Alberta Spring 2009 Efficiently Searching and Mining Biological Sequence and Structure Data Jianjun Zhou Chapter 1 Introduction In the last three decades, computer-assisted biological technologies have been playing a more and more significant role in our society. For instance, hierarchical clustering of DNA sequences now enables scientists to trace down the origins of contagious diseases that have the potential to evolve into global pandemics; DNA forensic analysis is widely used in criminal investigations; and recent advances in protein structure prediction help the drug industry to develop new drugs more efficiently. Given the potential of DNA and protein research, a significant number of resources have been devoted to it, generating a huge volume of sequence and structure data. For instance, the Human Genome Project has created a sequence database containing 3 billion chemical base pairs; and each year, millions of dollars are spent on protein structure determination via X-ray and NMR methods, leading to a steady growth of protein structure databases. How to search and mine these sequence and structure data to make the best use of them is a significant challenge facing researchers today. While handling growing data-sets has been a long-standing topic of interest in the database and data mining communities, the unique characteristics of biological data make it difficult or even impossible to directly apply the traditional database searching and mining methods. In many biological databases, the (dis-)similarity measurement typically involves time consuming operations such as sequence or structure alignments, which are computationally much more expensive than traditional dissimilarity measurements such as the Euclidean distance in a low dimensional space. On the other hand, many traditional searching and mining algorithms do not scale-up well so that they require a large number of expensive distance computations which lead to low performance in runtime. This dissertation addresses the challenges in two searching and mining problems: (1) clustering, and (2) similarity searches. The rest of this chapter will outline our contributions 1 to each of these two areas. 1.1 Database Support for Efficient Hierarchical Clustering Data Clustering is an important task for knowledge discovery in databases (KDD). The basic goal of a clustering algorithm is to partition a set of data objects into groups so that similar objects belong to the same group and dissimilar objects belong to different groups. There are different types of clustering algorithms for different types of applications. A common distinction is between partitioning and hierarchical clustering algorithms (see e.g. [28]). Partitioning algorithms, for instance, the k-means [32] and the k-medoids algorithms [28], decompose a database into a set of k clusters whereas hierarchical algorithms only compute a representation of the data-set, which reflects its hierarchical clustering structure, but do not explicitly determine clusters. Examples of hierarchical clustering algorithms are the Single-Link method [42] and OPTICS [3]. Clustering algorithms in general, and in particular hierarchical algorithms, do not scale well with the size of the data-set. On the other hand, very fast methods are most desirable for exploratory data analysis, which is what clustering is mostly used for. To speed-up cluster analysis on large data-sets, a number of data summarization methods have been proposed recently. Those methods are based on a general strategy that can be used to scale-up whole classes of clustering algorithms (rather than inventing a new clustering algorithm), and their basic idea is described in more detail as below. 1. Use a data summarization method that produces “sufficient statistics” for subsets of the data-set (using either sampling plus a classification of objects to the closest sample point, or some other technique such as BIRCH [54]). The data summarizations are sometimes also called “micro-clusters” (e.g. in [27]). 2. Apply (an adapted version of) the clustering algorithm to the data summaries only. 3. Using the clustering result for the data summaries, estimate a clustering result for the whole data-set. Different data summarization methods have different advantages and disadvantages. In [9] it was shown that hierarchical clustering algorithms such as the Single-Link [42] method or OPTICS [3] require special information in order to produce high quality results for small numbers of data summaries. The proposed data summarizations that meet all the requirements for hierarchical clustering were called “Data Bubbles”. 2 Most techniques to compute data summaries, including Data Bubbles, are based on the assumption that the data is from a vector space. Typically, they compute statistics such as the mean of the set of objects which requires that vector space operations (addition of objects, multiplication with scalar values) can be applied. In many important applications, however, the data is from a non-vector metric space. A metric space is given by a set of data objects B and a distance function D(., .) that satisfies symmetry, reflexivity, and triangle inequality (for any x, y, z ∈ B, D(x, y) + D(y, z) ≥ D(x, z)) [12]. Metric spaces include some vector spaces such as the Euclidean space, but generally in a metric space, the notions of volume and dimensionality in vector space are not valid, rendering most mining and searching methods developed on vector data inapplicable. For summarization in non-vector spaces, the only information that can be utilized is a similarity or a dissimilarity distance function. In this chapter we will assume a distance function to measure the dissimilarities, i.e., we only have information about distances between objects. This makes it difficult or at least very expensive to compute the usual sufficient statistics used to summarize vector data. However, having a data summarization method that allows a very fast (even if only approximate) clustering of non-vector data is highly desirable since the distance functions for some typical and important applications can be extremely computationally expensive (e.g., a sequence alignment for a set of DNA or amino acid sequences). In Chapter 2, we propose a novel data summarization method that can be applied to nonvector data to produce high-quality “micro-clusters” to efficiently and effectively support hierarchical clustering. The information produced for each data summary is related and improves upon the information computed for the Data Bubbles proposed in [9] in the sense that accurate estimations of the information needed by hierarchical clustering algorithms are generated (in our experiment, we found that our new version of Data Bubbles even outperforms the original Data Bubble method for some vector data). While being a significant improvement over the na¨ıve sampling, the approach of using data summarization to derive approximate clustering results still suffers from the problem of inadequate accuracy for some important real-life applications. For these methods, clusters with a size smaller than the number of points in the smallest abstract region, represented by a set of sufficient statistics, will typically be lost in the final clustering result. Even clusters that have a large number of points but are close to other clusters could be obscured by bigger clusters in the output result, since gaps between clusters can often not be recovered correctly by BIRCH or Data Bubbles [56]. 3 In Chapter 3, we propose a novel approach to perform approximate clustering with high accuracy. The method is based on the observation that in some clustering algorithms such as OPTICS and Single-Link, the final clustering result depends largely on the nearest neighbor distances of data objects, which comprise only a very small portion of the quadratic number of pairwise distances between data objects. We introduce a novel pairwise hierarchical ranking to efficiently determine close neighbors for every data object. The clustering will then be performed on the original data objects instead of on sample points or sufficient statistics as in the previous methods. Since a na¨ıve pairwise hierarchical ranking may introduce a large computational overhead, we also propose two techniques to significantly reduce this overhead: 1) a frontier search rather than a sequential scan in the na¨ıve ranking to reduce the search space; 2) an approximate frontier search for pairwise ranking that further reduces the runtime. Empirical results on synthetic and real-life data show a speed-up of up to two orders of magnitude over previous approaches. 1.2 Database Support for Efficient Similarity Search in Metric Spaces The rapid growth of biological sequence and structure databases imposes a challenging problem of how to access them efficiently. For a classic instance, when scientists are to study a new biological sequence or structure, they may want to see if there is any similar sequence or structure in the existing databases so that they can use the knowledge on the existing objects to infer the properties of the new object. In the case of primary sequences only, fast homology search tools such as BLAST [2] and FASTA [36] may be used for those queries which have very similar sequences already deposited in the database. Nevertheless, neither of them is guaranteed to identify the most similar sequences for an arbitrary query. The main cause is that these homology search tools are largely heuristic, and they typically fail on queries which do not have very similar sequences (measured solely by a single distance funtion) deposited in the database [31]. Finding the most similar sequences (or structures) from an existing database can be easily modeled as a similarity search problem such as the k-nearest neighbor (k-nn) search studied in the database community [25]. In k-nn search, the basic task is to efficiently retrieve from a large set of database objects the k most similar ones to a given query object, measured by a distance (or similarity) function — the smaller the distance between objects, the more similar they are. Both the large number of database objects and the time-consuming distance computation between two objects can make the k-nn search slow, or even prohibitive, in real-life 4 applications. How k-nn search can be done efficiently has been extensively studied in the database and theory community, and numerous methods have been proposed. However, while the problem has well known solutions in low-dimensional vector space, for example, through R-trees [22], an efficient solution is still an elusive goal for k-nn search in many important real-life applications. In these applications, usually the data objects are from very high dimensional vector spaces or general metric spaces in which the data objects behave, from an indexing point of view, like very high dimensional data. Such databases often have the following characteristic: each data object typically has only a few close neighbors (objects that are similar in a meaningful way), while the other objects are very far away from it at a similar distance (i.e., approximately to the same degree of large dissimilarity). For example, according to our computational experience, when indexing all protein sequences using a global sequence alignment dissimilarity measure, sequences within the same protein family typically form a very tight cluster and have very small distances to each other, while sequences from different families have very large distances. This interesting characteristic of very high dimensional databases renders most indexing structures ineffective. Currently, there is no exact k-nn search method that can handle such databases efficiently [12, 25, 6, 4]. We call this kind of databases “hard-to-index” databases. In these “hard-to-index” databases, the distance function is typically very expensive in terms of CPU time. For instance, in biological databases the distance (or similarity) functions usually involve optimization operations such as sequence or structure alignments. When using some sequence alignment algorithms [52], computing the distance (or similarity) between two genomic sequences of length around 10,000 nucleotides may take seconds on a state-of-the-art PC. This property makes the na¨ıve solution to k-nn search, that computes the distances between the query and all database objects, unacceptable for interactive applications that require quick response time or data mining applications that require very large numbers of k-nn searches. Particularly note that a modern database, such as GenBank or Protein Data Bank, may contain thousands or even millions of objects. Consequently, the primary goal of an efficient k-nn search in these “hard-to-index” databases is to reduce the number of distance computations. For that purpose, many applications have regulated their expensive distance function so that the distance function and the database together form a metric space, In such a space, the triangle inequality (for any x, y, z ∈ B, D(x, y) + D(y, z) ≥ D(x, z)) is the fundamental mechanism to estimate the distance between two data objects, which can be done in time often negligible compared to distance computation. In many existing search methods, the distances between (all or a 5 portion of) data objects and a set of pre-selected reference data objects are pre-computed. At query time, the distance D(q, p) between the query object q and a reference object p is computed first. Then, the triangle inequality on ho, p, qi can be used to derive a lower bound and an upper bound for the distance D(q, o) between the query object and an arbitrary non-reference object o, by: |D(q, p) − D(o, p)| ≤ D(q, o) ≤ D(o, p) + D(q, p). (1.1) The pre-selected reference data objects such as the above object p are called pivots. When there is more than one pivot, the combination of estimated bounds from all pivots in a set P using Formula (1.1) leads to the largest lower bound lD(o,q) = max{|D(q, p) − D(o, p)|} p∈P (1.2) and the smallest upper bound uD(o,q) = min{D(q, p) + D(o, p)} p∈P (1.3) for D(q, o). These bounds can be used to prune away objects that are not within a certain distance threshold from query object q, thus avoiding computing their distances to query object q. The general goal during query processing in metric data is to use the triangle inequality with pre-computed distances and distances computed during the query stage to exclude or prune objects from the set of possible answers. Essential for the effectiveness of such indexing schemes is the ability to produce only a small set of candidates during the pruning stage of the query processing, without computing too many distances. Objects that cannot be excluded from the result of a query using triangle inequalities have to be loaded from disk and the distances between the query and these candidate objects have to be computed. We will see that for hard-to-index metric data-sets, traditional pruning techniques using a relatively small set of fixed pivots cannot be effective, even if the data satisfies the condition that each object can have a small number of close neighbors, but those groups of close objects are far away from each other, as in the given application domains. Our contributions to this k-nn search problem are the following. We propose a new method for k-nn search in hard-to-index metric data that significantly outperforms previous approaches. The method is based on the novel concepts of virtual pivots and partial pivots. In Chapter 4, we first analyze the pruning ability of pivots and the resulting limitations of existing approaches. We then introduce the novel concept of a virtual pivot, which allows us 6 to select virtually any data object as a pivot. We show formally that a single virtual pivot can have the same pruning power as a large set of fixed pivots, and propose a query dependent, dynamic virtual pivot selection method using ranking. Dynamic virtual pivot selection effectively tightens both the lower and upper bound estimations of unknown distances during query processing, whereas most traditional schemes of triangle inequality pruning focuses only on tightening the lower bound estimations. The space and pre-computation overhead of our method is small and similar to approaches using only a relatively small number of fixed pivots. In Chapter 5, we introduce the concept of partial pivots, further extending the concept of virtual pivots to use every data object as a pivot but without suffering from a quadratic number of distance computations. Our partial pivots method is based on the pairwise ranking technique developed in Chapter 3, and is shown to outperform virtual pivots in our testing. An extensive experimental evaluation on real-life data-sets confirms that our new method outperforms the next best method by using no more than 40% of distance computations per query, on a database of 10,000 gene sequecnes, compared to several best known k-nn search methods including M-Tree [13], OMNI [20], SA-Tree [34], and LAESA [37]. 7 Part I Speed-up Clustering 8 Chapter 2 Speed-up Clustering with Data Summarization To speed-up clustering algorithms, data summarization methods, which first summarize the data-set by computing suitable representative objects, have been proposed. Then a clustering algorithm is applied to these representatives only, and a clustering structure for the whole data-set is derived, based on the result for the representatives. Most previous summarization methods are, however, limited in their application domain. They are, in general, based on sufficient statistics such as the linear sum of a set of points, which assumes that the data is from a vector space. On the other hand, in many important applications, the data is from a non-vector metric space, and only distances between objects can be exploited to construct effective data summarizations. In this chapter1 , we develop a new data summarization method based only on distance information that can be applied directly to non-vector metric data. An extensive performance evaluation shows that our method is very effective in finding the hierarchical clustering structure of non-vector metric data using only a very small number of data summarizations, thus resulting in a large reduction of runtime while trading only very little clustering quality. The rest of this chapter is organized as follows. We briefly review related work in Section 2. We present the necessary background regarding the original Data Bubbles technique for vector data and the OPTICS clustering algorithm in Section 3. Section 4 discusses the problems when trying to generate summary information for sets of non-vector metric data and introduces our new method. The experimental evaluation in Section 5 shows that our method not only allows very effective and efficient hierarchical clustering of non-vector metric data, but it even outperforms the original Data Bubbles method when applied to vector data. Section 6 summarizes the chapter. 1 Some of the material in this chapter has been published in [56]. 9 2.1 Related Work The most basic method to speed-up expensive data mining algorithms such as hierarchical clustering is random sampling: only a subset of the database is randomly chosen, and the data mining algorithm is applied to this subset instead of to the whole database. Typically, if the sample size is large enough, the result of the data mining method on the samples will be similar enough to the result on the original database. More specialized data summarization methods have been developed to support particularly clustering algorithms. For k-means type of clustering algorithms, summary statistics called “clustering features”, originally introduced for the BIRCH method [54], have been used by different approaches. BIRCH incrementally computes compact descriptions of subclusters, called Clustering Features, which are defined as CF = (n, LS, ss), where LS is P P the linear sum ( ni x~i ) and ss the square sum ( ni x~i 2 ) of the n points in the subcluster represented by the clustering feature CF . The CF -values are sufficient to compute information like centroid, radius and diameter of a set of points. They also satisfy an additivity condition, that allows the incremental computation of CF -values when inserting points into a set: if CF1 = (n1 , LS1 , ss1 ) and CF2 = (n2 , LS2 , ss2 ) are the CF s for sets of points S1 and S2 respectively, then CF1 + CF2 = (n1 + n2 , LS1 + LS2 , ss1 + ss2 ) is the clustering feature for the union of the points in S1 and S2 , S1 ∪ S2 . The CF s are organized in a balanced tree with branching factor b and a threshold t, where a non-leaf node represents all objects in the sub-tree rooted at this node. A leaf node has to contain at most l entries and the diameter of each leaf node has to be less than t. The generation of a CF -tree is similar to the construction of B+-trees: point p is inserted into the tree by finding first the leaf in the current CF -tree that is closest to p. If an entry in the leaf can absorb p without violating the threshold condition, it is inserted into this entry and the corresponding CF value is updated. If p cannot be inserted into an existing entry, a new entry is created in the leaf node. This may lead to an overflow of the leaf node causing it (and possibly its ancestors) to be split in a similar fashion as B+-trees. A clustering algorithm is applied to the entries in the leaf nodes of the CF -tree. In [8], a very specialized compression technique for scaling up k-means and EM clustering algorithms [24] is proposed. This method basically uses the same type of sufficient statistics as BIRCH, i.e., triples of the form (n, LS, ss). The major difference is only that different sets of data items (points or summaries) are treated and summarized inde- 10 pendently: points that are unlikely to change cluster membership in the iterations of the clustering algorithm, data summaries that represent tight subclusters of data points, and a set of regular data points which contains all points which cannot be assigned to other data summarizations. In [16], a general framework for “squashing” data is proposed, which is intended to scale up a large collection of data mining methods. This method is based on partitioning the dimensions of the data space and grouping the points into the resulting regions. For each region, a number of moments are calculated such as mean, minimum, maximum, second order moments such as Xi2 or Xi Xj , and higher order moments depending on the desired degree of approximation. Squashed data items are then created for each region in a way that the moments of the squashed items approximate those of the original data falling into the region. This information can also be used to compute clustering features as above for each constructed region in order to speed-up k-means type of clustering algorithms. In [9] it was also proposed to compute sufficient statistics of the form (n, LS, ss) based on a random sample by partitioning the data-set using a k-nearest neighbor classification. This method has several advantages over, for instance the CF -tree: the number of representative objects for a data-set can be determined exactly, and no other heuristic parameters such as a maximum diameter, or a bin size have to be used in order to restrict the number of partitions that are represented by triples (n, LS, ss). The method was proposed as follows: • Draw a random sample of size s from the database to initialize s sufficient statistics. • In one pass over the database, classify each object o to the sampled object p to which it is closest and incrementally add o to the sufficient statistics initialized by p, using the additivity condition given above. Results in [9] show that the quality of the sufficient statistics obtained by random sampling is much better than the CF -values produced by BIRCH, when used to generate the additional information that is needed to get satisfactory results with hierarchical clustering algorithms. The runtime to generate those CF values using a CF -tree is also significantly larger and makes it almost impossible to beat even a na¨ıve sampling approach to speed-up clustering, given the same resources. If it takes too much time to generate data summarizations, na¨ıve sampling may just use a larger sample and obtain superior results with a much less complex implementation. The only other data summarization method for non-vector metric data that we are aware of is presented in [21], and is based on BIRCH. The authors suggest a generalization of a 11 BIRCH tree that has two variants BUBBLE and BUBBLE-FM for metric data. Both variants keep a number of representatives for each leaf node entry in order to approximate the most centrally located object in a CF -tree leaf. In non-leaf level entries, both methods keep a certain number of sample objects from the sub-tree rooted at that entry in order to guide the search process when building the tree. The basic difference between BUBBLE and BUBBLE-FM is that for BUBBLE-FM the sample points in the non-leaf node entries are mapped to a d-dimensional Euclidean vector space using Fastmap [19]. The image space is then used to determine distances between new objects and the CF s, thus replacing possibly expensive distance calculations in the original space by Euclidean distance computations. We argue that this approach has similar drawbacks as the vector version, and we will therefore, base our current work for non-vector metric data on a sampling based approach to produce data summarizations. 2.2 Speed-up Clustering with Data Summarization In this section, we propose a novel data summarization method that can be applied to nonvector metric data to produce high quality “micro-clusters” to efficiently and effectively support hierarchical clustering. The information produced for each data summary is related and improves upon the information computed for the Data Bubbles proposed in [9] in the sense that accurate estimations of the information needed by hierarchical clustering algorithms is generated. 2.2.1 Data Bubbles for Euclidean Vector Data In this subsection, we briefly review the notion of Data Bubbles for Euclidean vector spaces as proposed in [9]. We discuss the special requirements that hierarchical clustering algorithms such as the Single-Link method and OPTICS pose on data summarization methods, and we illustrate the advantages of Data Bubbles. While simple statistics such as clustering features produced by BIRCH, are effective for k-means type clustering algorithms, they typically are not sufficient to produce good results using a hierarchical clustering algorithm. The main reason is that hierarchical clustering algorithms are based on the distances between sets of data points which are not represented well by the distances between only the representative objects, especially when the compression rate increases. This type of error typically results in a very distorted clustering structure based on data summaries. The Data Bubbles in [9] have been proposed to solve those problems, showing that a data summarization method, in order to support hierarchical 12 clustering, has to take into account the extension and the point density of the data-subset being represented. Basic Definitions A Data Bubble was defined in [9] as follows: DEFINITION 1. A Data Bubble for a set of points X = {Xi }, 1 ≤ i ≤ n, is a tuple BX = (rep, n, extent, nnDist), where • rep is a representative object for X, which is assumed to be close to the center (e.g. the centroid) of the set X; • n is the number of objects in X; • extent is the radius of BX around rep that encloses “most” of the objects in X; • nnDist(k, BX ) is a function that estimates the average k-nearest neighbor distances in BX . For d-dimensional points from a Euclidean vector data, the representative rep, the radius of the Data Bubbles extent, and the k-nearest neighbor distances nnDist(k, B) can be easily estimated using simple sufficient statistics, which can be incrementally computed during the initial construction of the Data Bubbles. The representative rep is simply comP puted as the mean of the set of objects in X, i.e., rep = i=1...n Xi /n. The radius of BX around be estimated by the average pairwise distances within BX , i.e., r P rep can P extent = i=1...n 2 j=1...n (Xi −Xj ) n(n−1) . This expression can in turn be computed from the simpler statistics linear sum LS and square sum ss of all objects in X. LS and ss can be incrementally maintained when constructing a Data Bubble (as in the construction of cluster features CF in the BIRCH algorithm). Using these two values, the extent can be q 2 calculated as 2·n·ss−2·LS . n(n−1) The average k-nearest neighbor distances can be estimated by a simple arithmetic ex- pression assuming a uniform distribution of the objects within a Data Bubble [9]: 1 nnDist(k, BX ) = (k/n) d · extent. Application to Hierarchical Clustering Hierarchical clustering algorithms compute a hierarchical representation of the data-set, which reflects its possibly nested clustering structure. 13 The hierarchical clustering algorithm OPTICS is based on the notions of core distance and reachability distance for objects with respect to parameters eps and minP ts. For any point p, its core distance is equal to its minP ts nearest neighbor distance if this distance is no greater than eps, or infinity otherwise. The reachability of p with respect to another point o is the greater value of the core distance of p and the distance between p and o. Parameter minP ts allows the core-distance and reachability-distance of a point p to capture the point density around that point. Using these distances, OPTICS computes a “walk” through the data-set, and assigns to each object p its core distance and the smallest reachability distance reachDist with respect to an object considered before p in the walk. The algorithm starts with an arbitrary object assigning it a reachability distance equal to ∞. The next object o in the output is then always the object that has the shortest reachability distance d to any of the objects that were “visited” previously by the algorithm. This reachability value d is assigned to this object o. The output of the algorithms is a reachability plot, which is a bar plot of the reachability values assigned to the objects in the order they were visited. An example reachability plot for a 2-dimensional data-set is depicted in Figure 2.1. Such a plot is interpreted as following: “valleys” in the plot represent clusters, and the deeper the “valley”, the denser the cluster. The tallest bar between two “valleys” is a lower bound on the distance between the two clusters. High bars in the plot, not at the border of a cluster represent noise, and “nested valleys” represent hierarchically nested clusters (reachability plots can also be converted into dendrograms [39]). (a) Data-set (b) Reachability plot Figure 2.1: Example reachability plot. Clusters in a hierarchical clustering representation are in general obtained manually (e.g., by cutting through the representation). This process is typically guided by a visual inspection of the diagram - which is why a correct representation of the clustering structure is very important, especially when applying the algorithm to data summarizations instead of the whole data-set. The most important issue when applying hierarchical clustering to Data Bubbles is the 14 Figure 2.2: Illustration of the distance between original Data Bubbles for vector data (Fig. adapted from [9]). distance function that is used to measure dissimilarity between two Data Bubbles. In [9] it has been shown that using the distance between representatives and the extent of Data Bubbles for vector data, a distance between Data Bubbles can be computed that dramatically improves the result of hierarchical clustering compared to using only the distance between representatives. This notion of distance that is aware of the extent of Data Bubbles is depicted in Figure 2.2. If the Data Bubbles do not overlap, it is basically the distance between the “borders” of the Data Bubbles (distance between representative objects of the Data Bubbles minus the extents of the Data Bubbles plus the 1-nearest neighbor distances of the Data Bubbles); otherwise, if they overlap, it is the estimated nearest neighbor distance of the Data Bubble that has the larger nn-distance. The second important issue in hierarchical clustering of data summarizations is the adaptation of the graphical result. The reason is that the Data Bubbles typically represent sets of objects that contain a significantly different number of objects, and that can have largely differing point densities. Including only the representative of a Data Bubble in the hierarchical output representation will most often lead to a very distorted picture of the true clustering structure of a data-set. Therefore, for OPTICS, the bar for each Data Bubble in the reachability plot is expanded using the so called “virtual reachability”. More precisely, for each Data Bubble representing n points, n bars are added to the reachability plot. The height of each bar is calculated as the virtual reachability of the Data Bubble, which corresponds to the estimated average reachability distance for points within the Data Bubble (basically the estimated minP ts-nearest neighbor distance). Other hierarchical algorithms such as the Single-Link method can be similarly adapted to work with Data Bubbles. 15 2.2.2 Data Bubbles for Non-Vector Spaces The only information from a non-vector space that can be utilized is the distance function, i.e., distances between objects. Therefore, it is difficult in such spaces to get an accurate and at the same time computationally inexpensive estimation for the important components defined for the original Data Bubbles. We cannot compute new “artificial” objects such as a centroid, which is guaranteed to be in the center of the respective set, and hence would be the best representative for the objects in a Data Bubble. We also cannot compute statistics like the linear sum or the square sum of the objects that would allow us to incrementally maintain a good estimation of the radius of a Data Bubble around the representative. Similarly, there is no inexpensive or incremental way to compute an estimation of the average k-nearest neighbor distances in a Data Bubble. For these reasons the original definition of Data Bubbles has to be significantly changed and adapted to the particular problems of non-vector metric spaces. The main purpose of Data Bubbles is to support effective and highly efficient hierarchical clustering based on the summary information provided by the Data Bubbles. The representative, the extent, and the average k-nearest neighbor distances of a Data Bubble serve only the purpose of defining effective distance notions for hierarchical clustering. For the algorithm OPTICS, which we will use to evaluate our method, these notions are: • The notion of a distance between Data Bubbles, which has to “be aware” of the extent of the Data Bubbles. This is the most important notion for effective hierarchical clustering, because the distances between Data Bubbles will determine the shape of the cluster result. • The core-distance of a Data Bubble, which is also used to define the “virtual reachability” for the objects represented by the Bubble. • The reachability-distance of a Data Bubble relative to another Data Bubble, which is needed during the execution of OPTICS. The appropriateness of the reachabilitydistance is dependent on the previous two notions, since it is defined using only coredistance and the distance between Data Bubbles. Errors in estimating a representative, the extent, or the average k-nearest neighbor distances will lead to errors when computing the above distances, which in turn will produce errors in the clustering result using Data Bubbles. To make things worse: errors for different components in a Data Bubble may depend on and amplify each other, e.g., an error in 16 the estimation of the representative will obviously lead to an increased error in the extent around the representative, if we keep the original definition of extent as a radius around the representative that contains most of the objects of the Data Bubble. In the following subsections we will analyze these problems and propose a new and more suitable version of Data Bubbles that solves these problems. In order to discuss the problems, we will assume the following minimal procedure to generate k Data Bubbles for non-vector metric data (the complete method will be given later in this section): 1. Sample k objects from the database randomly. 2. Assign each object in the database to the closest sample object in the set of objects obtained in step 1. This means that using this procedure, the only information we can utilize in our computation of data summarizations are the k objects drawn from the database in step 1 (they may be used, for instance, as candidates for representative objects), and the distances of all objects to all of the sample objects from step 1. These distances have to be computed anyway to determine the closest representative for each object. Representative Objects In a non-vector metric space the representative object for a Data Bubble has to be an object from the Data Bubble itself, since we cannot compute a centroid for the set of objects. Theoretically, the best representative for a set of objects in a non-vector metric space is a medoid, i.e., an object that is located most centrally in the set of objects, in the sense that its overall distance to all other objects in the Data Bubble is minimal. More formally: DEFINITION 2. A medoid for a set of objects X is an object m ∈ X such that for all p ∈ X: X o∈X D(m, o) ≤ X D(p, o). o∈X A medoid, although it seems to be the best choice of a representative, has a severe drawback: determining a medoid for a set of n objects is computationally expensive (O(n2 )), since all pairwise distances have to be computed. Because we want to use very high compression rates in practice (i.e., only a very small number of Data Bubbles, and hence a very large number of objects represented by one Data Bubble on average), it is not feasible to determine a medoid for a Data Bubble with this exhaustive search method. The same number 17 of computations could be better used to cluster a larger subset of objects directly without generating Data Bubbles. Using our minimal procedure to construct data summarizations, there are basically three alternatives to determine some representative objects for a Data Bubble more efficiently but less optimally – all with advantages and disadvantages: 1. “Initial sample object”: keep the initial sample object that is used to generate a Data Bubble as the representative of the Data Bubble. 2. “Relocation using a sub-sample”: after the generation of the Data Bubble, take a small sub-sample from the Data Bubble (including the initial sample object), and determine an exact medoid only in this subset. 3. “Maintaining several candidates”: while generating the Data Bubble, keep a number of objects as potential representatives in main memory (e.g. first m objects assigned to the Data Bubble). When assigning objects, compute and sum up distances not only to the initial sample object but also to the additional candidates in the Data Bubble. After the generation of the Data Bubble, select the candidate with the smallest sum of distances. The first alternative, keeping the initial sample object, is the least expensive, since no additional computation is necessary. But, it is also the alternative with the largest error. The quality of the representatives found by the second alternative obviously depends on the size of the sub-sample drawn from the Data Bubble. Our experiments show however, that a 5% sub-sample of a Data Bubble will result in representatives that are only slightly better approximations of a true medoid than the first approach, but the effect on the quality of the clustering result is not significant. Taking larger sub-samples, however, is also too expensive in the same sense as the exhaustive method: instead of taking a sub-sample to relocate the representative, we can use a larger sample without bubble generation to improve the clustering result. Alternative 3, i.e., maintaining several candidates during the generation of the bubbles, has basically the same properties as alternative 2. Note that, even in the best case, i.e., if we could get an exact medoid for the whole set of objects in a Data Bubble, we may produce noticeable errors in the clustering result because there are limits to the accuracy of a medoid as being in the center of the set of objects that it represents. This is in fact a drawback for any representative object that has to be an element of the set itself (opposed to a computed mean in case of vector data). Figure 2.3 depicts 18 Figure 2.3: No object is close to the center of the set. a case where the data-set does not contain an object close to the center of the set. Due to this drawback, even the best selection of a representative for a Data Bubble may result in an error when estimating the extent of a Data Bubble and consequently in the distance between Data Bubbles to a degree that would not occur for vector Data Bubbles. Using any of the three alternatives, and keeping the original definition of a Data Bubble, we cannot guarantee that our representative will be close enough to the “center” of the dataset to lead to small errors. On the other hand, having a representative close to the center of a Data Bubble is not an objective in its own for hierarchical clustering. Only the above listed distance notions for Data Bubbles are important. As we will see in the next subsection, we can in fact compensate for a less centered representative by applying a new and much more sophisticated distance function for Data Bubbles. Representatives that are not close to the center of a data-set will only lead to an error in the clustering result when using the original idea of extent of a Data Bubble around a representative and the original definition of distance that is based on this extent. Therefore, we choose alternative 1 and keep the initial sample object as the representative of a non-vector metric Data Bubble, which has no computational overhead. Average k-nn-Distances, Core-Distance, and Virtual Reachability Distance The estimation of the average k-nearest neighbor distances nnDist(k, B) for a Data Bubble B is closely related to the core-distance and the virtual reachability distance of B. The nearest neighbor distance is also used in the original definition of the distance between Data Bubbles. Because there is no notion of volume and dimensionality in a non-vector metric space, we cannot apply a simple function to calculate the average k-nearest neighbor distances as in a vector space. When constructing Data Bubbles for non-vector metric data, we have similar alternatives to determine an estimation of the average k-nearest neighbor distances as we have for the selection of a representative object. Using a sub-sample of the objects in a Data Bubble and computing the k-nearest neighbor distances only in this sub-sample is, 19 however, not an option: they would very likely be highly overestimated because the point density of samples is usually much smaller than the density of original data points, as the samples are only a small portion of the original data-set. 1. “k-nn distances w.r.t. the initial sample object”: when assigning objects to Data Bubbles, maintain a list of the k smallest distances relative to each initial sample object. For each Data Bubble, simply use the k-nn distances to its representative as the estimation of the average k-nn distance in the whole Data Bubble. 2. “k-nn distances w.r.t. to several reference objects”: keep a number of objects from a Data Bubble in main memory (e.g. the first m objects assigned to the Data Bubble) and compute distances to these objects for all objects that are assigned to the Data Bubble. For each of the reference objects, maintain a list of the k smallest distances. After the generation of the Data Bubble, compute an estimation of the average knearest neighbor distances by using those values. As for the selection of the representative objects, the first alternative has no significant computational overhead since the distances to the initial sample objects have to be computed anyway. The computational cost and the improvement in the estimation of the k-nn distances for the second alternative depend on the number of reference objects that are kept in main memory. As before, if we keep too many reference objects, the gain in accuracy will not outweigh the increased number of distance computations; for the same number of additional distance computations, we may be able to get a better result by just taking a larger sample size to begin with. The important question regarding the average k-nn distances in a Data Bubble is: for which values of k do we need the estimation and how accurate do they have to be? The most important use of the k-nn distances is for estimating the core-distance of a Data Bubble. The core-distance also defines the virtual reachability value for a Data Bubble, which is used when “expanding” a Data Bubble in the clustering output. Typically, we don’t want to use values for minP ts that are too small, in order to avoid Single-Link effect and to reduce the effect of noise (see [3] for details). In practice, we mostly use values which are significantly larger than 5 for larger data-sets; and we may consider minP ts values in the range of 5 only for relatively small data-sets. To estimate the core- and virtual reachability distance of a Data Bubble we therefore only need k-nn distances for the larger values of k = M inP ts that we want to use for clustering. Fortunately, the larger values of the average k-nn distance in a Data Bubble can be estimated with small errors using only the 20 distances to the initial sample object. In fact, the larger k, the more accurate the estimation using only the initial sample object (or any other reference object, or the average over several reference objects) would be. Only for very small values of k, especially for k = 1, is the actual value nnDist(1, B) for most of the objects in B quite different from the average nearest neighbor distance. The nearest neighbor distance in a Data Bubble B, nnDist(1, B), is only used in the original definition of distance between Data Bubbles, which we will not use for non-vector metric data because of other reasons. Therefore, we don’t need the more error prone estimation of nnDist(k, B) for very small values of k. And, since the use of only a few reference objects does not significantly improve the result for the larger values of k, we choose here again the more efficient alternative 1 to estimate the k-nn distances (up to the maximum value of minP ts), i.e., we use only the initial sample objects and the distances that we have to compute in the construction of Data Bubbles. Using the estimation of k-nearest neighbor distances and the core-distance of a Data Bubble, the virtual reachability distance of the Data Bubble (the distance needed for the expansion of the reachability plot after clustering) are then defined similarly as in [9]: DEFINITION 3. Let B be a Data Bubble. The virtual reachability and core-distance of B are defined using the estimated k-nn distances, nnDist(k, B), as following: virtualReachability(B) = coreDist(B) = nnDist(minP ts, B). The Distance Between Data Bubbles The distance between Data Bubbles in [9] is based on the extent of the Data Bubbles as illustrated above in Figure 2.2. The purpose of the extent of a Data Bubble is to be able to define the distance between Data Bubbles as the distance between their borders, which are approximated by the extents. However, the extent as the average pairwise distance in a Data Bubble is expensive to estimate since there is no supporting statistics that could be collected incrementally while constructing a Data Bubble. The only option to get an estimation of the average pairwise distance would be to draw a sub-sample of objects from a Data Bubble and compute all pairwise distances in this sub-sample. The accuracy of this approach depends on the size of the sub-sample. The value could be used as a radius around the representative within which most of the objects of the Data Bubble are supposed to be located. Since this is the intended interpretation of the extent, we could alternatively use only the distances to 21 the representative and maintain incrementally a distance around the representative so that “most” objects of the Data Bubble fall inside this radius around the representative (similar to maintaining the k-nn distances). The second alternative for estimating a global extent is much more efficient but also much more error-prone since it is very sensitive to outliers. To work properly, both approaches have to assume (in addition to having a small error) that the representative is close to the center, which is difficult to guarantee in a non-vector metric space. In fact, errors in choosing representatives and errors in the estimation of the extent amplify each other, resulting in large errors in the clustering result, because the distances between Data Bubbles will be heavily distorted. As a solution, we propose a new definition of distance between Data Bubbles, which is based on simple statistics that uses only the distances between objects and sample objects (which are computed when constructing a Data Bubble anyway). All the needed notions can be maintained incrementally and without significant computational overhead. Conceptually, in order to compensate for an error in the selection of the representatives, we want to distinguish the extent of a Data Bubble around its representative in different directions - “direction” being defined using only distances between the representative and other objects. For instance, if a representative is not centered well in a Data Bubble, the distances to the “border” of the Data Bubble may be very different in different “directions”. Figure 2.4 illustrates this concept using a 2-dimensional Data Bubble B where the extent e1 from the representative rB in direction of object o1 is much smaller than the extent e2 in direction of object o2 . The notions required to formalize these intuitions will be introduced as follows. Please note that all concepts will be defined without any reference to vector space properties or operations, and that although we will use 2-dimensional point data to illustrate the concepts, the notions are solely based on distances between objects. Figure 2.4: “Directional extent” of a Data Bubble. In order to define more accurate distances between Data Bubbles, the goal is to find a more accurate representation of the “border” of a Data Bubble. However, we only need to know the distance between the representative and the border, i.e., the extent of a Data Bubble, in the directions of the (representatives of) other Data Bubbles. Intuitively, given 22 any two Data Bubbles, A and B, and their representatives, rA and rB , we can divide the Data Bubble B into two parts with respect to Data Bubble A: one part containing the objects in B that are “in the direction of A” in the sense that the distance between them and the representative of A, rA , is smaller than the distance between the two representatives rA and rB ; the second part of B contains the other objects, which are called to be “in the reverse direction of A”. Formally: DEFINITION 4. Let A and B be two sets of objects, represented by rA and rB , respectively. • Bubble(B).InDirection(A) = {o ∈ B|D(o, rA ) ≤ D(rA , rB )}. For each object o ∈ Bubble(B).InDirection(A) we say that o lies in the direction of A. • Bubble(B).InRevDirection(A) = {o ∈ B|D(o, rA ) > D(rA , rB )} For each object o ∈ Bubble(B).InRevDirection(A) we say that o lies in the reverse direction of A. Figure 2.5 illustrates these notions: all objects o ∈ B that lie inside the circle having rA as center and the distance between rA and rB as radius, are in direction of A, i.e., in Bubble(B).InDirection(A); objects o′ ∈ B which are outside the circle lie in “reverse” direction of A, i.e., in Bubble(B).InRevDirection(A). Figure 2.5: Illustration of direction and reverse direction. Following a similar intuition, the next notion we define is the border distance of a Data Bubble in the direction of another Bubble: DEFINITION 5. Let A and B be two sets of objects, represented by rA and rB , respectively. The border distance of B in the direction of A is defined as Bubble(B).borderDistInDirection(A) = D(rA , rB ) − mino∈B (D(o, rA )) The border distance of B in the direction of A is defined as the distance between the two representatives minus the distance between the representative of A, rA , and the object 23 o in B that is closest to rA . Figure 2.6 illustrates our estimation of the border distance of a Data Bubble B in the direction of another Bubble A. Figure 2.6: Illustration of border distance. A consequence of our definition of border distance is that - in contrast to what can happen in the original Data Bubbles - the extents of two Data Bubbles can never overlap, i.e., the distance from a representative rB of a Data Bubble B to its “border”, in the direction of a Data Bubble A with representative rA , can never be larger than half the distance between the two representatives: LEMMA 1. Given two data bubbles A and B with representatives rA and rB , respectively. Let Bubble(B).borderDistInDirection(A) be the border distance of B in the direction of A. If the distance function is a metric, i.e., satisfies the triangle inequality, then Bubble(B).borderDistInDirection(A) ≤ D(rA , rB )/2. Proof. Suppose the border distance is greater than D(rA , rB )/2. It follows that D(o, rA ) < D(rA , rB )/2, where o = argmin (D(o, rA )). And because o ∈ B, by construction of B it o∈B must be D(o, rB ) ≤ D(o, rA ). But then it follows that D(o, rA ) + D(o, rB ) ≤ 2D(o, rA ) < D(rA , rB ). This inequality violates the assumption that the triangle inequality holds. Hence Bubble(B).borderDistInDirection(A) ≤ D(rA , rB )/2. Our definition serves well in a “normal” situation of well-separated bubbles as depicted in Figure 2.7, where the distance between the “borders” of the bubbles gives a good estimate of the true distance between the point sets. In practice, some situations can occur where a Data Bubble may contain a “gap” in a particular direction. This can occur if the Data Bubbles represent points from different clusters but their representatives happen to be close enough so that one Data Bubble contains points from both clusters. Figure 2.8 illustrates such a case. These situation can lead 24 Figure 2.7: “Normal” Data Bubble separation. to errors in the clustering result because the difference between the borders, and hence the distance between Data Bubbles may be underestimated. As a consequence, cluster separations may be lost. For vector Data Bubbles, this problem does not occur as frequently as for non-vector Bubbles, since the extent is estimated by the average pairwise distance, which in the case depicted in Figure 2.8 would still be close to the true extent of B (which is much smaller than the depicted directional border distance of B). Figure 2.8: A “gap” in Data Bubble B. Since the cluster in Bubble A spread across the ceter line between representative rA and rB , Bubble B contains points from two clusters. In B, the border distance in direction A is larger than in reverse direction of A. In order to detect those and similar situations, we maintain certain statistics with respect to the distances of objects in a Data Bubble B to its representative rB . The values we compute when constructing a Data Bubble are: 1) average distance of the objects in B in direction of each other Data Bubble (and reverse directions), 2) the standard deviation of the distances in all different directions. DEFINITION 6. Let A and B be two sets of objects, represented by rA and rB , respectively. Let BA = Bubble(B).InDirection(A) denote the set of objects in B that lie in direction of A and let BrevA = Bubble(B).InRevDirection(A) denote the set of objects in B that lie in the reverse direction of A. • Bubble(B).aveDistInDirection(A) = P D(o,rB ) . |BA | o∈BA Bubble(B).aveDistInDirection(A) is the average distance between the representative of B and the objects in B that lie in direction of A. • Bubble(B).aveDistInRevDirection(A) = 25 P o∈BrevA D(o,rB ) |BrevA | . (a) Average distance of B in direction of A is larger than in the reverse direction (b) Standard deviation of B in direction of A is much larger than in the reverse direction Figure 2.9: Examples for a “gap” in a Data Bubble B. Bubble(B).aveDistInRevDirection(A) is the average distance between the representative of B and the objects in B that lie in reverse direction of A. DEFINITION 7. Let A and B be two sets of objects, represented by rA and rB , respectively. Let again BA = Bubble(B).InDirection(A) and BrevA = Bubble(B).InRevDirection(A). Let furthermore, distBA = Bubble(B).aveDistInDirection(A) and distBrevA = Bubble(B).aveDistInRevDirection(A) • Bubble(B).stdevInDirection(A) = rP 2 o∈BA (D(o,rB )−distBA ) |BA | Bubble(B).stdevInDirection(A) is the standard deviation of the distances between the representative of B and the objects in B that lie in direction of A. rP • Bubble(B).stdevInRevDirection(A) = 2 o∈BrevA (D(o,rB )−distBrevA ) |BrevA | Bubble(B).stdevInRevDirection(A) is the standard deviation of the distances between the representative of B and all objects in B lying in reverse direction of A. The “directional” versions of border distance, average distance and standard deviation of the distances help us to detect “gaps” in a Data Bubble in many cases. The situation depicted in Figure 2.8, e.g., is indicated by the fact that the border distance of Bubble B in direction of A is not only much larger than in the reverse direction but also much larger than the average distance in direction of A. Two other examples of a “gap” in a bubble are given in Figure 2.9. In order to avoid overestimating the extent of a Data Bubble (and consequently underestimating the distance between Data Bubbles) in the presence of “gaps”, we introduce a refined notion of “directional” border distance, which we call “directional” extent of the Data Bubble. 26 DEFINITION 8. Let A and B be two sets of objects, represented by rA and rB , respectively. Let Ave, and Stdv be the average respectively the standard deviation of the distances in B in direction of A or the reverse direction of A - whichever is smaller. The extent of B in the direction of A, Bubble(B).extentInDirection(A), is then defined as Bubble(B).extentInDirection(A) = min{Bubble(B).borderDistInDirection(A), Ave + 2 · Stdv} Basically, the (directional) extent of a Data Bubble is either the (directional) border distance, or the (directional) average distance plus two times (directional) standard deviation - whichever is smaller. Taking the average distance plus two times standard deviation is a way to estimate a (“directional”) border around the representative that will include most of the points within that limit. Having a good estimation of the extent of a Data Bubble in a certain direction, we can define the distance between two Data Bubbles simply as the distance between their representatives minus their directional extents. DEFINITION 9. Let A and B be two sets of objects, represented by rA and rB , respectively. The distance between A and B is defined as D(A, B) = D(rA , rB ) − Bubble(A).extentInDirection(B) − Bubble(B).extentInDirection(A) In summary, our new method for constructing a collection of k Data Bubbles for nonvector metric data is as following: 1. Draw randomly a sample of k objects from the database. Each sample object will be the representative object rBi for one of the k Data Bubbles Bi (i = 1, . . . , k). 2. Classify, i.e., assign each object in the database to the closest representative object rB in the set of objects obtained in step 1, and maintain incrementally the following information about each Data Bubble B: • The distances to the k-nearest neighbors of the representative object rB , up to a value k = minP ts. These k-nearest neighbor distances, nnDist(k, B) are used to define core-dist and virtual reachability as in [9], i.e., coreDist(B) = virtualReachability(B) = nnDist(minP ts, B). • Relative to each other Data Bubble A: 27 (a) Bubble(B).borderDistInDirection(A); (b) Average distance and standard deviation in direction of A and in reverse direction of A. 3. Compute the extent of each Data Bubble B in direction of every other Data Bubble A: Ave = min{Bubble(B).aveInDirection(A), Bubble(B).aveInRevDirection(A)}; Dev = min{Bubble(B).stdevInDirection(A), Bubble(B).stdevInRevDirection(A)}; Bubble(B).extentInDirection(A) = min{Bubble(B).borderDistInDirection(A), Ave + 2Dev}. After the construction of the Data Bubbles and the computation of the directional extent values, hierarchical algorithms such as the Single-Link method can be applied to the nonvector Bubbles by using the distance between Data Bubbles defined in Definition 9. The clustering algorithm OPTICS is based on the notion of reachability distance. For point objects, the reachability distance of an object o1 relative to an object o2 was defined as the maximum of D(o1 , o2 ) and coreDist(o2 ) (see [3] for details). For Data Bubbles, the notion of the reachability distance of a Data Bubble B1 relative to a Data Bubble B2 can be defined analogously: DEFINITION 10. Let A and B be two Data Bubbles. The reachability distance of B relative to A is defined as reachDist(B, A) = max{D(A, B), coreDist(A), coreDist(B)}. In the following, we will show that this definition is an improved version of the definition used in [9], which estimates the reachability distance of a hypothetical object o in B1 in direction of B2 , relative to an object in B2 in direction of B1 . Analogously to the definition of reachability distance for points, if the two bubbles are far apart, the reachability distance will be equal to the distance between the bubbles. If the bubbles are very close to each other, which is indicated by at least one of the core-distances being larger than the distance between the bubbles, the hypothetical object o can be considered as being located at the border of both Data Bubbles, and we estimate its core-distance by the larger of the two core-distances, which in turn is used as the estimation of the reachability distance. The definition given in [9] only considers the core-distance of the Bubble A when estimating the reachability distance of Bubble B relative to A. The old definition underestimates the 28 reachability value for points at the border of B significantly if A is relatively dense in its center and close to a less dense Bubble B (resulting in errors in the clustering structure). Furthermore, this definition allows a much easier integration of Data Bubbles into OPTICS than the original method. 2.3 Performance Evaluation In this section, we perform an extensive experimental evaluation of our Data Bubbles. The results show that (1) our new method is highly scalable; (2) it produces reliable results even for very small numbers of Data Bubbles; (3) it is significantly better than random sampling; and (4) it even has an improved accuracy compared to the original Data Bubbles when applied to some vector data. 2.3.1 Data-sets and Experimental Setup To evaluate the performance of our new method for non-vector Data Bubbles, in particular the ability to discern clusters close to each other, we compared our method with the original Data Bubbles method on the following data-sets: First, a synthetic 2-dimensional point data-set (Figure 2.10(a)) with Euclidean distance (L2 ), called DS-Vector, which is used to show that even for Euclidean vector spaces the new version of Data Bubbles (which only uses the distance information and none of the vector space properties) outperforms the original Data Bubbles (for vector spaces) proposed in [9]. The reachability plot obtained when clustering the whole data-set using OPTICS is depicted in Figure 2.10(c). The data-set contains 50,000 points distributed over 8 clusters and 4% background noise. The eight clusters have similar sizes and most of them are located very close to each other as can be seen in Figure 2.10(a). Therefore, this data-set is a good test bed for evaluating the new directional definition of extent and the heuristics to handle gaps in bubbles. The second data-set, called DS-UG, is another synthetic 2-dimensional synthetic vector data (Figure 2.10(b)) that contains 100,000 points in both uniform distributed and Gaussian distributed clusters. Figure 2.10(d) shows the reachability plot of the OPTICS clustering result on DS-UG. We used this testing data to study the performance of the new and old Data Bubbles on different type of clusters. The third data-set, called DS-Tuple, which we use to evaluate the relative performance of our non-vector Data Bubbles, is a synthetic set of 50,000 binary strings. Each object of DS-tuple is a 100-bit 0/1 sequence, and represents a set (Given 100 items, if a position in 29 the sequence has value one, then the item corresponds to the position is in the set). The similarity between two such sequences s1 and s2 is measured using the Jaccard coefficient on the corresponding sets, i.e., |s1 ∩ s2 |/|s1 ∪ s2 |. 80% of the objects form 10 clusters and the remaining 20% are noise. Two of the clusters are very small (123 and 218 objects), making the problem of finding them very challenging for data summarization techniques. The reachability plot obtained when clustering the whole data-set using OPTICS is depicted in Figure 2.10(e) (the two tiny clusters are indicated by arrows). The fourth data-set used to illustrate the practical relevance of our method is a real data-set containing RNA sequences. The application and the data-set, called DS-Real, are explained in more detail in Section 2.3.4. 100 100 90 90 80 80 70 70 60 60 50 50 40 40 30 30 20 20 10 10 0 0 0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80 90 100 (a) DS-Vector (c) reach.-plot for DS-Vector (b) DS-UG (d) reach.-plot for DS-UG (e) reach.-plot for DS-Tuple Figure 2.10: Reachability plots for the whole synthetic data-sets used for the evaluation. The values reported in the following sections are average values over 100 repetitions of each experiment. In order to measure the quality of the hierarchical clustering results based on data summarizations, We used the following two measurements. We designed the first scoring scheme and denote it by N-score. This score emphasizes the ability of the tested clustering algorithm in fully recovering all the clusters (big and small) in the data-set. For each reachability plot obtained for data summarizations, we apply the following heuristic to select the best cut-line through the diagram. Specifically we evaluate 40 different cut-lines through a reachability plot in equidistant intervals, and selects the cut-line that corresponds most closely to the clustering obtained for the whole data-set. This cut-line is assigned a score based on the number of clusters that are present 30 with respect to this cut through the reachability plot. If n clusters are found (0 ≤ n ≤ maximum number of clusters in the original data-set, n max), then the cut-line gets a score of n/n max. If k is greater than n max, it gets a score of 0. Hence missing clusters, finding spurious clusters, and splitting clusters will get penalties. The intention of this score is to penalize especially those results where the algorithm produces structures that are not existent in the original data-set and which may lead to misinterpretations of a data-set. This scoring scheme is in the favor of the Data Bubbles for vector data instead of our new Data Bubbles for non-vector data, since our new method is designed to recover gaps, the likely missing cluster structures in the result of the old Data Bubble method. The second measurement we applied on the clustering results is the F-score measurement [30]. Similar to the first measurement, we used a cut-line to determine the borders of clusters. Denote by X the set of objects in a cluster of the approximate clustering result, X = {X} the set of all clusters X, and T the set of objects in a cluster of the OPTICS result. Let N1 = T ∩ X, N2 = |X|, N3 = T ∪ X, p = N1 /N2 and r = N1 /N3 . Then the F-score for T is defined as F (T ) = maxX∈X {2pr/(p + r)}. The overall F-score for an approximate clustering result is the weighted average of F-score for each T in the OPTICS result. We used multiple cut-lines to cut through the reachability plot of a clustering result, and report the maximal overall F-score as the accuracy. The higher the F-score, the better the result. All experiments were performed on an AMD Athlon-xp 1800+ workstation with 512 MB of memory. 2.3.2 Comparison with Original Data Bubbles First, we compare our new method with the original Data Bubble method using the vector data-set DS-Vector. The N-scores for both methods when increasing the number of Data Bubbles are depicted in Figure 2.11. The results clearly show that our new directional definition of extent and the heuristics to handle gaps in Data Bubbles leads to a better quality of the results, even when not using any of the vector space properties. In Figure 2.12, when we applied the F-score on the clustering results of DS-Vector, we saw a similar trend as in Figure 2.11: our new Data Bubbles designed for non-vector data outperform the old Data Bubbles for vector data on DS-Vector. We also tested new and old Data Bubble methods on the data-set DS-UG which contains more conplicated cluster structures. On this vector data-set, we acknowlege that the original Data Bubble method for vector data has better F-scores than our new method for non-vector 31 0.9 0.89 New Method Old Method 0.88 0.87 N-Score 0.86 0.85 0.84 0.83 0.82 0.81 0.8 0.79 100 150 200 Number of Data Bubbles 250 Figure 2.11: New versus old method on DS-Vector using N-score. 0.96 0.94 Overall F-score 0.92 0.9 0.88 0.86 0.84 0.82 0.8 100 New Method Old Method 150 200 Number of Data Bubbles 250 Figure 2.12: New versus old method on DS-Vector using F-score. data, ash shown in Figure 2.13. Compared with the result on DS-Vector which is consist of purely uniform distributed clusters and random noise, we conjecture that the original method may have a better ability to handle clusters in Gauss distribution. How to improve the performance of Data Bubbles for non-vector data on Gaussian clusters is a direction of future study. 2.3.3 Comparison with Random Sampling In this section we apply our method to the non-vector metric data-set DS-Tuple. We evaluate both the quality and the runtime benefits of non-vector Data Bubbles. The first set of experiments shows the relative performance of our method compared to a random sampling approach, which just clusters a random sample of the data-set and then assigns every object 32 0.96 0.94 Overall F-score 0.92 0.9 0.88 0.86 0.84 0.82 0.8 100 New Method Old Method 150 200 Number of Data Bubbles/Samples 250 Figure 2.13: New versus old method on DS-UG using F-score. in the database to the closest sample object. This approach represents a baseline method, which is in fact difficult to beat since it is very efficient. If a method is computationally relatively expensive in the construction of its data summarizations (such as the BIRCH based methods BUBBLE and BUBBLE-FM), random sampling can be more effective since it can use the same number of resources to simply cluster a larger sample. Figure 2.14 shows the result of the comparison on DS-tuple. The quality of plots created by our method is consistently better than that of random sampling. For example, when using 50 bubbles/sample, our method is almost always able to recover all the significant clusters and finds one of the 2 small ones regularly, while random sampling recovers only 7 of the 8 big clusters quite consistently. Two example plots are depicted in Figure 2.14(a). Figure 2.14(b) shows the average N-score for both methods when varying the number of bubbles/sample. We obtain an up to 40% better quality when using very high compression rates (low numbers of bubbles). In general, we consistently obtain a score that can be obtained by random sampling only when the number of sample points is at least twice the number of Data Bubbles (this rate increases with the number of bubbles/sample). Figure 2.14(c) shows that the average overall F-score for both methods when change the number of bubbles/sample. Similar to Figure 2.14(b), it shows that our Data Bubble method outperforms random sampling consistently by a large margin, especially when the number of bubbles/sample is small. Figure 2.14(d) shows the scale-up of both methods with respect to the number of bubbles/sample. Both methods scale linearly. They both have the same number of distance computations and hence their runtime is very close to each other, especially when the sam- 33 (a) reachability-plots for bubbles and sampling 0.95 Data Bubbles Random Sampling 0.9 0.85 N-Score 0.8 0.75 0.7 0.65 0.6 0.55 0.5 25 50 75 100 150 200 250 Number of Data Bubbles/Samples (b) quality w.r.t. N-score 1 0.95 Overall F-score 0.9 0.85 0.8 0.75 0.7 0.65 Data Bubbles Random Sampling 0.6 0 50 100 150 200 250 Number of Data Bubbles/Samples (c) quality w.r.t. F-score 30 Data Bubbles Random Sampling Runtime (second) 25 20 15 10 5 0 25 50 75 100 150 200 250 Number of Data Bubbles/Samples (d) runtime Figure 2.14: non-vector data bubbles vs. random sampling on the DS-Tuple data-set. pling rate is low. Random sampling is slightly faster when using a large sample rate and a relatively cheap distance function (Jaccard coefficient in this case). In real applications, however, the distance function is typically much more expensive (e.g., a sequence alignment score as in our real-life data-set DS-Real), and the runtime of both methods will be 34 dominated heavily by only the distance computations (e.g., 638 seconds for Data Bubbles versus 635 seconds for sampling on the DS-Real data set). 11 10 Runtime (second) 9 8 7 6 5 4 3 2 10000 15000 20000 25000 30000 35000 40000 45000 50000 Number of Objects (a) Scale-up 400 Speed-up factor 350 300 250 200 150 100 50 10000 15000 20000 25000 30000 35000 40000 45000 50000 Number of Objects (b) Speed-up Figure 2.15: Scale-up speed-up w.r.t. number of objects on DS-Tuple. Figure 2.15 shows the absolute runtime and speed-up factors (compared to OPTICS on the whole database), when varying the database size. The databases used are subsets of DS-Tuple. Our algorithm, using 100 data bubbles, scales approximately linearly with the size of database, and we achieve as expected large speed-up values over OPTICS: between 77 and 400. Note that this speed-up is also dependent on the distance function, and for more expensive distance functions the expected speed-up will even be much larger. 35 2.3.4 An Application to a Real-life Data-set The RNase P Database [47] is a compilation of ribonuclease P (RNase P) sequences and other information. In the last a few years, the number and diversity of RNase P RNA sequences available have increased significantly and analyzing this data-set has become an important issue. Cluster analysis can help detecting functional subgroups in this data-set and help understanding the evolutionary relationships between the sequences. (a) Result of OPTICS, runtime = 6578 seconds. (b) Result of OPTICS-Bubbles, runtime = 638 seconds. Figure 2.16: Results for DS-Real. In this application, we used global sequence alignment under the standard affine gap penalty scoring scheme (used in BLAST) to cluster the database of 481 sequences. The OPTICS result for the whole data-set (DS-Real) is shown in Figure 2.16(a). Figure 2.16(b) shows a typical result using 50 Data Bubbles. It is easy to verify that the results are very similar. The clustering structure corresponds mostly to the already known evolutionary relationships and matches well with the annotations in the database. An exception is the Bacteria.Gamma family that is partitioned into two sub-groups, which both are mixed with respect to the existing annotations of the sequences. This is an interesting finding that is currently being investigated in more detail. 36 2.4 Summary In this chapter, we presented a new data summarization method for non-vector metric data. The method uses only distance information, and introduces the novel concept of a directional extent of a set of objects. We showed that the distance between bubbles based on this notion of extent even improves upon Data Bubbles when applied to vector data. An extensive performance evaluation also showed that our method is more effective than a random sampling approach, using only a very small number of data summarizations, and thus resulting in a large reduction of runtime (up to 400 times better than OPTICS) while trading only very little clustering quality. The method allows us to obtain results even for data-sets where clustering the whole data-set is infeasible because of the prohibitive cost of the distance function. 37 Chapter 3 Speed-up Clustering with Pairwise Ranking Many clustering algorithms in particular hierarchical clustering algorithms do not scale-up well for large data-sets especially when using an expensive distance function. In this chapter1 , we propose a novel approach to perform approximate clustering with high accuracy. We introduce the concept of a pairwise hierarchical ranking to efficiently determine close neighbors for every data object. We also propose two techniques to significantly reduce the overhead of ranking: 1) a frontier search rather than a sequential scan in the na¨ıve ranking to reduce the search space; 2) based on this exact search, an approximate frontier search for pairwise ranking that further reduces the runtime. Empirical results on synthetic and real-life data show a speed-up of up to two orders of magnitude over OPTICS while maintaining a high accuracy and up to one order of magnitude over the previously proposed Data Bubbles method, which also tries to speed-up OPTICS by trading accuracy for speed. The remainder of this chapter is organized as follows. In Section 2 we introduce background knowledge including the OPTICS clustering algorithm; in Section 3, we state the motivation of the new method; Section 4 discusses the idea of ranking; in Section 5, we introduce our new ranking method; in Section 6, we compare our method empirically with the previous methods; finally, we summarize this chapter with Section 7. 3.1 Preliminaries 3.1.1 Three Major Clustering Approaches Clustering algorithms can be categorized based on how they cluster the data objects. In this subsection we briefly introduce three algorithms representing the major categories: 1 Some of the material in this chapter has been published in [57]. 38 partitioning, hierarchical and density-based approaches. For a complete description of all categories, see [24]. The partitioning approach is represented by the k-means algorithm. This approach selects a set of centers and partitions the data-set by assigning data objects to their nearest center. The centers are then adjusted according to the objects in each group and the assignment process is repeated to refine the result. Each group of objects assigned to a center is considered a cluster. The hierarchical approach is represented by the Single-Link algorithm. Starting from groups of individual data objects (one data object per group), the method agglomerates two nearest groups into a new group. The final result is a hierarchical ordering of all data objects that shows the process of the agglomeration. The density-based approach is represented by the DBSCAN algorithm [18]. The method estimates the density of the region around each data object by counting the number of neighbor objects within a given radius. It then connects dense regions to grow them into clusters. Although our algorithm can be slightly modified to apply to other clustering methods such as Single-Link and DBSCAN. In this chapter we focus on OPTICS, which is a hierarchical clustering algorithm that uses density-based concepts to measure the dissimilarity between points, and is described in more detail below. 3.1.2 Triangle Inequalities in Metric Spaces The triangle inequality can be used in a technique called pruning to avoid distance computations in data retrieval operations and data-mining applications that require distance computations. To apply the pruning technique, typically the distances between a selected small set of objects P and all other objects o in a data-set are pre-computed in a preprocessing step. The objects p ∈ P are called “pivots” or “reference points” in the literature [12, 25]. In a range query, for example, a query object q is given and the task is to find objects within a given query radius r from q. For any data object o and pivot p, by a derived form of the triangle inequality, it holds that D(q, o) ≥ |D(q, p) − D(p, o)|. Therefore, at query time, the distance D(q, p) is also computed in order to determine if |D(q, p) − D(p, o)| > r for all object o. If this condition is true for o, then it follows that D(q, o) > r, and o can safely be excluded (pruned away) without actually computing the distance D(q, o). The triangle inequality has been incorporated in several indexing methods for metric data, for instance the M-Tree [13]. It can lead to a substantial saving of distance com- 39 putations in low dimensional spaces and in metric spaces that can be mapped to a low dimensional space. In high dimensional spaces and general metric spaces, however, its effectiveness deteriorates dramatically [25, 12]. Compared with the sampling and methods such as BIRCH and DATA-BUBBLES, the advantage of using triangle inequalities is that it can provide additional speed-up for virtually any method on metric data (including our method) and it is an exact method that loses no accuracy. 3.2 Motivation Although the OPTICS algorithm without index support requires the computation of O(n2 ) distances, its final result depends largely on the minP ts-nearest neighbor distances only. Some large distances (larger than typical minP ts-nearest neighbor distances) between clusters also count, but OPTICS only needs a few of them, e.g., one per pair of clusters as depicted in Figure 3.1, while most of the reachability-distances plotted in the output are short distances within clusters. The exact values of these large distances can even been replaced by approximated values without significantly changing the cluster structure in the output plot, since as long as the approximation value is large enough, it can fulfill its function of separating a cluster from the remaining of the data-set. Figure 3.1: An OPTICS walk. The arrows represent the ordering in the walk. Although the corresponding reachability-distances are different from the distances between the pairs of objects, the lengths of the edges indicate the level of the reachability-distance values. It shows that most plotted reachability-distances are small in values. It is also not necessary to figure out the exact minP ts-nearest neighbor for each object to compute its core-distance (approximately). Since OPTICS only uses the values of the minP ts-nearest neighbor distances, an object with a very similar distance as the minP tsnearest neighbor can also serve the same purpose. Overall, in order to preserve the quality of the final output, we are only required to provide OPTICS with values that are similar to the minP ts-nearest neighbor distance for each object and a few large distances between clusters. To achieve this without computing O(n2 ) distances, we will introduce the method of pairwise hierarchical ranking in Section 3.4. 40 3.3 k-Close Neighbor Ranking The problem of ranking a list of objects has been well studied in database research and social sciences, with the typical applications of ranking top selling merchandise or political candidates. Different from these problems of ranking, in this section, we will study a special kind of ranking for similarity queries. DEFINITION 11 (k-cn ranking). Given a list B of data objects, a query q, and a distance function D(., .), the problem is how to rank objects in B so that the top k-th object ok in the ranking has a similar distance D(q, ok ) to the true k-th nearest neighbor distance dk , i.e., D(q, ok ) = (1 + ǫ)dk for a small ǫ (the smaller the ǫ, the better). This problem is called the k-close neighbor (k-cn) ranking problem. For our application, we do not require to find the true top k nearest neighbors; as long as the D(q, ok ) value returned by the ranking reflects the density around the query object (the core-distance in OPTICS [3]), we can use the ranking to replace the range search in OPTICS and reduce distance computations. In the coming sections, we will propose a ranking method based on the triangle inequality to perform k-cn ranking. It does not provide a theoretical bound on ǫ but achieves high accuracy in synthetic and real-life data, and is efficient to compute. 3.3.1 Ranking Using Triangle Inequalities It has been long observed empirically [48] that the triangle inequality in a metric space (D(x, y)+D(y, z) ≥ D(x, z) for data objects x, y, z) can be used to detect close neighbors for a given query object. While the triangle inequality has been used to speed-up datamining applications via the pruning technique [17] as discussed in Section 3.1.2, the use of triangle inequalities to perform ranking is only gaining the attention of researchers in recent years [4]. DEFINITION 12 (Lower Bound Estimation (LBE)). Given a distance function D(., .), a query object q and data objects o and p, Ep (q, o) = |D(q, p) − D(p, o)| is a lower bound estimation of D(q, o) using p as a pivot. The estimations of individual pivots in a set of pivots P can be naturally combined in order to obtain the best estimation as the largest lower bound: EP (q, o) = max{|D(q, pi ) − D(pi , o)|}. pi ∈P 41 Figure 3.2: Ranking with triangle inequalities. Although D(q, o) cannot be estimated by using p′ , chances are that D(q, o) can be estimated by using another pivot p; while p′ cannot be used to estimate D(q, o), p′ can be used to estimate D(q ′ , o) As shown in the 2-d example of Figure 3.2, when q and o are not on the circle centered at p, then the absolute difference value |D(q, p) − D(p, o)| will be larger than zero and can indicate the actual D(q, o) value. The estimation will be the better, the closer q, o, and p are located on a straight line. Using several pivots will typically improve the estimation, since if one pivot fails to estimate D(q, o) well, chances are that it can be estimated better using another pivot. DEFINITION 13 (Ranking with LBE). Given a query object q, a set of pivots P , and a distance function D(., .) on a data-set B, in the preprocessing stage the distances between all data objects in B and all pivots in P are computed; then, in the application, all data objects o in B are ranked non-decreasingly according to EP (q, o). The merit of this ranking method lies in its ability to save distance computations. All required distances except those between q and pivots in P have been computed in the preprocessing stage, so that in the application, only |P | distance computations and other computationally cheap operations are performed. When the number of pivots is set to be small and the distance function is computationally expensive, the total number of computations is much smaller than in the brute-force approach of computing all distances between q and all data objects to find the close neighbors. In most scenarios, the runtime of an application is much more important than that of a possible preprocessing, since the preprocessing is usually performed in advance and only once for several applications. But even when the runtime of the preprocessing stage is counted in the total runtime of an application, the ranking method can still significantly speed-up our intended applications such as hierarchical and density-based clustering, where the runtime is dominated by the runtime of typically very expensive distance computations, and where the closest neighbors have to be determined for each object in the data-set. In these applications, the total number of computed distances is O(n|P |), which is much smaller than O(n2 ) (since |P | ≪ n). 42 3.3.2 An Intuition of When Ranking Works In this subsection, we give a theoretical analysis to show why EP (q, o) can be used to estimate D(q, o) in general metric spaces. Given a pivot p, a query q and a close neighbor c of the query, Ep (q, c) = |D(q, p) − D(p, c)| is bounded by D(q, c) since, by triangle inequality, |D(q, p) − D(p, c)| ≤ D(q, c). This result can be extended directly to the case of using a set of pivots P , with EP (q, c) ≤ D(q, c). Therefore, if c is very close to q, then D(q, c) is small, and consequently EP (q, c) must be small. This means that when ranking objects according to their estimated distance to q, c can be expected to be ranked high, if not many objects that are farther away from q have estimated distances lower than EP (q, c). The important question is therefore: “How large will EP (q, o) on average be for a randomly chosen object o?” If EP (q, o) has a high probability of being larger than EP (q, c), then close neighbors will mostly be ranked higher than random objects. Lemma 2 below gives the probability of random objects o getting an EP (q, o) value no greater than a given value. LEMMA 2. Given a data-set B with metric distance function D(., .), let query q, data object o and pivot set P be selected randomly from B. Let Z = {|D(q, pi ) − D(pi , o)||pi ∈ P }, and let PZ (x) be the probability for z ∈ Z with z ≤ x. Then P r[EP (q, o) ≤ x] = (PZ (x))|P | . Proof. Let S = {v|v = D(q, pi ) or v = D(o, pi ), pi ∈ P }. Since q, o and the pivots in P are selected randomly from B, elements in S are independent of each other. Thus the zi = D(q, pi ) − D(pi , o) are also independent of each other. Therefore P r[EP (q, o) ≤ x] = P r[max{|D(q, pi ) − D(pi , o)|} ≤ x] pi ∈P Y = P r[|D(q, pi ) − D(pi , o)| ≤ x] pi ∈P = (PZ (x))|P | Lemma 2 provides us with a clue of when the ranking will be effective. Let x = D(q, c) be a distance between a query q and an object c. By Lemma 2 P r[EP (q, o) ≤ D(q, c)] = (PZ (D(q, c))|P | . 43 Although the distribution of Z is unknown, (PZ (D(q, c))|P | is always a monotonic function of D(q, c). The smaller the D(q, c), the smaller PZ (D(q, c)), and consequently the smaller will be (PZ (D(q, c))|P | . It also holds that the larger the number of pivots |P |, the smaller (PZ (D(q, c))|P | is. Therefore, the closer a neighbor c is to a query q, and the more pivots we use, the higher the probability that a random object is ranked lower than c. For instance, if Z follows a normal distribution with mean µ and standard deviation σ, let D(q, c) = δ · µ, then by simple calculation using the Gauss error function erf(.), we can derive δµ δµ P r[EP (q, o) ≤ D(q, c)] = ((erf( √ ) − erf(− √ ))/2)|P | . σ 2 σ 2 (3.1) Let |P | = 1, and the close neighbor distance D(q, c) = 0.1µ, i.e., δ = 0.1. The value of P r[EP (q, o) ≤ D(q, c)] in Formula (3.1) is plotted in Figure 3.3(a). It shows that as the standard deviation σ of Z approaches zero, the probability of a random object being ranked higher than a close neighbor goes up quickly towards one, leading to a less and less effective ranking. This result is consistent with the theoretical analysis of Beyer et al. in [7] which shows that as the dimensionality of a Euclidean space goes up, the standard deviation of distances converge to zero, and the effectiveness of all known indexing methods is getting lower and lower as well. Compared with the standard deviation, Figure 3.3(a) also shows that the mean µ has little effect on P r[EP (q, o) ≤ D(q, c)]. Figure 3.3(b) and Figure 3.3(c) show how P r[EP (q, o) ≤ D(q, c)] is correlated with δ and |P |. For µ = 0.5 and |P | = 10, Figure 3.3(b) shows that when the close neighbor is very close to the query object, i.e., δ is small (< 0.1), for most σ values (σ > 0.1), P r[EP (q, o) ≤ D(q, c)] in Formula (3.1) is close to zero, indicating an effective ranking. In Figure 3.3(c), µ is set to 0.5 and σ is set to 0.1. It shows that when δ approaches zero, P r[EP (q, o) ≤ D(q, c)] decreases dramatically, and the larger the number of pivots |P |, the larger the range of δ values that will have low P r[EP (q, o) ≤ D(q, c)] values, allowing more close neighbors to be ranked higher than random objects (objects with a distance to the query object close to µ). 3.4 Pairwise Hierarchical Ranking In this section, we propose a new method using ranking to reduce distance computations in OPTICS. The method performs a “pairwise” ranking to detect close neighbors for each object. 44 1 0.8 1 0.6 0.8 0.4 0.6 0.2 0 Pr 0.4 0.2 1 0.8 00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 sigma 0.9 0.6 mu 0.4 0.2 1 0 (a) |P | = 1 1 0.8 1 0.6 0.8 0.4 0.6 0.2 0 Pr 0.4 0.2 00 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 sigma 0.9 1 1 0.9 0.8 0.7 0.6 0.5 delta 0.4 0.3 0.2 0.1 (b) |P | = 10, µ = 0.5 1 |P| = 5 |P| = 10 |P| = 20 |P| = 50 0.9 0.8 0.7 Pr 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.2 0.4 0.6 0.8 1 delta (c) σ = 0.1, µ = 0.5 Figure 3.3: P r[EP (q, o) ≤ D(q, c)] with respect to µ, σ, δ, and |P |. DEFINITION 14 (Pairwise Ranking). In a pairwise ranking of a set of m objects, every object will in turn be the query so that the ranking contains m sub-rankings of the m objects. After the pairwise ranking, OPTICS uses only the distances between objects and their detected close neighbors plus a few additional distances between objects that are far away from each other. 45 As indicated by Lemma 2, to rank close neighbors of a query object high, we should use as many pivots as we can in the ranking, since the larger the number of pivots, the larger is the probability that a random object is ranked lower than close neighbors of the query. However, more pivots also mean more distance computations between pivots and data objects, as well as the overhead (i.e. runtime outside distance computations) of the ranking. Selecting a suitable number of pivots to balance these two factors is not easy. In order to increase the number of pivots without significantly increasing the number of distance computations, we propose to perform the ranking hierarchically. The idea is to organize the pivots in a way similar to metric trees such as the M-Tree [13], but we will use the pivots in a way different from the previous methods. Our method can be roughly described as follows. First, the data-set of n objects are partitioned in a hierarchical way, so that for each group of objects on the leaf level, their distances to O(log n) ancestor objects are computed. Using these O(log n) ancestors as pivots, our method then performs a pairwise ranking for each group of objects to find close neighbors within the group. To find close neighbors across different groups, the method also performs ranking across several groups at a time. Since different groups of objects have different sets of ancestors, the pivots our method uses will be their common ancestors. In other words, the rankings will be performed layer by layer, along the generated hierarchical partitioning. Our hierarchical ranking method can save distance computations because not every pivot is associated with the same number of distance computations. The top level of pivots have distances to all objects, but in the next level, for each pivot, since its set of descendants is only a fraction of the whole data-set, the number of distances associated with it is reduced to the same fraction. In this way, the pivots are constructed similar to the pyramidal hierarchy of a government: some pivots are global pivots, responsible for every member of the data-set, but some are local pivots that are responsible for members within their territories only. The layer by layer ranking in our method can also improve the accuracy of finding close neighbors. As the ranking goes down towards the leaves, more and more random objects are excluded from the ranking during the partition of the data-set, thus reducing the total number of random objects being ranked higher than close neighbors. The processes of partitioning and ranking will be explained in more detail in the following subsections. 46 3.4.1 Partitioning Our method first partitions the data in a hierarchical way, creating a tree, which we call “pivot tree”. Initially, the root node of the hierarchical tree contains all objects in the dataset. During the construction, if a node v is “split”, a set of b representative objects are randomly selected from the set of objects in v, and b associated child nodes (one per representative) are created under v. The set of objects in v is then distributed among its children by assigning each object in v to the child node with the closest associated representative. Each representative and the objects assigned to it will form a new node. The construction proceeds recursively with the leaf node that contains the largest number of objects until the tree has a user-specified number of leaf nodes. At the end, all data objects are contained in the leaf nodes of the tree, and all nodes except the root contain a representative. For any data object o and a node v, o is said to be under v if (1) v is a leaf that contains o; or (2) v is an internal node while o is contained in a leaf that is a decedent of v. 3.4.2 Ranking Multiple rankings are performed using subsets of the representatives in the constructed hierarchical tree as pivots. We will show an example before giving the formal description of the algorithm. In Figure 3.4, Ii (i = 1, 2, 3) are internal nodes and Li (i = 1, 2, 3, 4, 5) are leaf nodes in different layers. o1 , o2 , o3 , o4 , o5 , o6 , o7 , o8 , and o9 are data objects under them. Let Ii .rep and Li .rep be the representatives of internal node Ii and leaf node Li respectively. For o1 , o2 , o3 , o4 , and o5 , since the distances between them and the representatives of leaf nodes L1 and L2 are computed when the algorithm partitions the internal node I2 into L1 and L2 , we can use L1 .rep and L2 .rep as pivots to rank data objects o1 , o2 , o3 , o4 , and o5 . Since the distances between these data objects and I2 .rep, I3 .rep, L5 .rep are also computed in earlier partitions, I2 .rep, I3 .rep, L5 .rep can also be used as pivots to rank them. Therefore, {L1 .rep, L2 .rep, I2 .rep, I3 .rep, L5 .rep} is the set of pivots to perform the ranking for objects of L1 and L2 . In the upper layer ranking of objects under I2 , I3 and L5 , we can only use {I2 .rep, I3 .rep, L5 .rep} as pivots to rank the whole set of {o1 , o2 , o3 , o4 , o5 , o6 , o7 , o8 , o9 }, since distances between data objects o6 , o7 , o8 , o9 and representatives L1 .rep, L2 .rep may not be computed (they are computed only when L1 .rep = I2 .rep or L2 .rep = I2 .rep). The pseudo code for the ranking algorithm is shown in Figure 3.5. Starting from the root as the input node v, function rankNode performs a pairwise k-cn ranking of the objects under node v, using the child representatives of v and the higher-level pivots with known 47 Figure 3.4: An example for hierarchical ranking. Figure 3.5: Hierarchical ranking algorithm. 48 distances to the objects under v as the current set of pivots. rankNode is then recursively applied on all child nodes of v. Therefore, any object o under v takes part in multiple rankings: the ranking in v as well as the rankings in all descendant nodes of v that o is also under. The lower the level of the node, the more pivots are used in its pairwise ranking and the less objects are involved in the pairwise ranking. The method maintains a close neighbor set for each data object o. In any k-cn ranking, the top k objects with the smallest EP (q, o) values are retrieved and stored in this close neighbor set of o. The distances between objects in this set and o are computed at the end and will be used in the clustering. It is easy to prove that the number of distances computed in the partition and ranking is O(bn logb n + kn logb n), where b is the branching factor in the tree, and n is the size of the data-set. However, the overhead of this ranking using na¨ıveKcnRank can have a quadratic time complexity (although using computationally cheap operations), since the function na¨ıveKcnRank essentially scans all objects and all pivots to compute and sort the distance estimation values EP (q, o) of each object o. In the next subsection, we will propose two new techniques to reduce this overhead. 3.4.3 Reducing the Overhead in Ranking One issue that arises in the ranking algorithm shown in Figure 3.5 is that for the ranking in each node, the worst case time complexity is O(m2 log n), where m is the number of objects to rank (m decreases as the algorithm proceeds from the root to the bottom layer). This is due to the fact that the algorithm needs to perform a k-cn ranking for each data object and each ranking has a time complexity of O(m log n). (Note, however, that the time complexity is on computationally cheap operations, i.e., simple comparisons of precomputed distance values, rather than expensive distance computations.) To reduce this overhead, we propose two new techniques: 1) a best-first frontier search (rather than the sequential scan in the na¨ıve ranking) to significantly reduce the search space; 2) based on this frontier search (which has the same pairwise ranking result as the na¨ıveKcnRank) an approximate pairwise ranking that further reduces the runtime without sacrificing too much accuracy in the application to hierarchical clustering. Best-First Frontier Search While the na¨ıve k-cn ranking performs a sequential scan of distances between pivots and all data objects to be ranked, we propose to use instead a best-first frontier search, based on a new data structure that organizes the distances associated with pivots in the following way. 49 Given a set of objects R under a particular node of the pivot tree and the corresponding set P of pivots for the k-cn ranking of the objects in R, for each pivot pi ∈ P , we store the distances between pi and o ∈ R in a list of pairs (o, D(o, pi )), and sort the list by the distance value of D(pi , o). Using |P | pivots, we have |P | sorted lists, and each object o ∈ R will have exactly one occurrence in each of these lists. Between the lists of different pivots we link the occurrences of the same object together in order to efficiently access all occurrences of a particular object in all lists. The data structure is illustrated in Figure 3.6. Figure 3.6: Linking the occurrences of each object. When computing a pairwise k-cn ranking, each object q will be used in turn as a query object, and all other objects o ∈ R will be ranked according to their estimated distances to q. Instead of solving this problem with a sequential scan, our new k-cn ranking algorithm first retrieves all occurrences of the current query q from the given data structures. These occurrences virtually form a starting line. Then, our method searches from the starting line, advances upward and downward along the |P | sorted lists, to search for the top k objects with the smallest EP (q, o) distance estimation values. The rationale is as follows. For a query q, let object o be one of the top k objects that is returned by the na¨ıve ranking, i.e., its distance estimation value EP (q, o) (= maxpi ∈P {|D(q, pi )− D(pi , o)|}) is one of the k smallest among all objects to be ranked. That also means that for object o, the values |D(q, pi ) − D(pi , o)| for each pivot pi are all small (since |D(q, pi ) − D(pi , o)| ≤ maxpi ∈P {|D(q, pi ) − D(pi , o)|} = EP (q, o)). Consequently, the occurrences of (a top k object) o in all the lists will in general be close to the occurrences of the query q because the lists are sorted by the distances of the objects to the pivot D(pi , o), and for a difference |D(q, pi ) − D(pi , o)| to be small, D(q, pi ) and D(pi , o) have to be similar values and will hence appear close to each other when sorted. Therefore, we can start from the occurrences of q and look in the nearby positions in the |P | sorted lists for 50 Figure 3.7: k-cn ranking algorithm with best-first frontier search. the top k objects by a frontier search. At the end, the number of occurrences we visit will be typically only a fraction of the total occurrences in the lists that belong to the pivots, leading to a speed-up over the sequential scan. The pseudo-code of the new k-cn ranking algorithm is given in Figure 3.7. Function kcnRank maintains a priority queue as the frontier such that its top element is a pair (o, D(o, pi )) with |D(q, pi ) − D(pi , o)| the smallest in the queue. After all occurrences of q in the lists that belong to the pivots are retrieved, the frontier is initialized with occurrences immediately adjacent to those occurrences of q above and below. Then the function per51 forms a frontier search in all the sorted lists, always advancing in the list of pi where the current top element of the queue lies. For objects already encountered when the frontier advances, the function maintains a count of the number of their occurrences. If this number is equal to the number of pivots used in the ranking, then the object is one of the top k objects returned in the final ranking. This process continues until all top k objects are found. In the remainder of this subsection, we prove the correctness of algorithm kcnRank. LEMMA 3. In algorithm kcnRank, let occurrence (a, D(a, pi )) be popped out of the priority queue before another occurrence (b, D(b, pj )), then |D(q, pi )− D(pi , a)| ≤ |D(q, pj )− D(pj , b)|. Proof. When (a, D(a, pi )) is popped out the priority queue, (b, D(b, pj )) can only be either in the frontier queue or outside the frontier (i.e. the occurrence has not yet been visited by the frontier). If (b, D(b, pj )) is in the queue, then by the property of the priority queue, |D(q, pi ) − D(pi , a)| ≤ |D(q, pj ) − D(pj , b)|. If (b, D(b, pj )) is outside the frontier, since all the lists are sorted, there must be a third occurrence (c, D(c, pj )) in the list of pivots pj with an absolute difference value |D(q, pi ) − D(pi , c)| ≤ |D(q, pj ) − D(pj , b)|. Since |D(q, pi ) − D(pi , a)| ≤ |D(D(q, pj ) − D(pj , c)|, |D(q, pi ) − D(pi , a)| ≤ |D(q, pj ) − D(pj , b)|. THEOREM 1. The algorithm of kcnRank and the na¨ıve k-cn ranking algorithm na¨veKcnRank return the same result. Proof. Let the last occurrence popped out of the priority queue belongs to data object t. Denote this occurrence by (t, D(t, pi )). For any object o other than the returned k objects in topK, it must have an occurrence (o, D(o, pj )) still in the queue or to be explored (it only be popped out of the priority queue after (t, D(t, pi ))). By Lemma 3, |D(q, pi )−D(pi , t)| ≤ |D(q, pj ) − D(pj , o)|. Thus EP (q, t) ≤ EP (q, o). Also by Lemma 3, (t, D(t, pi )) has an absolute difference value |D(q, pi )−D(t, pi )| no less than those of the previous occurrences popped out the queue. Since for the other top k objects returned by kcnRank, all of their occurrences are popped out before (t, D(t, pi )), their distance estimation values are all no greater than EP (q, t). So the elements in topK have the smallest distance estimation values among all objects to rank. Therefore, they will also be returned by the na¨ıveKcnRank algorithm. 52 Approximate Pairwise k-cn Ranking As indicated by Lemma 2 in Section 3.3.2, the larger the number of pivots, the greater the ranking accuracy. Given a fixed set of pivots, if the number of pivots is too small to effectively perform k-cn ranking, e.g., k = 5 and only 3 of the top 5 objects returned by the ranking are actually close neighbors, then some of the occurrences of the top k objects to be returned by the ranking may be located far away from the starting point of the frontier search. Thus the frontier search in algorithm kcnRank in Figure 3.7 may have to visit many occurrences of not-so-close neighbors of q before finding all the occurrences of all top k objects, and may still incur a large overhead. Our solution to this problem is to limit the steps that the frontier can advance from the starting position. The returned result is then no longer exactly the same as the na¨ıve k-cn ranking, i.e., the new algorithm performs an approximate pairwise k-close neighbor ranking. When the search stops, if only k′ of the top k objects (k′ < k) have all occurrences within the frontier, then the remaining k−k′ objects are selected from those objects (besides the k′ objects already selected) that have the largest numbers of occurrences within the frontier. The rationale behind this idea is that objects with occurrences located far away from the corresponding occurrences of the query objects are more likely to be random neighbors that cannot contribute short distances to be used by OPTICS, even if the frontier search goes all the way to find their occurrences. Thus setting a step limit for the frontier search will not hurt the final clustering accuracy, even if some of the top but not so close objects are not returned by the search. Let the step limit be s. The approximate pairwise k-cn ranking algorithm has worst case time complexity of O(sn log n), where n is the size of the data-set. Empirical results in Section 3.5 show that s can be as small as 2k to generate clustering results with high and robust accuracy. Integration with OPTICS After close neighbors of all objects have been detected by the pairwise ranking based on distance estimations, our method computes the actual distances between each object and these close neighbors. Another set of distances we will use are the distances computed in the partition stage when creating the pivot tree, i.e., the distances between the representatives of nodes and the objects under them. These are the only distances that OPTICS will use in the clustering. All other distances are assumed to be “infinitely” large. 53 The value of k should not be significantly smaller than the minP ts parameter of OPTICS, otherwise the cluster result can be distorted because there are not enough computed distances associated with each object to estimate the core-distances. In the pairwise hierarchical ranking, each object can take part in several sub-ranking, i.e., rankings of different layers, so that the number of distances associated with each object is usually a little larger than k. And since in practice the minP ts parameter only needs to be relatively small to provide good results, k can also be set to a small value (typically ≤ 20). 3.5 Experimental Evaluation In this section, we compared our method and the Data Bubbles method on synthetic as well as real-life data-sets. Both methods were used to speed-up the OPTICS clustering algorithm. We denote our method using approximate pairwise hierarchical ranking by OPTICS-Rank, and the DATA-BUBBLE method by OPTICS-Bubble. All experiments were performed on a Linux workstation with dual AMD Opteron 2.2GHz CPUs and 5GB of RAM, using one CPU only. 3.5.1 Synthetic Data We used the two synthetic data-sets studied in last chapter, DS-Vector and DS-Tuple, to show that our new method has better accuracy in detecting subtle cluster structures. (a) OPTICS output for DS-Vector (b) OPTICS output for DS-Tuple (c) OPTICS-Rank output for DS-Vector (d) OPTICS-Rank output for DS-Tuple Figure 3.8: The Reachability plots from OPTICS (a and b) and OPTICS-Rank (c and d) are almost identical (due to the property of OPTICS clustering, there are switches of cluster positions). The outputs of OPTICS-Rank for DS-Vector and DS-Tuple are shown in Figure 3.8(c) and (3.8(d)) respectively. For the parameters of OPTICS-Rank, the branching factor and the 54 step limit s for the best-first frontier search is set to 10, the number of leaf nodes for the pivot tree to 5, 000, and the number of top objects to return in each ranking, k, is set to 5, the same as minP ts. The plots generated by OPTICS-Rank are almost identical to those generated by OPTICS, only that some clusters have switched position, which is a normal phenomenon when clustering with OPTICS and which does not affect the clustering accuracy. OPTICSRank uses only a fraction of the total number of distances used by OPTICS. The number of distances computed by OPTICS is 2.5 × 109 for both data-sets, while OPTICS-Rank uses 2.4 × 106 and 2.7 × 106 distances for DS-Vector and DS-Tuple respectively. We used two measurements, N-score and F-score described in the last chapter to compare the clustering accuracy of OPTICS-Rank and OPTICS-Bubble on DS-Vector and DSTuple. The clustering accuracy of N-score on DS-Vector is shown in Figure 3.9(a). OPTICSRank uses a fixed setting as mentioned above while the number of bubbles used by OPTICSBubble varies from 100 to 250. The experiment was repeated 10 times and OPTICS-Rank always succeeds to find all the clusters so that it has a score of 1. This accuracy is consistently better than that of OPTICS-Bubble. The F-scores on DS-Vector are shown in Figure 3.9(b). As the number of bubbles increases, the accuracy of OPTICS-Bubble increases rapidly and stays on a relatively high level (> 0.98). This accuracy, however, consistently laggs behind that of OPTICS-Rank (> 0.999). The corresponding numbers of computed distances by the two algorithms are shown in Figure 3.9(c). As the number of bubbles increases, the number of distances computed by OPTICS-Bubble increases linearly. It uses as many as 12.5 × 106 /2.4 × 106 ≈ 5.2 times the number of distances of OPTICS-Rank. The clustering accuracy on DS-Tuple using N-score and F-score is shown in Figure 3.10(a) and 3.10(b) respectively. OPTICS-Rank uses the same setting as in the previous experiment and the number of bubbles used by OPTICS-Bubble varies from 25 to 250. Similar to the previous experiment, OPTICS-Rank outperforms OPTICS-Bubble and is only matched by the latter when the number of bubbles reaches 200 for N-score and 250 for F-score. The numbers of computed distances for both methods are shown in Figure 3.10(c). It shows that when we use a larger number of bubbles (≥ 250) for OPTICS-Bubble to match the accuracy of OPTICS-Rank, OPTICS-Bubble needs to perform four times the number of distance computations. 55 1 N-score 0.8 0.6 0.4 0.2 OPTICS-Rank OPTICS-Bubble 0 100 150 200 250 Number of bubbles (a) N-score on DS-Vector 1.02 1 Overall F-score 0.98 0.96 0.94 0.92 0.9 OPTICS-Rank OPTICS-Bubble 0.88 100 150 200 Number of bubbles 250 (b) F-score on DS-Vector 16 14 OPTICS-Bubble OPTICS-Rank 12 Million 10 8 6 4 2 0 100 150 200 Number of bubbles 250 (c) Number of distance computations for clustering DS-Vector Figure 3.9: Clustering accuracy and performance on DS-Vector. 3.5.2 Real-life Data The first real-life data-set we used, denoted by DS-Protein, are 12,010 refined structure models on the NMR structure of Escherichia coli ribosomal protein L25 [43] generated by the protein structure modeling program GaFolder [51]. The distance function we used is the dRMS distance which is a metric when applied to structure models sharing the same protein sequence (see [29] for a description of dRMS). Another real-life data-set we used, denoted by DS-Jaspar, consists of 73,253 DNA sequence motifs extracted from the human 56 1 N-score 0.8 0.6 0.4 0.2 OPTICS-Rank OPTICS-Bubble 0 25 50 100 150 200 250 Number of bubbles (a) N-score on DS-Tuple 1.02 Overall F-score 1 0.98 0.96 0.94 0.92 0.9 OPTICS-Rank OPTICS-Bubble 0.88 25 50 100 150 Number of Bubbles 200 250 (b) F-score on DS-Tuple 16 OPTICS-Bubble OPTICS-Rank 14 12 Million 10 8 6 4 2 0 25 50 100 150 Number of bubbles 200 250 (c) Number of distance computations for clustering DS-Tuple Figure 3.10: Clustering accuracy and performance on DS-Tuple. chromosome 1 using the transcription factor binding patterns in the JASPAR database [44]. The distance function we used is an adapted edit (Levenstein) distance for all sequences in two motifs [38]. Since the runtime of OPTICS on these two data-sets are too long to run on a single PC (each experiment takes a few months), we had to pre-compute the pairwise distances using massive parallel processing, and report the runtime calculated as follows: runtime = a · d + h, where a is the average runtime of computing one distance, d is the number of distance computations and h is the algorithm overhead, estimating the runtime 57 of the algorithms as if they do not have access to the pre-computed resource of all pairwise distances. Protein Structure Data The data-set DS-Protein is considered more difficult to cluster than DS-Vector and DSTuple, as we can see in Figure 3.11 DS-Protein contains many more clusters. To evaluate the clustering accuracy using F-score, we defined three cut-lines on the OPTICS reachability plot (y = 0.1, 0.2, 0.3), and for each of these cut-lines, we determined the best corresponding (according to F-score) cut-lines on the OPTICS-Rank and OPTICS-Bubble results from a set of candidate cut-lines. 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 2000 4000 6000 8000 10000 12000 Figure 3.11: OPTICS output for DS-Protein with three cut-lines (y = 0.1, 0.2, 0.3). Figure 3.12(a), 3.12(b) and 3.12(c) show the accuracy for the cut-lines of y = 0.1, 0.2, 0.3 on the OPTICS output, as the number of leaves for OPTICS-Rank and the number of bubbles for OPTICS-Bubble ranges from 100 to 3,000. The other parameters of OPTICS-Rank are set as follows. The branching factor b and the step limit s are set to 10, the parameter k and minP ts are set to 20. The results show that the accuracy for OPTICS-Bubble increases dramatically as the number of bubbles grows from 100 to 1,000, but this curve starts to level off afterwards. The curve of OPTICS-Rank is relatively stable disregarding the number of leaves. The results also show that OPTICS-Rank has better accuracy than OPTICS-Bubble when the number of leaves/bubbles is small (≤ 200). For larger numbers of leaves/bubbles, these two methods have similar accuracy, with OPTICS-Bubble outperforming OPTICSRank on cut-line y = 0.1 and OPTICS-Rank outperforming OPTICS-Bubble on cut-line 58 1 Overall F-score 0.98 0.96 0.94 0.92 OPTICS-Rank OPTICS-Bubble 0.9 100 500 1000 1500 2000 #Leaves or #Bubbles 2500 3000 (a) Accuracy on cut-line y = 0.1 1 Overall F-score 0.98 0.96 0.94 0.92 OPTICS-Rank OPTICS-Bubble 0.9 100 500 1000 1500 2000 #Leaves or #Bubbles 2500 3000 (b) Accuracy on cut-line y = 0.2 1 Overall F-score 0.98 0.96 0.94 0.92 OPTICS-Rank OPTICS-Bubble 0.9 100 500 1000 1500 2000 #Leaves or #Bubbles 2500 3000 (c) Accuracy on cut-line y = 0.3 Figure 3.12: Clustering accuracy on DS-Protein. y = 0.2 and y = 0.3. Nonetheless, both methods achieve high F-scores greater than 0.96. The bigger difference between OPTICS-Rank and OPTICS-Bubble lies on their runtime performance. Figure 3.13 shows the performance of the two approximate clustering methods with respect to the number of distance computations and runtime. For instance, when using 1,000 leaves/bubbles, OPTICS-Rank would run an estimated 15 hours while OPTICS-Bubble would run 271 hours (> 11 days) and OPTICS would run 1,702 hours (> 70 days), without access to the pre-computed distances. Figure 3.14 shows how accurate the rankings in OPTICS-Rank perform. For each data object o, let Dt (o) be the true 20th nearest neighbor distance of o, Dr (o) be the estimated 20th nearest neighbor distance in OPTICS-Rank, and let Dr (o) = (1 + ǫ)Dt (o). The 59 3.5e+07 OPTICS-Rank OPTICS-Bubble #Distance Computations 3e+07 2.5e+07 2e+07 1.5e+07 1e+07 5e+06 0 100 500 1000 1500 2000 2500 3000 #Leaves or #Bubbles (a) Number of distance computations 800 700 OPTICS-Rank OPTICS-Bubble Runtime (Hour) 600 500 400 300 200 100 0 100 500 1000 1500 2000 #Leaves or #Bubbles 2500 3000 (b) Runtime Figure 3.13: Performance of OPTICS-Rank and OPTICS-Bubble on DS-Protein. 80 70 60 % 50 40 30 20 10 0 0 2 4 6 8 10 Epsilon Figure 3.14: The distribution of ǫ values for rankings in OPTICS-Rank. distribution of ǫ in terms of percentage of data objects are shown in Figure 3.14. As we can see, most objects have a small ǫ value which indicates good ranking accuracy. For instance, 75.8% of the objects have an ǫ value smaller than 0.3, i.e., most objects have an estimated Dr (o) deviating no more than 30% from the true Dt (o) value. This difference can be considered small in a data-set where the standard deviation of Dt (o) is equal to 1.9 60 times the mean of Dt (o). 1 Overall F-score 0.99 0.98 0.97 0.96 0.95 0.94 0.93 1 5 10 20 30 40 50 60 70 80 90 100 k (a) Clustering accuracy 1.3e+06 #Distance Computation 1.2e+06 1.1e+06 1e+06 900000 800000 700000 600000 500000 400000 1 5 10 20 30 40 50 k 60 70 80 90 100 (b) Number of distance computations 20 Overhead (Second) 19 18 17 16 15 14 13 12 1 5 10 20 30 40 50 k 60 70 80 90 100 (c) Overhead Figure 3.15: Effect of changing parameter k. We also studied the relation between the setting of the OPTICS-Rank parameters and the clustering result. Since the dRMS distance function on DS-Protein is expensive (it takes 0.085 second on average to compute one distance), the runtime of OPTICS-Rank is dominated by the distance computation runtime and thus correlates closely with the number of distance computations, as shown in Figure 3.13. In the remainder of this subsection, we will report the number of distance computations and the algorithm overhead only (overhead 61 = total runtime - distance computation runtime). Unless explicitly specified otherwise, in the remainder of this subsection we always set the number of leaves to 1,000 and used the cut-line y = 0.3 on the OPTICS output for the accuracy measurement. The parameter k denotes the number of top neighbors to return in the ranking for each data object. Figure 3.15(a) shows how changing k from 1 to 100 affects the clustering accuracy, with both branching factor b and step limit s set to 10. When k is small (< 5), the accuracy is relatively less stable. As k increases, the accuracy also increases until it levels off at k = 50. Using a lager k can improve the accuracy, but at the price of increasing the number of distance computations as well as the algorithm overhead, as shown in Figure 3.15(b) and Figure 3.15(c) respectively. Figure 3.15(b) shows that the number of distance computations grows almost linear in the number k, while Figure 3.15(c) shows the overhead increase rapidly when k is small but grows more slowly after reaching k = 10. Note that the overhead of OPTICS-Rank is small compared with the total runtime of the algorithm. The overhead is in the magnitude of tens of seconds, but the total runtime of OPTICS-Rank is in tens of hours because of the expensive distance computations. Overall, Figure 3.15 indicates that a k value between 5 and 50 can provide the best balance between accuracy and runtime. The results of changing the branching factor b from 2 to 100 with s = 10 are shown in Figure 3.16. When b increases, Figure 3.16(a) shows that the clustering accuracy stays around a F-score of 0.98, but the number of distance computations and the overhead tend to go up, with the optimal point around b = 5. Figure 3.17 shows the result of change the step limit s from 1 to 5,000 with b = 10. By the design of the algorithm, increasing the step limit does not affect the number of distance computations, so that the corresponding plot is omitted. Figure 3.17(a) shows that the step limit can only affect the accuracy slightly, but increasing it increases the overhead significantly, as shown in Figure 3.17(b). Our result also shows that increasing s does not always improve the accuracy, as we can see in Figure 3.17(a) the run with s = 5, 000 actually has slightly lower accuracy than the run with s = 500. One explanation is that some pivots might behave significantly different from the others and their scores bring down the ranking of some close neighbors. In that case, pushing the search limit further to incorporate the scores of these pivots might not improve the accuracy. In our last experiment on DS-Protein, we reduced the size of the data-set to study the scalability of the OPTICS-Rank algorithm. We used the first portion of DS-Protein, chang62 1 0.98 Overall F-score 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 0 10 20 30 40 50 60 70 80 90 100 Branching Factor (a) Clustering accuracy 1.8e+06 #Distance Computations 1.6e+06 1.4e+06 1.2e+06 1e+06 800000 600000 400000 0 10 20 30 40 50 60 70 Branching Factor 80 90 100 (b) Number of distance computations 55 Overhead (second) 50 45 40 35 30 25 20 15 0 10 20 30 40 50 60 Branching Factor 70 80 90 100 (c) Overhead Figure 3.16: Effect of changing branching factor b. ing the size to from 2,000 to 12,000. Figure 3.18 shows the good scalability of OPTICSRank, as we can see that the number of distance computation, runtime and overhead all grow linear with respect to the size of the data-set. JASPAR Data The clustering results on DS-Jaspar are depicted in Figure 3.19. While using 1,000 times less distances, OPTICS-Rank generates a plot that captures the same cluster structure as the 63 1 0.98 Overall F-score 0.96 0.94 0.92 0.9 0.88 0.86 0.84 0.82 0.8 1 5 10 20 30 40 50 60 70 80 90 100 80 90 100 s (a) Clustering accuracy 70 Overhead (Second) 60 50 40 30 20 10 0 1 5 10 20 30 40 50 s 60 70 (b) Overhead Figure 3.17: Effect of changing step limit s. output of the original OPTICS (with some switching of cluster positions due to the property of OPTICS clustering algorithm). We applied the F-score to measure the accuracy numerically. As we can see in Figure 3.19, the diverse level of reachability distances between clusters makes it difficult to extract the clusters using cut-lines, and we had to manually extract them from both output plots (98 clusters for OPTICS and 101 clusters for OPTICS-Rank). Each cluster in the OPTICS output is then matched to the cluster in the OPTICS-Rank output that has the highest F-score. The F-score distribution of the matched clusters is shown in Figure 3.20. It shows that the majority of the clusters in the OPTICS output can be matched with a cluster in the OPTICS-Rank output with a high F-score of more than 0.95. The overall F-score is 0.87. 3.6 Summary In this chapter, we proposed a novel approach to perform approximate clustering with high accuracy. We introduced a novel pairwise hierarchical ranking method to efficiently determine close neighbors for every data object. We also proposed a frontier search rather than a 64 700000 #Distance Computation 600000 500000 400000 300000 200000 100000 0 2000 4000 6000 8000 10000 12000 Size of Data-set (a) Number of distance computations 60000 Runtime (Second) 50000 40000 30000 20000 10000 0 2000 4000 6000 8000 Size of Data-set 10000 12000 (b) Runtime 18 Overhead (Second) 16 14 12 10 8 6 4 2 0 2000 4000 6000 8000 Size of Data-set 10000 12000 (c) Overhead Figure 3.18: Scalability with respect to the size of the data-set. sequential scan in the na¨ıve ranking to reduce the search space and a heuristic that approximates the frontier search but further reduces the runtime. Empirical results on synthetic and real-life data show the high efficiency and accuracy of our method in combination with OPTICS, obtaining a speed-up up to two orders of magnitude over OPTICS while maintaining a very high accuracy, and up to one order of magnitude over Data Bubbles combined with OPTICS while obtaining a more accurate result. 65 (a) OPTICS output, using 5.4 billion distances. (b) OPTICS-Rank output, using 4.9 million distances. Figure 3.19: Clustering results on DS-Jaspar. 60 50 % 40 30 20 10 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 F-score Figure 3.20: Distribution of F-scores. The overall F-score is 0.87. 66 Part II Speed-up Similarity Search 67 Chapter 4 The Concept of Virtual Pivots for Similarity Search Performing efficient k-nearest neighbor (k-nn) search in many real world metric data-sets is a very challenging problem, e.g., given a new protein sequence or structure, finding efficiently the most similar protein(s) in a protein database using computationally expensive dissimilarity measures such as sequence or structure alignments. In this chapter, we propose a new algorithm to efficiently perform k-nn search in metric data based on triangle inequality pruning using pivot objects. Our method is based on a formal theoretical analysis of the pruning ability of pivots. We introduce the novel concept of virtual pivots and show that a single virtual pivot can have the same pruning power as a large set of fixed pivots. Any database object can act as a virtual pivot, and virtual pivots can be dynamically selected, without imposing a quadratic space complexity as in previously proposed dynamic pivot selection schemes. We also propose and analyze a method for boosting the performance of virtual pivots by a method that selects objects close to the query object with high probability (if they exist). In this chapter, we will focus on the theoretical analysis of virtual pivots and leave the testing on real-life data to the next chapter. The remainder of the chapter is organized as following. In the next section, we review related work. In Section 4.2, we analyze the pruning ability of pivots. In Section 4.3 we introduce virtual pivots. We describe boosting of virtual pivots in Section 4.4. In Section 4.5, we discuss the advantages of using virtual pivots. Section 4.6 describes our algorithm of virtual pivot, and Section 4.7 summarizes this chapter. 68 4.1 Related Work Similarity search in general metric spaces where only a metric distance function is available, uses so-called pivots (also called reference objects or foci in some papers) and the triangle inequality of the distance function to speed up similarity search. We can distinguish between two basic approaches: (1) methods that organize the pivots in a hierarchical structure and (2) methods that use a non-hierarchical approach to organize and use the pivots. 4.1.1 Hierarchical Approaches The first proposal for a hierarchical (tree-based) pivot structure is probably the BurkhardKeller Tree [10] (BKT) for indexing discrete-valued distances. The root node of a BKT is an arbitrary data object. The data space is partitioned by grouping data objects with the same distance to this root node object into a child node, and each child node is recursively partitioned in the same way. The objects in the nodes of the tree are called pivots. Later, BKT has been improved by other methods. For instance, Fixed Queries Tree (FQT) [5] reduces the number of distance computations during search by choosing the same pivot for all nodes on the same level of a tree; the method Vantage-Point Tree (VPT) [46, 53] extends the approach to handle continuous distance functions; the M-tree [13] introduces a disk resident version of VPT; the SA-tree [34] uses spatial approximation inspired by the Voronoi diagram to reduce the number of backtracks in the tree. For a comprehensive survey on tree-based k-nn search methods, see Ch´avez et al. [12] and Hjaltason and Samet [25]. Two of the most famous tree pivot structures are the M-Tree [13] and SA-Tree [34], which are further described below. The M-tree [13] method splits the data-set into spheres represented by nodes. Each node can hold a limited number of objects, and (except the root node) is led by a pivot (called routing object in [13]) which is stored in the parent node. Every pivot maintains a covering radius that is the largest distance between itself and all the data objects in its covering tree, which is the subtree rooted at the node led by this pivot. During the M-tree construction stage, when a new data object arrives, its distances to pivots stored in the root node are computed and it is inserted into the most suitable node (i.e., led by the closest pivot from the root node) or the node that minimizes the increase of the covering radius, and subsequently descended down to a leaf node. If the size of a node reaches the prespecified limit, this node is split into two new nodes with their pivots elected accordingly and inserted into their parent node. In the query stage, M-tree uses the covering radius to 69 perform triangle inequality pruning. For a node led by pivot p with covering radius r(p), the distance between any data object o in this node and query q, D(q, o), is lower bounded by D(q, p) − r(p) [13]. SA-Tree [34] uses spatial approximation inspired by the Voronoi diagram to reduce the number of visited subtrees during the query stage. In SA-Tree, every node corresponds to a database object (unlike in M-Tree). In the SA-tree construction stage, first, a random data object is chosen as the root and a set of data objects from the whole database are selected as its neighbors. Any object in this neighbor set must be closer to the root than to any other object in this neighbor set. Each neighbor object then starts a subtree of the root and the process goes on recursively. In the query stage, besides using the covering radius to perform triangle inequality pruning the same as in M-Tree, SA-Tree also uses a property derived from the way the tree is constructed. That is, if p1 and p2 are a pair of parent-child nodes (serving as pivots) visited by the query process, then for any data object o in the neighbor set of node p2 , D(o, p1 ) > D(o, p2 ) and thus 12 (D(q, p2 ) − D(q, p1 )) is a lower bound of D(q, o) [34]. A common problem in tree-based k-nn search methods is that close objects always have certain probability of being stored into different branches and this probability goes up quickly as the difficulty of indexing the database increases (e.g., when the dimensionality increases in vector data). These methods then have to search many subtrees to retrieve all close neighbors of the query object, which will involve a large overhead and a large number of distance computations. Another disadvantage of tree-based methods is that there is no easy way to adjust the pruning ability, when allowed, through varying the database preprocessing effort. Results in Weber et al. [50] and Jagadish et al. [26] have shown that tree-based k-nn search methods typically do not work well in very high dimensional metric data. 4.1.2 Non-hierarchical Approaches Pivots can also be selected and organized into a non-hierarchical structure. Non-hierarchical k-nn search methods [40, 33, 20, 11, 37, 15] select in the pre-computing stage a set of pivots and compute the distance between each pivot and every database object; during query processing, all the pivots are used collectively to prune objects from the k-nn candidate list using the triangle inequality. That is, any object whose estimated lower bound is larger than the k-th smallest estimated upper bound cannot be among the k nearest neighbors of the query, and thus can be pruned away. 70 The na¨ıve way to select pivots is to pick them randomly. Some researchers preferred using pivots that are far away from all the other database objects or pivots that are far away from each other [40, 33]. One can also choose not to select pivots in batch mode (i.e., select all the pivots at the same time), but incrementally: Filho et al. [20, 45] proposed to select pivots so that they are far away from each other and at the most similar pairwise distances to each other; Bustos et al. [11] proposed to select pivots to maximize the mean of the pairwise distances between pivots; Rico-Juan and Mic´o [37] proposed a similar method LAESA, to be further described below. One recent work by Digout et al. [15] pre-selects a large set of pivots but uses only some of them to do the actual pruning in the query stage. However, this method relies on the distances between the query object and all the pivots, which are computed directly, to determine which subset of the pivots to be used for pruning purpose. Therefore, when the distance function is very expensive, this method cannot afford to use a large number of pre-selected pivots. LAESA [37] is a nearest neighbor search algorithm, which can be easily modified to perform k-nn search (as we did in this work). In the preprocessing stage, a set of fixed pivots are selected and the distance between each pivot and every data object is computed. To select the pivots, LAESA uses a method called maximum minimum distances [37], to avoid selecting pivots that are close to each other. The first pivot is selected randomly. The other pivots are selected iteratively, with the next pivot always being the object that has the maximum distance to the set of already selected pivots. In the query stage, LAESA computes the distance between the query object and a fixed pivot with the lowest lower bound estimation, updates the upper bound δ of the nearest neighbor distance and prunes away objects with a lower bound estimation greater than δ. If there are no more fixed pivots to use, it computes the distances to data objects that have not been pruned yet, one at a time, until the nearest neighbor is found. During this process, each computed distance is used to update δ and to further prune objects. All the methods we mentioned so far have spent minimal effort in the pre-computing stage to build an index structure. Their numbers of pre-computed distances are typically in the level of O(n log n) where n is the size of the data-set. In contrast, another branch of methods spend more effort on pre-computing distances (up to all pairwise distances), in order to obtain a higher pruning ability in the query stage. This approach can be dated back to the 1980s, when Vidal proposed the method AESA [48, 49]. This method pre-computes the distances between all objects in the data-set. When performing k-nn search, an object is selected randomly as the first pivot, and the distance between the query object and the 71 pivot is computed. Using the triangle inequality, objects are then pruned, if possible. If the number of remaining objects is still large, another object with the smallest lower bound distance estimation computed by the triangle inequality is selected as pivot and the process is repeated. Shasha and Wang [41] proposed a method that is similar to AESA in that it tries to exploit all pairwise distances for pruning. In contrast to AESA, however, they assumed that only a subset of all the distances is computed in a first step. Then, a θ(n3 ) time algorithm using two matrices of size n by n is applied to compute the tightest lower and tightest upper bound estimations for unknown distances, given the known distances. These bounds are calculated using a generalized form of the triangle inequality considering more than three objects at a time. This pre-computation step is itself computationally very expensive and replacing actual distances with approximate bounds is only worthwhile, if the cost for computing the pairwise distances would be even higher, i.e, for this method we have to assume that the distance function is computationally extremely expensive. The reason why approaches that try to exploit all pairwise distances have been much less popular than approaches using fixed pivots is that they all use quadratic space in the pre-computing and the search stage since the matrices computed in the pre-computing stage which store the distances or the estimated bounds are all needed. “This is unacceptably high for all but very small databases” [12]. 4.2 The Pruning Ability of Pivots For the following analysis of similarity search in metric spaces, we assume a data-set B, a metric distance function D(., .), a query object q ∈ / B, and a fixed set of pivots P = {p|p ∈ B}, for which all distances between database objects and pivots have been pre-computed. Furthermore, we will assume a metric vector space to be able to do a mathematical analysis involving volumes of regions in the data space. Although the notion of volume is not available in general metric data, the conclusions that we derive from the analysis apply approximately to general metric data as well, since it is possible to map a general metric space into a vector space so that the distances between all objects are approximately preserved. Only the properties of those distances and their distribution are relevant for the effectiveness of triangle inequality pruning using pivots. The basic approach to similarity search in metric spaces using pivots is the following. First, compute the distance between the query object and some pivot(s). Some approaches 72 use the pivots in P in some specific order, while others may use a subset or the whole set of pivots “simultaneously”. These differences in algorithms only affect the runtime of the pruning but not the total number of distance computations, if eventually all pivots in P are used. Utilizing the pre-computed distances between the pivots in a currently selected subset P ′ ⊆ P and objects o ∈ B, we can estimate (w.r.t. P ′ ) a largest lower bound lD(o,q) = maxp∈P ′ {|D(q, p) − D(o, p)|}, and a smallest upper bound uD(o,q) = minp∈P ′ {D(q, p) + D(o, p)} of D(q, o) between the query object q and every database object o, i.e., lD(o,q) ≤ D(o, q) ≤ uD(o,q) . These bounds are used in both range search and k nearest neighbor search to prune objects from the list of candidate query answers. For a range search, i.e., for finding all objects within a distance δ from the query q, the lower bound lD(o,q) can be used to exclude objects o for which the lower bound of the distance to q is already larger than the query range, i.e., lD(o,q) > δ. For a k-nn search, i.e., for finding the k closest objects to the query q, typically an iterative scheme is applied in which at every step pruning is done as for a range query, but with increasingly smaller tentative ranges δ around q. For instance, using the current estimated upper bounds uD(o,q) , a tentative range δ can be found by selecting the kth smallest upper bound distance estimation. Any object whose lower bound estimation is already larger than the k-th smallest upper bound estimation cannot be a true k nearest neighbor of query object q. Such pruning for similarity queries does not work well when both the query q and the database objects o are approximately equally far away from all the pivots. The lower bounds will be almost zero and the upper bounds will be very large, so that lD(o,q) of almost every object would be smaller than most query ranges, in particular the pruning ranges for k-nn search, derived from upper bound estimations. In those cases, the resulting k-nn candidate list will contain most of the database objects. In the following subsections, we will analyze the pruning ability of pivots for range and k-nn queries, by studying the ability of pivots to exclude objects from a given query radius (fixed for a range query, derived for k-nn queries). 4.2.1 One Pivot versus Several Pivots To improve pruning ability, one straightforward approach is to increase the number of pivots. The intuition behind this approach is illustrated in Figure 4.1(a) for a 2-d data example using two pivots for a query object q with query radius δ. For each pivot p, by the triangle inequality, any object that can be potentially located within the radius δ around q must have 73 q δ s+δ q′ δ p q δ p1 s−δ s−δ p2 (a) More pivots. s = D(q, p2 ). s+δ (b) A closer pivot. s = D(q, p). Figure 4.1: Methods to enhance pruning ability a distance satisfying D(q, p) − δ ≤ D(q, o) ≤ D(q, p) + δ, which means that these data objects fall into a region bounded by a (hyper-)ring centered at p. When combining more pivots, the region for possible locations of objects is the intersection of the corresponding (hyper-)rings (indicated by darker color in Figure 4.1(a)). Intuitively, the more pivots are used, the smaller the volume of this intersection. 4.2.2 Random Pivots versus Close Pivots To improve pruning ability, instead of increasing the number of pivots, we can alternatively try to select pivots more carefully. Ideally, we would like to select pivots close to the query object. The closer a pivot to the query object, the better its pruning ability. This fact is illustrated in Figure 4.1(b) for a 2-d example: for pivot p, the hyper-ring associated with the closer query object q ′ has a much smaller volume than the hyper-ring for the query object q which is further away from p. The following lemma shows that this increase in volume when increasing D(q, p) is by a factor more than linear if the dimensionality d > 2, which means that the pruning ability of a pivot deteriorates rapidly with increasing distance to the query object in higher dimensional spaces. LEMMA 4. In a d-dimensional metric space, d > 2, let q be a query object with query radius δ. Without considering boundary effects, the volume of the region that cannot be pruned via the triangle inequality using an object p as a pivot grows super-linearly as s = D(q, p) increases. Proof. If s > δ, the region that cannot be pruned away is a hyper-ring defined by the inner radius s − δ and outer radius s + δ. If s ≤ δ, the region is a hypersphere of radius s + δ. Since the volume of a d-dimensional hypersphere with radius R can be computed as Vhypersphere(R, d) = π d/2 Rd /g(d), where g(d) = Γ(1 + d2 ) and Γ(.) is the gamma 74 function, the volume of the region that cannot be pruned away is Vhypersphere(s + δ) − Vhypersphere(s − δ) if s > δ Vr = Vhypersphere(s + δ) if s ≤ δ n π 2 ((s + δ)d − (s − δ)d )/g(d) if s > δ n = π 2 (s + δ)d /g(d) if s ≤ δ) It is easy to see that Vr grows super-linearly as s increases, since the second derivative Vr′′ w.r.t. s is greater than zero for both cases of s > δ and s ≤ δ, if d > 2. In the following, we present a systematic comparison between using a set of random pivots versus using closer pivots in terms of pruning ability. We leave the discussion of how to actually find or select close pivots to the next section. DEFINITION 15 (d-space). A data space where the objects are uniformly distributed in a d-dimensional hypersphere of radius a is denoted by d-space(a). If a = 1 (unit radius), the space is denoted by d-space. To measure the pruning ability of pivots, we define the following measure called Random Pivot Power (RPP). DEFINITION 16 (RPP). When using the triangle inequality to perform pruning, if the probability of a random object “o” not being pruned away using a particular pivot p is equal to the probability of “o” not being pruned away using m random pivots, we say that pivot p has a pruning ability of m RPP (Random Pivot Power). LEMMA 5. In a d-space(a), when performing a range query of radius δ, the asymptotic probability P (m) of a random object not being pruned away by m random pivots is r δ d m P (m) = (erf( )) , a 2 where erf is the Gauss error function. Proof. Denote the set of random pivots by P and let x and y be two variables following the distance distribution within a d-space. Since the pruning among different random pivots are independent, P (m) = Πpi ∈P P r(|D(q, pi ) − D(o, pi )| < δ) = (P r(|x − y| < δ))m . 75 √ When d → ∞, both x and y follow a normal distribution N (µ, σ 2 ), where µ = a 2 and σ2 = a2 2d [23]. Since x and y are independent of each other, z = x − y also follows a normal distribution with µ = 0 and σ = Z P (m) = ( δ −δ a2 d . r Thus, d −d·t2 /(2a2 ) m δ e dt) = (erf( 2 2πa a r d m )) . 2 Using Lemma 5, we can estimate the pruning ability of a particular pivot, depending on its distance to the query object, as stated in the following theorem. THEOREM 2. For a δ radius query in a d-space, let the asymptotic pruning ability of a pivot be m RPP, then p )/ ln(erf(δ d/2)) if s > δ ln((s + δ)d − (s − δ)dp m≥ ln((s + δ)d )/ ln(erf(δ d/2)) if s ≤ δ. Proof. Ignoring boundary effects, the probability of a random object not being pruned away by a single pivot is P′ = = Vhypersphere (s+δ,d)−Vhypersphere (s−δ,d) Vhypersphere (1,d) Vhypersphere (s+δ,d) Vhypersphere (1,d) if s > δ if s ≤ δ (s + δ)d − (s − δ)d if s > δ (s + δ)d if s ≤ δ where Vhypersphere(R, d) is the volume of a d-dimensional hypersphere with radius R and can be computed as in the proof of Lemma 4. If we consider boundary effects, P ′ will be even smaller since some of the volume of the hyper-ring will fall outside the data space. To estimate the pruning ability of one close pivot, let P ′ = P (m), where P (m) is the probability computed in Lemma 5 for a = 1. By simple calculation, we obtain the the lower bound of m as stated in the theorem. The lower bound of m is plotted as a function of s and δ in Figure 4.2. In the 2-d illustration of Figure 4.1, it is quite obvious that in most cases, using two pivots, even if they are randomly selected, will have better pruning ability than using a single close pivot in a 2-d space. However, as the dimensionality increases to 50, we need as many as 1,000 random pivots in order to get the same pruning ability as a single close pivot. When the dimensionality reaches 100, that number of needed pivots can be as high as 40,000. This analysis shows that when a pivot is close to the query, it may have a tremendous pruning ability equivalent to using thousands of random pivots. The crucial question is, 76 m 1800 1600 1400 1200 1000 800 600 400 200 00 1800 1600 1400 1200 1000 800 600 400 200 0 0.3 0.2 0.2 delta 0.1 0.4 0.6 s = D(q,p) 0.8 1 0 (a) d = 50 45000 40000 35000 30000 m 25000 20000 15000 10000 5000 00 45000 40000 35000 30000 25000 20000 15000 10000 5000 0 0.3 0.2 0.2 0.1 0.4 0.6 s = D(q,p) 0.8 delta 1 0 (b) d = 100 Figure 4.2: The lower bound of m (δ ≤ 0.3). however, how to select close pivots for all possible queries, when the number of pivots are limited? Assuming that query objects follow the same distribution as the objects in the database, each database object that could act as a pivot is typically close to only a very small number of query objects. In a d-space, the asymptotic distribution of the pairwise distances is a √ 1 normal distribution N ( 2, 2d ). For a random object, the portion of neighbors within a radius of δ can be computed as Z 0 δ r d −d(t−√2)2 e dt. π 77 The values for three different dimensionalities are plotted in Figure 4.3. It shows that most neighbors are uniformly far away. Therefore, any object that is selected as a pivot, is only close to a few neighbors and only proportionally many queries. For the vast majority of objects, it behaves in the same way as a random pivot. Portion of neighbors within radius 1 d=50 d=100 d=200 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0 0.5 1 1.5 2 Radius Figure 4.3: Portion of neighbors within a δ radius. In the traditional schemes of using a fixed number of pivots, this number is usually very small compared to the number of objects in the database. If there is no prior knowledge about the possible locations query objects, it is very unlikely that a pre-selected pivot will be close to a query and can perform efficient pruning. To overcome this problem, it has been proposed to pre-compute or estimate all pairwise distances between data objects, and use this complete knowledge on relations between objects in the data-set to find close pivots. Methods using this approach include AESA [48, 49] and the Shasha-Wang’s method [41]. However, this approach suffers the serious drawback of using quadratic space as well as quadratic or even cubic time complexity. Therefore, although these methods have been shown to work very well on small data-sets, they are not very practical and therefore very few methods are based on this approach. 4.3 Virtual Pivots In this section, we propose a new class of pivots called virtual pivots that exhibit the pruning power of close pivots without suffering quadratic space and time complexity. We first compute the distances between a relatively small number of fixed pivots and all the other data objects, as in the traditional approaches using a fixed number of pivots. Then, we will use these pre-computed distances to approximate the pruning of close pivots using the concept of a virtual pivot. 78 All data objects except the randomly pre-selected fixed pivots are potential virtual pivots, which are selected at query time from the data-set, based on the given query. In Section 4.4 we will discuss a method for selecting objects from the database as virtual pivots so that they are close to the query object with high probability (if close objects exist). In this section, we describe how virtual pivots can be used to prune objects, which is different from the traditional pruning scheme and requires the following simple lemmata. LEMMA 6 (Shortest Path Property of a Metric). For a set of n objects with a metric distance function, if we map the objects to the nodes of a complete graph G of size n and set the length of every edge to be the distance between the corresponding two objects, then for any two nodes u and v in G, the shortest path between u and v is the edge uv. Proof. By induction and the triangle inequality. If some of the pairwise distances are unknown, then the corresponding edges will be missing. In this case, we can use the lemma below to estimate the upper and lower bounds of those unknown distances. LEMMA 7 (Upper and Lower Bounds). For a set of n objects with an incomplete distance matrix M , assume that the objects are mapped to the nodes of an undirected graph G of size n and two nodes (x, y) are connected with an edge of length D(x, y) if the distance D(x, y) is in M . Let |.| be the length function. For any two nodes u and v, if D(u, v) ∈ / M, then for any path between u and v in G (if it exists): |path(u, v)| is an upper bound for D(u, v); and 2|st| − |path(u, v)| is a lower bound for D(u, v), where st is any edge on path(u, v). Proof. The upper bound follows directly from Lemma 6. Also by Lemma 6, |st| = D(s, t) ≤ |path(u, v)| − |st| + D(u, v). Therefore, D(u, v) ≥ |st| − (|path(u, v)| − |st|) = 2|st| − |path(u, v)|. To understand the pruning using virtual pivots, consider, for example, in nearest neighbor search a query object q that is close to another data object v that has been selected as virtual pivot (and is also a candidate for the nearest neighbor since it is close to q). Any data 79 p1 Close to 180 degree q v p2 o p3 (a) Using fixed pivots that are almost in a line with virtual pivot v and data object o to help pruning. Either p1 or p2 can be used. p1 q v p1 o (b) Using virtual pivots. D(q, v) is computed to rule out o. q v o (c) Traditional schemes compute D(q, p1 ), but not D(q, v). Figure 4.4: Illustration of virtual pivots. Dashed lines represent distances to be estimated. object o that is farther away from v than q could be easily pruned away if v would be a pivot for which the distance D(v, o) would have been pre-computed. If the distance D(v, o) is unknown, then we can no longer use the triangle inequality to do the pruning. However, we observe that if there is one of the pre-selected fixed pivots in a suitable position, then we can use the above lemma to try to exclude object o. An example that illustrates pruning in this nearest neighbor search is depicted in a 2-d example of Figure 4.4(a). There are several fixed pivots, p1 , p2 and p3 , and there is an object v close to q. p1 and p2 are quite far away from v and o, and they are almost a line with v and o. Another observation is that the estimations |D(o, p1 ) − D(p1 , v)| and |D(o, p2 ) − D(p2 , v)| are quite close to the actual distance D(o, v). Fixed pivot p3 , on the other hand, is almost orthogonal to the line between v and o, and the difference between D(o, p3 ) and D(v, p3 ) is very small. We observe that the fixed pivots p1 and p2 but not p3 can be used to help pruning o. Suppose we are using pivot p1 , as illustrated in Figure 4.4(b). By Lemma 7, we can estimate D(q, o) using |D(o, p1 ) − D(p1 , v)| − D(v, q) ≤ D(q, o). (4.1) Since o is much farther away from v than q, and |D(o, p1 ) − D(p1 , v)| is close to the actual D(o, v), the lower bound of D(q, o) given in above formula will be larger than D(v, q). Therefore, we can exclude o as a candidate of the nearest neighbor using this virtual pivot. Our pruning scheme is different from the traditional scheme of using fixed pivots as illustrated in Figure 4.5 for a query q and a random object o. The traditional scheme always computes the distances between the query object and the fixed pivots, all of which are 80 Figure 4.5: Pruning schemes. likely to be very large and have a low probability to exclude o. In our scheme, only the the distances between the query and some virtual pivots will be computed. If we are able to select virtual pivots close to the query, these distances will be short and generate a tight upper bound for the nearest neighbor distance since the virtual pivots are part of the database and hence are also candidates for the nearest neighbors. Although, the above discussion used a nearest neighbor search example, the arguments apply to k-nn search as well: we can simply replace the nearest neighbor distance with a k-nearest neighbor distance upper bound, and the pruning is otherwise done in the same way. In our scheme, we will assume that the set of fixed pivots that supports virtual pivots is selected so that the fixed pivots are in general far away from each other (which can be achieved in high-dimensional space by simply selecting them randomly). Given a set of fixed pivots and a virtual pivot v, for effective pruning with v, the question is now how to choose one of the fixed pivots (like p1 in the example) that can be used in combination with v to prune objects. A straight forward solution is suggested by Formula (4.1): ideally the fixed pivot we want to find should give the tightest lower bound for D(o, v) using |D(o, p1 ) − D(p1 , v)|. All distances needed to solve this problem are pre-computed, and in a na¨ıve way we can simply scan the set of distances to fixed pivots to find the tightest bound 81 for a query object q. Similar to proposals as, e.g., in [20], we can also store the distances associated with every fixed pivot in a B+-tree and perform a range query to find distances like D(o, p1 ) that can be used to prune objects. If the data-set does not have frequent object deletions and insertions, we can also use sorted lists (or sorted files if the main memory is not large enough to store the distances). 4.3.1 The Pruning Ability of Virtual Pivots For the following analysis of the pruning ability of virtual pivots, we assume we have no prior knowledge of the data and we only assume that we have x fixed pivots that are randomly distributed. THEOREM 3. In a d-space, assume a query radius δ around a query object q. For any virtual pivot v, supported by x fixed pivots, let the asymptotic pruning power of v be y RPP, and let s = D(v, q), then y = f (s, δ) · x, p ln(erf((δ + s) d/2)) p . f (s, δ) = ln(erf(δ d/2)) (4.2) Proof. Any object o with D(q, o) ≤ δ, by Lemma 7, satisfies Formula (4.1). If v would be a true pivot so that all distances between database objects and v would be known, then we would be able to exclude all objects except those in the hyper-ring centered at v that contains the sphere centered at q with radius δ (see Figure 4.1(b)). But, since v is not a true pivot, we only approximate this region using the fixed pivots in the following precise sense. Formula (4.1), which is used for the pruning, can be rewritten as |D(o, p1 ) − D(p1 , v)| ≤ D(q, o) + D(v, q). Therefore, since D(q, o) ≤ δ, for all fixed pivots pi , o must satisfy |D(o, pi ) − D(pi , v)| ≤ δ + D(v, q). This means that the original problem of finding objects within distance δ around q using v can be transformed into the problem of finding objects within distance δ + D(v, q) around v using the x fixed reference points. Since the distances between v and all the fixed pivots follow the same distribution of the pairwise distances in the d-space, v can be considered as a random query object in this transformed problem. Consequently, the region that cannot be excluded in the original problem is approximated by the intersection of the hyper-rings centered at the fixed pivots. Therefore, the pruning ability of v in the original problem can be estimated as the pruning ability of x random pivots for a query radius δ + D(v, q) and a random query object. 82 By Lemma 5, the asymptotic probability of a random object not being pruned by y random pivots, for a query objects q with query distance δ (by assumption the same as using p v for the original problem), is P (y) = (erf(δ d/2))y . With s = D(v, q), the correspond- ing probability using the x fixed pivots in the transformed problem for query object v with query distance δ + s is p P (x) = (erf((δ + s) d/2))x . To estimate the pruning ability y of a single virtual pivot v supported by the x fixed pivots, we set P (x) = P (y). Then y can be easily obtained by simple transformations as stated in the theorem (Formula 4.2). The factor f (s, δ) essentially indicates how much pruning ability of the x fixed pivots is “transferred” to a single virtual pivot. The distributions of the factor f (s, δ) in different d-spaces (d = 50, 100) are plotted in Figure 4.6. The distributions show a similar pattern. When s is small, f is close to 1, which means the single virtual pivot has almost the same pruning ability as the x random pivots. Note that this also means that we save x − 1 distance computations in the query stage since we only compute the distance between the virtual pivot and the query object, but not as in traditional schemes the distances between all fixed pivots and q. What about more than one virtual pivot? Assuming we have z virtual pivots, selected randomly and independently. Since each virtual pivot performs an independent pruning for a given query, the asymptotic probability of a random object not being pruned away by z virtual pivots is p (Πzi=1 erf((δ + si ) d/2))x , where vi is the ith virtual pivot and si = D(vi , q). Similar to the case of one virtual pivot, we can compute the RPP as y = fz (s, δ) · x, p Pz i=1 ln(erf((δ + si ) d/2)) p fz (s, δ) = . ln(erf(δ d/2)) The above formula indicates that if we use more than one virtual pivot, it is possible to achieve a pruning ability that is a multiple of that of the total set of x fixed pivots. Since in our method we do not compute the distances between the query object and all the fixed pivots, we can use a relatively large number of fixed pivots. This will increase the runtime of the pre-computing stage, which is a one-time effort. However, in the query stage, the number distance computations will very likely decrease since the larger number 83 1 0.8 1 0.6 0.8 f 0.4 0.6 0.2 0 0.4 1 0.2 00 0.5 0.2 0.4 0.6 s = D(q,p) 0.8 radius 1 0 (a) d = 50 1 0.8 1 0.6 0.8 f 0.4 0.6 0.2 0 0.4 1 0.2 00 0.5 0.2 0.4 0.6 s = D(q,p) 0.8 radius 1 0 (b) d = 100 Figure 4.6: The factor f (s, δ). of fixed pivots will give the virtual pivots a better pruning ability. In the traditional scheme, since it needs to compute the distances between the query objects and all the fixed pivots, it can only use a small number of fixed pivots. 84 4.4 Boosting Virtual Pivots In this subsection, we will answer the question of how to select the virtual pivots for any given query object efficiently and effectively. In order to perform efficient k-nn search using virtual pivots, we want the virtual pivots to be as close to the query object as possible. For a typical metric data-set, however, there is no information available about which object could be close to the query object. In the following, we propose a method for boosting the performance of virtual pivots by selecting them one by one, selecting the next virtual pivot using criteria measured by previously selected pivots. A similar idea of pivot selection was first mentioned without analysis in AESA [48, 49] and then later in Shasha and Wang’s paper [41] as part of their search methods using a full matrix of computed or estimated pairwise distances between all objects. Given a query object, their algorithms pick a random object as the first pivot. Starting with the second pivot, they always choose the object with the smallest lower bound estimation as the next pivot. As we have mentioned before, these methods suffer from extremely high pre-computation cost and the quadratic space problem in both the preprocessing stage and the query stage. There are two differences between our boosting method and the previous approaches. (1) Our method uses virtual pivots and thus gets rid of the quadratic space complexity. (2) In a high dimensional space, the distance between an arbitrary query object and a random pivot is almost always very large, as shown in Figure 4.3, which means that the first selected pivot has in general a low pruning ability, and the convergence of the previous methods to a close pivot could be slow when the lower bound estimations are loose. Therefore, instead of using a random object as the first pivot, we propose a method that tries to select a close neighbor of the query already as the first virtual pivot. The main idea in selecting the first virtual pivot for a given query object q is to utilize a small subset S ⊆ P of the fixed pivots to find a potentially close neighbor of q. Given that the distances between all pivots in P and all database objects have been pre-computed, we can first compute the distances between q and the fixed pivots in S, in order to obtain the lower bound estimation (LBE) ES (q, o) = maxp∈S {|D(q, p) − D(o, p)|}. Then, all the LBEs are sorted and the object with the smallest LBE is chosen as the first pivot. The following theoretical analysis will show that this is indeed a very good heuristic to find a close object to a query. Note, however, that this heuristic itself cannot be used to rule out objects as possible k-nn objects, using the traditional scheme of triangle inequality 85 pruning, since the upper bound estimation of the k-th nearest neighbor distance could be very large. We will call this ability of LBE to discriminate close neighbors rather than ruling out other objects its discriminatory ability. 4.4.1 Why Boosting Works The reason why boosting works may seem surprising. It is because the discriminatory ability of the lower bound estimation is growing so fast (exponentially with the number of used pivots) that LBE can rank a close neighbor highly in a sorted list. For a random query object q with query radius δ using m random pivots in a d-space, by Lemma 5, the asymptotic probability of a random object o having an LBE smaller than δ is P r = p (erf(δ d/2))m . Pr 1 1e-05 1e-10 1e-15 1e-20 1e-25 1e-30 0 m = 30 d=30 d=50 d=100 d=200 0.05 0.1 0.15 0.2 0.25 Pr 1 1e-05 1e-10 1e-15 1e-20 1e-25 1e-30 1e-35 1e-40 1e-45 1e-50 0.3 0 m = 100 d=30 d=50 d=100 d=200 0.05 0.1 0.15 Delta Delta (a) m = 30 (b) m = 100 0.2 0.25 0.3 Figure 4.7: Probability of having an LBE smaller than δ. The values of P r for several dimensions are plotted in Figure 4.7 against different values of δ. It shows that when δ is small, the probability of a random object getting a low LBE is also small and this probability goes down exponentially as the number of pivots m increases. For instance, assuming m = 100, d = 50 and δ = 0.1, P will be no greater than 10−30 . Let the size of the data-set be 10 million. Then on average, there is only 10−23 objects with a LBE smaller than 0.1. If we reduce m to 30, this average number is still only 10−2 . Therefore, if the query object has actually a close neighbor within δ = 0.1, then this object will be the top object in the sorted list of lower bounds. Once we have found this closer-than-random neighbor, our boosting method will then use it as a virtual pivot and prune the other objects. If the query objects have more than one close neighbor, since now the close virtual pivot generates lower LBEs for close neighbors than LBEs generated by random pivots as shown in Lemma 5 (when a decreases, P increases, and close neighbors 86 LBEs decrease since erf is a monotonic increasing function), we can continue to select a pivot that is even closer to the query object. This analysis also indicates that the boosting can only be effective when the query object actually has some neighbors that are not too far away. This is a reasonable assumption in data-sets where similarity search is performed, since if all objects are far away from the query object like random noise, it is hard to justify the need to do k-nn search. Therefore, we will assume the user will, in practice, be able to provide a parameter ǫ so that only knearest neighbors within a radius of ǫ to the query object are considered as being relevant. There is no hard limit on the value for ǫ. It will only affect the performance of boosting. In the worst case, the boosting will simply select the virtual pivots randomly. 4.5 Advantages of Using Virtual Pivots Given a virtual pivot v, Formula (4.1) gives a new lower bound estimation |D(p, o) − D(p, v)| − D(q, v) for D(q, o). This estimation is derived from two adjacent triangles on four objects, and thus might be looser than a single triangle inequality as Formula (1.1). Nevertheless, such a pruning scheme based on a virtual pivot v is significantly different from the traditional methods using many fixed pivots (i.e., Formula (1.2)), where a much larger number of distances to the query have to be computed in the query stage. Additionally, in the traditional methods, Formula (1.3) gives the estimated upper bounds which might all be large; while in our VP method, if we are able to select the virtual pivot v to be close to the query q, then D(q, v) will be small and serve as a tighter upper bound for the nearest neighbor distance. To summarize, there are several advantages for using a virtual pivot: • One can freely choose from the non-pivot data objects as virtual pivots during querying time, based on the given query. In our VP method, we choose the one with the smallest estimated lower bound. This query-dependent and dynamic pivot selection will likely enable us to select virtual pivots that are close to the query object. • The distances between the close virtual pivots and the query will provide a much tighter k-nn upper bound, δ, than the other methods. This upper bound and the new lower bound estimation by Formula (4.1) can more effectively prune away distant data objects. • In our VP method, the selection of virtual pivots in the query stage relies on only a small number of fixed pivots whose distances to the query are computed. The 87 subsequent lower bound estimation using Formula (4.1) and a larger number of fixed pivots saves a substantial number of distance computations during the query stage. Consequently, one can opt to pre-compute a large set of fixed pivots to improve the effectiveness of virtual pivot pruning by Formula (4.1). 4.6 Algorithm Preprocessing Algorithm. Select a set of fixed pivots, denoted by P . Compute the distances between these pivots and all the objects in the data-set. Store the distances in |P | sorted lists. If the main memory is not large enough, store them in |P | sorted files or B+-trees. All data objects except the |P | fixed pivots can be used as virtual pivots. Query Algorithm. Given a query object q and threshold value v (the maximal number of virtual pivots to use) and ǫ (only k-nn objects within ǫ are relevant), perform the following operations. Step 1. Randomly select a small subset S ⊆ P of fixed pivots. For all f ∈ S, compute D(q, f ). Use the traditional scheme of k-nn search to perform pruning and rank the remaining set according to LBEs. Denote the current k-nn distance upper bound as u. Let the current k-nn candidate set be C. Denote the set of virtual pivots by V . Step 2. If |V | = v, return with C. Otherwise, use the object that has the lowest LBE but is currently not in P ∪ V as the next virtual pivot (denoted by v). If v does not exist, return with C. Otherwise compute D(q, v). Step 3. For each fixed pivot p ∈ P \ S, perform a range query of radius min{ǫ, u} + D(v, q) in its sorted list/sorted file/B+-tree to find objects with distances that are close to D(p, v). Let the objects in this query result be Q. Update the lower and upper bounds of objects in Q ∩ C using Formula (4.1) Update u and C. Goto Step 2. Let the size of the data-set be n. In the preprocessing stage, our method computes |P | · n distances, so that the space complexity is linear. Since sorting them takes O(n log n) time, the total time complexity for preprocessing is O(n log n). In the query stage, for each virtual pivot, the range query takes O(n) time, so that the total time complexity is O(|V | · |P | · n). However, the total number of distance computations which can be computationally very expensive and dominate the runtime, is only |S| + |V |. We remark that in high dimensional data when ǫ is large, we should use a linear scan rather than range query in step 3 since range queries are not effective in such data. 88 4.7 Summary In this chapter, we analyzed the pruning ability of pivots and showed that pivots close to a query object have a much better pruning ability than random pivots. We introduced the novel concept of virtual pivots and showed formally that a single virtual pivot can have the same pruning power as a large set of fixed pivots. A theoretical analysis showed how and why we can select virtual pivots close to the query object (if they exist) with high probability. Our results indicate that the problem of k-nn search in general metric spaces or high dimensional spaces may not be as difficult as we might have thought, as long as the data is not completely random and k-nn search is not completely meaningless. The LBE is a very good heuristic in high dimensional space for finding close neighbors. The actual problem is rather how to rule out far away objects. Our virtual pivots can simulate the pruning of close pivots to address this problem of ruling out far away objects. We conjecture that the reasons why k-nn search had been so challenging in the literature are a combination of (1) k-nn search is performed to find not very close neighbors, (2) the number of fixed pivots is too small, and (3) methods lack adaptability with respect to the query. Suppose in the analysis of Section 4.4.1, using 10 random pivots, we have δ = 0.5 instead of δ = 0.1, then P = 0.56, which means 56% of the data objects will have an LBE smaller than δ. The LBE is now not a good heuristic to find close neighbors. However, if we are using 100 random pivots instead of 10, then P = 0.0033. This indicates that using a larger number of pivots can be a potential solution when searching for neighbors that are farther away. However, in the traditional scheme, this is not affordable since in the query stage we need to compute the distances between the query object and all the pivots. In our “virtual pivot + boosting” method, we can compute a larger set of fixed pivots without increasing the number of distance computations at query time. Only being able use a small number of fixed pivots limits the effectiveness of traditional approaches even when searching for close neighbors in high dimensional data-sets when there are many very small clusters, since we cannot have a fixed pivot in every cluster. In our method, the virtual pivot selection has been designed to be adaptive to the query so that every cluster can have a virtual pivot. 89 Chapter 5 Efficient Similarity Search with Virtual Pivots and Partial Pivots Modern biological applications usually involve the similarity comparison between two objects, which is often computationally very expensive, such as whole genome pairwise alignment and protein three dimensional structure alignment. Nevertheless, being able to quickly identify the closest neighboring objects from very large databases for a newly obtained sequence or structure can provide timely hints to its functions and more. This chapter1 presents a substantial speed-up technique for the well studied k-nearest neighbor (k-nn) search, based on novel concepts of virtual pivots and partial pivots, such that a significant number of the expensive distance computations can be avoided. The new method is able to dynamically locate virtual pivots, according to the query, with increasing pruning ability. Using the same or less amount of database preprocessing effort, the new method outperformed the next best method by using no more than 40% distance computations per query, on a database of 10,000 gene sequences, compared to several of the best known k-nn search methods including M-Tree, OMNI, SA-Tree, and LAESA. We demonstrated the use of this method on two biological sequence datasets, one of which is for HIV-1 viral strain computational genotyping. In the next section, we detail our method that implements the above two novel ideas for speeding up k-nn search. We show the computational results on two biological datasets in Section 5.2, where several existing well-known hierarchical and non-hierarchical k-nn search methods are included for comparison. In Section 5.3, we further discuss our method and its application on real-life data. 1 Some of the material in this chapter has been published in [58]. 90 5.1 Methods From Chapter 4, we see that an effective pivot would be one that is close to the query object. However, a pivot can be effective for only a portion of data objects provided that these data objects are all close to the pivot. It is often difficult to select universally effective pivots, given that they have to be selected during the database preprocessing stage. Assuming that query objects follow the same distribution as the database objects, each database object is typically close to only a small number of other database objects. Therefore, when one is selected as a pivot, it becomes effective for only a small number of query objects. In other words, one pivot is ineffective for the vast majority of query objects. Consequently, if an effective pivot is desired for all query objects, one has to have a large pool of pivots, despite many careful considerations on how pivots should be selected [20, 11, 37]. 5.1.1 Partial Pivots In Chapter 4, we introduced virtual pivots which enable us to use virtually any data object as a pivot, which by itself does not have pre-computed distances to other data objects and must be supported by a set of traditional fixed pivots (each has distances to all data object) to perform pruning. In this section, we extend the concept of virtual pivots to use every data object itself as a pivot. The difference between such a pivot and the fixed pivots is that for every data object, we predict a number of its nearest neighbors using a simplified version of the pairwise ranking technique developed in Chapter 3, and then compute the distances between the data object and these predicted nearest neighbors, so that the new pivot only has distances to a small set of data objects (predicted close neighbors). In this sense, such a pivot is only a partial pivot. Partial pivots are associated with two phases of ranking. In the preprocessing stage, to construct partial pivots, a pairwise ranking step predicts the close neighbors for every data object; in the query stage, a ranking with LBE (described in Section 3.3.1) selects pivots that are close to the query object. Our algorithm uses the predicted closest pivots to perform pruning first. The idea of using partial pivots is that, since such a pivot is expected to be close to these predicted nearest neighbors, it will be an effective pivot in estimating the lower (and upper) bounds of the actual distances between the query object and these predicted nearest neighbors. All these steps are done in the database preprocessing stage, and we will demonstrate that these additional preprocessing efforts are paid off by the significantly reduced numbers 91 Figure 5.1: Partial pivots can help when there is no close pivot to the query object. Let q be a query object, p be a partial pivot far away from q, and oi (i = 1, 2, 3) be neighbors of p. |D(q, p) − D(p, oi )| is close to D(q, p) and can be larger than the current k-nn distance upper bound so that oi will be pruned away. of distance computations during the query stage. Even when the query object has no close neighbors and thus no close pivots at all, partial pivots can still help in the pruning. Being far away from the query object, by the analysis of Chapter 4, virtual pivots can no longer be effective, but for partial pivots, since they have exact distances to some of their close neighbors, they can be used as pivots to exclude their close neighbors as close objects to the query, as shown in Figure 5.1. From this point of view, partial pivots can have better pruning ability than virtual pivots since a partial pivot has some pruning ability even when it is not close to the query object. The pruning ability of a partial pivot is limited by its number of close pivots predicted. If the number of distances associated with a partial pivot is t, then theoretically using partial pivots alone can at most reduce the number of distance computations in the query stage to 1/t of the total size of the data-set. However, in many real-life metric data such as those studied in our experimental section, performing k-nn queries is challenging such that the performance of most methods cannot approach this limit even for t as small as 20. Our partial pivots are fully integrated with the virtual pivot algorithm described in Section 4.6. In our database preprocessing stage, after the fixed pivots are selected and their distances to all other data objects computed, we estimate the lower bounds on D(o′ , o) for each non-pivot object o and those data objects o′ with yet unknown distances to o, using Formula (1.2). This way, for each non-pivot object o, we can predict a pre-specified number of data objects o′ as its close neighbors using their estimated lower bounds. These objects o′ form the estimated neighborhood of object o, denoted as N (o). Subsequently, we precompute the distances D(o′ , o), for o′ ∈ N (o). In the query stage, when o is selected as a 92 virtual pivot, D(q, o′ ) can be better bounded from below than using Formula (4.1): |D(o′ , o) − D(q, o)| ≤ D(q, o′ ) ≤ D(o′ , o) + D(q, o), ∀ o′ ∈ N (o). (5.1) Also, when the pre-specified number of virtual pivots is reached, o can be used for possible pruning of objects in N (o). That is, for any k-nn candidate o, we compute the distance D(q, o), and can use Formula (5.1) to possibly prune away o′ if o′ ∈ N (o) was also a candidate. 5.1.2 The Complete Algorithm with Virtual and Partial Pivots Preprocessing Algorithm. Given a database B, randomly select a set of data objects as fixed pivots P ⊆ B and compute the distance between each fixed pivot in P and every data object in B. Using these pre-computed distances, for each data object o in B, rank all the other data objects o′ in B in non-decreasing order of their estimated lower bound on D(o′ , o) by Formula (1.2). If o ∈ P , let N (o) = B; Otherwise, compute the distances between o and the top t objects in the ordered list, which form N (o) with t pre-specified. Query Algorithm. Given a query object q (q can be in or not in B), perform the following steps of operations: 1. (Compute the distances between q and a number of fixed pivots:) Randomly select a subset S ⊆ P of the fixed pivots, and compute the distances between q and all the pivots in S. 2. (Estimate the initial k-nn distance upper bound δ:) Compute the set of upper bounds U = {uD(q,o) }, where uD(q,o) = D(q, o), ∀ o ∈ S; minp∈S {D(p, q) + D(p, o)}, ∀ o ∈ B \ S. Set δ to the k-th smallest value in U . 3. (Select the first virtual pivot:) For each data object o ∈ B estimate its lower bound as D(q, o), ∀ o ∈ S; lD(q,o) = maxp∈S {|D(p, q) − D(p, o)|}, ∀ o ∈ B \ S. If lD(q,o) > δ, object o is excluded from the k-nn candidate list C, which is initialized to be B and (always) sorted in non-decreasing order of estimated distance lower bounds. Select the non-pivot object in C with the smallest lower bound as the first virtual pivot v. Add v to the set of virtual pivots V . 93 4. (Update δ:) Compute D(q, v). If D(q, v) > δ, remove v from the C. Otherwise, update both uD(q,v) and lD(q,v) as D(q, v), and update δ as the k-th smallest value in U. 5. (Pruning using the virtual pivot v:) For each non-pivot k-nn candidate o ∈ C \ N (v), lD(q,o) = max lD(q,o) , max {|D(p, o) − D(p, v)| − D(q, v)} ; p∈P \S for each non-pivot k-nn candidate o ∈ C ∩ N (v), lD(q,o) = max lD(q,o) , max {|D(p, o) − D(p, v)| − D(q, v)}, |D(o, v) − D(q, v)| . p∈P \S If lD(q,o) > δ, remove o from the k-nn candidate list C. 6. If |C| = k or there is no more virtual pivot to select, return the list of k-nn objects; If |V | reaches the pre-specified limit, go to the next step; Otherwise, select the non-pivot object in C with the smallest lower bound as the next virtual pivot v, add v to V , and go to Step 4). 7. (Pruning using partial pivots:) Repeat the following while |C| > k: (a) Select the non-pivot object in C with the smallest lower bound as the next partial pivot o. If lD(q,o) > δ, return the k objects in C with the smallest distances to the query; Otherwise, compute D(q, o). (b) Update uD(q,o) as D(q, o); For each o′ ∈ N (o) ∪ P , update uD(q,o′ ) using Formula (5.1), uD(q,o′ ) = min uD(q,o′ ) , D(o′ , o) + D(q, o) ; Update δ as the k-th smallest value in U . (c) Update lD(q,o) as D(q, o), and if lD(q,o) > δ, remove o from C; For each o′ ∈ N (o) ∪ P , update lD(q,o′ ) using Formula (5.1), lD(q,o′ ) = max lD(q,o′ ) , |D(o′ , o) − D(q, o)| , and if lD(q,o′ ) > δ, remove o′ from C. 94 5.2 Results We tested our k-nn search method, denoted as VP, on two biological sequence datasets. The first dataset consists of 1,198 human immunodeficiency type 1 virus (HIV-1) complete viral sequences studied by Wu et al. [52] (with an average sequence length of 9,103 bases). The second dataset contains 10,000 HA gene sequences of influenza (FLU) virus (encoding hemagglutinin) downloaded from the NCBI GenBank [1] (with an average sequence length of 1,480 bases). For each of these two datasets, we randomly removed 100 sequences from the dataset and used them as query objects to perform k-nn search on the remaining objects. All reported performance measurements are the average values over these 100 queries. We compared our VP method with several hierarchical and non-hierarchical k-nn search methods, regarding the average number of distance computations and the average running time per query. Note that we tuned our VP method so that its preprocessing efforts, in terms of the number of pre-computed distances, were less than those in the other methods. 1) The sequential scan, denoted by SeqScan, is the baseline method that computes the distances between the query and all database objects. 2) The na¨ıve non-hierarchical pivot method uses a number of randomly selected fixed pivots, denoted by FP. In the query stage, FP computes the distances between the query objects and all the fixed pivots, and uses Formulas (1.2– 1.3) for pruning. Lastly, for each k-nn candidate, FP computes its distance to the query, updates the k-nn distance upper bound, and uses this bound to further prune objects from the candidate list. 3) The non-hierarchical pivot method by Filho et al. [20, 45], denoted by OMNI, selects pivots carefully so that they are far away from each other. In the query stage, it performs the same as FP. 4) The fourth method is M-Tree [13]. 5) The fifth method is SA-Tree [34]. 6) The sixth method is LAESA [37]. Brief descriptions of the last three methods are included in the related work section of Chapter 4. All these six methods are implemented in C++. The code for M-Tree was downloaded from its authors’ website, and the code for LAESA was provided by its authors. All experiments were performed using an AMD Opteron 2.2GHz CPU with a 5GB RAM. 5.2.1 Results on the HIV-1 Dataset Wu et al. studied the HIV-1 viral strain computational genotyping problem, where 42 pure subtype HIV-1 strains were used as references and the task is to assign a subtype or a recombination to any given HIV-1 viral strain [52]. In total, there are 1,198 HIV-1 complete viral strains used in the study. The average length of these 1,198 strains is around 9,000 nu- 95 cleotides. These strains form our first dataset, and they are classified into 11 pure subtypes (A, B, C, D, F, G, H, J, K, N, O) and 18 recombinations. One computational genotyping method is, given a query strain, to identify the k most similar strains in the dataset and then use their subtypes to assign a subtype to the query through a majority vote. Similarity or dissimilarity between two genomic sequences can be defined in many ways. In this work we test two of them: the global alignment distance [35] and the Complete Composition Vector (CCV) based euclidean distance, the latter of which was adopted in Wu et al. [52]. In calculating global alignment distances, we adopted the scoring scheme used in Mega BLAST [55], which is designed for aligning nucleotide sequences that differ slightly. In calculating the CCV-based euclidean distances, every HIV-1 viral strain is represented as a high dimensional vector, in which each entry records essentially the amount of evolutionary information carried by a nucleotide string. We considered all nucleotide strings of length 21 or less as suggested in [52]. The distance between two HIV-1 viral strains is taken as the euclidean distance between the two corresponding high, more than two million, dimensional vectors. Both distance functions are metric. Computing the global alignment distance and the CCV-based euclidean distance between two HIV-1 viral strains took about 1.6 and 2.5 seconds on average, respectively. Overall Performance In FP, OMNI, and LAESA, 100 fixed pivots were selected and their distances to all data objects were computed. This way, they had equal preprocessing efforts. In our VP method, only |P | = 80 fixed pivots were randomly selected and similarly their distances to all data objects were computed. Afterwards, t = 20 close neighbors for each non-pivot object were predicted using these 80 pivots and triangle inequality, and subsequently the distance between each non-pivot object and everyone of its predicted close neighbors was computed. Therefore, VP had no more preprocessing efforts than FP, OMNI, and LAESA, in terms of the number of distance computations. In the query stage, for each query object, VP randomly selected and used only 5 out of the 80 fixed pivots (i.e., |S| = 5), and the number of virtual pivots to be selected, |V |, was set to 10. For M-Tree, the input parameters were set according to its original paper [13]. For SeqScan and SA-Tree, there was no parameter to be set. Obviously, for each query, SeqScan always made exactly 1,098 distance computations, independent of the value of k. The other six methods, which all take advantage of triangle inequality, all made less distance computations, and the detailed numbers of distance com- 96 Average #Distance Computations 1200 SeqScan M-Tree OMNI SA-Tree FP LAESA VP 1000 800 600 400 200 0 1 3 5 10 k (a) HIV-1 dataset using global alignment distance. Average #Distance Computations 1200 1000 SeqScan M-Tree OMNI SA-Tree FP LAESA VP 800 600 1 3 5 10 k (b) HIV-1 dataset using CCV-based euclidean distance. Figure 5.2: The average numbers of distance computations per query by all seven k-nn search methods on HIV-1 dataset, for k = 1, 3, 5, 10. putations depended on the query HIV-1 viral strain, as well as the value of k. Figure 5.2(a) (Figure 5.2(b), respectively) plots the average numbers of global alignment (CCV-based euclidean, respectively) distance computations per query by these seven methods, over the 100 k-nn queries, where k = 1, 3, 5, 10. With respect to k, except SeqScan, each of the other six methods made more distance computations with increasing values of k. For each value of k, from the plots we can see that the general performance order is: our VP method performed the best, followed by LAESA, FP, SA-Tree/OMNI, M-Tree, and SeqScan. It was surprising to see that all these four values of k, M-Tree, OMNI, and SA-Tree made more distance computations per query than FP, which simply used 100 randomly selected fixed pivots for pruning. On the other hand, non-surprisingly, our method VP consistently outper- 97 formed all the other methods. Specifically, our VP method was at least 17.7% better than the next best method LAESA. For instance, for 1-nn search, M-Tree made 523 global alignment distance computations, OMNI made 418, SA-Tree made 387, FP made 351, LAESA made 213, but our VP method made only 181 global alignment distance computations. 2000 SeqScan M-Tree OMNI SA-Tree FP LAESA VP Average Runtime (seconds) 1800 1600 1400 1200 1000 800 600 400 200 0 1 3 5 10 k (a) HIV-1 dataset using global alignment distance. Average Runtime (seconds) 2600 2400 2200 2000 SeqScan M-Tree OMNI SA-Tree FP LAESA VP 1800 1600 1400 1 3 5 10 k (b) HIV-1 dataset using CCV-based euclidean distance. Figure 5.3: The average runtime per query using by all seven k-nn search methods on HIV-1 dataset, for k = 1, 3, 5, 10. The actual runtime per query by a k-nn search method includes the time for distance computations and possibly the time for triangle inequality pruning and other operations, and thus it is not necessarily proportional to the number of distance computations. Nevertheless, our collected average runtime per query by each of the seven methods, over the 100 k-nn queries, did correlate quite well with the average number of distance computations per query, as shown in Figure 5.3. Again, for each value of k, from the plots we can see that 98 the general performance order is: our VP method performed the best, followed by LAESA, FP, SA-Tree/OMNI, M-Tree, and SeqScan. Varying Preprocessing Efforts We dropped SeqScan from this comparison since it performed the worst and its preprocessing effort does not affect its query performance. We mentioned above that the input parameters of M-Tree method were set according to its original paper and SA-Tree has no parameter to be tuned. For the other four methods, VP, LAESA, FP, and OMNI, we are able to tune their parameters such that they have different preprocessing efforts, which would subsequently affect their query performance. For LAESA, FP, and OMNI, their parameters are the number of fixed pivots; For our VP method, its parameters are the number of fixed pivots, the number of predicted close neighbors for each data object, the number of fixed pivots used in the query stage, and the number of virtual pivots allowed in the query stage. As shown in Figure 5.4, M-Tree always pre-computed 81,910 global alignment distances on the database of 1,098 HIV-1 viral strains, in 134,332 seconds, and SA-Tree precomputed 42,142 global alignment distances in 69,112 seconds. We tested four different amounts of preprocessing efforts by the other four methods and their query performance in 1-nn search. We set the numbers of fixed pivots in OMNI, FP and LAESA at 40, 50, 75, and 100 such that their preprocessing efforts fell in the same range as M-Tree and SA-Tree. Correspondingly, we set the number of fixed pivots in our VP method at 10, 20, 50, and 75, and kept (|S|, t, |V |) = (5, 20, 10). Figure 5.4 shows how varying preprocessing efforts affected the average numbers of distance computations and the average runtime per query for VP, LAESA, FP, and OMNI in 1-nn search. In these four settings, the preprocessing efforts of VP were always less than those of OMNI, FP and LAESA, yet VP performed better than them in the query stage. For instance, corresponding to the case where the numbers of fixed pivots in OMNI, FP and LAESA were set 40 (the leftmost point for each method), our VP method computed 32,669 global alignment distances in 53,577 seconds in the preprocessing. This effort was the smallest by comparison: LAESA computed 43,139 global alignment distances in 70,747 seconds, FP computed 43,140 global alignment distances in 70,749 seconds, and OMNI computed 44,198 global alignment distance in 72,484 seconds (M-Tree and SA-Tree as above). Nevertheless, VP computed only 198 global alignment distances per query (in 241 seconds), while LAESA computed 238 global alignment distance per query (in 293 seconds), FP computed 347 global alignment distances per query 99 Average #Distance Computations per Query 700 M-Tree OMNI SA-Tree FP LAESA VP 600 500 400 300 200 100 0 30000 50000 70000 90000 110000 Average Runtime per Query (seconds) #Distance Computations in Preprocessing (a) The average numbers of distance computations 800 M-Tree OMNI SA-Tree FP LAESA VP 700 600 500 400 300 200 100 0 50000 80000 110000 140000 170000 Runtime in Preprocessing (seconds) (b) The average runtime Figure 5.4: Performance measures per query by the six k-nn search methods, with different amounts of preprocessing efforts on HIV-1 dataset using global alignment distance. (in 419 seconds), SA-Tree computed 387 global alignment distances per query (in 453 seconds), OMNI computed2 441 global alignment distances per query (in 536 seconds), and M-Tree computed 523 global alignment distances per query (in 625 seconds). 5.2.2 Results on the HA Gene Dataset All the seven k-nn search methods were also compared on the FLU HA gene dataset, which contains 10,000 HA gene sequences. In this section we only reported the results using 2 The author of OMNI suggested to use less pivots for this experiment. However, since reducing the number of pivots will not increase the pruning ability of OMNI, in the best case OMNI can only reduce the number of distance computations by the number of computed distances between the query object and all the pivots, i.e., reducing the number of distance computations from 441 to 401, a performance which is still worse than SA-Tree. 100 the CCV-based euclidean distance (the results using the global alignment distance were similar). Average #Distance Computations 10000 SeqScan M-Tree OMNI LAESA FP SA-Tree VP 8000 6000 4000 2000 0 1 3 5 10 k (a) Smaller preprocessing effort Average #Distance Computations 10000 SeqScan M-Tree OMNI LAESA FP SA-Tree VP 8000 6000 4000 2000 0 1 3 5 10 k (b) Bigger preprocessing effort Figure 5.5: The average numbers of distance computations per query by all seven k-nn search methods with two different amounts of preprocessing efforts, on HA dataset, for k = 1, 3, 5, 10. Figures 5.5 and 5.6 show the query performance of the seven k-nn search methods on HA dataset, for k = 1, 3, 5, 10. We only plotted two different amounts of preprocessing efforts for those k-nn search methods whose preprocessing efforts can be tuned. For MTree, we used its default parameter setting; For OMNI, FP, and LAESA, the numbers of fixed pivots were set at 100 and 200, respectively; Correspondingly, for our VP method, the numbers of fixed pivots were set at 75 and 150, and (|S|, t, |V |) = (5, 20, 10). We distinguish these two settings as the smaller and the bigger preprocessing efforts. Table 5.1 101 Average Runtime (seconds) 3000 2500 SeqScan M-Tree OMNI LAESA FP SA-Tree VP 2000 1500 1000 500 0 1 3 5 10 k (a) Smaller preprocessing effort Average Runtime (seconds) 3000 2500 SeqScan M-Tree OMNI LAESA FP SA-Tree VP 2000 1500 1000 500 0 1 3 5 10 k (b) Bigger preprocessing effort Figure 5.6: The average runtime per query by all seven k-nn search methods with two different amounts of preprocessing efforts, on HA dataset, for k = 1, 3, 5, 10. Table 5.1: The detailed numbers of distance computations and the runtime (in seconds) by all seven k-nn search methods in the smaller preprocessing, and their resultant query performance in 1-nn search in terms of the average number of distance computations and the average runtime (in seconds) per query. Method VP SA-Tree FP LAESA OMNI M-Tree Preprocessing #dist comp’s runtime 960,705 263,079 1,682,030 460,383 985,049 269,614 985,049 269,614 985,049 269,614 1,280,537 350,492 Per query #dist comp’s runtime 485 194 1,213 373 1,462 468 2,166 716 2,388 743 4,950 2,178 shows that in the smaller preprocessing settings, our VP method spent the least amount of effort among the six methods, in both the number of distance computations and the actual 102 runtime (SeqScan was excluded since it doesn’t have the preprocessing stage). Nevertheless, both the table and Figures 5.5(a) and 5.6(a) show that our VP method outperformed the other five methods (and SeqScan) in the query stage, in both the average number of distance computations and the actual runtime per query. Furthermore, the results show a slightly different performance tendency compared to that on the HIV-1 dataset: for each value of k, our VP method performed the best, followed by SA-Tree/FP, LAESA, OMNI, M-Tree, and SeqScan. That is, LAESA performed worse than SA-Tree and FP on this FLU HA gene dataset. Our VP method can be 150% better than the next best method, SA-Tree. For instance, for a 1-nn search with the smaller preprocessing efforts (cf. Table 5.1), the next best method, SA-Tree, made 1, 213 CCV-based distance computations, but our VP method made only 485 per query. 5.3 Discussion 5.3.1 Query Performance Dependence on the Effectiveness of Pivots We discussed earlier that the level of effectiveness of pivots is extremely important for the success of a query. Essentially, one effective pivot, fixed or virtual, would have the ability to estimate the distances between the query object and the non-pivot objects accurately such that a large portion of the database can be pruned away to avoid distance computations. Table 5.2: The distance intervals associated with the five bins and the numbers of query objects therein. 1 2 3 4 5 HIV-1-GA Distance interval #Objects Bin [0, 0.08) 25 [0.08, 0.16) 11 [0.16, 0.24) 43 [0.24, 0.32) 20 [0.32, 0.4) 1 HIV-1-CCV Distance interval #Objects [0, 50) 8 [50, 100) 9 [100, 150) 9 [150, 200) 12 [200, 250) 62 HA-CCV Distance interval #Objects [0, 20) 51 [20, 40) 34 [40, 60) 10 [60, 80) 4 [80, 100) 1 The reported performance measurements on the two biological sequence datasets for the six k-nn search methods (excluding SeqScan) that use triangle inequality pruning were on the 100 randomly pre-selected query objects. Certainly, whether or not there exists an effective pivot for each of them would affect the query performance of these six methods. In fact, the most effective pivots for these 100 randomly pre-selected query objects have various levels of effectiveness. To study the relation between the query performance of different methods and the level of effectiveness of pivots, we calculated the nearest neigh- 103 bor (i.e., 1-nn) distance for each of the 100 query objects, and divided them into 5 bins of equally wide distance intervals. The nearest neighbor distance associated with the query object measures the level of effectiveness of the most effective pivot for the query. Let HIV1-GA denote the HIV-1 dataset using the global alignment distance; Likewise, HIV-1-CCV and HA-CCV denote the HIV-1 and HA gene datasets using the CCV-based euclidean distance, respectively. Table 5.2 collects the distance intervals associated with the five bins and the numbers of query objects therein, for HIV-1-GA, HIV-1-CCV, and HA-CCV. Separately for each bin, we collected the average numbers of distance computations per query by all six methods in a 1-nn search, and plotted in Figure 5.7 separately. In this experiment, the input parameters of the M-Tree method were set according to its original paper, FP, OMNI, and LAESA were set to select 100 fixed pivots, and VP was set to use a lower preprocessing effort than all the other five methods, with (|P |, |S|, t, |V |) = (80, 5, 20, 10) for HIV-1-GA and HIV-1-CCV, (|P |, |S|, t, |V |) = (75, 30, 20, 10) for HA-CCV. From these plots, one can see that the performance of all six methods declined as the average nearest neighbor distance associated with the query objects increases. Also, our VP method almost always, except for the middle bin of HIV-1-CCV, outperformed the other five methods. Plots in Figure 5.7 also show that the performance of some methods deteriorates faster than the others. For instance, on the HA-CCV dataset (Figure 5.7(c)), LAESA outperformed M-Tree and OMNI when the average nearest neighbor distance associated with the query objects is small (the first two bins), but LAESA gradually lagged behind as the average nearest neighbor distance increases (the last two bins). Table 5.2 and Figure 5.7(b) together explain why all six methods performed relatively poorly on HIV-1-CCV. On HIV1-CCV, most query objects (62 out of 100) were in Bin 5, that is, they have large nearest neighbor distances and consequently low levels of effectiveness of the pivots, fixed or virtual. Nevertheless, our VP method still outperformed the other five methods on this bin. 5.3.2 Query Performance w.r.t. Ranking Ranking plays an important role in our algorithm. In the preprocessing stage, our algorithm uses pairwise ranking to construct partial pivots; at the beginning of the query stage, our algorithm uses a na¨ıve ranking with |S| fixed pivots to select data objects close to the query as virtual pivots. To test the performance of our algorithm with respect to ranking, on the HIV-1-GA data-set we applied our algorithm in different scenarios, using virtual pivots only or using both virtual pivots and partial pivots, with |P | set to 80. Figure 5.8(a) shows the result when using virtual pivots only. For the two curves in 104 Average #Distance Computations 900 M-Tree OMNI SA-Tree FP LAESA VP 800 700 600 500 400 300 200 100 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Nearest Neighbor Distance (a) HIV-1-GA Average #Distance Computations 1200 1000 800 600 400 M-Tree OMNI SA-Tree FP LAESA VP 200 0 0 50 100 150 200 250 Nearest Neighbor Distance (b) HIV-1-CCV Average #Distance Computations 10000 9000 8000 7000 M-Tree OMNI SA-Tree FP LAESA VP 6000 5000 4000 3000 2000 1000 0 10 20 30 40 50 60 70 80 90 Nearest Neighbor Distance (c) HA-CCV Figure 5.7: The average numbers of distance computations per query by all six methods in 1-nn search for five bins of query objects with different nearest neighbor distance ranges, on three datasets. the plot, with |S| varied from 0 to 80, the number of virtual pivots |V | was set to 0 and 10 respectively. When |V | = 0, our algorithm uses only the |S| fixed pivots to do the pruning. The result shows the number of distance computations drops dramatically as |S| increases from 1 to 30; after that, the curves are almost flat. This result also indicates that using just 10 virtual pivots consistently renders better performance than using no virtual pivots. Figure 5.8(b) shows the result when using both virtual pivots and partial pivots, under 105 |V|=0 |V|=10 Average #Distance Computations 1200 1000 800 600 400 200 0 1 5 10 20 30 40 50 60 70 80 |S| (a) HIV-1-GA with virtual pivots only |V|=0 |V|=10 Average #Distance Computations 450 400 350 300 250 200 150 100 50 0 1 5 10 20 30 40 |S| 50 60 70 80 (b) HIV-1-GA with virtual pivots and partial pivots Figure 5.8: Performance on HIV-1-GA when the number of fixed pivots to perform ranking, |S|, increases from 1 to 80. the same setting as the previous experiment. Similar to Figure 5.8(a), the number of distance computations decrease quickly when |S| increases from 0 to 5. After that, increasing |S| only deteriorates the performance slightly. The result also shows that on this data-set, virtual pivots are superseded by partial pivots, as we can see that using 0 and 10 virtual pivots does not have a significant difference in performance. We also tested our algorithm on the larger HA-CCV data-set, using a similar setting as the previous experiment (|P | = 75). The results shown in Figure 5.9 are similar to those of Figure 5.8. 5.3.3 Query Performance w.r.t. t We also studied how the query performance of our algorithm can be affected by the parameter t, the number of close neighbors to returned for partial pivots. Figure 5.10 shows the performance of our method when t increases from 0 to 50. For the other parameters, on the HIV-1-GA data-set, (|P |, |S|, |V |) = (80, 5, 10), and on the HA-CCV data-set, 106 Average #Distance Computations |V|=0 |V|=10 10000 9000 8000 7000 6000 5000 4000 3000 2000 1000 0 1 5 10 20 30 40 50 60 70 |S| (a) HA-CCV with virtual pivots only |V|=0 |V|=10 Average #Distance Computations 1200 1000 800 600 400 200 0 1 5 10 20 30 40 50 60 70 |S| (b) HA-CCV with virtual pivots and partial pivots Figure 5.9: Performance on HIV-CCV when |S| increases from 1 to 70. (|P |, |S|, |V |) = (75, 30, 10). Both plots in Figure 5.10 show a similar trend that the number of distance computations decreases as t increases until it levels off around t = 10. The results also show that t has a bigger effect on the larger data-set of HA-CCV. 5.3.4 Time and Space Complexity Given a set of data objects B, a set of randomly selected fixed pivots P , and a constant t specifying the number of predicted close neighbors for each data object, in the preprocessing stage, our VP method computes (|P | · |B| + t · |B|) distances with an overhead of O(|P | · |B| + t · |B| · log |B|) storing, sorting, and processing the computed distances, when the prediction of the t close neighbors for each data object is done by a frontier search [57]. The space complexity is O(|P | · |B| + t · |B|). In the query stage with |V | virtual pivots allowed, the time complexity is O(|V | · |P | · |B| + t · |B| + |B| · log |B|) distance lower and upper bound estimations and sorting the candidate list, in addition to the time for distance computations. The space complexity is O(|P | · |B| + t · |B|), mostly for storing those pre-computed distances for query processing. 107 Average #Distance Computations 250 200 150 100 50 0 1 5 10 20 30 40 50 40 50 t (a) HIV-1-GA Average #Distance Computations 1000 900 800 700 600 500 400 300 200 100 0 1 5 10 20 30 t (b) HA-CCV Figure 5.10: Performance when the number of predicted close neighbors for each data object, t, increases from 1 to 50. Similarly as pointed out in [34], there are extreme datasets (such as the query objects in Bin 5 of HIV-1-CCV dataset) on which no pivot can provide any distance information on data objects whose distances to the query are unknown yet. Our VP method would also fail to work efficiently on them, and in the worst case the distances between the query and all data objects have to be computed. The precise theoretical analysis for our VP method on reasonable distance distributions such as the one in [34] is one of our focuses in the future work. 5.3.5 Distance Distributions Based on a “homogeneous” distance distribution model, Ciaccia et al. [14] provided a theoretical analysis on the CPU cost spent on distance computations and the I/O overhead. Navarro [34] started from Delaunay graphs and proposed a simplified uniformly distributed distance model to analyze SA-Tree. It should be noted that both assumptions are on the relative distances, not the spatial distribution of the data objects. For the complete HIV-1 108 dataset of 1,198 viral strains, we have calculated all pairwise global alignment distances and CCV-based euclidean distances. These distances are plotted in Figure 5.11 separately. 9000 9000 8000 8000 7000 7000 6000 6000 Number Number It is difficult to determine which distribution these two distances follow. 5000 4000 5000 4000 3000 3000 2000 2000 1000 1000 0 0 0 0.2 0.4 0.6 0.8 1 1.2 1.4 Edit Distance 0 500 1000 1500 2000 2500 3000 3500 4000 CCV-Based Euclidean Distance Figure 5.11: The distributions of all pairwise global alignment distances and all pairwise CCV-based euclidean distances on the complete HIV-1 dataset of 1,198 viral strains. 5.3.6 HIV-1 Genotyping Accuracy The 1, 198 HIV-1 strains include 867 pure subtype strains: 70 A, 268 B, 419 C, 54 D, 10 F, 13 G, 3 H, 2 J, 2 K, 5 N, and 21 O, and 331 recombinants: 196 CRF01AE, 52 CRF02AG, 3 CRF03AB, 3 CRF04CPX, 3 CRF05DF, 8 CRF06CPX, 7 CRF07BC, 4 CRF08BC, 5 CRF09CPX, 3 CRF10CD, 10 CRF11CPX, 10 CRF12BF, 6 CRF13CPX, 7 CRF14BG, 5 CRF1501B, 2 CRF16A2D, 4 CRF18CPX, and 3 CRF19CPX. Using the majority vote, we examined the subtyping accuracies by k-nn search with k = 1, 3, 5. In more details, when k = 1, each of the 1, 198 strains was used as a query against the other 1, 197 strains. The query strain was assigned the subtype of its nearest neighbor. When the global alignment distance was used, 1, 188 queries, or 99.17%, were assigned the correct subtype or recombination; when the CCV-based euclidean distance was used, 8 more queries, summing up to 99.83%, were assigned subtype or recombination correctly. The only two strains genotyped incorrectly were DQ207943, from pure subtype B to AB recombination, and AY771588, from BF recombination to pure subtype B. When k = 3 (5, respectively), the subtypes and recombinations consisting of less than or equal to 3 (5, respectively) strains were removed from consideration. That is, only 8 (7, respectively) pure subtypes and 12 (8, respectively) recombinations were used, which include 1, 174 (1, 151, respectively) strains. When the global alignment distance was used, 1, 160 (1, 136, respectively) queries, or 98.81% (98.70%, respectively), were assigned the 109 correct subtype or recombination; when the CCV-based euclidean distance was used, 6 (11, respectively) more queries, summing up to 99.32% (99.65%, respectively), were assigned subtype correctly. Table 5.3: The HIV-1 computational genotyping accuracies by k-nn search and majority vote, k = 1, 2, 3, 4, 5. Distance function HIV-1-GA HIV-1-CCV k=1 99.165% 99.833% k=2 99.328% 99.832% k=3 98.807% 99.318% k=4 98.799% 99.313% k=5 98.696% 99.652% Table 5.3 summarizes the HIV-1 computational genotyping accuracies by k-nn search coupled with majority vote, for k = 1, 2, 3, 4, and 5. From these five pairs of genotyping accuracies, one can see that the accuracy through the CCV-based euclidean distance was always higher than the corresponding one using the global alignment distance. In this sense, the CCV representation of the whole genomic sequences and the CCV-based euclidean distance seem to capture better the evolutionary information embedded in the whole genomic sequences. 5.4 Summary This chapter presents a substantial speed-up technique for the well studied k-nearest neighbor (k-nn) search, based on a novel concept of virtual pivots and partial pivots, such that a significant number of the expensive distance computations can be avoided. Using the same amount of database preprocessing effort, the new method conducted substantially less distance computations during the query stage, compared to several best known k-nn search methods including M-Tree, OMNI, SA-Tree and LAESA. We demonstrated the performance on two real-life data-sets of varied sizes, and the use of this method on HIV-1 subtyping, where the subtype of the query viral strain is assigned through a majority vote among its most similar strains with known subtypes, and the similarity is defined by global global alignment distance between two whole strains or the euclidean distance between the Complete Composition Vectors of the whole strains. 110 Chapter 6 Conclusions and Future Work The rapidly increasing amount of biological sequence and structure data makes it difficult or even impossible to apply the traditional database searching and mining approaches. This is especially true when the distance function to measure the dissimilarity between biological objects is expensive to compute. In this thesis, we studied the challenges we are facing when performing clustering or similarity search on biological sequence and structure data using an expensive distance function. We identified the central issue as avoiding expensive distance computations and proposed several novel solutions, including directional extent in non-vector data bubbles, pairwise ranking, virtual pivots and partial pivots. In-depth theoretical analysis and extensive experimental results confirmed the superiority of our methods over the previous approaches. During our study, we have observed the potential of ranking in particular pairwise ranking on performing clustering and similarity search. The technique is embedded into several algorithms we proposed, and has been shown to play a significant role in the success of our methods (see the experimental sections of Chapter 3 and Chapter 5). Ranking helps deriving an approximate solution to our problem, and can be computed efficiently in terms of both the number of distance computations and overhead (see Chapter 3). Whether and how this (pairwise) ranking technique can be applied to other searching and mining problems is an interesting direction for future research. 111 Bibliography [1] www.ncbi.nlm.nih.gov/genomes/FLU/Database/select.cgi. [2] S. F. Altschul, T. L. Madden, A. A. Sch¨affer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research, 25:3389–3402, 1997. [3] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander. Optics: Ordering points to identify the clustering structure. In Proceedings of the 1999 ACM SIGMOD International Conference on Management of Data (SIGMOD’99), pages 49–60, 1999. [4] V. Athitsos, M. Hadjieleftheriou, G. Kollios, and S. Sclaroff. Query-sensitive embeddings. In Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data (SIGMOD’05), pages 706–717, 2005. [5] R. A. Baeza-Yates, W. Cunto, U. Manber, and S. Wu. Proximity matching using fixedqueries trees. In Proceedings of the 5th Annual Symposium on Combinatorial Pattern Matching (CPM’94), pages 198–212, 1994. [6] S.-A. Berrani, L. Amsaleg, and P. Gros. Approximate searches: k-neighbors + precision. In Proceedings of the 2003 Conference on Information and Knowledge Management (CIKM’03), pages 24–31, 2003. [7] K. S. Beyer, J. Goldstein, R. Ramakrishnan, and U. Shaft. When is “nearest neighbor” meaningful? In Proceedings of the 7th International Conference on Database Theory (ICDT’99), pages 217–235, 1999. [8] P. S. Bradley, U. M. Fayyad, and C. Reina. Scaling clustering algorithms to large databases. In Proceedings of the Fourth International Conference on Knowledge Discovery and Data Mining (KDD’98), pages 9–15, 1998. [9] M. M. Breunig, H.-P. Kriegel, P. Kr¨oger, and J. Sander. Data bubbles: Quality preserving performance boosting for hierarchical clustering. In Proceedings of the 2001 ACM SIGMOD international conference on Management of data (SIGMOD’01), pages 79– 90, 2001. [10] W. A. Burkhard and R. M. Keller. Some approaches to best-match file searching. Communications of the ACM, 16:230–236, 1973. [11] B. Bustos, G. Navarro, and E. Ch´avez. Pivot selection techniques for proximity searching in metric spaces. Pattern Recognition Letters, 24:2357–2366, 2003. [12] E. Ch´avez, G. Navarro, R. A. Baeza-Yates, and J. L. Marroqu´ın. Searching in metric spaces. ACM Computing Surveys, 33:273–321, 2001. [13] P. Ciaccia, M. Patella, and P. Zezula. M-tree: An efficient access method for similarity search in metric spaces. In Proceedings of 23rd International Conference on Very Large Data Bases (VLDB’97), pages 426–435, 1997. 112 [14] P. Ciaccia, M. Patella, and P. Zezula. A cost model for similarity queries in metric spaces. In Proceedings of the 17th ACM SIGACT-SIGMOD-SIGART Symposium on Principles of Database Systems (PODS’98), pages 59–68, 1998. [15] C. Digout, M. A. Nascimento, and A. Coman. Similarity search and dimensionality reduction: Not all dimensions are equally useful. In Proceedings of the 9th International Conference on Database Systems for Advances Applications (DASFAA’04), pages 831–842, 2004. [16] W. DuMouchel, C. Volinsky, T. Johnson, C. Cortes, and D. Pregibon. Squashing flat files flatter. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’99), pages 6–15, 1999. [17] C. Elkan. Using the triangle inequality to accelerate k-means. In Proceedings of the Twentieth International Conference (ICML’03), pages 147–153, 2003. [18] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu. A density-based algorithm for discovering clusters in large spatial databases with noise. In Proceedings of the Second International Conference on Knowledge Discovery and Data Mining (KDD’96), pages 226–231, 1996. [19] C. Faloutsos and K.-I. Lin. Fastmap: A fast algorithm for indexing, data-mining and visualization of traditional and multimedia datasets. In Proceedings of the 1995 ACM SIGMOD international conference on Management of data (SIGMOD’95), pages 163–174, 1995. [20] R. F. S. Filho, A. J. M. Traina, C. Traina Jr., and C. Faloutsos. Similarity search without tears: The OMNI family of all-purpose access methods. In Proceedings of the 17th International Conference on Data Engineering (ICDE’01), pages 623–630, 2001. [21] V. Ganti, R. Ramakrishnan, J. Gehrke, A. L. Powell, and J. C. French. Clustering large datasets in arbitrary metric spaces. In Proceedings of the 15th International Conference on Data Engineering (ICDE’99), pages 502–511, 1999. [22] A. Guttman. R-trees: A dynamic index structure for spatial searching. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD’84), pages 47–57, 1984. [23] J. M. Hammersley. The distribution of distance in a hypersphere. The Annals of Mathematical Statistics, 21:447–452, 1950. [24] J. Han and M. Kamber. Data Mining: Concepts and Techniques (2nd Edition). Morgan Kaufmann, San Francisco, CA, 2006. [25] G. R. Hjaltason and H. Samet. Index-driven similarity search in metric spaces. ACM Transactions on Database Systems, 28:517–580, 2003. [26] H. V. Jagadish, B. C. Ooi, K.-L. Tan, C. Yu, and R. Zhang. iDistance: An adaptive B+ -tree based indexing method for nearest neighbor search. ACM Transactions on Database Systems, 30:364–397, 2005. [27] W. Jin, A. K. H. Tung, and J. Han. Mining top-n local outliers in large databases. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD’01), pages 293–298, 2001. [28] L. Kaufman and P. Rousseeuw. Finding Groups in Data: An Introduction to Cluster Analysis. John Wiley and Sons, New York, NY, 1990. [29] R. Kolodny and N. Linial. Approximate protein structural alignment in polynomial time. Proceedings of the National Academy of Sciences, 101(33):12201–12206, 2004. 113 [30] B. Larsen and C. Aone. Fast and effective text mining using linear-time document clustering. In Proceedings of the Fifth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD’99), pages 16–22, 1999. [31] B. Ma, J. Tromp, and M. Li. PatternHunter: Faster and more sensitive homology search. Bioinformatics, 18:440–445, 2002. [32] J. MacQueen. Some methods for classification and analysis of multivariate observations. Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability, 1:281–297, 1967. [33] M. L. Mico, J. Oncina, and E. Vidal. A new version of the nearest-neighbour approximating and eliminating search algorithm (AESA) with linear preprocessing time and memory requirements. Pattern Recognition Letters, 15:9–17, 1994. [34] G. Navarro. Searching in metric spaces by spatial approximation. The VLDB Journal, 11:28–46, 2002. [35] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology, 48(3):443–453, 1970. [36] W. R. Pearson and D. J. Lipman. Improved tools for biological sequence comparison. Proceedings of National Academy of Sciences of USA, 85:2444–2448, 1988. [37] J. R. Rico-Juan and L. Mic´o. Comparison of AESA and LAESA search algorithms using string and tree-edit-distances. Pattern Recognition Letters, 24:1417–1426, 2003. [38] G. Robertson, M. Bilenky, K. Lin, A. He, W. Yuen, M. Dagpinar, R. Varhol, K. Teague, O. L. Griffith, X. Zhang, Y. Pan, M. Hassel, M. C. Sleumer, W. Pan, E. Pleasance, M. Chuang, H. Hao, Y. Y. Li, N. Robertson, C. Fjell, B. Li, S. Montgomery, T. Astakhova, J. Zhou, J. Sander, A. S. Siddiqui, and S. J. M. Jones. cisred: a database system for genome-scale computational discovery of regulatory elements. Nucleic Acids Research, 34(Database Issue):D68–D73, 2006. [39] J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky. Automatic extraction of clusters from hierarchical clustering representations. In Proceedings of the 7th Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD’03), pages 75–87, 2003. [40] M. Shapiro. The choice of reference points in best-match file searching. Communications of the ACM, 20:339–343, 1977. [41] D. Shasha and J. T.-L. Wang. New techniques for best-match retrieval. ACM Transactions on Information Systems, 8:140–158, 1990. [42] R. Sibson. Slink: An optimally efficient algorithm for the single-link cluster method. Computer Journal, 16:30–34, 1973. [43] M. Stoldt, J. W02hnert, M. G¨orlach, and L. R. Brown. The nmr structure of escherichia coli ribosomal protein l25 shows homology to general stress proteins and glutaminyl-trna synthetases. The EMBO Journal, 17:6377–6384, 1998. [44] G. D. Stormo. Dna binding sites: representation and discovery. Bioinformatics, 16:16– 23, 2000. [45] C. J. Traina, R. F. S. Filho, A. J. M. Traina, M. R. Vieira, and C. Faloutsos. The omnifamily of all-purpose access methods: a simple and effective way to make similarity search more efficient. VLDB Journal, 16:483–505, 2007. [46] J. K. Uhlmann. Satisfying general proximity/similarity queries with metric trees. Information Processing Letters, 40:175–179, 1991. 114 [47] N. C. S. University. Rnase p database. http://www.mbio.ncsu.edu/RNaseP/home.html. [48] E. Vidal. An algorithm for finding nearest neighbours in (approximately) constant average time. Pattern Recognition Letters, 4:145–157, 1986. [49] E. Vidal. New formulation and improvements of the nearest-neighbour approximating and eliminating search algorithm (AESA). Pattern Recognition Letters, 15:1–7, 1994. [50] R. Weber, H.-J. Schek, and S. Blott. A quantitative analysis and performance study for similarity-search methods in high-dimensional spaces. In Proceedings of 24th International Conference on Very Large Data Bases (VLDB’98), pages 194–205, 1998. [51] D. S. Wishart, D. Arndt, M. Berjanskii, P. Tang, J. Zhou, and G. Lin. CS23D: a web server for rapid protein structure generation using nmr chemical shifts and sequence data. Nucleic Acids Research, 36(Web Server Issue):W496–W502, 2008. [52] X. Wu, Z. Cai, X.-F. Wan, T. Hoang, R. Goebel, and G.-H. Lin. Nucleotide composition string selection in HIV-1 subtyping using whole genomes. Bioinformatics, 23:1744–1752, 2007. [53] P. N. Yianilos. Data structures and algorithms for nearest neighbor search in general metric spaces. In Proceedings of the fourth annual ACM-SIAM Symposium on Discrete algorithms (SODA’93), pages 311–321, Philadelphia, PA, 1993. [54] T. Zhang, R. Ramakrishnan, and M. Livny. Birch: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD’96), pages 103–114, 1996. [55] Z. Zhang, S. Schwartz, L. Wagner, and W. Miller. A greedy algorithm for aligning dna sequences. Journal of Computational Biology, 7:203–214, 2000. [56] J. Zhou and J. Sander. Data bubbles for non-vector data: Speeding-up hierarchical clustering in arbitrary metric spaces. In Proceedings of 29th International Conference on Very Large Data Bases (VLDB’03), pages 452–463, 2003. [57] J. Zhou and J. Sander. Speedup clustering with hierarchical ranking. In Proceedings of the 6th IEEE International Conference on Data Mining (ICDM’06), pages 1205– 1210, 2006. [58] J. Zhou, J. Sander, Z. Cai, L. Wang, and G. Lin. Finding the nearest neighbors in biological databases using less direct distance computations. IEEE/ACM Transactions on Computational Biology and Bioinformatics. In press. 115