* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Unformatted Manuscript - ICMC
Survey
Document related concepts
Transcript
A Simpler and More Accurate AUTO-HDS Framework for Clustering and Visualization of Biological Data IEEE ACM Transactions on Computational Biology and Bioinformatics Vol. 9, pp. 1850-1852, 2012 Unformatted Manuscript 1 A Simpler and More Accurate AUTO-HDS Framework for Clustering and Visualization of Biological Data Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander Abstract In reference [1], the authors proposed a framework for automated clustering and visualization of biological data sets named AUTO-HDS. This letter is intended to complement that framework by showing that it is possible to get rid of a user-defined parameter in a way that the clustering stage can be implemented more accurately while having reduced computational complexity. Index Terms Data Mining, Clustering, Bioinformatics Databases, AUTO-HDS. I. I NTRODUCTION UTO-HDS [1] is an interesting clustering framework that can be used to discover relevant data clusters from biological data sets. It is composed of a clustering stage, a cluster ranking and selection stage, and a visualization stage. The clustering stage is based on the HDS algorithm, proposed by the same authors in [2]. HDS is a density-based hierarchical clustering algorithm that performs a sampling of the possible hierarchical levels (each of which represents a particular density threshold that discriminates between dense objects and noise) by using a geometric sampling rate controlled by a user-defined parameter, rshave . The complete hierarchy would be obtained as rshave → 0. In this case, however, the asymptotic running time of the method is the same as the worst case running time of an analogous method by Wishart (HMA) [3], namely, O(n3 ), where n is the number of data objects [2], [4]. The use of “sufficiently large” values of rshave allows the sampling of a logarithmic number of hierarchical levels, reducing this complexity to O(n2 log n) [1], [2], [4]1 . But the sampling of hierarchical levels performed by HDS represents loss of information that may affect the results provided by the subsequent stages of AUTO-HDS, i.e., the ranking/selection of clusters based on their stability and the visualization tool. In fact, by missing hierarchical levels the “birth” and/or “death” of clusters cannot be precisely captured, so their stability cannot be exactly computed. In the worst case, a cluster may even be born and then disappear in between two sampled levels, in a way that it will not be detected and presented to the user. Therefore, rshave represents a trade-off between accuracy and computational burden of AUTO-HDS. In Section II, we show that the complete hierarchy that would be obtained as rshave → 0 can actually be computed in O(n2 ) time without any need of sampling. In Section III, we discuss how the same procedure for ranking and selection of clusters used by AUTO-HDS can still be applied to the complete hierarchy, regardless of rshave . We also discuss some implications of our observations for the AUTO-HDS visualization tool. A II. C OMPLETE H IERARCHICAL C LUSTERING A. Basic Idea Gupta et al. [1] have proposed a framework for clustering and visualization of biological data, which means that its constituent parts are presumably replaceable. In order to replace the HDS clustering algorithm with another one capable of producing a fully compatible yet complete hierarchy, we need first to recall some of the authors’ discussions in [1] on the connections between HDS and other related density-based clustering algorithms. In Section 2 of [1], p. 224, when referring to the DBSCAN algorithm [6], particularly to the choice of its parameters (M inP ts and ε), Gupta et al. argued that “Different choices of ε and M inP ts can give dramatically different clusterings; choosing these parameters are a potential pitfall for the user”. While this is true in what concerns the combination of these parameters, one should notice that M inP ts is fully equivalent to the parameter nǫ of HDS, which is a classic smoothing factor found in different density-based clustering algorithms (e.g. [1], [6], [7], [8], [9], [10]) and whose behavior is quite robust and well understood. In what concerns ε, the OPTICS algorithm [7] is known to produce a bar plot, called reachability plot, which, for a given value of M inP ts, encodes in a nested way all possible DBSCAN-like clusterings w.r.t. ε, except for The authors are from the Dept. of Computing Science of the Univ. of Alberta, Edmonton, Canada. Ricardo J. G. B. Campello (currently in a sabbatical leave) is originally from the Dept. of Computer Sciences of the Univ. of São Paulo at São Carlos, Brazil. E-mail: [email protected] 1 Further gains have been shown to be possible by using parallel computing techniques, but only for very low dimensional spaces [5]. 2 eventual differences in the assignment of border objects. In [11], it was shown that a hierarchical dendrogram, closely related to Single-Linkage (SL), can be extracted from a reachability plot such that each level of the resulting hierarchy corresponds to a horizontal cut through the plot. A horizontal cut through the plot, in turn, corresponds to a DBSCAN-like clustering (with possible differences in the assignment of border objects) for a specific value of ε [7]. At this point, one should notice that the only difference between a DBSCAN clustering w.r.t. ε and the HDS clustering at density level rǫ is the presence of border objects in DBSCAN, as observed by Gupta et al. [1] in Section 4, p. 226. But removing the border objects from OPTICS and, accordingly, from the DBSCAN-like hierarchy that can be extracted from it, can be trivially done by simply redefining the reachability distances in a symmetric way, as described in [12]. In this case, as observed in [12], it follows that OPTICS reduces to a Minimum Spanning Tree (MST) algorithm in a transformed space of symmetric reachability distances, which in turn is equivalent to SL in that space. This means that applying SL to the transformed space of symmetric reachability distances produces a complete hierarchy in which the hierarchical levels are fully equivalent to those of HDS w.r.t. different density thresholds rǫ . Since the SL algorithm can be implemented in O(n2 ), the complete density-based hierarchy can be computed with this complexity without any need of sampling, an idea that has been recently rediscovered in [9] in the context of complex networks. Once the complete hierarchy is available, the relabeling and smoothing (particle removal) procedures described in Section 5.2 of [1] can be applied as suggested in that reference. B. Formulation Let X = {x1 , · · · , xn } be a data set containing n data objects, each of which is described by a d−dimensional attribute vector, x(·) . In addition, let MS be an n × n symmetric matrix containing the distances dS (xi , xj ) between pairs of objects of X . Definition 1: (Dense Object) An object xi ∈ X is called a dense object w.r.t. both a radius rǫ and an integer threshold nǫ ≥ 1 if there are at least nǫ objects (including xi ) within a closed ball of radius rǫ centered at xi , i.e., if the cardinality of the subset {xj ∈ X | dS (xi , xj ) ≤ rǫ } is not lower than nǫ . Definition 1 is equivalent to the definition of a core object used by DBSCAN and OPTICS. It is also the basis for the following definition of cluster, which is used by the algorithms Density Shaving (DS) and Hierarchical Density Shaving (HDS) in [1], [2]. Definition 2: (Cluster and Noise) Let G ⊆ X be the subset of dense objects according to a certain density threshold established by a pair of values of rǫ and nǫ . Then, a cluster C ⊆ G is defined as a maximal subset of dense objects satisfying the condition that, for every pair of objects xi , xj ∈ C, there exists a chain of dense objects connecting xi and xj in C such that every pair of consecutive objects in the chain are within a radius rǫ from each other. Objects that are not part of any cluster (i.e., the non-dense objects) are denoted as noise. The algorithm DS finds the clusters of a data set X according to Definition 2, given a pair of values of nǫ and rǫ . The hierarchical version of DS, HDS, finds a hierarchy of such clusters for different values of rǫ . Each level of the resulting hierarchy is equivalent to the clustering solution that would be produced by the DS algorithm with a particular value for this radius. The values for the radius that are actually used — and, therefore, the hierarchical levels that are represented in the resulting HDS hierarchy — are determined by a geometric sampling of a sorted set of candidate values, R, which is defined below. Definition 3: (Set of Neighborhood Radii) The set of neighborhood radii of a data set X w.r.t. nǫ , R, is defined as the sorted set of distances from each object of X to its nǫ -th neighbor (which includes the object itself). From Definition 1 one can readily see that each element of R in Definition 3 represents the minimum value of rǫ such that one or more objects in X are deemed dense (so-called core distance of the corresponding object or objects). Therefore, R is associated in a bijective manner with the set of all possible different subsets G ⊆ X of dense objects that can be obtained for a given nǫ when varying rǫ . Accordingly, R is associated in a bijective manner with the set of all possible different clustering solutions that can be obtained according to Definition 2 for a given nǫ . The complete hierarchy containing all these possible solutions, however, can only be produced by HDS as the sampling rate controlled by the user-defined parameter rshave tends to zero, which leads to a cubic running time complexity. At this point, it is worth noticing that the only difference between a cluster according to Definition 2 and the definition of cluster used by DBSCAN is that the latter allows for those non-dense objects that are within distance rǫ from a dense object in a cluster C (if any) to be included into C as well. These objects are said to be density-reachable from one or more dense objects in a cluster and are called border objects. If such objects were not allowed to be included into clusters, then the clusters that would be found by DBSCAN would be precisely as described in Definition 2, i.e., the same as those found by DS and HDS at the density level corresponding to rǫ . The exclusion of border objects from the clusters that would be found by DBSCAN, or, equivalently, from those that would be found through a horizontal cut at level rǫ of the OPTICS hierarchy, can be formally modeled by replacing the original notion 3 of density-reachability used by those algorithms with a symmetric one. Such an alternative notion is based on a transformed distance2 between two objects, rDist [12], defined below. Definition 4: (rDist) Let cDistnǫ (xi ) be the core distance of an object xi ∈ X w.r.t. nǫ , i.e., the distance from xi to its nǫ -th neighbor. Then, the rDist distance between any two objects xi , xj ∈ X w.r.t. nǫ is defined as rDist(xi , xj ) = max{cDistnǫ (xi ), cDistnǫ (xj ), dS (xi , xj )}. The n × n symmetric matrix containing the values rDist(·, ·) between all pairs of objects of X is denoted as MrDist . Note that the rDist distance between two objects is the minimum value of rǫ such that both objects are dense and are within a radius rǫ from each other. In particular, the rDist distance between an object and itself is the core distance of the object, i.e., rDist(xi , xi ) = cDistnǫ (xi ), which is not zero in general (except for nǫ = 1). From these observations, we can rewrite Definition 2 in the following, fully equivalent way. Definition 5: (Cluster and Noise — Equivalent Definition based on rDist) A cluster C of a data set X w.r.t. to rǫ and nǫ can be defined as a maximal subset of data objects satisfying the condition that, for every pair of objects xi , xj ∈ C, including i = j, there exists a chain of objects connecting xi and xj in C such that the rDist distance between every pair of consecutive objects in the chain is not greater than rǫ . Objects that are not in any cluster are denoted as noise. By using this equivalent definition, we can finally show that all clusterings that can be found by the HDS algorithm can be readily computed by applying Single-Linkage (SL) to the transformed space of rDist distances. Proposition 1: For a given nǫ , the set of all possible different clustering solutions that can be obtained according to Definition 2 or, equivalently, according to Definition 5, can be computed by applying the SL algorithm to the matrix MrDist of rDist distances. Proof: By construction, when applied to a distance matrix MS of a data set X , the SL algorithm is well-known to produce as a result a complete hierarchy, in the form of a dendrogram, containing all possible clusterings whose composing clusters are maximal subsets satisfying the single-link connectivity property [13]. More specifically, at level dl of the dendrogram scale, any cluster lying on that hierarchical level: (i) satisfies the property that there is always a chain that connects any two objects within the cluster and such that the distance between every consecutive pair of objects in the chain is no greater than dl ; and (ii) is a maximal subset of X satisfying this property. It is then clear that, by replacing MS with MrDist and considering the dendrogram scale as values for the radius rǫ , the result is a complete hierarchy in which the clusters satisfy Definition 5. The only detail is that, in this case, the SL algorithm must also take into account the connection of objects to themselves, i.e., the elements in the diagonal of the rDist distance matrix, which will correspond to levels of the hierarchy below which the corresponding objects are no longer part of any cluster and are, therefore, deemed noise.3 C. Complexity Analysis Let us first consider the scenario where the data set X with n d-dimensional objects is given as input. Provided that the distance between objects can be computed in O(d) time, the core distances of all objects can be computed in O(dn2 ) time in the worst case. Having the core distances precomputed, each rDist distance can be computed in O(d) time on demand by the SL algorithm. One of the fastest ways to compute SL is by using a divisive method based on the Minimum Spanning Tree (MST) [13]. It is possible to construct an MST in O(dn2 ) time by using an implementation of Prim’s algorithm based on an ordinary list search (instead of a heap). The divisive extraction of the SL hierarchy from the MST does not exceed this complexity. Once the hierarchy is available, relabeling and smoothing (particle removal) procedures can be applied as described in Section 5.2 of [1]. These procedures do not exceed the above complexity either. So, in summary, the overall worst case asymptotic time complexity of the algorithm, when the data set X is given as input, is O(dn2 ). Regarding main memory requirements, one needs O(dn) space to store the data set X and O(n) space to store the core distances. The MST requires O(n) space to be stored. In the divisive extraction stage, only the currently processed hierarchical level is needed at any point in time, which requires O(n) space. Hence, the overall space complexity of the algorithm is O(dn). If the distance matrix MS is given instead of the data set X , the only change for the runtime is that one can directly access any distance dS (xi , xj ) from MS in constant time. Then, the computations no longer depend on the dimension d of the objects and, as a consequence, the worst case time complexity reduces to O(n2 ). On the other hand, this requires that matrix MS be stored in main memory, which results in O(n2 ) space complexity. III. R ANKING , S ELECTION , AND V ISUALIZATION The selection of clusters described in Section 6.2 of [1] depends only on the ranking (relative ordering) of the clusters according to their stability, defined as Stab(C) = (log(nec ) − log(ncs−1 ))/log(1 − rshave ), where nic is the total number of 2 This is actually an abuse of terminology, as rDist in Definition 4 does not satisfy all the formal properties of a distance in mathematics. the traditional SL, the elements in the diagonal of the distance matrix are all equal to zero and objects are always considered to be part of clusters, possibly singleton clusters in case of isolated objects. 3 In 4 dense objects at the ith hierarchical level, e is the level where cluster C first appears, and s is the last level at which C survives. As observed by the authors on p. 229 of [1], the denominator of Stab(C) is a constant for all clusters C. This means that the ranking (and selection) of clusters can be performed using solely the information of nec and ncs−1 . In HDS, however, these terms implicitly depend on rshave . Contrarily, in the complete hierarchy that can be produced as described above (Section II), nec and ncs−1 depend only on the (ranks of the) density thresholds rǫ associated with their respective hierarchical levels. These terms can thus be readily derived from the hierarchy, or even precomputed and stored during its construction. This means that rshave is not needed at all when the complete hierarchy is considered, which makes the tool easier to use. In spite of this, an interesting side effect of the geometric sampling controlled by rshave is a further compaction of the hierarchy that allows a log-scale visualization able to emphasize smaller clusters with respect to bigger ones, as observed by the authors in Section 7 of [1], p. 229. A direct consequence of having the complete hierarchy available is that such a visualization is now possible for any value of rshave without the need of re-clustering. In other words, one can still use rshave for visualization, if desired, and outcomes for different values can be produced by just sampling the levels of the complete hierarchy. This can result in significant speed-ups during exploratory data analysis using the AUTO-HDS framework. IV. C ONCLUSION This letter is intended to be a complement to the AUTO-HDS framework introduced in [1]. We have shown that the particular algorithm adopted in the clustering stage of that framework can be replaced so that a user-defined parameter is eliminated and the clustering procedure can be performed more accurately, with reduced complexity. ACKNOWLEDGMENTS The authors thank the Research Foundation of the State of São Paulo - Brazil (FAPESP), the Brazilian National Council for Scientific and Technological Development (CNPq), and the Natural Sciences and Engineering Research Council of Canada (NSERC). R EFERENCES [1] G. Gupta, A. Liu, and J. Ghosh, “Automated hierarchical density shaving: A robust automated clustering and visualization framework for large biological data sets,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 2, pp. 223 –237, 2010. [2] ——, “Hierarchical density shaving: A clustering and visualization framework for large biological datasets,” in IEEE ICDM Workshop on Data Mining in Bioinformatics (DMB), Hong Kong/China, 2006, pp. 89–93. [3] D. Wishart, “Mode analysis: A generalization of nearest neighbour which reduces chaining effects,” in Numerical Taxonomy. Academic Press, 1969, pp. 282–311. [4] G. Gupta, A. Liu, and J. Ghosh, “Automated hierarchical density shaving and gene DIVER,” IDEAL-2006-TR05, Dept. Electrical and Computer Eng., Univ. of Texas at Austin, Tech. Rep., 2006. [5] S. Dhandapani, G. Gupta, and J. Ghosh, “Design and implementation of scalable hierarchical density based clustering,” IDEAL-2010-06, Dept. Electrical and Computer Eng., Univ. of Texas at Austin, Tech. Rep., 2010. [6] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Int. Conference on Knowledge Discovery and Data Mining (KDD), Portland/USA, 1996, pp. 226–231. [7] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: ordering points to identify the clustering structure,” SIGMOD Rec., vol. 28, pp. 49–60, June 1999. [8] T. Pei, A. Jasra, D. Hand, A.-X. Zhu, and C. Zhou, “Decode: a new method for discovering clusters of different densities in spatial data,” Data Mining and Knowledge Discovery, vol. 18, pp. 337–369, 2009, 10.1007/s10618-008-0120-3. [9] H. Sun, J. Huang, J. Han, H. Deng, P. Zhao, and B. Feng, “gSkeletonClu: Density-based network clustering via structure-connected tree division or agglomeration,” in 10th IEEE Int. Conference on Data Mining (ICDM), Sydney/Australia, 2010, pp. 481 –490. [10] W. Stuetzle and R. Nugent, “A generalized single linkage method for estimating the cluster tree of a density,” J. Computational and Graphical Statistics, vol. 19, no. 2, pp. 397–418, 2010. [11] J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky, “Automatic extraction of clusters from hierarchical clustering representations,” in Pacific-Asia Conf. of Advances in Knowledge Discovery and Data Mining, Seoul/Korea, 2003, pp. 75–87. [12] L. Lelis and J. Sander, “Semi-supervised density-based clustering,” in 9th IEEE Int. Conference on Data Mining (ICDM), Miami/USA, 2009, pp. 842 –847. [13] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988. Ricardo J. G. B. Campello is currently in a sabbatical leave at the Department of Computing Science of the University of Alberta, Canada. He is an Associate Professor at the University of São Paulo at São Carlos, Brazil. He received his MS and PhD degrees in Electrical Engineering in 1997 and 2002, respectively, both from the State University of Campinas, Brazil. His current research interests fall primarily into the areas of data mining and machine learning. 5 Davoud Moulavi is a PhD student in the Computing Science Department at the University of Alberta, Canada. He received his MS in Computer Science from Moscow Power Engineering University. His research interests include knowledge discovery in large databases, particularly clustering, data mining in biological databases, and machine learning. Jöerg Sander is currently an Associate Professor at the University of Alberta, Canada. He received his MS in Computer Science in 1996 and his PhD in Computer Science in 1998, both from the University of Munich, Germany. His current research interests include knowledge discovery in databases, especially clustering and data mining in biological, spatial, and high-dimensional data sets.