Download Unformatted Manuscript - ICMC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
A Simpler and More Accurate AUTO-HDS
Framework for Clustering and Visualization of
Biological Data
IEEE ACM Transactions on Computational Biology and Bioinformatics
Vol. 9, pp. 1850-1852, 2012
Unformatted Manuscript
1
A Simpler and More Accurate AUTO-HDS
Framework for Clustering and Visualization of
Biological Data
Ricardo J. G. B. Campello, Davoud Moulavi, and Joerg Sander
Abstract
In reference [1], the authors proposed a framework for automated clustering and visualization of biological data sets named
AUTO-HDS. This letter is intended to complement that framework by showing that it is possible to get rid of a user-defined
parameter in a way that the clustering stage can be implemented more accurately while having reduced computational complexity.
Index Terms
Data Mining, Clustering, Bioinformatics Databases, AUTO-HDS.
I. I NTRODUCTION
UTO-HDS [1] is an interesting clustering framework that can be used to discover relevant data clusters from biological
data sets. It is composed of a clustering stage, a cluster ranking and selection stage, and a visualization stage. The
clustering stage is based on the HDS algorithm, proposed by the same authors in [2]. HDS is a density-based hierarchical
clustering algorithm that performs a sampling of the possible hierarchical levels (each of which represents a particular density
threshold that discriminates between dense objects and noise) by using a geometric sampling rate controlled by a user-defined
parameter, rshave . The complete hierarchy would be obtained as rshave → 0. In this case, however, the asymptotic running time
of the method is the same as the worst case running time of an analogous method by Wishart (HMA) [3], namely, O(n3 ), where
n is the number of data objects [2], [4]. The use of “sufficiently large” values of rshave allows the sampling of a logarithmic
number of hierarchical levels, reducing this complexity to O(n2 log n) [1], [2], [4]1 . But the sampling of hierarchical levels
performed by HDS represents loss of information that may affect the results provided by the subsequent stages of AUTO-HDS,
i.e., the ranking/selection of clusters based on their stability and the visualization tool. In fact, by missing hierarchical levels
the “birth” and/or “death” of clusters cannot be precisely captured, so their stability cannot be exactly computed. In the worst
case, a cluster may even be born and then disappear in between two sampled levels, in a way that it will not be detected and
presented to the user. Therefore, rshave represents a trade-off between accuracy and computational burden of AUTO-HDS.
In Section II, we show that the complete hierarchy that would be obtained as rshave → 0 can actually be computed in O(n2 )
time without any need of sampling. In Section III, we discuss how the same procedure for ranking and selection of clusters
used by AUTO-HDS can still be applied to the complete hierarchy, regardless of rshave . We also discuss some implications
of our observations for the AUTO-HDS visualization tool.
A
II. C OMPLETE H IERARCHICAL C LUSTERING
A. Basic Idea
Gupta et al. [1] have proposed a framework for clustering and visualization of biological data, which means that its constituent
parts are presumably replaceable. In order to replace the HDS clustering algorithm with another one capable of producing a
fully compatible yet complete hierarchy, we need first to recall some of the authors’ discussions in [1] on the connections
between HDS and other related density-based clustering algorithms.
In Section 2 of [1], p. 224, when referring to the DBSCAN algorithm [6], particularly to the choice of its parameters
(M inP ts and ε), Gupta et al. argued that “Different choices of ε and M inP ts can give dramatically different clusterings;
choosing these parameters are a potential pitfall for the user”. While this is true in what concerns the combination of these
parameters, one should notice that M inP ts is fully equivalent to the parameter nǫ of HDS, which is a classic smoothing
factor found in different density-based clustering algorithms (e.g. [1], [6], [7], [8], [9], [10]) and whose behavior is quite
robust and well understood. In what concerns ε, the OPTICS algorithm [7] is known to produce a bar plot, called reachability
plot, which, for a given value of M inP ts, encodes in a nested way all possible DBSCAN-like clusterings w.r.t. ε, except for
The authors are from the Dept. of Computing Science of the Univ. of Alberta, Edmonton, Canada. Ricardo J. G. B. Campello (currently in a sabbatical
leave) is originally from the Dept. of Computer Sciences of the Univ. of São Paulo at São Carlos, Brazil. E-mail: [email protected]
1 Further
gains have been shown to be possible by using parallel computing techniques, but only for very low dimensional spaces [5].
2
eventual differences in the assignment of border objects. In [11], it was shown that a hierarchical dendrogram, closely related
to Single-Linkage (SL), can be extracted from a reachability plot such that each level of the resulting hierarchy corresponds to
a horizontal cut through the plot. A horizontal cut through the plot, in turn, corresponds to a DBSCAN-like clustering (with
possible differences in the assignment of border objects) for a specific value of ε [7].
At this point, one should notice that the only difference between a DBSCAN clustering w.r.t. ε and the HDS clustering
at density level rǫ is the presence of border objects in DBSCAN, as observed by Gupta et al. [1] in Section 4, p. 226. But
removing the border objects from OPTICS and, accordingly, from the DBSCAN-like hierarchy that can be extracted from it,
can be trivially done by simply redefining the reachability distances in a symmetric way, as described in [12]. In this case,
as observed in [12], it follows that OPTICS reduces to a Minimum Spanning Tree (MST) algorithm in a transformed space
of symmetric reachability distances, which in turn is equivalent to SL in that space. This means that applying SL to the
transformed space of symmetric reachability distances produces a complete hierarchy in which the hierarchical levels are fully
equivalent to those of HDS w.r.t. different density thresholds rǫ . Since the SL algorithm can be implemented in O(n2 ), the
complete density-based hierarchy can be computed with this complexity without any need of sampling, an idea that has been
recently rediscovered in [9] in the context of complex networks. Once the complete hierarchy is available, the relabeling and
smoothing (particle removal) procedures described in Section 5.2 of [1] can be applied as suggested in that reference.
B. Formulation
Let X = {x1 , · · · , xn } be a data set containing n data objects, each of which is described by a d−dimensional attribute
vector, x(·) . In addition, let MS be an n × n symmetric matrix containing the distances dS (xi , xj ) between pairs of objects
of X .
Definition 1: (Dense Object) An object xi ∈ X is called a dense object w.r.t. both a radius rǫ and an integer threshold
nǫ ≥ 1 if there are at least nǫ objects (including xi ) within a closed ball of radius rǫ centered at xi , i.e., if the cardinality of
the subset {xj ∈ X | dS (xi , xj ) ≤ rǫ } is not lower than nǫ .
Definition 1 is equivalent to the definition of a core object used by DBSCAN and OPTICS. It is also the basis for the
following definition of cluster, which is used by the algorithms Density Shaving (DS) and Hierarchical Density Shaving (HDS)
in [1], [2].
Definition 2: (Cluster and Noise) Let G ⊆ X be the subset of dense objects according to a certain density threshold
established by a pair of values of rǫ and nǫ . Then, a cluster C ⊆ G is defined as a maximal subset of dense objects satisfying
the condition that, for every pair of objects xi , xj ∈ C, there exists a chain of dense objects connecting xi and xj in C such
that every pair of consecutive objects in the chain are within a radius rǫ from each other. Objects that are not part of any
cluster (i.e., the non-dense objects) are denoted as noise.
The algorithm DS finds the clusters of a data set X according to Definition 2, given a pair of values of nǫ and rǫ . The
hierarchical version of DS, HDS, finds a hierarchy of such clusters for different values of rǫ . Each level of the resulting
hierarchy is equivalent to the clustering solution that would be produced by the DS algorithm with a particular value for this
radius. The values for the radius that are actually used — and, therefore, the hierarchical levels that are represented in the
resulting HDS hierarchy — are determined by a geometric sampling of a sorted set of candidate values, R, which is defined
below.
Definition 3: (Set of Neighborhood Radii) The set of neighborhood radii of a data set X w.r.t. nǫ , R, is defined as the
sorted set of distances from each object of X to its nǫ -th neighbor (which includes the object itself).
From Definition 1 one can readily see that each element of R in Definition 3 represents the minimum value of rǫ such that
one or more objects in X are deemed dense (so-called core distance of the corresponding object or objects). Therefore, R is
associated in a bijective manner with the set of all possible different subsets G ⊆ X of dense objects that can be obtained for
a given nǫ when varying rǫ . Accordingly, R is associated in a bijective manner with the set of all possible different clustering
solutions that can be obtained according to Definition 2 for a given nǫ . The complete hierarchy containing all these possible
solutions, however, can only be produced by HDS as the sampling rate controlled by the user-defined parameter rshave tends
to zero, which leads to a cubic running time complexity.
At this point, it is worth noticing that the only difference between a cluster according to Definition 2 and the definition of
cluster used by DBSCAN is that the latter allows for those non-dense objects that are within distance rǫ from a dense object
in a cluster C (if any) to be included into C as well. These objects are said to be density-reachable from one or more dense
objects in a cluster and are called border objects. If such objects were not allowed to be included into clusters, then the clusters
that would be found by DBSCAN would be precisely as described in Definition 2, i.e., the same as those found by DS and
HDS at the density level corresponding to rǫ .
The exclusion of border objects from the clusters that would be found by DBSCAN, or, equivalently, from those that would
be found through a horizontal cut at level rǫ of the OPTICS hierarchy, can be formally modeled by replacing the original notion
3
of density-reachability used by those algorithms with a symmetric one. Such an alternative notion is based on a transformed
distance2 between two objects, rDist [12], defined below.
Definition 4: (rDist) Let cDistnǫ (xi ) be the core distance of an object xi ∈ X w.r.t. nǫ , i.e., the distance from xi to
its nǫ -th neighbor. Then, the rDist distance between any two objects xi , xj ∈ X w.r.t. nǫ is defined as rDist(xi , xj ) =
max{cDistnǫ (xi ), cDistnǫ (xj ), dS (xi , xj )}. The n × n symmetric matrix containing the values rDist(·, ·) between all pairs
of objects of X is denoted as MrDist .
Note that the rDist distance between two objects is the minimum value of rǫ such that both objects are dense and are
within a radius rǫ from each other. In particular, the rDist distance between an object and itself is the core distance of the
object, i.e., rDist(xi , xi ) = cDistnǫ (xi ), which is not zero in general (except for nǫ = 1). From these observations, we can
rewrite Definition 2 in the following, fully equivalent way.
Definition 5: (Cluster and Noise — Equivalent Definition based on rDist) A cluster C of a data set X w.r.t. to rǫ and nǫ
can be defined as a maximal subset of data objects satisfying the condition that, for every pair of objects xi , xj ∈ C, including
i = j, there exists a chain of objects connecting xi and xj in C such that the rDist distance between every pair of consecutive
objects in the chain is not greater than rǫ . Objects that are not in any cluster are denoted as noise.
By using this equivalent definition, we can finally show that all clusterings that can be found by the HDS algorithm can be
readily computed by applying Single-Linkage (SL) to the transformed space of rDist distances.
Proposition 1: For a given nǫ , the set of all possible different clustering solutions that can be obtained according to Definition
2 or, equivalently, according to Definition 5, can be computed by applying the SL algorithm to the matrix MrDist of rDist
distances.
Proof: By construction, when applied to a distance matrix MS of a data set X , the SL algorithm is well-known to produce
as a result a complete hierarchy, in the form of a dendrogram, containing all possible clusterings whose composing clusters are
maximal subsets satisfying the single-link connectivity property [13]. More specifically, at level dl of the dendrogram scale,
any cluster lying on that hierarchical level: (i) satisfies the property that there is always a chain that connects any two objects
within the cluster and such that the distance between every consecutive pair of objects in the chain is no greater than dl ; and
(ii) is a maximal subset of X satisfying this property. It is then clear that, by replacing MS with MrDist and considering
the dendrogram scale as values for the radius rǫ , the result is a complete hierarchy in which the clusters satisfy Definition 5.
The only detail is that, in this case, the SL algorithm must also take into account the connection of objects to themselves, i.e.,
the elements in the diagonal of the rDist distance matrix, which will correspond to levels of the hierarchy below which the
corresponding objects are no longer part of any cluster and are, therefore, deemed noise.3
C. Complexity Analysis
Let us first consider the scenario where the data set X with n d-dimensional objects is given as input. Provided that the
distance between objects can be computed in O(d) time, the core distances of all objects can be computed in O(dn2 ) time
in the worst case. Having the core distances precomputed, each rDist distance can be computed in O(d) time on demand
by the SL algorithm. One of the fastest ways to compute SL is by using a divisive method based on the Minimum Spanning
Tree (MST) [13]. It is possible to construct an MST in O(dn2 ) time by using an implementation of Prim’s algorithm based
on an ordinary list search (instead of a heap). The divisive extraction of the SL hierarchy from the MST does not exceed this
complexity. Once the hierarchy is available, relabeling and smoothing (particle removal) procedures can be applied as described
in Section 5.2 of [1]. These procedures do not exceed the above complexity either. So, in summary, the overall worst case
asymptotic time complexity of the algorithm, when the data set X is given as input, is O(dn2 ).
Regarding main memory requirements, one needs O(dn) space to store the data set X and O(n) space to store the core
distances. The MST requires O(n) space to be stored. In the divisive extraction stage, only the currently processed hierarchical
level is needed at any point in time, which requires O(n) space. Hence, the overall space complexity of the algorithm is O(dn).
If the distance matrix MS is given instead of the data set X , the only change for the runtime is that one can directly access
any distance dS (xi , xj ) from MS in constant time. Then, the computations no longer depend on the dimension d of the objects
and, as a consequence, the worst case time complexity reduces to O(n2 ). On the other hand, this requires that matrix MS be
stored in main memory, which results in O(n2 ) space complexity.
III. R ANKING , S ELECTION , AND V ISUALIZATION
The selection of clusters described in Section 6.2 of [1] depends only on the ranking (relative ordering) of the clusters
according to their stability, defined as Stab(C) = (log(nec ) − log(ncs−1 ))/log(1 − rshave ), where nic is the total number of
2 This
is actually an abuse of terminology, as rDist in Definition 4 does not satisfy all the formal properties of a distance in mathematics.
the traditional SL, the elements in the diagonal of the distance matrix are all equal to zero and objects are always considered to be part of clusters,
possibly singleton clusters in case of isolated objects.
3 In
4
dense objects at the ith hierarchical level, e is the level where cluster C first appears, and s is the last level at which C survives.
As observed by the authors on p. 229 of [1], the denominator of Stab(C) is a constant for all clusters C. This means that the
ranking (and selection) of clusters can be performed using solely the information of nec and ncs−1 . In HDS, however, these
terms implicitly depend on rshave . Contrarily, in the complete hierarchy that can be produced as described above (Section II),
nec and ncs−1 depend only on the (ranks of the) density thresholds rǫ associated with their respective hierarchical levels. These
terms can thus be readily derived from the hierarchy, or even precomputed and stored during its construction. This means that
rshave is not needed at all when the complete hierarchy is considered, which makes the tool easier to use. In spite of this, an
interesting side effect of the geometric sampling controlled by rshave is a further compaction of the hierarchy that allows a
log-scale visualization able to emphasize smaller clusters with respect to bigger ones, as observed by the authors in Section
7 of [1], p. 229. A direct consequence of having the complete hierarchy available is that such a visualization is now possible
for any value of rshave without the need of re-clustering. In other words, one can still use rshave for visualization, if desired,
and outcomes for different values can be produced by just sampling the levels of the complete hierarchy. This can result in
significant speed-ups during exploratory data analysis using the AUTO-HDS framework.
IV. C ONCLUSION
This letter is intended to be a complement to the AUTO-HDS framework introduced in [1]. We have shown that the particular
algorithm adopted in the clustering stage of that framework can be replaced so that a user-defined parameter is eliminated and
the clustering procedure can be performed more accurately, with reduced complexity.
ACKNOWLEDGMENTS
The authors thank the Research Foundation of the State of São Paulo - Brazil (FAPESP), the Brazilian National Council
for Scientific and Technological Development (CNPq), and the Natural Sciences and Engineering Research Council of Canada
(NSERC).
R EFERENCES
[1] G. Gupta, A. Liu, and J. Ghosh, “Automated hierarchical density shaving: A robust automated clustering and visualization framework for large biological
data sets,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 7, no. 2, pp. 223 –237, 2010.
[2] ——, “Hierarchical density shaving: A clustering and visualization framework for large biological datasets,” in IEEE ICDM Workshop on Data Mining
in Bioinformatics (DMB), Hong Kong/China, 2006, pp. 89–93.
[3] D. Wishart, “Mode analysis: A generalization of nearest neighbour which reduces chaining effects,” in Numerical Taxonomy. Academic Press, 1969,
pp. 282–311.
[4] G. Gupta, A. Liu, and J. Ghosh, “Automated hierarchical density shaving and gene DIVER,” IDEAL-2006-TR05, Dept. Electrical and Computer Eng.,
Univ. of Texas at Austin, Tech. Rep., 2006.
[5] S. Dhandapani, G. Gupta, and J. Ghosh, “Design and implementation of scalable hierarchical density based clustering,” IDEAL-2010-06, Dept. Electrical
and Computer Eng., Univ. of Texas at Austin, Tech. Rep., 2010.
[6] M. Ester, H.-P. Kriegel, J. Sander, and X. Xu, “A density-based algorithm for discovering clusters in large spatial databases with noise,” in Int. Conference
on Knowledge Discovery and Data Mining (KDD), Portland/USA, 1996, pp. 226–231.
[7] M. Ankerst, M. M. Breunig, H.-P. Kriegel, and J. Sander, “Optics: ordering points to identify the clustering structure,” SIGMOD Rec., vol. 28, pp. 49–60,
June 1999.
[8] T. Pei, A. Jasra, D. Hand, A.-X. Zhu, and C. Zhou, “Decode: a new method for discovering clusters of different densities in spatial data,” Data Mining
and Knowledge Discovery, vol. 18, pp. 337–369, 2009, 10.1007/s10618-008-0120-3.
[9] H. Sun, J. Huang, J. Han, H. Deng, P. Zhao, and B. Feng, “gSkeletonClu: Density-based network clustering via structure-connected tree division or
agglomeration,” in 10th IEEE Int. Conference on Data Mining (ICDM), Sydney/Australia, 2010, pp. 481 –490.
[10] W. Stuetzle and R. Nugent, “A generalized single linkage method for estimating the cluster tree of a density,” J. Computational and Graphical Statistics,
vol. 19, no. 2, pp. 397–418, 2010.
[11] J. Sander, X. Qin, Z. Lu, N. Niu, and A. Kovarsky, “Automatic extraction of clusters from hierarchical clustering representations,” in Pacific-Asia Conf.
of Advances in Knowledge Discovery and Data Mining, Seoul/Korea, 2003, pp. 75–87.
[12] L. Lelis and J. Sander, “Semi-supervised density-based clustering,” in 9th IEEE Int. Conference on Data Mining (ICDM), Miami/USA, 2009, pp. 842
–847.
[13] A. K. Jain and R. C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988.
Ricardo J. G. B. Campello is currently in a sabbatical leave at the Department of Computing Science of the University of Alberta,
Canada. He is an Associate Professor at the University of São Paulo at São Carlos, Brazil. He received his MS and PhD degrees
in Electrical Engineering in 1997 and 2002, respectively, both from the State University of Campinas, Brazil. His current research
interests fall primarily into the areas of data mining and machine learning.
5
Davoud Moulavi is a PhD student in the Computing Science Department at the University of Alberta, Canada. He received his
MS in Computer Science from Moscow Power Engineering University. His research interests include knowledge discovery in large
databases, particularly clustering, data mining in biological databases, and machine learning.
Jöerg Sander is currently an Associate Professor at the University of Alberta, Canada. He received his MS in Computer Science in
1996 and his PhD in Computer Science in 1998, both from the University of Munich, Germany. His current research interests include
knowledge discovery in databases, especially clustering and data mining in biological, spatial, and high-dimensional data sets.