* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CHAMELEON: A Hierarchical Clustering Algorithm Using
Survey
Document related concepts
Mixture model wikipedia , lookup
Human genetic clustering wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
K-means clustering wikipedia , lookup
Transcript
CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Paper presentation in data mining class Presenter : 許明壽 ; 蘇建仲 Data : 2001/12/18 2001/12/18 CHAMELEON 1 About this paper … Department of Computer Science and Engineering , University of Minnesota George Karypis Eui-Honh (Sam) Han Vipin Kumar IEEE Computer Journal - Aug. 1999 2001/12/18 CHAMELEON 2 Outline Problems definition Main algorithm Keys features of CHAMELEON Experiment and related worked Conclusion and discussion 2001/12/18 CHAMELEON 3 Problems definition Clustering Intracluster similarity is maximized Intercluster similarity is minimized Problems of existing clustering algorithms 2001/12/18 Static model constrain Breakdown when clusters that are of diverse shapes,densities,and sizes Susceptible to noise , outliers , and artifacts CHAMELEON 4 Static model constrain Data space constrain K means , PAM … etc Cluster shape constrain K means , PAM , CLARANS Assume cluster as ellipsoidal or globular and are similar sizes Cluster density constrain DBScan Suitable only for data in metric spaces Points within genuine cluster are density-reachable and point across different clusters are not Similarity determine constrain CURE , ROCK 2001/12/18 Use static model to determine the most similar cluster to merge CHAMELEON 5 Partition techniques problem (b) Clusters with convex shapes (a) Clusters of widely different sizes 2001/12/18 CHAMELEON 6 Hierarchical technique problem (1/2) The {(c) , (d)} will be choose to merge when we only consider closeness 2001/12/18 CHAMELEON 7 Hierarchical technique problem (2/2) The {(a) , (c)} will be choose to merge when we only consider interconnectivity 2001/12/18 CHAMELEON 8 Main algorithm Two phase algorithm PHASE I PHASE II 2001/12/18 Use graph partitioning algorithm to cluster the data items into a large number of relatively small subclusters. Uses an agglomerative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining together these sub-clusters. CHAMELEON 9 Framework Construct Partition the Graph Sparse Graph Data Set Merge Partition Final Clusters 2001/12/18 CHAMELEON 10 Keys features of CHAMELEON Modeling the data Modeling the cluster similarity Partition algorithms Merge schemes 2001/12/18 CHAMELEON 11 Terms Arguments needed K K-nearest neighbor graph MINSIZE The minima size of initial cluster TRI Threshold of related inter-connectivity TRC Threshold of related intra-connectivity α Coefficient for weight of RI and RC 2001/12/18 CHAMELEON 12 Modeling the data K-nearest neighbor graph approach Advantages 2001/12/18 Data points that are far apart are completely disconnected in the Gk Gk capture the concept of neighborhood dynamically The edge weights of dense regions in Gk tend to be large and the edge weights of sparse tend to be small CHAMELEON 13 Example of k-nearest neighbor graph 2001/12/18 CHAMELEON 14 Modeling the clustering similarity (1/2) Relative interconnectivity | EC{Ci , Cj } | RI (Ci, Cj ) | ECCi | | ECCj | 2 Relative closeness RC (Ci, Cj) 2001/12/18 S EC{Ci, Cj} | Ci | | Cj | S ECCi S ECCj | Ci | | Cj | | Ci | | Cj | CHAMELEON 15 Modeling the clustering similarity (2/2) • If related is considered , {(c) , (d)} will be merged 2001/12/18 CHAMELEON 16 Partition algorithm (PHASE I) What Why Finding the initial sub-clusters RI and RC can’t be accurately calculated for clusters containing only a few data points How 2001/12/18 Utilize multilevel graph partitioning algorithm (hMETIS) Coarsening phase Partitioning phase Uncoarsening phase CHAMELEON 17 Partition algorithm (cont.) Initial Repeat until (size of all clusters < MINSIZE) all points belonging to the same cluster Select the largest cluster and use hMETIS to bisect Post scriptum Balance constrain 2001/12/18 Spilt Ci into CiA and CiB and each sub-clusters contains at least 25% of the node of Ci CHAMELEON 18 2001/12/18 CHAMELEON 19 Merge schemes (Phase II) What Merging sub-clusters using a dynamic framework How Finding and merging the pair of sub-clusters that are the most similar Scheme 1 RI (Ci, Cj ) TRI and RC (Ci, Cj ) TRC Scheme 2 RI (Ci, Cj ) * RC (Ci, Cj ) 2001/12/18 CHAMELEON 20 Experiment and related worked Introduction of CURE Introduction of DBSCAN Results of experiment Performance analysis 2001/12/18 CHAMELEON 21 Introduction of CURE (1/n) Clustering Using Representative points 1. Properties : 2001/12/18 Fit for non-spherical shapes. Shrinking can help to dampen the effects of outliers. Multiple representative points chosen for non-spherical Each iteration , representative points shrunk ratio related to merge procedure by some scattered points chosen Random sampling in data sets is fit for large databases CHAMELEON 22 Introduction of CURE (2/n) 2. Drawbacks : Partitioning method can not prove data points chosen are good. Clustering accuracy with respect to the parameters below : (1) Shrink factor s : CURE always find the right clusters by range of s values from 0.2 to 0.7. (2) Number of representative points c : CURE always found right clusters for value of c greater than 10. (3) Number of Partitions p : with as many as 50 partitions , CURE always discovered the desired clusters. (4) Random Sample size r : 2001/12/18 (a) for sample size up to 2000 , clusters found poor quality (b) from 2500 sample points and above , about 2.5% of the data set size , CURE always correctly find the clusters. CHAMELEON 23 3. Clustering algorithm : Representative points 2001/12/18 CHAMELEON 24 • Merge procedure 2001/12/18 CHAMELEON 25 Introduction of DBSCAN (1/n) Density Based Spatial Clustering of Application With Noise 1. Properties : 2001/12/18 Can discovery clusters of arbitrary shape. Each cluster with a typical density of points which is higher than outside of cluster. The density within the areas of noise is lower than the density in any of the clusters. Input the parameters MinPts only Easy to implement in C++ language using R*-tree Runtime is linear depending on the number of points. Time complexity is O(n * log n) CHAMELEON 26 Introduction of DBSCAN (2/n) 2. Drawbacks : Cannot apply to polygons. Cannot apply to high dimensional feature spaces. Cannot process the shape of k-dist graph with multifeatures. Cannot fit for large database because no method applied to reduce spatial database. 3. Definitions 2001/12/18 Eps-neighborhood of a point p NEps(p)={q€D | dist(p,q)<=Eps} Each cluster with MinPts points CHAMELEON 27 Introduction of DBSCAN (3/n) 4. p is directly density-reachable from q (1) p€ NEps(q) and (2) | NEps(q) | >=MinPts (core point condition) We know directly density-reachable is symmetric when p and q both are core point , otherwise is asymmetric if one core point and one border point. 5. p is density-reachable from q if there is a chain of points between p and q 2001/12/18 Density-reachable is transitive , but not symmetric Density-reachable is symmetric for core points. CHAMELEON 28 Introduction of DBSCAN (4/n) 6. A point p is density-connected to a point q if there is a point s such that both p and q are density-reachable from s. Density-connected is symmetric and reflexive relation A cluster is defined to be a set of density-connected points which is maximal density-reachability. Noise is the set of points not belong to any of clusters. 7. How to find cluster C ? Maximality ∆ p , q : if p€ C and q is density-reachable from p , then q € C Connectivity ∆ p , q € C : p is density-connected to q 8. How to find noises ? ∆ p , if p is not belong to any clusters , then p is noise point 2001/12/18 CHAMELEON 29 Results of experiment 2001/12/18 CHAMELEON 30 Performance analysis (1/2) The time of construct the k-nearest neighbor Low-dimensional data sets based on k-d trees , overall complexity of O(n log n) High-dimensional data sets based on k-d trees not applicable , overall complexity of O(n2) Finding initial sub-clusters 2001/12/18 Obtains m clusters by repeated partitioning successively smaller graphs , overall computational complexity is O(n log (n/m)) Is bounded by O(n log n) A faster partitioning algorithm to obtain the initial m clusters in time O(n+m log m) using multilevel m-way partitioning algorithm CHAMELEON 31 Performance analysis (2/2) Merging sub-clusters using a dynamic framework The time of compute the internal inter-connectivity and internal closeness for each initial cluster is which is O(nm) The time of the most similar pair of clusters to merge is O(m2 log m) by using a heap-based priority queue So overall complexity of CHAMELEON’s is O(n log n + nm + m2 log m) 2001/12/18 CHAMELEON 32 Conclusion and discussion Dynamic model with related interconnectivity and closeness This paper ignore the issue of scaling to large data Other graph representation methodology?? Other Partition algorithm?? 2001/12/18 CHAMELEON 33