Download CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic

CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling Author: George et al. Advisor: Dr. Hsu Graduate: ZenJohn Huang IDSL seminar 2001/10/23 Outline  Motivation  Objective  Research restrict  Literature review   An overview of related clustering algorithms The limitations of clustering algorithms  CHAMELEON  Concluding remarks  Personal opinion Motivation  Existing clustering algorithms can breakdown  Choice of parameters is incorrect  Model is not adequate to capture the characteristics of clusters  Diverse shapes, densities, and sizes Objective  Presenting a novel hierarchical clustering algorithm – CHAMELEON  Facilitating discovery of natural and homogeneous  Being applicable to all types of data Research Restrict  In this paper, authors ignored the issue of scaling to large data sets that cannot fit in the main memory Literature Review  Clustering  An overview of related clustering algorithms  The limitations of the recently proposed state of the art clustering algorithms Clustering  The intracluster similarity is maximized and the intercluster similarity is minimized [Jain and Dubes, 1988]  Serving as the foundation for data mining and analysis techniques Clustering(cont’d)  Applications  Purchasing patterns  Categorization of documents on WWW [Boley, et al., 1999]  Grouping of genes and proteins that have similar functionality[Harris, et al., 1992]  Grouping if spatial locations prone to earth quakes[Byers and Adrian, 1998] An Overview of Related Clustering Algorithms  Partitional techniques  Hierarchical techniques Partitional Techniques  K means[Jain and Dubes, 1988] Hierarchical Techniques  CURE [Guha, Rastogi and Shim, 1998]  ROCK [Guha, Rastogi and Shim, 1999] Limitations of Existing Hierarchical Schemas  CURE  Fail to take into account special characteristics Limitations of Existing Hierarchical Schemas(cont’d)  ROCK  Irrespective of densities and shapes CHAMELEON  Overview  Modeling the data  Modeling the cluster similarity  A two-phase clustering algorithm  Performance analysis  Experimental Results Overall Framework CHAMELEON Modeling the Data  K-nearest graphs from an original data in 2D Modeling the Cluster Similarity  Relative inter-connectivity Modeling the Cluster Similarity(cont’d)  Relative closeness A Two-phase Clustering Algorithm  Phase I: Finding initial sub-clusters A Two-phase Clustering Algorithm(cont’d)  Phase I: Finding initial sub-clusters  Multilevel paradigm[Karypis & Kumar, 1999]  hMeT|s [Karypis & Kumar, 1999] A Two-phase Clustering Algorithm(cont’d)  Phase II: Merging sub-clusters using a dynamic framework RI (Ci , C j )  TRI and RC (Ci , C j )  TRC TRI, TRC: user specified threshold A Two-phase Clustering Algorithm(cont’d)  Phase II: Merging sub-clusters using a dynamic framework  RI (Ci , C j ) * RC (Ci , C j ) α is a user specified parameter Performance Analysis  The amount of time required to compute  K-nearest neighbor graph  Two-phase clustering Performance Analysis(cont’d)  The amount of time required to compute  K-nearest neighbor graph Low-dimensional data sets = O(n log n)  High-dimensional data sets = O(n2)  Performance Analysis(cont’d)  The amount of time required to compute  Two-phase clustering Computing internal inter-connectivity and closeness for each cluster: O(nm)  Selecting the most similar pair of cluster: O(n log n + m2 log m)   Total time = O(nm + n log n + m2 log m) Experimental Results  Program  DBSCAN: a publicly available version  CURE: a locally implemented version  Data sets  Qualitative comparison Data Sets • • Five clusters Different size, shape, and density Noise point • • • Six clusters Different size, shape, and orientation Random noise point Special artifacts • • • • • • • • Eight clusters Different size, shape, density, and orientation Random noise point Two clusters Close to each other Different region, different densities • • • Eight clusters Different size, shape, and orientation Random noise and special artifacts Concluding remarks  CHAMELEON can discover natural clusters of different shapes and sizes  It is possible to use other algorithms instead of k-nearest neighbor graph  Different domains may require different models for capturing closeness and inter-connectivity Personal Opinion  Without further work

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic