* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Project Presentation - University of Calgary
Survey
Document related concepts
Transcript
Project Presentation CPSC 695 Prepared By: Priyadarshi Bhattacharya Outline of Talk Introduction to clustering and its relevance to my research interests. Discussion on existing clustering techniques and their shortcomings. Introduction to a new Delaunay based clustering algorithm. Experimental Results and comparison with other methods. Direction of future research. Clustering – Definition Automatic identification of groups of similar objects. A method of grouping data such that intracluster similarity is maximized and intercluster similarity is minimized. Properties of clustering Scalability: Clustering performance should decrease linearly with data size increase Ability to detect clusters of different shapes Minimal input parameter Robust with regard to noise Insensitive to data input order Scalability to higher dimensions (properties referred from “On Data Clustering Analysis: Scalability, Constraints and Validation” with minor modifications) Relevance to my research Identification of high-risk areas in the sea based on incident data from the Maritime Activity and Risk Investigation System (MARIS), maintained primarily by the University of Halifax. Marine Route Planning Incident Data (ESRI Shape File) Clustering Algorithm High-risk areas Location of SAR Bases Existing clustering algorithms Clustering Partitioning Hierarchical Density-based K-Means, K-Medoid BIRCH, CURE, ROCK, CHAMELEON DBSCAN, TURN* 1WaveCluster: A Grid-based WaveCluster1, CLIQUE novel clustering approach based on wavelet transforms. Applies a multiresolution grid structure on the data space. For more details, refer to “Wavecluster: a multiresolution clustering approach for very large spatial databases”, Proc. 24th Conf. on Very Large Databases. Shortcomings of existing methods Require large number of parameters to be input by user. Example – number of clusters, threshold to quantify “similarity”, stopping condition, number of nearest neighbors etc. Sensitivity to user-supplied parameters. Capability of identifying clusters degrades with increase in noise. Inability to identify clusters of widely varying shapes and sizes. Most detect spherical ones only. Identification of dense clusters in presence of sparse ones, clusters connected by multiple bridges, closely lying dense clusters remains elusive. CRYSTAL – A new Delaunay based clustering algorithm The algorithm has 3 stages : Triangulation phase: Forms the Delaunay Triangulation of the data points and sorts the vertices in the order of decreasing average length of adjacent edges. Grow cluster phase: Scans the sorted vertex list and grows clusters from the vertices in that order, first encompassing first order neighbors, then second order neighbors and so on. The growth stops when the boundary of the cluster is determined. Noise removal phase: The algorithm identifies noise as sparse clusters. They can be easily eliminated by removing clusters which are very small in size or which have a very low density. Description of stage I Triangulation phase: Triangulation is done in O(nlogn) time using the incremental algorithm. An auxiliary grid structure (O(n) in size) is used to speed up the point location problem in the Delaunay Triangulation. This considerably reduces length of walk in the graph to locate the triangle containing the data point. The well-known Winged-Edge data-structure is used to represent the Delaunay Triangulation because of its efficiency in answering proximity queries. Description of Stage II Grow Cluster phase: A queue is used to maintain a list of vertices in order, from which the cluster is grown. Only vertices that are not boundary points are inserted into the queue. To decide whether a point belongs to the cluster, the edge length is compared with the average edge length of the cluster. To decide whether a point is on the boundary of a cluster, the average adjacent edge length of the point is compared to the average edge length of the cluster. Description of Stage III Noise Removal Phase: Noise in the data may be in the form of isolated data points or scattered throughout the data. In the former case, cluster based at these data points will not be able to grow. However, if the noise is scattered uniformly throughout the data, our algorithm identifies it as a single sparse cluster. This phase simply gets rid of noise by eliminating the cluster with the highest average edge length. Also any trivial clusters (size less than an acceptable number) are removed in this phase. Complexity Analysis The algorithm operates in O(nlogn) time. Delaunay Triangulation is generated in O(nlogn) time. As a vertex once assigned to a cluster is not considered again, the clustering is done in O(n) time. Cluster size (1000) Vs Time consumed (ms) Clustering in action Experimental Results Comparison with K-Means based approaches Experimental Results (contd.) 1. Clusters of different shapes 2. Closely lying dense clusters Experimental Results (contd.) 1. Clusters connected by multiple bridges 2. Clusters of widely varying density Experimental Results (contd.) Data set K-Means GEM CRYSTAL Experimental Results (contd.) Results on t7.10k.dat (originally used in “CHAMELEON: A Hierarchical Clustering Algorithm Using Dynamic Modeling”) Conclusion & Future Work CRYSTAL is a fast O(nlogn) clustering algorithm that automatically identifies clusters of widely varying shapes, sizes and densities without requiring any input from user. Future work will involve: Application of the clustering algorithm in identification of highrisk areas in the sea using the MARIS database. Extension of the algorithm to 3D. Considering physical constraints in clustering. In GIS, physical constraints such as rivers, highways, mountain ranges can hinder or alter the clustering result. References G. Papari, N. Petkov: Algorithm That Mimics Human Perceptual Grouping of Dot Patterns. Lecture Notes in Computer Science (2005) 497-506 Vladimir Estivill-Castro, Ickjai Lee: AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining Massive Point-Data Sets. Fifth International Conference on Geocomputation (2000) Osmar R. Zaiane, Andrew Foss, Chi-Hoon Lee, Weinan Wang: On Data Clustering Analysis: Scalability, Constraints and Validation. Advances in Knowledge Discovery and Data Mining, Springer-Verlag (2002 ) Z.S.H. Chan, N. Kasabov: Efficient global clustering using the Greedy Elimination Method. Electronics Letters 40 25 (2004 ) Aristidis Likas, Nikos Vlassis, Jakob J. Verbeek: The global k-means clustering algorithm. Pattern Recognition 36 2 (2003 ) 451-461 Ying Xu, Victor Olman, Dong Xu: Minimum Spanning Trees for Gene Expression Data Clustering. Computational Protein Structure Group, Life Sciences Division, Oak Ridge National Laboratory, USA C. Eldershaw, M. Hegland: Cluster Analysis using Triangulation. Computational Techniques and Applications CTAC97, 201-208. World Scientific, Singapore, 1997 Mir Abolfazl Mostafavi, Christopher Gold, Maciej Dakowicz: Delete and insert operations in Voronoi/Delaunay methods and applications. Computers \& Geosciences 29 4 523-530 (2003) Atsuyuki Okabe, Barry Boots, Kokichi Sugihara: Spatial Tessellations: Concepts and Applications of Voronoi Diagrams. Thank You! All 11 identified by CRYSTAL! Questions?