Download Project Presentation - University of Calgary

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Project Presentation
CPSC 695
Prepared By:
Priyadarshi Bhattacharya
Outline of Talk

Introduction to clustering and its relevance to my
research interests.

Discussion on existing clustering techniques and
their shortcomings.

Introduction to a new Delaunay based clustering
algorithm.

Experimental Results and comparison with other
methods.

Direction of future research.
Clustering – Definition

Automatic identification of groups of similar
objects.

A method of grouping data such that
intracluster similarity is maximized and
intercluster similarity is minimized.
Properties of clustering






Scalability: Clustering performance should
decrease linearly with data size increase
Ability to detect clusters of different shapes
Minimal input parameter
Robust with regard to noise
Insensitive to data input order
Scalability to higher dimensions
(properties referred from “On Data Clustering Analysis: Scalability, Constraints and Validation” with minor
modifications)
Relevance to my research

Identification of high-risk areas in the sea based on
incident data from the Maritime Activity and Risk
Investigation System (MARIS), maintained primarily
by the University of Halifax.
Marine
Route
Planning
Incident
Data
(ESRI Shape File)
Clustering
Algorithm
High-risk
areas
Location of
SAR Bases
Existing clustering algorithms
Clustering
Partitioning
Hierarchical
Density-based
K-Means, K-Medoid
BIRCH, CURE, ROCK,
CHAMELEON
DBSCAN, TURN*
1WaveCluster: A
Grid-based
WaveCluster1, CLIQUE
novel clustering approach based on wavelet transforms. Applies a multiresolution grid structure on the data space. For more details, refer to “Wavecluster: a multiresolution clustering approach for very large spatial databases”, Proc. 24th Conf. on Very Large
Databases.
Shortcomings of existing
methods

Require large number of parameters to be input by user.
Example – number of clusters, threshold to quantify “similarity”,
stopping condition, number of nearest neighbors etc.

Sensitivity to user-supplied parameters.

Capability of identifying clusters degrades with increase in
noise.

Inability to identify clusters of widely varying shapes and sizes.
Most detect spherical ones only.

Identification of dense clusters in presence of sparse ones,
clusters connected by multiple bridges, closely lying dense
clusters remains elusive.
CRYSTAL – A new Delaunay
based clustering algorithm
The algorithm has 3 stages :
Triangulation phase: Forms the Delaunay Triangulation of the
data points and sorts the vertices in the order of decreasing
average length of adjacent edges.
Grow cluster phase: Scans the sorted vertex list and grows
clusters from the vertices in that order, first encompassing first
order neighbors, then second order neighbors and so on. The
growth stops when the boundary of the cluster is determined.
Noise removal phase: The algorithm identifies noise as sparse
clusters. They can be easily eliminated by removing clusters
which are very small in size or which have a very low density.
Description of stage I

Triangulation phase:

Triangulation is done in O(nlogn) time using the incremental
algorithm.

An auxiliary grid structure (O(n) in size) is used to speed up
the point location problem in the Delaunay Triangulation.
This considerably reduces length of walk in the graph to
locate the triangle containing the data point.

The well-known Winged-Edge data-structure is used to
represent the Delaunay Triangulation because of its
efficiency in answering proximity queries.
Description of Stage II

Grow Cluster phase:
A queue is used to maintain a list of vertices in order, from
which the cluster is grown. Only vertices that are not boundary
points are inserted into the queue.
To decide whether a point belongs to the cluster, the edge
length is compared with the average edge length of the cluster.
To decide whether a point is on the boundary of a cluster, the
average adjacent edge length of the point is compared to the
average edge length of the cluster.
Description of Stage III

Noise Removal Phase:
Noise in the data may be in the form of isolated data points or
scattered throughout the data. In the former case,
cluster based at these data points will not be able to grow.
However, if the noise is scattered uniformly throughout the data,
our algorithm identifies it as a single sparse cluster. This phase
simply gets rid of noise by eliminating the cluster with the
highest average edge length. Also any trivial clusters (size less
than an acceptable number) are removed in this phase.
Complexity Analysis

The algorithm operates in O(nlogn) time.
Delaunay Triangulation is generated in O(nlogn) time. As a vertex once
assigned to a cluster is not considered again, the clustering is done in
O(n) time.
Cluster size (1000) Vs Time consumed (ms)
Clustering in action
Experimental Results
Comparison with K-Means based approaches
Experimental Results (contd.)
1. Clusters of different shapes
2. Closely lying dense clusters
Experimental Results (contd.)
1. Clusters connected by multiple bridges
2. Clusters of widely varying density
Experimental Results (contd.)
Data set
K-Means
GEM
CRYSTAL
Experimental Results (contd.)
Results on t7.10k.dat (originally used in “CHAMELEON: A Hierarchical Clustering Algorithm
Using Dynamic Modeling”)
Conclusion & Future Work
CRYSTAL is a fast O(nlogn) clustering algorithm that
automatically identifies clusters of widely varying shapes, sizes
and densities without requiring any input from user.
Future work will involve:

Application of the clustering algorithm in identification of highrisk areas in the sea using the MARIS database.

Extension of the algorithm to 3D.

Considering physical constraints in clustering. In GIS, physical
constraints such as rivers, highways, mountain ranges can
hinder or alter the clustering result.
References









G. Papari, N. Petkov: Algorithm That Mimics Human Perceptual Grouping of Dot Patterns. Lecture Notes
in Computer Science (2005) 497-506
Vladimir Estivill-Castro, Ickjai Lee: AUTOCLUST: Automatic Clustering via Boundary Extraction for Mining
Massive Point-Data Sets. Fifth International Conference on Geocomputation (2000)
Osmar R. Zaiane, Andrew Foss, Chi-Hoon Lee, Weinan Wang:
On Data Clustering Analysis: Scalability, Constraints and Validation.
Advances in Knowledge Discovery and Data Mining, Springer-Verlag (2002 )
Z.S.H. Chan, N. Kasabov: Efficient global clustering using the Greedy Elimination Method.
Electronics Letters 40 25 (2004 )
Aristidis Likas, Nikos Vlassis, Jakob J. Verbeek: The global k-means clustering algorithm.
Pattern Recognition 36 2 (2003 ) 451-461
Ying Xu, Victor Olman, Dong Xu: Minimum Spanning Trees for Gene Expression Data Clustering.
Computational Protein Structure Group, Life Sciences Division, Oak Ridge National Laboratory, USA
C. Eldershaw, M. Hegland: Cluster Analysis using Triangulation. Computational Techniques and
Applications CTAC97, 201-208. World Scientific, Singapore, 1997
Mir Abolfazl Mostafavi, Christopher Gold, Maciej Dakowicz: Delete and insert operations in
Voronoi/Delaunay methods and applications. Computers \& Geosciences 29 4 523-530 (2003)
Atsuyuki Okabe, Barry Boots, Kokichi Sugihara: Spatial Tessellations: Concepts and Applications of
Voronoi Diagrams.
Thank You!
All 11 identified by CRYSTAL!
Questions?