Download 1: Recent advances in clustering algorithms: a review

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal of Conceptions on Computing and Information Technology
Vol. 1, Issue. 1, November 2013; ISSN: 2345 – 9808
Recent advances in clustering algorithms: a review
Manivara Kumar Parsha
Sreenivasulu Pacha
Dept. of Electronics & Communication Engineering,
Narayana Engineering College,
Nellore, AP, India.
[email protected]
Dept. of Electronics & Communication Engineering,
PBR Viswodaya Institute of Technology & Science,
Kavali, AP, India.
[email protected]
Abstract— The Unsupervised classification of patterns,
observations, data items, or feature vectors is called Clustering.
The classification is aimed to group the data into clusters. The
clustering problem has been addressed in many contexts and by
researchers in many disciplines. This reflects its broad
applications in the data analysis. This paper presents an overview
of Clustering algorithms, with a goal of providing useful advices
and references from fundamental concepts to the present day
advancements in the algorithms mentioned, so that the broad
community of clustering practitioners will get benefited from this
review.
b. Overlapping clustering: The overlapping clustering uses
fuzzy sets to cluster data, so that each point may belong to
two or more clusters with different degrees of membership.
Keywords- Clustering Algorithms;
Machine learning; Gene Expression;
e. Partitional Clustering: Non Hierarchical or Partitioning
clustering creates the clusters in one step as opposed to
several steps. MST and Squared error, K-Means, Nearest
Neighbor, PAM algorithms come under the Partitional
algorithms.
Information
retrieval;
I. INTRODUCTION
Clustering is the process of grouping the data. The groups
are called clusters. Grouping is accomplished by finding
similar characteristics in the actual data. Clustering is an
unsupervised learning. Hence clusters can be defined as group
of like elements. Clustering has been used in many application
domains such as biology, medicine, anthropology, marketing,
psychiatry, psychology, archeology, geology, geography and
economics etc., in any domain, clustering means grouping a
given collection of unlabeled data or patterns or feature vectors
into meaningful clusters.
II. CLUSTERING ACTIVITY
The typical pattern clustering activity involves the
following steps[1] (i) Pattern Representations, this stage also
involves the feature extraction and/or selection. (ii) Definition
of a pattern proximity measure appropriate to the data domain.
(iii) Clustering or grouping. (iv) Data Abstraction (v)
Assessment of Output. The Last two steps are optional in
several applications. The Clustering methods are used in
Pattern Recognition, Image processing and information
retrieval. More or less these are also used in unsupervised
learning, vector quantization and Learning by observation.
III.
CLASSIFICATION
The clustering algorithms may be broadly classified as [2],
a. Exclusive clustering: In Exclusive clustering data are
grouped in Exclusive way, so that a certain datum belongs
to only one definite cluster. K-Means clustering is one
example of the exclusive clustering algorithms.
c. Hierarchical clustering: It is connectivity based clustering
and is a method of cluster analysis which seeks to build
a hierarchy of clusters. Agglomerative clustering and
Divisive clustering comes under the Hierarchical clustering.
d. Probabilistic clustering: Probabilistic clustering uses a
completely probabilistic approach. Mixture of Gaussian
algorithm is an example.
f. Clustering Large Databases: Algorithms associated with
Dynamic databases are BIRCH, DBSCAN, CURE
algorithms.
g. Clustering with Large Categorical Data: Problems that are
existed when clustering the categorical data can be solved
with these algorithms. K-mode algorithm is an example.
h. Centroid Based Clustering: In this, clusters are represented
by a central vector, which may not necessarily be a member
of the data set. K-Means is an example.
Density Based Clustering: In this type, clusters are defined
as areas of higher density than the remainder of the data
set. DBSCAN is the most popular example.
IV. ADVANCES IN THE CLUSTERING ALGORITHMS
A. Density Based Algorithm (DBSCAN)
DBSCAN (for Density-Based Spatial Clustering of
Applications with Noise) is a data clustering algorithm
proposed by Martin Ester, Hans-Peter Kriegel, Jörg
Sander and Xiaowei Xu in 1996. It is a density-based
clustering algorithm because it finds a number of clusters
starting from the estimated density distribution of
corresponding nodes. DBSCAN is one of the most common
clustering algorithms and also most cited in scientific
literature.
An Interesting Experiment was conducted by
DominicaArlia and Massimocoppala[3] in the parallel
1|71
International Journal of Conceptions on Computing and Information Technology
Vol. 1, Issue. 1, November 2013; ISSN: 2345 – 9808
clustering with DBSCAN in which they used a master module
and a slave module. Master module performs cluster
assignment while the slave module answers the neighborhood
queries. As a result unlike the sequential scan, the points
already labeled are returned again and again from the slaves.
The results were similar to the sequential DBSCAN, but they
provided the new methodology of code which helps to
reengineer the existing code. The Density based algorithms
such as GDBSCAN, PDBSCAN, DENCLUE, are
mentioned[4]. Recently In 2010, Tepwankul, A. et al., studied
the problem of clustering uncertain objects. They proposed a
new deviation function that approximates the underlying
uncertain model of objects and a new density-based clustering
algorithm, U-DBSCAN,[5] that utilizes the proposed
deviation.
B.
BIRCH algorithm
Balanced Iterative Reducing and Clustering using
Hierarchies (BIRCH) is designed for very large dataset.
Incremental and dynamic clustering of incoming objects can be
done with BIRCH. Only one scan of data is necessary, and no
need to have whole data set in advance. This is a new data
clustering method using clustering feature (CF)-tree data
structure, and suggests that this method can be easily
parallelized.
Two algorithms are used in BIRCH clustering. They are
CF-Tree Insertion and CF-tree rebuilding. Both these
algorithms were explained by Tian Zhang and Raghu
Ramakrishnan in their paper[6]. Recently novel method of
efficient spam mail classification using BIRCH clustering
techniques is presented by M. Basavaraju et al.,[7] . This
technique was implemented using open source technology in
C language. Birch Algorithm works for larger data set.
C.
Clustering using Representatives(CURE)Algorithm
The Drawbacks of the traditional clustering algorithms are
mentioned [8]. The algorithms generally favor clusters with
spherical shapes and similar sizes, or are very fragile in the
presence of outliers. Sudipto Guha, Rajeev Rastoji, Kyuseok
Shim Proposed a new algorithm CURE[8]
This algorithm is robust to outliers and identifies clusters
with non spherical shapes and wide variances in size. CURE
achieves this by representing each cluster by a certain fixed
number of points that are generated by selecting well scattered
points from the cluster and then shrinking them toward the
center of the cluster by a specified fraction. Having more than
one representative point per cluster allows CURE to adjust
well to the geometry of non spherical shapes and the shrinking
helps to dampen the effects of outliers.
An improved cure algorithm was developed to analyze the
business information[9]. In this technique the weight value can
be adjusted dynamically, which increases the flexibility of
clustering. . Theory analysis and experiment result showed
that it is a validate method to dynamic identifying analysis of
competitor for enterprise.
D.
K-Means Algorithm
A Simple efficient implementation of Lloyd’s K-means
clustering algorithm was done by Kanungo[10]. He mentioned
the Filtering Algorithm which is based on storing the
multidimensional data points in kd-tree. He established the
practical efficiency of Filtering algorithm in two ways. One is
the running time and another is related to the studies on the
synthetically generated data and on real data sets from
applications in color quantization, data compression and
image segmentation. Spam mail classification was also
done[7] using K-means algorithm, which fits best for smaller
data set.
E.
PAM: (Partitioning Around Medoids)
Compared to the k-means approach, the PAM has the
following features[11] (i) it also accepts a dissimilarity matrix
(ii) it is more robust because it minimizes a sum of
dissimilarities instead of a sum of Squared Euclidean distances
(iii) it provides a novel graphical display, the silhouette plot
which also allows to select the number of clusters. The PAM
algorithm is based on the search for k representative objects or
medoids among the objects of the dataset. These objects
should represent the structure of the data. After finding a set of
k medoids, k clusters are constructed by assigning each object
to the nearest medoid. The goal is to find k representative
objects which minimize the sum of the dissimilarities of the
objects to their closest representative object. The algorithm
first looks for a good initial set of medoids (this is called the
BUILD phase). Then it finds a local minimum for the
objective function, that is, a solution such that there is no
single switch of an object with a medoid that will decrease the
objective (this is called the SWAP phase).
Advances in feature level fusion system of face and palm
print traits is recently carried out[12]. Partitioning
Around
Medoids (PAM) algorithm is used to partition the setof ‘n’
invariant feature points of the face and palmprint imagesinto
‘k’ clusters. By partitioning face and palm print imageswith
scale invariant features SIFT points, a number of clustersare
formed on both the images. The results showed the
betterperformance compared to the existing techniques.
F.
CLARANS Algorithm
CLARANS is an efficient and effective clustering method
especially in spatial data mining. It is applicable to locate
objects with polygon shape. The content of data mining
usually contains discrete data. To solve this problem, the
traditional method is to convert discrete data into digital
values. Still it is hard to obtain a meaningful result since the
order of discrete data problem area would be damaged. A
clustering Algorithm, CLARANS (Clustering Large
Applications based on RANdomized Search) could avoid this
efficiently. The improved and parallel CLARANS[13] can
improve the clustering performance further, and it's able to get
better clustering result and as well shorten searching time.
This was applied in 110 Alarm System and it turned out to be
satisfying.
2|71
International Journal of Conceptions on Computing and Information Technology
Vol. 1, Issue. 1, November 2013; ISSN: 2345 – 9808
CLARA (Clustering LARge Applications) relies on
sampling. This Algorithm handles large data sets. Instead of
finding representative objects for the entire data set, CLARA
draws a sample of the data set, applies PAM on the sample,
and finds the medoids of the sample. The point is that, if the
sample is drawn in a sufficiently random way, the medoids of
the sample would approximate the medoids of the entire data
set. The detailed algorithm was given in the paper [14].
G.
CHAMELEON Algorithm
Existing clustering algorithms, such as K-means, PAM,
CLARANS, DBSCAN, CURE, and ROCK are designed to
find clusters that fit some static models. These algorithms can
breakdown if the choice of parameters in the static model is
incorrect with respect to the data set being clustered, or if the
model is not adequate to capture the characteristics of clusters.
Furthermore, most of these algorithms breakdown when the
data consists of clusters that are of diverse shapes, densities,
and sizes. CHAMELEON algorithm[15] measures the
similarity of two clusters based on a dynamic model. In the
clustering process, two clusters are merged only if the interconnectivity and closeness (proximity) between two clusters
are high relative to the internal inter-connectivity of the
clusters and closeness of items within the clusters. The
merging process using the dynamic model presented in this
paper facilitates discovery of natural and homogeneous
clusters. The methodology of dynamic modeling of clusters
used in CHAMELEON is applicable to all types of data as
long as a similarity matrix can be constructed. Out of the five
data sets they used CURE was able to find clusters only for
two data sets. But CHAMELEON is able to find for the
remaining also. Moreover CHAMELEON can discover
clusters of different shapes and sizes.
H.
FCDC Algorithm
FCDC algorithm (Frequent Concepts based Document
Clustering)[16] treats the text documents as set of related
words, instead of bag of words. Different words shares the
same meanings are known as synonyms. Set of these
differentwords that have same meaning is known as concept.
So whether document share the same frequent concept or not
is used as the measurement of their closeness. So, this
algorithm is able to group documents in the same cluster even
if they do not contain common words. They constructed
feature vector
based on concepts and apply an Apriori
paradigm for discovering frequent concepts then frequent
concepts are used to create clusters. It was found that FCDC
algorithm is more efficient and accurate than other clustering
algorithms in this application
I.
expression data. Later advancements proved that Cluster
Ensemble Approach[18] yields better results than the single
best clustering algorithm for analyzing gene expression data
set.
J.
RObust Clustering using links algorithm (ROCK) is
targeted to both Boolean data and categorical data. A pair of
items are said to be neighbors if their similarity exceeds some
threshold. The number of links between two items is defined as
the number of common neighbors they have. The objective of
the clustering algorithm is to group together points that have
more links. A recent development in genetics used this
algorithm called GE-ROCK[19]. GE-ROCK is an improved
ROCK algorithm which combines the techniques of clustering
and genetic optimization. In GE-ROCK, similarity function is
used throughout the iterative clustering process, while in the
“conventional” ROCK algorithm, similarity function is only to
be used for the initial calculation. The analysis showed that
GE-ROCK leads to the superior performance in not only better
clustering quality but also shorter computing time when
compared to the ROCK algorithm commonly used in the
literature.
V.
CONCLUSION
We have presented the overview of Recent Advances in the
Clustering Algorithms along with the basic activity and wide
classification to give broader idea about the clustering
algorithms. The improvements in different algorithms were
mentioned. Advancements in the algorithms such as
CLARANS, FCDC, Cluster Ensemble approaches, ROCK etc.,
are mentioned. Recent advances in clustering algorithms in
different areas such as analyzing Gene expression, Genetic
optimization, Text document clustering, Image clustering,
Business Information, Spam mail classification, clustering the
uncertain objects are included, which gives the broad spectrum
of recent advances in clustering.
REFERENCES
[1]
[2]
[3]
[4]
[5]
Cluster Ensemble Approach
Clustering is one of the most widely and frequently used
technic in Gene Expression analysis. Comparatative study on
analyzing ES Cell Gene Expression Data with different
algorithms like K-Means, PAM and SOM are given by
Gengxin Chen et al.,[17] . Their study provides a guideline on
how to select a suitable clustering algorithm for the extraction
of meaningful biological information from microarray
ROCK Algorithm
[6]
[7]
[8]
3|71
Mrs.J.R.Prasad, R.S.Prasad, U.V.Kulkarni “Impact of Feature Selection
Methods in Hierarchical Clustering Technique: A Review.” Proceedings
of IMECS 2008, Vol I, pp 19-21 (2008)
Margaret, H. D, Data Mining: Introductory and Advanced topics,
Pearson Education, Inc., Copyright 2003.
Dominica, A, Massimo, C, Experiments in Parallel clustering with
DBSCAN.
Euro-Par 2001, LNCS 2150, pp 326-331, Springer,
Heidelberg (2001)
Kotsiantis S.B, Pintelas P. E, Recent Advances in Clustering: A Brief
Survey. Technical Report (2004)
Tepwankul. A Maneewongwattana S, U-DBSCAN: A density-based
clustering algorithm for uncertain objects. 26th IEEE International
Conference on Data Engineering Workshops (ICDEW), pp 136-143,
(2010)
Tian.Z, Raghu.R, Miron.L, BIRCH: A New Data Clustering Algorithm
and Its Applications Kluwer Academic Publishers pp 1-40 (1997)
Basavaraju.M, Prabhakar.R, A Novel Method of Spam mail detection
using Text Based Clustering Approach. IJCA (0975 – 8887) Vol. 5,
No.4, August 2010.
Sudipto.G, Rajeev.R, Kyuseok.S, CURE: An Efficient Clustering
Algorithm for Large Databases. SIGMOD, Seattle, WA, USA (1998)
International Journal of Conceptions on Computing and Information Technology
Vol. 1, Issue. 1, November 2013; ISSN: 2345 – 9808
[9]
[10]
[11]
[12]
[13]
Zhao, Yan, Research of an improved cure algorithm used in enterprise
competitive intelligence to dynamic identify analysis. IEEE Youth
Conference on Information Computing and Telecommunications (YCICT), pp 299-302, Nov 2010
Tapas.K, David.M.M., Nathan.S.N, Christine.D.P, Ruth.S, Angela,
Y.Wu, An Efficient K- Means Clustering Algorithm Analysis and
Implementation. IEEE transactions on pattern analysis and machine
intelligence, Vol. 24, No. 7, July 2002.
United Nations Educational, Scientific and Cultural Organization,
http://www.unesco.org/webworld/idams/advguide/Chapt7_1_1.htm
Dakshina R. K, Phalguni G, Jammu.K. S, Multibiometrics Feature
Level Fusion by Graph Clustering, International Journal of Security
and Its Applications Vol. 5 No. 2, April, 2011
Xue Jing-Sheng, Parallel CLARANS- improvement and application of
CLARANS algorithm. International Conference On Computer and
Communication Technologies in Agriculture Engineering (CCTAE),
PP248 – 251, June 2010
[14] Raymond.T, Ng, Jiawei.H, Efficient and Effective Clustering methods
for Spatial Data Mining. Proceedings of the 20th VLDB Conference,
Satiago, Chile, 1994.
[15] George.K,Eui.H.H,Vipin.K, Chameleon: A Hierarchical clustering
algorithm using Dyanamic Modeling. Technical Report
[16] Rekha.B, Renu.D, A Frequent Concepts Based Docment Clustering
Algorithm. IJCA (0975 – 8887) Volume 4 – No.5, July 2010.
[17] Gengxin.C, Saied.A.J, Nila.B, Evaluation and Comparison of Clustering
Algorithms In Analyzing ES Cell Gene Expression Data.
[18] Xiaohua.H, Illhoi.Y, Cluster Ensemble and Its Applications in Gene
Expression Analysis. 2nd APBC, Dunedin, New Zealand 2004
[19] Qiongbing. Z, Lixin.D, Shanshan.Z, A genetic Evolutionary Rock
Algorithm.
International Conference on Computer Application and
System Modeling (ICCASM), 347-351 pages, Oct, 2010
4|71