Download Comparative Study of Clustering Techniques

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Expectation–maximization algorithm wikipedia , lookup

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Volume 4, Issue 3, May-June 2014
Available Online at www.gpublication.com/jcer
ISSN No.: 2250-2637
©Genxcellence Publication 2011, All Rights Reserved
RESEARCH PAPER
Comparative Study of Clustering Techniques
Archana1, Jitendra Kumar2
*1,2
Department of Computer Science & Engineering, SET, IFTM University, Moradabad, India
1
[email protected], [email protected]
Abstract
Data Clustering is grouping of data of similar qualities into clusters. In other words, clustering is a division of similar data into
different groups. These groups of data are called clusters. Data are grouped into clusters in such a way that data in the same group
are similar and in other groups are dissimilar. Cluster analysis aims to categorize the set of patterns into clusters based on
similarity. Various types of clustering techniques analyzed on a dataset. We have used WEKA to compare the various clustering
algorithms by taking some parameters on a sample data.
Keywords
Clustering, Data mining, Weka etc.
complications of very large datasets with many attributes
of different types [1]. A variety of algorithms have recently
emerged that meet these requirements and were
successfully applied to real-life data mining problems.
I. INTRODUCTION
Data Miming is one of the important aspects for extracting
a great deal of information from the large amount of data.
DM is the search for valuable information in large volumes
of data [3]. Clustering is an important data mining
technique to divide the similar data into a cluster and
dissimilar data into other cluster [9]. Clustering is an
unsupervised learning technique. Data clusters are created
to meet specific requirements that cannot create using any
of the categorical levels. One can combine data subjects as
a temporary group to get a data cluster. Clustering
techniques can be divided into several categories:
Partitioning algorithms, Hierarchical algorithms, Density
based clustering algorithms, Farthest first clustering and
filtered clustering etc. These clustering algorithms are
defined and compared in this research. WEKA is used data
mining tools for this purpose in this research. It provides a
better interface compare to the other data mining tools.
This study is of use of different clustering techniques
available in WEKA data mining tool. Comparing different
clustering algorithms dermatology dataset is used which
contains 35 attributes and 366 instances. Study comprises
of different data mining clustering algorithms, i.e. KMeans, Hierarchical clustering, Density based clustering
algorithm, Farthest first clustering and Filtered Clustering
for comparison. Dermatology dataset with 35 attributes and
366 instances is taken and by using WEKA taken results
and analyzed. Various clustering algorithms are taken and
predict a useful result with the help of WEKA.
Raw Data
Clustering
Algorithm
Clusters
of data
Figure 1: Clustering Process
III. WEKA
WEKA is the product of the University of Waikato (New
Zealand) and was first implemented in its modern form in
1997 [5]. WEKA stands for Waikato Environment for
Knowledge Analysis. It is the most widely tool for the data
mining. WEKA is an open source interface developed in
Java and aims at providing an interface where various
algorithms are incorporated for the practitioners for
research. Users can quickly adapt the interface due to its
simplicity and ease of use [6]. WEKA provides the
graphical user interface of the user and provides many
facilities.
The GUI Chooser consists of four buttons in WEKA:
1. Explorer: An environment for exploring data with
WEKA.
2. Experimenter: An environment for performing
experiments and conducting statistical tests between
learning schemes.
3. Knowledge Flow: This environment supports
essentially the same functions as the Explorer but with
a drag and drop interface. One advantage is that it
supports incremental learning.
4. Simple CLI: Provides a simple command-line interface
that allows direct execution of WEKA commands for
operating systems that do not provide their own
command line interface [7].
II. CLUSTERING
Data clustering is measured a remarkable approach for
finding similarities in data and grouping similar data into
different clusters. A cluster is a set of objects which are
similar and dissimilar to the objects belonging to the other
clusters. Cluster analysis is used in wide variety of field:
business analysis, psychology, social science, biology,
statistics, information retrieval, machine learning and data
mining [4] etc. Data mining adds to clustering the
7
Please Cite this Article at: Archana et al, Journal of Current Engineering Research, 4 (3), May-June 2014, 7-10
IV. DATA CLUSTERING TECHNIQUES
Clustering is the main task of Data Mining. The most
commonly used algorithms in Clustering are Partitioning,
Hierarchical, Farthest First clustering, Filtered Clustering
and Density based algorithms. Partitioning algorithms are
two types K-Means and K-Medoids. K-means is the most
popular techniques for clustering because it exhibits less
complexity comparison to other algorithms.
Hierarchical clustering is a method of cluster analysis
which seeks to build a hierarchy of clusters. Hierarchical
clustering algorithms divide or merge a dataset into a
sequence of nested partitions [13]. This algorithm build
cluster gradually. Every cluster node contains child
clusters; sibling clusters partition the points covered by
their common parent. Such an approach allows exploring
data on different levels of granularity.
It is of two types:
A. K-means Clustering
It is a partition method technique. K-Means is one of the
simplest unsupervised learning methods among all
partitioning based clustering algorithms. K-Means
algorithm organizes objects into k – partitions where each
partition represents a cluster [8]. In K-Means partitioning
algorithms each cluster‟s center is represented by mean
value of objects in the cluster. A centroid is defined for
each cluster. All the data objects are placed in a cluster
having centroid nearest to that data object [10].
i. Agglomerative (bottom up)Agglomerative hierarchical clustering is a bottom-up
clustering method where clusters have sub-clusters, which
in turn have sub-clusters, etc. It starts by letting each object
form its own cluster and iteratively merges cluster into
larger and larger clusters, until all the objects are in a
single cluster or certain termination condition is satisfied.
The single cluster becomes the hierarchy‟s root. For the
merging step, it finds the two clusters that are closest to
each other, and combines the two to form one cluster [8].
Algorithm
Input : „k‟ the number of clusters to be partitioned; „n‟, the
number of objects.
Output : a set of „k‟ clusters based on given similarity
function.
Steps
i) Arbitrarily choose „k‟ objects as the initial cluster
centers;
ii) Repeat
a. (Re)assign each object to the cluster to which
the object is the most similar; based on the given
similarity function;
b. Update the centroid (cluster means), i.e.,
calculate the mean value of the objects for each
cluster;
iii) Until no change.
ii. Divisive (top down)A top-down clustering method and is less commonly used.
It works in a similar way to agglomerative clustering but in
the opposite direction. This method starts with a single
cluster containing all objects, and then successively splits
resulting clusters until only clusters of individual objects
remain [8].
Fig.4 Results of Hierarchical Clustering
C. Density Based Clustering
In Density Based Clustering, clusters are defined as areas
of higher density than the outstanding of the data set.
Objects in these areas that are required to separate clusters
are considered to be noise and border points [11]. In this
approach, a cluster is regarded as a region in which the
density of data objects exceeds a threshold [13].
Representative algorithms include DBSCAN, GDBSCAN
and OPTICS.
Fig.3. Results of K-Means algorithm
B. Hierarchical Clustering
8
Please Cite this Article at: Archana et al, Journal of Current Engineering Research, 4 (3), May-June 2014, 7-10
Density Based Spatial Clustering of Application with
Noise (DBSCAN)
This algorithm is proposed by Easter in 1996. In DBSCAN
cluster is defined by the set of all point connected to their
neighbors. The Eps and the Minpts are the two parameters
of the DBSCAN. The basic idea of DBSCAN algorithm is
that for each object of a cluster, the neighborhood of a
given radius (Eps) has to contain at least a minimum
number of objects ( MinPts )[11].
Procedure of DBSCAN algorithm
Arbitrary select a point r.
Retrieve all points density-reachable from r w.r.t Eps and
Minpts.
If r is a core point, cluster is formed.
If r is a border point, no points are density-reachable from r
and DBSCAN visits the next point of the database.
Continue the process until all of the points have been
processed [12].
Fig.6 Results of Farthest First Clustering
E. Filtered Clustering
The filtered algorithm is used for filtering the information
or pattern. In this the user supplies a sample set of
appropriate information. On the coming of new
information they are comparing against the filtering profile
and the information matching the keywords is presented to
the user.
Fig.5 Results of Density Based Clustering
D. Farthest first Clustering
Farthest first is a variant of K Means that places each
cluster centre in turn at the point furthermost from the
existing cluster centre [14]. This point has to lie within the
data region. This greatly speeds up the clustering in most
of the cases since less reassignment and adjustment is
needed.
Fig.7 Results of Filtered Clustering
V. COMPARISON
Above section involves the study of each of the five
techniques introduced previously using WEKA clustering
tool on a set of dermatology dataset consists of 35
attributes and 366 records .Comparison of these clustering
algorithm using WEKA is given in Table 1.
VI. CONCLUSION
In this work we study various clustering algorithms on the
dermatology data. Clustering is the process of grouping the
data into clusters. The objects are clustered or grouped
based on the principle of maximizing the intra-class
9
Please Cite this Article at: Archana et al, Journal of Current Engineering Research, 4 (3), May-June 2014, 7-10
similarity and minimizing the inter-class similarity.
Dissimilarity is based on the attribute values describing the
objects. We have performed five experiments on the
dermatology database. After getting the results of
clustering algorithms we can find that the performance of
K-Means algorithm is better than Hierarchical Clustering
algorithm. From table it is observed that using WEKA tool
K-means and Density-Based algorithm gives similar
Name Of
Algorithms
No. Of
clusters
Cluster Instances
results. Hierarchical algorithm takes more time in
comparison to K-Means and Density-based clustering
algorithms. Farthest First clustering gives best results in
less time compare to other algorithms used in our study.
For the purpose of future scope, one can take the other
algorithms or even devise new algorithms and compare the
results with the existing results.
No. of
iterations
Log
Likelihood
Within cluster sum of
squared errors
Time taken to
build a model
K-Means
Algorithm
2
0 252(69%)
1 114(31%)
4
-
3519.6616534260493
0.11 seconds
Hierarchical
Algorithm
2
0 365(100%)
1 1(0%)
-
-
-
0.53 seconds
Density Based
Algorithm
2
0 254(69%)
1 112(31%)
4
-28.94463
3519.6616534260493
0.11 seconds
Farthest First
Clustering
2
0 277(76%)
1 89(24%)
-
-
-
0.02 seconds
Filtered
Clustering
2
0 252(69%)
1 114(31%)
4
-
3519.6616534260493
0.05 seconds
Table1: Comparison Table of clustering algorithms
REFERENCES
[1]
[2]
[3]
[4]
[5]
[6]
[7]
Pavel Berkhin, “Survey of clustering data mining
techniques”. Techreport, 2002
Ankit Bhardwaj, Arvind Sharma, V.K. Shrivastava,
“Data Mining Techniques and Their Implementation in
Blood Bank Sector –A Review”, IJERA 2012
Sang Jun Lee and Keng Siau, ”A review of data mining
techniques”, Industrial Management & Data Systems
2001.
Er. Arpit Gupta 1 Er. Ankit Gupta 2 Er. Amit Mishra,”
Research Paper on cluster techniques of data variations
“, IJATER 2011.
Narendra Sharma 1, Aman Bajpai 2, Mr. Ratnesh
Litoriya 3, “Comparison the various clustering
algorithms of weka tools”, IJETAE Volume 2, Issue 5,
May 2012.
Namita Bhan, Deepti Mehrotra, ”comparative study of
EM and K-means clustering techniques in weka interface”, IJATER 2013.
Bharat chandhari, Manan Parikh, “A Comparative study
of clustering algorithms using weka tools”, Vol 1, Issue
2, IJAIEM, 2012.
[8]
[9]
[10]
[11]
[12]
[13]
[14]
10
Aastha Joshi, Rajneet Kaur, “A Review: Comparative
study of Various Clustering Techniques In Data
Mining”, IJARCSSE 2013.
Manish Verma, Mauly Srivastava, Neha Chack, Atul,
Nidhi Gupta, “A comparative study of various
clustering algorithm in data mining”, IJERA, 2012.
Shalini S Singh and N C Chauhan, “K-Means v/s KMedoids: A Comparative Study” , National
Conference on Recent Trends in Engineering
&
Technology, 2011.
Amandeep kaur Mann and Navneet Kaur, “Survey
paper on clustering techniques”, IJSETR, 2013.
Prabhdip Kaur1, Shruti Aggrwal2, “Comparative Study
of Clustering Techniques”, IJARET Vol. 1, Issue III,
April 2013.
Shraddha K.Popat and Emmanuel M., ” Review and
Comparative Study of Clustering Techniques”,
(IJCSIT), 2014.
Vishal Shrivastava And Prem Narayan Arya, “
Performance Analysis On Retail Sales Data From
Various clustering Algorithm Using Weka Tool”,
IJCITB 2013.