Download Comparison and Analysis of Various Clustering Methods

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March – April 2014
ISSN 2278-6856
Comparison and Analysis of Various Clustering
Methods in Data mining On Education data set
Using the weak tool
Suman 1 and Mrs.Pooja Mittal2
1
Student of Masters of Technology,
Department of Science and Application M.D. University, Rohtak, Haryana, India
2
Assistant Professor
Department of Computer Science and Application M.D. University, Rohtak, Haryana, India
Abstract:- Data mining is used to find the hidden
information pattern and relationship between the large data
set which is very useful in decision making. Clustering is very
important techniques in data mining, which divides the data
into groups and Each group containing similar data and
dissimilar from other groups. Clustering using various
notations to create the groups and these notations can be like
as clusters include groups with low distances among the
cluster members, dense areas of the data space, intervals or
particular statistical distributions In this paper provide a
comparison of various clustering algorithms like k-Means
Clustering, Hierarchical Clustering, Density based clustering,
grid clustering etc. We compare the performance of these
three major clustering algorithms on the aspect of correctly
class wise cluster building ability of the algorithm.
Performance of the 3 techniques is presented and compared
using a clustering tool WEKA.
Centre will represent with input vector can tell which
cluster this vector belongs to by measuring a similarity
metric between input vector and all cluster centers and
determining which cluster is nearest or most similar one
[1]. There are various method in clustering these are
followed: PARTITIONING MATHOD
o K-mean method
o K- Medoids method
 HIERARCHICAL METHODS
o Agglomerative
o Divisive
 GRID BASED
 DENSITY BASED METHODS
o DBSCAN
Keywords: - Data mining, clustering, k-Means
Clustering, Hierarchical Clustering, DBSCAN clustering,
grid clustering etc.
I. Introduction
Data mining is also known as knowledge discovery. In
computer science field data mining is an important
subfield which has computational ability to discover the
patterns from large data sets. The main objective of data
mining is that to discover the data and patterns and store
it in an understandable form. Data mining applications
are used almost every field to manage the records and in
other forms. Data mining is a process to convert the raw
data into meaningful information according to stepwise
(data mining follows some steps to discover the hidden
data and pattern). Data mining having various numbers
of techniques which have some own capabilities, but in
this paper, we will concentrate on clustering techniques
and its methods.
Fig 2.1 methods of clustering techniques
III. Weka
Weka is developed by the University of Waikato (New
Zealand) and its first modern form is implemented in
1997.It is open source means it is available for use public.
Weka code is written in Java language and it contains a
GUI for Interacting with data files and producing visual
results. The figure of Weka is shown in the figure3. 1
II. Clustering
In this technique we split the data into groups and these
groups are known as clusters. Each cluster contains the
homogenous data, but it is heterogeneous data from other
cluster's data. A data is choosing the cluster according to
attribute values describing by objects. Clustering is used
in many fields like education, industries, agriculture etc.
Clustering used unsupervised learning techniques. Cluster
Volume 3, Issue 2 March – April 2014
Figure3.1: front view of weka tools
Page 240
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March – April 2014
ISSN 2278-6856
The GUI Chooser consists of four buttons:
 Explorer: An environment for exploring data with
WEKA.
 Experimenter: An environment for performing
experiments and conducting statistical tests
between learning schemes.
 Knowledge Flow: This environment supports
essentially the same functions as the Explorer, but
with a drag and- drop interface. One advantage is
that it supports incremental learning.
 Simple CLI: Provides a simple command-line
interface that allows direct execution of WEKA
commands for operating systems that do not
provide their own command line interface. [8]
IV. Dataset
For performing the comparison analysis, we need the
datasets. In this research I am taking education data set.
This data set is very helpful for the researchers. We can
directly apply this data in the data mining tools and
predict the result.
V. Methodology
My methodology is very simple. I am taking the
education data set and apply it on the weka in differentdifferent data set of student records . In the weka I am
applying different- different clustering algorithms and
predict a useful result that will be very helpful for the new
users and new researchers.
VI. Performing clustering on weka
For performing cluster analysis on Weka.I have loaded
the data set on weka that shown in this fig.6.1.waka can
support CSV and ARFF format of data set. Here we are
using CSV data set. In this data having 2197instances
and 9 attributes.
Figure 6.1: load data set in to the weka
After that we have many options shown in the figure
After that we have many options shown in the figure. We
perform clustering [10] so we click on the cluster button.
After that we need to choose which algorithm is applied
to the data. It is shown in the figure 6.2. And then click
the ok button.
Volume 3, Issue 2 March – April 2014
Fig.6.2 various clustering algorithms in weka
VII. Partitioning methods
As the name suggested that in this method we divide the
large object into (groups) clusters and each cluster
contain at least one element. This method follows an
iterative process by use of this process, we can relocate
the object from one group to another more relevance
group. This method is effective for small to medium sized
data sets. Examples of partitioning methods include kmeans and k-medoids [2].
VII (I) K-Means Algorithm
It is a centroid based technique. This algorithm takes the
input parameters k and partition a set of n objects into k
clusters that the resulting intra-cluster similarity is high
but the inter-cluster similarity is low. The method can be
used by cluster to assign rank values to the cluster
categorical data is statistical method. K mean is mainly
based on the distance between the object and the cluster
mean. Then it computes the new mean for each cluster.
Here categorical data have been converted into numeric
by assigning rank value [3].
Algorithm:In this we take k the number of cluster and D as data set
containing an object. In this output is stored as A set Of k
clusters. Algorithm follows some steps these are:Steps1:- Randomly choose k object from D
as initial
cluster center.
Steps2:- Calculate the distance from the data point to
each cluster.
Step3: - If the data point is closest to its own cluster, leave
it where it is. If the data point is not closest to its own
cluster, move it into the closest cluster.
Step4: repeat step2 and 3 until best relevant cluster is
found for each data.
Step5: - updates the cluster means and calculate the mean
value of the object for each cluster.
Step6: - stop (every data is located in a proper positioned
cluster).
Now I am applying the k-mean on weak tool table17.
1show the result of k-mean.
Page 241
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March – April 2014
ISSN 2278-6856
Table7. 1.1k- means clustering algorithms
Dataset
Name
Attribut
e and
Instance
s
Civil
Instance
s: 446
Attribut
es: 9
Computer
and IT
E.C.E
Mechanic
al
Instance
s: 452
Attribut
es: 9
Instance
s: 539
Attribut
es: 9
Instance
s: 760
Attribut
es: 9
Clustere
d
Instance
s
0: 247
(55%)
1: 199
(45%)
0
206
(46%)
1
246
(54%)
0
317
( 59%)
1
222
( 41%)
0
327
( 43%)
1
433
( 57%)
Time
taken
to
build
the
model
0.02
secon
ds
Square
d
Error
No of
Iterat
ions
13.5
3
0.02
secon
ds
15.6
5
0.27
secon
ds
16.03
5
0.06
secon
ds
22.7
3
It is two types:Agglomerative (bottom up):It is a bottom up approach so that it is starting from sub
cluster than merge the sub clusters and makes a big
cluster at the top.
Figure 8.1: Hierarchical Clustering Process [7]
Divisive (top down):It is working opposite like as agglomerative. It is starting
from top mean a big cluster than decomposed it into
smaller cluster. Thus, it is a stat from top and reached at
the bottom. Table 7.2.2 shows the result
Table 7.2.1 Hierarchical Clustering
FIG.7.1 compression between attributes of k-mean
VII (II) K-Medoids Algorithm
This is a variation of the k-means algorithm and is less
sensitive to outliers [5]. In this instead of mean we use the
actual object to represent the cluster, using one
representative object per cluster. Clusters are generated by
points which are close to respective methods. The
function used for classification is a measure of
dissimilarities of points in a cluster and their
representative [5]. The partitioning is done based on
minimizing the sum if the dissimilarities between each
object and its cluster representative. This criterion is
called as absolute-error criterion.
N Sum of Absolute error=Σ Σ Dist (p, a)
i=1 p ∈ Ci
Where p represents an object in the data set and oi is the
ith representative.
N is the number of clusters.
Two well-known types of k-medoids clustering [6] are the
PAM (Partitioning Around Medoids) and CLARA
(Clustering LARge Applications).
VIII. Hierarchical Clustering
This method provides the tree relationship between
clusters. In this method we use same no. cluster and data,
means if we have n no. of data then we use n no of
clusters.
Volume 3, Issue 2 March – April 2014
Dataset
Name
Civil
Attribute and
Instances
Instances: 446
Attributes: 9
Comput
er and
IT
Instances: 452
Attributes: 9
E.C.E
Instances: 539
Attributes: 9
Mechani
cal
Instances: 760
Attributes: 9
Clustered
Instances
0
445
(100%)
1
1(
0%)
0
305 (
67%)
1
147 (
33%)
0
538
(100%)
1
1(
0%)
0
758
(100%)
1
2(
0%)
Time taken to
build the model
2.09 seconds
4.02 seconds
3.53 seconds
13 seconds
FIG 7.1 comparison between attributes of hierarchical
clustering
IX. Grid based
The grid based clustering approach uses a multi
resolution grid data structure. It measures the object space
into a finite number of cells that form a grid structure on
which all of the operations for clustering are performed.
We are present two examples; STING and CLIQUE.
Page 242
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March – April 2014
ISSN 2278-6856
STING (Statistical Information Grid): - It is used mainly
with numerical values. It is a grid-based multi resolution
clustering technique which is computed the numerical
attribute and store in a rectangular cell. The quality of
clustering produced by this method is directly related to
the granularity of the bottom most layers, approaching the
result of DBSCAN as granularity reaches zero [2].
CLIQUE (Clustering in Quest): - It was the first
algorithm proposed for dimension –growth subspace
clustering in high dimensional space. CLIQUE is a
subspace partitioning algorithm introduced in 1998.
XI. Experimental results
Here we use various clustering method of student record
data and compare these using weka tools. According to
these comparisons we find the which method is
performed better result. Fig 11.1 shows the comparison
result on according to the time taken to build a model.
X. Density based clustering
X.I. DBSCAN (for density-based spatial clustering of
applications with noise) is a density based clustering
algorithm. It is using the concept of “density reachibility”
and “density connect ability”, both of which depends
upon input parameter- size of epsilon neighborhood e and
minimum terms of local distribution of nearest neighbors.
Here parameter e controls the size of the neighborhood
and size of clusters. It starts with an arbitrary starting
point that has not been visited [4]. DBSCAN algorithm is
an important part of clustering technique which is mainly
used in scientific literature. Density is measured by the
number of objects which are nearest the cluster.
Table 10.1.1 DBSCAN Clustering
Dataset
Name
Civil
Comput
er and
IT
E.C.E
Mechani
cal
Attribute and
Instances
Instances: 446
Attributes: 9
Instances: 452
Attributes: 9
Clustered
Instances
446
Time taken to
build the model
4.63 seconds
452
6.13 seconds
Instances: 539
Attributes: 9
539
11.83 seconds
Instances: 760
Attributes: 9
760
23.95 seconds
X. II. Optics: - stands for Ordering Points to Identify
Clustering Structure. DBSCAN burdens the user from
choosing the input parameters. Moreover, different parts
of the data could require different parameters [5]. It is an
algorithm for finding density based clusters in spatial data
which addresses one of DBSCAN’S major weaknesses i.e.
Of detecting meaningful clusters in data of varying
density.
Table 10.2.1. OPTICS Clustering
Dataset
Name
Civil
Comput
er and
IT
E.C.E
Mechani
cal
Attribute and
Instances
Instances: 446
Attributes: 9
Instances: 452
Attributes: 9
Clustered
Instances
446
Time taken to
build the model
5.42 seconds
452
6.73 seconds
Instances: 539
Attributes: 9
539
9.81 seconds
Instances: 760
Attributes: 9
760
23.85 seconds
Volume 3, Issue 2 March – April 2014
Fig11.1 compared according to time taken to build a
model.
According to this result, we can say that k-mean provide
better results than other methods. But only a single
attribute we cannot use k-mean every time. Thus we can
use any other methods if time is not important.
XII. Conclusion
Data mining is covering every field of our life. Mainly we
are using the data mining in banking, education, business
etc. In this paper, we have provided an overview of the
comparison, classification of clustering algorithms such
as partitioning, hierarchical, density based and grid based
methods. Under partitioning methods, we have applied kmeans, and its variant k-medicine weka tool. Under
hierarchical, we have discussed the two approaches which
are the top-down approach and the bottom-up approach.
We have also applied the DBSCAN and OPTICS
algorithms under the density based methods. Finally, we
have used the STING and CLIQUE algorithms under the
grid based methods.
And we are describing the
comparative study of data mining techniques.These
comparisons we can show in the above tables. Thus we
can say that every technique is important in his functional
area. We can improve the capability of data mining
techniques by removing the limitation of these
techniques.
References
[1] Manish Verma, Mauly Srivastava, Neha Chack, Atul
Kumar Diswar, Nidhi Gupta,” A Comparative Study
of Various Clustering Algorithms in Data Mining,”
International Journal of Engineering Research and
Applications (IJERA), Vol. 2, Issue 3, pp. 13791384, 2012.
[2] Jiawei Han and Micheline Kamber, Jian Pei, B Data
Mining: Concepts and Techniques, 3rd Edition,
2007.
[3] Patnaik, Sovan Kumar, Soumya Sahoo, and Dillip
Kumar Swain, “Clustering of Categorical Data by
Page 243
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS)
Web Site: www.ijettcs.org Email: [email protected], [email protected]
Volume 3, Issue 2, March – April 2014
ISSN 2278-6856
[4]
[5]
[6]
[7]
[8]
[9]
Assigning Rank through Statistical Approach,”
International Journal of Computer Applications 43.2:
43.2: 1-3, 2012.
Manish Verma, Mauly Srivastava, Neha Chack, Atul
Kumar Diswar, Nidhi Gupta,” A Comparative Study
of Various Clustering Algorithms in Data Mining,”
International Journal of Engineering Research and
Applications (IJERA), Vol. 2, Issue 3, pp. 1379-1384,
2012
Survey of Clustering Data Mining Techniques, Pavel
Berkhin, 2002.
C. Y. Lin, M. Wu, J. A. Bloom, I. J. Cox, and M.
Miller, “Rotation, scale, and translation resilient
public watermarking for images,” IEEE Trans.
Image Processing, vol. 10, no. 5, pp. 767-782, May
2001.
Pallavi, Sunila Godara “A Comparative Performance
Analysis of Clustering Algorithms”International
Journal of Engineering Research and Applications
(IJERA) ISSN: 2248-9622 www.ijera.com Vol. 1,
Issue 3, pp. 441-445
Bharat Chaudhari1, Manan Parik“A Comparative
Study of clustering algorithms Using weka tools”
International Journal of Application or Innovation in
Engineering & Management (IJAIEM)
M. And Heckerman, D. (February, 1998). An
experimental comparison of several clustering and
initialization methods. Technical Report MSRTR-9806, Microsoft Research, Redmond, WA.
Volume 3, Issue 2 March – April 2014
Page 244