Download Application of Clustering in Data mining Using Weka Interface

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Special Issue
September 2015
Application of Clustering in Data mining Using Weka Interface
Anita And Uttama Pandey
Assistant Professor, D.A.V Centenary College, N.I.T Faridabad
AbstractNow a day’s clustering is very important and widely used technique in data mining to group the data .Groups are based
on the similarities between the data according to characteristics found in the real data.. K-means Clustering is important
technique in data mining. K-means is a simple algorithm that has been adopted to solve many problem domains. It
generates a specific number of disjoint flat clusters. This paper is about to explain the use of k- means clustering by
Weka interface. The Database has been taken from the Website of agriculture--Agricultural Statistics of India. This
paper is used to demonstration the database of population and growth rate by using clustering technique of data mining
in Weka interface.
KeywordsK-means Clustering, data mining, Weka Interface.
I INTRODUCTION
Clustering is a process of dividing a set of objects into a set of meaningful subclasses, called clusters. It helps
the users to understand the natural grouping or structure in a data set. K-means clustering[1] is one of the
simplest clustering technique and it is commonly used in medical, imaging, biometric and other fields.
Computer science has been widely adopted in different fields like agriculture. It is very difficult to gather and
analyze data and make grouping based on the similarities and dissimilarities of data objects. It is only possible
by make use of computer systems. Data mining is one of them. By making use of data mining technique of k
means clustering it will be possible to make the grouping based on the databases. This research paper s used to
explain the K-mean clustering technique by using Weka interface.
II Meaning of K- means Clustering
K-means clustering gain its name from its method of operation.
This algorithm clusters observations into k groups, where k is taken as an input unit...It then assign each observation to
cluster based upon the observation proximity to the mean of the cluster. Cluster mean is then recomputed and the
process begins again. There is how the k-means algorithm works1. This algorithm arbitrarily selects k points as the initial cluster center called means.
2. Each point in the dataset is assigned to the closed cluster, based on the Euclidean distance between each point
and each cluster center.
3. Each cluster center is then recomputed as the average of the points in that cluster.
4. Repeat step 2 and 3 until the cluster converge
Properties of k-means clustering
1. There are always K clusters.
2. There is always at least one item in each cluster.
3. The clusters are always non –hierarchical and they do not overlap.
4. Every member of cluster is closest to its cluster than any other cluster because closeness does not always
involve the center of clusters.
Euclidean distance in K-means clustering is the distance between two points/objects/items in a dataset,
defined by point x and point y
EUCLIDEAN DISTANCE(X,Y)=
( |X1-Y1|2 + |X2-Y2|2 + … + |XN-1-YN-1|2 + |XN-YN|2 )1/2
where |Z| represents the absolute value of Z, X is the
first data point, Y is the second data point, N is the
number of characteristics or attributes in data mining terminology or fields in database terminology
and EUCLIDEAN DISTANCE(X, Y)[3] is the distance between data point X and data point Y using a
mathematical calculation known as the EUCLIDEAN DISTANCE.
77
Anita, Uttama Pandey
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Special Issue
September 2015
Example:
Growth(%age)
Rate(per unit)
Time(yrs)
Object 1
6
3
2
Object 2
8
2
3
Here object 1 has 6,3,2 coordinate values and Object 2 has 8,2,3 coordinate values , Hence the Euclidian
Distance for the above example can be calculated as:
EUCLIDEAN DISTANCE(obj1,obj2) =
( |6-8|2 + |3-2|2 +|2-3|2 )1/2
This comes out to be 2.449
Flowchart
Start
Initialize K-clustes
centoid
Distance of points to centroid
Make grouping based on minimum
distance
New Centroids
Do Centroid
Move
End
III Weka Interface
It stands for Waikato Environment for Knowledge Analysis[4].It’s a data mining/machine learning tool developed by
Department of Computer Science, University of Waikato, New Zealand. It is a collection of visualization tools and
algorithms for data analysis and predictive modeling.Weka provides several models to support data mining tasks
which are data pre-processing, clustering, classification, regression, visualization, and feature selection. All of Weka's
techniques are predicated on the assumption that the data is available as a single flat file or relation, where each data
point is described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute
types are also supported). Result sets in weka are saved in the notepad with the extention .arff (attribute relation file
format)
File formats[5] supported by weka:
 CSV- A CSV file is a specially formatted plain text file which stores spreadsheet or basic databasestyle information in a very simple format, with one record on each line, and each field within that
record separated by a comma.
 ARFF-An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of
instances sharing a set of attributes. ARFF files were developed to work with the Weka machine
learning software
78
Anita, Uttama Pandey
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Special Issue
September 2015

XRFF- The XRFF (eXtensible attribute-Relation File Format) is an XML-based extension of the
ARFF format.
 C4.5 (*.data or *.names)
 Libsvm-Library for Support Vector Machines
 Binary serialized instances(*.bsi)
The GUI chooser consists of four buttons-one for each of the four major Weka applications-and for menus
These buttons can be used to start the following applications:
•Explorer An environment for exploring data with WEKA (the rest of this documentation deals with this
application in more detail).
•Experimenter An environment for performing experiments and conducting statistical tests between learning
schemes.
•Knowledge Flow This environment supports essentially the same functions as the Explorer but with a dragand-drop interface. One advantage
is that it supports incremental learning.
•Simple CLI provides a simple command-line interface that allows direct execution of WEKA commands for
operating systems that do not provide their own command line interface.
Figure 1 : Weka Explorer
79
Anita, Uttama Pandey
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Special Issue
September 2015
This interface has six tabs:
1. Preprocess- used to choose the data file to be used by the application
2. Classify- used to test and train different learning schemes on the preprocessed data file under experimentation.
3. Cluster- used to apply different tools that identify clusters within the data file
4. Association- used to apply different rules to the data file that identifyassociation within the data
5. Selectattributes-used to apply different rules to reveal changes based on selected attributes inclusion or exclusion
from the experiment .
6. Visualize- used to see what the various manipulation produced on the data set in a 2D format, in scatter plot and bar
graph output
IV K-means clustering using Weka interface
To demonstrate the application of K-means clustering using weka interface. The statistics of population and growth
rate[6]
has
been
taken
from
Agricultural
Statistics
of
India
having
website
name―agricoop.nic.in/Agristatistics.htm‖.
The actual Database is in .xls format, but weka support .csv(comma separated values) format. This could be
done by saving the database with .csv extension :
Figure 2 Actual Database(.xls)
80
Anita, Uttama Pandey
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Special Issue
September 2015
Figure 3Database (.csv)
Working with clusters in weka:
Step 1: Select on to the open file tab to select the desired database for which clusters are to be created. The
database must be in the .csv format
Figure 4 Opening database in weka
Step 2: Choose the clustering method by selecting the appropriate, from the list of clusters. The simple kmeans have been selected here for which the number of clusters have to be mentioned by right clicking on the
selected k-mean method, and choosing the numClusters as 3.
81
Anita, Uttama Pandey
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Special Issue
September 2015
Figure 5 Properties of k-means
Step 3: Various cluster modes are :
 Use training set
 Supplied test set
 Percentage split
 Classes to cluster evaluation
Select use training set from these modes and start.
Figure 6 k-means clusters
Figure 6 is showing the evaluation on the training data sets with three clustered instances.
82
Anita, Uttama Pandey
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Special Issue
September 2015
Step 4: Visualize the cluster assignment, by right clicking on the result list window.
Figure 7 Clusters visualization
Step 5:These cluster visualization result is to be saved in the notepad for further review. For this click on
SAVE button in the “Clusterer visualize” window. These must be saved in .arff format. This “pop.arff” file
will look like as:
When we open this notepad file result will be:
@relation 2.2_clustered
@attribute Instance_number numeric
@attribute 'Himachal Pradesh ' {'Chandigarh ',Sikkim,'Arunachal Pradesh','Nagaland ','Manipur ' ,Mizoram,
Tripura, Meghalaya, 'Daman & Diu ','Dadra & Nagar Haveli ',Goa,'Lakshadweep ','Pondicherry ','Andaman &
Nicobar Islands '}
@attribute '34,73,892' {'5,80,282','3,21,661','7,20,232','10,25,707','1
3,69,764','5,52,339','18,71,867','14,92,668','1,50,100','1,93,178','7,40,711','33,106','6,10,485','2,02,330'}
@attribute '33,82,617'
{'4,74,404','2,86,027','6,62,379','9,54,895','13,51,992','5,38,675','17,99,165','14,71,339','92,811','1,49,675','7,17
,012','31,323','6,33,979','1,77,614'}
@attribute '68,56,509'
{'10,54,686','6,07,688','13,82,611','19,80,602','27,21,756','10,91,014','36,71,032','29,64,007','2,42,911','3,42,85
3','14,57,723','64,429','12,44,464','3,79,944'}
@attribute 17.54 numeric
@attribute 12.81 numeric
@attribute Cluster {cluster0,cluster1,cluster2}
@data
0,'Chandigarh ','5,80,282','4,74,404','10,54,686',40.28,17.1,cluster2
1,Sikkim,'3,21,661','2,86,027','6,07,688',33.06,12.36,cluster1
2,'Arunachal Pradesh','7,20,232','6,62,379','13,82,611',27,25.92,cluster1
3,'Nagaland ','10,25,707','9,54,895','19,80,602',64.53,-0.47,cluster2
4,'Manipur ','13,69,764','13,51,992','27,21,756',24.86,18.65,cluster1
5,Mizoram,'5,52,339','5,38,675','10,91,014',28.82,22.78,cluster1
6,Tripura,'18,71,867','17,99,165','36,71,032',16.03,14.75,cluster1
7,Meghalaya,'14,92,668','14,71,339','29,64,007',30.65,27.82,cluster1
83
Anita, Uttama Pandey
International Journal of Innovations & Advancement in Computer Science
IJIACS
ISSN 2347 – 8616
Volume 4, Special Issue
September 2015
8,'Daman & Diu ','1,50,100','92,811','2,42,911',55.73,53.54,cluster0
9,'Dadra & Nagar Haveli ','1,93,178','1,49,675','3,42,853',59.22,55.5,cluster0
10,Goa,'7,40,711','7,17,012','14,57,723',15.21,8.17,cluster1
11,'Lakshadweep ','33,106','31,323','64,429',17.3,6.23,cluster1
12,'Pondicherry ','6,10,485','6,33,979','12,44,464',20.62,27.72,cluster1
13,'Andaman & Nicobar Islands ','2,02,330','1,77,614','3,79,944',26.9,6.68,cluster1
Note:-In addition to the "instance_number" attribute, WEKA has also added "Cluster" attribute to the original
data set. In the data portion, each instance now has its assigned cluster as the last attribute value. By doing
some simple manipulation to this data set, we can easily convert it to a more usable form for additional
analysis or processing. For example, here we have converted this data set in a comma-separated format and
sorted the result by clusters.WEKA offers clustering capabilities not only as standalone schemes, but also as
filters and classifiers.
V FUTURE SCOPE
We can use the concept of a package to add additional functionality, separate from that are already supplied
with weka.jar files. A package consists ofdocumentation, meta data, and possibly source code. Weka includes
a facility to manage these packages and a mechanism to load them dynamically at runtime.
VI ACKNOWLEDGEMENT
A special thanks to the team members of weka for giving us privilege to work on this machine learning tool of
datamining and all the external contributors for enhancing the knowledge in various applications of
datamining
VII REFERENCES
[1] K-Means Clustering Tutorial,By Kardi Teknomo,PhD
[2] Kanungo, T.; Mount, D. M.; Netanyahu, N. S.; Piatko, C. D.; Silverman, R.; Wu, A. Y.
"An efficient k-means clustering algorithm: Analysis and implementation".
[3] http://www.cut-the-knot.org/pythagoras/DistanceFormula.shtml
[4] http://www.slideshare.net/wekacontent/an-introduction-to-weka-2875221#btnNext
[5] https://blog.itu.dk/SPVC-E2010/files/2010/11/wekatutorial.pdf
[6] http://agricoop.nic.in/Agristatistics.htm
[7] http://maya.cs.depaul.edu/~Classes/Ect584/Weka/k-means.html
[8] http://www.users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf
[9] A Hybridized k-means approach for high dimensional dataset- By Rajashree Dash – International journal
of Engineering, science and technology
[10]http://www.cs.put.poznan.pl/jstefanowski/sed/DM-7clusteringnew.pdf
84
Anita, Uttama Pandey