Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 4, Special Issue September 2015 Application of Clustering in Data mining Using Weka Interface Anita And Uttama Pandey Assistant Professor, D.A.V Centenary College, N.I.T Faridabad AbstractNow a day’s clustering is very important and widely used technique in data mining to group the data .Groups are based on the similarities between the data according to characteristics found in the real data.. K-means Clustering is important technique in data mining. K-means is a simple algorithm that has been adopted to solve many problem domains. It generates a specific number of disjoint flat clusters. This paper is about to explain the use of k- means clustering by Weka interface. The Database has been taken from the Website of agriculture--Agricultural Statistics of India. This paper is used to demonstration the database of population and growth rate by using clustering technique of data mining in Weka interface. KeywordsK-means Clustering, data mining, Weka Interface. I INTRODUCTION Clustering is a process of dividing a set of objects into a set of meaningful subclasses, called clusters. It helps the users to understand the natural grouping or structure in a data set. K-means clustering[1] is one of the simplest clustering technique and it is commonly used in medical, imaging, biometric and other fields. Computer science has been widely adopted in different fields like agriculture. It is very difficult to gather and analyze data and make grouping based on the similarities and dissimilarities of data objects. It is only possible by make use of computer systems. Data mining is one of them. By making use of data mining technique of k means clustering it will be possible to make the grouping based on the databases. This research paper s used to explain the K-mean clustering technique by using Weka interface. II Meaning of K- means Clustering K-means clustering gain its name from its method of operation. This algorithm clusters observations into k groups, where k is taken as an input unit...It then assign each observation to cluster based upon the observation proximity to the mean of the cluster. Cluster mean is then recomputed and the process begins again. There is how the k-means algorithm works1. This algorithm arbitrarily selects k points as the initial cluster center called means. 2. Each point in the dataset is assigned to the closed cluster, based on the Euclidean distance between each point and each cluster center. 3. Each cluster center is then recomputed as the average of the points in that cluster. 4. Repeat step 2 and 3 until the cluster converge Properties of k-means clustering 1. There are always K clusters. 2. There is always at least one item in each cluster. 3. The clusters are always non –hierarchical and they do not overlap. 4. Every member of cluster is closest to its cluster than any other cluster because closeness does not always involve the center of clusters. Euclidean distance in K-means clustering is the distance between two points/objects/items in a dataset, defined by point x and point y EUCLIDEAN DISTANCE(X,Y)= ( |X1-Y1|2 + |X2-Y2|2 + … + |XN-1-YN-1|2 + |XN-YN|2 )1/2 where |Z| represents the absolute value of Z, X is the first data point, Y is the second data point, N is the number of characteristics or attributes in data mining terminology or fields in database terminology and EUCLIDEAN DISTANCE(X, Y)[3] is the distance between data point X and data point Y using a mathematical calculation known as the EUCLIDEAN DISTANCE. 77 Anita, Uttama Pandey International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 4, Special Issue September 2015 Example: Growth(%age) Rate(per unit) Time(yrs) Object 1 6 3 2 Object 2 8 2 3 Here object 1 has 6,3,2 coordinate values and Object 2 has 8,2,3 coordinate values , Hence the Euclidian Distance for the above example can be calculated as: EUCLIDEAN DISTANCE(obj1,obj2) = ( |6-8|2 + |3-2|2 +|2-3|2 )1/2 This comes out to be 2.449 Flowchart Start Initialize K-clustes centoid Distance of points to centroid Make grouping based on minimum distance New Centroids Do Centroid Move End III Weka Interface It stands for Waikato Environment for Knowledge Analysis[4].It’s a data mining/machine learning tool developed by Department of Computer Science, University of Waikato, New Zealand. It is a collection of visualization tools and algorithms for data analysis and predictive modeling.Weka provides several models to support data mining tasks which are data pre-processing, clustering, classification, regression, visualization, and feature selection. All of Weka's techniques are predicated on the assumption that the data is available as a single flat file or relation, where each data point is described by a fixed number of attributes (normally, numeric or nominal attributes, but some other attribute types are also supported). Result sets in weka are saved in the notepad with the extention .arff (attribute relation file format) File formats[5] supported by weka: CSV- A CSV file is a specially formatted plain text file which stores spreadsheet or basic databasestyle information in a very simple format, with one record on each line, and each field within that record separated by a comma. ARFF-An ARFF (Attribute-Relation File Format) file is an ASCII text file that describes a list of instances sharing a set of attributes. ARFF files were developed to work with the Weka machine learning software 78 Anita, Uttama Pandey International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 4, Special Issue September 2015 XRFF- The XRFF (eXtensible attribute-Relation File Format) is an XML-based extension of the ARFF format. C4.5 (*.data or *.names) Libsvm-Library for Support Vector Machines Binary serialized instances(*.bsi) The GUI chooser consists of four buttons-one for each of the four major Weka applications-and for menus These buttons can be used to start the following applications: •Explorer An environment for exploring data with WEKA (the rest of this documentation deals with this application in more detail). •Experimenter An environment for performing experiments and conducting statistical tests between learning schemes. •Knowledge Flow This environment supports essentially the same functions as the Explorer but with a dragand-drop interface. One advantage is that it supports incremental learning. •Simple CLI provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. Figure 1 : Weka Explorer 79 Anita, Uttama Pandey International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 4, Special Issue September 2015 This interface has six tabs: 1. Preprocess- used to choose the data file to be used by the application 2. Classify- used to test and train different learning schemes on the preprocessed data file under experimentation. 3. Cluster- used to apply different tools that identify clusters within the data file 4. Association- used to apply different rules to the data file that identifyassociation within the data 5. Selectattributes-used to apply different rules to reveal changes based on selected attributes inclusion or exclusion from the experiment . 6. Visualize- used to see what the various manipulation produced on the data set in a 2D format, in scatter plot and bar graph output IV K-means clustering using Weka interface To demonstrate the application of K-means clustering using weka interface. The statistics of population and growth rate[6] has been taken from Agricultural Statistics of India having website name―agricoop.nic.in/Agristatistics.htm‖. The actual Database is in .xls format, but weka support .csv(comma separated values) format. This could be done by saving the database with .csv extension : Figure 2 Actual Database(.xls) 80 Anita, Uttama Pandey International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 4, Special Issue September 2015 Figure 3Database (.csv) Working with clusters in weka: Step 1: Select on to the open file tab to select the desired database for which clusters are to be created. The database must be in the .csv format Figure 4 Opening database in weka Step 2: Choose the clustering method by selecting the appropriate, from the list of clusters. The simple kmeans have been selected here for which the number of clusters have to be mentioned by right clicking on the selected k-mean method, and choosing the numClusters as 3. 81 Anita, Uttama Pandey International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 4, Special Issue September 2015 Figure 5 Properties of k-means Step 3: Various cluster modes are : Use training set Supplied test set Percentage split Classes to cluster evaluation Select use training set from these modes and start. Figure 6 k-means clusters Figure 6 is showing the evaluation on the training data sets with three clustered instances. 82 Anita, Uttama Pandey International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 4, Special Issue September 2015 Step 4: Visualize the cluster assignment, by right clicking on the result list window. Figure 7 Clusters visualization Step 5:These cluster visualization result is to be saved in the notepad for further review. For this click on SAVE button in the “Clusterer visualize” window. These must be saved in .arff format. This “pop.arff” file will look like as: When we open this notepad file result will be: @relation 2.2_clustered @attribute Instance_number numeric @attribute 'Himachal Pradesh ' {'Chandigarh ',Sikkim,'Arunachal Pradesh','Nagaland ','Manipur ' ,Mizoram, Tripura, Meghalaya, 'Daman & Diu ','Dadra & Nagar Haveli ',Goa,'Lakshadweep ','Pondicherry ','Andaman & Nicobar Islands '} @attribute '34,73,892' {'5,80,282','3,21,661','7,20,232','10,25,707','1 3,69,764','5,52,339','18,71,867','14,92,668','1,50,100','1,93,178','7,40,711','33,106','6,10,485','2,02,330'} @attribute '33,82,617' {'4,74,404','2,86,027','6,62,379','9,54,895','13,51,992','5,38,675','17,99,165','14,71,339','92,811','1,49,675','7,17 ,012','31,323','6,33,979','1,77,614'} @attribute '68,56,509' {'10,54,686','6,07,688','13,82,611','19,80,602','27,21,756','10,91,014','36,71,032','29,64,007','2,42,911','3,42,85 3','14,57,723','64,429','12,44,464','3,79,944'} @attribute 17.54 numeric @attribute 12.81 numeric @attribute Cluster {cluster0,cluster1,cluster2} @data 0,'Chandigarh ','5,80,282','4,74,404','10,54,686',40.28,17.1,cluster2 1,Sikkim,'3,21,661','2,86,027','6,07,688',33.06,12.36,cluster1 2,'Arunachal Pradesh','7,20,232','6,62,379','13,82,611',27,25.92,cluster1 3,'Nagaland ','10,25,707','9,54,895','19,80,602',64.53,-0.47,cluster2 4,'Manipur ','13,69,764','13,51,992','27,21,756',24.86,18.65,cluster1 5,Mizoram,'5,52,339','5,38,675','10,91,014',28.82,22.78,cluster1 6,Tripura,'18,71,867','17,99,165','36,71,032',16.03,14.75,cluster1 7,Meghalaya,'14,92,668','14,71,339','29,64,007',30.65,27.82,cluster1 83 Anita, Uttama Pandey International Journal of Innovations & Advancement in Computer Science IJIACS ISSN 2347 – 8616 Volume 4, Special Issue September 2015 8,'Daman & Diu ','1,50,100','92,811','2,42,911',55.73,53.54,cluster0 9,'Dadra & Nagar Haveli ','1,93,178','1,49,675','3,42,853',59.22,55.5,cluster0 10,Goa,'7,40,711','7,17,012','14,57,723',15.21,8.17,cluster1 11,'Lakshadweep ','33,106','31,323','64,429',17.3,6.23,cluster1 12,'Pondicherry ','6,10,485','6,33,979','12,44,464',20.62,27.72,cluster1 13,'Andaman & Nicobar Islands ','2,02,330','1,77,614','3,79,944',26.9,6.68,cluster1 Note:-In addition to the "instance_number" attribute, WEKA has also added "Cluster" attribute to the original data set. In the data portion, each instance now has its assigned cluster as the last attribute value. By doing some simple manipulation to this data set, we can easily convert it to a more usable form for additional analysis or processing. For example, here we have converted this data set in a comma-separated format and sorted the result by clusters.WEKA offers clustering capabilities not only as standalone schemes, but also as filters and classifiers. V FUTURE SCOPE We can use the concept of a package to add additional functionality, separate from that are already supplied with weka.jar files. A package consists ofdocumentation, meta data, and possibly source code. Weka includes a facility to manage these packages and a mechanism to load them dynamically at runtime. VI ACKNOWLEDGEMENT A special thanks to the team members of weka for giving us privilege to work on this machine learning tool of datamining and all the external contributors for enhancing the knowledge in various applications of datamining VII REFERENCES [1] K-Means Clustering Tutorial,By Kardi Teknomo,PhD [2] Kanungo, T.; Mount, D. M.; Netanyahu, N. S.; Piatko, C. D.; Silverman, R.; Wu, A. Y. "An efficient k-means clustering algorithm: Analysis and implementation". [3] http://www.cut-the-knot.org/pythagoras/DistanceFormula.shtml [4] http://www.slideshare.net/wekacontent/an-introduction-to-weka-2875221#btnNext [5] https://blog.itu.dk/SPVC-E2010/files/2010/11/wekatutorial.pdf [6] http://agricoop.nic.in/Agristatistics.htm [7] http://maya.cs.depaul.edu/~Classes/Ect584/Weka/k-means.html [8] http://www.users.cs.umn.edu/~kumar/dmbook/dmslides/chap8_basic_cluster_analysis.pdf [9] A Hybridized k-means approach for high dimensional dataset- By Rajashree Dash – International journal of Engineering, science and technology [10]http://www.cs.put.poznan.pl/jstefanowski/sed/DM-7clusteringnew.pdf 84 Anita, Uttama Pandey