Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March – April 2014 ISSN 2278-6856 Comparison and Analysis of Various Clustering Methods in Data mining On Education data set Using the weak tool Suman 1 and Mrs.Pooja Mittal2 1 Student of Masters of Technology, Department of Science and Application M.D. University, Rohtak, Haryana, India 2 Assistant Professor Department of Computer Science and Application M.D. University, Rohtak, Haryana, India Abstract:- Data mining is used to find the hidden information pattern and relationship between the large data set which is very useful in decision making. Clustering is very important techniques in data mining, which divides the data into groups and Each group containing similar data and dissimilar from other groups. Clustering using various notations to create the groups and these notations can be like as clusters include groups with low distances among the cluster members, dense areas of the data space, intervals or particular statistical distributions In this paper provide a comparison of various clustering algorithms like k-Means Clustering, Hierarchical Clustering, Density based clustering, grid clustering etc. We compare the performance of these three major clustering algorithms on the aspect of correctly class wise cluster building ability of the algorithm. Performance of the 3 techniques is presented and compared using a clustering tool WEKA. Centre will represent with input vector can tell which cluster this vector belongs to by measuring a similarity metric between input vector and all cluster centers and determining which cluster is nearest or most similar one [1]. There are various method in clustering these are followed: PARTITIONING MATHOD o K-mean method o K- Medoids method HIERARCHICAL METHODS o Agglomerative o Divisive GRID BASED DENSITY BASED METHODS o DBSCAN Keywords: - Data mining, clustering, k-Means Clustering, Hierarchical Clustering, DBSCAN clustering, grid clustering etc. I. Introduction Data mining is also known as knowledge discovery. In computer science field data mining is an important subfield which has computational ability to discover the patterns from large data sets. The main objective of data mining is that to discover the data and patterns and store it in an understandable form. Data mining applications are used almost every field to manage the records and in other forms. Data mining is a process to convert the raw data into meaningful information according to stepwise (data mining follows some steps to discover the hidden data and pattern). Data mining having various numbers of techniques which have some own capabilities, but in this paper, we will concentrate on clustering techniques and its methods. Fig 2.1 methods of clustering techniques III. Weka Weka is developed by the University of Waikato (New Zealand) and its first modern form is implemented in 1997.It is open source means it is available for use public. Weka code is written in Java language and it contains a GUI for Interacting with data files and producing visual results. The figure of Weka is shown in the figure3. 1 II. Clustering In this technique we split the data into groups and these groups are known as clusters. Each cluster contains the homogenous data, but it is heterogeneous data from other cluster's data. A data is choosing the cluster according to attribute values describing by objects. Clustering is used in many fields like education, industries, agriculture etc. Clustering used unsupervised learning techniques. Cluster Volume 3, Issue 2 March – April 2014 Figure3.1: front view of weka tools Page 240 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March – April 2014 ISSN 2278-6856 The GUI Chooser consists of four buttons: Explorer: An environment for exploring data with WEKA. Experimenter: An environment for performing experiments and conducting statistical tests between learning schemes. Knowledge Flow: This environment supports essentially the same functions as the Explorer, but with a drag and- drop interface. One advantage is that it supports incremental learning. Simple CLI: Provides a simple command-line interface that allows direct execution of WEKA commands for operating systems that do not provide their own command line interface. [8] IV. Dataset For performing the comparison analysis, we need the datasets. In this research I am taking education data set. This data set is very helpful for the researchers. We can directly apply this data in the data mining tools and predict the result. V. Methodology My methodology is very simple. I am taking the education data set and apply it on the weka in differentdifferent data set of student records . In the weka I am applying different- different clustering algorithms and predict a useful result that will be very helpful for the new users and new researchers. VI. Performing clustering on weka For performing cluster analysis on Weka.I have loaded the data set on weka that shown in this fig.6.1.waka can support CSV and ARFF format of data set. Here we are using CSV data set. In this data having 2197instances and 9 attributes. Figure 6.1: load data set in to the weka After that we have many options shown in the figure After that we have many options shown in the figure. We perform clustering [10] so we click on the cluster button. After that we need to choose which algorithm is applied to the data. It is shown in the figure 6.2. And then click the ok button. Volume 3, Issue 2 March – April 2014 Fig.6.2 various clustering algorithms in weka VII. Partitioning methods As the name suggested that in this method we divide the large object into (groups) clusters and each cluster contain at least one element. This method follows an iterative process by use of this process, we can relocate the object from one group to another more relevance group. This method is effective for small to medium sized data sets. Examples of partitioning methods include kmeans and k-medoids [2]. VII (I) K-Means Algorithm It is a centroid based technique. This algorithm takes the input parameters k and partition a set of n objects into k clusters that the resulting intra-cluster similarity is high but the inter-cluster similarity is low. The method can be used by cluster to assign rank values to the cluster categorical data is statistical method. K mean is mainly based on the distance between the object and the cluster mean. Then it computes the new mean for each cluster. Here categorical data have been converted into numeric by assigning rank value [3]. Algorithm:In this we take k the number of cluster and D as data set containing an object. In this output is stored as A set Of k clusters. Algorithm follows some steps these are:Steps1:- Randomly choose k object from D as initial cluster center. Steps2:- Calculate the distance from the data point to each cluster. Step3: - If the data point is closest to its own cluster, leave it where it is. If the data point is not closest to its own cluster, move it into the closest cluster. Step4: repeat step2 and 3 until best relevant cluster is found for each data. Step5: - updates the cluster means and calculate the mean value of the object for each cluster. Step6: - stop (every data is located in a proper positioned cluster). Now I am applying the k-mean on weak tool table17. 1show the result of k-mean. Page 241 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March – April 2014 ISSN 2278-6856 Table7. 1.1k- means clustering algorithms Dataset Name Attribut e and Instance s Civil Instance s: 446 Attribut es: 9 Computer and IT E.C.E Mechanic al Instance s: 452 Attribut es: 9 Instance s: 539 Attribut es: 9 Instance s: 760 Attribut es: 9 Clustere d Instance s 0: 247 (55%) 1: 199 (45%) 0 206 (46%) 1 246 (54%) 0 317 ( 59%) 1 222 ( 41%) 0 327 ( 43%) 1 433 ( 57%) Time taken to build the model 0.02 secon ds Square d Error No of Iterat ions 13.5 3 0.02 secon ds 15.6 5 0.27 secon ds 16.03 5 0.06 secon ds 22.7 3 It is two types:Agglomerative (bottom up):It is a bottom up approach so that it is starting from sub cluster than merge the sub clusters and makes a big cluster at the top. Figure 8.1: Hierarchical Clustering Process [7] Divisive (top down):It is working opposite like as agglomerative. It is starting from top mean a big cluster than decomposed it into smaller cluster. Thus, it is a stat from top and reached at the bottom. Table 7.2.2 shows the result Table 7.2.1 Hierarchical Clustering FIG.7.1 compression between attributes of k-mean VII (II) K-Medoids Algorithm This is a variation of the k-means algorithm and is less sensitive to outliers [5]. In this instead of mean we use the actual object to represent the cluster, using one representative object per cluster. Clusters are generated by points which are close to respective methods. The function used for classification is a measure of dissimilarities of points in a cluster and their representative [5]. The partitioning is done based on minimizing the sum if the dissimilarities between each object and its cluster representative. This criterion is called as absolute-error criterion. N Sum of Absolute error=Σ Σ Dist (p, a) i=1 p ∈ Ci Where p represents an object in the data set and oi is the ith representative. N is the number of clusters. Two well-known types of k-medoids clustering [6] are the PAM (Partitioning Around Medoids) and CLARA (Clustering LARge Applications). VIII. Hierarchical Clustering This method provides the tree relationship between clusters. In this method we use same no. cluster and data, means if we have n no. of data then we use n no of clusters. Volume 3, Issue 2 March – April 2014 Dataset Name Civil Attribute and Instances Instances: 446 Attributes: 9 Comput er and IT Instances: 452 Attributes: 9 E.C.E Instances: 539 Attributes: 9 Mechani cal Instances: 760 Attributes: 9 Clustered Instances 0 445 (100%) 1 1( 0%) 0 305 ( 67%) 1 147 ( 33%) 0 538 (100%) 1 1( 0%) 0 758 (100%) 1 2( 0%) Time taken to build the model 2.09 seconds 4.02 seconds 3.53 seconds 13 seconds FIG 7.1 comparison between attributes of hierarchical clustering IX. Grid based The grid based clustering approach uses a multi resolution grid data structure. It measures the object space into a finite number of cells that form a grid structure on which all of the operations for clustering are performed. We are present two examples; STING and CLIQUE. Page 242 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March – April 2014 ISSN 2278-6856 STING (Statistical Information Grid): - It is used mainly with numerical values. It is a grid-based multi resolution clustering technique which is computed the numerical attribute and store in a rectangular cell. The quality of clustering produced by this method is directly related to the granularity of the bottom most layers, approaching the result of DBSCAN as granularity reaches zero [2]. CLIQUE (Clustering in Quest): - It was the first algorithm proposed for dimension –growth subspace clustering in high dimensional space. CLIQUE is a subspace partitioning algorithm introduced in 1998. XI. Experimental results Here we use various clustering method of student record data and compare these using weka tools. According to these comparisons we find the which method is performed better result. Fig 11.1 shows the comparison result on according to the time taken to build a model. X. Density based clustering X.I. DBSCAN (for density-based spatial clustering of applications with noise) is a density based clustering algorithm. It is using the concept of “density reachibility” and “density connect ability”, both of which depends upon input parameter- size of epsilon neighborhood e and minimum terms of local distribution of nearest neighbors. Here parameter e controls the size of the neighborhood and size of clusters. It starts with an arbitrary starting point that has not been visited [4]. DBSCAN algorithm is an important part of clustering technique which is mainly used in scientific literature. Density is measured by the number of objects which are nearest the cluster. Table 10.1.1 DBSCAN Clustering Dataset Name Civil Comput er and IT E.C.E Mechani cal Attribute and Instances Instances: 446 Attributes: 9 Instances: 452 Attributes: 9 Clustered Instances 446 Time taken to build the model 4.63 seconds 452 6.13 seconds Instances: 539 Attributes: 9 539 11.83 seconds Instances: 760 Attributes: 9 760 23.95 seconds X. II. Optics: - stands for Ordering Points to Identify Clustering Structure. DBSCAN burdens the user from choosing the input parameters. Moreover, different parts of the data could require different parameters [5]. It is an algorithm for finding density based clusters in spatial data which addresses one of DBSCAN’S major weaknesses i.e. Of detecting meaningful clusters in data of varying density. Table 10.2.1. OPTICS Clustering Dataset Name Civil Comput er and IT E.C.E Mechani cal Attribute and Instances Instances: 446 Attributes: 9 Instances: 452 Attributes: 9 Clustered Instances 446 Time taken to build the model 5.42 seconds 452 6.73 seconds Instances: 539 Attributes: 9 539 9.81 seconds Instances: 760 Attributes: 9 760 23.85 seconds Volume 3, Issue 2 March – April 2014 Fig11.1 compared according to time taken to build a model. According to this result, we can say that k-mean provide better results than other methods. But only a single attribute we cannot use k-mean every time. Thus we can use any other methods if time is not important. XII. Conclusion Data mining is covering every field of our life. Mainly we are using the data mining in banking, education, business etc. In this paper, we have provided an overview of the comparison, classification of clustering algorithms such as partitioning, hierarchical, density based and grid based methods. Under partitioning methods, we have applied kmeans, and its variant k-medicine weka tool. Under hierarchical, we have discussed the two approaches which are the top-down approach and the bottom-up approach. We have also applied the DBSCAN and OPTICS algorithms under the density based methods. Finally, we have used the STING and CLIQUE algorithms under the grid based methods. And we are describing the comparative study of data mining techniques.These comparisons we can show in the above tables. Thus we can say that every technique is important in his functional area. We can improve the capability of data mining techniques by removing the limitation of these techniques. References [1] Manish Verma, Mauly Srivastava, Neha Chack, Atul Kumar Diswar, Nidhi Gupta,” A Comparative Study of Various Clustering Algorithms in Data Mining,” International Journal of Engineering Research and Applications (IJERA), Vol. 2, Issue 3, pp. 13791384, 2012. [2] Jiawei Han and Micheline Kamber, Jian Pei, B Data Mining: Concepts and Techniques, 3rd Edition, 2007. [3] Patnaik, Sovan Kumar, Soumya Sahoo, and Dillip Kumar Swain, “Clustering of Categorical Data by Page 243 International Journal of Emerging Trends & Technology in Computer Science (IJETTCS) Web Site: www.ijettcs.org Email: [email protected], [email protected] Volume 3, Issue 2, March – April 2014 ISSN 2278-6856 [4] [5] [6] [7] [8] [9] Assigning Rank through Statistical Approach,” International Journal of Computer Applications 43.2: 43.2: 1-3, 2012. Manish Verma, Mauly Srivastava, Neha Chack, Atul Kumar Diswar, Nidhi Gupta,” A Comparative Study of Various Clustering Algorithms in Data Mining,” International Journal of Engineering Research and Applications (IJERA), Vol. 2, Issue 3, pp. 1379-1384, 2012 Survey of Clustering Data Mining Techniques, Pavel Berkhin, 2002. C. Y. Lin, M. Wu, J. A. Bloom, I. J. Cox, and M. Miller, “Rotation, scale, and translation resilient public watermarking for images,” IEEE Trans. Image Processing, vol. 10, no. 5, pp. 767-782, May 2001. Pallavi, Sunila Godara “A Comparative Performance Analysis of Clustering Algorithms”International Journal of Engineering Research and Applications (IJERA) ISSN: 2248-9622 www.ijera.com Vol. 1, Issue 3, pp. 441-445 Bharat Chaudhari1, Manan Parik“A Comparative Study of clustering algorithms Using weka tools” International Journal of Application or Innovation in Engineering & Management (IJAIEM) M. And Heckerman, D. (February, 1998). An experimental comparison of several clustering and initialization methods. Technical Report MSRTR-9806, Microsoft Research, Redmond, WA. Volume 3, Issue 2 March – April 2014 Page 244