* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Implementation and Evaluation of K-Means, KOHONEN
Human genetic clustering wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
Principal component analysis wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
Nearest-neighbor chain algorithm wikipedia , lookup
International Journal of Computer Science Engineering & Information Technology Research (IJCSEITR) ISSN 2249-6831 Vol. 3, Issue 1, Mar 2013, 165-174 © TJPRC Pvt. Ltd. IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN-SOM, AND HAC DATA MINING ALGORITHMS BASED ON CLUSTERING KAPIL SHARMA & RICHA DHIMAN Computer Science and Engineering, Lovely Professional University, Jalandhar, Punjab, India ABSTRACT With the development of information technology and computer science, high-capacity data appear in our lives. In order to help people analyzing and digging out useful information, the generation and application of data mining technology seem so significance. Clustering is the mostly used method of data mining. Clustering can be used for describing and analyzing of data. In this research, we use clustering and classification methods to mine the data and extract the valuable information by using hybrid algorithms or combination of three algorithms and it produce the better results than the traditional algorithms and compare it by applying on data sets. After comparing these three methods effectively, i can reflect data characters and potential rules syllabify. This work will be presents a new and improves results from largescale datasets. KEYWORDS: Clustering, K-Means, HAC, Data Mining, Kohonen-SOM INTRODUCTION A self-organizing map (SOM) or self-organizing feature map (SOFM) is a kind of artificial neural network that is trained using unsupervised learning to produce a low-dimensional (typically two dimensional), discretized representation of the input space of the training samples, called a map. Self-organizing maps are different than other artificial neural networks in the sense that they use a neighborhood function to preserve the topological properties of the input space. SOM is a clustering method. Indeed, it organizes the data in clusters (cells of map) such as the instances in the same cell are similar, and the instances in different cells are different. In this point of view, SOM gives comparable results to state-of-the-art clustering algorithm such as K-Means. SOM can be vied also as a visualization technique. It allows us to visualize in a low dimensional representation space (2D) the original dataset. Indeed, the individuals located in adjacent cells are more similar than individuals located in distant cells. In this point of view, it is comparable to visualization techniques such as Multidimensional scaling or PCA (Principal Component Analysis). In this work, I show how to implement the Kohonen's SOM algorithm with particular tool. I try to assess the properties of this approach by comparing the results with those of the PCA algorithm. Then, I compare the results to those of K-Means, which is a clustering algorithm. Finally, I will implement the Two-step Clustering process by combining the SOM algorithm with the HAC process (Hierarchical Agglomerative Clustering). It is a variant of the Two-Step Clustering where i combine K-Means and HAC. DATASET I analyze the WAVEFORM dataset (Breiman and al., 1984).This is an artificial dataset. There are 21 descriptors. I have 5000 instances. I do not use the CLASS attribute which classify the instances into 3 pre-defined classes in this research. 166 Kapil Sharma & Richa Dhiman Kohonen’s SOM Approach with Tanagra The easiest way to import a XLS file is to open the data file into Excel spreadsheet. Then, using the add-in TANAGRA.XLA3, i can send the dataset to Tanagra which is launched automatically. I can check the range of selected cells in the worksheet. Figure 1 Tanagra is launched, a new diagram is created and the dataset is loaded. I have 5000 instances and 21 attributes. Descriptive Statistics and Outlier’s Detection First step, i check the integrity of the dataset by computing some descriptive statistics indicators. I insert the DEFINE STATUS component into the diagram using the shortcut into the toolbar. Then I set all the variables as INPUT. Figure 2 I add the UNIVARIATE CONTINUOUS STAT component (STATISTICS tab). I click on the VIEW menu. I obtain the following report. Figure 3 Implementation and Evaluation of K-Means, KOHONEN-SOM, and HAC Data Mining Algorithms Based on Clustering 167 I note that there is no constant in our dataset i.e. standard deviation = 0. I note also that all the variables seem defined in the same scale. The KOHONEN-SOM Component I want to launch the analysis now. I create a grid with 2 rows and 3 columns i.e. a Classification of the instances into 6 groups (2 x 3 = 6 clusters). I add the KOHONEN-SOM component (CLUSTERING tab) into the diagram. I click on the PARAMETERS menu. I set the following settings. Figure 4 The number of rows of the map is 2 (Row Size), the number of columns is 3 (Col Size). I standardize the data i.e. i divide each variable by their standard deviation. It is recommended when the variables are not in the same scale. It is not necessary if they are defined in the same scale or when i want to take into consideration explicitly the differences in scale. I do not modify the other settings I validate and i click on the VIEW menu. I obtain the following report. Figure 5 The KOHONEN-SOM component adds a new column to the current dataset. It states the group membership of each instance. This new attribute is available in the subsequent part of the diagram. I can visualize the current dataset with the VIEW DATASET component. I see for instance that the first example belongs to the cell (1, 3) i.e. first row and third column. Note: I can classify an additional instance with the same framework i.e. an example which is not involved in the learning process. This deployment phase is one of the most important steps of the Data Mining process. Individuals who are in adjacent cells are also close in the original representation space. This is one of the main interests of this method. Let us check this assertion on the WAVEFORM dataset. 168 Kapil Sharma & Richa Dhiman I cannot visualize the dataset into the original space. So i use a PCA in order to obtain a 2D representation. I try to visualize the relative positions of groups (clusters) in the scatter plot. I add the PRINCIPAL COMPONENT ANALYSIS component (FACTORIAL ANALYSIS tab) after the KOHONEN-SOM 1 component. I click on the VIEW menu Figure 6 The first two factors representation space accounts for the 53.17% of the total variability. It seems Low but on this dataset, it is enough to represent properly the instances. I add the SCATTERPLOT component (DATA VISUALIZATION tab) into the diagram. I set the first factor as the horizontal axis, the second one as vertical axis. Figure 7 A crucial step of this research, i colorize the points with the cluster membership supplied by the SOM algorithm (CLUSTER_SOM_1). Figure 8 Implementation and Evaluation of K-Means, KOHONEN-SOM, and HAC Data Mining Algorithms Based on Clustering 169 I note the correspondence betaken the proximities into the SOM map and the proximities into the 2 first factors of PCA. It means also that the instances into adjacent cells are close into the original representation space (with 21 attributes). Figure 9 A comparison with K-MEANS and Clustering process with K-Means K-Means is a state-of-the-art approach for clustering process. I add the component into the diagram and i ask 6 clusters. There is no constraint about the relative position of the clusters here. Figure 10 I click on the VIEW menu in order to launch the calculations. Figure 11 The relative part of the total sum of squares explained by the partitioning is 46.31%. It is rather comparable to the one obtained with SOM (45.16%). But i remind that there is no constraint about the relative position of the clusters for the K-Means algorithm. 170 Kapil Sharma & Richa Dhiman Agreement between the Clusters If the performances of these approaches seem similar, are the clusters comparable? To check the correspondence, i create a cross-tabulation betaken the column memberships supplied by the two approaches. I insert a new DEFINE STATUS component into the diagram. I set CLUSTER_SOM_1, the cluster membership column supplied by the SOM algorithm, as TARGET; I set CLUSTER_KMEANS_1, supplied by the K-MEANS algorithm, as INPUT. Figure 12 In the table below, i show the correspondence between clusters. Table 1 Two-Step Clustering Two-step clustering creates pre-clusters, and then it clusters the pre-clusters using hierarchical methods (HCA). Two step clustering handles very large datasets6. The K-Means is usually used in the first phase where the pre-clusters are created. In this tutorial, instead of K-Means, we use the SOM results for this first phase. This variant involves a very interesting property: the adjacent preclusters correspond to nearby areas in the original representation space. This strengthens the interpretation of the dendrogram created with the subsequent HCA algorithm. We add the DEFINE STATUS component into the diagram. We set CLUSTER_SOM_1 as TARGET, the descriptive variables (V1…V21) as INPUT. Implementation and Evaluation of K-Means, KOHONEN-SOM, and HAC Data Mining Algorithms Based on Clustering 171 Figure 13 I add the HAC component (CLUSTERING tab). Figure 14 The component automatically detects 3 groups. This choice relies on the height between each merging. There is no theoretical justification here. The DENDROGRAM tab of the visualization window is very important. By clicking on each node of the tree, we obtain the ID of the pre-clusters supplied by the SOM algorithm. Figure 15 172 Kapil Sharma & Richa Dhiman CONCLUSIONS The white nodes of the tree states the groups computed with the HAC algorithm. If we select the white node at right, we obtain the SOM's pre-clusters ID i.e. the individuals in this group come from the pre-clusters (1 ; 1), (1 ; 2) et (2 ; 1). In the table, we see the correspondence between SOM pre-clusters ID and the HAC clusters ID. Table 2 The results are strikingly consistent with the theoretical consideration underlying the SOM approach: the HAC above all merges the adjacent cells of the Kohonen’s map. REFERENCES 1. Rakesh Agrawal , Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules in Large Databases, Proceedings of the 20th International Conference on Very Large Data Bases, p.487-499, September 12-15, 1994. 2. A. Baskurt, F. Blum, M. Daoudi, J.L. Dugelay, F. Dupont, A. Dutartre, T. Filali Ansary, F. Fratani, E. Garcia, G. Lavoué, D. Lichau, F. Preteux, J. Ricard, B. Savage, J.P. Vandeborre, T. Zaharia. SEMANTIC-3D : COMPRESSION, INDEXATION ET TATOUAGE DE DONNÉES 3D Réseau National de Recherche en Télécommunications (RNRT) (2002). 3. T.Zaharia F.Prêteux, Descripteurs de forme : Etude comparée des approches 3D et 2D/3D 3D versus 2D/3D Shape Descriptors: A Comparative study. 4. T.F.Ansary J.P.Vandeborre M.Daoudi, Recherche de modèles 3D de pièces mécaniques basée sur les moments de Zernike 5. A. Khothanzad, Y. H. Hong, Invariant image recognition by Zernike moments, IEEE Trans. Pattern Anal. Match. Intell.,12 (5), 489-497, 1990. 6. Agrawal R., Imielinski T., Swani A. (1993) Mining Association rules between sets of items in large databases. In : Proceedings of the ACMSIGMOD Conference on Management of Data, Washington DC, USA. 7. Hébrail G., Lechevallier Y. (2003) Data mining et analyse des données. In : Govaert G. Analyse des données. Ed. Lavoisier, Paris, pp 323-355. 8. T.F.Ansary J.P.Vandeborre M.Daoudi, une approche baysiénne pour l’indexation de modèles 3D basée sur les vues caractéristiques. 9. Ansary, T. F. Daoudi, M. Vandeborre, J.-P. A Bayesian 3-D Search Engine Using Adaptive Views Clustering, IEEE Transactions on Multimedia, 2007. Implementation and Evaluation of K-Means, KOHONEN-SOM, and HAC Data Mining Algorithms Based on Clustering 173 10. Ansary, T.F. Vandeborre, J.-P. Mahmoudi, S. Daoudi, M. A Bayesian framework for 3D models retrieval based on characteristic views, 3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004. Proceedings. 2nd International Symposium Publication Date: 6-9 Sept. 2004. 11. Agrawal R., Srikant R., Fast algorithms for mining association rules in larges databases. In Proceeding of the 20th international conference on Very Large Dada Bases (VLDB’94), pages 478-499. Morgan Kaufmann, September 1994. 12. U. Fayyad, G.Piatetsky-Shapiro, and Padhraic Smyth, From Data Mining toKnowledge Discovery in Databases, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 13. S.Lallich, O.Teytaud, Évaluation et validation de l'intérêt des règles d'association. 14. Osada, R., Funkhouser, T., Chazelle, B. et Dobkin, D. ((Matching 3D Models with Shape Distributions)). Dans Proceedings of the International Conference on Shape Modeling & Applications (SMI ’01), pages 154–168. IEEE Computer Society,Washington, DC, Etat-Unis. 2001. 15. W.Y. Kim et Y.S. Kim. A region-based shape descriptor using Zernike moments. Signal Processing: Image Communication, 16 :95–100, 2000. 16. A. Khotanzad et Y.H. Hong. Invariant image recognition by Zernike moments. IEEE Transactions on Pattern Analysis and Machine Intelligence, 12(5), May 90. 17. N Pasquier, Y Bastide, R Taouil, L Lakhal - Database Theory ICDT'99, 1999 – Springer. 18. M.El far, L.Moumoun, T.Gadi, R.Benslimane- A new technique for the extraction of characteristic views for 2D/3D indexation- International Journal of Engineering Science and Technology (IJEST)-accepted session July 2010.