Download Implementation and Evaluation of K-Means, KOHONEN

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Principal component analysis wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
International Journal of Computer Science Engineering
& Information Technology Research (IJCSEITR)
ISSN 2249-6831
Vol. 3, Issue 1, Mar 2013, 165-174
© TJPRC Pvt. Ltd.
IMPLEMENTATION AND EVALUATION OF K-MEANS, KOHONEN-SOM, AND HAC
DATA MINING ALGORITHMS BASED ON CLUSTERING
KAPIL SHARMA & RICHA DHIMAN
Computer Science and Engineering, Lovely Professional University, Jalandhar, Punjab, India
ABSTRACT
With the development of information technology and computer science, high-capacity data appear in our lives. In
order to help people analyzing and digging out useful information, the generation and application of data mining
technology seem so significance. Clustering is the mostly used method of data mining. Clustering can be used for
describing and analyzing of data. In this research, we use clustering and classification methods to mine the data and extract
the valuable information by using hybrid algorithms or combination of three algorithms and it produce the better results
than the traditional algorithms and compare it by applying on data sets. After comparing these three methods effectively,
i can reflect data characters and potential rules syllabify. This work will be presents a new and improves results from largescale datasets.
KEYWORDS: Clustering, K-Means, HAC, Data Mining, Kohonen-SOM
INTRODUCTION
A self-organizing map (SOM) or self-organizing feature map (SOFM) is a kind of artificial neural network that is
trained using unsupervised learning to produce a low-dimensional (typically two dimensional), discretized representation
of the input space of the training samples, called a map. Self-organizing maps are different than other artificial neural
networks in the sense that they use a neighborhood function to preserve the topological properties of the input space.
SOM is a clustering method. Indeed, it organizes the data in clusters (cells of map) such as the instances in the
same cell are similar, and the instances in different cells are different. In this point of view, SOM gives comparable results
to state-of-the-art clustering algorithm such as K-Means.
SOM can be vied also as a visualization technique. It allows us to visualize in a low dimensional representation
space (2D) the original dataset. Indeed, the individuals located in adjacent cells are more similar than individuals located in
distant cells. In this point of view, it is comparable to visualization techniques such as Multidimensional scaling or PCA
(Principal Component Analysis).
In this work, I show how to implement the Kohonen's SOM algorithm with particular tool. I try to assess the
properties of this approach by comparing the results with those of the PCA algorithm. Then, I compare the results to those
of K-Means, which is a clustering algorithm. Finally, I will implement the Two-step Clustering process by combining the
SOM algorithm with the HAC process (Hierarchical Agglomerative Clustering). It is a variant of the Two-Step Clustering
where i combine K-Means and HAC.
DATASET
I analyze the WAVEFORM dataset (Breiman and al., 1984).This is an artificial dataset. There are 21 descriptors. I
have 5000 instances. I do not use the CLASS attribute which classify the instances into 3 pre-defined classes in this
research.
166
Kapil Sharma & Richa Dhiman
Kohonen’s SOM Approach with Tanagra
The easiest way to import a XLS file is to open the data file into Excel spreadsheet. Then, using the add-in
TANAGRA.XLA3, i can send the dataset to Tanagra which is launched automatically. I can check the range of selected
cells in the worksheet.
Figure 1
Tanagra is launched, a new diagram is created and the dataset is loaded. I have 5000 instances and 21 attributes.
Descriptive Statistics and Outlier’s Detection
First step, i check the integrity of the dataset by computing some descriptive statistics indicators. I insert the
DEFINE STATUS component into the diagram using the shortcut into the toolbar. Then I set all the variables as INPUT.
Figure 2
I add the UNIVARIATE CONTINUOUS STAT component (STATISTICS tab). I click on the VIEW menu.
I obtain the following report.
Figure 3
Implementation and Evaluation of K-Means, KOHONEN-SOM, and
HAC Data Mining Algorithms Based on Clustering
167
I note that there is no constant in our dataset i.e. standard deviation = 0. I note also that all the variables seem
defined in the same scale.
The KOHONEN-SOM Component
I want to launch the analysis now. I create a grid with 2 rows and 3 columns i.e. a Classification of the instances
into 6 groups (2 x 3 = 6 clusters). I add the KOHONEN-SOM component (CLUSTERING tab) into the diagram. I click on
the PARAMETERS menu. I set the following settings.
Figure 4
The number of rows of the map is 2 (Row Size), the number of columns is 3 (Col Size). I standardize the data i.e.
i divide each variable by their standard deviation. It is recommended when the variables are not in the same scale. It is not
necessary if they are defined in the same scale or when i want to take into consideration explicitly the differences in scale.
I do not modify the other settings I validate and i click on the VIEW menu. I obtain the following report.
Figure 5
The KOHONEN-SOM component adds a new column to the current dataset. It states the group membership of
each instance. This new attribute is available in the subsequent part of the diagram. I can visualize the current dataset with
the VIEW DATASET component. I see for instance that the first example belongs to the cell (1, 3) i.e. first row and third
column.
Note: I can classify an additional instance with the same framework i.e. an example which is not involved in the
learning process. This deployment phase is one of the most important steps of the Data Mining process.
Individuals who are in adjacent cells are also close in the original representation space. This is one of the main
interests of this method. Let us check this assertion on the WAVEFORM dataset.
168
Kapil Sharma & Richa Dhiman
I cannot visualize the dataset into the original space. So i use a PCA in order to obtain a 2D representation. I try to
visualize the relative positions of groups (clusters) in the scatter plot. I add the PRINCIPAL COMPONENT ANALYSIS
component (FACTORIAL ANALYSIS tab) after the KOHONEN-SOM 1 component. I click on the VIEW menu
Figure 6
The first two factors representation space accounts for the 53.17% of the total variability. It seems Low but on
this dataset, it is enough to represent properly the instances. I add the SCATTERPLOT component (DATA
VISUALIZATION tab) into the diagram. I set the first factor as the horizontal axis, the second one as vertical axis.
Figure 7
A crucial step of this research, i colorize the points with the cluster membership supplied by the SOM algorithm
(CLUSTER_SOM_1).
Figure 8
Implementation and Evaluation of K-Means, KOHONEN-SOM, and
HAC Data Mining Algorithms Based on Clustering
169
I note the correspondence betaken the proximities into the SOM map and the proximities into the 2 first factors of
PCA. It means also that the instances into adjacent cells are close into the original representation space (with 21 attributes).
Figure 9
A comparison with K-MEANS and Clustering process with K-Means
K-Means is a state-of-the-art approach for clustering process. I add the component into the diagram and i ask
6 clusters. There is no constraint about the relative position of the clusters here.
Figure 10
I click on the VIEW menu in order to launch the calculations.
Figure 11
The relative part of the total sum of squares explained by the partitioning is 46.31%. It is rather comparable to the
one obtained with SOM (45.16%). But i remind that there is no constraint about the relative position of the clusters for
the K-Means algorithm.
170
Kapil Sharma & Richa Dhiman
Agreement between the Clusters
If the performances of these approaches seem similar, are the clusters comparable?
To check the correspondence, i create a cross-tabulation betaken the column memberships supplied by the two
approaches. I insert a new DEFINE STATUS component into the diagram. I set CLUSTER_SOM_1, the cluster
membership column supplied by the SOM algorithm, as TARGET; I set CLUSTER_KMEANS_1, supplied by the
K-MEANS algorithm, as INPUT.
Figure 12
In the table below, i show the correspondence between clusters.
Table 1
Two-Step Clustering
Two-step clustering creates pre-clusters, and then it clusters the pre-clusters using hierarchical methods (HCA).
Two step clustering handles very large datasets6.
The K-Means is usually used in the first phase where the pre-clusters are created. In this tutorial, instead of
K-Means, we use the SOM results for this first phase.
This variant involves a very interesting property: the adjacent preclusters correspond to nearby areas in the
original representation space. This strengthens the interpretation of the dendrogram created with the subsequent HCA
algorithm.
We add the DEFINE STATUS component into the diagram. We set CLUSTER_SOM_1 as TARGET, the
descriptive variables (V1…V21) as INPUT.
Implementation and Evaluation of K-Means, KOHONEN-SOM, and
HAC Data Mining Algorithms Based on Clustering
171
Figure 13
I add the HAC component (CLUSTERING tab).
Figure 14
The component automatically detects 3 groups. This choice relies on the height between each merging. There is
no theoretical justification here.
The DENDROGRAM tab of the visualization window is very important. By clicking on each node of the tree, we
obtain the ID of the pre-clusters supplied by the SOM algorithm.
Figure 15
172
Kapil Sharma & Richa Dhiman
CONCLUSIONS
The white nodes of the tree states the groups computed with the HAC algorithm. If we select the white node at
right, we obtain the SOM's pre-clusters ID i.e. the individuals in this group come from the pre-clusters
(1 ; 1), (1 ; 2) et (2 ; 1).
In the table, we see the correspondence between SOM pre-clusters ID and the HAC clusters ID.
Table 2
The results are strikingly consistent with the theoretical consideration underlying the SOM approach: the HAC
above all merges the adjacent cells of the Kohonen’s map.
REFERENCES
1.
Rakesh Agrawal , Ramakrishnan Srikant, Fast Algorithms for Mining Association Rules in Large Databases,
Proceedings of the 20th International Conference on Very Large Data Bases, p.487-499, September 12-15, 1994.
2.
A. Baskurt, F. Blum, M. Daoudi, J.L. Dugelay, F. Dupont, A. Dutartre, T. Filali Ansary, F. Fratani, E. Garcia, G.
Lavoué, D. Lichau, F. Preteux, J. Ricard, B. Savage, J.P. Vandeborre, T. Zaharia. SEMANTIC-3D :
COMPRESSION, INDEXATION ET TATOUAGE DE DONNÉES 3D Réseau National de Recherche en
Télécommunications (RNRT) (2002).
3.
T.Zaharia F.Prêteux, Descripteurs de forme : Etude comparée des approches 3D et 2D/3D 3D versus 2D/3D
Shape Descriptors: A Comparative study.
4.
T.F.Ansary J.P.Vandeborre M.Daoudi, Recherche de modèles 3D de pièces mécaniques basée sur les moments de
Zernike
5.
A. Khothanzad, Y. H. Hong, Invariant image recognition by Zernike moments, IEEE Trans. Pattern Anal. Match.
Intell.,12 (5), 489-497, 1990.
6.
Agrawal R., Imielinski T., Swani A. (1993) Mining Association rules between sets of items in large databases. In :
Proceedings of the ACMSIGMOD Conference on Management of Data, Washington DC, USA.
7.
Hébrail G., Lechevallier Y. (2003) Data mining et analyse des données. In : Govaert G. Analyse des données. Ed.
Lavoisier, Paris, pp 323-355.
8.
T.F.Ansary J.P.Vandeborre M.Daoudi, une approche baysiénne pour l’indexation de modèles 3D basée sur les
vues caractéristiques.
9.
Ansary, T. F. Daoudi, M. Vandeborre, J.-P. A Bayesian 3-D Search Engine Using Adaptive Views Clustering,
IEEE Transactions on Multimedia, 2007.
Implementation and Evaluation of K-Means, KOHONEN-SOM, and
HAC Data Mining Algorithms Based on Clustering
173
10. Ansary, T.F. Vandeborre, J.-P. Mahmoudi, S. Daoudi, M. A Bayesian framework for 3D models retrieval based
on characteristic views, 3D Data Processing, Visualization and Transmission, 2004. 3DPVT 2004. Proceedings.
2nd International Symposium Publication Date: 6-9 Sept. 2004.
11. Agrawal R., Srikant R., Fast algorithms for mining association rules in larges databases. In Proceeding of the 20th
international conference on Very Large Dada Bases (VLDB’94), pages 478-499. Morgan Kaufmann, September
1994.
12. U. Fayyad, G.Piatetsky-Shapiro, and Padhraic Smyth, From Data Mining toKnowledge Discovery in Databases,
American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996
13. S.Lallich, O.Teytaud, Évaluation et validation de l'intérêt des règles d'association.
14. Osada, R., Funkhouser, T., Chazelle, B. et Dobkin, D. ((Matching 3D Models with Shape Distributions)). Dans
Proceedings of the International Conference on Shape Modeling & Applications (SMI ’01), pages 154–168. IEEE
Computer Society,Washington, DC, Etat-Unis. 2001.
15. W.Y. Kim et Y.S. Kim. A region-based shape descriptor using Zernike moments. Signal Processing: Image
Communication, 16 :95–100, 2000.
16. A. Khotanzad et Y.H. Hong. Invariant image recognition by Zernike moments. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 12(5), May 90.
17. N Pasquier, Y Bastide, R Taouil, L Lakhal - Database Theory ICDT'99, 1999 – Springer.
18. M.El far, L.Moumoun, T.Gadi, R.Benslimane- A new technique for the extraction of characteristic views for
2D/3D indexation- International Journal of Engineering Science and Technology (IJEST)-accepted session July
2010.