* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A new K-means Initial Cluster Class of the Center Selection
Survey
Document related concepts
Transcript
Communications in Information Science and Management Engineering CISME A new K-means Initial Cluster Class of the Center Selection Algorithm Based on the Great Graph Zhou Haiyan1, Zhao Jianyang2 Faculty of Computer Engineering, Huaiyin Institute of Technology, Huai’an, China 1 [email protected] ; [email protected] cluster class center based on space division.The paper [5] proposed the method used histogram according to the characteristics of a data distribution, which improved more in the clustering effect than paper [4],but the clustering effect is not satisfactory. So this paper proposed the initial cluster center class selection algorithm based on K-means clustering of great group of graph. First, it determines the initial cluster class center by the paper proposed Initial cluster center selection algorithm class. Then it obtained clustering results using the optimized K-means clustering algorithm. Through UCI machine learning the test of database data, it verified that clustering results have been significantly improved, which proposed by the paper compared the initial cluster class center selection algorithm of K-means clustering of a great group of graph-based to the classic K-means clustering method based on the initial cluster class centers, which randomly selected . Abstract- K-means algorithm, which is simple and fast, with a more intuitive geometric meaning, has been widely used in pattern recognition, image processing and computer vision, and achieved satisfactory results. But K-means algorithm is executed to be pre-determined initial cluster center of class, the initial cluster centers class selected have a direct impact on the final clustering results. K-means Initial cluster class of the center selection algorithm based on the great group is presented. The method compared to the other initial cluster center of class selection algorithm, significantly improves the clustering effect. Keywords- Data Mining; Data Clustering; Initial Cluster Center; Great Group I. INTRODUCTION Cluster analysis is a method, which is the study of how to use mathematical methods to classify a group of samples (object, index, property). Clustering algorithms include statistical algorithms, machine learning, neural network and methods of database-oriented. Clustering is an important issue of data mining and pattern recognition. The clustering method commonly used is based on the distance of the division algorithm, and K-means clustering that is also known as Kmeans clustering, is a typical division algorithm. It is a classification method based on the average as the class "center". K-means clustering method is simple and fast with a more intuitive geometric meaning, which has been used in pattern recognition, image processing and computer vision,and has achieved satisfactory results. The basic work steps of K-means clustering is a random k data objects as centers of the initial cluster class, the data object is divided into different cluster classes through the iterative.Algorithm the value of parameter k is given in advance. Some studies have indicated that if the center of the initial cluster class is improperly selected, the clustering of the k-means algorithm may fall into local optimal solution to do not get a better clustering result. To determine the initial cluster center class of the K-means, clustering method has raised a number of different solutions, such as more typical the initial cluster center algorithm of based sampling which PaulS. Bradley proposed, density estimation method based on division which is proposed by Moh'dB. Al-Daoud and Stuart A. Robert s, Kaufman made by the local density of data points to estimate the initial cluster class center, and the KKZ method that Kat savounidis and others made. To avoid more, the initial class focal point of the same actual category, J. T. Tou, and others propose means restricted the distance between the centers of the initial class. In China, Qie Xian also made an approach of a class initialized K-means cluster-centric. To some extent, these methods have improved the performance of K-means algorithm, but the aspects are to be further improved in the accuracy of clustering, initialize cluster class center used time and the number of clustering iterative. When clustering data dimension is high, the method is difficult to achieve zoning. The paper [4] proposes the selection method of the initial II. CLUSTERING ALGORITHM OF PREVIOUS GRAPH Classical graph theory clustering, which is also known as the largest (small) support (generate) tree (minim urn spanning trees, MST) clustering algorithm [6],was firstly proposed by Zahn's.This algorithm, firstly defines the distance between clusters of data; secondly produce a non-(edge) weighted complete graph by the distance between the data and the data; thirdly obtained the minimum spanning tree MST by the complete graph;Finally, MST is divided into a number of sub trees by removing the MST in those with the most longdistance (i.e., the maximum weight) of the edge, and the sub tree is found that the clustering clusters. According to the literature [6] pointed out that this algorithm can obtain global optimal clusters, but in fact the method cannot guarantee that the data from the small cluster belong to the same class, from the large cluster of data belonging to different classes.For example: Given an undirected graph G shown in Figure 1, the minimum spanning tree of the graph shown in Figure 2, the minimum spanning tree’s maximum weight edges (x1, x4) is deleted income sub-graph shown in Figure 3 show. Diagram shown in Figure 3 has two sub-trees, according to clustering algorithm of the literature [6]. The data set is divided into two clusters categories, namely, {x1, x2, x3} and {x4, x5, x6}.However, the distance between x1 and x4 is 4 belong to different clusters classes, while the distance between x4 and x6 is 5 belong to the same cluster class. And the algorithm find the minimum spanning tree MST by generating the complete graph, the cost of implementation of this process is very large, which time degree is O (n2). The literature [7] using the method is basically the same in the literature [6],so there is the same problem. Class selection algorithm of the initial cluster center of connected components of graph-based, which proposed in the literature [8],has some improve in terms of both effect and efficiency compared with clustering algorithm proposed by the literature [6, 7]. But it proposed the same algorithm with the literature [6, 7], it cannot guarantee that the data from the small cluster belong to the same class, C 2011-2012 World Academic Publishing CISME Vol. 2 Iss. 8 2012 PP. 17-20 www.jcisme.org ○ - 17 - Communications in Information Science and Management Engineering from the large cluster of data belonging to different classes, too. So we propose the initial cluster class center selection algorithm of a graph-based great group and use this algorithm to obtain the initial cluster class center of the K-means clustering method. The distance d (x, y) between x and y is based on the specific circumstances, such as can use Euclidean distance, absolute distance or other distance measurement methods. Theorem 1: Given a finite data set X can be expressed with an (edge) weighted undirected complete graph. III. K-MEANS CLUSTERING ALGORITHM The basic idea of K-means clustering algorithm is randomly selecting k data objects and each data object represents the data objects of an initial cluster mean or center. Each of the remaining data object according to its distance from the center of each initial cluster, assign it to the nearest cluster; and then recalculate the average or center of each cluster. Repeat the process until the criterion function converges. Proof: By Definition 1 that was. Definition 2: (Great group). For the graph G (V, E),V V , If the vertex set V derived sub graph G (V ,E )is a complete graph, it is called the group of graph G ; If v V and v , the vertex set V ∪{v} is the induced sub graph G.{v} is a complete graph, then G is called a great group of graph G . For a given data set X = {x1, x2, ..., xn}, mission is the X data objects clustering. First defined (choice) a measure of the distance between data objects and then calculate the distance in X between the arbitrary two data objects, so to get a free (edge) weighted graph G (X). Criterion function E is defined as: _ k CISME E ( x xi ) 2 i 1 xC i Where: E is a data set of all data objects and it's in the center of the cluster sum of squared errors. E greater shows the cluster data object class with the greater distance from the center, the similarity of the cluster the lower. Conversely, E is smaller; the similarity between cluster data objects is higher. x _ is a cluster of data objects; x i is the center of the cluster Cik is the number of clusters; C is the k clusters in the i-the cluster. K-means clustering algorithm is described as follows. Input: n data objects containing data sets and the number of clusters k, criterion function is that E = M; Output: k clusters, to meet the convergence criterion function. 1) Randomly selecting k data objects, each object class is as the initial cluster centers; 2) Calculated the distance between each data objects and the center of cluster class, object is divided into the nearest clusters according to the distance; 3) To recalculate the average of each cluster object, update the cluster center of cluster classes; 4) Calculate the weight criterion function E; 5) If the criterion function E does not occur rehanged with turning to the switch 6) step, or turn to step 2); 6) {end}. Then choosing a threshold α, the right of sides in the G (X) is greater than α , which are removed to get a sub graph G0 of G (X) , the distance between the two adjacent nodes (data objects) in G0 does not exceed α. Finding all the great group of G0, and seting up a total of great group p (here requiring p is slightly greater than K, we can adjust the threshold value of α to meet this requirement); the node (data object) set Xi of the i-th significant group is as a subset through the operation, so get a cover of the data set X consisting of p sub-set Y = {X1, X2, ..., Xp}. The statistics of all subsets of Y in the number of data objects, data objects will contain a subset of the largest data centers c1 as the first class of the initial cluster center (denoted as p1). Followed by calculation the distance of the center and the other subset in Y from p1, the center of the largest subset of c2 from p1 was selected as the second center of the initial cluster class (denoted by p2); And so on, order i 1 pi Max d (ci , p j ) j 1 Where ci is the center of Y-neutron set , i = 3,4, ..., p. pi is as the i-th center of the initial cluster class , i = 3,4, ..., k.. So that by (2) to be class of K initial cluster centers. Based on the above basic idea, based on a great group of K-means clustering method the initial cluster center selection algorithm described in the following categories: IV. INITIAL CLUSTER CLASS OF THE Algorithm 1 K-means clustering algorithm based on a great group of initial cluster center selection algorithm CENTER SELECTION ALGORITHM BASED ON THE GREAT GROUP Suppose the given data set is X = {x1, x2, ..., xn}, and k values (k is the desired number of cluster classes), the data set X with an undirected complete graph turn the poly-cube class problem into an undirected complete graph partitioning problems. Input: Cluster data set to be X = {x1, x2, ..., xn}, and k values (k is the desired number of cluster classes); Output: clustering: Definition 1: Weighted graph G (X) = <V, E, W>. Among them, the point set V = {xi ︱ xi ∈ X}, xi = (xi1, ..., xim), xi is the j-th properties of data object xij (where you can choose the m data objects property that is similar or dissimilar the degree); Edge set E = {(xi, xj) ︱ for xi, xj ∈ X and i ≠ j}, W = {wij ︱ wij = d (xi, xj), said weight of edge (xi, xj), d (xi, xj) is the distance (or dissimilarity) between xi and xj }, then G (X) is an undirected (edge) weighted complete graph. 1. For a given set of data objects X = {x1, x2, ..., xn}, choose an appropriate measure of distance between data object d, the establishment of undirected (edge) weighted graph G (V,E,W):(Where V = X, for xi, xj ∈ X, xi ≠ xj, edge (xi, xj) ∈ E, W = {wi, j | wi, j = d (xi, xj) for the edge (xi, xj) weight}); and the establishment of data objects X of the distance between the matrix MX; 2. Select a threshold α; C 2011-2012 World Academic Publishing CISME Vol. 2 Iss. 8 2012 PP. 17-20 www.jcisme.org ○ - 18 - Communications in Information Science and Management Engineering 3. For the threshold value α do CISME n2). The time complexity of Step 4 easily known is O (1). So know the time complexity of each cycle of steps 2 to 4 is O (n2). Because the number of options of the threshold α in the Algorithm 1 is can be considered on a constant, the number of cycles is constant and the time complexity of steps 2 to 4 is O (n2). Easily known the time complexity Step 5 to 7 is O (n), General Step 1 of Algorithm 1, Step 2 to 4 and step 5 to 7 shows that the time complexity of the initial cluster clustering method class "center" of K-means algorithm used a calculation of Algorithm 1 is O (n2). 3.1 Establishment of the (edge) weighted graph G (V, E, W): (which V = X, for u, v ∈ X, edge (u, v) ∈ E, if and only if d (u, v) ≤ α); 3.2 Undirected graph called seeking a great group of all algorithms [9] to calculate the G (V, E, W) of all great groups. Let G (V, E, W) have p great groups, and put the note (data object) of the i-th great group in G (V, E, W) by the operation as a subset of the set Xi, So obtained from the p-subset of X into the set Y = {X1, X2, ..., Xp} data set composed of a cover of X; TABLE I THE COMPARISON OF K-MEANS ALGORITHM AND THE PAPER ALGORITHM 4. If (p <k) take α '<α, α = α', go to Step 3, otherwise step 5; Algorit hm 5. The statistics the number of data objects of all subsets in Y , data objects will contain a subset of the largest data centers c1 as the first the initial cluster class center of the clustering (denoted as p1); Kmeans 6. Followed by calculation the distance of the center and the other subset in Y from p1, the center of the largest subset of c2 from p1 was selected as the second center of the initial cluster class (denoted by p2); This paper algorith m 7. for (i=3; i≤k; i++) Data Sets Emax Emin Eavg balance 396.87643 332.34532 363.74312 car iris wine 3.1632E3 155.36851 3.15783E4 3.1321E3 79.83192 2.9863E4 3.1543E3 96.425347 3.07964E4 balance 335.65312 335.65312 335.65312 car iris wine 3.1421E3 80.94756 2.9634E4 3.1421E3 80.94756 2.9634E4 3.1421E3 80.94756 2.9634E4 This completes the proof. i 1 Pi Max d (ci , p j ) V. EXPERIMENTAL RESULTS AND ANALYSIS In this paper, the test data set is 4 set experimental data of the UCI database, which are balance, car, iris and wine. UCI is a specialized database for common international database of testing machine learning and data mining algorithms. The data of the library have determined classification, so the accuracy can be used to visually represent the quality of clustering. In order to verify the accuracy of the algorithm, data distribution of the test data set is in their original state without any manual processing. We will compare traditional K-means algorithm of the initial cluster centers classes randomly selected with the selection algorithm of the initial clusters class “center" based on the great group in this paper. Since the selection algorithm of the initial clusters class "center" based on the great group got the initial cluster class "center" is determined, the use of the algorithm for only an experiment to get the maximum, minimum and maximum and average clustering accuracy , minimum and average value of criterion function E is the same. j 1 (Where ci is set as the center of the neutron of Y, i = 3, 4, ..., p.) pi is as the initial cluster of the i-th cluster class center, which i = 3,4, ..., k.. (This will be the initial cluster of K class centers); 8. From this K initial cluster class centers starting, calling K-means algorithm, obtained clustering results 9. Algorithm ends. Theorem 2 For a given finite set of data X = {x1, x2, ..., xn} and the distance d selected between a data, Algorithm 1 correctly calculate the K-means clustering method the initial cluster class "center" it’s time complexity is O (n2). Proof: The correctness of Algorithm 1 is known by Theorem 1 and the definition. We analyze and calculate the initial cluster class "center" time complexity of K-means clustering method. The initial cluster class "center" of Kmeans clustering computing is a step in the algorithm 1 to 7. Because in step 1, the time spent mainly on representation of the limited data set X = {x1, x2, ..., xn} and the establishment of the (edge) weighted graph G (V, E, W). If Undirected graph G use the edges of the graph G which is adjacent to node said [10] of the table 2m edges. So that the complexity 1 of Undirected graph G is O(︱E︱)(m=︱E︱= n(n-1)) 2 and the time complexity of Step 1 is O (n2). In each main loop of step 2 to 4, the establishment of an undirected graph G in step 3.1 is based on the weight of the n (n-1) edges screening time, so the total time complexity of Step 3.1 is O (n2). And because the calculation based on edge adjacency table of algorithm time complexity of all the great groups of map is O(︱E︱) [11], the time complexity of the algorithm step 3.2 is O(︱E︱).In summary, the first execution time complexity of step 3 is O(n2)+ O(︱E︱)=O(n2)(∵︱E︱≤ Use of traditional the initial cluster class centers of Kmeans algorithm randomly selected 10 times experiments, the experimental results are shown in Table 2. TABLE II THE COMPARISON OF K-MEANS AND PAPER ALGORITHM AND E(CRITERION FUNCTION) % Clustering Accuracy /% Algori thm Kmeans This paper algorit hm Data Sets Highest Lowest Average balance car iris wine balance car iris 0.713359 0.732235 0.732546 0.763592 0.714371 0.793352 0.893542 0.431357 0.635536 0.576543 0.495132 0.714371 0.793352 0.893542 0.538663 0.681272 0.625325 0.693651 0.714371 0.793352 0.893542 wine 0.757356 0.75356 0.757356 C 2011-2012 World Academic Publishing CISME Vol. 2 Iss. 8 2012 PP. 17-20 www.jcisme.org ○ - 19 - Communications in Information Science and Management Engineering VI. CONCLUSION K-means Initial cluster class of the center selection algorithm based on the great group, which is proposed in the paper, is to use great group of diagrams to determine clustering class center of K-means algorithm. To further the implementation of K-means algorithm has laid a good foundation, which compared to the traditional K-means algorithm randomly selecting the initial cluster center class, has improved accuracy and stability of the algorithm; while avoiding the randomly selected cluster centers caused the algorithm class accuracy and efficiency of randomness. Simulation results show that the algorithm avoids the arbitrary of initial cluster centers class selected, which significantly improved the accuracy of clustering of data samples, significantly reduced the number of iterations of the algorithm, and the algorithm of clustering effect has been significantly improved. REFERENCES [1] [2] Han J. Kamber M. Fan, X. Meng, M. Data Mining: Concepts and Techniques [M], Beijing: Mechanical Industry Press, 2001.8: 232-235. Kurniawan A,Benech N, Tao Yufei. Towards High-dimensional Clustering[J]. COMP,November 1999: 1-2. [3] Lee Tae, Shen Xianghong.Double fuzzy K-means algorithm classifier design computer measurement and control. 2008.16 (9):1325-1326. [4] Su banners, XUE Huifeng, Zhan Hailiang based division of K-means algorithm to optimize the initial cluster centers. Microelectronics and Computer, 2009,26 (1) :8-11. [5] Zhang Jianpei, Yang Yue, Yang Jing, etc. based on the optimal division of K-Means algorithm for selecting the initial cluster centers Journal of System Simulation .2009,21 (9) :2586 -2590. [6] Qian Yuntao, Rong-Chun Zhao, Xie Weixin. Robust clustering - based on graph theory and objective function method [J]. Journal of Electronics, 1998,26 (2) :91-94. [7] LI Chunsheng, WANG Yaonan.Initializes a new method of cluster centers. .2010,27 Control Theory and Applications (10) :1435-1440. [8] Zhou Haiyan, Bai Xiaoling. Graph-based k-means clustering method choice in the initial cluster centers. .2010,18 Computer measurement and control system (9) :2167-2169. [9] Chen Anlong, Changjie Tang, Tao Wang was so based on a great mission and the FP-Tree Algorithm for Mining Association Rules [J]. Journal of Software, 2004,15 (8) :1198-1207. [10] Aho A.V. Hopcroft J.E., Ullman J.D The Design and Analysis of Computer Algorithms New York: Addison-Wesley Publishing Company 1974. 189-195. [11] Hu Gang, Xu Ming, Liu Lixia, etc. wireless network in one kind of group cognitive division of spectrum sensing algorithm, Journal of Software, 2011,22 (2) :298-312. C 2011-2012 World Academic Publishing CISME Vol. 2 Iss. 8 2012 PP. 17-20 www.jcisme.org ○ - 20 - CISME