Download use bp-network to construct composite attribute

Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007 A CLUSTERING ALGORITHM FOR DATA MINING BASED ON SWARM INTELLIGENCE PENG JIN1, 2, YUN-LONG ZHU1, KUN-YUAN HU1 1 Shenyang Institute of Automation of the Chinese Academy of Sciences, Shenyang, 110016, China 2 Graduate School of the Chinese Academy of Sciences, Beijing, 100039, China E-MAIL: {jinpeng, ylzhu, hukunyuan}@sia.cn Abstract: Clustering analysis is an important function of data mining. Various clustering methods are need for different domains and applications. A clustering algorithm for data mining based on swarm intelligence called Ant-Cluster is proposed in this paper. Ant-Cluster algorithm introduces the concept of multi-population of ants with different speed, and adopts fixed moving times method to deal with outliers and locked ant problem. Finally, we experiment on a telecom company’s customer data set with SWARM, agent-based model simulation software, which is integrated in SIMiner, a data mining software system developed by our own studies based on swarm intelligence. The results illuminate that Ant-Cluster algorithm can get clustering results effectively without giving the number of clusters and have better performance than k-means algorithm. Keywords: Clustering algorithm; Data mining; Swarm intelligence 1. Introduction Clustering analysis is a kind of unsupervised learning method that groups a set of data objects into clusters. In these clusters, data objects are similar to one another within the same cluster and are dissimilar to the objects in other clusters [1]. Major clustering methods are classified into five categories, i.e. partitioning methods, hierarchical methods, density-based methods, grid-based methods, and model-based methods. Some clustering algorithms integrate the ideas of several clustering methods. But each kind of clustering methods has its own limitation. For example, k-means algorithm, a widely used partitioning method, is sensitive to the initial objects that can induce local optimum, and needs user to specify the number of clusters. Swarm intelligence is a kind of evolutionary algorithm inspired by the behaviors of social animals. It has some advantages and characteristics, such as self-adaptation, self-government, and parallel computing, etc. It has been applied in the traveling salesman problem (TSP), quadratic assignment problem, graph coloring, job-shop scheduling, sequential ordering, and vehicle routing [2]. Some clustering algorithms for data mining based on swarm intelligence have also been proposed [3-10], which can solve some problems existing in other methods. For example, in these algorithms, the number of clusters need not to be specified by user, amount of calculation reduces due to calculating with local objects instead of all objects, and clusters with arbitrary shape can be discovered. In this paper, we introduce a clustering algorithm based on a kind of swarm intelligence method inspired by the clustering of corpses and larval-sorting activities observed in real ant colonies [6]. Improving the existing algorithm, we present Ant-Cluster algorithm. In Ant-Cluster algorithm, multi-population of ants with different moving speed is introduced, which is firstly proposed by Lumer and Faieta [11], and outlier objects are processed properly. We experiment on a telecom company’s customer data set to evaluate the performance of Ant-Cluster algorithm. The result illuminates that Ant-Cluster algorithm is more effective than k-means algorithm when the numbers of clusters are same or similar. The rest of this paper is organized as follows. Section 2 introduces clustering algorithms based on swarm intelligence briefly and illuminates the parameters and symbols used in this paper. Section 3 discusses the Ant-Cluster algorithm and represents improvements. Section 4 reports experimental results of clustering with Ant-Cluster algorithm and compares Ant-Cluster algorithm with k-means algorithm. Finally, Section 5 concludes the paper and points out expectation for future research. 2. Clustering Intelligence Algorithms Based on Swarm In existing research, there are two kinds of swarm intelligence methods used for clustering. One is ant colony optimization algorithm which is inspired by behaviors of 1-4244-0973-X/07/$25.00 ©2007 IEEE 803 Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007 ant colonies finding the shortest path between their nest and a food source [3-5]. The other is ant-based clustering inspired by the clustering of corpses and larval-sorting behaviors of real ant colonies [6-9]. The second one is researched in this paper. This kind of method is briefly introduced as follows. Ants are modeled by simple agents that randomly move in their environment, a 2-dimension grid with periodic boundary conditions. Data objects that are scattered within this environment can be picked up, transported and dropped by the agents. The picking and dropping operations are based on the similarity and density of data objects within the ants’ local neighborhood: ants are likely to pick up data objects that are either isolated or surrounded by dissimilar; they tend to drop them in the vicinity of similar ones. In this way, clusters of data objects on the grid are obtained. In this paper, we introduced the concept of multi-population of ants with different moving speed and processing method of outlier objects. Then the Ant-Cluster algorithm is proposed. The parameters and symbols used in this paper are illuminated as follows. α: swarm similarity coefficient; r: observing radius of each ant; N: the maximum of cycle times; size: the size of the 2-dimension grid; mp: the number of ants in each population; p: index of populations, p = 1, 2, 3; pp: picking-up probability; pd: dropping probability; pr: random probability, pr ∈[0, 1); k1 and k2: threshold constants for computing pp and pd respectively; anti: the ith ant; oi: the ith data object; loaded and unloaded: state of ant. If there is a data object on an ant, its state is loaded; otherwise, its state is unloaded; vhigh: the speed of ants in high speed population; vlow: the speed of ants in low speed population; vMAX: the maximal speed in variable speed population; l: the maximum times of an ant moving with a same data object continuously. 3. Ant-Cluster algorithm We proposed Ant-Cluster algorithm based on the existing research about ant-based clustering. The high-level description of Ant-Cluster algorithm is shown in Algorithm I. ALGORITHM I: A High-Level Description of Ant-Cluster Initialization phase: Initialize parameters (α, r, N, size, mp, vhigh, vlow, vMAX, and l). Place data objects on a 2-dimension grid randomly, i.e. assign a pair of coordinates (x, y) to each data objects. Put three populations of ants with different speed on this 2-dimension grid. Initial state of each ant is unloaded; while (cycle_time <= N) Adjust α with specific step; for (p = 1; p <= 3; p++) for (i = 1; i <= mp; i++) if (anti encounter a data object) if (state of anti is unloaded) Compute the swarm similarity of the data object within a local region with radius r, and compute picking-up probability pp. Compare pp with a random probability pr. if pp > pr, anti pick up this data object, and the state of anti is changed to loaded; else if (state of anti is loaded) If anti has already moved with the same data object l steps, the data object is dropped and the state of anti is changed to unloaded. Otherwise, compute the swarm similarity of the data object within a local region with radius r, and compute dropping probability pd. Compare pd with a random probability pr. if pp > pr, anti drops this data object, and the state of anti is changed to unloaded. end end end 3.1. General Description of Ant-Cluster In initialization phase, all parameters, including α, r, N, size, mp, vhigh, vlow, vMAX, and l, are given values by user. Data objects and three populations of ants with different speed are placed in a 2-dimension grid randomly. There is only one data object and/or one ant in a grid at most. Initial state of each ant is set unloaded. In each of outer loop iteration, i.e. while loop, all ants on the 2-dimension grid move one time. Each of interior loop iteration corresponds to the behavior of one ant. An ant moves one step on the 2-dimension grid randomly with different speed according to different population at a time. When it encounters a data object and its state is unloaded, the swarm similarity and picking-up probability are computed for deciding whether or not to pick up the date object. When it does not encounter a data object and its 804 Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007 state is loaded, the moving times with the same data object is compared with l at first. If the ant has already moved with the same data object l times, the data object is dropped. Otherwise, the swarm similarity and dropping probability are computed for deciding whether or not to drop the date object. The swarm similarity is computed by following formula: 1 f (oi ) = so ( ) ⎡ d oi , o j ⎤ ⎢1 − ⎥ α ⎦ ⎣ j ∈Neigh ( r ) ∑ (1) where f(oi) is a measure of the average similarity of object oi with the other objects oj present in the neighborhood of oi. S is the number of objects oj. d(oi, oj) is the distance between two objects oi and oj in the space of attributes measured with Euclidean distance. The swarm similarity is transformed to picking-up probability pp and dropping probability pd by following formulas respectively. 2 ⎛ ⎞ k1 ⎟ , p p (oi ) = ⎜⎜ ⎟ + ( ) k f o i ⎠ ⎝ 1 ⎛ ⎞ k2 ⎟ pd (oi ) = ⎜⎜ ⎟ + k f o ( ) i ⎠ ⎝ 2 2 (2) Wherein, k1 and k2 are two threshold constants assigned by users. 3.2. Multi-population with Different Speed In Ant-Cluster algorithm, the concept of multipopulation is introduced. Three populations of ants with different speed are adopted in this paper, i.e. high speed population, low speed population, and variable speed population. The speed is denoted with the length of one step which each ant moves at a time. Ants with high speed can make the algorithm converge more quickly. Ants with low speed can make clustering results more precise and subtle. Ants with variable speed can detect its neighborhood and then decide its speed according to the following formula. ⎧⎡ px ⋅ vMAX ⎤ ⎪ v = ⎨⎡(1 − p x ) ⋅ vMAX ⎤ ⎪ p ⋅v ⎩⎡ r MAX ⎤ pick up or drop a object successfully pick up or drop a object unsuccessfully (3) otherwise where px is picking-up probability pp or dropping probability pd. pr is a random probability. vMAX is the maximal speed given by users. 3.3. Outlier Processing In data sets, there are some special objects called outliers, such as noise, exceptional cases, or incomplete data objects. These objects may confuse the clustering process due to their dissimilarity with others. An outlier is hardly to be dropped once it has been loaded by an ant. This ant is “locked” in a manner and can not take part in the algorithm effectively. With increasing of locked ants, convergence of algorithm will slow down. To solve this problem, we adopt a method that if an ant has already loaded an object more than l steps, the ant will drop its load. The threshold l is assigned by users. 4. Experimental Results In this section, we experiment on a telecom company’s customer data set, comprising 2669 cases of customers, to verify the performance of Ant-Cluster algorithm. The algorithm is implemented with SWARM, agent-based model simulation software, which is integrated in SIMiner, a data mining software system developed by our own studies based on swarm intelligence. The data set includes following attributes. Table 1. Data attributes used in experiment Variable name Description Minutes of call in regular time Regular_dur Minutes of call in discount time Discount_dur Minutes of local call Local_dur Minutes of domestic call Domestic_dur Times of short message service Svc_sms Number of service types Svc_type Number of service times Svc_time Customer age Age Customer gender Gender Balance of customer account Balance Times of arrearage Arrearage_time Average Revenue Per User ARPU Customer is churning or not Churn Parameters of Ant-Cluster algorithm are set as follow in this experiment. Swarm similarity coefficient is α=12~14, observing radius is r =10, the maximum of cycle time is N=8000, the size of the 2-dimension grid is size =160×160, the number of ants in each population is mp =100, threshold constants are k1=0.1 and k2=0.15, the speed of high speed population is vhigh =5, the speed of low speed population is vlow =1, the maximal speed in variable speed population is vMAX =20, and the threshold of locked ant moving time is l=50. Clustering result is illuminated in figure 1. In figure 1, each cluster figures one customer cluster. Objects in a cluster have some common characteristics and these characteristics can be obtained by comparing distribution of an attribute value in the whole data set with the one in a certain cluster. For example, figure 2 and 3 illuminate distribution of domestic call time attribute in the whole customer data set and in a certain customer cluster 805 Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007 respectively. As is shown in figure 3, domestic call time in this cluster is longer than that in all customer data set shown in figure 2. Therefore, we can draw a conclusion that higher domestic call time is one of characteristics of this cluster. Appropriate marketing strategies should be made according to this result. than 300, namely 2, 9, 13, 4, 8, and 11. The k-means algorithm has only discovered two clusters which average value of Domestic_dur are more than 300, namely 2 and 3. We can conclude that Ant-Cluster algorithm is more effective than k-means algorithm for discovering clusters which have distinct characteristics when the numbers of clusters are same or similar. 500 450 400 350 300 250 200 150 100 50 0 Figure 4. Result of Ant-Cluster algorithm Figure 1. Result of customer segmentation obtained by Swarm with Ant-Cluster 500 450 400 350 300 250 200 150 100 50 0 Number of samples *100% Domestic dur/month Figure 2. Distribution of domestic call time in all customer data set Number of samples *100% 1812 5 1416 3 1917 1 6 7 1510 2 9 13 4 8 11 19 8 181716 5 14 1 10 6 7 1511 2 9 13 4 12 3 Figure 5. Result of k-means algorithm 5. Domestic dur/month Figure 3. Distribution of domestic call time in a certain customer cluster Figure 4 shows the average value of Domestic_dur in each cluster obtained with Ant-Cluster algorithm. For comparing, we implemented the k-means algorithm. The average value of Domestic_dur in each cluster obtained with k-means algorithm (k=19) is shown in figure 5. In figure 4, the Ant-Cluster algorithm has obtained six clusters which average value of Domestic_dur are more Conclusions This paper has proposed a clustering algorithm for data mining based on swarm intelligence called Ant-Cluster. In Ant-Cluster algorithm, multi-population with different speed is introduced, and fixed moving times method is adopted to deal with outliers and locked ant problem. SWARM, agent-based model simulation software, is applied to evaluate the performance of Ant-Cluster algorithm by experimenting on a telecom company’s customer data set. The results illuminate that the algorithm proposed in this paper can obtain clustering results effectively without giving the number of clusters and have better performance than k-means algorithm. In the future research, other kinds of methods for 806 Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007 computing picking-up probability pp and dropping probability pd should be investigated to enhance efficiency of Ant-Cluster algorithm. [6] Acknowledgements This work is supported by the National Natural Science Foundation of China (Grant No. 70431003). [7] References [1] Jiawei Han, and Micheline Kamber, Data Mining: Concepts and Techniques, Morgan Kaufmann Publishers, San Francisco, 2000. [2] Eric Bonabeau, Marco Dorigo, Guy Theraulaz, Swarm Intelligence: from Natural to Artificial Intelligence, Oxford University Press, New York, 1999. [3] Cheng-Fa Tsai, Chun-Wei Tsai, Han-Chang Wu, and Tzer Yang, “ACODF: a novel data clustering approach for data mining in large databases”, Journal of Systems and Software, Vol 73, pp. 133-145,2004. [4] Cheng-Fa Tsai, Han-Chang Wu, Chun-Wei Tsai, “A new data clustering approach for data mining in large databases”, Proceedings of the International Symposium on Parallel Architectures, Algorithms and Networks, Metro Manila, Philippines, pp. 22-24, 2002. [5] H.Azzag; N.Monmarché; M.Slimane; C.Guinot; G.Venturini, “AntTree: a new model for clustering with artificial ants”, Proceedings of the 7th European [8] [9] [10] [11] 807 Conference on Artificial Life, Dortmund, Germany, pp. 14-17, 2003. Wu bin, Shi Zhongzhi, “A clustering algorithm based on swarm intelligence”, Proceedings of the International Conferences on Info-tech and Info-net, Beijing, China, pp. 58-66, 2001. Wu Bin, Zheng Yi, Liu Shaohui, Shi Zhongzhi, “CSIM: a document clustering algorithm based on swarm intelligence”, Proceedings of the Congress on Computational Intelligence, Hawaiian, USA, pp. 477-482, 2002. Yan Yang, Mohamed Kamel, “Clustering ensemble using swarm intelligence”, Proceedings of the 2003 IEEE Swarm Intelligence Symposium, Piscataway, NJ, USA, pp. 65-71, 2003. Handl, J., Knowles, J. and Dorigo, M, “Ant-based clustering: a comparative study of its relative performance with respect to k-means, average link and 1D-som”, Technical Report TR/IRIDIA/2003-24. IRIDIA, Universite Libre de Bruxelles, Belgium, 2003. Peng Yuqing, Hou Xiangdan, Liu Shang, “The K-means clustering algorithm based on density and ant colony”, Proceedings of the International Conference on Neural Networks and Signal Processing, Nanjing, China, pp. 457-460, 2003. E. Lumer and B. Faieta, “Diversity and adaptation in populations of clustering ants”, Proceedings of the third international conference on Simulation of adaptive behavior: from animals to animats 3, Brighton, United Kingdom, pp. 501-508, 1994.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download use bp-network to construct composite attribute