Download use bp-network to construct composite attribute

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007
A CLUSTERING ALGORITHM FOR DATA MINING BASED ON SWARM
INTELLIGENCE
PENG JIN1, 2, YUN-LONG ZHU1, KUN-YUAN HU1
1
Shenyang Institute of Automation of the Chinese Academy of Sciences, Shenyang, 110016, China
2
Graduate School of the Chinese Academy of Sciences, Beijing, 100039, China
E-MAIL: {jinpeng, ylzhu, hukunyuan}@sia.cn
Abstract:
Clustering analysis is an important function of data
mining. Various clustering methods are need for different
domains and applications. A clustering algorithm for data
mining based on swarm intelligence called Ant-Cluster is
proposed in this paper. Ant-Cluster algorithm introduces the
concept of multi-population of ants with different speed, and
adopts fixed moving times method to deal with outliers and
locked ant problem. Finally, we experiment on a telecom
company’s customer data set with SWARM, agent-based
model simulation software, which is integrated in SIMiner, a
data mining software system developed by our own studies
based on swarm intelligence. The results illuminate that
Ant-Cluster algorithm can get clustering results effectively
without giving the number of clusters and have better
performance than k-means algorithm.
Keywords:
Clustering algorithm; Data mining; Swarm intelligence
1.
Introduction
Clustering analysis is a kind of unsupervised learning
method that groups a set of data objects into clusters. In
these clusters, data objects are similar to one another within
the same cluster and are dissimilar to the objects in other
clusters [1]. Major clustering methods are classified into five
categories, i.e. partitioning methods, hierarchical methods,
density-based methods, grid-based methods, and
model-based methods. Some clustering algorithms integrate
the ideas of several clustering methods. But each kind of
clustering methods has its own limitation. For example,
k-means algorithm, a widely used partitioning method, is
sensitive to the initial objects that can induce local optimum,
and needs user to specify the number of clusters.
Swarm intelligence is a kind of evolutionary algorithm
inspired by the behaviors of social animals. It has some
advantages and characteristics, such as self-adaptation,
self-government, and parallel computing, etc. It has been
applied in the traveling salesman problem (TSP), quadratic
assignment problem, graph coloring, job-shop scheduling,
sequential ordering, and vehicle routing [2]. Some clustering
algorithms for data mining based on swarm intelligence
have also been proposed [3-10], which can solve some
problems existing in other methods. For example, in these
algorithms, the number of clusters need not to be specified
by user, amount of calculation reduces due to calculating
with local objects instead of all objects, and clusters with
arbitrary shape can be discovered.
In this paper, we introduce a clustering algorithm
based on a kind of swarm intelligence method inspired by
the clustering of corpses and larval-sorting activities
observed in real ant colonies [6]. Improving the existing
algorithm, we present Ant-Cluster algorithm. In Ant-Cluster
algorithm, multi-population of ants with different moving
speed is introduced, which is firstly proposed by Lumer and
Faieta [11], and outlier objects are processed properly. We
experiment on a telecom company’s customer data set to
evaluate the performance of Ant-Cluster algorithm. The
result illuminates that Ant-Cluster algorithm is more
effective than k-means algorithm when the numbers of
clusters are same or similar.
The rest of this paper is organized as follows. Section
2 introduces clustering algorithms based on swarm
intelligence briefly and illuminates the parameters and
symbols used in this paper. Section 3 discusses the
Ant-Cluster algorithm and represents improvements.
Section 4 reports experimental results of clustering with
Ant-Cluster algorithm and compares Ant-Cluster algorithm
with k-means algorithm. Finally, Section 5 concludes the
paper and points out expectation for future research.
2.
Clustering
Intelligence
Algorithms
Based
on
Swarm
In existing research, there are two kinds of swarm
intelligence methods used for clustering. One is ant colony
optimization algorithm which is inspired by behaviors of
1-4244-0973-X/07/$25.00 ©2007 IEEE
803
Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007
ant colonies finding the shortest path between their nest and
a food source [3-5]. The other is ant-based clustering inspired
by the clustering of corpses and larval-sorting behaviors of
real ant colonies [6-9]. The second one is researched in this
paper. This kind of method is briefly introduced as follows.
Ants are modeled by simple agents that randomly
move in their environment, a 2-dimension grid with
periodic boundary conditions. Data objects that are
scattered within this environment can be picked up,
transported and dropped by the agents. The picking and
dropping operations are based on the similarity and density
of data objects within the ants’ local neighborhood: ants are
likely to pick up data objects that are either isolated or
surrounded by dissimilar; they tend to drop them in the
vicinity of similar ones. In this way, clusters of data objects
on the grid are obtained.
In this paper, we introduced the concept of
multi-population of ants with different moving speed and
processing method of outlier objects. Then the Ant-Cluster
algorithm is proposed. The parameters and symbols used in
this paper are illuminated as follows.
α: swarm similarity coefficient;
r: observing radius of each ant;
N: the maximum of cycle times;
size: the size of the 2-dimension grid;
mp: the number of ants in each population;
p: index of populations, p = 1, 2, 3;
pp: picking-up probability;
pd: dropping probability;
pr: random probability, pr ∈[0, 1);
k1 and k2: threshold constants for computing pp and pd
respectively;
anti: the ith ant;
oi: the ith data object;
loaded and unloaded: state of ant. If there is a data
object on an ant, its state is loaded; otherwise, its state is
unloaded;
vhigh: the speed of ants in high speed population;
vlow: the speed of ants in low speed population;
vMAX: the maximal speed in variable speed population;
l: the maximum times of an ant moving with a same
data object continuously.
3.
Ant-Cluster algorithm
We proposed Ant-Cluster algorithm based on the
existing research about ant-based clustering. The high-level
description of Ant-Cluster algorithm is shown in Algorithm
I.
ALGORITHM I: A High-Level Description of
Ant-Cluster
Initialization phase: Initialize parameters (α, r, N, size,
mp, vhigh, vlow, vMAX, and l). Place data objects on a
2-dimension grid randomly, i.e. assign a pair of coordinates
(x, y) to each data objects. Put three populations of ants
with different speed on this 2-dimension grid. Initial state of
each ant is unloaded;
while (cycle_time <= N)
Adjust α with specific step;
for (p = 1; p <= 3; p++)
for (i = 1; i <= mp; i++)
if (anti encounter a data object)
if (state of anti is unloaded)
Compute the swarm similarity of the data
object within a local region with radius r,
and compute picking-up probability pp.
Compare pp with a random probability pr.
if pp > pr, anti pick up this data object, and
the state of anti is changed to loaded;
else
if (state of anti is loaded)
If anti has already moved with the same
data object l steps, the data object is
dropped and the state of anti is changed to
unloaded. Otherwise, compute the swarm
similarity of the data object within a local
region with radius r, and compute
dropping probability pd. Compare pd with
a random probability pr. if pp > pr, anti
drops this data object, and the state of anti
is changed to unloaded.
end
end
end
3.1. General Description of Ant-Cluster
In initialization phase, all parameters, including α, r, N,
size, mp, vhigh, vlow, vMAX, and l, are given values by user.
Data objects and three populations of ants with different
speed are placed in a 2-dimension grid randomly. There is
only one data object and/or one ant in a grid at most. Initial
state of each ant is set unloaded.
In each of outer loop iteration, i.e. while loop, all ants
on the 2-dimension grid move one time. Each of interior
loop iteration corresponds to the behavior of one ant. An ant
moves one step on the 2-dimension grid randomly with
different speed according to different population at a time.
When it encounters a data object and its state is unloaded,
the swarm similarity and picking-up probability are
computed for deciding whether or not to pick up the date
object. When it does not encounter a data object and its
804
Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007
state is loaded, the moving times with the same data object
is compared with l at first. If the ant has already moved
with the same data object l times, the data object is dropped.
Otherwise, the swarm similarity and dropping probability
are computed for deciding whether or not to drop the date
object.
The swarm similarity is computed by following
formula:
1
f (oi ) =
so
(
)
⎡ d oi , o j ⎤
⎢1 −
⎥
α ⎦
⎣
j ∈Neigh ( r )
∑
(1)
where f(oi) is a measure of the average similarity of
object oi with the other objects oj present in the
neighborhood of oi. S is the number of objects oj. d(oi, oj)
is the distance between two objects oi and oj in the space of
attributes measured with Euclidean distance.
The swarm similarity is transformed to picking-up
probability pp and dropping probability pd by following
formulas respectively.
2
⎛
⎞
k1
⎟ ,
p p (oi ) = ⎜⎜
⎟
+
(
)
k
f
o
i ⎠
⎝ 1
⎛
⎞
k2
⎟
pd (oi ) = ⎜⎜
⎟
+
k
f
o
(
)
i ⎠
⎝ 2
2
(2)
Wherein, k1 and k2 are two threshold constants
assigned by users.
3.2. Multi-population with Different Speed
In Ant-Cluster algorithm, the concept of multipopulation is introduced. Three populations of ants with
different speed are adopted in this paper, i.e. high speed
population, low speed population, and variable speed
population. The speed is denoted with the length of one step
which each ant moves at a time. Ants with high speed can
make the algorithm converge more quickly. Ants with low
speed can make clustering results more precise and subtle.
Ants with variable speed can detect its neighborhood and
then decide its speed according to the following formula.
⎧⎡ px ⋅ vMAX ⎤
⎪
v = ⎨⎡(1 − p x ) ⋅ vMAX ⎤
⎪ p ⋅v
⎩⎡ r MAX ⎤
pick up or drop a object successfully
pick up or drop a object unsuccessfully (3)
otherwise
where px is picking-up probability pp or dropping
probability pd. pr is a random probability. vMAX is the
maximal speed given by users.
3.3. Outlier Processing
In data sets, there are some special objects called
outliers, such as noise, exceptional cases, or incomplete
data objects. These objects may confuse the clustering
process due to their dissimilarity with others. An outlier is
hardly to be dropped once it has been loaded by an ant. This
ant is “locked” in a manner and can not take part in the
algorithm effectively. With increasing of locked ants,
convergence of algorithm will slow down. To solve this
problem, we adopt a method that if an ant has already
loaded an object more than l steps, the ant will drop its load.
The threshold l is assigned by users.
4.
Experimental Results
In this section, we experiment on a telecom company’s
customer data set, comprising 2669 cases of customers, to
verify the performance of Ant-Cluster algorithm. The
algorithm is implemented with SWARM, agent-based
model simulation software, which is integrated in SIMiner,
a data mining software system developed by our own
studies based on swarm intelligence. The data set includes
following attributes.
Table 1. Data attributes used in experiment
Variable name
Description
Minutes of call in regular time
Regular_dur
Minutes of call in discount time
Discount_dur
Minutes of local call
Local_dur
Minutes of domestic call
Domestic_dur
Times of short message service
Svc_sms
Number of service types
Svc_type
Number of service times
Svc_time
Customer age
Age
Customer gender
Gender
Balance of customer account
Balance
Times of arrearage
Arrearage_time
Average Revenue Per User
ARPU
Customer is churning or not
Churn
Parameters of Ant-Cluster algorithm are set as follow
in this experiment.
Swarm similarity coefficient is α=12~14, observing
radius is r =10, the maximum of cycle time is N=8000, the
size of the 2-dimension grid is size =160×160, the number
of ants in each population is mp =100, threshold constants
are k1=0.1 and k2=0.15, the speed of high speed population
is vhigh =5, the speed of low speed population is vlow =1, the
maximal speed in variable speed population is vMAX =20,
and the threshold of locked ant moving time is l=50.
Clustering result is illuminated in figure 1.
In figure 1, each cluster figures one customer cluster.
Objects in a cluster have some common characteristics and
these characteristics can be obtained by comparing
distribution of an attribute value in the whole data set with
the one in a certain cluster. For example, figure 2 and 3
illuminate distribution of domestic call time attribute in the
whole customer data set and in a certain customer cluster
805
Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007
respectively. As is shown in figure 3, domestic call time in
this cluster is longer than that in all customer data set
shown in figure 2. Therefore, we can draw a conclusion that
higher domestic call time is one of characteristics of this
cluster. Appropriate marketing strategies should be made
according to this result.
than 300, namely 2, 9, 13, 4, 8, and 11. The k-means
algorithm has only discovered two clusters which average
value of Domestic_dur are more than 300, namely 2 and 3.
We can conclude that Ant-Cluster algorithm is more
effective than k-means algorithm for discovering clusters
which have distinct characteristics when the numbers of
clusters are same or similar.
500
450
400
350
300
250
200
150
100
50
0
Figure 4. Result of Ant-Cluster algorithm
Figure 1. Result of customer segmentation obtained by
Swarm with Ant-Cluster
500
450
400
350
300
250
200
150
100
50
0
Number of
samples
*100%
Domestic dur/month
Figure 2. Distribution of domestic call
time in all customer data set
Number of
samples
*100%
1812 5 1416 3 1917 1 6 7 1510 2 9 13 4 8 11
19 8 181716 5 14 1 10 6 7 1511 2 9 13 4 12 3
Figure 5. Result of k-means algorithm
5.
Domestic dur/month
Figure 3. Distribution of domestic call
time in a certain customer cluster
Figure 4 shows the average value of Domestic_dur in
each cluster obtained with Ant-Cluster algorithm. For
comparing, we implemented the k-means algorithm. The
average value of Domestic_dur in each cluster obtained
with k-means algorithm (k=19) is shown in figure 5.
In figure 4, the Ant-Cluster algorithm has obtained six
clusters which average value of Domestic_dur are more
Conclusions
This paper has proposed a clustering algorithm for
data mining based on swarm intelligence called Ant-Cluster.
In Ant-Cluster algorithm, multi-population with different
speed is introduced, and fixed moving times method is
adopted to deal with outliers and locked ant problem.
SWARM, agent-based model simulation software, is
applied to evaluate the performance of Ant-Cluster
algorithm by experimenting on a telecom company’s
customer data set. The results illuminate that the algorithm
proposed in this paper can obtain clustering results
effectively without giving the number of clusters and have
better performance than k-means algorithm.
In the future research, other kinds of methods for
806
Proceedings of Sixth International Conference on Machine Learning Cybernetics, Hong Kong, 19-22 August 2007
computing picking-up probability pp and dropping
probability pd should be investigated to enhance efficiency
of Ant-Cluster algorithm.
[6]
Acknowledgements
This work is supported by the National Natural
Science Foundation of China (Grant No. 70431003).
[7]
References
[1] Jiawei Han, and Micheline Kamber, Data Mining:
Concepts and Techniques, Morgan Kaufmann
Publishers, San Francisco, 2000.
[2] Eric Bonabeau, Marco Dorigo, Guy Theraulaz, Swarm
Intelligence: from Natural to Artificial Intelligence,
Oxford University Press, New York, 1999.
[3] Cheng-Fa Tsai, Chun-Wei Tsai, Han-Chang Wu, and
Tzer Yang, “ACODF: a novel data clustering
approach for data mining in large databases”, Journal
of Systems and Software, Vol 73, pp. 133-145,2004.
[4] Cheng-Fa Tsai, Han-Chang Wu, Chun-Wei Tsai, “A
new data clustering approach for data mining in large
databases”, Proceedings of the International
Symposium on Parallel Architectures, Algorithms and
Networks, Metro Manila, Philippines, pp. 22-24, 2002.
[5] H.Azzag; N.Monmarché; M.Slimane; C.Guinot;
G.Venturini, “AntTree: a new model for clustering
with artificial ants”, Proceedings of the 7th European
[8]
[9]
[10]
[11]
807
Conference on Artificial Life, Dortmund, Germany,
pp. 14-17, 2003.
Wu bin, Shi Zhongzhi, “A clustering algorithm based
on swarm intelligence”, Proceedings of the
International Conferences on Info-tech and Info-net,
Beijing, China, pp. 58-66, 2001.
Wu Bin, Zheng Yi, Liu Shaohui, Shi Zhongzhi,
“CSIM: a document clustering algorithm based on
swarm intelligence”, Proceedings of the Congress on
Computational Intelligence, Hawaiian, USA, pp.
477-482, 2002.
Yan Yang, Mohamed Kamel, “Clustering ensemble
using swarm intelligence”, Proceedings of the 2003
IEEE Swarm Intelligence Symposium, Piscataway, NJ,
USA, pp. 65-71, 2003.
Handl, J., Knowles, J. and Dorigo, M, “Ant-based
clustering: a comparative study of its relative
performance with respect to k-means, average link and
1D-som”, Technical Report TR/IRIDIA/2003-24.
IRIDIA, Universite Libre de Bruxelles, Belgium,
2003.
Peng Yuqing, Hou Xiangdan, Liu Shang, “The
K-means clustering algorithm based on density and
ant colony”, Proceedings of the International
Conference on Neural Networks and Signal
Processing, Nanjing, China, pp. 457-460, 2003.
E. Lumer and B. Faieta, “Diversity and adaptation in
populations of clustering ants”, Proceedings of the
third international conference on Simulation of
adaptive behavior: from animals to animats 3,
Brighton, United Kingdom, pp. 501-508, 1994.