Download A new K-means Initial Cluster Class of the Center Selection

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Communications in Information Science and Management Engineering
CISME
A new K-means Initial Cluster Class of the Center
Selection Algorithm Based on the Great Graph
Zhou Haiyan1, Zhao Jianyang2
Faculty of Computer Engineering, Huaiyin Institute of Technology, Huai’an, China
1
[email protected] ; [email protected]
cluster class center based on space division.The paper [5]
proposed the method used histogram according to the
characteristics of a data distribution, which improved more in
the clustering effect than paper [4],but the clustering effect is
not satisfactory. So this paper proposed the initial cluster
center class selection algorithm based on K-means clustering
of great group of graph. First, it determines the initial cluster
class center by the paper proposed Initial cluster center
selection algorithm class. Then it obtained clustering results
using the optimized K-means clustering algorithm. Through
UCI machine learning the test of database data, it verified that
clustering results have been significantly improved, which
proposed by the paper compared the initial cluster class center
selection algorithm of K-means clustering of a great group of
graph-based to the classic K-means clustering method based
on the initial cluster class centers, which randomly selected .
Abstract- K-means algorithm, which is simple and fast, with a
more intuitive geometric meaning, has been widely used in
pattern recognition, image processing and computer vision, and
achieved satisfactory results. But K-means algorithm is executed
to be pre-determined initial cluster center of class, the initial
cluster centers class selected have a direct impact on the final
clustering results. K-means Initial cluster class of the center
selection algorithm based on the great group is presented. The
method compared to the other initial cluster center of class
selection algorithm, significantly improves the clustering effect.
Keywords- Data Mining; Data Clustering; Initial Cluster
Center; Great Group
I. INTRODUCTION
Cluster analysis is a method, which is the study of how to
use mathematical methods to classify a group of samples
(object, index, property). Clustering algorithms include
statistical algorithms, machine learning, neural network and
methods of database-oriented. Clustering is an important issue
of data mining and pattern recognition. The clustering method
commonly used is based on the distance of the division
algorithm, and K-means clustering that is also known as Kmeans clustering, is a typical division algorithm. It is a
classification method based on the average as the class
"center". K-means clustering method is simple and fast with a
more intuitive geometric meaning, which has been used in
pattern recognition, image processing and computer
vision,and has achieved satisfactory results. The basic work
steps of K-means clustering is a random k data objects as
centers of the initial cluster class, the data object is divided
into different cluster classes through the iterative.Algorithm
the value of parameter k is given in advance. Some studies
have indicated that if the center of the initial cluster class is
improperly selected, the clustering of the k-means algorithm
may fall into local optimal solution to do not get a better
clustering result. To determine the initial cluster center class
of the K-means, clustering method has raised a number of
different solutions, such as more typical the initial cluster
center algorithm of based sampling which PaulS. Bradley
proposed, density estimation method based on division which
is proposed by Moh'dB. Al-Daoud and Stuart A. Robert s,
Kaufman made by the local density of data points to estimate
the initial cluster class center, and the KKZ method that Kat
savounidis and others made. To avoid more, the initial class
focal point of the same actual category, J. T. Tou, and others
propose means restricted the distance between the centers of
the initial class. In China, Qie Xian also made an approach of
a class initialized K-means cluster-centric. To some extent,
these methods have improved the performance of K-means
algorithm, but the aspects are to be further improved in the
accuracy of clustering, initialize cluster class center used time
and the number of clustering iterative. When clustering data
dimension is high, the method is difficult to achieve zoning.
The paper [4] proposes the selection method of the initial
II. CLUSTERING ALGORITHM OF PREVIOUS GRAPH
Classical graph theory clustering, which is also known as
the largest (small) support (generate) tree (minim urn
spanning trees, MST) clustering algorithm [6],was firstly
proposed by Zahn's.This algorithm, firstly defines the distance
between clusters of data; secondly produce a non-(edge)
weighted complete graph by the distance between the data and
the data; thirdly obtained the minimum spanning tree MST by
the complete graph;Finally, MST is divided into a number of
sub trees by removing the MST in those with the most longdistance (i.e., the maximum weight) of the edge, and the sub
tree is found that the clustering clusters. According to the
literature [6] pointed out that this algorithm can obtain global
optimal clusters, but in fact the method cannot guarantee that
the data from the small cluster belong to the same class, from
the large cluster of data belonging to different classes.For
example: Given an undirected graph G shown in Figure 1, the
minimum spanning tree of the graph shown in Figure 2, the
minimum spanning tree’s maximum weight edges (x1, x4) is
deleted income sub-graph shown in Figure 3 show. Diagram
shown in Figure 3 has two sub-trees, according to clustering
algorithm of the literature [6]. The data set is divided into two
clusters categories, namely, {x1, x2, x3} and {x4, x5,
x6}.However, the distance between x1 and x4 is 4 belong to
different clusters classes, while the distance between x4 and
x6 is 5 belong to the same cluster class. And the algorithm
find the minimum spanning tree MST by generating the
complete graph, the cost of implementation of this process is
very large, which time degree is O (n2). The literature [7]
using the method is basically the same in the literature [6],so
there is the same problem. Class selection algorithm of the
initial cluster center of connected components of graph-based,
which proposed in the literature [8],has some improve in
terms of both effect and efficiency compared with clustering
algorithm proposed by the literature [6, 7]. But it proposed the
same algorithm with the literature [6, 7], it cannot guarantee
that the data from the small cluster belong to the same class,
C 2011-2012 World Academic Publishing
CISME Vol. 2 Iss. 8 2012 PP. 17-20 www.jcisme.org ○
- 17 -
Communications in Information Science and Management Engineering
from the large cluster of data belonging to different classes,
too. So we propose the initial cluster class center selection
algorithm of a graph-based great group and use this algorithm
to obtain the initial cluster class center of the K-means
clustering method.
The distance d (x, y) between x and y is based on the
specific circumstances, such as can use Euclidean distance,
absolute distance or other distance measurement methods.
Theorem 1: Given a finite data set X can be expressed
with an (edge) weighted undirected complete graph.
III. K-MEANS CLUSTERING ALGORITHM
The basic idea of K-means clustering algorithm is
randomly selecting k data objects and each data object
represents the data objects of an initial cluster mean or center.
Each of the remaining data object according to its distance
from the center of each initial cluster, assign it to the nearest
cluster; and then recalculate the average or center of each
cluster. Repeat the process until the criterion function
converges.
Proof: By Definition 1 that was.
Definition 2: (Great group). For the graph G (V,
E),V V
  , If the vertex set V derived sub graph
G  (V ,E )is a complete graph, it is called the group of
graph G ; If v V and v , the vertex set V ∪{v} is the
induced sub graph G.{v} is a complete graph, then G is called
a great group of graph G .
For a given data set X = {x1, x2, ..., xn}, mission is the X
data objects clustering. First defined (choice) a measure of the
distance between data objects and then calculate the distance
in X between the arbitrary two data objects, so to get a free
(edge) weighted graph G (X).
Criterion function E is defined as:
_
k
CISME
E    ( x  xi ) 2
i 1 xC i
Where: E is a data set of all data objects and it's in the
center of the cluster sum of squared errors. E greater shows
the cluster data object class with the greater distance from the
center, the similarity of the cluster the lower. Conversely, E is
smaller; the similarity between cluster data objects is higher. x
_
is a cluster of data objects; x i is the center of the cluster Cik
is the number of clusters; C is the k clusters in the i-the cluster.
K-means clustering algorithm is described as follows.
Input: n data objects containing data sets and the number
of clusters k, criterion function is that E = M;
Output: k clusters, to meet the convergence criterion
function.
1) Randomly selecting k data objects, each object class is
as the initial cluster centers;
2) Calculated the distance between each data objects and
the center of cluster class, object is divided into the nearest
clusters according to the distance;
3) To recalculate the average of each cluster object, update
the cluster center of cluster classes;
4) Calculate the weight criterion function E;
5) If the criterion function E does not occur rehanged with
turning to the switch 6) step, or turn to step 2);
6) {end}.
Then choosing a threshold α, the right of sides in the G (X)
is greater than α , which are removed to get a sub graph G0 of
G (X) , the distance between the two adjacent nodes (data
objects) in G0 does not exceed α. Finding all the great group
of G0, and seting up a total of great group p (here requiring p
is slightly greater than K, we can adjust the threshold value of
α to meet this requirement); the node (data object) set Xi of
the i-th significant group is as a subset through the operation,
so get a cover of the data set X consisting of p sub-set Y =
{X1, X2, ..., Xp}. The statistics of all subsets of Y in the
number of data objects, data objects will contain a subset of
the largest data centers c1 as the first class of the initial cluster
center (denoted as p1). Followed by calculation the distance
of the center and the other subset in Y from p1, the center of
the largest subset of c2 from p1 was selected as the second
center of the initial cluster class (denoted by p2);
And so on, order
i 1
pi  Max d (ci , p j )
j 1
Where ci is the center of Y-neutron set , i = 3,4, ..., p. pi is
as the i-th center of the initial cluster class , i = 3,4, ..., k.. So
that by (2) to be class of K initial cluster centers.
Based on the above basic idea, based on a great group of
K-means clustering method the initial cluster center selection
algorithm described in the following categories:
IV. INITIAL CLUSTER CLASS OF THE
Algorithm 1 K-means clustering algorithm based on a
great group of initial cluster center selection algorithm
CENTER SELECTION
ALGORITHM BASED ON THE GREAT GROUP
Suppose the given data set is X = {x1, x2, ..., xn}, and k
values (k is the desired number of cluster classes), the data set
X with an undirected complete graph turn the poly-cube class
problem into an undirected complete graph partitioning
problems.
Input: Cluster data set to be X = {x1, x2, ..., xn}, and k
values (k is the desired number of cluster classes);
Output: clustering:
Definition 1: Weighted graph G (X) = <V, E, W>. Among
them, the point set V = {xi ︱ xi ∈ X}, xi = (xi1, ..., xim), xi
is the j-th properties of data object xij (where you can choose
the m data objects property that is similar or dissimilar the
degree); Edge set E = {(xi, xj) ︱ for xi, xj ∈ X and i ≠ j}, W
= {wij ︱ wij = d (xi, xj), said weight of edge (xi, xj), d (xi, xj)
is the distance (or dissimilarity) between xi and xj }, then G
(X) is an undirected (edge) weighted complete graph.
1. For a given set of data objects X = {x1, x2, ..., xn},
choose an appropriate measure of distance between data
object d, the establishment of undirected (edge)
weighted graph G (V,E,W):(Where V = X, for xi,
xj ∈ X, xi ≠ xj, edge (xi, xj) ∈ E, W = {wi, j | wi, j = d
(xi, xj) for the edge (xi, xj) weight}); and the
establishment of data objects X of the distance between
the matrix MX;
2. Select a threshold α;
C 2011-2012 World Academic Publishing
CISME Vol. 2 Iss. 8 2012 PP. 17-20 www.jcisme.org ○
- 18 -
Communications in Information Science and Management Engineering
3. For the threshold value α do
CISME
n2). The time complexity of Step 4 easily known is O (1). So
know the time complexity of each cycle of steps 2 to 4 is O
(n2). Because the number of options of the threshold α in the
Algorithm 1 is can be considered on a constant, the number of
cycles is constant and the time complexity of steps 2 to 4 is O
(n2). Easily known the time complexity Step 5 to 7 is O (n),
General Step 1 of Algorithm 1, Step 2 to 4 and step 5 to 7
shows that the time complexity of the initial cluster clustering
method class "center" of K-means algorithm used a
calculation of Algorithm 1 is O (n2).
3.1 Establishment of the (edge) weighted graph G (V,
E, W): (which V = X, for u, v ∈ X, edge (u, v) ∈ E,
if and only if d (u, v) ≤ α);
3.2 Undirected graph called seeking a great group of
all algorithms [9] to calculate the G (V, E, W) of all
great groups. Let G (V, E, W) have p great groups, and
put the note (data object) of the i-th great group in G
(V, E, W) by the operation as a subset of the set Xi, So
obtained from the p-subset of X into the set Y = {X1,
X2, ..., Xp} data set composed of a cover of X;
TABLE I THE COMPARISON OF K-MEANS ALGORITHM AND THE
PAPER ALGORITHM
4. If (p <k) take α '<α, α = α', go to Step 3, otherwise
step 5;
Algorit
hm
5. The statistics the number of data objects of all subsets
in Y , data objects will contain a subset of the largest
data centers c1 as the first the initial cluster class center
of the clustering (denoted as p1);
Kmeans
6. Followed by calculation the distance of the center and
the other subset in Y from p1, the center of the largest
subset of c2 from p1 was selected as the second center
of the initial cluster class (denoted by p2);
This
paper
algorith
m
7. for (i=3; i≤k; i++)
Data
Sets
Emax
Emin
Eavg
balance
396.87643
332.34532
363.74312
car
iris
wine
3.1632E3
155.36851
3.15783E4
3.1321E3
79.83192
2.9863E4
3.1543E3
96.425347
3.07964E4
balance
335.65312
335.65312
335.65312
car
iris
wine
3.1421E3
80.94756
2.9634E4
3.1421E3
80.94756
2.9634E4
3.1421E3
80.94756
2.9634E4
This completes the proof.
i 1
Pi  Max d (ci , p j )
V. EXPERIMENTAL RESULTS AND ANALYSIS
In this paper, the test data set is 4 set experimental data of
the UCI database, which are balance, car, iris and wine. UCI
is a specialized database for common international database of
testing machine learning and data mining algorithms. The data
of the library have determined classification, so the accuracy
can be used to visually represent the quality of clustering. In
order to verify the accuracy of the algorithm, data distribution
of the test data set is in their original state without any manual
processing. We will compare traditional K-means algorithm
of the initial cluster centers classes randomly selected with the
selection algorithm of the initial clusters class “center" based
on the great group in this paper. Since the selection algorithm
of the initial clusters class "center" based on the great group
got the initial cluster class "center" is determined, the use of
the algorithm for only an experiment to get the maximum,
minimum and maximum and average clustering accuracy ,
minimum and average value of criterion function E is the
same.
j 1
(Where ci is set as the center of the neutron of Y, i = 3,
4, ..., p.) pi is as the initial cluster of the i-th cluster class
center, which i = 3,4, ..., k.. (This will be the initial
cluster of K class centers);
8. From this K initial cluster class centers starting,
calling K-means algorithm, obtained clustering results
9. Algorithm ends.
Theorem 2 For a given finite set of data X = {x1, x2, ...,
xn} and the distance d selected between a data, Algorithm 1
correctly calculate the K-means clustering method the initial
cluster class "center" it’s time complexity is O (n2).
Proof: The correctness of Algorithm 1 is known by
Theorem 1 and the definition. We analyze and calculate the
initial cluster class "center" time complexity of K-means
clustering method. The initial cluster class "center" of Kmeans clustering computing is a step in the algorithm 1 to 7.
Because in step 1, the time spent mainly on representation of
the limited data set X = {x1, x2, ..., xn} and the establishment
of the (edge) weighted graph G (V, E, W). If Undirected
graph G use the edges of the graph G which is adjacent to
node said [10] of the table 2m edges. So that the complexity
1
of Undirected graph G is O(︱E︱)(m=︱E︱= n(n-1))
2
and the time complexity of Step 1 is O (n2). In each main
loop of step 2 to 4, the establishment of an undirected graph G
in step 3.1 is based on the weight of the n (n-1) edges
screening time, so the total time complexity of Step 3.1 is O
(n2). And because the calculation based on edge adjacency
table of algorithm time complexity of all the great groups of
map is O(︱E︱) [11], the time complexity of the algorithm
step 3.2 is O(︱E︱).In summary, the first execution time
complexity of step 3 is O(n2)+ O(︱E︱)=O(n2)(∵︱E︱≤
Use of traditional the initial cluster class centers of Kmeans algorithm randomly selected 10 times experiments, the
experimental results are shown in Table 2.
TABLE II THE COMPARISON OF K-MEANS AND PAPER
ALGORITHM AND E(CRITERION FUNCTION) %
Clustering Accuracy /%
Algori
thm
Kmeans
This
paper
algorit
hm
Data Sets
Highest
Lowest
Average
balance
car
iris
wine
balance
car
iris
0.713359
0.732235
0.732546
0.763592
0.714371
0.793352
0.893542
0.431357
0.635536
0.576543
0.495132
0.714371
0.793352
0.893542
0.538663
0.681272
0.625325
0.693651
0.714371
0.793352
0.893542
wine
0.757356
0.75356
0.757356
C 2011-2012 World Academic Publishing
CISME Vol. 2 Iss. 8 2012 PP. 17-20 www.jcisme.org ○
- 19 -
Communications in Information Science and Management Engineering
VI. CONCLUSION
K-means Initial cluster class of the center selection
algorithm based on the great group, which is proposed in the
paper, is to use great group of diagrams to determine
clustering class center of K-means algorithm. To further the
implementation of K-means algorithm has laid a good
foundation, which compared to the traditional K-means
algorithm randomly selecting the initial cluster center class,
has improved accuracy and stability of the algorithm; while
avoiding the randomly selected cluster centers caused the
algorithm class accuracy and efficiency of randomness.
Simulation results show that the algorithm avoids the arbitrary
of initial cluster centers class selected, which significantly
improved the accuracy of clustering of data samples,
significantly reduced the number of iterations of the algorithm,
and the algorithm of clustering effect has been significantly
improved.
REFERENCES
[1]
[2]
Han J. Kamber M. Fan, X. Meng, M. Data Mining: Concepts and
Techniques [M], Beijing: Mechanical Industry Press, 2001.8: 232-235.
Kurniawan A,Benech N, Tao Yufei. Towards High-dimensional
Clustering[J]. COMP,November 1999: 1-2.
[3]
Lee Tae, Shen Xianghong.Double fuzzy K-means algorithm classifier
design computer measurement and control. 2008.16 (9):1325-1326.
[4] Su banners, XUE Huifeng, Zhan Hailiang based division of K-means
algorithm to optimize the initial cluster centers. Microelectronics and
Computer, 2009,26 (1) :8-11.
[5] Zhang Jianpei, Yang Yue, Yang Jing, etc. based on the optimal division
of K-Means algorithm for selecting the initial cluster centers Journal of
System Simulation .2009,21 (9) :2586 -2590.
[6] Qian Yuntao, Rong-Chun Zhao, Xie Weixin. Robust clustering - based
on graph theory and objective function method [J]. Journal of
Electronics, 1998,26 (2) :91-94.
[7] LI Chunsheng, WANG Yaonan.Initializes a new method of cluster
centers. .2010,27 Control Theory and Applications (10) :1435-1440.
[8] Zhou Haiyan, Bai Xiaoling. Graph-based k-means clustering method
choice in the initial cluster centers. .2010,18 Computer measurement
and control system (9) :2167-2169.
[9] Chen Anlong, Changjie Tang, Tao Wang was so based on a great
mission and the FP-Tree Algorithm for Mining Association Rules [J].
Journal of Software, 2004,15 (8) :1198-1207.
[10] Aho A.V. Hopcroft J.E., Ullman J.D The Design and Analysis of
Computer Algorithms New York: Addison-Wesley Publishing
Company 1974. 189-195.
[11] Hu Gang, Xu Ming, Liu Lixia, etc. wireless network in one kind of
group cognitive division of spectrum sensing algorithm, Journal of
Software, 2011,22 (2) :298-312.
C 2011-2012 World Academic Publishing
CISME Vol. 2 Iss. 8 2012 PP. 17-20 www.jcisme.org ○
- 20 -
CISME