Download Efficient Data Clustering Over Peer-to-Peer Networks

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

Transcript
Efficient Data Clustering Over Peer-to-Peer Networks
Ahmed Elgohary and Mohamed A. Ismail
Computer and Systems Engineering Department
Faculty of Engineering, Alexandria University
Alexandria, Egypt
[email protected], [email protected]
Abstract—Due to the dramatic increase of data volumes in
different applications, it is becoming infeasible to keep these
data in one centralized machine. It is becoming more and
more natural to deal with distributed databases and networks.
That is why distributed data mining techniques have been
introduced. One of the most important data mining problems
is data clustering. While many clustering algorithms exist for
centralized databases, there is a lack of efficient algorithms
for distributed databases. In this paper, an efficient algorithm
is proposed for clustering distributed databases. The proposed
methodology employs an iterative optimization technique to
achieve better clustering objective. The experimental results
reported in this paper show the superiority of the proposed
technique over a recently proposed algorithm based on a
distributed version of the well known K-Means algorithm
(Datta et al. 2009) [1].
Keywords-Distributed Data Mining, Clustering Over P2P
Networks, Distributed K-Means Clustering, Minimum Variance Clustering, Iterative Optimization
I. I NTRODUCTION
For several enterprise applications, the amount of the data
needs to be maintained has turned out to be infeasible to
be kept in a single machine. The reliance on a distributed
storage has become either the only or the most efficient way
for maintaining large data volumes. Consequently, mining
these data is no longer feasible to be achieved in one
centralized machine. As a result of that, inventing mining
algorithms specifically designed for distributed data has
become an active research area.
Peer-to-Peer (P2P) systems are distributed systems without centralized control in which each node shares and
exchanges data across a network [2]. Each node is connected
directly to a number of nodes (its peers). In order for a node
to communicate with any other node in the network, it has
to do so through one of its peers.
Data Clustering is one of the major data mining problems.
One of the most commonly used clustering algorithms is
the K-Means algorithm. The goal of this algorithm is to
partition a dataset into separate groups (clusters), each group
is represented by its centroid. The portioning is based on
the minimization of the sum of squared Euclidean distances
between patterns and their corresponding cluster centers. It
was shown in [3] that this algorithm converges to a local
minimum solution.
One of the main problems of the K-Means algorithm is
its sensitivity to the initial configuration (the initial centroids
& the required number of clusters). Ismail and Kamel [4]
proposed a variation of the classical K-Means algorithm
that managed to alleviate this problem. Additionally, their
technique showed a consistent superiority over the classical
K-Means algorithm.
Utilizing the idea described of finding the best move for
each pattern (data item) form its current cluster to another
cluster in order to enhance the overall objective function,
a clustering algorithm for distributed data over Peer-to-Peer
networks is developed in this paper. The results obtained
by the developed algorithm are compared with the newly
proposed distributed K-Means algorithm LSP2P [1].
The rest of this paper is organized as follows: Related
work on distributed data clustering is reviewed in section
II. The development of the newly proposed algorithm is
given in section III. The evaluation process and results of
multiple experiments are given in section IV, followed by
the conclusion of this paper.
II. R ELATED W ORK
Clustering distributed data has been addressed in many
publications over the past decade [4-12]. In [6], the authors presented the collective Principle Component Analysis (PCA) as new method for clustering distributed high
dimensional data. In [9], a hierarchical clustering algorithm
(HP2PC) was proposed for distributed data over a multilayer overlay network of Peer neighbours. In [10] the authors
suggested approximating the distributed high dimensional
data as precisely as possible with a specified number of bytes
before sending them to a centralized server to run the clustering algorithm. Jin et al. [12] presented a new algorithm,
called Fast and Exact K-means Clustering (FEKM), which
typically requires only one or a small number of passes on
the entire dataset, and provably produces the same cluster
centers as reported by the original k-means algorithm. The
algorithm uses sampling to create initial cluster centers, and
then takes one or more passes over the entire dataset to adjust
these cluster centers. Januzaj et at. [13] proposed clustering
the data locally and extracting suitable representatives out
of these clusters. These representatives are sent to a global
server where the complete clustering based on the local
representatives is restored. This approach is characterized by
carrying out local clustering quickly and independently from
each other. Furthermore, algorithm requires low transmission
cost, as the number of transmitted representatives is much
smaller than the cardinality of the complete data set.
One of the most recent work on the distributed data clustering is the work presented in [1]. The authors presented an
elegant synchronization technique for clustering distributed
data via the K-Means algorithm. The basic idea behind this
algorithm is that each node runs a single K-Means iteration
over its local data then, the resulting centroids are used
to synchronize the clustering results with the neighbour
nodes. Each node sends its centroids to its neighbours.
Receiving the centroids of a certain cluster obtained at all the
neighbouring nodes, each node modifies its centroid for that
cluster to be the weighted average of the received centroids
and its current local centroid. Afterwards, each node starts
the next iteration using the obtained average centroids till
satisfying the stopping criteria. Also, the authors described
how the algorithm should behave in a dynamic networks
when the network structure or the data may change. The
evaluation results showed that this algorithm achieves high
accuracy compared to the classical centralized K-Means.
Also the communication efficiency of the algorithm is
demonstrated.
III. T HE P ROPOSED A LGORITHM
Basically, two issues need to be considered to develop
a distributed data clustering algorithm; 1) How does the
algorithm behave at each node? and, 2) How do the nodes
synchronize their local results ? The same synchronization
method proposed in [1] is followed. Each node runs an
iteration over its local data then synchronizes the obtained
centroids with only its direct neighbours. Each node updates
its local centroids by performing a weighted-averaging on its
current centroids and the centroids of its neighbours. Then,
each node starts the next iteration till satisfying the stopping
criteria.
The way the algorithm behaves locally at each node is
the major concern in the proposed algorithm. The basic
idea of this algorithm is: Starting from an initial clusterassignment for all the patterns, the algorithm tries to move
each pattern from its current cluster to another cluster as
long as this move is going to reduce the overall objective
function. The major concern here is the way the algorithm
asses the effect of the move to determine whether the
objective function is going to be increased or decreased. Of
course, the assessment needs to be done efficiently without
having to recalculate the overall objective function. Also, if
the move decision was taken, it is required to update the
centroids of both the old and the new clusters of the moved
patterns effectively. In order to do that, a formula needs to
be developed to asses the move effect and another formula
needs to be developed to update the clusters centroids
according to a move decision.
Consider the problem where we have a dataset D distributed over N nodes in a Peer-to-Peer network where
each node can directly communicate with its neighbours.
The required number of clusters c is initially given to
the algorithm. The goal is to find a cluster-assignment for
each pattern to minimize the overall sum-of-squared errors
(Objective Function) Je , where:
c
X
Je =
Ji
(1)
i=1
where an effective error per cluster over all nodes is defined
by:
N
X
Ji =
Jij
(2)
j=1
For a certain node j, the effective error per cluster i is defined
by:
X
Jij =
kx − mij k2
(3)
x∈Dij
where Dij is the set of patterns in the node j that are
assigned to the cluster i and mij is the centroid of the cluster
i at node j.
As stated above, after each iteration, each node updates
the centroid of each cluster to be a weighted-average of the
centroids of that cluster achieved by its neighbours.
P
mij sij + r∈Nj mir sir
P
(4)
m`ij =
sij + r∈Nj sir
where sij is the size of the cluster i at node j and Nj is the
set of neighbour nodes of node j, and m`ij is the updated
centroid of the cluster i at node j.
Assuming that a pattern x̂ at node j was moved to the
cluster i, the centroid mij becomes m∗ij , where:
m∗ij =
mij sA
ij + x̂
A
sij + 1
where
sA
ij = sij +
X
sir
(5)
(6)
r∈Nj
The objective function of the cluster i and node j becomes:
X
∗
Jij
=
kx − m∗ij k2 + kx̂ − m∗ij k2
x∈Dij
=
X
x∈Dij
=
Jij +
sA
ij
kx −
mij sA
mij sA
ij + x̂ 2
ij + x̂ 2
k
+
kx̂
−
k
A +1
sA
+
1
s
ij
ij
2
sij + (sA
ij )
kx̂ − mij k2 −
2
(sA
+
1)
ij
2
DP (x̂ − mij , Xij − mij )
+1
(7)
Algorithm 1 The Core Iteration at Each Node j
{n is the total number of patterns in node j}
{ c is the required number of clusters}
for l = 1 to n do
{Assuming that xl belongs to cluster Ci and ni is the number of patterns in cluster Ci }
if ni 6= 1 then
(sA )2 −sij
∆i = (sijA −1)2 kxi − mij k2 + sA2+1 DP (xi − mij , Xij − mij )
ij
ij
∆min = i
for j = 1 to c do
if j 6= i then
2
sij +(sA
ij )
2
2
∆j = (sA +1)
2 kxi − mij k − sA +1 DP (xi − mij , Xij − mij )
ij
ij
if ∆j < ∆min then
∆min = ∆j
j1 = j
end if
end if
end for
if (∆min < ∆i ) then
Move xl to Cj1
Update mij and mj1j according to eqn.(5)
Update Xij andXj1j
A
Update sij , sj1j , sA
ij , sj1j and Jj
end if
end if
end for
P
where Xij = x∈Dij x, and for two patterns a and b of
d dimensions, the dot production DP is defined as:
IV. P ERFORMANCE E VALUATION
In order to evaluate the performance of the proposed
d
algorithm a dataset of 4-dimensional points sampled from 4
X
DP (a, b) =
a i bi
multivariate-Gaussian distributions with standard deviations
i=1
σ = 10 and means (150, 150, 0, 100), (100, 100, 10, 100),
(50, 50, 50, 50) and (0, 0, 20, 30).
Similarly, when removing a pattern x̂ from a cluster i at
Two network configurations are used for experimentation;
node j, the objective function changes to:
1)
A network of size 10 nodes with 50k points distributed
2
(sA
ij ) − sij
∗
2
over
the network. 2) A network of size 100 nodes with 250k
Jij = Jij −
kx̂ − mij k
2
(sA
ij − 1)
points distributed over the network. Each point has an equal
2
probability to be sampled from any of the 4 distributions
DP (x̂ − mij , Xij − mij )
(8)
+ A
mentioned above. Following [1] The data are distributed
sij + 1
randomly over the network. The number of points per node
The transfer of x̂ from a cluster j to a cluster i is
followed a Zipfian distribution with parameter 0.8.
advantageous if the decrease in Jjj is greater than the
For a Peer-to-Peer network simulation, PeerSim [14] is
increase in Jij . This is the case if
used due to its scalability and extensibility. Each node gets
connected to 4 neighbours. The message drop probability
2
sij + (sA
2
ij )
2
kx̂ − mij k − A
DP (x̂ − mij , Xij − mij ) and the message delay were always set to 0 as they do not
2
(sA
sij + 1
affect the accuracy of the clustering algorithms.
ij + 1)
Twenty-five random initial centroids are generated. Each
<
run, one of the centroids is used for initializing LSP2P
algorithm [1], and our proposed algorithm. The overall
2
(sA
2
jj ) − sjj
2
objective function achieved by each algorithm is recorded.
kx̂
−
m
k
+
DP
(x̂
−
m
,
X
−
m
)
jj
jj
jj
jj
2
(sA
sA
jj − 1)
jj + 1
This process is repeated for cluster sizes C = 4, 6, 8 and
Utilizing the developed formulas, the core iteration of our
10.
algorithm is described at Algorithm 1.
Figures 1, 2, 3 and 4 show the obtained results for a
10 Nodes - 4 Clusters
7e+07
10 Nodes - 8 Clusters
LSP2P
Our Algorithm
6.5e+07
2.2e+07
LSP2P
Our Algorithm
6e+07
2.1e+07
5e+07
2e+07
Objective Function
Objective Function
5.5e+07
4.5e+07
4e+07
3.5e+07
3e+07
1.9e+07
1.8e+07
2.5e+07
1.7e+07
2e+07
1.5e+07
0
5
10
15
20
25
1.6e+07
Run Number
0
5
10
15
20
25
20
25
20
25
Run Number
Figure 1.
10 Nodes - 4 Clusters
Figure 3.
10 Nodes - 8 Clusters
10 Nodes - 6 Clusters
1.91e+07
LSP2P
Our Algorithm
1.9e+07
10 Nodes - 10 Clusters
1.76e+07
LSP2P
Our Algorithm
1.74e+07
1.88e+07
1.72e+07
1.7e+07
1.87e+07
Objective Function
Objective Function
1.89e+07
1.86e+07
1.85e+07
1.84e+07
1.68e+07
1.66e+07
1.64e+07
1.62e+07
1.6e+07
1.83e+07
0
5
10
15
20
25
1.58e+07
Run Number
Figure 2.
1.56e+07
10 Nodes - 6 Clusters
0
5
10
15
Run Number
Figure 4.
network of size 10. Figures 5, 6, 7 and 8 show the obtained
results for a network of size 100.
The figures show that our algorithm is superior to the
recently proposed algorithm LSP2P when running both of
them in non-optimal configurations (cluster sizes of 6, 8
and 10) as our algorithm always manages to achieve lower
clustering objective function (except for 1 run out of 150).
10 Nodes - 10 Clusters
100 Nodes - 4 Clusters
3.5e+08
LSP2P
Our Algorithm
3e+08
In this paper, an efficient algorithm for clustering distributed databases is proposed. The algorithm employs iterative optimization to formulate an efficient algorithm for
clustering distributed databases in the form of a Peer-to-Peer
network. Experimental results reported in this paper show
the superiority of the proposed methodology over a recently
proposed algorithm based on a distributed version of the
well-known K-Means algorithm. It is envisaged that that the
new proposed algorithm will find extensive applications of
distributed clustering where efficient solutions are required.
Objective Function
2.5e+08
V. C ONCLUSIONS
2e+08
1.5e+08
1e+08
5e+07
0
5
10
15
Run Number
Figure 5.
100 Nodes - 4 Clusters
R EFERENCES
100 Nodes - 6 Clusters
1.08e+08
[1] S. Datta, C. M. Giannella, and H. Kargupta, “Approximate
distributed k-means clustering over a peer-to-peer network,”
IEEE Transactions on Knowledge and Data Engineering,
vol. 21, no. 10, 2009.
LSP2P
Our Algorithm
1.06e+08
1.04e+08
[2] A. Ismail, A. Barbar, and Z. Ismail, “P2p simulator for queries
routing using data mining,” International Journal of Database
Management Systems ( IJDMS ), vol. 3, no. 3, 2011.
Objective Function
1.02e+08
1e+08
9.8e+07
[3] S. Selim and M. A. Ismail, “K-means-type algorithms: A
generalized convergence theorem and characterization of local optimality,” IEEE Transactions on Pattern Analysis and
Machine Intelligence, vol. 6, no. 1, 1984.
9.6e+07
9.4e+07
9.2e+07
9e+07
0
5
10
15
20
25
Run Number
Figure 6.
[4] M. A. Ismail and M. S. Kamel, “Multidimensional data clustering utilizing hybrid search stategies,” Pattern Recognition,
vol. 22, no. 1, pp. 75–89, 1989.
[5] S. Parthasarathy and M. Ogihara, “Clustering distributed homogeneous datasets,” in European Conference on Principles
of Data Mining and Knowledge Discovery (PKDD), Lyon,
France, 2000.
100 Nodes - 6 Clusters
100 Nodes - 8 Clusters
9e+07
[6] H. Kargupta, W. Huang, K. Sivakumar, and K. Johnson,
“Distributed clustering using collective principal component
analysis,” Knowledge and Information Systems, vol. 3, no. 4,
pp. 422–448, 2001.
LSP2P
Our Algorithm
8.9e+07
[7] T. Li, S. Zhu, and M. Ogihara, “Algorithms for clustering high
dimensional and distributed data,” Intelligent Data Analysis,
vol. 7, no. 4, pp. 305–326, 2003.
Objective Function
8.8e+07
8.7e+07
[8] G. Forman and B. Zhang, “Distributed data clustering can be
efficient and exact,” ACM SIGKDD Explorations Newsletter,
vol. 2, no. 2, pp. 34–38, 2000.
8.6e+07
8.5e+07
8.4e+07
0
5
10
15
20
25
Run Number
Figure 7.
[9] K. Hammouda and M. S. Kamel, “Hierarchically distributed
peer-to-peer document clustering and cluster summarization,”
IEEE Transactions on Knowledge and Data Engineering,
vol. 21, no. 5, 2009.
[10] P. Kunath, H. Kriegel, M. Pfeifle, and M. Renz, “Approximated clustering of distributed high-dimensional data,” in
Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD), Hanoi, Vietnam, 2005.
100 Nodes - 8 Clusters
100 Nodes - 10 Clusters
8.6e+07
[11] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta,
K. Liu, and S. Datta, “Clustering distributed data streams in
peer-to-peer environments,” Information Sciences, vol. 176,
no. 14, pp. 1952–1985, 2006.
LSP2P
Our Algorithm
8.5e+07
Objective Function
8.4e+07
[12] R. Jin, A. Goswami, and G. Agrawal, “Fast and exact outof-core and distributed k-means clustering,” Knowledge and
Information Systems, vol. 10, no. 1, pp. 17–40, 2006.
8.3e+07
8.2e+07
[13] E. Januzaj, H. P. Kriegel, and M. Pfeifle, “Towards effective
and efficient distributed clustering,” in International Workshop on Clustering Large Data Sets, ICDM, Melbourne, FL,
2003.
8.1e+07
8e+07
7.9e+07
7.8e+07
0
5
10
15
20
25
[14] M. Jelasity, A. Montresor, G. P. Jesi, and S. Voulgaris, “The
Peersim simulator,” http://peersim.sf.net.
Run Number
Figure 8.
100 Nodes - 10 Clusters
[15] R. Jain, Art of Computer Systems Performance Analysis Techniques For Experimental Design Measurements Simulation
and Modeling. Chichester: Wiley, 1991.