Download Efficient Data Clustering Over Peer-to-Peer Networks

Efficient Data Clustering Over Peer-to-Peer Networks Ahmed Elgohary and Mohamed A. Ismail Computer and Systems Engineering Department Faculty of Engineering, Alexandria University Alexandria, Egypt [email protected], [email protected] Abstract—Due to the dramatic increase of data volumes in different applications, it is becoming infeasible to keep these data in one centralized machine. It is becoming more and more natural to deal with distributed databases and networks. That is why distributed data mining techniques have been introduced. One of the most important data mining problems is data clustering. While many clustering algorithms exist for centralized databases, there is a lack of efficient algorithms for distributed databases. In this paper, an efficient algorithm is proposed for clustering distributed databases. The proposed methodology employs an iterative optimization technique to achieve better clustering objective. The experimental results reported in this paper show the superiority of the proposed technique over a recently proposed algorithm based on a distributed version of the well known K-Means algorithm (Datta et al. 2009) [1]. Keywords-Distributed Data Mining, Clustering Over P2P Networks, Distributed K-Means Clustering, Minimum Variance Clustering, Iterative Optimization I. I NTRODUCTION For several enterprise applications, the amount of the data needs to be maintained has turned out to be infeasible to be kept in a single machine. The reliance on a distributed storage has become either the only or the most efficient way for maintaining large data volumes. Consequently, mining these data is no longer feasible to be achieved in one centralized machine. As a result of that, inventing mining algorithms specifically designed for distributed data has become an active research area. Peer-to-Peer (P2P) systems are distributed systems without centralized control in which each node shares and exchanges data across a network [2]. Each node is connected directly to a number of nodes (its peers). In order for a node to communicate with any other node in the network, it has to do so through one of its peers. Data Clustering is one of the major data mining problems. One of the most commonly used clustering algorithms is the K-Means algorithm. The goal of this algorithm is to partition a dataset into separate groups (clusters), each group is represented by its centroid. The portioning is based on the minimization of the sum of squared Euclidean distances between patterns and their corresponding cluster centers. It was shown in [3] that this algorithm converges to a local minimum solution. One of the main problems of the K-Means algorithm is its sensitivity to the initial configuration (the initial centroids & the required number of clusters). Ismail and Kamel [4] proposed a variation of the classical K-Means algorithm that managed to alleviate this problem. Additionally, their technique showed a consistent superiority over the classical K-Means algorithm. Utilizing the idea described of finding the best move for each pattern (data item) form its current cluster to another cluster in order to enhance the overall objective function, a clustering algorithm for distributed data over Peer-to-Peer networks is developed in this paper. The results obtained by the developed algorithm are compared with the newly proposed distributed K-Means algorithm LSP2P [1]. The rest of this paper is organized as follows: Related work on distributed data clustering is reviewed in section II. The development of the newly proposed algorithm is given in section III. The evaluation process and results of multiple experiments are given in section IV, followed by the conclusion of this paper. II. R ELATED W ORK Clustering distributed data has been addressed in many publications over the past decade [4-12]. In [6], the authors presented the collective Principle Component Analysis (PCA) as new method for clustering distributed high dimensional data. In [9], a hierarchical clustering algorithm (HP2PC) was proposed for distributed data over a multilayer overlay network of Peer neighbours. In [10] the authors suggested approximating the distributed high dimensional data as precisely as possible with a specified number of bytes before sending them to a centralized server to run the clustering algorithm. Jin et al. [12] presented a new algorithm, called Fast and Exact K-means Clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset, and provably produces the same cluster centers as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centers, and then takes one or more passes over the entire dataset to adjust these cluster centers. Januzaj et at. [13] proposed clustering the data locally and extracting suitable representatives out of these clusters. These representatives are sent to a global server where the complete clustering based on the local representatives is restored. This approach is characterized by carrying out local clustering quickly and independently from each other. Furthermore, algorithm requires low transmission cost, as the number of transmitted representatives is much smaller than the cardinality of the complete data set. One of the most recent work on the distributed data clustering is the work presented in [1]. The authors presented an elegant synchronization technique for clustering distributed data via the K-Means algorithm. The basic idea behind this algorithm is that each node runs a single K-Means iteration over its local data then, the resulting centroids are used to synchronize the clustering results with the neighbour nodes. Each node sends its centroids to its neighbours. Receiving the centroids of a certain cluster obtained at all the neighbouring nodes, each node modifies its centroid for that cluster to be the weighted average of the received centroids and its current local centroid. Afterwards, each node starts the next iteration using the obtained average centroids till satisfying the stopping criteria. Also, the authors described how the algorithm should behave in a dynamic networks when the network structure or the data may change. The evaluation results showed that this algorithm achieves high accuracy compared to the classical centralized K-Means. Also the communication efficiency of the algorithm is demonstrated. III. T HE P ROPOSED A LGORITHM Basically, two issues need to be considered to develop a distributed data clustering algorithm; 1) How does the algorithm behave at each node? and, 2) How do the nodes synchronize their local results ? The same synchronization method proposed in [1] is followed. Each node runs an iteration over its local data then synchronizes the obtained centroids with only its direct neighbours. Each node updates its local centroids by performing a weighted-averaging on its current centroids and the centroids of its neighbours. Then, each node starts the next iteration till satisfying the stopping criteria. The way the algorithm behaves locally at each node is the major concern in the proposed algorithm. The basic idea of this algorithm is: Starting from an initial clusterassignment for all the patterns, the algorithm tries to move each pattern from its current cluster to another cluster as long as this move is going to reduce the overall objective function. The major concern here is the way the algorithm asses the effect of the move to determine whether the objective function is going to be increased or decreased. Of course, the assessment needs to be done efficiently without having to recalculate the overall objective function. Also, if the move decision was taken, it is required to update the centroids of both the old and the new clusters of the moved patterns effectively. In order to do that, a formula needs to be developed to asses the move effect and another formula needs to be developed to update the clusters centroids according to a move decision. Consider the problem where we have a dataset D distributed over N nodes in a Peer-to-Peer network where each node can directly communicate with its neighbours. The required number of clusters c is initially given to the algorithm. The goal is to find a cluster-assignment for each pattern to minimize the overall sum-of-squared errors (Objective Function) Je , where: c X Je = Ji (1) i=1 where an effective error per cluster over all nodes is defined by: N X Ji = Jij (2) j=1 For a certain node j, the effective error per cluster i is defined by: X Jij = kx − mij k2 (3) x∈Dij where Dij is the set of patterns in the node j that are assigned to the cluster i and mij is the centroid of the cluster i at node j. As stated above, after each iteration, each node updates the centroid of each cluster to be a weighted-average of the centroids of that cluster achieved by its neighbours. P mij sij + r∈Nj mir sir P (4) m`ij = sij + r∈Nj sir where sij is the size of the cluster i at node j and Nj is the set of neighbour nodes of node j, and m`ij is the updated centroid of the cluster i at node j. Assuming that a pattern x̂ at node j was moved to the cluster i, the centroid mij becomes m∗ij , where: m∗ij = mij sA ij + x̂ A sij + 1 where sA ij = sij + X sir (5) (6) r∈Nj The objective function of the cluster i and node j becomes: X ∗ Jij = kx − m∗ij k2 + kx̂ − m∗ij k2 x∈Dij = X x∈Dij = Jij + sA ij kx − mij sA mij sA ij + x̂ 2 ij + x̂ 2 k + kx̂ − k A +1 sA + 1 s ij ij 2 sij + (sA ij ) kx̂ − mij k2 − 2 (sA + 1) ij 2 DP (x̂ − mij , Xij − mij ) +1 (7) Algorithm 1 The Core Iteration at Each Node j {n is the total number of patterns in node j} { c is the required number of clusters} for l = 1 to n do {Assuming that xl belongs to cluster Ci and ni is the number of patterns in cluster Ci } if ni 6= 1 then (sA )2 −sij ∆i = (sijA −1)2 kxi − mij k2 + sA2+1 DP (xi − mij , Xij − mij ) ij ij ∆min = i for j = 1 to c do if j 6= i then 2 sij +(sA ij ) 2 2 ∆j = (sA +1) 2 kxi − mij k − sA +1 DP (xi − mij , Xij − mij ) ij ij if ∆j < ∆min then ∆min = ∆j j1 = j end if end if end for if (∆min < ∆i ) then Move xl to Cj1 Update mij and mj1j according to eqn.(5) Update Xij andXj1j A Update sij , sj1j , sA ij , sj1j and Jj end if end if end for P where Xij = x∈Dij x, and for two patterns a and b of d dimensions, the dot production DP is defined as: IV. P ERFORMANCE E VALUATION In order to evaluate the performance of the proposed d algorithm a dataset of 4-dimensional points sampled from 4 X DP (a, b) = a i bi multivariate-Gaussian distributions with standard deviations i=1 σ = 10 and means (150, 150, 0, 100), (100, 100, 10, 100), (50, 50, 50, 50) and (0, 0, 20, 30). Similarly, when removing a pattern x̂ from a cluster i at Two network configurations are used for experimentation; node j, the objective function changes to: 1) A network of size 10 nodes with 50k points distributed 2 (sA ij ) − sij ∗ 2 over the network. 2) A network of size 100 nodes with 250k Jij = Jij − kx̂ − mij k 2 (sA ij − 1) points distributed over the network. Each point has an equal 2 probability to be sampled from any of the 4 distributions DP (x̂ − mij , Xij − mij ) (8) + A mentioned above. Following [1] The data are distributed sij + 1 randomly over the network. The number of points per node The transfer of x̂ from a cluster j to a cluster i is followed a Zipfian distribution with parameter 0.8. advantageous if the decrease in Jjj is greater than the For a Peer-to-Peer network simulation, PeerSim [14] is increase in Jij . This is the case if used due to its scalability and extensibility. Each node gets connected to 4 neighbours. The message drop probability 2 sij + (sA 2 ij ) 2 kx̂ − mij k − A DP (x̂ − mij , Xij − mij ) and the message delay were always set to 0 as they do not 2 (sA sij + 1 affect the accuracy of the clustering algorithms. ij + 1) Twenty-five random initial centroids are generated. Each < run, one of the centroids is used for initializing LSP2P algorithm [1], and our proposed algorithm. The overall 2 (sA 2 jj ) − sjj 2 objective function achieved by each algorithm is recorded. kx̂ − m k + DP (x̂ − m , X − m ) jj jj jj jj 2 (sA sA jj − 1) jj + 1 This process is repeated for cluster sizes C = 4, 6, 8 and Utilizing the developed formulas, the core iteration of our 10. algorithm is described at Algorithm 1. Figures 1, 2, 3 and 4 show the obtained results for a 10 Nodes - 4 Clusters 7e+07 10 Nodes - 8 Clusters LSP2P Our Algorithm 6.5e+07 2.2e+07 LSP2P Our Algorithm 6e+07 2.1e+07 5e+07 2e+07 Objective Function Objective Function 5.5e+07 4.5e+07 4e+07 3.5e+07 3e+07 1.9e+07 1.8e+07 2.5e+07 1.7e+07 2e+07 1.5e+07 0 5 10 15 20 25 1.6e+07 Run Number 0 5 10 15 20 25 20 25 20 25 Run Number Figure 1. 10 Nodes - 4 Clusters Figure 3. 10 Nodes - 8 Clusters 10 Nodes - 6 Clusters 1.91e+07 LSP2P Our Algorithm 1.9e+07 10 Nodes - 10 Clusters 1.76e+07 LSP2P Our Algorithm 1.74e+07 1.88e+07 1.72e+07 1.7e+07 1.87e+07 Objective Function Objective Function 1.89e+07 1.86e+07 1.85e+07 1.84e+07 1.68e+07 1.66e+07 1.64e+07 1.62e+07 1.6e+07 1.83e+07 0 5 10 15 20 25 1.58e+07 Run Number Figure 2. 1.56e+07 10 Nodes - 6 Clusters 0 5 10 15 Run Number Figure 4. network of size 10. Figures 5, 6, 7 and 8 show the obtained results for a network of size 100. The figures show that our algorithm is superior to the recently proposed algorithm LSP2P when running both of them in non-optimal configurations (cluster sizes of 6, 8 and 10) as our algorithm always manages to achieve lower clustering objective function (except for 1 run out of 150). 10 Nodes - 10 Clusters 100 Nodes - 4 Clusters 3.5e+08 LSP2P Our Algorithm 3e+08 In this paper, an efficient algorithm for clustering distributed databases is proposed. The algorithm employs iterative optimization to formulate an efficient algorithm for clustering distributed databases in the form of a Peer-to-Peer network. Experimental results reported in this paper show the superiority of the proposed methodology over a recently proposed algorithm based on a distributed version of the well-known K-Means algorithm. It is envisaged that that the new proposed algorithm will find extensive applications of distributed clustering where efficient solutions are required. Objective Function 2.5e+08 V. C ONCLUSIONS 2e+08 1.5e+08 1e+08 5e+07 0 5 10 15 Run Number Figure 5. 100 Nodes - 4 Clusters R EFERENCES 100 Nodes - 6 Clusters 1.08e+08 [1] S. Datta, C. M. Giannella, and H. Kargupta, “Approximate distributed k-means clustering over a peer-to-peer network,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 10, 2009. LSP2P Our Algorithm 1.06e+08 1.04e+08 [2] A. Ismail, A. Barbar, and Z. Ismail, “P2p simulator for queries routing using data mining,” International Journal of Database Management Systems ( IJDMS ), vol. 3, no. 3, 2011. Objective Function 1.02e+08 1e+08 9.8e+07 [3] S. Selim and M. A. Ismail, “K-means-type algorithms: A generalized convergence theorem and characterization of local optimality,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 6, no. 1, 1984. 9.6e+07 9.4e+07 9.2e+07 9e+07 0 5 10 15 20 25 Run Number Figure 6. [4] M. A. Ismail and M. S. Kamel, “Multidimensional data clustering utilizing hybrid search stategies,” Pattern Recognition, vol. 22, no. 1, pp. 75–89, 1989. [5] S. Parthasarathy and M. Ogihara, “Clustering distributed homogeneous datasets,” in European Conference on Principles of Data Mining and Knowledge Discovery (PKDD), Lyon, France, 2000. 100 Nodes - 6 Clusters 100 Nodes - 8 Clusters 9e+07 [6] H. Kargupta, W. Huang, K. Sivakumar, and K. Johnson, “Distributed clustering using collective principal component analysis,” Knowledge and Information Systems, vol. 3, no. 4, pp. 422–448, 2001. LSP2P Our Algorithm 8.9e+07 [7] T. Li, S. Zhu, and M. Ogihara, “Algorithms for clustering high dimensional and distributed data,” Intelligent Data Analysis, vol. 7, no. 4, pp. 305–326, 2003. Objective Function 8.8e+07 8.7e+07 [8] G. Forman and B. Zhang, “Distributed data clustering can be efficient and exact,” ACM SIGKDD Explorations Newsletter, vol. 2, no. 2, pp. 34–38, 2000. 8.6e+07 8.5e+07 8.4e+07 0 5 10 15 20 25 Run Number Figure 7. [9] K. Hammouda and M. S. Kamel, “Hierarchically distributed peer-to-peer document clustering and cluster summarization,” IEEE Transactions on Knowledge and Data Engineering, vol. 21, no. 5, 2009. [10] P. Kunath, H. Kriegel, M. Pfeifle, and M. Renz, “Approximated clustering of distributed high-dimensional data,” in Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD), Hanoi, Vietnam, 2005. 100 Nodes - 8 Clusters 100 Nodes - 10 Clusters 8.6e+07 [11] S. Bandyopadhyay, C. Giannella, U. Maulik, H. Kargupta, K. Liu, and S. Datta, “Clustering distributed data streams in peer-to-peer environments,” Information Sciences, vol. 176, no. 14, pp. 1952–1985, 2006. LSP2P Our Algorithm 8.5e+07 Objective Function 8.4e+07 [12] R. Jin, A. Goswami, and G. Agrawal, “Fast and exact outof-core and distributed k-means clustering,” Knowledge and Information Systems, vol. 10, no. 1, pp. 17–40, 2006. 8.3e+07 8.2e+07 [13] E. Januzaj, H. P. Kriegel, and M. Pfeifle, “Towards effective and efficient distributed clustering,” in International Workshop on Clustering Large Data Sets, ICDM, Melbourne, FL, 2003. 8.1e+07 8e+07 7.9e+07 7.8e+07 0 5 10 15 20 25 [14] M. Jelasity, A. Montresor, G. P. Jesi, and S. Voulgaris, “The Peersim simulator,” http://peersim.sf.net. Run Number Figure 8. 100 Nodes - 10 Clusters [15] R. Jain, Art of Computer Systems Performance Analysis Techniques For Experimental Design Measurements Simulation and Modeling. Chichester: Wiley, 1991.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Efficient Data Clustering Over Peer-to-Peer Networks