Download 3 The COD-CLARANS algorithm

Clustering for Network Planning LAMIAA FATTOUH1,OMAR KARAM2, MOHAMED A. EL SHARKAWY2, WALAA KHALED2 1 Department of Computer Sciences and Information Institute of Statistical Studies and Research, Cairo University Giza, EGYPT 2 Department of Information Systems Faculty of Computer and Information Sciences, Ain Shams University Cairo, 11566, EGYPT Abstract: - The design of urban telephone networks is of a key importance during the planning of new cities. The problems of the locations of the public exchanges and the cabling layouts have been addressed in this paper. They are treated as a clustering around medoids problem where the clustering distance represents a weighted shortest path. The weights are associated with links and represent the subscribers loads. Comparisons with other clustering methods are presented showing the advantages of the CSP-CLARANS ( Clustering with Shortest Path- CLARANS) algorithm introduced in this paper. Key-Words: - Spatial data mining, clustering algorithms, network planning, shortest path. 1 Introduction The problem of urban network planning is of key importance during the construction of new communities and cities, in which telephone services have to be introduced as a component of the overall master plan of the city. The city data is given as a map of streets, intersection nodes coordinates, distribution of the subscribers’ loads within the city. The available cable sizes, the cost per unit for each size and other technical constraints are also provided. The process of network planning is divided into several sub problems: 1-Determining the location of the exchanges. 2-Construction of subscriber network lines from exchange to subscribers to satisfy optimization criteria and design constraints. Cluster analysis, which groups data for finding overall distribution among data sets, has numerous applications in pattern recognition, spatial data analysis and image processing. Cluster analysis has been an active area of statistics and data mining research, with many effective clustering methods developed. These methods can be categorized into partitioning methods [1, 2, 3], hierarchical methods [1,4,5], density based methods [6,7,8], grid based [9, 10, 11] methods, and model based methods [12,13]. The clustering task consists of separating a set of objects into different groups according to some measures of goodness that differ according to application. A common measure of goodness is the sum of squares of the direct Euclidean distance between the customers and the center of the cluster they belong to. In many real application the use of direct Euclidean distance has its weaknesses [14]. The Direct Euclidean distance ignores the presence of streets and paths that must be taken into consideration during clustering. In this paper, both issues are addressed and a clustering –based solution is presented depending on using the weighted physical shortest available routes. The weights used are assigned to represent subscribers loads. The problem can be stated as follows: Given a set P data points {p1, p2… pn} in two dimensions map, the set of streets connecting these points, their corresponding loads and any other communication constraints. It is required to determine exchanges locations and a cabling layout. A clustering scheme is required where the number of exchanges equals the number of clusters and the exchanges locations are the clusters medoids. Clustering is done to minimize the cost function C, P  L C  p j C i pC ij d  (c i , p j ) i k C  L d (c , p i 1 p j Ci ij i j ) ( Eq. 1 ) Where ci is the medoid (is the real data point that satisfies minimum cost) of cluster Ci. The shortest path length from a point pj to ci is d"( ci , pj), Lij is the load cost of this shortest path. CSP-CLARANS is developed in the spirit of CLARANS and COD-CLARANS algorithms. In the next sections we will explain briefly introduction for each of them. In section 2 and 3, the CLARANS and COD-CLARANS algorithms are reviewed. In section 4, the CSP -CLARANS algorithm is introduced. A case study is presented in section 5, and section 6 is the conclusion 2 The CLARANS algorithm CLARANS (Clustering Large Applications based upon Randomized Search) is one of the k-medoids algorithms [2]. To cluster the database D with n data objects, CLARANS first selects initial k clusters medoids randomly. CLARANS tries to find a better solution by randomly picking one of the k medoids and trying to replace it with another randomly chosen object from the other (n – k) objects. If no better solution is found after a certain number of attempts, the local optimal is assumed to be reached. Figure 1 shows CLARANS fixing k-1 medoids in step 2 and testing out a new randomly selected medoid at step 3. Having found that the new solution is not better than the original solution, CLARANS back tracks to step 1 and repeats the process from there. Steps 4 and 5 show CLARANS being successful in searching for a better solution and it proceeds with the search from step 5. Phase I: Construction of the BSP (Binary Space Partition) tree and the complete Visibility graph for later use in computing the obstructed distance (the shortest path between any two points without cutting any obstacle edge). Phase II: The obstructed distance between points and medoids are calculated, and the estimated lower bound for the sum of distances error function E΄ is computed. The BSP tree is a data structure which can efficiently determine whether two points are visible to each other. Two points are defined as visible to each other if and only if the straight line joining them doesn’t intersect any obstacles. In CODCLARANS, the BSP tree is used to determine the set of all visible vertices of the obstacles from a point p. A visibility graph is a graph whose nodes correspond to vertices of the polygonal obstacles and whose edge corresponds to the pair of vertices that are mutually visible to each other. CODCLARANS first randomly select k cluster medoids from the data points, then randomly selects one of the medoid cj, and tries to replace it with a non center cluster crandom for many iterations. If after a certain time of testing, the k cluster medoids remain unchanged; we record the sum of distances error value and the cluster assignment. This process is done several times and the solution that yields the least sum of distances error (E) is the output. COD-CLARANS depends on the visibility graph Figure 1: CLARANS searching for better solution. 3 The COD-CLARANS algorithm The COD-CLARANS [14, 15]algorithm depends mainly on CLARANS and is designed for handling obstacles. While CLARANS algorithm can be made to handle obstacles by changing its distance function, COD-CLARANS optimized this function by pushing the task of handling obstacles into the algorithm. Figure 2 shows the overall structure of COD-CLARANS which consists of three main parts, the main algorithm, computation of squared error function E and a pruning function E΄. The pruning function E΄ has two purposes. it can help to prune off the search and avoid computation of E, makes the computation of E more efficient. The COD-CLARANS algorithm contains two phases: and BSP- tree to compute the obstructed distance and the corresponding square error function. Both BSP- tree and visibility checking methods depend on many variables. These include the number of edges in each polygon, the distribution of the polygons, and also the location of the interested viewpoint. In order to analyze the complexity, the worst case scenario needs to be considered. For BSP- tree construction, suppose there are a set of polygons with n edges in total. In the worst case, the complexity is O (n2). For the visibility checking, the worst case scenario occurs if all vertices are visible to a given location X. the complexity in this case is O (n). The average complexity is difficult to determine and further study must be conducted in order to obtain the information. In network planning, the visibility idea isn’t suitable. While planning, there is a map of streets with different lengths and loads. The important constraint to be considered is the shortest path between any two points. In CLARANS, it depends mainly for calculating E on the Euclidean distance (between any two data points) ignoring the constraints of the paths that must be considered. So some modifications should be applied to CLARANS to be more suitable for network planning. So this paper proposes an algorithm called CSP-CLARANS. 4 between a point and its assigned medoid is NearestDistance (p). If the direct Euclidean distance between point and crandom is shorter than NearestDistance (p), the point is assigned to crandom and NearestCenter (p) is reset to crandom. The resulting sum of NearestCenter (p) for all points is the lower bound of the true sum of distance error E. Algorithm 1 CSP-CLARANS Input: A set of n objects, clustering parameters, set of streets. Output: A partition of the n objects into k clusters and cluster’ medoids, c1, c2, … ck. Method: Function CSP-CLARANS () { Initialize k into 1 (one cluster) randomly select k points to be current into compute square error function E If (clustering constraints satisfied) Return current Else Increment: Increment k Let CurrentE=E Do { Found_new = false For (j=0; j < k; k ++) { Let remain = current – cj compute shortest distance of points to nearest medoid in the remain for (try=0; try < max_try; try ++) { replace cj with a randomly selected point crandom; compute estimated square error function E΄ If(E΄> currentE) continue compute square error function E If (E<currentE) { Found_new = true Current = new centers currentE = E } } If (Found_new = true) break } } While (Found_new) If (clustering constraints satisfied) Return current Else Go to increment CSP-CLARANS algorithm The CSP-CLARANS (Clustering with Shortest Path ) algorithm is proposed for handling network planning. CSP-CLARANS is based mainly on the idea of CLARANS and handles the constraints of network planning by modifying the CODCLARANS distance function. This algorithm should ensure a minimized overall travel distance of all the customers in the city. 4.1 Computing the shortest distance The algorithm is shown in Figure 3. CSPCLARANS is to assign a non-medoid points to its nearest cluster medoids, and computes the actual or estimated sum of distances error. We are interested in the shortest path between a non-medoid to its nearest k-medoids. Given a point p, we denote its nearest cluster medoid as NearestCenter (p) and the distance to the nearest distance to the nearest medoid as NearestDistance (p). The Floyd-Warshall algorithm is used to compute the shortest distances, or paths, to all data points from the cluster medoid. Figure 4 outlines the Floyd-Warshall algorithm. 4.2 Estimated Square Error Function For every iteration, a non- medoid point crandom and is tested and selected for whether it is a better replacement for one of the existing medoids. This testing operation can be very expensive when the shortest distance function is involved. To reduce the running time, an estimated lower bound sum of distances error E΄ (crandom) function is used to prune off some of the testing cycles. To calculate the value of E΄ (crandom), all data points are assigned to the (k1) medoids cj, where 1 ≤ i ≤ k but i ≠ j. The distance } Figure 3: CSP-CLARANS algorithm 4.3 The number of clusters Clustering is to partition data points into a set of k clusters where k is a user input parameter. Each of the data points is assigned to one of the k clusters according to some distance measures. If k cannot be Known ahead of time, various values of k can be evaluated until the most suitable one is found. In network planning application, the number of clusters (k) is unknown so it isn’t a user input parameter. But there are some constraints that must be considered such as available cable sizes and acceptable grade of service that must be achieved. In previous work [15], the city is partitioned into four quadrants which are the number of clusters. Checking the network constraints for each quadrant. If the constraints are satisfied, the number of clusters will be four quadrants ( clusters). The switches will be located at the centroids which are the center of gravity of each cluster. If the constraints are not satisfied in any of the four quadrants the same partitioning method is applied to the quadrant which does not satisfy the constraints. this yields that the number of clusters equal seven partitions. This method will be iterated until the network constraints are satisfied. The resulting number of clusters may be 4,7, 10,.. etc. Algorithm 2 Floyd- Warshall. Input: a matrix W with weights, s (source) Output: shortest path from s to all points. Method Function FLOYD (W, s) { n= rows (W) D (0)=G for k=1 to n i =s do j =1 to n dij(k)= min { dij(k-1), dik(k-1)+ dkj(k-1)} return D (n) } Figure 4: Floyd-Warshall algorithm Since the previous work doesn’t reflect the real nature of the clusters, or the number of the suitable clusters, it is always incrementing the number of clusters by three, CSP- CLARANS is proposed. The algorithm is shown in figure 3, the algorithm starts, assuming the whole map as one cluster, and incrementing the number of clusters (k) until the clustering constraints are accepted at certain k. The medoids of the resulting clusters depend on the distance measures only. Considering the weighted streets, new medoids should result and move towards the loaded (weighted ) streets to achieve minimum cost. Table 1 is a comparison between different clustering approaches. 5 Case Study For real application, the proposed algorithm is applied on a map representing a certain part of Cairo. The actual map is shown in figure 5. The actual map is scanned. The beginning and ending of each street are transformed into data points, defined by their coordinates. The streets themselves are transformed into linkages between data points. The subscribers loads are also considered to be the weights (loads) for each street. Figure 6 shows the processed map. Nodes represent data points and lines representing the streets of the actual map. The proposed CSP-CLARANS algorithm is applied to the processed map. The output is the clusters, Table 1: Table 1: A comparison between different clustering approaches. initial medoids and final medoids too. Initial medoids are represented by dark square. When taking into account the subscribers loads, the initial medoids will move towards heavy streets. The final medoids are shown by dark triangle in figure 7. The medoid as the final one due to the uniform distribution of the subscribers loads within this cluster. These final medoids will be the locations for the exchanges. 5 Conclusion Clustering analysis is one of the major tasks in various research areas. The clustering aims at identifying and extracting significant groups in underlying data. Based on certain clustering criteria the data are grouped so that the data points in a cluster are more similar to each other than points in different clusters. In this paper, we introduced a clustering solution to the problem of network planning, the CSP-CLARANS algorithm. This algorithm is a medoid clustering algorithm using distances which are weighted shortest paths satisfying the network constraints. The result is a realistic solution representing the subscriber demand with minimum network costs. figure shows the movement of the medoids towards heavily loaded streets . The figure shows the moving of medoids of clusters Cl1, Cl2 and Cl3 towards the heavy loads, while cluster Cl4 maintains its initial References: [1] L. Kaufman and P. J. Rousseeuw, Finding groups in Data: an Introduction to cluster, John Wiley & Sons, 1990. [2] J. Han, M. Kamber, and A. K. H. Tung, Spatial Clustering Methods in data mining: A Survey, Geographic Data Mining and Knowledge Discovery, 2001. [3] P. Bradly, U. Fayyad, and C. Reina“ , Scaling clustering algorithms to large databases, In proc. 1998 Int. Conf. Knoweldge Discovery and Data mining, 1998. [4] T. Zhang, R. Ramakrishnan, and M. Livny, BIRCH: an efficient data clustering method for very large databases, In Proc. 1996 ACMSIGMOD Int. Conf. Management of data (SIGMOD’96), 1996. [5] S. Guha, R. Rastogi, and K. Shim, Cure : An efficient clustering algorithm for large databases, In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’98), 1998. [6] M. Ester, H. P. Kriegel, J. sander, and X. Xu, A density based algorithm for discovering clusters in large spatial databases, In Proc. 1996 Inc. Conf. Knowledge discovery and Data mining (KDD’96). [7] M. Ankerst, M. Breunig, H.P. kriegel, and J. Sander, OPTICS: Ordering points to identify the clustering structure, In Proc. 1999 ACMSIGMOD Int. Conf. Management of data ( SIGMOD’96), 1999. [8] A. Hinneburg and D. A. Keim, An efficient approach to clustering in large multimedia databases with noise, In Proc. 1998 Int. Conf. . Knowledge discovery and Data mining (KDD’98), 1998. [9] W. Wang, J. Yang, and R. Muntz, STING: A statistical information grid approach to spatial data mining”. In Proc. 1997 Int. Conf. Very Large Data Bases ( VLDB’97), 1997. [10] G. Sheikholeslami, S. Chatterjee, and A. Zhang, Wave Cluster : A multi- resolution clustering approach for very large spatial databases, In Proc. 1997 Int. Conf. Very Large Data Bases ( VLDB’97), 1998. [11] R. Agrawal. J. Gehrke, D. Gunopulos, P. Raghavan, Automatic subspace clustering of high dimensional data for data mining application, In Proc. 1998 ACM-SIGMOD Int. Conf. Management of Data (SIGMOD’98), 1998. [12] J. W. Shavlik, T.G. Dietterich, Reading in machine learning, 1990, [13] T. Kohonen, Self organized formation of topologically correct feature maps, Biological Cybernetics, 1982. [14] Anthony K.H. Tung, Jean Hou, and Jiawei Han, Spatial Clustering in the presence of obstacles, Proc. 2001 Int. Conf. on Data Engineering (ICDE'01), 2001 . [15] Jean Hou, Clustering with obstacle entities, Simon Fraser university, 1999. [16] Ayman El-Dessouki, Abd El Moniem Wahdan and Lamia Fattouh Ibrahim, The Use of Knowledge-based System for Urban Telephone Planning, ITU/ITC/LAS Regional Seminar on Tele-Traffic Engineering for Arab States, Damascus, 1999.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download 3 The COD-CLARANS algorithm