Download 3 The COD-CLARANS algorithm

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

IEEE 802.1aq wikipedia , lookup

IEEE 1355 wikipedia , lookup

Dijkstra's algorithm wikipedia , lookup

Computer cluster wikipedia , lookup

Transcript
Clustering for Network Planning
LAMIAA FATTOUH1,OMAR KARAM2, MOHAMED A. EL SHARKAWY2, WALAA KHALED2
1
Department of Computer Sciences and Information
Institute of Statistical Studies and Research, Cairo University
Giza, EGYPT
2
Department of Information Systems
Faculty of Computer and Information Sciences, Ain Shams University
Cairo, 11566, EGYPT
Abstract: - The design of urban telephone networks is of a key importance during the planning of new cities.
The problems of the locations of the public exchanges and the cabling layouts have been addressed in this
paper. They are treated as a clustering around medoids problem where the clustering distance represents a
weighted shortest path. The weights are associated with links and represent the subscribers loads. Comparisons
with other clustering methods are presented showing the advantages of the CSP-CLARANS ( Clustering with
Shortest Path- CLARANS) algorithm introduced in this paper.
Key-Words: - Spatial data mining, clustering algorithms, network planning, shortest path.
1 Introduction
The problem of urban network planning is of key
importance during the construction of new
communities and cities, in which telephone services
have to be introduced as a component of the overall
master plan of the city. The city data is given as a
map of streets, intersection nodes coordinates,
distribution of the subscribers’ loads within the city.
The available cable sizes, the cost per unit for each
size and other technical constraints are also
provided.
The process of network planning is divided into
several sub problems:
1-Determining the location of the
exchanges.
2-Construction of subscriber network lines from
exchange to subscribers to satisfy optimization
criteria and design constraints.
Cluster analysis, which groups data for finding
overall distribution among data sets, has numerous
applications in pattern recognition, spatial data
analysis and image processing. Cluster analysis has
been an active area of statistics and data mining
research, with many effective clustering methods
developed. These methods can be categorized into
partitioning methods [1, 2, 3], hierarchical methods
[1,4,5], density based methods [6,7,8], grid based [9,
10, 11] methods, and model based methods [12,13].
The clustering task consists of separating a set of
objects into different groups according to some
measures of goodness that differ according to
application. A common measure of goodness is the
sum of squares of the direct Euclidean distance
between the customers and the center of the cluster
they belong to. In many real application the use of
direct Euclidean distance has its weaknesses [14].
The Direct Euclidean distance ignores the presence
of streets and paths that must be taken into
consideration during clustering. In this paper, both
issues are addressed and a clustering –based solution
is presented depending on using the weighted
physical shortest available routes. The weights used
are assigned to represent subscribers loads.
The problem can be stated as follows: Given a set P
data points {p1, p2… pn} in two dimensions map,
the set of streets connecting these points, their
corresponding loads and any other communication
constraints. It is required to determine exchanges
locations and a cabling layout. A clustering scheme
is required where the number of exchanges equals
the number of clusters and the exchanges locations
are the clusters medoids. Clustering is done to
minimize the cost function C,
P
 L
C 
p
j
C
i
pC
ij
d  (c i , p j )
i
k
C
 L d (c , p
i 1 p j Ci
ij
i
j
)
( Eq. 1 )
Where ci is the medoid (is the real data point that
satisfies minimum cost) of cluster Ci. The shortest
path length from a point pj to ci is d"( ci , pj), Lij is
the load cost of this shortest path.
CSP-CLARANS is developed in the spirit of
CLARANS and COD-CLARANS algorithms. In the
next sections we will explain briefly introduction for
each of them. In section 2 and 3, the CLARANS and
COD-CLARANS algorithms are reviewed. In
section 4, the CSP -CLARANS algorithm is
introduced. A case study is presented in section 5,
and section 6 is the conclusion
2
The CLARANS algorithm
CLARANS (Clustering Large Applications based
upon Randomized Search) is one of the k-medoids
algorithms [2]. To cluster the database D with n data
objects, CLARANS first selects initial k clusters
medoids randomly. CLARANS tries to find a better
solution by randomly picking one of the k medoids
and trying to replace it with another randomly
chosen object from the other (n – k) objects. If no
better solution is found after a certain number of
attempts, the local optimal is assumed to be reached.
Figure 1 shows CLARANS fixing k-1 medoids in
step 2 and testing out a new randomly selected
medoid at step 3. Having found that the new
solution is not better than the original solution,
CLARANS back tracks to step 1 and repeats the
process from there. Steps 4 and 5 show CLARANS
being successful in searching for a better solution
and it proceeds with the search from step 5.
Phase I: Construction of the BSP (Binary Space
Partition) tree and the complete Visibility graph for
later use in computing the obstructed distance (the
shortest path between any two points without cutting
any obstacle edge).
Phase II: The obstructed distance between points
and medoids are calculated, and the estimated lower
bound for the sum of distances error function E΄ is
computed.
The BSP tree is a data structure which can
efficiently determine whether two points are visible
to each other. Two points are defined as visible to
each other if and only if the straight line joining
them doesn’t intersect any obstacles. In CODCLARANS, the BSP tree is used to determine the
set of all visible vertices of the obstacles from a
point p.
A visibility graph is a graph whose nodes
correspond to vertices of the polygonal obstacles
and whose edge corresponds to the pair of vertices
that are mutually visible to each other. CODCLARANS first randomly select k cluster medoids
from the data points, then randomly selects one of
the medoid cj, and tries to replace it with a non
center cluster crandom for many iterations. If after a
certain time of testing, the k cluster medoids remain
unchanged; we record the sum of distances error
value and the cluster assignment. This process is
done several times and the solution that yields the
least sum of distances error (E) is the output.
COD-CLARANS depends on the visibility graph
Figure 1: CLARANS searching for better solution.
3 The COD-CLARANS
algorithm
The COD-CLARANS [14, 15]algorithm depends
mainly on CLARANS and is designed for handling
obstacles. While CLARANS algorithm can be made
to handle obstacles by changing its distance
function, COD-CLARANS optimized this function
by pushing the task of handling obstacles into the
algorithm. Figure 2 shows the overall structure of
COD-CLARANS which consists of three main
parts, the main algorithm, computation of squared
error function E and a pruning function E΄. The
pruning function E΄ has two purposes. it can help to
prune off the search and avoid computation of E,
makes the computation of E more efficient.
The COD-CLARANS algorithm contains two
phases:
and BSP- tree to compute the obstructed distance
and the corresponding square error function. Both
BSP- tree and visibility checking methods depend
on many variables. These include the number of
edges in each polygon, the distribution of the
polygons, and also the location of the interested
viewpoint. In order to analyze the complexity, the
worst case scenario needs to be considered. For
BSP- tree construction, suppose there are a set of
polygons with n edges in total. In the worst case, the
complexity is O (n2). For the visibility checking, the
worst case scenario occurs if all vertices are visible
to a given location X. the complexity in this case is
O (n). The average complexity is difficult to
determine and further study must be conducted in
order to obtain the information. In network planning,
the visibility idea isn’t suitable. While planning,
there is a map of streets with different lengths and
loads. The important constraint to be considered is
the shortest path between any two points. In
CLARANS, it depends mainly for calculating E on
the Euclidean distance (between any two data
points) ignoring the constraints of the paths that
must be considered. So some modifications should
be applied to CLARANS to be more suitable for
network planning. So this paper proposes an
algorithm called CSP-CLARANS.
4
between a point and its assigned medoid is
NearestDistance (p). If the direct Euclidean distance
between point and crandom is shorter than
NearestDistance (p), the point is assigned to crandom
and NearestCenter (p) is reset to crandom. The
resulting sum of NearestCenter (p) for all points is
the lower bound of the true sum of distance error E.
Algorithm 1 CSP-CLARANS
Input: A set of n objects, clustering parameters, set of
streets.
Output: A partition of the n objects into k clusters and
cluster’ medoids, c1, c2, … ck.
Method:
Function CSP-CLARANS ()
{ Initialize k into 1 (one cluster)
randomly select k points to be current into
compute square error function E
If (clustering constraints satisfied)
Return current
Else
Increment: Increment k
Let CurrentE=E
Do
{
Found_new = false
For (j=0; j < k; k ++)
{ Let remain = current – cj
compute shortest distance of points to
nearest medoid in the remain
for (try=0; try < max_try; try ++)
{ replace cj with a randomly selected point
crandom;
compute estimated square error function
E΄
If(E΄> currentE)
continue
compute square error function E
If (E<currentE)
{ Found_new = true
Current = new centers
currentE = E
}
}
If (Found_new = true)
break
}
} While (Found_new)
If (clustering constraints satisfied)
Return current
Else
Go to increment
CSP-CLARANS algorithm
The CSP-CLARANS (Clustering with Shortest Path
) algorithm is proposed for handling network
planning. CSP-CLARANS is based mainly on the
idea of CLARANS and handles the constraints of
network planning by modifying the CODCLARANS distance function. This algorithm should
ensure a minimized overall travel distance of all the
customers in the city.
4.1 Computing the shortest distance
The algorithm is shown in
Figure 3. CSPCLARANS is to assign a non-medoid points to its
nearest cluster medoids, and computes the actual or
estimated sum of distances error. We are interested
in the shortest path between a non-medoid to its
nearest k-medoids. Given a point p, we denote its
nearest cluster medoid as NearestCenter (p) and the
distance to the nearest distance to the nearest
medoid as NearestDistance (p). The Floyd-Warshall
algorithm is used to compute the shortest distances,
or paths, to all data points from the cluster medoid.
Figure 4 outlines the Floyd-Warshall algorithm.
4.2 Estimated Square Error Function
For every iteration, a non- medoid point crandom and is
tested and selected for whether it is a better
replacement for one of the existing medoids. This
testing operation can be very expensive when the
shortest distance function is involved. To reduce the
running time, an estimated lower bound sum of
distances error E΄ (crandom) function is used to prune
off some of the testing cycles. To calculate the value
of E΄ (crandom), all data points are assigned to the (k1) medoids cj, where 1 ≤ i ≤ k but i ≠ j. The distance
}
Figure 3: CSP-CLARANS algorithm
4.3 The number of clusters
Clustering is to partition data points into a set of k
clusters where k is a user input parameter. Each of
the data points is assigned to one of the k clusters
according to some distance measures. If k cannot be
Known ahead of time, various values of k can be
evaluated until the most suitable one is found. In
network planning application, the number of clusters
(k) is unknown so it isn’t a user input parameter. But
there are some constraints that must be considered
such as available cable sizes and acceptable grade of
service that must be achieved.
In previous work [15], the city is partitioned into
four quadrants which are the number of clusters.
Checking the network constraints for each quadrant.
If the constraints are satisfied, the number of clusters
will be four quadrants ( clusters). The switches will
be located at the centroids which are the center of
gravity of each cluster. If the constraints are not
satisfied in any of the four quadrants the same
partitioning method is applied to the quadrant which
does not satisfy the constraints. this yields that the
number of clusters equal seven partitions. This
method will be iterated until the network constraints
are satisfied. The resulting number of clusters may
be 4,7, 10,.. etc.
Algorithm 2 Floyd- Warshall.
Input: a matrix W with weights, s (source)
Output: shortest path from s to all points.
Method
Function FLOYD (W, s)
{ n= rows (W)
D (0)=G
for k=1 to n
i =s
do j =1 to n
dij(k)= min { dij(k-1), dik(k-1)+ dkj(k-1)}
return D (n)
}
Figure 4: Floyd-Warshall algorithm
Since the previous work doesn’t reflect the real
nature of the clusters, or the number of the suitable
clusters, it is always incrementing the number of
clusters by three, CSP- CLARANS is proposed. The
algorithm is shown in figure 3, the algorithm starts,
assuming the whole map as one cluster, and
incrementing the number of clusters (k) until the
clustering constraints are accepted at certain k. The
medoids of the resulting clusters depend on the
distance measures only. Considering the weighted
streets, new medoids should result and move
towards the loaded (weighted ) streets to achieve
minimum cost.
Table 1 is a comparison between different clustering
approaches.
5 Case Study
For real application, the proposed algorithm is
applied on a map representing a certain part of
Cairo. The actual map is shown in figure 5. The
actual map is scanned. The beginning and ending of
each street are transformed into data points, defined
by their coordinates. The streets themselves are
transformed into linkages between data points. The
subscribers loads are also considered to be the
weights (loads) for each street. Figure 6 shows the
processed map. Nodes represent data points and
lines representing the streets of the actual map.
The proposed CSP-CLARANS algorithm is applied
to the processed map. The output is the clusters,
Table 1: Table 1: A comparison between different clustering
approaches.
initial medoids and final medoids too. Initial
medoids are represented by dark square. When
taking into account the subscribers loads, the initial
medoids will move towards heavy streets. The final
medoids are shown by dark triangle in figure 7. The
medoid as the final one due to the uniform
distribution of the subscribers loads within this
cluster. These final medoids will be the locations for
the exchanges.
5
Conclusion
Clustering analysis is one of the major tasks in
various research areas. The clustering aims at
identifying and extracting significant groups in
underlying data. Based on certain clustering criteria
the data are grouped so that the data points in a
cluster are more similar to each other than points in
different clusters. In this paper, we introduced a
clustering solution to the problem of network
planning, the CSP-CLARANS algorithm. This
algorithm is a medoid clustering algorithm using
distances which are weighted shortest paths
satisfying the network constraints. The result is a
realistic solution representing the subscriber demand
with minimum network costs.
figure shows the movement of the medoids towards
heavily loaded streets . The figure shows the moving
of medoids of clusters Cl1, Cl2 and Cl3 towards the
heavy loads, while cluster Cl4 maintains its initial
References:
[1] L. Kaufman and P. J. Rousseeuw, Finding
groups in Data: an Introduction to cluster, John
Wiley & Sons, 1990.
[2] J. Han, M. Kamber, and A. K. H. Tung, Spatial
Clustering Methods in data mining: A Survey,
Geographic Data Mining and Knowledge
Discovery, 2001.
[3] P. Bradly, U. Fayyad, and C. Reina“ , Scaling
clustering algorithms to large databases, In proc.
1998 Int. Conf. Knoweldge Discovery and Data
mining, 1998.
[4] T. Zhang, R. Ramakrishnan, and M. Livny,
BIRCH: an efficient data clustering method for
very large databases, In Proc. 1996 ACMSIGMOD Int. Conf. Management of data
(SIGMOD’96), 1996.
[5] S. Guha, R. Rastogi, and K. Shim, Cure : An
efficient clustering algorithm for large databases,
In Proc. 1998 ACM-SIGMOD Int. Conf.
Management of Data (SIGMOD’98), 1998.
[6] M. Ester, H. P. Kriegel, J. sander, and X. Xu,
A density based algorithm for discovering
clusters in large spatial databases, In Proc. 1996
Inc. Conf. Knowledge discovery and Data
mining (KDD’96).
[7] M. Ankerst, M. Breunig, H.P. kriegel, and J.
Sander, OPTICS: Ordering points to identify the
clustering structure, In Proc. 1999 ACMSIGMOD Int. Conf. Management of data
( SIGMOD’96), 1999.
[8] A. Hinneburg and D. A. Keim, An efficient
approach to clustering in large multimedia
databases with noise, In Proc. 1998 Int. Conf. .
Knowledge discovery and Data mining
(KDD’98), 1998.
[9] W. Wang, J. Yang, and R. Muntz, STING: A
statistical information grid approach to spatial
data mining”. In Proc. 1997 Int. Conf. Very
Large Data Bases ( VLDB’97), 1997.
[10] G. Sheikholeslami, S. Chatterjee, and A.
Zhang, Wave Cluster : A multi- resolution
clustering approach for very large spatial
databases, In Proc. 1997 Int. Conf. Very Large
Data Bases ( VLDB’97), 1998.
[11] R. Agrawal. J. Gehrke, D. Gunopulos, P.
Raghavan, Automatic subspace clustering of
high dimensional data for data mining
application, In Proc. 1998 ACM-SIGMOD Int.
Conf. Management of Data (SIGMOD’98),
1998.
[12] J. W. Shavlik, T.G. Dietterich, Reading in
machine learning, 1990,
[13] T. Kohonen, Self organized formation of
topologically correct feature maps, Biological
Cybernetics, 1982.
[14] Anthony K.H. Tung, Jean Hou, and Jiawei
Han, Spatial Clustering in the presence of
obstacles, Proc. 2001 Int. Conf. on Data
Engineering (ICDE'01), 2001 .
[15] Jean Hou, Clustering with obstacle entities,
Simon Fraser university, 1999.
[16] Ayman El-Dessouki, Abd El Moniem Wahdan
and
Lamia Fattouh Ibrahim, The Use of
Knowledge-based System for Urban Telephone
Planning, ITU/ITC/LAS Regional Seminar on
Tele-Traffic Engineering for Arab States,
Damascus, 1999.