Download Clustering Algorithms Implementation on ATLaS

Clustering Algorithms Implementation on ATLaS --CS240B Project Report Richard Luo Prof. Carlo Zaniolo 2002/6 Abstract In this project, we will discus clustering algorithms in spatial data mining, such as partitioning algorithm PAM and density-based algorithm DBSCAN. Some of their implementations on User-Defined Aggregate (UDA) database system ATLaS are illustrated. With UDA, it's convenient to implement such clustering algorithms. A spatial index structure called R-tree will significantly improve the performance. Experiments with real data of SEQUOIA 2000 show that such algorithms implementation on ATLaS is satisfactory even with absence of R-tree index. However, some improvement on ATLaS will benefit the development of these clustering algorithms as well as other general data mining algorithms. An ATLaS system improvement proposal is addressed in the end. Introduction Knowledge discovery becomes more and more important in spatial databases since increasingly large amounts of data obtained from satellite images, X-ray crystallography or other automatic equipment are stored in spatial databases. Several types of clustering algorithms are addressed in the last few years, such as: 1) Partitioning Algorithm: Construct various partitions then evaluate them by some criterion 2) Hierarchy Algorithm: Create a hierarchical decomposition of the set of data (or objects) using some criterion 3) Density-based Algorithm: based on local connectivity and density functions In this report, we will discus partitioning algorithm PAM and based-based algorithm DBSCAN. Their implementation on User-Defined Aggregate (UDA) database system ATLaS will be illustrated. With UDA, it's convenient to implement such clustering algorithms. A spatial index structure called R-tree will significantly improve the performance. Experiments with real data of SEQUOIA 2000 show that such algorithms implementation on ATLaS is satisfactory even with absence of R-tree index. Finally, we will talk about some improvement on ATLaS which may benefit the development of these clustering algorithms as well as other general data mining algorithms. An ATLaS system improvement proposal is addressed in the end. Clustering Algorithms DBSCAN The key idea of a density-based cluster is that for each point of a cluster its Eps-neighborhood for some given Eps > 0 has to contain at least a minimum number of points, i.e. the “density” in the Eps-neighborhood of points has to exceed some threshold. Furthermore, the density within the areas of noise is lower than the density in any of the clusters. This idea of “density-based clusters” can be generalized in two important ways. First, we can use any notion of a neighborhood instead of an Eps-neighborhood if the definition of the neigh-borhood is based on a binary predicate which is symmetric and reflexive. Second, instead of sim-ply counting the objects in a neighborhood of an object we can as well use other measures to de-fine the “cardinality” of that neighborhood. A naive approach could require for each object in a density-connected set that the weighted cardinality of the NPred-neighborhood of that object has at least a value MinCard. However, this approach fails because there may be two kinds of objects in a density-connected set, objects in-side (core object) and objects “on the border” of the density-connected set (border objects). In general, an NPred-neighborhood of a border object has a significantly lower wCard than an NPred-neighborhood of a core object. Therefore, we would have to set the value MinCard to a relatively low value in order to include all objects belonging to the same density-connected set. This value, however, will not be characteristic for the respective density-connected set particularly in the presence of noise objects. Therefore, for every object p in a density-connected set C there must be an object q in C so that p is inside of the NPred-neighborhood of q and the weight-ed cardinality wCard of NPred(q) is at least MinCard. We also require the objects of the set C to be somehow “connected” to each other. PAM PAM (Partitioning Around Medoids) was developed by Kaufman and Rousseeuw. To find k clusters, PAM's approach is to determine a representative object for each cluster. This representative object, called a medoid, is meant to be the most centrally located object within the cluster. Once the medoids have been selected, each non-selected object is grouped with the medoid to which it is the most similar. More precisely, if Oj is a non-selected object, and Oi is a (selected) medoid, we say that Oj belongs to the cluster represented by Oi, if d(Oj ; Oi ) = minOe d(Oj ; Oe), where the notation minOe denotes the minimum over all medoids Oe , and the notation d(Oa ; Ob ) denotes the dissimilarity or distance between objects Oa and Ob . All the dissimilarity values are given as inputs to PAM. Finally, the quality of a clustering (i.e. the combined quality of the chosen medoids) is measured by the average dissimilarity between an object and the medoid of its cluster. To find the k medoids, PAM begins with an arbitrary selection of k objects. Then in each step, a swap between a selected object Oi and a non-selected object Oh is made, as long as such a swap would result in an improvement of the quality of the clustering. In particular, to calculate the effect of such a swap between Oi and Oh , PAM computes costs Cjih for all non-selected objects Oj . Depending on which of the following cases Oj is in, Cjih is defined by one of the equations below: First Case: suppose Oj currently belongs to the cluster represented by Oi . Furthermore, let Oj be more similar to Oj2 than Oh , i.e. d(Oj ; Oh ) >= d(Oj ; Oj2 ), where Oj2 is the second most similar medoid to Oj . Thus, if Oi is replaced by Oh as a medoid, Oj would belong to the cluster represented by Oj2 . Hence, the cost of the swap as far as Oj is concerned is: C jih = d(Oj ; Oj2 ) - d(Oj ; Oi ) This equation always gives a non-negative Cjih , indicating that there is a non-negative cost incurred in replacing Oi with Oh. Second Case: Oj currently belongs to the cluster represented by Oi . But this time, Oj is less similar to Oj2 than Oh , i.e. d(Oj ; Oh ) < d(Oj ; Oj2 ). Then, if Oi is replaced by Oh , Oj would belong to the cluster represented by Oh . Thus, the cost for Oj is given by: Cjih = d(Oj ; Oh ) - d(Oj ; Oi ); Cjih here can be positive or negative, depending on whether Oj is more similar to Oi or to Oh . Third Case: suppose that Oj currently belongs to a cluster other than the one represented by Oi . Let Oj2 be the representative object of that cluster. Furthermore, let Oj be more similar to Oj2 than Oh . Then even if Oi is replaced by Oh , Oj would stay in the cluster represented by Oj2 . Thus, the cost is: C jih = 0 Fourth Case: Oj currently belongs to the cluster represented by Oj2 . But Oj is less similar to Oj2 than Oh . Then replacing Oi with Oh would cause Oj tOjump to the cluster of Oh from that of Oj2 . Thus, the cost is: C jih = d(Oj ; Oh ) - d(Oj ; Oj2 ); and is always negative. Combining the four cases above, the total cost of replacing Oi with Oh is given by: TCih = sum of Cjih We now present Algorithm PAM. 1. Select k representative objects arbitrarily. 2. Compute TCih for all pairs of objects Oi ; Oh where Oi is currently selected, and Oh is not. 3. Select the pair Oi ; Oh which corresponds to minOi ;Oh TCih . If the minimum TCih is negative, replace Oi with Oh , and go back to Step (2). 4. Otherwise, for each non-selected object, find the most similar representative object. Halt. 2 R*-tree spatial index In the following, we will introduce a typical spatial index, the R*-tree. The R*-tree generalizes the 1-dimensional B-tree to d-dimensional data spaces, specifically an R*-tree manages k-dimensional hyperrectangles instead of 1-dimension-al keys. An R*-tree may organize extended objects such as polygons using minimum bounding rectangles (MBR) as approximations as well as point objects as a special case of rectangles. The leaves store the MBRs of the data objects and a pointer to the exact geometry of the polygons. Internal nodes store a sequence of pairs consisting of a rectangle and a pointer to a child node. These rectangles are the MBRs of all data or directory rectangles stored in the subtree having the referenced child node as its root. To answer a region query, starting from the root, the set of rectangles intersecting the query region is determined and then their referenced child nodes are searched until the data pages are reached. Fig 1. R*-tree The height of an R*-tree is O(log n) for a database of n objects in the worst case and a query with a “small” query region has to traverse only a limited number of paths in the R*-tree. Implementation Although ATLaS is still on its alpha-stage and provides only basic functionalities, we still find it convenient to implement these clustering algorithms. User-defined Aggregate (UDA) provides one-scan approach and flexible access to the database. In this section, we will describe 2 clustering algorithms implementation on ATLaS -- DBSCAN and PAM. DBSCAN table setofpoints (x real, y real, ClId real); /* meaning of ClId: -1: unclassified, 0: noise, 1,2,3...: cluster*/ table nextid(ClusterId real); table seeds (sx real, sy real); insert into nextid values (1); load from dbscan.input into temp; insert into setofpoints select x, y, -1 from temp; select ExpandCluster(x, y, ClusterId, 1000, 4) from setofpoints, nextid where ClId<=0; The table setofpoints stores the coordinates and cluster ids of all points read from the input file dbscan.input. After initializing the cluster id to -1, it calls the major aggregate in this algorithm -- ExpandCluster() to expand the cluster from any point (x,y). We use the global attribute MinPoints of 4 and Eps of 1000. The regionQuery() aggregate returns the Eps-neighborhood of point (qx,qy): aggregate regionQuery(qx real, qy real, eps real):(r1 real,r2 real) { INITIALIZE: ITERATE: { INSERT INTO return select x,y from setofpoints where (x-qx)*(x-qx) + (y - qy) * (y - qy) <= eps * eps; } } In the changeClId(), points which have been marked to be NOISE may be changed later, if they are density-reachable from some other pint of the database. This happens for border points of a cluster. Those points are not added to the seeds because we already know that a point with ClId of NOISE is not a core point. Adding those pints to seeds would only result in additional region queries which would yield no new answers. aggregate changeClId (sx real, sy real, ClusterId real, Eps real, MinPts real):real { table result (rx real, ry real); table resultsize (size real); initialize: iterate: { insert into result select regionQuery(sx, sy, Eps); insert into resultsize select count(rx) from result; insert into seeds select rx, ry from result where (select size from resultsize)>=MinPts and (select ClId from setofpoints where x=result.rx and y=result.ry)=-1; update setofpoints set ClId=ClusterId where SQLCODE=1 and exists (select rx,ry from result) and (ClId=-1 or ClId=0); delete from seeds where seeds.sx=sx and seeds.sy=sy; delete from resultsize where 1=1; } } AGGREGATE ExpandCluster (ex real, ey real, ClusterId real, Eps real, MinPts real):real { table seedssize (size real); initialize: iterate: { insert into seeds select regionQuery (ex, ey, Eps); insert into seedssize select count(sx) from seeds; /* insert into stdout select ex, ey, size from seedssize;*/ update setofpoints set ClId=0 where exists (select sx from seeds where sx=setofpoints.x and sy=setofpoints.y) and (select size from seedssize)<MinPts; update setofpoints set ClId=ClusterId where exists (select sx from seeds where sx=setofpoints.x and sy=setofpoints.y) and SQLCODE=0; update nextid set ClusterId=ClusterId+1 where SQLCODE=1; delete from seeds where sx=ex and sy=ey and SQLCODE=1; select changeClId (sx, sy, ClusterId, Eps, MinPts) from seeds where SQLCODE=1; delete from seedssize where 1=1; delete from seeds where 1=1; } } PAM table setofpoints (id int, x real, y real); table pointSize (psize int); table temp (x real, y real, name char(30)); table temp1 (x real, y real); table mediod(mx real, my real); table i(i int); aggregate randSel(size int):int { table randNo(no real); initialize:iterate: { insert into randNo values(rand()*size); insert into mediod select x, y from setofpoints, randNo where id-1 < no and no <= id; delete from randNo where 1=1; } } AGGREGATE addid(ax real, ay real) : int { TABLE tmp(i int); INITIALIZE : { INSERT INTO tmp VALUES(1); INSERT INTO setofpoints values(1, ax, ay); } ITERATE : { UPDATE tmp SET i=i+1; INSERT INTO setofpoints SELECT i, ax, ay FROM tmp; } } aggregate mymin(c real, mx real, my real, x real, y real):(r1 real,r2 real,r3 real,r4 real,r5 real) { table minCost(cc real, cmx real, cmy real, cx real, cy real); initialize: { insert into minCost values(c, mx, my, x, y); } iterate: { update minCost set cc=c, cmx=mx, cmy = my, cx = x, cy = y where c<cc; } terminate: { insert into return select cc, cmx, cmy, cx, cy from minCost; } } aggregate allCost(jx real, jy real, ix real, iy real, hx real ,hy real):(r1 real,r2 real,r3 real,r4 real,r5 real) { table cost(cost real); initialize: { } iterate: { update cost set cost = cost + sqrt((jx-hx)*(jx-hx)+(jy-hy)*(jy-hy))-sqrt((jx-ix)*(jx-ix)+(jy-iy)*(jy-iy)); } terminate: { insert into return select cost, ix, iy, hx, hy from cost; } } aggregate updMediod(ix real, iy real, hx real, hy real):int { /* table cost(cc real, cmx real, cmy real, cx real, cy real); table minCost(cc real, cmx real, cmy real, cx real, cy real); (cmx, cmy) (ix, iy) selected mediod --Oi in the paper (cx,cy) (hx, hy) unselected object --Oh in the paper */ initialize:iterate: { insert into cost select allCost(x,y, ix, iy, hx, hy) from setofpoints; } terminate: { insert into minCost select mymin(cc, cmx, cmy, cx, cy) from cost; delete from cost where 1=1; update mediod set mx = (select cx from minCost where cc<0), my = (select cy from minCost where cc<0); select updMediod(mx, my, x, y) from mediod, setofpoints where SQLCODE = 1 and ((mx <> x) or (my <> y)); } } load from pam.input into temp; select addid(x,y) from temp1; insert into pointSize select count(x) from setofpoints; insert into stdout select id, x, y from setofpoints; insert intOi values(0),(0), (0); select randSel(psize) from i, pointSize; select updMediod(mx, my, x, y) from mediod,setofpoints where mx <> x or my <> y; insert into stdout select mx,my from mediod; Experiment To test the efficiency of DBSCAN implementation on ATLaS, we use the SEQUOIA 2000 benchmark data. The SEQUOIA 2000 benchmark database uses real data sets that are typical for Earth Science tasks. There are four types of data in the database: raster data, pointdata, polygon data and directed graph data. The point data set contains 62,584 Californian names of landmarks, extracted from the US Geological Survey’s Geographic Names Information Sys-tem, together with their location. The data set is look like this: -1651760,-833648,Corral Creek Campground -1853558,-861151,Corral De Piedra -1828216,-922899,Corral De Quati -1956635,-565741,Corral De Tierra (Palomares) -1953782,-569635,Corral De Tierra (Vasquez) -1920767,-690536,Corral Del Tierra (McCobb) ...... Even though we are not using R-tree index in our current experiment, the result is still satisfactory. Currently, since ATLaS doesn't support large integer data type, we use real data type to store data, which is another improvable latency. 3910 5213 6256 Points In paper 11 16 18 On ATLaS 180 300 400 Fig. 2 comparison of DBSCAN running time 62584 233 107 It’s interesting to note that the last experiment which has most points is fastest in our system. The reason for that is we use global value of MinPoints and Eps. If the points are large enough, there would be less clusters so that less calls of ExpandCluster() may be involved. ATLaS Improvement Proposal In above sections, we describe the application of ATLaS system on clustering algorithms and the experiment results. You may see that UDA benefits the developers a lot. However, during our implementation process, we find out that the following suggestion might improve the system’s flexibility and power. Embedded C Standard The idea of embedded SQL called by a host language such as C is not new and exciting. think it over in the other way! But Now the ATLaS is conforming to the SQL syntax standard. SQL syntax is easy to write and understand. But sometimes it's not flexible and powerful enough, especially for those algorithms containing some iteration or other c-language concepts which is very common! The reasons for embedding C standard into ATLaS are: 1) ATLaS would become more powerful and flexible with embedded C; Recall the implementation example for PAM. If we need to store a variable within an aggregate implementation, we have to create a table and an attribute: aggregate allCost(jx real, jy real, ix real, iy real, hx real ,hy real):(r1 real,r2 real,r3 real,r4 real,r5 real) { table cost(cost real); ...... } Well, if you think this is still acceptable, think about how you implement an iteration? In our example, we use an indirect way -- recursion: aggregate updMediod(ix real, iy real, hx real, hy real):int { ...... initialize:iterate: { ...... } terminate: { ...... select updMediod(mx, my, x, y) from mediod, setofpoints where SQLCODE = 1 and ((mx <> x) or (my <> y)); } } This is obviously not a straight forward and good approach. But SQL doesn't provide any means to do iteration. However, if we embed C codes into ATLaS codes, all these problems can be solved by single C statements. Thus, the developer can write ATLaS programs in a more powerful and efficient way. 2) Not much overload will be added. ATLaS is built on the BerkeleyDB. The ATLaS codes are first compiled into C codes object file then make use of BerkeleyDB's API. Therefore, every ATLaS file will have a related C codes object file. If we embed C code into ATLaS codes, it's not hard for the system to "move" them from the ATLaS file to the C object file. The overload for it will be small. Conclusion In this report, we talk about 2 clustering algorithms: partitioning algorithm PAM and based-based algorithm DBSCAN and their implementation on ATLaS. By using user-defined aggregate provided by ATLaS system, we find it convenient to implement these clustering algorithms. A spatial index structure called R-tree will significantly improve the performance. Even though we are not using R-tree index in our current experiment, the result is still satisfactory. Our future work will focus on improving the ATLaS system. During our implementation process, we find out that the embedding C solution might improve the system for the following reasons: 1) ATLaS would become more powerful and flexible with embedded C; 2) Not much overload will be added. Reference 1. Ester M., Kriegel H.-P., Sander J. and Xu X. 1996. “A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise”. Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining. Portland, OR, 226-231. 2. Raghu Ramakrishnan, Johannes Gehrke, “Database Management systems (Second Edition)”, McGraw-Hill Companies, Inc. 3. Beckmann N., Kriegel H.-P., Schneider R, and Seeger B. 1990. “The R*-tree: An Efficient and RobustAccess Method for Points and Rectangles”. Proc. ACM SIGMOD Int. Conf. on Management of Data.Atlantic City, NJ, 322-331. 4. Jain A.K., and Dubes R.C. 1988. “Algorithms for Clustering Data”. New Jersey: Prentice Hall. 5. Sander J., Ester M., Kriegel H.-P., Xu X.: Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and its Applications, in: Data Mining and Knowledge Discovery, an Int. Journal, Kluwer Academic Publishers, Vol. 2, No. 2, 1998, pp. 169-194. 6. Haixun Wang, Carlo Zaniolo: Database System Extensions for Decision Support: the AXL Approach. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge Discovery 2000: 11-20 7. Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB 1994: pp. 144-155

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Clustering Algorithms Implementation on ATLaS