Download Clustering Algorithms Implementation on ATLaS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Clustering Algorithms Implementation
on ATLaS
--CS240B Project Report
Richard Luo
Prof. Carlo Zaniolo
2002/6
Abstract
In this project, we will discus clustering algorithms in spatial data mining, such as
partitioning algorithm PAM and density-based algorithm DBSCAN. Some of their
implementations on User-Defined Aggregate (UDA) database system ATLaS are illustrated.
With UDA, it's convenient to implement such clustering algorithms. A spatial index structure
called R-tree will significantly improve the performance. Experiments with real data of
SEQUOIA 2000 show that such algorithms implementation on ATLaS is satisfactory even with
absence of R-tree index. However, some improvement on ATLaS will benefit the development
of these clustering algorithms as well as other general data mining algorithms. An ATLaS
system improvement proposal is addressed in the end.
Introduction
Knowledge discovery becomes more and more important in spatial databases since increasingly
large amounts of data obtained from satellite images, X-ray crystallography or other automatic
equipment are stored in spatial databases. Several types of clustering algorithms are addressed
in the last few years, such as:
1) Partitioning Algorithm: Construct various partitions then evaluate them by some criterion
2) Hierarchy Algorithm: Create a hierarchical decomposition of the set of data (or objects)
using some criterion
3) Density-based Algorithm: based on local connectivity and density functions
In this report, we will discus partitioning algorithm PAM and based-based algorithm DBSCAN.
Their implementation on User-Defined Aggregate (UDA) database system ATLaS will be
illustrated. With UDA, it's convenient to implement such clustering algorithms. A spatial
index structure called R-tree will significantly improve the performance. Experiments with real
data of SEQUOIA 2000 show that such algorithms implementation on ATLaS is satisfactory
even with absence of R-tree index.
Finally, we will talk about some improvement on ATLaS which may benefit the development of
these clustering algorithms as well as other general data mining algorithms. An ATLaS system
improvement proposal is addressed in the end.
Clustering Algorithms
DBSCAN
The key idea of a density-based cluster is that for each point of a cluster its Eps-neighborhood
for some given Eps > 0 has to contain at least a minimum number of points, i.e. the “density” in
the Eps-neighborhood of points has to exceed some threshold. Furthermore, the density within
the areas of noise is lower than the density in any of the clusters.
This idea of “density-based clusters” can be generalized in two important ways. First, we can use
any notion of a neighborhood instead of an Eps-neighborhood if the definition of the
neigh-borhood is based on a binary predicate which is symmetric and reflexive. Second, instead
of sim-ply counting the objects in a neighborhood of an object we can as well use other measures
to de-fine the “cardinality” of that neighborhood.
A naive approach could require for each object in a density-connected set that the weighted
cardinality of the NPred-neighborhood of that object has at least a value MinCard. However, this
approach fails because there may be two kinds of objects in a density-connected set, objects
in-side (core object) and objects “on the border” of the density-connected set (border objects). In
general, an NPred-neighborhood of a border object has a significantly lower wCard than an
NPred-neighborhood of a core object. Therefore, we would have to set the value MinCard to a
relatively low value in order to include all objects belonging to the same density-connected set.
This value, however, will not be characteristic for the respective density-connected set particularly in the presence of noise objects. Therefore, for every object p in a density-connected
set C there must be an object q in C so that p is inside of the NPred-neighborhood of q and the
weight-ed cardinality wCard of NPred(q) is at least MinCard. We also require the objects of the
set C to be somehow “connected” to each other.
PAM
PAM (Partitioning Around Medoids) was developed by Kaufman and Rousseeuw. To find k
clusters, PAM's approach is to determine a representative object for each cluster. This
representative object, called a medoid, is meant to be the most centrally located object within the
cluster. Once the medoids have been selected, each non-selected object is grouped with the
medoid to which it is the most similar. More precisely, if Oj is a non-selected object, and Oi is a
(selected) medoid, we say that Oj belongs to the cluster represented by
Oi, if d(Oj ; Oi ) = minOe d(Oj ; Oe), where the notation minOe denotes the minimum over all
medoids Oe , and the notation d(Oa ; Ob ) denotes the dissimilarity or distance between objects
Oa and Ob . All the dissimilarity values are given as inputs to PAM. Finally, the quality of
a clustering (i.e. the combined quality of the chosen medoids) is measured by the average
dissimilarity between an object and the medoid of its cluster.
To find the k medoids, PAM begins with an arbitrary selection of k objects. Then in each step, a
swap between a selected object Oi and a non-selected object Oh is made, as long as such a swap
would result in an improvement of the quality of the clustering. In particular, to calculate the
effect of such a swap between Oi and Oh , PAM computes costs Cjih for all non-selected objects
Oj . Depending on which of the following cases Oj is in, Cjih is defined by one of the equations
below:
First Case: suppose Oj currently belongs to the cluster represented by Oi . Furthermore, let Oj be
more similar to Oj2 than Oh , i.e. d(Oj ; Oh ) >= d(Oj ; Oj2 ), where Oj2 is the second most
similar medoid to Oj . Thus, if Oi is replaced by Oh as a medoid, Oj would belong to the cluster
represented by Oj2 . Hence, the cost of the swap as far as Oj is concerned is:
C jih = d(Oj ; Oj2 ) - d(Oj ; Oi )
This equation always gives a non-negative Cjih , indicating that there is a non-negative cost
incurred in replacing Oi with Oh.
Second Case: Oj currently belongs to the cluster represented by Oi . But this time, Oj is less
similar to Oj2 than Oh , i.e. d(Oj ; Oh ) < d(Oj ; Oj2 ). Then, if Oi is replaced by Oh , Oj would
belong to the cluster represented by Oh . Thus, the cost for Oj is given by:
Cjih = d(Oj ; Oh ) - d(Oj ; Oi );
Cjih here can be positive or negative, depending on whether Oj is more similar to Oi or to Oh .
Third Case: suppose that Oj currently belongs to a cluster other than the one represented by Oi .
Let Oj2 be the representative object of that cluster. Furthermore, let Oj be more similar to Oj2
than Oh . Then even if Oi is replaced by Oh , Oj would stay in the cluster represented by Oj2 .
Thus, the cost is:
C jih = 0
Fourth Case: Oj currently belongs to the cluster represented by Oj2 . But Oj is less similar to Oj2
than Oh . Then replacing Oi with Oh would cause Oj tOjump to the cluster of Oh from that of
Oj2 . Thus, the cost is:
C jih = d(Oj ; Oh ) - d(Oj ; Oj2 );
and is always negative.
Combining the four cases above, the total cost of replacing Oi with Oh is given by:
TCih = sum of Cjih
We now present Algorithm PAM.
1. Select k representative objects arbitrarily.
2. Compute TCih for all pairs of objects Oi ; Oh where Oi is currently selected, and Oh is
not.
3. Select the pair Oi ; Oh which corresponds to minOi ;Oh TCih . If the minimum TCih is
negative, replace Oi with Oh , and go back to Step (2).
4. Otherwise, for each non-selected object, find the most similar representative object.
Halt. 2
R*-tree spatial index
In the following, we will introduce a typical spatial index, the R*-tree. The R*-tree generalizes
the 1-dimensional B-tree to d-dimensional data spaces, specifically an R*-tree manages
k-dimensional hyperrectangles instead of 1-dimension-al keys. An R*-tree may organize
extended objects such as polygons using minimum bounding rectangles (MBR) as
approximations as well as point objects as a special case of rectangles. The leaves store the
MBRs of the data objects and a pointer to the exact geometry of the polygons.
Internal nodes store a sequence of pairs consisting of a rectangle and a pointer to a child node.
These rectangles are the MBRs of all data or directory rectangles stored in the subtree having the
referenced child node as its root. To answer a region query, starting from the root, the set of
rectangles intersecting the query region is determined and then their referenced child nodes are
searched until the data pages are reached.
Fig 1. R*-tree
The height of an R*-tree is O(log n) for a database of n objects in the worst case and a query
with a “small” query region has to traverse only a limited number of paths in the R*-tree.
Implementation
Although ATLaS is still on its alpha-stage and provides only basic functionalities, we still
find it convenient to implement these clustering algorithms. User-defined Aggregate (UDA)
provides one-scan approach and flexible access to the database. In this section, we will
describe 2 clustering algorithms implementation on ATLaS -- DBSCAN and PAM.
DBSCAN
table setofpoints (x real, y real, ClId real);
/* meaning of ClId: -1: unclassified, 0: noise, 1,2,3...: cluster*/
table nextid(ClusterId real);
table seeds (sx real, sy real);
insert into nextid values (1);
load from dbscan.input into temp;
insert into setofpoints
select x, y, -1
from temp;
select ExpandCluster(x, y, ClusterId, 1000, 4)
from setofpoints, nextid
where ClId<=0;
The table setofpoints stores the coordinates and cluster ids of all points read from the input file
dbscan.input. After initializing the cluster id to -1, it calls the major aggregate in this algorithm
-- ExpandCluster() to expand the cluster from any point (x,y). We use the global attribute
MinPoints of 4 and Eps of 1000.
The regionQuery() aggregate returns the Eps-neighborhood of point (qx,qy):
aggregate regionQuery(qx real, qy real, eps real):(r1 real,r2 real)
{
INITIALIZE: ITERATE:
{
INSERT INTO return select x,y from setofpoints where (x-qx)*(x-qx) + (y - qy) * (y - qy)
<= eps * eps;
}
}
In the changeClId(), points which have been marked to be NOISE may be changed later, if they
are density-reachable from some other pint of the database. This happens for border points of a
cluster. Those points are not added to the seeds because we already know that a point with ClId
of NOISE is not a core point. Adding those pints to seeds would only result in additional
region queries which would yield no new answers.
aggregate changeClId (sx real, sy real, ClusterId real, Eps real, MinPts
real):real
{
table result (rx real, ry real);
table resultsize (size real);
initialize:
iterate:
{
insert into result select regionQuery(sx, sy, Eps);
insert into resultsize select count(rx) from result;
insert into seeds select rx, ry from result
where (select size from resultsize)>=MinPts
and (select ClId from setofpoints where x=result.rx and y=result.ry)=-1;
update setofpoints set ClId=ClusterId where SQLCODE=1
and exists (select rx,ry from result) and (ClId=-1 or ClId=0);
delete from seeds where seeds.sx=sx and seeds.sy=sy;
delete from resultsize where 1=1;
}
}
AGGREGATE ExpandCluster (ex real, ey real, ClusterId real, Eps real, MinPts
real):real
{
table seedssize (size real);
initialize:
iterate:
{
insert into seeds select regionQuery (ex, ey, Eps);
insert into seedssize select count(sx) from seeds;
/*
insert into stdout select ex, ey, size from seedssize;*/
update setofpoints set ClId=0
where exists (select sx from seeds where sx=setofpoints.x and sy=setofpoints.y)
and (select size from seedssize)<MinPts;
update setofpoints set ClId=ClusterId
where exists (select sx from seeds where sx=setofpoints.x and
sy=setofpoints.y)
and SQLCODE=0;
update nextid set ClusterId=ClusterId+1 where SQLCODE=1;
delete from seeds where sx=ex and sy=ey and SQLCODE=1;
select changeClId (sx, sy, ClusterId, Eps, MinPts) from seeds
where SQLCODE=1;
delete from seedssize where 1=1;
delete from seeds where 1=1;
}
}
PAM
table setofpoints (id int, x real, y real);
table pointSize (psize int);
table temp (x real, y real, name char(30));
table temp1 (x real, y real);
table mediod(mx real, my real);
table i(i int);
aggregate randSel(size int):int
{
table randNo(no real);
initialize:iterate:
{
insert into randNo values(rand()*size);
insert into mediod
select x, y
from setofpoints, randNo
where id-1 < no and no <= id;
delete from randNo where 1=1;
}
}
AGGREGATE addid(ax real, ay real) : int
{
TABLE tmp(i int);
INITIALIZE :
{
INSERT INTO tmp VALUES(1);
INSERT INTO setofpoints values(1, ax, ay);
}
ITERATE :
{
UPDATE tmp SET i=i+1;
INSERT INTO setofpoints
SELECT i, ax, ay FROM tmp;
}
}
aggregate mymin(c real, mx real, my real, x real, y real):(r1 real,r2 real,r3 real,r4 real,r5 real)
{
table minCost(cc real, cmx real, cmy real, cx real, cy real);
initialize:
{
insert into minCost values(c, mx, my, x, y);
}
iterate:
{
update minCost
set cc=c, cmx=mx, cmy = my, cx = x, cy = y
where c<cc;
}
terminate:
{
insert into return
select cc, cmx, cmy, cx, cy
from minCost;
}
}
aggregate allCost(jx real, jy real, ix real, iy real, hx real ,hy real):(r1 real,r2 real,r3 real,r4 real,r5
real)
{
table cost(cost real);
initialize:
{
}
iterate:
{
update cost
set cost = cost +
sqrt((jx-hx)*(jx-hx)+(jy-hy)*(jy-hy))-sqrt((jx-ix)*(jx-ix)+(jy-iy)*(jy-iy));
}
terminate:
{
insert into return
select cost, ix, iy, hx, hy from cost;
}
}
aggregate updMediod(ix real, iy real, hx real, hy real):int
{
/*
table cost(cc real, cmx real, cmy real, cx real, cy real);
table minCost(cc real, cmx real, cmy real, cx real, cy real);
(cmx, cmy) (ix, iy) selected mediod --Oi in the paper
(cx,cy) (hx, hy) unselected object
--Oh in the paper
*/
initialize:iterate:
{
insert into cost
select allCost(x,y, ix, iy, hx, hy)
from setofpoints;
}
terminate:
{
insert into minCost
select mymin(cc, cmx, cmy, cx, cy) from cost;
delete from cost where 1=1;
update mediod
set mx = (select cx from minCost where cc<0),
my = (select cy from minCost where cc<0);
select updMediod(mx, my, x, y)
from mediod, setofpoints
where SQLCODE = 1 and ((mx <> x) or (my <> y));
}
}
load from pam.input into temp;
select addid(x,y)
from temp1;
insert into pointSize
select count(x)
from setofpoints;
insert into stdout select id, x, y from setofpoints;
insert intOi values(0),(0), (0);
select randSel(psize)
from i, pointSize;
select updMediod(mx, my, x, y)
from mediod,setofpoints
where mx <> x or my <> y;
insert into stdout select mx,my from mediod;
Experiment
To test the efficiency of DBSCAN implementation on ATLaS, we use the SEQUOIA
2000 benchmark data. The SEQUOIA 2000 benchmark database uses real data sets that
are typical for Earth Science tasks. There are four types of data in the database: raster data,
pointdata, polygon data and directed graph data. The point data set contains 62,584 Californian
names of landmarks, extracted from the US Geological Survey’s Geographic Names Information
Sys-tem, together with their location.
The data set is look like this:
-1651760,-833648,Corral Creek Campground
-1853558,-861151,Corral De Piedra
-1828216,-922899,Corral De Quati
-1956635,-565741,Corral De Tierra (Palomares)
-1953782,-569635,Corral De Tierra (Vasquez)
-1920767,-690536,Corral Del Tierra (McCobb)
......
Even though we are not using R-tree index in our current experiment, the result is still
satisfactory. Currently, since ATLaS doesn't support large integer data type, we use real data
type to store data, which is another improvable latency.
3910
5213
6256
Points
In paper
11
16
18
On ATLaS 180
300
400
Fig. 2 comparison of DBSCAN running time
62584
233
107
It’s interesting to note that the last experiment which has most points is fastest in our
system. The reason for that is we use global value of MinPoints and Eps. If the points are
large enough, there would be less clusters so that less calls of ExpandCluster() may be involved.
ATLaS Improvement Proposal
In above sections, we describe the application of ATLaS system on clustering algorithms and the
experiment results. You may see that UDA benefits the developers a lot. However, during
our implementation process, we find out that the following suggestion might improve the
system’s flexibility and power.
Embedded C Standard
The idea of embedded SQL called by a host language such as C is not new and exciting.
think it over in the other way!
But
Now the ATLaS is conforming to the SQL syntax standard. SQL syntax is easy to write and
understand. But sometimes it's not flexible and powerful enough, especially for those
algorithms containing some iteration or other c-language concepts which is very common!
The reasons for embedding C standard into ATLaS are:
1) ATLaS would become more powerful and flexible with embedded C;
Recall the implementation example for PAM. If we need to store a variable within an
aggregate implementation, we have to create a table and an attribute:
aggregate allCost(jx real, jy real, ix real, iy real, hx real ,hy real):(r1 real,r2 real,r3 real,r4 real,r5
real)
{
table cost(cost real);
......
}
Well, if you think this is still acceptable, think about how you implement an iteration?
In our example, we use an indirect way -- recursion:
aggregate updMediod(ix real, iy real, hx real, hy real):int
{
......
initialize:iterate:
{
......
}
terminate:
{
......
select updMediod(mx, my, x, y)
from mediod, setofpoints
where SQLCODE = 1 and ((mx <> x) or (my <> y));
}
}
This is obviously not a straight forward and good approach. But SQL doesn't provide
any means to do iteration. However, if we embed C codes into ATLaS codes, all these
problems can be solved by single C statements. Thus, the developer can write ATLaS
programs in a more powerful and efficient way.
2) Not much overload will be added.
ATLaS is built on the BerkeleyDB. The ATLaS codes are first compiled into C codes object
file then make use of BerkeleyDB's API. Therefore, every ATLaS file will have a related C
codes object file. If we embed C code into ATLaS codes, it's not hard for the system to "move"
them from the ATLaS file to the C object file. The overload for it will be small.
Conclusion
In this report, we talk about 2 clustering algorithms: partitioning algorithm PAM and
based-based algorithm DBSCAN and their implementation on ATLaS. By using user-defined
aggregate provided by ATLaS system, we find it convenient to implement these clustering
algorithms. A spatial index structure called R-tree will significantly improve the performance.
Even though we are not using R-tree index in our current experiment, the result is still
satisfactory.
Our future work will focus on improving the ATLaS system. During our implementation
process, we find out that the embedding C solution might improve the system for the following
reasons:
1) ATLaS would become more powerful and flexible with embedded C;
2) Not much overload will be added.
Reference
1. Ester M., Kriegel H.-P., Sander J. and Xu X. 1996. “A Density-Based Algorithm for
Discovering Clusters in Large Spatial Databases with Noise”. Proc. 2nd Int. Conf. on Knowledge
Discovery and Data Mining. Portland, OR, 226-231.
2. Raghu Ramakrishnan, Johannes Gehrke, “Database Management systems (Second Edition)”,
McGraw-Hill Companies, Inc.
3. Beckmann N., Kriegel H.-P., Schneider R, and Seeger B. 1990. “The R*-tree: An Efficient
and RobustAccess Method for Points and Rectangles”. Proc. ACM SIGMOD Int. Conf. on
Management of Data.Atlantic City, NJ, 322-331.
4. Jain A.K., and Dubes R.C. 1988. “Algorithms for Clustering Data”. New Jersey: Prentice Hall.
5. Sander J., Ester M., Kriegel H.-P., Xu X.: Density-Based Clustering in Spatial Databases: The
Algorithm GDBSCAN and its Applications, in: Data Mining and Knowledge Discovery, an Int.
Journal, Kluwer Academic Publishers, Vol. 2, No. 2, 1998, pp. 169-194.
6. Haixun Wang, Carlo Zaniolo: Database System Extensions for Decision Support: the AXL
Approach. ACM SIGMOD Workshop on Research Issues in Data Mining and Knowledge
Discovery 2000: 11-20
7. Raymond T. Ng, Jiawei Han: Efficient and Effective Clustering Methods for Spatial Data Mining. VLDB
1994: pp. 144-155