Download A study of the grid and density based algorithm clustering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
A study of the grid and density based algorithm clustering
Sida Lin
Natural Science Foundation of Zhejiang, Hangzhou , P.R.China , 310007
E-mail: [email protected]
Abstract: In this article, there is a grid and density based algorithm clustering in the data mining will be
discussed. The algorithm gives up the concept of distance and it takes a totally different approach. It can
automatically find out subspaces containing interesting patterns we want and discover all clusters in the
subspace. Besides, it performs well when dealing with high dimensional data and has good scalability
when the size of the data sets increase.
Keywords: grid; density; algorithm clustering
1. The concept of clustering
Clustering is what hopes to make all the records to comprise different classes or what is called
“cluster” when it is unknown that how many classes are there in the target database, and minimize the
same clusters and maximize the different ones with similarity based on some measure methods. In fact,
the similarities of a big class algorithm in the algorithm clustering are based on the distance, and
because of the variety of the date type of the database, there are many discussion about how to measure
the distance of to records which conclude two fields of non-number type, and put forward the relative
algorithm. In many applications, members of every cluster got from cluster analysis can be treated. The
algorithm clustering can be divide into many classes below: split-off method, gradational method,
density-based method, grid-based method, and modal-based method.
2. The basic concept of correlative algorithm clustering
1. Density-based Methods: the basic difference between density-based methods and other ones is: it
is not based on many kinds of distances, but on density. Thus we can overcome the disadvantage of the
distance-based methods which can only find the “circle-like” cluster .The directorial concept of this
method is: as long as the density of the spots in one area bigger than a threshold is found, add it into the
cluster which is similar with it. The representative algorithm is: DBSCAN algorithm, OPTICS algorithm,
DENCLUE algorithm.
2. Grid-based Methods: first, this kind of method divides the data space into grid structure of
limited cells, and all of the handlings target each single cell. Then one outstanding advantage of the
handling is that the speed of handing is very fast, and usually it is nothing to do with the number of the
records in the database, but there is something to do with how many cells the data space is divided into.
The representative algorithms are: STING algorithm, CLIQUE algorithm, WAVE-CLUSTER
algorithm.
3. The effectuation of the algorithm
The description of the problem: the function which algorithm has to accomplish is the cluster function
in the process of data mining.
For high dimensional data space, because the distribution of the spots in the space is dispersal, it is
not easy to form high supportive cluster. So we consider to execute the clustering analysis in some
subspace, and which subspace will become our target of analysis will automatically engender in the
process of iterate from low dimension to high dimension. The reason why we choose subspace but not
form new dimension from some of the dimensions by linear methods as out target of analysis is that
the result of former is easier to interpret and better understand by the user. We apply the density-based
1160
algorithm, so the definition of this kind of cluster is: cluster is an area fulfill that the density of the spots
in this area is bigger than the neighboring one. Now the problem is it is capable to automatically
discover subspaces like these in the source data space, and form a higher density of spots area after
project all the finding data records into the subspace. In order to make the way of calculate the density
of the spots easy, we divide the data space into grid (we do this by divide every dimension of the data
space into the same number of interval, and that means every unit has the same “volume”, thus the
calculation of the density of the spots in the units can change into a easy calculation of the number of the
spots), then define the number of the spots in the cell as the density of the cell. Now, we can assign a
numeric and when the number of the spots in the unit is bigger than the numeric, we can say that the unit
is dense. Finally, the cluster is also defined as a gather of all the connexity dense units.
The effectuation of the algorithm is comprised of three steps: define the subspace which conclude the
cluster, define the cluster in the subspace, and produce the description of the cluster. It is discussed
below:
(1) The first step: find out the subspace which conclude the cluster
in order to accomplish the assignment, the most direct way is to enumerate all the subspace and
than calculate the number of the spots in every unit, thus it will be clear whether there are clusters or not.
But as this way is not available for high dimensional date, we can only refer to a back- to- top plan and
before this, let us introduce an lemma which will be used later:
lemma 1(monotony lemma) if a records gather S is a cluster of a k dimension space, then for any
k-1 dimension subspace, after project the records gather S, it can also make a cluster.
The identification of the lemma is easy, the S is a cluster show that it is comprised of a dense unit
gather, let us project the spots of the unit to a k-1 subspace in the condition that a k dimension is dense,
and find they are in the same unit, so the unit is also dense. Then it is proved that all the units which
comprise the S gather is dense after project, and the neighboring units are also neighboring after project,
so it is also a cluster after project.
The algorithm of the step is shown below: (assume the data field/dimension is ordered,’<’presents
the dictionary order of the field)
I: for k 1, find out all one dimension dense units by overview the target database., and define the
derived gather as :D1;
II: by means of the method below, produce a k+1 dimension candidate dense units gather Ck+1 from
a k dimension dense units gather Dk.
,
Insert into Ck+1
Select u1. [l1, h1], [l2, h2 ] ,……., u1.[lk , hk],u2.[lk,hk]
From Dk u1, Dk u2 Where
u1.d1=u2.d1, u1.l1=u2.l1, u1.h1=u2.h1,u1.d2=u2.d2
u1.l2=u2.l2, u1.h2=u2.h2,……….., u1.dk-1=u2.k-1,u1.lk-2=u2.lk-1,u1.hk-1=u2.hk-2,
u1.dk<u2.dk
Ⅲ
Ⅳ
: if CK+1 gets empty, turn to
; otherwise, search the target database all through again,
compute the selectivity for the selected unit, after deleting the non-dense cell unit (the gist for doing this
is the lemma we have referred to above, the monotone principle, it can ensure that it can find out the
redundant dense cell unit, no matter it exists in which sub-interspaces), we can get a muster DK+1, then
make K to K+1 and turn to II.
:The arithmetic ends. Get the maximum vector's sub-interspaces which include the clustering, is
also the target for this step.
The second step: Find out the clustering in the sub-interspaces;
This step's input: An dense cell unit muster D, and be placed in same sub-interspaces;
This step's output: a division on muster D { D1, D2,
, Dq}, which satisfies all intensive cell units
that be placed in the different Di and are bordered on each other, and the cell units in the different Di, Dj
is not close together, Di
Dj=
, D1
D2 ...
q= D;
Ⅳ
⑵
∩
¢
∪
…
∪ ∪
1161
This process is similar to looking for the connected branches of the diagram, as long as we take the unit
in the D as the top of the diagram, and if two units are bordered on the other, there will be a line
between them. Then we can adopt the popular depth-first search arithmetic, or the width-first arithmetic
to complete this mission. Therefore, then we realize that the key mission to establish is a data
construction that can express the diagram. What I adopt is a matrix that can express the diagram, in
order to get a matrix I first give each cell unit that exists in a chain a serial number, and establish an
index that makes our visits randomly, this serial number of the cell units are suffixes of the
index-pointer-arrays. Another point we should say is, DFS arithmetic which we adopt in the
re-establishing process recursively transferred in the stack simulation.
The third step: Produce the description of a cluster:
The introduction of this step is: one intensive unit set of sub space of K linking, element of the set
form one cluster C;
The output of this step is: Set R of an area( the concept of the area has already been defined above),
any member in R all include in C, And any unit dose in C includes among a member in R at least;
There is no good algorithm to solve this problem, and the question it proves to be NP-hard, That is to
say that there are no algorithms of time of multinomial. It is the greedy algorithm of the enlightenment
that is adopted here, the enlightening principle that algorithm relies on: Seek some optimum can gets
probably a overall situation more excellent result. So, only by staying know whether this algorithm is
effective while observing. The algorithm takes two steps: First step, Find out all largest areas of dose of
units while covering C(the concept of the largest area is seen), Such a result is Any unit dose in C is
covered with by a such a largest area at least; The second step, The number of the largest area will
minimize, Make the set that receive finally can cover all Entrance dose of C;
.
⑶
4. The main advantage of this calculate way
1) In order to proceed the clustering analyzes, it does not need to do any assumption about the
distribution of the importation data, and the result that get has nothing to do with sequence of the
importing data record in the calculate way;
2) It does not need customers to specify the sub-space for clustering of original data form, this calculate
way can discovers the highest dimensional sub- space existed automatically;
3) It has a good expand about amount of handling form, and it can handle high-dimensional form object
better;
4) Although the form word in the form object that the calculate way handled are only a few numbers
type, it can expand to other data type form; It can discover clustering with any shape, and the result of
clustering mining is not sensitive to the excrescent data;
5. Conclusion
The clustering calculates way that this text put forward solved the some difficulties which the
Original one may face: For example, customer needn't specify the sub-space to precede the excavation,
the calculate way can discover worthy sub-space automatically; and it has a good expand of the data
scale; and it also can cope with the high dimensional.
This kind of calculates way has many problems that still need to discuss thoroughly at the same time:
first, it has the problem in a deep level of structure about clustering analysis:" meaning" problem. And
for a data form, whatever the meaning represents in the concrete application each row, clustering
analysis that the calculate way made is constant in the formality. So this method may be useful and
meaningful and can be explained in a kind of application, while in another application may not.
Therefore, finding out a method that may let the customers’s knowledge influence the process of
clustering mining, is very meaningful. Second, it is about the analysis of the time sophisticate degree
and related efficiency. Next work primarily aim at further research about above aspects, so that can
perfect this calculate way.
1162
References
[1]Data Mining: Concepts and Techniques, Jiawei Han and Micheline Kamber, Simon Fraser
University,2000;
[2] Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Rakesh
Agrawal, Johannes Gehrke,Dimitrios Gunopulos etc.
[3] Craig Knoblock, Steven Minton, Jose Luis Ambite, Naveen Ashish, Pragnesh Modi, Ion Muslea,
Andrew Philpot, and Sheila Tejada. Modeling Web Sources for Information Integration. In Proc.
National Conference on ArtificialIntelligence (AAAI), July 1998.
[4] Ling Liu, Calton Pu, and Wei Han. XWRAP: An XML-Enabled Wrapper Construction System for
Web Information Sources. Proc. International Conference on Data Engineering (ICDE), San Diego,
California, February 2000.
1163