* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A study of the grid and density based algorithm clustering
Survey
Document related concepts
Transcript
A study of the grid and density based algorithm clustering Sida Lin Natural Science Foundation of Zhejiang, Hangzhou , P.R.China , 310007 E-mail: [email protected] Abstract: In this article, there is a grid and density based algorithm clustering in the data mining will be discussed. The algorithm gives up the concept of distance and it takes a totally different approach. It can automatically find out subspaces containing interesting patterns we want and discover all clusters in the subspace. Besides, it performs well when dealing with high dimensional data and has good scalability when the size of the data sets increase. Keywords: grid; density; algorithm clustering 1. The concept of clustering Clustering is what hopes to make all the records to comprise different classes or what is called “cluster” when it is unknown that how many classes are there in the target database, and minimize the same clusters and maximize the different ones with similarity based on some measure methods. In fact, the similarities of a big class algorithm in the algorithm clustering are based on the distance, and because of the variety of the date type of the database, there are many discussion about how to measure the distance of to records which conclude two fields of non-number type, and put forward the relative algorithm. In many applications, members of every cluster got from cluster analysis can be treated. The algorithm clustering can be divide into many classes below: split-off method, gradational method, density-based method, grid-based method, and modal-based method. 2. The basic concept of correlative algorithm clustering 1. Density-based Methods: the basic difference between density-based methods and other ones is: it is not based on many kinds of distances, but on density. Thus we can overcome the disadvantage of the distance-based methods which can only find the “circle-like” cluster .The directorial concept of this method is: as long as the density of the spots in one area bigger than a threshold is found, add it into the cluster which is similar with it. The representative algorithm is: DBSCAN algorithm, OPTICS algorithm, DENCLUE algorithm. 2. Grid-based Methods: first, this kind of method divides the data space into grid structure of limited cells, and all of the handlings target each single cell. Then one outstanding advantage of the handling is that the speed of handing is very fast, and usually it is nothing to do with the number of the records in the database, but there is something to do with how many cells the data space is divided into. The representative algorithms are: STING algorithm, CLIQUE algorithm, WAVE-CLUSTER algorithm. 3. The effectuation of the algorithm The description of the problem: the function which algorithm has to accomplish is the cluster function in the process of data mining. For high dimensional data space, because the distribution of the spots in the space is dispersal, it is not easy to form high supportive cluster. So we consider to execute the clustering analysis in some subspace, and which subspace will become our target of analysis will automatically engender in the process of iterate from low dimension to high dimension. The reason why we choose subspace but not form new dimension from some of the dimensions by linear methods as out target of analysis is that the result of former is easier to interpret and better understand by the user. We apply the density-based 1160 algorithm, so the definition of this kind of cluster is: cluster is an area fulfill that the density of the spots in this area is bigger than the neighboring one. Now the problem is it is capable to automatically discover subspaces like these in the source data space, and form a higher density of spots area after project all the finding data records into the subspace. In order to make the way of calculate the density of the spots easy, we divide the data space into grid (we do this by divide every dimension of the data space into the same number of interval, and that means every unit has the same “volume”, thus the calculation of the density of the spots in the units can change into a easy calculation of the number of the spots), then define the number of the spots in the cell as the density of the cell. Now, we can assign a numeric and when the number of the spots in the unit is bigger than the numeric, we can say that the unit is dense. Finally, the cluster is also defined as a gather of all the connexity dense units. The effectuation of the algorithm is comprised of three steps: define the subspace which conclude the cluster, define the cluster in the subspace, and produce the description of the cluster. It is discussed below: (1) The first step: find out the subspace which conclude the cluster in order to accomplish the assignment, the most direct way is to enumerate all the subspace and than calculate the number of the spots in every unit, thus it will be clear whether there are clusters or not. But as this way is not available for high dimensional date, we can only refer to a back- to- top plan and before this, let us introduce an lemma which will be used later: lemma 1(monotony lemma) if a records gather S is a cluster of a k dimension space, then for any k-1 dimension subspace, after project the records gather S, it can also make a cluster. The identification of the lemma is easy, the S is a cluster show that it is comprised of a dense unit gather, let us project the spots of the unit to a k-1 subspace in the condition that a k dimension is dense, and find they are in the same unit, so the unit is also dense. Then it is proved that all the units which comprise the S gather is dense after project, and the neighboring units are also neighboring after project, so it is also a cluster after project. The algorithm of the step is shown below: (assume the data field/dimension is ordered,’<’presents the dictionary order of the field) I: for k 1, find out all one dimension dense units by overview the target database., and define the derived gather as :D1; II: by means of the method below, produce a k+1 dimension candidate dense units gather Ck+1 from a k dimension dense units gather Dk. , Insert into Ck+1 Select u1. [l1, h1], [l2, h2 ] ,……., u1.[lk , hk],u2.[lk,hk] From Dk u1, Dk u2 Where u1.d1=u2.d1, u1.l1=u2.l1, u1.h1=u2.h1,u1.d2=u2.d2 u1.l2=u2.l2, u1.h2=u2.h2,……….., u1.dk-1=u2.k-1,u1.lk-2=u2.lk-1,u1.hk-1=u2.hk-2, u1.dk<u2.dk Ⅲ Ⅳ : if CK+1 gets empty, turn to ; otherwise, search the target database all through again, compute the selectivity for the selected unit, after deleting the non-dense cell unit (the gist for doing this is the lemma we have referred to above, the monotone principle, it can ensure that it can find out the redundant dense cell unit, no matter it exists in which sub-interspaces), we can get a muster DK+1, then make K to K+1 and turn to II. :The arithmetic ends. Get the maximum vector's sub-interspaces which include the clustering, is also the target for this step. The second step: Find out the clustering in the sub-interspaces; This step's input: An dense cell unit muster D, and be placed in same sub-interspaces; This step's output: a division on muster D { D1, D2, , Dq}, which satisfies all intensive cell units that be placed in the different Di and are bordered on each other, and the cell units in the different Di, Dj is not close together, Di Dj= , D1 D2 ... q= D; Ⅳ ⑵ ∩ ¢ ∪ … ∪ ∪ 1161 This process is similar to looking for the connected branches of the diagram, as long as we take the unit in the D as the top of the diagram, and if two units are bordered on the other, there will be a line between them. Then we can adopt the popular depth-first search arithmetic, or the width-first arithmetic to complete this mission. Therefore, then we realize that the key mission to establish is a data construction that can express the diagram. What I adopt is a matrix that can express the diagram, in order to get a matrix I first give each cell unit that exists in a chain a serial number, and establish an index that makes our visits randomly, this serial number of the cell units are suffixes of the index-pointer-arrays. Another point we should say is, DFS arithmetic which we adopt in the re-establishing process recursively transferred in the stack simulation. The third step: Produce the description of a cluster: The introduction of this step is: one intensive unit set of sub space of K linking, element of the set form one cluster C; The output of this step is: Set R of an area( the concept of the area has already been defined above), any member in R all include in C, And any unit dose in C includes among a member in R at least; There is no good algorithm to solve this problem, and the question it proves to be NP-hard, That is to say that there are no algorithms of time of multinomial. It is the greedy algorithm of the enlightenment that is adopted here, the enlightening principle that algorithm relies on: Seek some optimum can gets probably a overall situation more excellent result. So, only by staying know whether this algorithm is effective while observing. The algorithm takes two steps: First step, Find out all largest areas of dose of units while covering C(the concept of the largest area is seen), Such a result is Any unit dose in C is covered with by a such a largest area at least; The second step, The number of the largest area will minimize, Make the set that receive finally can cover all Entrance dose of C; . ⑶ 4. The main advantage of this calculate way 1) In order to proceed the clustering analyzes, it does not need to do any assumption about the distribution of the importation data, and the result that get has nothing to do with sequence of the importing data record in the calculate way; 2) It does not need customers to specify the sub-space for clustering of original data form, this calculate way can discovers the highest dimensional sub- space existed automatically; 3) It has a good expand about amount of handling form, and it can handle high-dimensional form object better; 4) Although the form word in the form object that the calculate way handled are only a few numbers type, it can expand to other data type form; It can discover clustering with any shape, and the result of clustering mining is not sensitive to the excrescent data; 5. Conclusion The clustering calculates way that this text put forward solved the some difficulties which the Original one may face: For example, customer needn't specify the sub-space to precede the excavation, the calculate way can discover worthy sub-space automatically; and it has a good expand of the data scale; and it also can cope with the high dimensional. This kind of calculates way has many problems that still need to discuss thoroughly at the same time: first, it has the problem in a deep level of structure about clustering analysis:" meaning" problem. And for a data form, whatever the meaning represents in the concrete application each row, clustering analysis that the calculate way made is constant in the formality. So this method may be useful and meaningful and can be explained in a kind of application, while in another application may not. Therefore, finding out a method that may let the customers’s knowledge influence the process of clustering mining, is very meaningful. Second, it is about the analysis of the time sophisticate degree and related efficiency. Next work primarily aim at further research about above aspects, so that can perfect this calculate way. 1162 References [1]Data Mining: Concepts and Techniques, Jiawei Han and Micheline Kamber, Simon Fraser University,2000; [2] Automatic Subspace Clustering of High Dimensional Data for Data Mining Applications, Rakesh Agrawal, Johannes Gehrke,Dimitrios Gunopulos etc. [3] Craig Knoblock, Steven Minton, Jose Luis Ambite, Naveen Ashish, Pragnesh Modi, Ion Muslea, Andrew Philpot, and Sheila Tejada. Modeling Web Sources for Information Integration. In Proc. National Conference on ArtificialIntelligence (AAAI), July 1998. [4] Ling Liu, Calton Pu, and Wei Han. XWRAP: An XML-Enabled Wrapper Construction System for Web Information Sources. Proc. International Conference on Data Engineering (ICDE), San Diego, California, February 2000. 1163