Download IJDE-24 - CSC Journals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Human genetic clustering wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Mr. Chetan J. Awati& Prof. D.G. Chougule
Clustering Algorithm for 2D Multi-Density Large Dataset Using
Adaptive Grids
Mr. Chetan J. Awati
[email protected]
Lecturer in Department of Information Technology
Bharati Vidyapeeth’s College of Engineering, Kolhapur
Shivaji University, Kolhapur, 416013, India
Prof. D.G. Chougule
[email protected]
Professor in Department of Computer Science & Engineering
Tatyasaheb Kore Institute of Engineering & Technology, Warnanagar
Shivaji University, Kolhapur, 416113, India.
Abstract
Clustering is a key data mining problem. Density-based clustering algorithms have
recently gained popularity in the data mining field.Density and grid based technique is a
popular way to mine clusters in a large spatial datasets wherein clusters are regarded
as dense regions than their surroundings. The attribute values and ranges of these
attributes characterize the clusters In this paper we adapt a density-based clustering
algorithm, Grid Density clustering using triangle subdivision(GDCT) capable of
identifying arbitrary shaped embedded clusters as well as multi density clusters over
massive spatial datasets. Experimental results on a wide variety of synthetic and real
data sets demonstrate the effectiveness of Adaptive grids and triangle subdivision
method.
Keywords:Clustering, Density Based, Intrinsic Cluster, Adaptive Grids
1. INTRODUCTION
Clustering involves the grouping of similar objects together. It is one of the fundamental data mining tasks
that can serve as an independent data mining tool or a preprocessing step for other data mining tasks
such as classification. Clustering is a versatile unsupervised learning method that can be used in several
ways, e.g., outlier detection, data reduction and identification of natural data types and classes.Major
clustering techniques have been classified into partitional, hierarchical, density based, grid based and
model based. Among these techniques, the density-based approach is famous for its capability of
discovering arbitrary shaped clusters of good quality even in noisy datasets. Grid-based clustering
approach is well known for its fast processing time especially for large datasets. In this algorithm the data
points are grouped into blocks and density of each block is calculated. The blocks are then clustered by
forming adaptive grids. First rough clusters are formed and then for finer results triangle subdivision
algorithm is used. This algorithm finds quality clustering.
1
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
1.1 Related Work
Clustering methods can be categorized into partitioning, hierarchical, density-based, grid-based, and
model-based Methods. DBSCAN is a density-based clustering algorithm capable of discovering clusters
of various shapes even in presence of noise. However, due to the use of the global density parameters, it
fails to detect embedded or nested clusters.Grid based clustering algorithms are computationally efficient
which depends on the number of cells in each dimension in the quantized space. It is independent of the
number of data points and is not input order sensitive. STING uses a multi-resolution approach, which is
query-independent and easy to parallelize. However the shapes of clusters have horizontal or vertical
boundaries but no diagonal boundary is detected. WaveCluster also uses a multidimensional grid
structure, capable to detect clusters at varying levels of accuracy by removing outliers. However, it is not
suitable for high dimensional data sets. CLIQUE automatically finds subspaces of the highest
dimensionality and is insensitive to the order of input. However, the accuracy of the clustering result may
be degraded at the expense of simplicity of the method.
Some clustering algorithms that can cluster on multi-density datasets. Chameleon [3] can handle multidensity datasets, but for large datasets the time complexity is too high. SNN (shared nearest neighbor),
can find clusters of varying shapes, sizes and densities and can deal multi-density dataset. The
disadvantage of SNN is that the degree of precision is low on the multi-density clustering and finding
outliers. DGCL [6] is based on density-grid based clustering approach. But, since it uses a uniform
density threshold it causes the low density clusters to be lost.
Real life datasets can also be found with skewed distribution and may contain nested cluster structures
the discovery of which is very difficult. OPTICS and EnDBSCAN attempts to handle such situations.
OPTICS can identify embedded clusters; however, it is very sensitive to the three input parameters to be
provided by the user. EnDBSCAN can detect embedded clusters, however, with the increase in the
volume of data, the performance of it also degrades and also it is highly dependent on two input
parameters. Based on our selected survey and experimental analysis, it has been observed that Density
based approach is most suitable for quality cluster detection over massive datasets. Grid based approach
is suitable for fast processing of large datasets. Distribution of most of the real-life datasets are skewed in
nature, so, handling of such datasets for all types for qualitative cluster detection based on a global input
parameter seems to be Impractical.
An algorithm which is capable of handling large data and at the same time effectively detects multiple,
nested or embedded clusters even in presence of noise is of utmost importance. This paper presents a
clustering algorithm, GDCT, which can effectively address the previously mentioned clustering
challenges.
2. GRID DENSITY CLUSTERING USING TRIANGLE SUBDIVISION
The aim of our clustering algorithm is to discover basic as well as global clusters over spatial datasets of
variable density. We introduce some new terms which are used in GDCT.
Cell Density: The number of spatial point objects within a particular grid cell.
Useful Cell: Those cells with data points will be treated as useful cell.
Neighbor Cell: Those cells which are edge neighbors or vertex neighbors of a current cell are the
neighbors of the current cell.
Density Confidence of a cell: If the ratio of the densities of the current cell and one of its neighbors is
greater than or equal to some threshold β (user input) then β is the density confidence between them. For
two cells p and q to be merged into the same cluster the following condition should be satisfied: d n (q) /
dn (p) ≥ β, where dn represents the density of that particular cell.
2
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
Reachability of cell: A cell p is reachable from a cell q if p is a neighbor cell of q and cell p satisfies the
density confidence condition w.r.t. cell q.
Triangle Density: The number of spatial point objects within a particular triangle of a grid cell.
Useful Triangle: Only those triangles which are populated i.e., which contain data points will be treated
as useful triangle.
Neighbor Triangle: Those triangles which have a common edge to the current triangle are neighbors of
the current triangle.
Density Confidence of a triangle: Two triangles can be merged into the same cluster, if the ratio of the
densities of that triangle and one of its neighbors is greater than or equal to β/4 i.e., dn(Tp) /dn (Tq) ≥ β/4,
where dn represents density of the triangle.
Reachability of a triangle: A triangle p is reachable from a triangle q if p is a neighbor triangle of q and
triangle p satisfies the density confidence condition w.r.t. triangle q.
Cluster: A cluster C is defined to be the set of points belonging to the set of reachable cells and triangles
i.e., if p C and q is reachable from p w.r.t. β, then q C, where p, q are cells or triangles.
Noise: Noise is simply the set of points belonging to the cells (or triangles) not belonging to any of its
clusters. Let C1, C2 ...Ck be the clusters w.r.t. β, then,
Noise= {no_ p| p n×n,
i: no_ p Ci} where no_p is set of points in cell p and Ci (i=1,..., k).
Density Confidence of a Useful Cell: The density confidence for a given set of cells reflects the general
trend of that set. If the density of one cell is abnormal from the others it is excluded from the set. Similarly,
each useful cell has a density confidence with each of its neighbor cells. If the density confidence of a
current cell with one of its neighbor cell does not satisfy the density confidence condition than that
neighbor cell is not included into the local dense area.
Spatial Dataset
Conversion of
spatial dataset
into image
Clusters
Finding pixels
from image
dataset
Applying GDCT
algorithm
Grid formation
and density
calculation
FIGURE 1: Process Flow Diagram
The above process flow diagram depicts sequence of operation. We are dealing with the spatial datasets.
So we have to convert that dataset into image. Then we have to find the pixel from the dataset. Then we
are applying the grid structure and calculating the density. Finally we have to apply GDCT algorithm and
obtain the clusters. Here are some modules to implement the algorithm.
3
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
Initially, the GDCT algorithm divides the data space into n × n non-overlapping square grid cells, where n
is a user input, and maps the data set to each cell. Then computing the density of each cell. Next we are
sorting the cells according to their densities. Then identifying the maximum dense cell from the set of
unclassified cells. Next we have to traverse the neighbor cells starting from the dense cell and form the
adaptive grid (Rough cluster).
Algorithm 1 Expand_adaptive_grid (Cell)
-------------------------------------------------------------------------1. IF Cell. classified== CLASSIFIED
2. RETURN;
3. END IF
4. ELSE
5. Cell.classified= CLASSIFIED;
6. Cell.Clus_id= Clus_id;
7. // Assigns the Cell with cluster id
8. Neighb= NeighborQuery(Cell);
9. FOR i from 1 to Neighb.size Do
10. IF ( Neighb.get(i).density/Cell.density) >= β
11. Expand_adaptive_grid(Neighb.get(i))
12. END IF
13. END FOR
14. END ELSE
15. END//End Expand Cluster
--------------------------------------------------------------------------The formation of rough cluster is explained in thealgorithm1Expand_adaptive_grid. It starts with the
maximum density cell from the sorted list. Here adaptive grid expands to the neighboring cell based on
the position of the cell. Neighboring cell is merged if it is not a member of adaptive grid and it satisfies the
density confidence. The process of adaptive grid continues recursively in the same way until no
neighboring cell satisfies the condition. The cells falling in the same adaptive grid are given the same
cluster id.
Algorithm 2 Expand_triangle (Triangle)
---------------------------------------------------------------------------------------1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
IF Triangle.classified== CLASSIFIED
Neighbor_ Triangle = Neighbor Query (Triangle);
FOR i from 1 to Neighbor_ Triangle.size Do
IF ( Neighbor_ Triangle.get(i).classified = UNCLASSIFIED )
IF(Neighbor_Triangle.get(i).density / Triangle. density) >= β/4
Neighbor_ Triangle.get (i).Cluster id= Triangle. Cluster id;
Neighbor_ Triangle.get(i).classified = CLASSIFIED;
Expand_triangle (Neighb_Triangle.get(i));
END IF
END IF
END FOR
END IF
END//End Expand_triangle
-------------------------------------------------------------------------------------The process then checks the neighbors of the last formed adaptive grid cells. If any one of the neighbors
is an unclassified useful cell then both the adaptive grid cell as well as the unclassified neighbor cell is
triangulated. The process of triangle subdivision is explained in algorithm 2 Expand_triangle. First we
have to identify the neighbors of a triangle. During triangle subdivision the grid cell is divided into four
4
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
triangles. If that triangle is not a part of any adaptive grid and it satisfies the density confidence of a
triangle as given in algorithm 2 Expand_triangle, then it is added into the rough cluster. This process
continues recursively until no neighboring triangle satisfies the condition.
After the process of triangle subdivision the triangles are merged into the cluster and assigned the same
cluster id.
3. EXPERIMENTAL RESULTS
The proposed algorithm is applied to various synthetic and real datasets. The algorithm was also applied
on Chameleon t4.8k.dat, t5.8k.dat and t7.10k.dat. From the experiment results it has been found that the
clustering result is dependent on threshold β which varies in the interval [0.3, 0.9]. We performed
experiments using a personal computer with 1 GB of RAM and Intel Core 2 Duo with 1.60 GHz and
running Windows 7 Ultimate.
3.1 Clustering on Real Dataset
FIGURE 2: Earthquake Dataset
FIGURE 3: Clustering Result for Earthquake Dataset
This is the earthquake dataset taken from UCI machine repository and contains 65000 data points which
gives information about the longitude and latitude of the earthquake points. Here we are able to find four
different clusters. So based on this clusters we can easily find the frequency of the earthquake in the
required area.
5
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
FIGURE 4: Hypsography Supplemental points of Alaska State in U.S.A
FIGURE 5: Clustering Result for Hypsography Supplemental points of Alaska State in U.S.A.
The result shows eight different clusters of the real dataset of Hypsography supplemental points of Alaska
State in U.S.A.
FIGURE 6: Global Summation of Vegetation for Calendar Year 1986 To 1988
FIGURE 7: Clustering Result for Global Summation of Vegetation for Calendar Year 1986 To 1988
6
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
For the real dataset of global summation of Vegetation clusters are found according to the density
accurately.
FIGURE 8: Map of Caribbean Island
FIGURE 9: Clustering Result for Map of Caribbean Island
This is the map of Caribbean island and clustering result is shown with ten clusters.
3.2 Clustering On Chameleon Data
FIGURE 10: Chameleon t4.8k.dat Dataset
7
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
FIGURE 11: Clustering Result for Chameleon t4.8k.dat Dataset
This is the chameleon t4.8k.dat dataset containing 8000 data points. Here we are able to detect six
different clusters efficiently. So embedded and nested clusters are detected properly on chameleon
dataset. Also here we are able to detect outliers successfully.
3.3 Clustering On Synthetic Dataset
FIGURE 12: Synthetic Dataset
FIGURE 13: Clustering Result for Synthetic Dataset
This is the synthetic dataset containing 2913 points. Here we are able to find two clusters efficiently.
So from the above results obtained we can conclude that GDCT is capable of detecting intrinsic as well
as multi density clusters efficiently.
8
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
4.
COMPARISON WITH ITS COUNTERPARTS
Algorithms
Input
Parameters
Outlier
Handling
Scalability
Varying
Density
Embedded
Cluster
Complexity
DBSCAN
2
Yes
Yes
No
No
O (n log n)
Yes
Yes
Yes
No
O (n log n) +
O (n )
Yes
Yes
Yes
Yes
O (n log n)
(MinPts, ε)
OPTICS(Int)
3
(MinPts, ε, ε)
GDCT
2
(n, β)
TABLE1:Comparison with DBSCAN and OPTICS
A detailed comparative study among the three algorithms (i.e. GDCT, DBSCAN & OPTICS (Int.)) was
carried out in light of those real and synthetic datasets. Table 1 presents the same in terms of six crucial
factors. As can be seen from the column 1 of the table that like DBSCAN, the GDCT algorithm also
requires less number of input parameters than OPTICS. Similarly, column 5 depicts that embedded
clusters can be detected only by the GDCT algorithm. Also, from the complexity point of view, column 6
clearly shows that the performance of GDCT is similar with DBSCAN. However, OPTICS requires an
additional complexity O(n) (at least) to classify those points after ordering, apart from O(nlogn), when a
spatial index is used. The rest other columns establish that in terms of the other quality parameters, the
performance of the GDCT algorithm is equally good with its other two counterparts.
9
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
FIGURE 14: Time Cost of GDCT for Different Data Points
The time cost of GDCT for different Data points is shown in figure 8. Three different categories are shown
in the graph. The time required for the formation of rough cluster is sometimes greater than the time
required for the triangle subdivision. GDCT time cost is less than time cost of DBSCAN, OPTICS and
EnDBSCAN.
5.
CONCLUSION
In this paper, a clustering algorithm for massive spatial datasets with variable density has been
presented. The result shows that GDCT can find the clusters correctly and eliminate outliers efficiently.
Furthermore it also fast enough. If users want to get more exact result, they can adjust the threshold β.
Moreover, it is not affected by the outliers and can handle them properly. And GDCT is order insensitive.
It is observed that the time complexity depends on the number of cells. From the experimental results
given above, we can conclude that GDCT is highly capable of detecting intrinsic as well as multi-density
clusters qualitatively. All the sequential algorithms were found to be slower for the above datasets and
embedded cluster detection was not possible.
10
International Journal of Data Engineering
Mr. Chetan J. Awati& Prof. D.G. Chougule
6. REFERENCES
[1] Zhi-Wei SUN. A Cluster Algorithm Identifying the Clustering Structure” IEEE International Conference
on CSSE 2008.
[2] Karypis, G., Han, & Kumar, V. “CHAMELEON: A hierarchical clustering algorithm using dynamic
modeling”. IEEE Computer, 32(8), pp 68-75, 1999.
[3] Nagesh, H. S., ET. al. “A scalable parallel subspace clustering algorithm for massive data sets”, in
Proc. Of Parallel Processing, pp. 477, 2000.
[4] Ertoz, L., et. al. “Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High
Dimensional Data”, in Proc. of SDM '03.
[5] Ho Seok Kim, Song Gao, Ying Xia, Gyoung Bae Kim, and Hae Young Bae “DGCL: An Efficient
Density and Grid Based Clustering Algorithm for Large Spatial Database” WAIM 2006, LNCS 4016, pp.
362 – 371, 2006.
[6] S. Roy and D.K. Bhattacharyya, “An Approach to Find Embedded Clusters Using Density Based
Techniques”, ICDCIT 2005, LNCS 3816, pp. 523 – 535, 2005.
[7] Ying Xia, GuoYin Wang, Song Gao, “ An Efficient Clustering Algorithm for 2D Multi-density Dataset in
Large Database”, 2007 IEEE International Conference on Multimedia and Ubiquitous
Engineering(MUE'07)
[8] Ozge Uncu, Member, IEEE, William A. Gruver, Fellow, IEEE, Dilip B. Kotak, Senior Member,
“GRIDBSCAN: GRId Density-Based Spatial Clustering of Applications with Noise” 2006 IEEE
International Conference on Systems, Man, and Cybernetics October 8-11, 2006, Taipei, Taiwan.
[9] Song Gao, Ying Xia, “GDCIC: A Grid-based Density-Confidence-Interval Clustering Algorithm for Multidensity Dataset in Large Spatial Database”, Sixth IEEE International Conference on Intelligent Systems
Design and Applications (ISDA'06)
11
International Journal of Data Engineering