* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A Topological-Based Spatial Data Clustering
Survey
Document related concepts
Principal component analysis wikipedia , lookup
Human genetic clustering wikipedia , lookup
Expectation–maximization algorithm wikipedia , lookup
K-nearest neighbors algorithm wikipedia , lookup
Nonlinear dimensionality reduction wikipedia , lookup
K-means clustering wikipedia , lookup
Transcript
A Topological-Based Spatial Data Clustering Hakam W. Alomari*a, Amer F. Al-Badarnehb Department of Computer Science and Software Engineering, Miami University, Oxford, OH, USA 45056; bComputer Information Systems Department, Jordan University of Science and Technology, Irbid, Jordan 3030 a ABSTRACT An approach is presented that automatically discovers different cluster shapes that are hard to discover by traditional clustering methods (e.g., non-spherical shapes). This allows discover useful knowledge by dividing the datasets into sub clusters; in which each one have similar objects. The approach does not compute the distance between objects but instead the similarity information between objects is computed as needed while using the topological relations as a new similarity measure. An efficient tool was developed to support the approach and is applied to a multiple synthetic and real datasets. The results are evaluated and compared against different clustering methods using different comparison measures such as accuracy, number of parameters, and time complexity. The tool performs better than error-prone distance clustering methods in both the time complexity and the accuracy of the results. Keywords: clustering, spatial data, knowledge discovery, data mining, topological relations 1. INTRODUCTION Database systems have shown a great performance with stored data, but these systems might encounter some problems when dealing with space related and multidimensional data, like maps and satellite images [1]. In order to handle the spatial data, the database systems are augmented with spatial data types (e.g., points, lines, polygons, etc.) in its data model and query languages [2, 3, 4]. Spatial data may contain some useful (hidden) knowledge that tends to be large and complex in nature, thus the mining process must be automated in order to trace the progressively growth in the data. Clustering, as a mining task, divides the dataset into clusters (groups) in such a way, objects in one cluster similar to each other. The significant behind this process is to discover the natural distribution of the data, which leads to extract useful knowledge. The clustering problem has received a great attention within the past two decades; hundreds of research papers have been published presenting new clustering methods or improvement on existing methods with the aim of providing an optimal solution. Many applications took the benefit of clustering problems to improve the decision making process, such as information retrieval, text mining, spatial database applications, web analysis, medical diagnostics, pattern recognition, computational biology, and market research [5, 6, 7, 8]. Spatial database is a typical example of using clustering methods taking into consideration its huge quantity and complexity. Amongst the different clustering techniques known in literature, hierarchical clustering models propose better versatility clustering spatial data, as they do not require an advance definition of the number of clusters to find. Based on the process of building clusters, hierarchical clustering methods are classified into two main approaches, agglomerative and divisive [5]. Divisive approach initially considers all data points as one big cluster, and then iteratively divides this cluster based on the similarity measure used between data points. On the other hand, agglomerative approach starts with each point as an individual cluster, and then depending on the measure type used as a similarity determinative, these points are grouped iteratively to build the final clusters. Despite the type of clustering method used, the calculation of spatial clusters is, with few exceptions, based on the notion of a distance as a similarity measure, e.g., Euclidean Distance [9]. Unfortunately, calculating the distance matrix is quite costly in terms of computational time and space. As such clustering approaches generally do not scale well and while there are some (costly) workarounds, generating clusters for a very large dataset can often take days of computing time. Additionally, many tools are strictly limited to an upper bound on the dimensions of the dataset they can cluster. *[email protected]; phone 513 529-0356; http://www.users.miamioh.edu/alomarhw/ Optical Pattern Recognition XXVII, edited by David Casasent, Mohammad S. Alam, Proc. of SPIE Vol. 9845, 98450S · © 2016 SPIE · CCC code: 0277-786X/16/$18 · doi: 10.1117/12.2229413 Proc. of SPIE Vol. 9845 98450S-1 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx An accurate, fast, and scalable clustering approach is extremely useful for a number of reasons. Developers will have a very low cost and practical means to discover hidden patterns within reasonable time frames. This is very important for clustering new spatial objects and understanding how this object is related to other parts of the pattern. It will also provide an inexpensive test to determine if a full deep analysis of the data is warranted. Lastly, we think an accurate clustering approach could open up new avenues of research in mining useful knowledge based on clustering. That is, clustering can now be conducted on very large dimensional datasets and on more complex raw data in very practical time frames. This opens the door to a number of experiments and empirical investigations previously too costly to undertake. We proposed a new spatial clustering approach called Agglomerative Clustering Using Topological Relations Effectively (ACUTE) and conducted a comparison study with different clustering methods from literature. The results show that the clusters produced by our approach are very reasonable with respect to accuracy. To demonstrate the scalability, we applied the tool to several synthetic and real datasets and present the results. The remainder of this paper is organized as follows. Section 2 gives an overview of topological operators used in the proposed approach and an overview of related clustering. We present our new clustering method in Section 3. Section 4 gives the performance analysis and presents comparison with clustering algorithms. In Section 5, we summarize our paper and discuss possible future research directions. 2. TOPOLOGICAL OPERATORS AND CLUSTERING METHODS In this section, we present an overview of the topological relations used in our approach and its modifications in order to identify the closeness between spatial points. Additionally, an overview of the related work of the basic types of clustering methods is also presented. 2.1 Topological operators Topological relations [3, 18] between two spatial objects A and B can be specified using the well-known 9-intersection model (3 × 3 binary matrices). The regions that intersect to determine these nine relations are the interior, exterior, and boundary of objects A and B. Table 1 shows the standard intersection matrix for objects A and B in the two-dimensional space. Where, °, ¯, and ∂ symbols denote interior, exterior, and boundary relations, respectively. By allowing each matrix cell to be nonempty (equals to one) or empty (equals to zero) then we can consider 29 = 512 different topological relationships, in which only: disjoint, overlap, meet, contain, inside, equal, cover, and covered by relations can be established and represented in the two dimensional space. Table 1. Standard intersection Matrix between two objects A and B [3]. R(A, B) = A° ⋂ B° ∂A ⋂ B° A° ⋂ ∂B ∂A ⋂ ∂B A° ⋂ B¯ ∂A ⋂ B¯ A¯ ⋂ B° A¯ ⋂ ∂B A¯ ⋂ B¯ Here we extend the original definitions of the topological relations in two ways. The first is adapting it to the spatial data points, and second is using three relations (i.e., disjoint, overlap, and meet) to build the final clusters. The assumption here that all spatial objects we are dealing with are singleton points has the same size in the Euclidean plan ℝ2. Therefore, the relations: contains, inside, covers, and coveredBy are out of our interest. Moreover, we will not use the equal relation since we assume that every point has a unique position in the space. We will now give a specification of the constraint rules (denoted Clustering Rules) that are generic for all type combinations considered. The clustering rules we used to cluster the spatial points do not take into account all the nine intersections of the standard intersection matrix. In total, an eight clustering rules are defined and proved in [19]. The three rules we are interested in are as follows: Clustering Rule 1 (Disjoint) Two spatial points have a disjoint relationship, if the parts of one point intersect at most with the exterior of the other point, i.e., disjoint (A, B) = A° ∩ B° = Ø ∧ A° ∩ ∂B = Ø ∧ B° ∩ ∂A = Ø ∧ ∂A ∩ ∂B = Ø. Clustering Rule 2 (Meet) Two spatial points have a meet relationship, if both interiors do not intersect, but the interior or the boundary of one point intersects the boundary of the other point, i.e., meet (A, B) = A° ∩ B° = Ø ∧ (A° ∩ ∂B ≠ Ø ∨ B° ∩ ∂A ≠ Ø ∨ ∂A ∩ ∂B ≠ Ø). Proc. of SPIE Vol. 9845 98450S-2 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx Clustering Rule 3 (Overlap) Two spatial points have an overlap relationship, if the interior of each point intersects both the interior and the exterior of the other points, i.e., overlap (A, B) = A° ∩ B° ≠ Ø ∧ A° ∩ B¯ ≠ Ø ∧ B° ∩ ¯A ≠ Ø. 2.2 Spatial clustering methods Clustering analysis is a commonly used approach for understanding and detecting hidden patterns in the raw data. The idea is quite simple, given a dataset’s object and the location of this object in the space, tell me what other objects of the dataset are affected (similar) by this object. The approach has been used successfully for many years for various real world applications [8, 20, 21, 22]. For example, clustering was used to help address image coding clustering, document clustering, and spatial clustering in geo-informatics. These various applications of clustering require different properties; thus, a number of different definitions have been proposed [14, 15, 17]. The two main abstract definitions are the spatial and non-spatial clustering approaches. These definitions are covered in detail in various surveys of the clustering literature [5, 12, 23, 24, 26, 27]. Interestingly, each survey presents the definitions from a slightly different perspective. For example, some references focus mainly on the applications of clustering techniques. Other references compare different implementations and classify the clustering techniques according to their empirical results. Finally, some authors compare and classify clustering techniques in order to identify relations between them. Although there are a number of similarities between spatial and non-spatial clustering algorithms, spatial datasets/databases in particular need specific requirements (e.g., identify irregular shapes, sensitivity to noise, higher dimensionality, etc.) that make special needs for clustering algorithms. There are a number of different methods proposed for performing spatial clustering but the four main divisions are hierarchical, partitional, grid-based, and density-based clustering. k-means [10, 11] is one of the most famous and simplest partitional clustering algorithms. It defines the cluster as similar objects to the centroid of the cluster, which is usually the mean of a group of objects. kmeans starts by choosing k initial centroids, where k is user-predefined number of clusters, and then iteratively updates the centroids until there is no further change with these centroids. After that a number of different clustering techniques and tools have been proposed and implemented. These techniques are broadly distinguished according to the similarity measure used and the input requirements of the clustering method. 3. TOPOLOGICAL CLUSTERING This section aims at proposing a new clustering algorithm called Agglomerative Clustering Using Topological Relations Effectively (ACUTE). ACUTE algorithm supersedes existing clustering algorithms in two main aspects: one its validity for all cluster shapes, given that it agglomerates points in all directions. Second, it reduces both the number of parameters and comparisons that is needed by most clustering algorithms so as to perform the clustering process. 3.1 Algorithm overview The algorithm initially computes the new appropriate value of radius r experimentally, then iteratively selects a point with flag field equal to zero from the point’s table and scans the database in order to compare the current point with other points. The points that related with overlap and meet relationships are belong to the same cluster and the algorithm sets its flag fields to one, so no more comparison with it. The topological relations are applied by comparing the corresponding dimensions of points A and B, so if the relation between the two points is overlap or meet then these points belongs to the same cluster. Finally, the final clusters are identified while scanning (sequentially) the cluster table, then if the final cluster just has a single point, or a user pre-defined threshold, then this cluster is marked as an outlier. 3.2 Radius approximation Here we will discuss how the proposed approach calculate the best value of the radius r, then how to approximate this value analytically. Specifically, the proposed algorithm tries to determine a universal value of r that could be applied to any dataset (regardless the data distribution in the spatial space), or in the worst-case find a way to dynamically calculate this value based on the dataset that is in clustering progress. All existent clustering algorithms specifying the closeness between spatial objects using a distance similarity measure, so they assume that similar objects closest to each other forming a cluster. Our proposed algorithm utilizes this notation by agglomerate all points that overlapped and met with each other after increasing the radii of the points. Modifying the radii of the points with incorrect value of r can lead to extract inaccurate number and odd shapes of clusters. We start our algorithm by computing the value of r experimentally using a dataset with previously well-known true clusters. The Proc. of SPIE Vol. 9845 98450S-3 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx algorithm starts by clustering the data points initially using a small value of r (e.g., 0.1). After that the algorithm iteratively increases the value of r and cluster the objects. This process continues until the accuracy value (true ratio) starts to degrade or reaches 100% (which occurs first). Essentially, the process of increasing the value of r in its core is considered a distance calculation process. Hence, for a faster computation of radius, we can use one of the well-known approximation methods such as (1+ε)-approximate nearest neighbor [28] to reduce the computational cost to only Ο(f log n), where f is a factor depending only on dimension. The proposed algorithm here divides the clustering process into two stages. In the first stage, the algorithm use a cheap distance measure to create overlapping circles (or hyper sphere for dimensions greater than two). The radius r of any circle is equal to the half of maximum distance of minimum pair wise distances. Significantly, a data point may appear under more than one circle, and every data point must appear in at least on circle. Note that every circle covers a data point in its center with a radius equal to r, the value of r is obtained from the distance matrix, and thus the proposed algorithm does not require any input parameter. In the second stage, the proposed algorithm applies the topological relations. The values of overlap, meet, and disjoint relations are obtained from the overlapped circular regions using the equation 2r – dist(p, q) for a given pair of point’s p and q. The proposed algorithm partitions data space into overlapped circles. The radius r of these circles should be depends on the far nearest neighbor distance of data space, since we want to make sure that the circle (of radius r) around the farthest point will meet with the circle around the nearest neighbor of this point. 3.3 Accuracy calculation Accuracy value measures the portion of correct cases to all cases. In order to calculate the accuracy, the true clusters of points should be known in advance. Then the correct number of cases can be calculated by comparing the resulted clusters with existing true ones. So the accuracy will be: Accuracy = (number of correct cases / total number of cases) × 100. (a) Points with r = 0 (d) Points with r = 0.5 (b) Points with r = 0.1 (e) Points with r = 1.1 (c) Points with r = 0.3 (f) Points with r = 1.5 Figure 1. The clusters representations after each change in the r-value. The true clusters for this particular example are seen in (d) in (d) when the r-value is equal to 0.5. Figure 1 shows how the sizes of the points are influenced with the increases in the r-value. The significant behind this step is to observe the differences in cluster shapes while the radius is changed. As shown in Figure 1(a), all points initially represented with r-value by default equals to ~ zero. Then by increasing the value of r gradually, the natural clusters of the dataset start to appear. The process of increasing the value of r continues until either the resulted clusters (after each increment) equal to the one in the true cluster table or the cluster’s accuracy value starts to degrade. Figure 1(d) represents the final clusters when the r-value equals to 0.5, in this case we have two clusters and one outlier. Assume we continue increasing the value of r, and then we have cases as those shown in in Figure 1(e) and Figure 1(f). The points start to merge between each other, so the accuracy value starts to decrease. One step previous of the decreasing step is comprises the best r-value that we should consider to cluster the dataset. Proc. of SPIE Vol. 9845 98450S-4 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx 4. EXPERIMENTS AND PERFORMANCE ANALYSIS Different real and synthetic datasets with different sizes, densities, and shapes were used to evaluate the ACUTE performance and accuracy. We used four well-known real datasets, which are: Iris, Wine, Congressional Votes, and Mushroom, in addition to six synthetic datasets which are: Hepta, Density, ACUTE, Forty, Twenty, and Spiral. The characteristics of the datasets are summarized in Table 2. Table 2. Four real and six synthetic datasets properties. The size of the clusters for the first two datasets (Iris and Wine) are already know. Data Iris Wine Cong.Votes Mushroom Hepta Density ACUTE Forty Twenty Spiral Number of Dimensions Number of Clusters Outliers Number of Points Cluster Size 4 13 16 22 3 2 2 2 2 2 3 3 6 24 7 3 5 40 20 2 No No Yes Yes No No No No No No 150 178 712 8124 212 65 311 1000 1000 1000 50, 50, 50 71, 59, 48 ----------------------------------------------------------------- The Iris dataset is considered as one of the most popular datasets in clustering literature. The Congressional Votes dataset it is the United States Congressional Voting Records in 1984. Each record represents one congressman’s votes on 16 issues (attributes). A classification label of republican or democrat is provided with each record. This dataset contains 435 records with 168 republicans and 267 democrats. Mushroom dataset has 22 attributes with 8124 records. Each record has the physical characteristics of a single mushroom. The Wine dataset are the results of a chemical analysis of wines grown in the same region in Italy but derived from three different cultivars. This dataset had around 13 variables, and in the clustering context, this is a well-posed problem with well-behaved class structures. It is considered as a good dataset for testing a new clustering algorithm. For a complete description, all the datasets can be downloads from Machine Learning Repository at University of California, Irvine [25]. 4.1 Evaluation criteria Here we evaluate the clustering results of our tool to determine if correct clusters are produced, and are produced efficiently. The time and cost it takes to generate the cluster, including execution time and number of parameters, is of particular interest with respect to usability of the method. In addition, we want to determine if the results obtained by ACUTE are comparable to others in terms of accuracy. However, since the implementations of these tools have so few aspects in common, it is not meaningful to compare all of the relevant aspects of the different implementations. Therefore, we focus our attention on evaluating clusters of both tools, by taking into consideration the correctness, size of the results, time, and the limitations of both tools. 4.2 Performance analysis The overall computational complexity of ACUTE depends mainly on the number of comparisons between data points. Increasing the r-value gradually degrades the number of comparisons between data points, from the time when the number of points that overlap or meet is increased. In other words, there is no need to compare the points that already belongs to another clusters. The amount of time required by the comparison process depends mainly on (1) the number of data points in the dataset, and (2) the value of radius r. The worst-case complexity is obtained when each data point considered as a single cluster. That is, when the r-value is initially set to very small value (~ zero). This corresponds to the worst case because the initial chosen data point is compared with number of points equal to n − 1, the second chosen point is compared with n − 2 points, these comparisons is continue to decrease by one until the final chosen point is compared only with one data point. The amount of time needed to compare n data points is (n2 − n)/2 which is O(n2). On the other hand, when the r-value is increased gradually, the number of comparisons is decreased gradually until it is equal to n – 1, in this case all the points are reached from the initial data point. Proc. of SPIE Vol. 9845 98450S-5 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx 180% ACC 160% 140% 120% 100% 80% 60% 40% 20% 0% Db 09 1N1 09 1 01 0» DA 05 01 Figure 2. Comparison of execution time and accuracyr-vaWe value over different r-values (x-axis) using the Hepta dataset. The time here is a percentage from minute. ACC = 100% when r-value = 0.5 to 0.9 and execution time = 22 to 2 seconds. As an example, Figure 2 shows the execution time of ACUTE using Hepta dataset, with different r-values. As seen, in general as the value of r increases, the execution time of ACUTE decreases due to the decreases in the total number of comparisons between data points. However, the execution time and the accuracy values (ACC) are sometimes stable on different values of r-value while the r-value is still increasing. This is because the number of comparisons is stable too. Table 3. The accuracy values (ACC) versus the radii values (r-value) on the four real datasets (Wine, Density, Cong. Votes, and Mushroom). Dataset Wine Density Cong. Votes Mushroom ACC r-value 79.8 100 92.1 100 47.5 14 32 11.5 Number of iteration 20 8 15 53 4.3 Experimental results In order to facilitate choosing the appropriate value of radius r, and to indicate the impact on the clustering results with different values of r, Figure 3 illustrates the execution of the ACUTE algorithm on Hepta, Iris, and Spiral datasets. This figure shows the exchange in the accuracy values (ACC) when the radius value is increased. The Hepta and the Spiral datasets hit an accuracy value equal to 100% (first hit) when the r-value is equal to 0.5 and 0.3, respectively. In another hand, the ACC value on the Iris dataset at best case is equal to 86.7% at r-value equal to 0.4. Those three datasets are plotted together in the same figure, as the scale values on the r-value are equals. For the sake of space and simplicity the results of the Wine, Cong. Votes, Mushroom, and Density datasets are shown in Table 3. The ACUTE algorithm hit a 100% accuracy value on both Density and Mushroom datasets with r-values equal to 14 and 11.5, respectively. However, using the Wine and Cong. Votes datasets, the accuracy values (in the best case) are equal to 79.8 and 92.1, respectively. The number of iterations column represent the number of increases of the rvalue till reach the highest ACC value. This number is changed dynamically if the amount of increase in the r-value is larger. For example, in the Wine dataset the amount of increase in the r-value is equal to 2.5 units starting from zero, therefore the number of iteration is equal to 20 to reach r-value equal to 47.5. ACUTE iteratively clustering the data after increasing the value of r, after that the ACC value is calculated. This process is continuing until no more enhancements can be done on the accuracy value. For example, as shown in Figure 3 as the rvalue increases the accuracy value increases too, since the number of points that overlapped or meets is increased. This enhancement on the accuracy value start to degrades then become more stable when the r-value increased to be very large, as a result all clusters are merged together. In other words, the r-value considered being large when all the points in the space are merged in one cluster (i.e., first cluster). In such a case, the accuracy value is stable and equal to (the number of points in the first cluster / total number of points in the dataset). As an example for this case, the Iris dataset Proc. of SPIE Vol. 9845 98450S-6 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx has three clusters each one has 50 points, so when we enlarge the r-value too much then the accuracy value unchanged and it is equal to (50 points in the first cluster / 150 total dataset points = 0.33). As we can observe from the figure of the accuracy values above, when the r-value increased the accuracy value increased, then the accuracy value either stable in different iterative and (or) related inversely with the r-value then it is stable. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 r-value Figure 3. ACUTE’s Accuracy values (y-axis) on Hepta, Iris, and Spiral datasets with different number of radii (x-axis). 100% 100% 66% 61% I I k -means (5) kmedoids (5) 32% DBSCAN (20, 5) 1 CLIQUE (18, 5) ACUTE (13) Figure 4. Percentage of clustering algorithms using ACUTE dataset. Figure 4 illustrates the true point’s accuracy values of the ACUTE, k-means, k-medoids [13], DBSCAN [16], and CLIQUE [26] clustering algorithms based on the ACUTE dataset. As seen, ACUTE accuracy value on ACUTE dataset is 100% with r-value equal 13, and k-means 66.23% with k equal 5. In addition, wrong choice of parameters on the CLIQUE degrades its accuracy value to 32.15%. In addition, we illustrate a comparison between the ACUTE clustering algorithm and other clustering algorithms from the number of parameters needed to accomplish the clustering process. ACUTE like most of other tools, needs one parameter to accomplish the clustering process, this is the r-value. 5.CONCLUSIONS AND FUTURE WORK The main contribution of this paper was a new spatial clustering algorithm. The proposed algorithm derives benefit from the topological relations as a spatial object’s similarity measure in order to overcome the extra overhead calculations generated when using the distance functions. Moreover, the algorithm builds clusters without known information of any calculated values in the dataset such as mean, median, and standard deviation. Proposed algorithm detects the natural clusters in the dataset with different cluster shapes. It can deal with points as a representative of all spatial data types, and can merge the points in all directions. We have compared the ACUTE algorithm with other clustering algorithms using different datasets with different sizes, shapes, densities, and number of dimensions. Experimental results have shown that our algorithm reduces the clustering processing time, in the best case, to O(n − 1) by reducing the number of comparisons in the building clusters. Additionally, ACUTE algorithm clusters most of the datasets with high 100% of accuracy value. Proc. of SPIE Vol. 9845 98450S-7 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx The ACUTE algorithm specifies r by iteratively increase its value until the accuracy value starts to degrade. As a future work we suggest that the r-value should be calculated analytically, we suggest benefiting from the distribution of the dataset in the space as a good indicator of the r-value. However, every dataset’s space can take different value of r. REFERENCES [1] Al-Badarneh A, Al-Alaj A and Mahafzah B. Multi Small Index (MSI): A spatial indexing structure. Journal of Information Science 2013; 39: 643–660. [2] Güting RH and Schneider M. Realms: A Foundation for Spatial Data Types in Database Systems. In: Abel DJ and Ooi BC (eds) Proceedings of the Third International Symposium on Advances in Spatial Databases, LNCS 692. Berlin Heidelberg, Germany: Springer-Verlag, 1993, p.14-35. [3] Tasdemir K, Milenov P and Tapsall B. Topology-based hierarchical clustering of self-organizing maps. IEEE Transactions on Neural Networks 2011; 22: 474–485. [4] Samet H. Spatial data structures. In: Kim W. (ed.) Modern database systems: The Object Model, Interoperability, and Beyond. New York, NY, USA: ACM Press/Addison-Wesley Publishing Co., 1995, pp.361–385. [5] Fahad A, Alshatri N, Tari Z, Alamri A, Khalil I, Zomaya AY, et al. A Survey of Clustering Algorithms for Big Data: Taxonomy and Empirical Analysis. IEEE Transactions on Emerging Topics in Computing 2014; 2: 267– 279. [6] Qian F, He Q, Chiew K and He J. Spatial co-location pattern discovery without thresholds. Knowledge and Information Systems 2012; 33: 419–445. [7] Read S, Bath PA, Willett P, and Maheswaran R. New developments in the spatial scan statistic. Journal of Information Science 2013; 39: 36-47. [8] Tan P-N, Steinbach M and Kumar V. Introduction to Data Mining. 1st ed. Boston, USA: Addison-Wesley Longman Publishing, 2005. [9] Fabbri R, Costa LDF, Torelli JC and Bruno OM. 2D Euclidean distance transform algorithms: A comparative survey. ACM Computing Surveys 2008; 40: 1–44. [10] Jain AK. Data clustering: 50 years beyond K-means. Pattern Recognition Letters 2010; 31: 651–666. [11] Kanungo T, Mount DM, Netanyahu NS, Piatko CD, Silverman R and Wu AY. An efficient k-means clustering algorithm: analysis and implementation. IEEE Transaction on Pattern Analysis and Machine Intelligence 2002; 24: 881–892. [12] Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, et al. Top 10 algorithms in data mining. Knowledge and Information Systems 2007; 14: 1–37. [13] Park H-S and Jun C-H. A simple and fast algorithm for K-medoids clustering. Expert Systems with Applications 2009; 36: 3336–3341. [14] Ng RT and Han J. CLARANS: A Method for Clustering Objects for Spatial Data Mining. IEEE Transactions on Knowledge and Data Engineering 2002; 14:1003–1016. [15] Zhang T, Ramakrishnan R and Livny M. BIRCH: an efficient data clustering method for very large databases. In: Widom J (ed.) Proceedings of the 1996 ACM SIGMOD international conference on Management of data. New York, NY, USA: ACM, 1996, pp.103–114. [16] Sander J, Ester M, Kriegel H-P and Xu X. Density-Based Clustering in Spatial Databases: The Algorithm GDBSCAN and Its Applications. Data Mining and Knowledge Discovery 1998; 2: 169–194. [17] Guha S, Rastogi R and Shim K. CURE: an efficient clustering algorithm for large databases. In: Tiwary A and Franklin M (eds) Proceedings of the 1998 ACM SIGMOD international conference on management of data. New York, NY, USA: ACM, 1989, pp.73–84. [18] Egenhofer MJ. A formal definition of binary topological relationships. In: Litwin W. and Schek HJ (eds) Proceedings of the 3rd International Conference on Foundations of Data Organization and Algorithms. New York, NY, USA: Springer-Verlag, 1989, pp.457–472. [19] Schneider M and Behr T. Topological relationships between complex spatial objects. ACM Transactions on Database Systems 2006; 31: 39–81. [20] Laguia M and Castro JL. Local distance-based classification. Knowledge Based Systems 2008; 21: 692–703. [21] Jacquenet F and Largeron C. Discovering unexpected documents in corpora. Knowledge Based Systems 2009; 22: 421–429. Proc. of SPIE Vol. 9845 98450S-8 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx [22] Li M and Zhang L. Multinomial mixture model with feature selection for text clustering. Knowledge Based Systems 2008; 21: 704–708. [23] Filippone M, Camastra F, Masulli F and Rovetta S. A survey of kernel and spectral methods for clustering. Pattern Recognition 2008; 41: 176–190. [24] Jain AK, Murty MN and Flynn PJ. Data clustering: a review. ACM Computing Surveys 1999; 31: 264–323. [25] The Repository at University of California, Irvine. URL: http://archive.ics.uci.edu/ml/. [26] Pavel Berkhin. Survey of clustering data mining techniques. Technical report, Accrue Software, San Jose, CA, 2002. [27] M.Parimala, Daphne Lopez, N.C. Senthilkumar. A Survey on Density Based Clustering Algorithms for Mining Large Spatial Databases. International Journal of Advanced Science and Technology Vol. 31, June, 2011. [28] Arya S, Mount DM, Netanyahu NS, Silverman R, Wu AY (1998). An optimal algorithm for approximate nearest neighbor searching fixed dimensions. J ACM 45 (6): 891-923. Proc. of SPIE Vol. 9845 98450S-9 Downloaded From: http://proceedings.spiedigitallibrary.org/ on 09/24/2016 Terms of Use: http://spiedigitallibrary.org/ss/termsofuse.aspx