Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction Methods for Knowledge Discovery in Spatial Databases ◦ ◦ ◦ ◦ ◦ Generalization-Based Knowledge Discovery Methods Using Clustering Methods Exploring Spatial Associations Using Approximation and Aggregation Mining in Image Databases Future Directions Conclusion References Our objectives: ◦ Describe existing spatial data mining methods ◦ Give a general perspective of the field’s current state ◦ Summarize the paper’s description of the state of spatial data mining in 1996. A spatial database is a database that is optimized to store and query data that is related to objects in space, including points, lines and polygons. While typical databases can understand various numeric and character types of data, additional functionality needs to be added for databases to process spatial data types. These are typically called geometry or feature. Database systems use indexes to quickly look up values and the way that most databases index data is not optimal for spatial queries. Instead, spatial databases use a spatial index to speed up database operations. In addition to typical SQL queries such as SELECT statements, spatial databases can perform a wide variety of spatial operations. A few examples are: Spatial Measurements Spatial Functions Spatial Predicates Constructor Functions Observer Functions Data mining is a field at the intersection of computer science and statistics, it is the process that attempts to discover patterns in large data sets. It utilizes methods at the intersection of artificial intelligence, machine learning, statistics, and database systems. The overall goal of the data mining process is to extract information from a data set and transform it into an understandable structure for further use. Aside from the raw analysis step, it involves database and data management aspects, data preprocessing, model and inference considerations, interestingness metrics, complexity considerations, post-processing of discovered structures, visualization, and online updating. “Spatial data mining, or knowledge discovery in spatial database, refers to the extraction of implicit knowledge, spatial relations, or other patterns not explicitly stored in spatial databases.” (Koperski and Han, 1995) Data mining, or knowledge discovery in databases, refers to the “ discovery of interesting, implicit, and previously unknown knowledge from large databases.” (Frawley et al, 1992) WHAT’S THE DIFFERENCE? DATA = The (WHAT) dimension. an attribute of an object. SPATIAL DATA = (WHERE) & (WHAT) Attribute data referenced to a specific location. The Attributes of spatial objects are: - Highly dependant on location - Often influenced by neighboring objects Object What Milk Is cold Yogurt Is warm Butter Is warm Object Where What Milk In fridge (X1,Y1) Is cold Milk On table (X2,Y2) Is warm Object Where What SPATIAL DATA MINING: Milk In fridge (X1,Y1) Is warm THE FRIDGE IS BROKEN! Yogurt In fridge (X1,Y1) Is warm Butter In fridge (X1,Y1) Is warm Object() : Location() -> Characteristic() (Milk) : (In Fridge) -> [should be] -> (Cold) To understand spatial data To discover relationships between spatial and non spatial data To capture the general characteristics in a concise way To build spatial knowledge-bases Various kinds of rules can be discovered from databases in general Spatial Characteristic Rule: General description for spatial data. Example: A rule describing the general price range of houses in various geographic regions in a city. Spatial Discriminant Rule: General description of the features discriminating or contrasting a class of spatial data from other classes. Example: Comparison of price ranges of houses in different geographical regions. Spatial Association Rule: Describes the implication of one or a set of features by another set of features in spatial databases. Example: Associating the price range of the houses with nearby spatial features, like beaches. Thematic Maps: Presents the spatial distribution of attributes. Differs from general maps where the objective is to present the positions of objects. Used for discovering different rules 2 ways to represent a thematic map: ◦ Raster ◦ Vector Image Databases: Special kind of spatial databases where data almost entirely consists of images or pictures. Used in remote sensing, medical imaging Usually stored in form of grid arrays representing the image intensity in spectral ranges. Spatial Data Structures: Consists of points, lines, rectangles, etc. Spatial Computations: Spatial join is one of the most expensive spatial operations. Map overlay is an important operation for geographic information systems SD mining algorithms must efficiently overcome: The huge volume of spatial data The complexity of spatial data types/structures The complexity of spatial accessing/query methods Expensive spatial processing operations HUGE S-DB www.jupiterimages.co m Object = Where is: Citizen{Brad} GO! Which highways cross Nat’n Park boundaries? = spatial JOIN ◦ ◦ ◦ ◦ ◦ Generalization-Based Knowledge Discovery Methods Using Clustering Methods Exploring Spatial Associations Using Approximation and Aggregation Mining in Image Databases Spatial Data ◦ Geometric (location, area, perimeter) ◦ Topological (adjacency, inclusion) Non-Spatial Data ◦ Stored in a traditional database with an attribute that is a pointer to the spatial description (Aref et al. - 1991) Need for background knowledge in the form of concept hierarchies : spatial and nonspatial ◦ Non-Spatial: ◦ Spatial: Counties -> Provinces -> Larger Regions Two generalization-baseds algorithms presented by Lu et al. - 1993: ◦ Spatial Data Dominant Generalization: Generalization of the spatial objects continues until the "spatial generalization threshold" is reached. Non-spatial data is analyzed. ◦ Non-Spatial Data Dominant Generalization: The non-spatial data is generalized into higher concept level. Neighbouring areas with similar generalized attributes are merged. Dependent upon the concept hierarchies and need for algorithms that does not use these hierarchies No need for a background information like concept hierarchies Foundation of clusters directly from the data A similar approach in machine learning is called "unsupervised learning" Clustering algorithms: ◦ PAM (Kaufmann and Rousseeuw – 1990) ◦ CLARAN (Kaufmann and Rousseeuw – 1990) ◦ CLARANS (Ng and Han - 1994) SD (CLARANS) – Spatial Dominant Approach NSD (CLARANS) – Non-Spatial Dominant Approach n objects and k clusters Selecting the most representative point for each cluster Most centrally located point in a cluster – medoid Computationally inefficient Total Cost = 20 10 10 10 9 9 9 8 8 8 Arbitrary choose k object as initial medoids 7 6 5 4 3 2 1 Assign each remaining object to nearest medoids 7 6 5 4 3 2 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 K=2 Do loop Until no change 1 2 3 4 5 6 7 8 9 10 7 6 5 4 3 2 1 0 0 10 3 4 5 6 7 8 9 10 10 Compute total cost of swapping 9 If quality is improved. 2 Randomly select a nonmedoid object,Oramdom Total Cost = 26 Swapping O and Oramdom 1 8 7 6 9 8 7 6 5 5 4 4 3 3 2 2 1 1 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 29 Very similar to PAM Sampling is the difference The idea: If the sample is selected in a random manner, it will be representing the data correctly and, therefore, CLARA can deal with larger data sets with respect to PAM. The drawback: May not do the best clustering -> An object is a medoid and it is not selected when sampling Tries to mix PAM and CLARA The idea: CLARA -> All stages with same sample CLARANS -> At each step a random sample The chance to miss the potential solutions is decreased to minimum Experimentally shown as more efficient than PAM and CLARA Also, detection of outliers i.e points that are not belong to any cluster Based upon CLARANS, two spatial mining algorithms were developed: ◦ SD (CLARANS) – Spatial Dominant Approach ◦ NSD (CLARANS) – Non-Spatial Dominant Approach Clustering of objects spatially by CLARANS Attribute-oriented induction on non-spatial description of the objects Result: Description of each cluster by it’s relative nonspatial attributes Spatial Cluster: Downtown Edmunton Non-Spatial (Attribute) Cluster: 50% Commercial, 40% Residental, 10% Public Services Attribute-oriented generalization on non-spatial description of the objects For each generalized tuple, clustering of the objects by CLARANS In the case of an overlap between clusters, the merging of those clusters Spatial Cluster: Region East of 50 St, South of 35 St, North of Whitemud. Non-Spatial (Attribute) Cluster: Mostly Industrial Two Drawbacks: ◦ The assumption that all objects to be clustered are stored in main memory Requirement of disk-based methods Integration with spatial access methods like R* tree ◦ Efficiency of the algorithm can be improved by some modifications Focusing on the representative objects when selecting medoids Restricting the access to certain objects that do not actually contribute to the computation Presented because of the non-availability or time consuming construction of the R-Trees An algorithm presented by Zhang et al. – 1996 Can be used with any clustering algorithm like CLARANS Two concepts that are used in the algorithm: ◦ Clustering Feature A triple summarizing information about subclusters of points ◦ CF Tree A balanced tree that stores clustering features Clustering Feature: CF = (N, LS, SS) N: Number of data points LS: Ni=1=Xi SS: Ni=1=Xi2 CF = (5, (16,30),(54,190)) 10 9 8 7 6 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 10 (3,4) (2,6) (4,5) (4,7) (3,8) Desire to discover rules that associate spatial objects with other spatial objects Association rules -> Agrawal et al. – 1993 Association rules for spatial databases -> Koperski and Han – 1995 Example Rule: is_a(x, school) -> close_to(x, park) (80%) Other predicates: intersects, overlap, disjoint, left_of, west_of, close_to, far_away Support ( Support (X) ) Probability that a tuple contains X Confidence ( X->Y ) Probability that a tuple having X also contains Y Predefined thresholds (minimum support, minimum confidence) to determine associations Strong Rule: A Rule that has a support, no less than minimum support; and has a confidence, no less than minimum confidence Methods that answer where the groups of data are More interesting: Why the clusters are there? Rephrased: What are the characteristics of the clusters in terms of the features that are close to them? C1 C1 C2 Cluster C1 centered at X1,Y1 Cluster C2 centered at X2,Y2 C2 45% of the objects in Clusters C1 & C2 are close to feature (River) CRH algorithm presented by Knorr and Ng – 1995 CRH: C (encompassing circle), R (isothetic rectangle), H (convex hull) Using filters to reduce the candidate features from a large number of features Input: Any number of features and one cluster Output: List of relevant features and points located from a defined distance from features There have been studies in this field, on the automatic recognition and categorization of objects Usual systems are composed of 3 components: Data focusing Feature extraction Classifaction learning Data Focusing: Increases the overall efficiency of the system by first identifying the portion of the image being analyzed that is most likely to containt the target object Feature Extraction: Extracts interesting features from the data. Pattern recognition methods are used Classification Learning: Discriminates the target objects from other objects that look alike This system has about 80% accuracy. It is difficult for experts to provide classifications with 100% certainty False classifications can produce large errors because they are treated as negative examples “The variety of yet unexplored topics and problems makes knowledge discovery in spatial databases an attractive and challenging research field.” Identified Future Directions for spatial data mining: ◦ Data Mining in Spatial Object Oriented DB ◦ Alternative Clustering Techniques: Clustering overlapping objects, Fuzzy Clustering of Spatial Data ◦ Mining under uncertainty: Evidential reasoning, Fuzzy sets approaches ◦ Spatial Data Deviation and Evolution Rules: ◦ ◦ ◦ ◦ ◦ Rule application to data that changes over time Interleaved Generalization (spatial and non-spatial) Generalization of Temporal Spatial Data (data evolution) Parallel Data Mining (multi-processor systems) Spatial Data Mining Query Language Multidimensional Rule Visualization and Multiple Thematic Maps Topic Currently Active Research Field References DM in Spatial Obj-Oriented DB Yes >10 (1997 – 2012) Alternative / Fuzzy Spatial Clustering Yes 1996, >10 (1997 – 2012) Mining under uncertainty Yes >10 (1997 – 2012) Deviation / Evolution Rules Yes >10 (1997 – 2012) Interleaved Generalization Vague… Generalization of Temporal Spatial Data Yes Parallel Data Mining merged Spatial Data Mining Query Language Yes: GeoMiner, GMQL, Spatial SQL 1997 (h&k), >10 (1997 – 2012) Visualization Topics Yes – esp. GIS >10 (1997 – 2012) >10 (1997 – 2012) 1991, 1994, 1996, Data Mining / Knowledge Discovery of Spatial Data is a large, active research area. While it was a “young” field at the time this survey paper was written, it is quickly maturing in applications such as: ◦ Geographic Information Systems ◦ Medical Imaging ◦ Robotics Navigation http://www.cubrid.org/wp-content/uploads/2011/09/2deminsional-mbr.png http://www.cubrid.org/wp-content/uploads/2011/09/geometryobject-model-examples.png http://www.cubrid.org/wp-content/uploads/2011/09/mercatorprojection.png http://www.lib.unc.edu/reference/gis/images/spatial1.jpg http://postgis.refractions.net/documentation/casestudies/northda kota/Map_Service_01.jpg http://postgis.refractions.net/documentation/casestudies/northda kota/4D_Screen.jpg http://rmbl.info/gis/layers.jpg/layers-full.jpg Krzysztof Koperski, Junas Adhikary, Jiawei Han. Spatial Data Mining: Progress and Challenges Survey Paper. Workshop on Research Issues on Data Mining and Knowledge Discovery, 1996 W. G. Aref and H. Samet. Extending DBMS with Spatial Operations. In Proc. 2nd Symp. SSD’91, pp.299-318, Zurich, Switzerland, Aug. 1991. W. Lu, J. Han, and B. C. Ooi. Discovery of General Knowledge in Large Spatial Databases. In Proc. Far East Workshop on Geographic Information Systems pp. 275-289, Singapore, June 1993. L. Kaufman and P. J. Rousseeuw. Finding Groups in Data: an Introduction to Cluster Analysis. John Wiley & Sons, 1990. R. Ng and J. Han. Efficient and effective clustering method for spatial data mining. In Proc. 1994 Int. Conf. Very Large Data Bases, pp. 144-155, Santiago, Chile, September 1994. T. Zhang, R. Ramakrishnan, and M. Livny. BIRCH: an Efficient Data Clustering Method for Very Large Databases. In Proc. 1996 ACM-SIGMOD Int. Conf. Management of Data, Montreal, Canada, June 1996. R. Agrawal, T. Imielinski, and A. Swami. Mining Association Rules Between Sets of Items in Large Databases. In Proc. 1993 ACM-SIGMOD Int. Conf. Management of Data, pp. 207-216, Washington, D.C., May 1993. K. Koperski and J. Han. Discovery of Spatial Association Rules in Geogrpahic Information Databases. In Proc. 4th Int’l Symp. On Large Spatial Databases (SSD’95), pp. 47-66, Portland, Maine, August 1995. E. Knorr and R. T. Ng. Applying Computational Geometry Concepts to Discovering Spatial Aggregate Proximity Relationships. In Technical Report, University of British Columbia, 1995. Paper Review and Slides by Brad Danielson