Download SPATIAL DATA MINING TECHNIQUES M.Tech

SPATIAL DATA MINING TECHNIQUES M.Tech Seminar Report Submitted by: Subhasmita Mahalik 113050073 CSE,M.Tech1 Under the guidance of: Prof.N.L. Sarda Department of Computer Science and Engineering Indian Institute of Technology, Bombay Mumbai ACKNOWLEDGEMENT I would like to express my sincere gratitude to my guide Prof. N. L. Sarda for his constant encouragement and guidance. He has been my primary source of motivation and advice during my entire study of seminar. I would like to thank my friends for their support , suggestions and feedback they have given me. Subhasmita Mahalik 113050073 CSE, M.Tech IIT Bombay I Contents 1. Introduction 1 2. Database primitives 2.1 Neighbourhood Relations 2.2 Neighbourhood graphs and their operation 2.3 Filter predicates 4 4 5 6 3. Knowledge Discovery 3.1 Classification 3.2 Clustering Methods 3.2.1 Complete Spatial Randomness, Cluster, and Decluster 3.2.2 DECODE : a new method of discovering cluster 3.3 Association Rules 3.4 Outlier Detection 7 7 9 9 10 12 13 4. Efficient polygon amalgamation method 4.1 Adjacency-based method 4.2 Occupancy-based method 16 16 17 5. Challenges and Applications 5.1 Challenges 5.1 Landslide-monitoring data mining 5.2 In visual data mining 19 19 19 19 6. Conclusion 22 II ABSTRACT Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from large spatial datasets. Extracting interesting and useful patterns from spatial datasets is more difficult than extracting the corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationships, and spatial autocorrelation. The requirements of mining spatial databases are different from those of mining classical relational databases. The spatial data mining techniques are often derived from spatial statistics, spatial analysis, machine learning, and databases, and are customized to analyze massive data sets. In this report some of the spatial data mining techniques have discussed along with some applications in real world. III Chapter 1 Introduction The explosive growth of spatial data and widespread use of spatial databases emphasize the need for the automated discovery of spatial knowledge. Spatial data mining is the process of discovering interesting and previously unknown, but potentially useful patterns from spatial databases. The complexity of spatial data and intrinsic spatial relationships limits the usefulness of conventional data mining techniques for extracting spatial patterns.Efficient tools for extracting information from geo-spatial data are crucial to organizations which make decisions based on large spatial datasets, including NASA, the National Imagery and Mapping Agency (NIMA), the National Cancer Institute (NCI), and the United States Department of Transportation (USDOT). These organizations are spread across many application domains including ecology and environmental management, public safety, transportation, Earth science, epidemiology, and climatology[2]. General purpose data mining tools such as Clementine and Enterprise Miner, are designed to analyze large commercial databases. Although these tools were primarily designed to identify customer-buying patterns in market basket data, they have also been used in analyzing scientific and engineering data, astronomical data, multi-media data, genomic data, and web data. Extracting interesting and useful patterns from spatial data sets is more difficult than extracting corresponding patterns from traditional numeric and categorical data due to the complexity of spatial data types, spatial relationships, and spatial autocorrelation. Specific features of geographical data that preclude the use of general purpose data mining algorithms are: i) rich data types(e.g., extended spatial objects) ii) implicit spatial relationships among the variables, iii) observations that are not independent, and iv) spatial autocorrelation among the features. The spatial data mining can be used to understand spatial data, discover the relation between pace and the non- space data, set up the spatial knowledge base, excel the query, reorganize spatial database and obtain concise total characteristic etc. The system structure of the spatial data mining can be divided into three layer structures mostly, such as the Fig 1[1] show .The customer interface layer is mainly used for input and output, the miner layer is mainly used to manage data, select algorithm and storage the mined knowledge, the data source layer, which mainly includes the spatial database (camalig) and other related data and knowledge bases, is original data of the spatial data mining. 1 The data inputs of spatial data mining are more complex than the inputs of classical data mining because they include extended objects such as points, lines, and polygons. The data inputs of spatial data mining have two distinct types of attributes: non-spatial attribute and spatial attribute. Non-spatial attributes are used to characterize non-spatial features of objects, such as name, population, and unemployment rate for a city. They are the same as the attributes used in the data inputs of classical data mining. Spatial attributes are used to define the spatial location and extent of spatial objects. The spatial attributes of a spatial object most often include information related to spatial locations, e.g., longitude, latitude and elevation, as well as shape. Relationships among non-spatial objects are explicit in data inputs, e.g.,arithmetic relation, ordering, is instance of, subclass of, and membership of. In contrast, relationships among spatial objects are often implicit, such as overlap, intersect, and behind. One possible way to deal with implicit spatial relationships is to materialize the relationships into traditional data input columns and then apply classical data mining techniques. However,the materialization can result in loss of information. Another way to capture implicit spatial relationships is to develop models or techniques to incorporate spatial information into the spatial data mining process. Fig1:Thesystematic structure of spatial data mining. One of the fundamental assumptions of statistical analysis is that the data samples are 2 independently generated: like successive tosses of coin, or the rolling of a die. However, in the analysis of spatial data, the assumption about the independence of samples is generally false. In fact, spatial data tends to be highly self correlated. For example, people with similar characteristics, occupation and background tend to cluster together in the same neighborhoods. The economies of a region tend to be similar. Changes in natural resources, wildlife, and temperature vary gradually over space. The property of like things to cluster in space is so fundamental that geographers have elevated it to the status of the first law of geography: “Everything is related to everything else but nearby things are more related than distant things”.In spatial statistics, an area within statistics devoted to the analysis of spatial data, this property is called spatial autocorrelation. Knowledge discovery techniques which ignore spatial autocorrelation typically perform poorly in the presence of spatial data. Often the spatial dependencies arise due to the inherent characteristics of the phenomena under study, but in particular they arise due to the fact that the spatial resolution of imaging sensors are finer than the size of the object being observed.For example, remote sensing satellites have resolutions ranging from 30 meters (e.g., the Enhanced Thematic Mapper of the Landsat 7 satellite of NASA) to one meter (e.g., the IKONOS satellite from SpaceImaging), while the objects under study (e.g., Urban, Forest, Water) are often much larger than 30 meters. As a result, per-pixel-based classifiers, which do not take spatial context into account, often produce classified images with salt and pepper noise. These classifiers also suffer in terms of classification accuracy. The spatial relationship among locations in a spatial framework is often modeled via a contiguity matrix. A simple contiguity matrix may represent a neighborhood relationship defined using adjacency, Euclidean distance, etc. Example definitions of neighborhood using adjacency include a four-neighborhood and an eight-neighborhood. Given a gridded spatial framework, a fourneighborhood assumes that a pair of locations influence each other if they share an edge. An eightneighborhood assumes that a pair of locations influence each other if they share either an edge or a vertex. Fig 2:A Spatial Framework and Its Four-neighborhood Contiguity Matrix. 3 In figure2(a) above shows a gridded spatial framework with four locations, A,B, C, and D. A binary matrix representation of a four-neighborhood relationship is shown in Fig2(b). The row normalized representation of this matrix is called a contiguity matrix, as shown in Fig2(c). The prediction of events occurring at particular geographic locations is very important in several application domains. Examples of problems which require location prediction include crime analysis, cellular networking, and natural disasters such as fires, floods, droughts, vegetation diseases, and earthquakes. Two spatial data mining techniques for predicting locations, namely the Spatial Autoregressive Model (SAR) and Markov Random Fields (MRF). An Application Domain We begin by introducing an example to illustrate the different concepts related to location prediction in spatial data mining. We are given data about two wetlands, named Darr and Stubble, on the shores of Lake Erie in Ohio USA in order to predict the spatial distribution of a marsh breeding bird, the red-winged blackbird (Agelaius phoeniceus). The data was collected from April to June in two successive years, 1995 and 1996. 4 Chapter 2 Database primitives In this section, we introduce a small set of database primitives for spatial data mining . The major difference between mining in relational databases and mining in spatial databases is that attributes of the neighbors of some object of interest may have an influence on the object itself. Therefore, our database primitives are based on the concept of spatial neighborhood relations. 2.1 Neighborhood Relations The mutual influence between two objects depends on factors such as the topology, the distance or the direction between the objects. Three basic types of spatial relations: topological, distance and direction relations which are binary relations, i.e. relations between pairs of objects. Spatial objects may be either points or spatially extended objects such as lines, polygons or polyhedrons. In general, the points p = (p1, p2, . . ., pd) are elements of a d-dimensional Euclidean vector space called Points[1]. Topological relations are those relations which are invariant under topological transformations, i.e. they are preserved if both objects are rotated, translated or scaled simultaneously. Definition 1: (topological relations) The topological relations between two objects A and B are derived from the nine intersections of the interiors, the boundaries and the complements of A and B with each other. The relations are: A disjoint B, A meets B, A overlaps B, A equals B, A covers B, A covered-by B, A contains B, A inside B. Definition 2: (distance relations) Distance relations are those relations comparing the distance of two objects with a given constant using one of the arithmetic operators. The distance dist between two objects, i.e. sets of points, can then simply be defined by the minimum distance between their points. Definition 3: (direction relations) To define direction relations O2 R O1, we distinguish between the source object O1 and the destination object O2 of the direction relation R. There are several 5 possibilities to define direction relations depending on the number of points they consider in the source and the destination object. Fig3:Illustrates some of the topological, distance and direction relations using 2D polygons. We define the direction relation of two spatially extended objects using one representative point rep(O1) of the source object O1 and all points of the destination object O2. The representative point of a source object may, e.g., be the center of the object. This representative point is used as the origin of a virtual coordinate system and its quadrants define the directions. Definition 4: (complex neighborhood relations) If r1 and r2 are neighborhood relations, then r1 ∧ r2 and r1 ∨ r2 are also neighborhood relations - called complex neighborhood relations. 2.2 Neighborhood Graphs and Their Operations Based on the neighborhood relations, we introduce the concepts of neighborhood graphs and neighborhood paths and some basic operations for their manipulation. Definition 5: (neighborhood graphs and paths) Let neighbor be a neighborhood relation and DB ⊆ 2Points be a database of objects. a) A neighborhood graph G DBneighbor = ( N, E ) is a graph with the set of nodes N which we identify with the objects o ∈ DB and the set of edges E ⊆ N × N where two nodes n1 and n2 ∈ N are connected via some edge of E iff neighbor(n 1,n2) holds. Let n denote the cardinality of N and let e denote the cardinality of E. Then, f:= e / n denotes the average number of edges of a node,i.e. f is called the “fan out” of the graph. b) A neighborhood path is a sequence of nodes [n1, n2, . . ., nk], where neighbor(ni, ni+1) holds for all n ∈ N, 1 ≤ i < k . The number k of nodes is called the length of the neighborhood path. 6 c) A neighborhood path [n1, n2, . . ., nk] is valid iff ∀ i ≤ k, j < k: i ≠ j ⇔ n i ≠ n j . 2.3 Filter Predicates for Neighborhood Paths Neighborhood graphs will in general contain many paths which are irrelevant if not “mislead ing” for spatial data mining algorithms. For finding significant spatial patterns, we have to consider only certain classes of paths which are “leading away” from the starting object in some straightforward sense. Such spatial patterns are most often the effect of some kind of influence of an object on other objects in its neighborhood. Furthermore, this influence typically decreases or increases continuously with increasing or decreasing distance. The task of spatial trend analysis, i.e. Finding patterns of systematic change of some non-spatial attributes in the neighborhood of certain database objects, can be considered as a typical example. Detecting such trends would be impossible if we do not restrict the pattern space in a way that paths changing direction in arbitrary ways or containing cycles are eliminated[1]. Fig4:Filter Starlike and filter variable starlike Definition 6: (filter starlike and filter variable starlike) Let p = [n1,n2,...,nk] be a neighborhood path and let reli be the exact direction for ni and n i+1, i.e. ni+1 reli ni holds. The predicates starlike and variable-starlike for paths p are defined as follows: starlike(p) :⇔ (∃ j < k: ∀ i > j: ni+1 reli ni ⇔ reli ⊆ relj), if k > 1; TRUE, if k=1 variable-starlike(p) :⇔ (∃ j < k: ∀ i > j: ni+1 reli ni ⇔ reli ⊆ rel1), if k > 1; TRUE, if k=1. 7 Chapter 3 Knowledge Discovery Knowledge discovery in databases (KDD) has been defined as the non-trivial process of discovering valid, novel, and potentially useful, and ultimately understand able patterns from data .The process of KDD is interactive and iterative, involving several steps such as the following ones: • Selection: selecting a subset of all attributes and a subset of all data from which the knowledge should be discovered. • Data reduction: using dimensionality reduction or transformation techniques to reduce the effective number of attributes to be considered. • Data mining: the application of appropriate algorithms that, under acceptable computational efficiency limitations, produce a particular enumeration of patterns over the data. • Evaluation: interpreting and evaluating the discovered patterns with respect to their usefulness in the given application. Spatial Database Systems (SDBS) are database systems for the management of spatial data. To find implicit regularities, rules or patterns hidden in large spatial databases, e.g. for geomarketing, traffic control or environmental studies, spatial data mining algorithms are very important .Most existing data mining algorithms run on separate and specially prepared files, but integrating them with a database management system (DBMS) has the following advantages. Redundant storage and potential inconsistencies can be avoided. Furthermore, commercial database systems offer various index structures to support different types of database queries. This functionality can be used without extra implementation effort to speed-up the execution of data mining algorithms (which, in general, have to perform many database queries)[4]. Similar to the relational standard language SQL, the use of standard primitives will speed-up the development of new data mining algorithms and will also make them more portable. 3.1 Spatial Classification The task of classification is to assign an object to a class from a given set of classes based on the attribute values of this object. In spatial classification the attribute values of neighboring objects are also considered. 8 The classification algorithm works as follows:The relevant attributes are extracted by comparing the attribute values of the target objects with the attribute values of their nearest neighbors. The determination of relevant attributes is based on the concepts of the nearest hit (the nearest neighbor belonging to the same class) and the nearest miss (the nearest neighbor belonging to a different class). In the construction of the decision tree, the neighbors of target objects are not considered individually. Instead, so-called buffers are created around the target objects and the nonspatial attribute values are aggregated over all objects contained in the buffer. For instance, in the case of shopping malls a buffer may represent the area where its customers live or work[4]. The size of the buffer yielding the maximum information gain is chosen and this size is applied to compute the aggregates for all relevant attributes. The task of classification is to assign an object to a class from a given set of classes based on the attribute values of the object. In spatial classification the attribute values of neighbouring objects may also be relevant for the membership of objects and therefore have to be considered as well. The extension to spatial attributes is to consider also the attribute of objects on a neighbourhood path starting from the current object. Thus, we define generalized attributes for a neighbourhood path p = [o1, . . ., ok] as tuples (attribute-name, index) where index is a valid position in p representing the attribute with attribute name of object o index. The generalized attribute (economic-power,2), e.g., represents the attribute economic-power of some (direct) neighbour of object o1. Because it is reasonable to assume that the influence of neighbouring objects and their attributes decreases with increasing distance, we can limit the length of the relevant neighbourhood paths by an input parameter max-length. Furthermore, the classification algorithm allows the input of a predicate to focus the search for classification rules on the objects of the database fulfilling this predicate. Figure 5 depicts a sample decision tree and two rules derived from it. Economic power has been chosen as the class attribute and the focus is on all objects of type city. 9 Fig 5:Sample decision tree and rules discovered by the classification algorithm 3.2 Spatial Clustering Spatial clustering is a process of grouping a set of spatial objects into clusters so that objects within a cluster have high similarity in comparison to one another, but are dissimilar to objects in other clusters. For example, clustering is used to determine the “hot spots” in crime analysis and disease tracking. Hot spot analysis is the process of finding unusually dense event clusters across time and space. Many criminal justice agencies are exploring the benefits provided by computer technologies to identify crime hot spots in order to take preventive strategies such as deploying saturation patrols in hot spot areas[6]. Spatial clustering can be applied to group similar spatial objects together; the implicit 10 assumption is that patterns in space tend to be grouped rather than randomly located. However, the statistical significance of spatial clusters should be measured by testing the assumption in the data. The test is critical before proceeding with any serious clustering analyses. 3.2.1 Complete Spatial Randomness, Cluster, and Decluster In spatial statistics, the standard against which spatial point patterns are often compared is a completely spatially random point process, and departures indicate that the pattern is not distributed randomly in space. Complete spatial randomness (CSR) [Cressie1993] is synonymous with a homogeneous Poisson process. The patterns of the process are independently and uniformly distributed over space, i.e., the patterns are equally likely to occur anywhere and do not interact with each other. However, patterns generated by a non-random process can be either cluster patterns(aggregated patterns) or decluster patterns(uniformly spaced patterns). Fig 6:Illustration of CSR, Cluster, and Decluster Patterns Several statistical methods can be applied to quantify deviations of patterns from a complete spatial randomness point pattern. One type of descriptive statistics is based on quadrats (i.e., well defined area, often rectangle in shape)[6]. Usually quadrats of random location and orientations in the quadrats are counted, and statistics derived from the counters are computed. Another type of statistics is based on distances between patterns; one such type is Ripley’s K-function . After the verification of the statistical significance of the spatial clustering, classical clustering algorithms can be used to discover interesting clusters. 3.2.2 DECODE (DiscovEring Clusters Of Different dEnsities) Discovering clusters in complex spatial data, in which clusters of different densities are superposed, severely challenges existing data mining methods. For instance, in seismic research, 11 foreshocks (which indicate forthcoming strong earthquakes) or aftershocks (which may help to elucidate the mechanism of major earthquakes) are often interfered by background earthquakes. In this context, data are presumed to consist of various spatial point processes in each of which points are distributed at a constant, but different intensity. DECODE is a new density-based cluster method(DECODE) to discover clusters of different densities in spatial data[8]. The novelties of DECODE are 2-fold: (1) It can identify the number of point processes with little prior knowledge.It can automatically estimate the thresholds for separating point processes and clusters. Fig 7:Flowchart of the method for discovering clusters of different densities in spatial data Two strategies have been adopted for finding density homogeneous clusters in density-based methods: (1) Grid-based clustering method Map data into a mesh grid and identify dense regions according to the density in cells. The main advantage : detection capability for finding arbitrary shaped clusters and their high efficiency in dealing with complex data sets which are characterized by large amounts of data, high dimensionality and multiple densities. (2) Distance-based clustering method Often based on the mth nearest neighbor distance. When clusters with different densities and noise coexist in a data set, few existing methods can determine the number of processes and precisely estimate the parameters. Therefore, DECODE is a solution for it. 12 DECODE is based upon a reversible jump Markov Chain Monte Carlo( MCMC) strategy and divided into three steps: (1) Map each point in the data to its mth nearest distance. (2)Classification thresholds are determined via a reversible jump MCMC strategy. (3) Clusters are formed by spatially connecting the points whose mth nearest distances fall into a particular bin defined by the thresholds. Some aspects of the model 1) Analysis of sensitivity to prior specification Two strategies can be applied to the selection of β. a. to update β constantly during the process. b. To fix β. Where λmax is the intensity corresponding to min(Xm) Fig 8: The cumulative occupancy fractions of j at different f b : (a) fb = 2; (b) fb = 10; (c) fb = 50; (d) fb = 200; (e) fb = 1,000; (f) fb = 5,000; (g) fb = 20,000; (h) fb = 250,000; (i) updating β (with δ = 1, α = 1, h = 10/(max(Xm ) − min(Xm )), g = 0.2) 13 As shown above, we find that updating β produces the highest posterior probability. From the results, it is observed that: updating β tends to produce a model with more point processes. fb is the parameter which significantly influences the results when fixing β. Total complexity of the algorithm is O(T(Mk+(k-1)k!))where T is the sweep times. M is the number of points of the data set. k is the number of point processes that the algorithm determines. DECODE is a robust and automatic cluster method which needs less prior knowledge of the target data. 3.3 Association rules Spatial association rule is a rule indicating certain association relationship among a set of spatial and possibly some nonspatial predicates. A strong rule indicates that the patterns in the rule have relatively frequent occurrences in the database and strong implication relationships. raditional data organization and retrieval tools can only handle the storage and retrieval of explicitly stored data. The extraction and comprehension of the knowledge implied by the huge amount of spatial data, though highly desirable, pose great challenges to currently available spatial database technologies. A spatial characteristic rule is a general description of a set of spatial-related data. For example, the description of the general weather patterns in a set of geographic regions is a spatial characteristic rule. A spatial discriminant rule is the general description of the contrasting or discriminating features of a class of spatial-related data from other class(es). For example, the comparison of the weather patterns in two geographic regions is a spatial discriminant rule. A spatial association rule is a rule which describes the implication of one or a set of features by another set of features in spatial databases. For example, a rule like most big cities in Canada are close to the Canada-U.S. border" is a spatial association rule. A strong rule indicates that the patterns in the rule have relatively frequent occurrences in the database and strong implication relationships.A spatial association rule is a rule in the form of P1 ∧ … ∧ Pm → Q1 ∧ … ∧Qn (c%) , where at least one of the predicates P1 ,…, Pm , Q1 ,…, Qn is a spatial predicate, and c% is the confidence of the rule. A rule “P  Q/S” is strong if predicate “P ∧ Q” is large in set S and the confidence of “P → Q/S” is high. 14 Steps for extracting the association rules: STEP 1: Task_relevant_DB := extract task relevant objects(SDB ,RDB); (Relevant objects are collected into one database) STEP 2: Coarse_predicate_DB := coarse spatial computation(Task relevant DB); (Spatial algorithm is executed at the coarse resolution level) STEP 3: Large_Coarse_predicate_DB := filtering_with_mininmum_support(Coarse_predicate_DB); (computes the support for each predicate in Coarse_predicate_DB,and filters out those entries whose support is below the minimum support threshold at the top level.) STEP 4: Fine_predicate_DB := refined_spatial_computation(Large_Coarse_predicate_DB); (Algorithm is executed at fine resolution level.) STEP 5: Find_large_predicates_and_mine_rules(Fine_predicate_DB); 3.4 Outlier Detection Spatial outlier represent locations which are significantly different from their neighborhoods even though they may not be significantly different from the entire population. Identification of spatial outliers can lead to the discovery of unexpected, interesting and implicit knowledge, such as local instability[3]. A spatial outlier is a spatially referenced object whose non-spatial attribute values are significantly different from those of other spatially referenced objects in its spatial neighborhood. Informally, a spatial outlier is a local instability or a spatially referenced object whose non-spatial attributes are extreme relative to its neighbors,even they may not be significantly different from the entire population. For example, a new house in an old neighborhood of a growing metropolitan area is a spatial outlier based on the non-spatial attribute house age. So, the problem is given the components of the S-outlier definition, the objective is to design a computationally efficient algorithm to detect the S-outliers. Techniques of outlier detection:Graphical methods include Variogram clouds and Moran Scatterplots. The algebraic method includes Scatterplot and Z(S(x)). A variogram cloud displays data points related by neighborhood relationships. For each pair of loactaions,the square-root of the absolute difference between attribute values at the locations versus the Euclidean distance between the loacations are plotted. In data sets exhibiting strong spatial dependence,the variance in the attribute differences will increase with increasing distance 15 between locations.Locations that are near to one another ,but with large attribute differences,might indicate spatial outlier, even though the values at both locations may appear to be reasonable when examining the dataset non-spatially . Fig 9:A Dataset for Outlier Detection. In Fig9(a), the X-axis is the location of data points in one-dimensional space; the Y-axis is the attribute value for each data point. Global outlier detection methods ignore the spatial location of each data point and fit the distribution model to the values of the non-spatial attribute. The outlier detected using this approach is the data point G, which has an extremely high attribute value 7.9, exceeding the threshold of μ + 2σ = 4.49 + 2 ∗ 1.61 = 7.71, as shown in Figure 9(b). This test assumes a normal distribution for attribute values. On the other hand, S is a spatial outlier whose observed value is significantly different than its neighbors P and Q[3]. Fig 10:Variogram cloud 16 This plot shows that two pairs (P,S ) and (Q,S ) in the left hand side lie above the main group of pairs, and are possibly related to outliers. The poin S may be identified as a spatial outlier since it occurs in both pairs ( Q,S) and ( P, S). This technique requires non-trivial post-processing of highlighted pairs to separate spatial outliers from their neighbors, particualrly when multiple outliers are present or density varies greatly. Fig 11:Scatter plot A scatterplot shows attribute values on the X-axis and the average of the attribute values in the neighborhood on the Y-axis. A least square regression line is used to identify spatial outliers. A scatter sloping upward to the right indicates a positive spatial autocorrelation; a scatter sloping upward to the left indicates a negative spatial auto -correlation .The residual is defined as the vertical distance (Y- axis) between a point P with location(Xp,Yp) to the regression line Y = mX +b ,that is, residual ∈ =Yp -(mXp +b). A scatterplot shows the attribute values plotted against the average of the attribute values in neighboring areas for the given dataset. 17 Chapter 4 Efficient polygon amalgamation method The polygon amalgamation operation computes the boundary of the union of a set polygons. It’s a fundamental operation for emerging new applications such as spatial OLAP and spatial data mining.Two efficient methods for identifying internal polygons without retrieving them from databases[7]. 1) Adjacency-based method 2) Occupancy-based method 4.1 Adjacency-based approach Two polygons are adjacent if they have at least one pair of identical line segments.The adjacency table of a set of polygons P is defined as a two column table ADJACENCY (p, p’) where p,p’ are identifiers of polygons in P. A tuple (p,p’) is in the table ADJACENCY if and only if p and p’ are adjacent. The basic idea is to use adjacency table to identify boundary polygons. Then remove identical line segments in boundary polygons to get t(S). Hence, most of the internal polygons will not be retrieved. A polygon is on the boundary of S if it’s adjacent to some polygons which don’t belong to S[7]. Fig 12:An example of shadow ring 18 The adjacency table for the four polygons in Figure 12(a) where P = {p1 , p2 , p3 , p4 }: {(p1;p2);p1;p3); (p1;p4); (p2;p3); (p2;p4); (p3;p4); (p2;D); (p3;D);(p4;D)} where (p; D ) means p has at least one line segment adjacent to no P polygons (in this case we say p is adjacent to a dummy polygon also labeled as D ). For S ⊆ P , by definition we have δS = { p|p ∈S, ∃(p,p’) ∈ ADJACENCY, p’ ∉ S } So, δS :p2, p3, p4 are boundary polygons. P1 will not be involved in the computation. So those line segments of p2, p3 and p4 which were originally shared with p1 have lost their counterpart. These line segments couldn’t be removed and form a shadow ring. Identify δS +: the set of sub-boundary polygons which are internal polygons adjacent to boundary polygons. We extract both δS and δS +, remove all identical line segments in them. We get the target polygon and possibly a shadow ring. But this time we know the shadow ring is formed by line segments of sub-boundary polygons δS +. So we then remove the line segments belonging to δS +. The remaining line segments will form the boundary polygon. 4.2 Occupancy-based approach Z-values: a spatial data access method which establishes certain relationship between the data space and spatial objects or their approximation bounding rectangles. It approximates a given object’s shape by recursively decomposing the embedding data space into smaller data space known as Peano cells. Z-values is commonly used spatial indexing mechanism[7]. Our occupancy-based approach is built on top of a simple extension of Z-values. Fig 13: Z-order and object approximation using z-values 19 One way to assign Z-values. 1. The z-value of the initial space, D, is 1; 2. The z-values of the four quadrants of a Peano cell whose z-value is z = z1 ...zn , 1 ≤ zi ≤ 4, 1 ≤ i ≤ n, are z1, z2, z3 and z4 respectively following the z-order; The accuracy of approximation can be improved by assigning multiple Z-values to a polygon. Occupancy-based approach will not produce shadow rings. Because: Any line segment which doesn’t overlap with boundary cells is not part of the target polygon, and thus can be discarded. Any line segment inside a boundary cell either is part of the target polygon, or its counterpart from another polygon must also be inside this boundary cell. Let C be the set of all Peano cells with which S polygons overlap with. If c C is not completely occupied by S polygons, we call c boundary cell. An S polygon overlapping with a boundary cell is likely to be a boundary polygon. The spatial indices using z-values associate objects with Peano cells. That is,each index entry is of the form (z, p), stating that object p overlaps with Peano cell z. There is no information about what percentage of the cell is occupied by p, thus it is not sufficient to determine if a Peano cell is completely occupied by a set of objects. Therefore, we extend the index entry to the form of (z, p, α) where α is the occupancy ratio. Let p ∩ z be the polygon produced from clipping polygon p by the Peano cell z, then α=area(p ∩ z)/area(z) 20 Chapter 5 Challenges and Applications 5.1 Challenges A noteworthy trend is the increasing size of data sets in common use, such as records of business transactions, environmental data and census demographics. These data sets often contain millions of records, or even far more. This situation creates new challenges in coping with scale. The challenges and impacts can be classified into three main areas, namely, geographic information in knowledge discovery, geographic knowledge discovery in geographic information science and geographic knowledge discovery in geographic research[2] . The huge volume of spatial data . So retrieval and storage is difficult.The spatial data types/structures are complex .Expensive spatial processing operations . 5.2 Application in land use dynamic monitoring The mass data storaged in spatial database includes spatial topological, nospatial properties and objects appearing variety on the time .The main knowledge types that can be discovered in the spatial database are: general geometric knowledge, spatial distribution rules, spatial association rules, spatial clustering rules, spatial characteristic rules, spatial discriminate rules, spatial evolution rules etc.. For land use dynamic monitoring, according to the knowledge mined in the spatial database, there are following several applications[4]: a. Make prediction of land variety According to geographic location, soil characters, geologic circumstance, prevent or control flood information, transportation circumstance etc. of the land, Making use of the spatial distribution rules, spatial clustering rules, spatial characteristic rules, spatial discriminate rules to analyze can get the distributing and the future development of the land . b.Provide decision support for the city planning Spatial data mining technique makes use of general geometric knowledge, spatial distribution rules, spatial association rules, spatial evolution rules to get many factors about terrain, prevent or control flood, preventive pollution during the city planning for providing good data environment in city construction. 21 c.Valid management and analysis of remote sensing monitoring result According to algorithm of spatial data mining, based on knowledge discovery, it can validly output various statistical charts, images, query result and analysis result for decision. 5.3 Application in Visual Data Mining For data mining of large data sets to be effective, it is also important to include humans in the data exploration process and combine their flexibility, creativity, and general knowledge with the enormous storage capacity and computational power of today’s computers. Visual data mining applies human visual perception to the exploration of large data sets. Presenting data in an interactive, graphical form often fosters new insights, encouraging the formation and validation of new hypotheses to the end of better problem-solving and gaining deeper domain knowledge. Visual data mining often follows a three step process: Overview first, zoom and filter, and then details-ondemand [5]. Some of they key advantages of visual data exploration over automatic data mining techniques alone are: • yields results more quickly, with a higher degree of user satisfaction and confidence in findings . • are especially useful when little is known about the data and exploration goals are vague, because the analyst guides the search and can shift or adjust goals on the fly . • can deal with highly non-homogeneous and noisy data . • can be intuitive and require less understanding of complex mathematical or statistical algorithms or parameters . • can provide a qualitative overview of the data, allowing unexpected phenomena . 22 Chapter 6 Conclusion Spatial Data Mining extends relational data mining with respect to special features of spatial data, like mutual influence of neighboring objects by certain factors (topology, distance, direction). It is based on techniques like generalization, clustering and mining association rules. Some algorithms require further expert knowledge that can not be mined from the data, like concept hierarchies. Spatial data mining is a niche area within data mining for the rapid analysis of spatial data. Spatial data can potentially influence major scientific challenges, including the study of global climate change and genomics. The distinguishing characteristics of spatial data mining can be netaly summarized by the first law of geography:All things are related, but nearby things are more related than distant things. Spatial data mining is being used in various fields like remote sensing sattelite, Visyal data mining to mine data. 23 Bibliography [1] Martin Ester, Alexander Frommelt, Hans-Peter Kriegel, Jörg Sander . Spatial Data Mining:Database Primitives, Algorithms and Efficient DBMS Support . "Integration of Data Mining with Database Technology", Data Mining and Knowledge Discovery, an International -Journal, Kluwer Academic Publishers, 1999. [2] Shashi Shekhar , Pusheng Zhang , Yan Huang , Ranga Raju Vatsavai. Trends in Spatial Data Mining,Department of Computer Science and Engineering, University of Minnesota 4-192, 200 Union ST SE, Minneapolis, MN 55455 [3] Sashi Sekhar ,Chang-Tien Lu and Pusheng Zhang. A Unified Approach To Detecting Spatial Outlier. Published in:Journal Geoinformatica Volume 7 Issue 2, June 2003 [4] Zhong yong a, Zhang jixian a, Yan qin. Research on spatial data mining technique applied in land use dynamic monitoring.Proceedings of International Symposium on Spatio-temporal Modeling, Spatial Reasoning, Analysis, Data Mining and Data Fusion. Volume XXXVI-2/W25, 2005 [5] Daniel A. Keim Christian Panse Mike Sips . Pixel Based Visual Mining of Geo-Spatial Data . Published in:Journal IEEE Computer Graphics and Applications Volume 24 Issue 5, September 2004 [6] Krzysztof Koperski and Jiawei Han . Discovery of Spatial Association Rules in Geographic Information database.Proceeding SSD '95 Proceedings of the 4th International Symposium on Advances in Spatial Databases Springer-Verlag London, UK ©1995 [7] Xiaofang Zhou1 , David Truffet2 , and Jiawei Han3 . Efficient Polygon Amalgamation Methods for Spatial OLAP and Spatial Data Mining . Published in:Proceeding SSD '99 Proceedings of the 6th International Symposium on Advances in Spatial Databases Springer-Verlag London, UK ©1999 [8] Pei T, Jasra A, Hand Dj, et al. DECODE: A new method for discovering clusters of different densities in spatial data.Springer Science+Business Media, LLC 2008 [9] Raymond T. Ng , Jiawei Han . Efficient and Effective Methods for Spatial Data Mining.Published in:Proceedings VLDB '94 Proceedings of the 20 th international conference on very large databases, San Franscisco, CA,USA ©1994 24

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download SPATIAL DATA MINING TECHNIQUES M.Tech