Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Chapter 7 : Spatial Data Mining: Some key notes before reading the Book Chapter Assignment: 1. Currently, only those sections have been addeed for which new literature has to be incorporated in the existing framework. The new literature sections are marked with a different color in the index. 2. Sections where major changes have been incorporated include association analysis and clustering (where hotspot detection is added). 3. Some of the mathematical symbols , verbatim text and figures have been directly adde from their respective sources in the first draft, which have to be changed and are subject to revision in subsequent versions. 4. Illustrative examples and use cases of added materials are being searched upon .This will be incorporated shortly. 5. The section on spatial outlier has not been changed much even in the first organization proposed. Some of the materials discovered while writing is being looked upon. The same topic might be extended in the next version. 1.Introduction 1.1 Data Mining Introduction. 1.2 Spatial Data Mining. 1.3 Motivation for Spatial Data Mining. 2.1 Spatial Statistics. 2.1.1 Point process. 2.1.2 Lattice. 2.1.3 Geo Statistics 2.1.4 Spatial Auto Correlation. 3.Spatial Data Mining Tasks 3.1 Spatial Classification and Regression 3.1.1 Linear Regression 3.1.2 Spatial Regression 3.1.3 Model Evaluation 3.1.3 Predicting Location using Map similarity 3.1.4 Markov Random Fields 3.2 Spatial Association Rules 3.2.1 Association Rules in Data Mining.(Includes Aprori). 3.2.2. Spatial Association 3.2.3. Co-location pattern discovery 3.2.3.1 Global Co-location Miner Algorithm 3.2.3.2 Local Co-location Miner Algorithm 3.3 Spatial Clustering 3.3.2 Clustering Algorithms. 3.3.3 Limitations of Clustering Algorithms. 3.3.4 Hotspot Analysis 3.4 Spatial outlier detection 3.5 Spatio-Temporal Data Mining 3.5.1 Recent trends in spatio-temporal Data Mining 1. Introduction 1.1 Data Mining: The amount of data is growing exponentially every day. With all this data around us we are starving for information. Data mining is the process of discovery useful information, from large datasets. Data mining is integral part of Knowledge discovery in databases. Data Mining tasks: There are two major tasks for data mining Predictive tasks: The objective of these tasks is to predict the value of an unknown parameter attribute using the value of other existing attribute. Descriptive Tasks: The objective of this pattern is to derive patterns in the form of clusters, trajectories and anomalies and give understanding about the underlying relationships in the data. The figure summarizes the four broad data mining tasks that are used. Fig1: Figure showing various Data Mining Tasks Predictive modeling refers to task of building a model for the target variable as a function of explanatory variables. Regression is a technique used for continuous variables where as classification is used for discrete variables. Association analysis is used to discover patterns that describe strongly associated features in the data. The discovered patterns are typically represented in the form of implication rules or features subsets. Cluster analysis seeks to group observation into different groups where items of each group are similar to each other but different to the other groups. Anomaly detection is the task of identifying observation whose characteristics are significantly different form the rest of the data. These are often called as outliers. In this chapter we will try to extend the ideas and techniques presented above to the context of spatial domain. 1.3. Spatial Data Mining: Spatial data mining (SDM) consists of extracting knowledge, spatial relationships and any other properties of spatial data. SDM is used to find implicit regularities, relations between spatial data and/or non-spatial data. Traditional analysis assumes about the independence of the samples, but spatial data is highly correlated in nature. For example people with similar characteristics occupations and backgrounds tend to be similar. A geographical database constitutes a spatiotemporal continuum in which properties concerning a particular place are generally linked and explained in terms of the properties of its neighborhood. We can thus see the great importance of spatial relationships in the analysis process. 3.2.3 Colocation Pattern Discovery: 3.2.3.1 Global Colocation patterns: Co-location patterns represent subsets of boolean spatial features whose instances are often located in close geographic proximity. Examples include symbiotic species and crime attractors (e.g., bars, misdemeanors, etc.). Boolean spatial features describe the presence or absence of geographic object types at different locations in a two-dimensional or three dimensional metric spaces,. Spatial co-location: Co-location rules are models to infer the presence of Boolean spatial features in the neighborhood of instances of other Boolean spatial features. For example, ‘Nile Crocodiles → Egyptian Plover’ predicts the presence of Egyptian Plover birds in areas with Nile Crocodiles. Fig : 2 Illustration of point spatial colocation patterns. Figure 2 shows a data set consisting of instances of several Boolean spatial features, each represented by a distinct shape. The shapes ‘+’ and ‘x’,’ o’, ‘∗’ represent different spatial feature. Spatial features in sets {‘+’, ‘×’} and {‘o’, ‘∗’} tend to be located together. A careful review reveals two co-location patterns, that is, (‘+’, ‘×’) and (‘o’, ‘∗’). Co-location rule discovery is the process of identifying co-location patterns from large spatial data sets with a large number of boolean features. The spatial co-location rule discovery problem looks similar to, but, in fact, is very different from the association rule mining because of the lack of transactions. In market basket data sets, transactions represent sets of item types bought together by customers. The support of an association is defined to be the fraction of transactions containing the association. Association rules are derived from all the associations with support values larger than a user-given threshold. Spatial co-location rule mining approaches can be grouped into two broad categories: approaches that use spatial statistics and algorithms that use association rule mining kind of primitives. Spatial statistics based approaches utilize statistical measures such as cross-K function, mean nearest-neighbor distance, and spatial autocorrelation.However, these approaches are computationally expensive. Global colocation pattern mining algorithm: This section defines the event centric model for finding local co location patterns, our approach to modeling co-location patterns.Consider Figure 3 as an example spatial dataset to illustrate the model. In the figure, each instance is uniquely identified by T.i where i is the spatial feature type and is the unique id inside each spatial feature type. For example, B.2 represents the instance 2 of spatial feature B. Two instances are connected by edges if they have a spatial neighbor relationship.A co-location is a subset of boolean spatial features. A co-location rule is of the form: C1=>c2(p,cp) where c1 and c2 are colocations ,and C1 intersection C2 = Null set where C1 and C2are co-locations, p is a number representing the prevalence measure, and cp is a number measuring conditional probability. Terminology R-proximity Neighborhood: Given a reflexive and symmetric neighbor relation R over a set ( of instances, a ) R -proximity neighborhood is a set I of instances that form a clique under the relation . The definition of neighbor relation R is an input and should be based on the semantics of the application domains. The neighbor relation R be defined using spatial relationships metric relationships (e.g. Euclidean distance) or a combination (e.g. shortest-path distance in a graph such as a road-map). The R -proximity neighborhood concept is different from the neighborhood concept in topology since some super sets of a R -proximity neighborhood may not qualify to be R-proximity neighborhoods. Row Instance: Two R-proximity neighborhoods I1 and I2 are R -reachable to each other if I1 U I2 is a Rproximity neighborhood. A R -proximity neighborhood is a row instance (denoted by Row_instance(c)) of a co-location C if contains instances of all the features in C and no proper subset of I does. For example, {A.3,B.4,C.1} is a row instance of colocation {A,B,C} in the spatial data set shown in fig3.But {A.2,A.3,B.4,C.1} is not a row instance of colocation {A,B,C} because it proper subset {A.3,B.4,C.1} is not a row instance of the colocation {A,B,C} in the spatial data set shown in fig3.But {A.2,A.3,B.4,C.1} is a row instance of the colocation {A,B,C} because its proper subset {A.3,B.4,C.1} contains instances of all features in {A,B,C}.In another example {A.2,A.4} is not a row instance of colocation {A} because its proper subset {A.2,A.4} is not a row instance of colocation {A} because its proper subset {A.2} contain instance of all the features in {A}.The table instance of a colocation c is the collection of all row instances of c. Fig:3 Spatial Data set to Illustrate event based model. Participation Ratio: pr(c,fi) for the feature type fi in a size k-location c={f1,f2,f3,,,fk} is the fraction of instances of feature fi R-reachable to some instance of colocation c=c-{fi}.The participation index pi(c) of colocation c= {f1,f2,f3,,,fk} is min(pr(c,fi) for all i. Conditional Probability: The conditional probability of colocation rule c1=>c2 is the fraction of row instance s of c1 R-reachable to some row instance of c2.It is computed as Πc1(table_instances ({c1Uc2})) (table instances) where is relational projection operation with duplication elimination. The Colocation Mining Algorithm: Input: 1)K boolean Spatial instance types ad their instances 2)A symmetric and reflexive neighbor relation R 3)A user specified threshold prevalence measure(Min_prevalence) 4)A user specified minimum conditional probability(Min cond probability) Output: Colocation rule set with partition index > min_prevalence and Conditional probability > min_conditional_prob Steps: 1)Prevalent size 1 Colocation set along with their table instances = P 2)Generate size 2 colocation rules 3)For size of coloations from (2 to K-1) do 4)Generate candidate prevalent colocations using the generalized aprori algorithm 5)Generate table instance and prune based on neighborhood 6)Prune based on prevalence of co-locations 7)Generate co-location rules 8) END Step1 Initializes the prevalent size 1 co-location set with the input P of the algorithm.The participation indexes of singleton co-locations are 1 and all singleton colocations are prevalent Step2 Generates Prevalent colocations rules of size 2.Due to the lack of pruning for singleton colocations ,it is more efficient to use spatial join in place of neighbor relationship in place of generalized aprori algorithm and then neighbor-based pruning like in generation of colocation rules of size3 or more.The spatial inner join of the instances of all spatial features will produce pairs of instances with neighbor relation R. Step 3 to Step 8 Loops through 2 to K-1 to generate prevalent colocations of size 3 or more,iterating on increasing values of sizes of coloations.It breaks when ever an empty colocation set of some size is generated. In Step 4 the function generate candidate colocations uses apriori_gen to generate K+1 candidate colocations from size k prevalent colocations. In step 5 the function generate candidate key is simulated using a join query .The join can be computed using geometric approach or combinatorial approach or a hybrid approach. In Step 6 the candidate colocations generated are pruned using the threshold on the prevalence measure. 3.2.3 Local Colocation Algorithm: Global statistics seldom provide useful insight and that most relationships in spatial datasets are geographically regional. The need for robust tools capable of extracting local colocations patterns from large spatial datasets is critical for advancing scientific research. In the proceeding section we present a algorithm called CLEVER(CLustEringusing representatiVEs and Randomized hill climbing)that finds the local collocation pattern in the datasets. Consider datasets containing objects o1,..,on: O={o1,..,on} subset of F where F is feature space of the dataset and the objects belonging to O are tuples that are characterized by attributes S U N, where S={S1,…,Sp } is a set of spatial and temporal attributes.N= {A1,…,Aq} is a set of other, non-geo-referenced attributes. Dom(S) and Dom(N) describe the possible values the attributes in S and N can take, that is, each object oϵO is characterized by a single tuple that takes values in F=Dom(S)xDom(N). Datasets that have the structure as above are called georeferenced , and O is assumed to be a geo-referenced dataset. The purpose of the framework is to find interesting places, called regions in the following, geo-referenced datasets. Regions are assumed to be contiguous areas in the spatial-temporal space Dom(S) which is a subspace of F. A region has an extension which is the set of objects in O it contains and an intention that describes the area it occupies. The region discovery framework employs additive, plug-in fitness functions q that capture what kind of regions are of interest to the domain expert; moreover, fitness functions are assumed to have the following structure: i(c)*|c|^ß where i(c) denotes the interestingness of region c—a quantity to reflect a degree to which regions are “newsworthy". It is important to find regions at different levels of granularity. The amount of premium put on the size of the extension of a region (‘|c|’ denotes the cardinality of c) is controlled by the value of parameter . A region reward is proportional to its interestingness, but rewards increase with region size non-linearly (>1) to encourage merging neighboring regions with similar characteristics. Given a geo-referenced dataset O, there are many possible algorithms to seek interesting regions in O with respect to a plugin fitness function q, subject to the following specification: Given: O, q, and possibly other input parameters Find: X={r1,...,rk} that maximize q({r1,...,rk}) subject to the following constraints: (1) r⫃ O (i=1,…,k) (2) r1 , r2, …, rk are contiguous in Dom(S) (3) ri∩rj=ϕ . Neighboring solutions of the current solution are created using three operators: ‘Insert’ – inserts a new representative into the current solution, ‘Delete’ – deletes a representative from the current solution and ‘Replace’ – replaces a representative with a non-representative. Each operator has a certain selection probability and representatives to be manipulated are chosen at random. The algorithm also allows for larger neighborhood sizes; the experiments in this paper were run for neighborhood size 3: in this case, solutions that are sampled are generated by applying three randomly selected operators to the current solution. Moreover, to battle premature convergence, CLEVER resamples p’>p solutions before terminating. Psudeo-Code for CLEVER: CLEVER Inputs: k’, neighborhood-size, p, p’ Outputs: regions, region representatives, number of representatives (k), fitness, interestingness etc., Algorithm: Step1 :Create a current solution by randomly selecting k’ representatives from O. Step2: Create p neighbors of the current solution randomly using the given neighborhood definition. Step:3 If the best neighbor improves the fitness, it becomes the current solution. Go back to step Step4: If the fitness does not improve, the solution neighborhood is re-sampled by generating p’ more neighbors. If re-sampling does not improve the current solution, terminate.otherwise, go back to step 2 replacing the current solution by the best solution found by re-sampling. By adding and deleting representatives and by using neighborhood size of larger than one, CLEVER samples from much larger neighborhood of the current solution. This characteristic distinguishes CLEVER from other prototype-based clustering algorithms. 3.Spatial Clustering: Challenges of clustering algorithms: Although there are similarities between spatial and non-spatial clustering, large databases, and spatial databases in particular, have unique requirements that create special needs for clustering algorithms. 1. The algorithms must be scalable an efficient considering it has to deal with large data sets. 2. Algorithms need to be able to identify irregular shapes, including those with lacunae or concave sections and nested shapes. (See figure below) 3. The clustering mechanism should be insensitive to large amounts of noise. 4. Algorithms should not be sensitive to the order of input. That is, clustering results should be independent of data order. 5. No a-priori knowledge of the data or the number of clusters to be created should be required, and therefore no domain knowledge input should be required from the user. 6. Algorithms should handle data with large numbers of features, that is, higher dimensionality PAM, (Partitioning around Medoids) uses k-clustering on medoids to identify clusters. It works efficiently on small data sets, but is extremely costly for larger ones. This led to the development of CLARA. CLARA (Clustering Large Applications)creates multiple samples of the data set, and then applies PAM to the sample. CLARA.chooses the best clustering as the output, basing quality on the similarity and dissimilarity of objects in the entire set, not just the samples. One of the first clustering algorithms specifically designed for spatial databases was CLARANS which uses the k-medoid method of clustering. CLARANS was followed by DBSCAN a locality-based algorithm relying on the density of objects for clustering. is also a locality-based algorithm, but it allows for random distribution of the points. Other density or locality-based algorithms include STING an enhancement of DBSCAN, WaveCluster a method based on wavelets, and DENCLUE which is a generalization of several locality-based algorithms. Three other algorithms, BIRCH CURE and CLIQUE are hybrid algorithms, making use of both hierarchical techniques and grouping of related items. 3.5. HotSpot Analysis: Hotspots are a special kind of clustered pattern. As in clustered patterns, objects in hotspot regions have high similarity in comparison to one another and are quite dissimilar to all the objects outside the hotspot. One important feature that distinguishes a hotspot from a general cluster is that the objects in the hotspot area are more active compared with all others (density,appearance, etc.). Spatial correlation of the attribute values within a hotspot could be high and possibly drops dramatically at the boundary, whereas in traditional clustering, the attribute values within a cluster could be i.i.d. Hotspot discovery/detection in SDM is a process of identifying spatial regions where more events are likely to happen, or more objects are likely to appear, in comparison to other areas. Hotspot detection is mainly used in the analysis of crime and disease data. Crime data analysis aims at finding areas that have greater than average numbers of criminal or disorderly events, or areas where people have a higher than average risk of victimization. Figure 4 shows two types of hotspots, namely, point hotspots and area hotspots. The design of hotspot maps is primarily oriented toward aiding law enforcement to make appropriate placement of their resources for crime investigation. For example, Figure 4(b) shows locations of bars with seven different colors obtained by using LISA(Local indicators of spatial Association),the red squares in the center, and peripheries of the map show the high crime activity bars. Maps such as the ones shown in Figure 4(a) show specific bars or hotspots where an increased attention for crime mitigation is necessary.On the other hand, if an analyst was interested in the geographic distribution of a particular crime type (e.g., Vandalism) based on an underlying baseline variable, one can make use of techniques such as kernel density estimation that is a part of tools such as CrimeStat.For example, Figure 4(b) shows the hotspots of vandalism incidents from the same city; the red cells indicate areas where there is a significantly high clustering of vandalism reports and the blue cells indicate cells where there is a significantly low concentration of vandalism, and grey indicates the area where there is no significant concentration.This map leads one to understand that, there is a significant clustering of vandalism incidents in the center of the city around the downtown areas. Hotspot analysis finds applications in cancer/ disease data analysis, hotspots of locations where disease are reported intensively are detected, which may indicate a potential breakout of this disease, or suggest an underlying cause of the disease. Other domains of application include transportation (to identify unusual rates of accidents along highways) and ecology(to conduct geoinformatic surveillance for geospatial hot-spot detection). Many of the standard clustering algorithms have been adapted for spatial hotspot analysis. These include K-Means, hierarchical clustering, etc. Many other methods such as STAC (spatiotemporal analysis of crime) and LISA (local indicators of spatial association) have been developed to aid law enforcement agencies for crime mitigation. Spatial hotspot analysis methods of particular utility in public health applications such as syndromic surveillance and outbreak detection have been proposed. These methods include various frequencies and Bayesian statistical measures such as the spatial scan and space-time scan statistic. Fig:4 Spatial Hotspots from the city of Lincon Spatial Network analysis : Spatial Network is a network of spatial elements.Transportation network is a prime example of spatial network. Finding spatial network hotspot analysis finds various applications , particularly important for crime analysis (high-crime density street discovery) and law enforcement (planning effective and efficient patrolling strategies). In urban areas, many human activities are centered about spatial infrastructure networks, such as roads and highways, oil/gas pipelines, and utilities (e.g., water, electricity, telephone). Thus, activity reports such as crime logs may often use network-based location references (e.g., street addresses). In addition, spatial interaction among activities at nearby locations may be constrained by network connectivity and network distances (e.g., shortest paths along roads or train networks) rather than the geometric distances used in traditional spatial analysis. Traditional methods that employ a geometric summarization scheme to identify concentrations of crime may not account for large crime concentrations that are normally accounted for by the network-based methods. For example, Figure 5(a) and (b) show a comparison between an ellipse-based geometric hotspot method and a network-based hotspot method for a data set from the recent Haiti earthquake. Crime prevention may focus on identifying subsets of ST networks with high activity levels, understanding underlying causes in terms of network properties, and designing network control policies. Identifying and quantifying spatial network hotspot is a challenging task due to the need to choose the correct statistical model. In addition, the discovery process in large spatial networks is computationally very expensive l Fig : 5 Comparison between geometric and network based hotspot for requests during the Haiti earthquake Spatio Temporal Data Mining: Spatio-temporal data are often modeled using events and processes, both of which generally represent change of some kind. Processes refer to ongoing phenomena that represent activities of one or more types without a specified endpoint. Events refer to individual occurrences of a process with a specified beginning and end. Event-types and event-instances are distinguished. For example, a hurricane eventtype may occur at many different locations and times, for example, Katrina (New Orleans, 2005) and Rita(Houston, 2005). Each event-instance is associated with a particular occurrence time and location. The ordering may be total if eventinstances have disjoint occurrence times. Otherwise, ordering is based on spatio-temporal semantics such as partial order, and spatio-temporal patterns can be modeled as partially ordered subsets. These unique characteristics create new and interesting challenges for discovering spatiotemporal patterns. For example, in contrast to spatial outliers, a spatio-temporal outlier is a spatio-temporal object whose thematic (non spatial and non temporal) attributes are significantly different from those of other objects in its spatial and temporal neighborhoods. A spatio-temporal object is defined as a time evolving spatial object whose evolution or history is represented by a set of instances (EQ),where the space stamp is the location of the object o id at timestamp t. Trends in Spatial Data Mining: Flow anomalies: Given a percentage threshold and a set of observations across multiple spatial locations,flow anomaly discovery aims to identify dominant time intervals where the fraction of time instants.of significantly mismatched sensor readings exceeds a given percentage threshold. Figure 6 gives a simple example of flow anomalies . In Figure 6(a), the input to the FA problem consists of two spatial locations [i.e., an upstream (up) and downstream (down) sensor], 10 time instants, and the notion of travel time or flow between the locations. For simplicity, the travel time is set to a constant of 1, but it can be a variable. The output contains two flow anomalies using the time instants at the upstream sensor, periods 1–3 and 6–9, where the majority of time points show significant differences in between (Figure 6(b)). Discovering flow anomalies is important for water treatment systems, transportation networks, and video surveillance systems. However, mining flow anomalies is computationally expensive due to the large (potentially infinite) number of time instants across a spatial network of locations. Traditional outlier detection methods (e.g. t-test) are suited for detecting transient flow anomalies (i.e., time instants of significant mismatches across consecutive sensors) but cannot detect persistent flow anomalies (i.e., long variable time windows with a high fraction of time instant transient flow anomalies) due to lack of a predetermined window size. Spatial outlier detection techniques do not consider the flow (i.e., travel time ) between spatial locations and cannot detect any type of flow anomalies. Fig6 : Example of flow anomalies Teleconnected flow anomalies:.A teleconnection represents a strong interaction between paired events that are spatially distant from each other.It utilizes flow anomalies. Identifying teleconnected flow events is computationally hard due to the large number of time instants of measurement, sensors, and locations. For example, a well-known teleconnected event pair involves the warming of the eastern pacific region (i.e., El Nino) and unusual weather patterns throughout the world.Recently, a RAD (Relationship Analysis of Dynamic-neighborhoods) technique has been proposed that models flow networks to identify teleconnected events. Mixed-drove co-occurrence patterns: Mixed-drove spatiotemporal co-occurrence patterns (MDCOPs) represent subsets of two or more different object-types whose instances are often located in spatial and temporal proximity. Discovering MDCOPs is potentially useful in identifying tactics in battlefields and games,understanding predator–prey interactions, and in transportation (road and network) planning.However, mining MDCOPs is computationally very expensive because the interest measures are computationally complex, data sets are larger due to the archival history, and the set of candidate patterns is exponential in the number of objecttypes. Preliminary work has produced a monotonic composite interest measure for discovering MDCOPs and novel MDCOP mining algorithms. Cascading spatio-temporal patterns: Casacding spatio-temporal patterns(CSTP) are partially ordered subsets of event-types whose instances are located together and occur in stage an example is shown in Figure7 .These are some interesting partially ordered patterns that were discovered from real spatio-temporal crime data sets from the city of Lincoln, Nebraska. In the domain of public safety, events such as bar closings and football games are considered generators of crime. Preliminary analysis revealed that football games and bar closing events do indeed generate CSTPs. CSTP discovery can play an important role in disaster planning, climate change science (e.g., understanding the effects of climate change and global warming), and public health (e.g., tracking the emergence, spread, and reemergence of multiple infectious diseases). Fig7 :Example of cascading spatio temporal pattern of public saftey