Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Spatial Databases: Lecture 7 Spatial Statistics DT249-4 DT228-4 Semester 2 2010 Pat Browne Outline Statistical spatial data Review of standard statistical concepts Unique features of spatial data Statistics Spatial Autocorrelation Spatial regression (SR) and geographical spatial regression (GWR) Data mining Association rules Co-location Statistical Spatial Data In this lecture we consider spatial data contains an attribute e.g. house prices, occurrences of disease, occurrences of accidents, crop yield, poverty patterns, crime rates, etc. Earlier parts of the course covered the representation of physical objects such as houses, counties, and roads. These objects were arranged by theme. Here we consider attributes of those objects e.g. the population of an ED. Definitions Spatial statistics is the statistical study of spatial data that varies over discrete space e.g. crime rates broken down by neighbourhood. Spatial statistical models can be used for estimation, description, and prediction based on probability theory (not covered). Geostatistics is the statistical study of spatial data sets that vary over continuous space e.g. soil quality. Interpolation and prediction techniques include Kringing & Veriograms (not covered on this course). Standard statistical concepts: Independent Events Two events A and B are statistically independent if the chance that they both happen simultaneously is the product of the chances that each occurs individually. We say that two events, A and B, are independent if the probability that they both occur is equal to the product of the probabilities of the two individual events, i.e. P(AB) = P(A) P(B) This is equivalent to saying that learning that one event occurs does not give any information about whether the other event occurred too. Standard statistical concepts: Identically Distributed Two events A and B are identically distributed if P(A) =P(B) i.e. they have the same probability distribution. Standard statistical concepts: Identically Distributed variable Identically Distributed variable Same probability distributions Standard statistical concepts: i.i.d A collection of two or more random variables {X1, X2, … , } is independent and identically distributed if the variables have the same probability distribution, and are independent. Standard statistical concepts: Examples Example i.i.d: All other things being equal, a sequence of dice rolls is i.i.d. Example of non i.i.d: bird nesting patterns in wetlands, where the independent variables are distance from water, length of grass, depth of water and the dependent variable would be the presence of a nest site. A uniform distribution of these variables on a map would indicate an even distribution, however a more complex emerges where the variables are spatially dependent. Standard statistical concepts: Correlation Correlation: A correlation is a single number that describes the degree of relationship between two normally distributed variables. The variables are not designated as dependent or independent. The value of a correlation coefficient can vary from minus one to plus one. A minus one indicates a perfect negative correlation, while a plus one indicates a perfect positive correlation. A correlation of zero means there is no relationship between the two variables. When there is a negative correlation between two variables, as the value of one variable increases, the value of the other variable decreases, and vice versa. Standard statistical concepts: Variance and covariance A measure of variation equal to the mean of the squared deviations from the mean. The variance is a measure of the amount of variation within the values of that variable, taking account of all possible values and their probabilities or weightings. Covariance is measure of the variation between variables, say X and Y. The range of covariance values is unrestricted. However, if the X and Y variables are first standardized, then covariance is the same as correlation and the range of covariance (correlation) values is from –1 to +1. Standard statistical concepts: Correlation Correlation is a measure of the degree of linear relationship between two variables, say X and Y. While in regression the emphasis is on predicting one variable from the other, in correlation the emphasis is on the degree to which a linear model may describe the relationship between two variables. In regression the interest is directional, one variable is predicted and the other is the predictor; in correlation the interest is nondirectional, the relationship is the critical aspect. The correlation coefficient may take on any value between plus and minus one (-1 < r < 1). Standard statistical concepts: Regression Regression: takes a numerical dataset and develops a mathematical formula that fits the data. The results can be used to predict future behaviour. Works well with continuous quantitative data like weight, speed or age. Not good for categorical data where order is not significant, like colour, name, gender, nest/no nest. Example: plotting snowfall against height above sea level. Standard statistical concepts: Regression Y = A + BX; The response variable is y, and x is the continuous explanatory variable. Parameter A is the intercept. Parameter B is the slope. The difference between each data point and the value predicted by the line (the model) us called a residual Standard statistical concepts: Null hypothesis The null hypothesis, H0, represents a theory that has been put forward, either because it is believed to be true, but has not been proved. For example, in a clinical trial of a new drug, the null hypothesis might be that the new drug is no better, on average, than the current drug H0: there is no difference between the two drugs on average. In general, the null hypothesis for spatial data is that either the features themselves or of the values associated with those features are randomly distributed (e.g. no spatial pattern or bias). Relation of i.i.d., regression, and correlation with spatial phenomena. The first law of geography according to Waldo Tobler is "Everything is related to everything else, but near things are more related than distant things." In statistical terms this is called autocorrelation where the traditional i.i.d. assumption is not valid for spatially dependent variables (e.g. temperature or crime rate) we need special techniques to handle this type of data (e.g. Moran’s I). These techniques usually involve including a weight matrix which contains location information. The non-i.i.d. nature of spatially dependent variables carries over into regression and correlation which require spatial weights Relation of i.i.d., regression, and correlation with spatial database Spatial databases are used for spatial data mining, which includes statistical techniques and more specialised DM techniques such as association rules.. In this case the data mining algorithms need to have a spatial context. We must explicitly include location information where previously with the i.i.d. assumption it was not required Typical generic data mining activities such as clustering, regression, classification, association rules, all need a spatial context. Spatial DM is used in a broad range scientific disciplines, such as analysis of crime, modelling land prices, poverty mapping, epidemiology, air pollution and health, natural and environmental sciences, etc. The analyst must be aware the special techniques required for SDM. Relation of i.i.d., regression, and correlation with spatial database Spatial databases are also used for pure statistical research (e.g. environmental studies). Those variables that are spatially dependent (e.g. the PH of the soil) need to be clearly identified and special techniques applied to take into account their spatial bias. Unique features of spatial data Statistics General Statistics assumes the samples are independently generated, which is may not the case with spatial dependent data. Like things tend to cluster together. Change tends to be gradual over space. Unique features of spatial data Statistics Spatial dependent values The previous maps illustrate two important features of spatial data: Spatial Autocorrelation (not independent) The probability that they both occur is equal to the product of the probabilities of the two individual events, i.e. P(AB) = P(A) P(B) Spatial data is not identically distributed. Two events A and B are identically distributed if P(A) =P(B) i.e. they have the same probability distribution. Unique features of spatial data Statistics Autocorrelation & Spatial Heterogeneity. Spatial autocorrelation is detected when the value of a variable in a location is correlated with values of the same variable in the neighbourhood (can be measured with Moran I). Spatial heterogeneity is characterized by different values or behaviours through space which can be measured by Local Indicators of Spatial Association (LISA). Characterizes the non-stationarity of most geographic processes, meaning that global parameters may not accurately reflect the process occurring at a particular location. Spatial Autocorrelation1. Autocorrelation: degree of correlation between neighbouring values. Spatial dependency: neighbouring values are similar (i.e. positive spatial autocorrelation). Moran’s I enable assessment of the degree to which values tend to be similar to neighbouring values. We can observe how autocorrelation varies with distance. The Moran scatter plot relates individual values to weighted averages of neighbouring values. The slope of a regression line fitted to the points in the scatter plot gives the global Moran’s I. Spatial Autocorrelation: Moran’s I Moran’s I measures the average correlation between the value of a variable at one location and the value at nearby locations. The essential idea is to specify pairs of locations that influence each other along with the relative intensity of interaction. Moran’s I provides a global view of spatial autocorrelation correlation. We will look at details later The range of the Moran's I statistic depends on the spatial weight matrix. When Moran's I is scaled by its bounds the statistic is restricted to the range ±1 Spatial Autocorrelation: Case Study Nest locations Distance to open water Vegetation durability Water depth Spatial Autocorrelation Classical Statistical Assumptions (i.i.d) do not hold for spatially dependent data Unique features of spatial data Statistics First Law of Geography First law of geography [Tobler]: Everything is related to everything, but nearby things are more related than distant things. People with similar backgrounds tend to live in the same area Economies of nearby regions tend to be similar Changes in temperature occur gradually over space (and time) (equator V poles). Spatial Autocorrelation: Moran’s I example Moran’s I - example Figure 7.5, pp. 190 •Pixel value set in (b) and (c ) are same but their Moran Is are different. •Q? Which dataset between (b) and (c ) has higher spatial autocorrelation? Spatial Autocorrelation : Moran Scatterplot Map São Paulo WZ Q4 = LH Q1= HH a 0 Q2= LL Q3 = HL 0 z Old-aged population Spatial Heterogeneity. Spatial heterogeneity; Is there such a thing as an average place with respect to some property (e.g. vegetation). is difficult to imagine any subset of the Earth’s surface being a representative sample of the whole. GWR (later) addresses the localness of spatial data. Neigbourhood relationship contiguity matrix Spatial autocorrelation Spatial autocorrelation is determined both by similarities in position, and by similarities in attributes Sampling interval Self-similarity Auto = self Correlation = degree of relatedness correspondence Spatial autocorrelation In the following slide, each diagram contains 32 white cell and 32 blue cells = 64 cells. BB = Blue beside Blue BW = Blue beside White WW = White beside White. Spatial autocorrelation Negative Dispersed Spatial Independence Spatial Clustering Positive Spatial regression (SR) Spatial regression (SR) is a global spatial modeling technique in which spatial autocorrelation among the regression parameters are taken into account. SR is usually performed for spatial data obtained from spatial zones or areas. The basic aim in SR modeling is to establish the relationship between a dependent variable measured over a spatial zone and other attributes of the spatial zone, for a given study area, where the spatial zones are the subset of the study area. While SR is known to be a modeling method in spatial data analysis literature in spatial data-mining literature it is considered to be a classification technique Geographically weighted regression (GWR) Geographically weighted regression (GWR) is a powerful exploratory method in spatial data analysis. It serves for detecting local variations in spatial behavior and understanding local details, which may be masked by global regression models. Unlike SR, where regression coefficient for each independent variable and the intercept are obtained for the whole study region, in GWR, regression coefficients are computed for every spatial zone. Therefore, the regression coefficients can be mapped and the appropriateness of stationarity assumption in the conventional regression analyses can be checked. Geographically weighted regression (GWR) GWR is an effective technique for exploring spatial nonstationarity, which is characterized by changes in relationships across the study region leading to varying relations between dependent and independent variables. Hence there is a need for better understanding of the spatial processes has emerged local modeling techniques. GWR has been implemented in various disciplines such as the natural, environmental, social and earth sciences. Exploring spatial patterning in spatial data values1. Two issues 1. How do variables change from place to place? Zone similar to neighbours? 2. How are variables related. How does the relationship between rainfall and altitude vary from place to place. Local Statistics1 moving window Geographical Weights • Binary: Rook or queen neighbours • Distance based • Boundary or perimeter based. • Weights can be rownormalized using the number of adjacent cells Local Univariate measures1 moving window Standard univariate can be computed for a moving window, supplying the degree and nature of variation in summary statistics across a region of interest (e.g. we could compute the standard deviation for several windows and assess the degree of variability from place to place. Geographical weighting schemes can be used for the calculation of local statistics. Local spatial autocorrelation1 Global statistics such as Moran’s I can mask local spatial structure. The local Moran can be used to measure local spatial autocorrelation. Only if there is little or no variation in the local observations do the global observations provide any reliable information on the local areas within the study area. As the spatial variation of the local observations increases, the reliability of the global observation as representative of local conditions decreases. Local spatial autocorrelation1 The weights could be based on rook, queen, distance, perimeter and normalized by number of neighbours ( slide 28) Local spatial autocorrelation Spatial autocorrelation Negative Dispersed Spatial Map A and Map B each represent a distinct geographic region. The number in the Independence regions (cells) represents the number of leukaemia cases in that region. These two sets of values have the same mean and standard deviation. In contrast, Moran’s I statistic for the data on Map A is -0.269, and 0.041 for the data on Map B. Positive They Spatial differClustering because values in the regions have a different spatial arrangement. The contiguity (or weight) matrix used by the Moran I calculation will be different and hence we get a different result. A visual inspection of both maps would suggests that A has negative (-Moran) , the neighbouring values tend to be dissimilar, thus no clustering of like values is suggested. B has little autocorrelation because it’s Moran is near zero. Spatial autocorrelation Negative Dispersed Spatial The grids A and B represent twoIndependence different spatial resolutions over the same area. Grid A contains 16 cells and Grid B contains 64 cells. The strength of spatial autocorrelation is often a function of scale or spatial resolution, as illustrated in above using black and white cells. High negative spatial autocorrelation is exhibited in A since each cell has a different colour from Positive its neighbouring Spatial Clustering cells. In B each cell can be subdivided into four half-size cells, assuming the cell’s homogeneity. Then, the strength of spatial autocorrelation among the black and white cells increases, while maintaining the same cell arrangement. his illustrates that spatial autocorrelation varies with the study scale The strength of spatial autocorrelation is a function of scale, increasing from 4-by-4 case to the 8-by-8 case. Calculate local Moran I for central cell (42) where z i= (xi – x ) Original data 45 44 Values, differences from mean, rook standardized weight sum = 1 yi zi wij wijzi 45 4.889 0.000 0.000 43 2.889 0.250 0.722 38 -2.111 0.000 0.000 44 3.889 0.2500 0.972 42 1.889 0.000 0.000 32 -8.111 0.250 -2.028 44 3.889 0.000 0.000 39 -1.111 0.25 -0.278 34 -6.111 0.000 0.000 1.00 -0.611 44 43 42 39 38 32 34 Mean 40.111 Variance = 21.861 Ii = (1.889/21.861)*(-0.661)= -0.053 Has low negative value, neighbouring values tend to be dissimilar. sum Global Moran’s I = 0.665 Local I, large positive values in rural areas, more patchy around Belfast Spatial Regression1 The assumption of i.i.d. underlying ordinary least squares regression rarely holds for spatial data. There are several techniques that handle the spatial case; Moving window regression Geographic Weighted Regression (GWR) We will look at GWR Geographic Weighted Regression (GWR) 1 The steps are; 1. Go to a location 2. Conduct regression using the raw data and a geographic weighting scheme. 3. Move to next location go back to stage 2 until all locations have been visited. The output is a set of regression coefficients (e.g. slope and intercept) at each location Coords of observations, variables. distance from first observation, and geographic weights point x y Var 1 Var 2 dist Geo w 1 25 45 12 6 0 1 2 25 44 34 52 1 0.995 3 21 48 32 41 5 0.8825 4 27 52 12 25 8 0.7261 5 16 31 11 22 16 0.278 6 42 35 14 9 20 0.0889 7 9 65 56 43 26 0.034 8 29 76 75 67 32 0.006 9 61 66 43 32 42 0.0002 Location of points for previous table Regression using previous table and locations, the geographic weighting pulls the line towards the points with larger weights Summary of spatial stats Moran’s I measures the average correlation between the value of a variable at one location and the value at nearby locations. Local Moran statistic measures spatial dependence on a local basis, allowing the researcher to see its variation over space, and by Geographically Geographically Weighted Regression allows the parameters of a regression analysis to vary spatially. GWR helps in detecting local variations in spatial behavior and understanding local details, which may be masked by global regression models. GWR, regression coefficients are computed for every spatial zone. © Oxford University Press, 2010. All rights reserved. Lloyd: Spatial Data Analysis Two scatter plots and fitted lines for different aggregations of same value © Oxford University Press, 2010. All rights reserved. Lloyd: Spatial Data Analysis Moran’s I A contiguity matrix may represent a neighborhood relationship defined using adjacency or Euclidean distance. There are several definitions adjacency include a fourneighbourhood or an eight-neighborhood. Given a gridded spatial framework, a fourneighborhood assumes that a pair of locations influence each other if they share an edge (rook). An eight-neighborhood assumes that a pair of locations influence each other if they share either an edge or a vertex (queen). Moran’s I • Using a normalised weight matrix the values of I range from -1 to 1. • Value = 1 : Perfect positive correlation • Value = 0 : No autocorrelation • Value = -1: Perfect negative correlation • A Moran’s I may appear low (say 0.17) but is statistically significant pattern is clustered since index is above 0. Moran’s I • Global Moran’s I • What is the extent of clustering in the total area? • Is this clustering significantly different from a random spatial distribution? • Local Moran’s I • Do local clusters (high-high or low-low) or local spatial outliers (high-low or low-high) exist? • Are these local clusters and spatial outliers statistically significant? Moran’s I: A measure of spatial autocorrelation Given x x1 ,...xn sampled over n locations. t zWz Moran I is defined as I zz t Where z x1 x ,...,xn x and W is a normalized contiguity matrix. Fig. 7.5, pp. 190 Spatial autocorrelation Negative Dispersed Spatial The grids A and B represent twoIndependence different spatial resolutions over the same area. Grid A contains 16 cells and Grid B contains 64 cells. The strength of spatial autocorrelation is often a function of scale or spatial resolution, as illustrated in above using black and white cells. High negative spatial autocorrelation is exhibited in A since each cell has a different colour from Positive its neighbouring Spatial Clustering cells. In B each cell can be subdivided into four half-size cells, assuming the cell’s homogeneity. Then, the strength of spatial autocorrelation among the black and white cells increases, while maintaining the same cell arrangement. his illustrates that spatial autocorrelation varies with the study scale The strength of spatial autocorrelation is a function of scale, increasing from 4-by-4 case to the 8-by-8 case. Second Law of Geography1 Second law of geography: Spatial heterogeneity [Goodchild] Spatial heterogeneity describes geographic variation in the constants or parameters of relationships When it is present, the outcome of an analysis depends on the area over which the analysis is made. Spatial heterogeneity depends on the spatial resolution. Global model might be inconsistent with respect to a regional model(s). Second Law of Geography Spatial heterogeneity definitions: quantitative information characterizing the ground spatial structure spatial variance distribution of the variable considered, within the coarse sample resolution (e.g. pixel or grid) The patterning or patchiness in important landscape properties such as vegetation cover. Second Law of Geography1 Spatial heterogeneity has been quantified from remote sensing images by using two basic approaches: (a) the direct image approach, where straight reflectance or reflectance indices of remote sensing images are used to quantify spatial heterogeneity, using the original pixel size of the image (b) the cartographic or patch mosaic approach, where the image is subdivided into homogeneous mapping units through classification. Second Law of Geography1 Suppose there is a relationship between number of AIDS cases and number of people living in an area The form of this relationship will vary spatially in some areas the number of cases per capita2 will be higher than in others we could map the constant of proportionality3 Spatial heterogeneity describes this geographic variation in the constants or parameters of relationships . When it is present, the outcome of an analysis depends on the area over which the analysis is made. Often this area is arbitrarily determined by a map boundary or political jurisdiction Second Law of Geography Second law of geography [Goodchild] Spatial heterogeneity Global model often inconsistent with regional models (e.g. the average does not hold anywhere). How to decide the weight wij ? The weight indicates the spatial interaction between entities. 1) Binary wij, also called absolute adjacency. Covers the general case answering the question is a value in a region similar or different to its neighbours. wij = 1 if two geographic entities are adjacent; otherwise, wij = 0. Choice of adjacency definition queens(8) or rooks(4). How to decide the weight wij ? The weight indicates the spatial interaction between entities. 2) The distance between geographic entities. Often the inverse distance is used, further objects get less weight, near object get more weight e.g. centre of epidemic. wij = f(dist(i,j)), dist(i,j) is the distance between i and j. 3) The length of common boundary for area entities. Policing borders, smaller borders less weight. wij = f(leng(i,j)), leng(i,j) is the length of common boundary between i and j. How to decide the weight wij ?1 The choice of weights should ultimately be driven by a rationale for including those areas as neighbors that have a spatial effect on a given location. This rationale can be derived from theory or be the result of using ESDA to experiment with different weights and connectivity orders. Since weights matrices are used to create spatial lags that average neighboring values, the choice of a weights matrix will determine which neighboring values will be averaged. For instance, since rook weights will usually have fewer neighbors than queen weights, on average, each neighboring observation has more influence. How to decide the weight wij ? 1 The question of which weights to choose is more pertinent in the context of modeling than ESDA since modeling is based on substantive notions of spatial effects while ESDA prioritizes the rejection of spatial randomness. Therefore, if there are no substantive reasons to guide the choice of weights in ESDA, using a weights file with as few neighbors as possible (such as rook) makes sense. Especially with irregular areal units (as opposed to grids), the difference between rook and queen weights is often minimal. However, it is advisable to test how sensitive your results are to your weights specifications by comparing multiple weights matrices. Spatial Outlier Detection Global outliers are observations which appear inconsistent with the remainder of that data set. Global outliers deviate so much from other observations that it may be possible that they were generated by a different mechanism. Spatial outliers are observations that appear inconsistent with their neighbours. Spatial Outlier Detection Detecting spatial outliers has important applications in transportation, ecology, public safety, public health, climatology and location based services. Geographic objects have a spatial (location, shape, metric & topological properties) & non-spatial component (house owner, sensor id., soil type). Spatial Outlier Detection Spatial neighbourhoods may be defined using spatial attributes & spatial relations. Comparisons between spatially referenced objects can be based on non-spatial attributes. A spatial outlier is a spatially referenced object whose non-spatial attribute values differ from those of other spatially referenced objects in its spatial neighbourhood. Data for Outlier detection In diagram on left G,P,S,Q show a big change in attribute for a small change in location. The right hand diagram shows a normal distribution (corresponds to attribute axis in left diagram) Spatial Outlier Detection The upper left & lower right quadrants of figure 7.17 indicate a spatial association of dissimilar values; low values surrounded by high value neighbours (P & Q) and high values surrounded by low values (S). Spatial Outlier Detection Moranoutlier is a point located in the upper left or lower right quadrant of a Moran scatter plot. Spatial Outlier Detection Moranoutlier is a point located in the upper left or lower right quadrant of a Moran scatter plot. WZ Q4 = LH Db 0 Q2= LL Q1= HH Cb a Q3 = HL z 0 values in a given location Model Evaluation Consider the two-class classification problem ‘nest’ or ‘no-nest’. The four possible outcomes (or predictions) are shown on the next slide. The desired predictions are: 1) where the model says the should be a nest and there is an actual nest (True Positive) 2) where the model says there is no nest and there is no nest (True Negative) The other outcomes are not desirable and point to a flaw in the model. Model Evaluation Spatial Statistical Models A Point Process is a model for the spatial distribution of points in a point pattern. Examples: the position of trees in a forest, location of petrol stations in a city. Actual real world point patterns can be compared (using distance) with a randomly distributed point pattern random. Calculating the Local Moran I Where the variance = 667.32 and mean = 55.82 from the entire population Calculating the Local Moran I Calculating the Global Moran I Statistics versus Data Mining Do we know the statistical properties of data? Is data spatially clustered, dispersed, or random? Data mining is strongly related to statistical analysis. Data mining can be seen as a filter (exploratory data analysis) before applying a rigorous statistical tool. Data mining generates hypothesis that are then verified. The filtering process does not guarantee completeness (wrong elimination or missing data). "Drowning in Data yet Starving for Knowledge" Data Mining: Outline Background to data mining & spatial data mining. The data mining process Spatial autocorrelation i.e. the non independence of phenomena in a contiguous geographic area. Spatial independence Classical data mining concepts: Classification Clustering Association rules Spatial data mining, e.g. Co-location Rules Summary Data Mining Data mining is the process of discovering interesting and potentially useful patterns of information embedded in large databases. Spatial data mining has the same goals as conventional data mining but requires additional techniques that are tailored to the spatial domain. A key goal of spatial data mining is to partially automate knowledge discovery, i.e., search for “nuggets” of information embedded in very large quantities of spatial data. Data Mining Data mining lies at the intersection of database management, statistics, machine learning and artificial intelligence. DM provides semiautomatic techniques for discovering unexpected patterns in very large data sets. We must distinguish between operational systems (e.g. bank account transactions) and decision support systems (e.g. data mining) Data Mining Spatial DM can be characterised by Tobler’s first law of geography (near things tend to be more related than far things). Which means that the standard DM assumptions that values are independently and identically distributed does not hold in spatially dependent data (SDD). The term spatial autocorrelation captures this property and needs to be included in DM techniques. Data Mining The important techniques in conventional DM are association rules, clustering, classification, and regression. These techniques need to be modified for spatial DM. Two approaches used when adapting DM techniques to the spatial domain: 1)Correct the underlying (iid) statistical model 2)The objective function1 which drives the search can be modified to include a spatial term. Data Mining Size of spatial data sets: NASA’s Earth Orbiting Satellites capture about a terabyte(1012) a day, YouTube 2008 = 6 terabytes. Environmental agencies, utilities (e.g. ESB), Central Statistics Office, government departments such as health/agriculture, and local authorities all have large spatial data sets. It is very difficult to analyse such large data sets manually. For examples see Chapter 7 from SDT Data Mining: Sub-processes Data mining involves many sub-process: Data collection: usually data was collected as part of the operational activities of an organization, not for the data mining task. It is unlikely that the data mining requirements were considered during data collection. Data extraction/cleaning: hence data must be extracted & cleaned for the specific data mining task. Data Mining: Sub-processes Feature selection. Algorithm design. Analysis of output Level of aggregation at which the data is being analysed must be decided. Identical experiments at different levels of scale can sometimes lead to contradictory results (e.g. the choice of basic spatial unit can influence the results of a social survey). Geographic Data mining process Close interaction between Domain Expert & Data-Mining Analyst The output consists of hypotheses (data patterns) which can be verified with statistical tools and visualised using a GIS. The analyst can interpret the patterns recommend appropriate actions Statistics versus Data Mining Do we know the statistical properties of data? Is data spatially clustered, dispersed, or random? Data mining is strongly related to statistical analysis. Data mining can be seen as a filter (exploratory data analysis) before applying a rigorous statistical tool. Data mining generates hypothesis that are then verified. The filtering process does not guarantee completeness (wrong elimination or missing data). Unique features of spatial data mining The difference between classical & spatial data mining parallels the difference between classical & spatial statistics. Statistics assumes the samples are independently generated, which is generally not the case with SDD. Like things tend to cluster together. Change tends to be gradual over space. Non-Spatial Descriptive Data Mining Descriptive analysis is an analysis that results in some description or summarization of data. It characterizes the properties of the data by discovering patterns in the data, which would be difficult for the human analyst to identify by eye or by using standards statistical techniques. Description involves identifying rules or models that describe data. Both clustering and association rules are employed by supermarket chains. Clustering (unsupervised learning) is a descriptive data mining technique. Clustering is the task of assigning cases into groups of cases (clusters) so that the cases within a group are similar to each other and are as different as possible from the cases in other groups. Clustering can identify groups of customers with similar buying patterns and this knowledge can be used to help promote certain products. Clustering can help locate what are the crime ‘hot spots’ in a city. Association Rules. Association rule discovery identifies the relationships within data. The rule can be expressed as a predicate in the form (IF x THEN y ). ARD can identify product lines that are bought together in a single shopping trip by many customers and this knowledge can be used to by a supermarket chain to help decide on the layout of the product lines. Non-Spatial Predictive Data Mining Predictive DM results in some description or summarization of a sample of data which predicts the form of unobserved data. Prediction involves building a set of rules or a model that will enable unknown or future values of a variable to be predicted from known values of another variable. Classification is a predictive data mining technique. Classification is the task of finding a model that maps (classifies) each case into one of several predefined classes. Classification is used in risk assessment in the insurance industry. Regression analysis is a predictive data mining technique that uses a model to predict a value. Regression can be used to predict sales of new product lines based on advertising expenditure. Case Study Data from 1995 & 1996 concerning two wetlands on the shores of Lake Erie, USA. Using this information we want to predict the spatial distribution of marsh breeding bird called the red-winged black bird. Where will they build nests? What conditions do they favour? A uniform grid (pixel=5 square metres) was superimposed on the wetland. Seven attributes were recorded. See link1 to Spatial Databases a Tour for details. Case Study Case Study Significance of three key variables established with statistical analysis. Vegetation durability Distance to open water Water depth The spatial distribution is shown in 7.3. Case Study Nest locations Water depth Distance to open water Vegetation durability Example showing different predictions: (a) the actual locations of nests; (b) pixels with actual nests; (c) locations predicted by one model; and (d) locations predicted by another model. Prediction (d) is spatially more accurate than (c). Classical statistical assumptions do not hold for spatially dependent data Case Study The previous maps illustrate two important features of spatial data: Spatial Autocorrelation (not independent) Spatial data is not identically distributed. Two random variables are identically distributed if and only if they have the same probability distribution. Why spatial DBs do not use classical DM Rich data types (e.g., extended spatial objects) Implicit spatial relationships among the variables, Observations that are not independent, Spatial autocorrelation exists among the features. Classical Data Mining Association rules: Determination of interaction between attributes. For example: X Y: Classification: Estimation of the attribute of an entity in terms of attribute values of another entity. Some applications are: Predicting locations (shopping centers, habitat, crime zones) Thematic classification (satellite images) Clustering: Unsupervised learning, where classes and the number of classes are unknown. Uses similarity criterion. Applications: Clustering pixels from a satellite image on the basis of their spectral signature, identifying hot spots in crime analysis and disease tracking. Regression: takes a numerical dataset and develops a mathematical formula that fits the data. The results can be used to predict future behavior. Works well with continuous quantitative data like weight, speed or age. Not good for categorical data where order is not significant, like color, name, gender, nest/no nest. Determining the Interaction among Attributes We wish to discovery relationships between attributes of a relation. is_close(house,beach) -> is_expensive(house) low(vegetationDurability) -> high(stem density) Associations & association rules are often used to select subsets of features for more rigorous statistical correlation analysis. How does data mining differ from conventional methods of data analysis? Using conventional data analysis the analyst formulates and refines the hypothesis. This is known as hypothesis verification, which is an approach to identifying patterns in data where a human analyst formulates and refines the hypothesis. For example "Did the sales of cream increase when strawberries were available?" Using data mining the hypothesis is formulated and refined without human input. This approach is known as hypothesis generation is an approach to identifying patterns in that data where the hypotheses are automatically formulated and refined. Knowledge discovery is where the data mining tool formulates and refines the hypothesis by identifying patterns in the data. For example, "What are the factors that determine the sales of cream?" Association rules An association rule is a pattern that can be expressed as a predicate in the form (IF x THEN y ), where x and y are conditions (about cases), which state if x (the antecedent) occurs then, in most cases, so will y (the consequence). The antecedent many contain several conditions but the consequence usually contains only one term. Association rules Association rules need to be discovered. Rule discovery is data mining technique that identifies relationships within data. In the non-spatial case rule discovery is usually employed to discover relationships within transactions or between transactions in operational data. The relative frequency with which an antecedent appears in a database is called its support. High support is the frequency at which the relative frequency is considered significant and is called the support threshold (say 70%) Association rules Example: Market basket analysis is form of association rule discovery that discovers relationships in the purchases made by a customer during a single shopping trip. An itemset in the context of market basket analysis is the set of items found in a customer’s shopping basket. Association rules Association rules & Spatial Domain Differences with respect to spatial domain: 1. The notion of transaction or case does not exist, since data are immerse in a continuous space.The partition of the space may introduce errors with respect to overestimation or sub-estimation confidences. The notion of transaction is replaced by neighborhood. 2. The size of itemsets is less in the spatial domain. Thus, the cost of generating candidate is not a dominant factor. The enumeration of neighbours dominates the final computational cost. 3. In most cases, the spatial items are discrete version of continuous variables. Spatial Association Rules Table 7.5 shows examples of association rules, support, and confidence that were discovered in Darr 1995 wetland data. Co-Location rules Colocation rules attempt to generalise association rules to point collection data sets that are indexed by space. The colocation pattern discovery process finds frequently colocated subsets of spatial event types given a map of their locations, see Figure 7.12. Co-location Examples (a) Illustration of Point Spatial Co-location Patterns. Shapes represent different spatial feature types. Spatial features in sets {`+,x} and {o,*} tend to be located together. (b) Illustration of Line String Co-location Patterns. Highways and frontage roads1 are co-located , e.g., Hwy100 is near frontage road Normandale Road. Two co-location patterns Answers: and Spatial Association Rules A spatial association rule is a rule indicating certain association relationship among a set of spatial and possibly some non-spatial predicates. Spatial association rules (SPAR) are defined in terms of spatial predicates rather than item. P1 P2 .. Pn Q1 .. Qm Where at least one of the terms (P or Q) is a spatial predicate. is(x,country)touches(x,Mediterranean) is(x,wine-exporter) Co-location V Association Rules Transactions are disjoint while spatial colocation is not. Something must be done. Three main options 1. Divide the space into areas and treat them as transactions 2. Choose a reference point pattern and treat the neighbourhood of each of its points as a transaction 3. Treat all point patterns as equal Co-location V Association Rules Spatial Association Rules Mining (SARM) is similar to the raster view in the sense that it tessellates a study region S into discrete groups based on spatial or aspatial predicates derived from concept hierarchies. For instance, a spatial predicate close to(α, β) divides S into two groups, locations close to β and those not. So, close to(α, β) can be either true or false depends on α’s closeness to β. A spatial association rule is a rule that consists of a set of predicates in which at least a spatial predicate is involved. For instance, is a(α, house) ∧ close to(α, beach) -> is expensive(α). This approach efficiently mines large datasets using a progressive deepening approach.