Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Classification in spatial data mining Kirsi Virrantaus Maa-123.3585 Spatial Data Mining AaltoENG, Department of Built Environment 1. Traditional vs. spatial data mining • in traditional data mining methods assumptions that violate Tobler´s first law of geography (spatial autocorrelation) – data samples independently generated • knowledge discovery technologies that ignore spatial autocorrelation perform poorly on spatial data • two major approaches to incorporate spatial dependence into classification problems – Spatial autoregression models, SAR – Marcov random field Models, MRF Synonyms • Spatial autocorrelation • Spatial context • Spatial dependence – natural resource analysts and statisticians – depending on the scientific community • Classification • Prediction • Spatial variation (over space) • Spatial heterogeneity 2. Classification • Classification is here understood as prediction • Classification is to find a function f : D → L. D is the domain of f , L is the set of labels – for example D can be three dimensional space consisting of three attributes vegetation durability, water depth and distance to open water and L containts two labels nest and no-nest – F is used to predict the labels L when only data from the set D is given – Training data is used to find the f until the required accuracy is reached; the function is then applied to the test data ; supervised learning Esimerkki/example:Bird nest • The region is divided into grid cells/alue on jaettu gridin soluihin • About each cell we know three attribute values (vegetation type, water depth, distance to open water)/jokaisesta gridin solusta tunnetaan kolmen muuttujan arvot (kasvillisuustyyppi, veden syvyys, etäisyys avoimeen veteen) • Teaching data includes both attribute values and labeling result (for example: whether there is a bird nest or not) /opetusdata sisältää sekä ominaisuudet, että luokittelun (esimerkiksi löytyykö solusta linnunpesä vai ei) • The function can now be used for classification of new data set (attribute values only)/Opetetulla funktiolla voidaan luokitella uutta aineistoa, josta tunnetaan vain ominaisuudet Three approaches for spatial extension of classification • Logistic regression models (logistic SAR) – Spatial autoregression + logistic transformation – In spatial autoregression: W for neighbourhood, ρ for the strength • Geographically weighted regression (GWR) – Takes into account spatial variation in model parametres and errors • Bayesian classifiers and Marcov random field extension – Most typical for example in remote sensing applications – Cause ”salt-and- pepper effect” without spatial extension – Spatial dependency is modeld as a priori term in Bayes´rule + neighbourhood matrix 3. Regression models in prediction • Basic form of linear regression y = mx + c or Y = Xβ + ε – training data can be used to calculate the parameter vector β , which in turn can be used to calculate the value of the class attribute in the test data set • When the dependent variable is binary, like in our bird nest example, logistic regression is used • assumption in regression models is that the sample observations are independently generated • this may not be true in spatial data and this shows up in the residual errors, the ε´s; • systematic variation can be found in the residual errors and that means that the model is unable to capture the spatial relationships that exist in the data Spatial Regression SAR • when variables are spatially referenced, they tend to exhibit spatial autocorrelation • the assumption of independent distributions is not appropriate but things depend on their neighbours • the simplest way to modify the regression equation is to use the contiguity matrix/adjacency matrix W • the spatial autoregressive regression equation is then Y = ρWY + Xβ + ε – in which ρ is the parameter that shows the correlation strength – when ρ is 0, the equation collapses to the classical regression model – with this model we get into much less systematic variation of ε and the model in total will have a better fit – Spatial autoregression = spatial context = spatial autocorrelation 4. GWR Geographically Weighted Regression • is another solution that enables the input of spatial autocorrelation to regression models • developers – Professor Stewart Fotherinham, Martin Charlton and Chris Brunsdon from National university of Ireland, Center for Geocomputation, Maynooth • In this course a full lecture and exercises on GWR In GWR • we try to explain the dependent variable by independent ones, as in normal regression, make a model and solve it for example by OLS, ordninary least squares -method • but add geographical weight which is based on a kernel, for example a Gaussian like – a square matrix whose leading diagonal contains the weights for the observations relative to the location in question • this is another approach to take into account spatial context • there will be a lecture and exercises on GWR in this course • useful reading material in Miller &Han , Ch 9, pp. 227…254 Geographically Weighted Regression - GWR U.Demsar Global vs. local linear regression global model assumption: a stationary process the same stimulus provokes the same response in all parts of the study region process yi = 0 + 1x1i + 2x2i +… nxni + i stationary parameter estimates j – constant everywhere spatial non-stationarity can only be seen through residuals GWR Measure spatial non-stationarity directly: allow the relationships we are measuring to vary over space. Geographically Weighted Regression - GWR How does GWR work? Local statistical technique to analyse spatial variations in relationships assumption: regression parameters j are continuous functions of location i yi = 0(i) + 1(i) x1i + 2(i) x2i +… n(i) xni+ i with the estimator ’(i) = (XTW(i) X)-1 XT W(i) Y matrix of weights observations nearer to i are given greater weight than observations further away U.Demsar spatially adaptive weighting fixed weighting Intepreting the results – individual behaviour of parameters Main output of GWR a set of location-specific parameter estimates statistical summaries for each parameter univariate maps information on spatial non-stationarity in relationships U.Demsar 5. Bayesian classifier • Bayes theorem P(B│A) = P(A│B) x P(A)/P(B); the probability of B given that A has happened – typically used in remote sensing image classification problems • see Appendix in 7.8 – In conditional probability we calculate the probability of A when we know that B has happened – In Bayes the question is: knowing that B happened what is the most probable cause Example on birds nest • Events – B = a bird´s nest in the location in question – A = a specified combination of attribute values (vegetation, water depth, distance to open water) • We know about corresponding probabilities – P(B) = prob. that there is a nest in the location – P(A) = prob. that there is the specified combination of attribute values in the location • We can calculate – P(A│B) prob that the terrain is specific in the location if there is a nest – P(B│A) prob that there is a nest in the location if the terrain attributes are specified • P(B│A) = P(A│B) x P(A)/P(B); • See the example in Appendix ORIGINAL MAPS SIMULATED MAPS Problem of conventional Bayes classifier • The conventional Bayes classifies does not take into account spatial dependency • Result of classification is a kind of ”salt and pepper” map • The solution is that spatial dependency must be taken into account • One solution is to use information about neighbourhood in classification (spatial autocorrelation) • The techniques is neighbourhood matrix W that can be seen as Marcov Random Field Markov Random Field • Markov Random Field (MRF): – a set of random variables whose interdependency is described by an undirected graph (for example a symmetric neighbourhood matrix W) • Markov property specifies that – a variable depends only on its neighbours and it is independent of all other variables • The location problem (predicting the label) can be modelled by assuming that the class label of different locations constitute an MRF – label value is based on feature value vector and neighbourhood class label vector MRF and Bayes • a Bayes classifier can be built on the MRF • then we come to use Bayes probabilities and take into account the spatial autocorrelation by using MRF • the label is predicted based on the probability distributions of the feature values and neighbourhood labels • see the formula 6.5 in (Miller & Han, p. 130) • paragraphs 7.3.3 …7.3.4 (in Shekhar & Chawla) and Coparison of SAR and MRF do not belong to the reading materials • In this context it is only important to understand the idea of MRF & Bayes and to know that there are lot of research on this topic • very good reading material about Classification in Miller & Han, Ch 6, pp. 117-130 Material for this lecture • Miller, H., Han,J., Geographic data mining and knowledge discovery, Ch 6, pp. 117…147 – available as an e-book • Spatial Databases: A Tour by Shashi Shekhar and Sanjay Chawla • This lecture is about the pp. 182…201 – Can be downloaded from – http://www.spatial.cs.umn.edu/Book/ • There are also few copies of the book in the library Extra slides about probabilities • Joint probability • Conditional probability • Bayes probability About probabilities • joint probability; in case of independent events; by multiplication of the independent probabilities • conditional probabilities; in case of not-independent probabilities; the other event effects to the second; by the formula of conditional probability • Bayes probability; in case of reverse problem statement; we know that something has been caused by something and we know the causation result; the question is about the probability of the causing event behind the result Joint probability • when A and B are independent • the situation is considered to be an intersection between the probabilities; both A and B should happen at the same time • the formula is then: P(AB) = P(A) x P(B) Conditional probability • in case of not independent events • probability of an event (A) when we know that some other event (B) has occured – B has already happened; B give us priori information – what is the probability that A occurs ? – we know that B already occurred and it either reduces or increases the change of A • posterior probability is denoted P(A │B) = P(AB) / P(B) Example • • • • • roll a four sided die (for example 3) then flip that number of coins (3 coins, with two sides) what is the probability we get exactly one “heads” Bi= an i appears on the first roll, P(Bi)=1/4 we can easily calculate the conditional probabilities for P(A│B), respectively to B ½, ¼, 1/8, 1/16 • and then solve P(A) by applying multiplication rule to the basic formula of conditional probability • (Durrett) Bayesian probability • a reverse question can be made: – if A occurs, what is the most likely cause (B value)? – in the previous example, if we get one heads, what is the most likely cause (value of B) • this leads to Bayes´formula, that uses the following information: – P(A); the probability of A, now we know the result and its probability; in this case 0,5 – we are asking the probability of B that caused A; what is the most likely value of B that caused A; we know the probability of B and the conditional probability P(B│A)= P(A│B)P(B)/P(A) Using Bayes rule, an example • Suppose there is a school with 60% boys and 40% girls as its students. • The female students wear trousers or skirts in equal numbers; the boys all wear trousers. • An observer sees a (random) student from a distance, and what the observer can see is that this student is wearing trousers. • What is the probability this student is a girl? • The correct answer can be computed using Bayes' theorem. • The event A is that the student observed is a girl, and the event B is that the student observed is wearing trousers. To compute P(A|B), we first need to know: • P(B|A), or the probability of the student wearing trousers given that the student is a girl. Since girls are as likely to wear skirts as trousers, this is 0.5. (conditional probability) • P(A), or the probability that the student is a girl regardless of any other information. Since the observer sees a random student, meaning that all students have the same probability of being observed, and the fraction of girls among the students is 40%, this probability equals 0.4. • P(B), or the probability of a (randomly selected) student wearing trousers regardless of any other information. Since half of the girls and all of the boys are wearing trousers, this is 0.5×0.4 + 1.0×0.6 = 0.8. • Given all this information, the probability of the observer having spotted a girl given that the observed student is wearing trousers can be computed by substituting these values in the formula: • P(A| B) = P(B|A) P(A) /P(B) = 0.5 x 0.4 / 0.8 = 0.25