Download Classification in spatial data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Classification in spatial data
mining
Kirsi Virrantaus
Maa-123.3585 Spatial Data Mining
AaltoENG, Department of Built Environment
1. Traditional vs. spatial data
mining
• in traditional data mining methods assumptions that
violate Tobler´s first law of geography (spatial
autocorrelation)
– data samples independently generated
• knowledge discovery technologies that ignore spatial
autocorrelation perform poorly on spatial data
• two major approaches to incorporate spatial dependence
into classification problems
– Spatial autoregression models, SAR
– Marcov random field Models, MRF
Synonyms
• Spatial autocorrelation
• Spatial context
• Spatial dependence – natural resource analysts and
statisticians
– depending on the scientific community
• Classification
• Prediction
• Spatial variation (over space)
• Spatial heterogeneity
2. Classification
• Classification is here understood as prediction
• Classification is to find a function
f : D → L. D is the domain of f , L is the set of labels
– for example D can be three dimensional space
consisting of three attributes vegetation durability,
water depth and distance to open water and L
containts two labels nest and no-nest
– F is used to predict the labels L when only data from
the set D is given
– Training data is used to find the f until the required
accuracy is reached; the function is then applied to
the test data ; supervised learning
Esimerkki/example:Bird nest
• The region is divided into grid cells/alue on jaettu gridin soluihin
• About each cell we know three attribute values (vegetation type,
water depth, distance to open water)/jokaisesta gridin solusta
tunnetaan kolmen muuttujan arvot (kasvillisuustyyppi, veden syvyys,
etäisyys avoimeen veteen)
• Teaching data includes both attribute values and labeling result (for
example: whether there is a bird nest or not) /opetusdata sisältää
sekä ominaisuudet, että luokittelun (esimerkiksi löytyykö solusta
linnunpesä vai ei)
• The function can now be used for classification of new data set
(attribute values only)/Opetetulla funktiolla voidaan luokitella uutta
aineistoa, josta tunnetaan vain ominaisuudet
Three approaches for spatial extension
of classification
• Logistic regression models (logistic SAR)
– Spatial autoregression + logistic transformation
– In spatial autoregression: W for neighbourhood, ρ for the strength
• Geographically weighted regression (GWR)
– Takes into account spatial variation in model parametres and
errors
• Bayesian classifiers and Marcov random field extension
– Most typical for example in remote sensing applications
– Cause ”salt-and- pepper effect” without spatial extension
– Spatial dependency is modeld as a priori term in Bayes´rule +
neighbourhood matrix
3. Regression models in
prediction
• Basic form of linear regression
y = mx + c or Y = Xβ + ε
– training data can be used to calculate the parameter vector β ,
which in turn can be used to calculate the value of the class
attribute in the test data set
• When the dependent variable is binary, like in our bird nest example,
logistic regression is used
• assumption in regression models is that the sample observations
are independently generated
• this may not be true in spatial data and this shows up in the
residual errors, the ε´s;
• systematic variation can be found in the residual errors and that
means that the model is unable to capture the spatial
relationships that exist in the data
Spatial Regression SAR
• when variables are spatially referenced, they tend to exhibit spatial
autocorrelation
• the assumption of independent distributions is not appropriate but
things depend on their neighbours
• the simplest way to modify the regression equation is to use the
contiguity matrix/adjacency matrix W
• the spatial autoregressive regression equation is then
Y = ρWY + Xβ + ε
– in which ρ is the parameter that shows the correlation strength
– when ρ is 0, the equation collapses to the classical regression
model
– with this model we get into much less systematic variation of ε
and the model in total will have a better fit
– Spatial autoregression = spatial context = spatial autocorrelation
4. GWR Geographically Weighted
Regression
• is another solution that enables the input of spatial
autocorrelation to regression models
• developers
– Professor Stewart Fotherinham, Martin Charlton and
Chris Brunsdon from National university of Ireland,
Center for Geocomputation, Maynooth
• In this course a full lecture and exercises on GWR
In GWR
• we try to explain the dependent variable by independent
ones, as in normal regression, make a model and solve
it for example by OLS, ordninary least squares -method
• but add geographical weight which is based on a kernel,
for example a Gaussian like
– a square matrix whose leading diagonal contains the weights for
the observations relative to the location in question
• this is another approach to take into account spatial
context
• there will be a lecture and exercises on GWR in this course
• useful reading material in Miller &Han , Ch 9, pp. 227…254
Geographically Weighted Regression - GWR
U.Demsar
Global vs. local
linear regression
global
model
assumption:
a stationary process
the same stimulus provokes
the same response in
all parts of the study region process
yi = 0 + 1x1i + 2x2i +… nxni + i
stationary parameter estimates j – constant everywhere
spatial non-stationarity can only be seen through residuals
GWR
Measure spatial non-stationarity directly:
allow the relationships we are measuring to vary
over space.
Geographically Weighted Regression - GWR
How does GWR work?
Local statistical technique to analyse spatial variations in
relationships
assumption:
regression parameters j are continuous functions of location i
yi = 0(i) + 1(i) x1i + 2(i) x2i +… n(i) xni+ i
with the estimator ’(i) = (XTW(i) X)-1 XT W(i) Y
matrix of
weights
observations nearer to i
are given greater weight
than observations further
away
U.Demsar
spatially adaptive weighting
fixed weighting
Intepreting the results – individual behaviour of parameters
Main output
of GWR
a set of location-specific
parameter estimates
statistical summaries for
each parameter
univariate
maps
information on spatial non-stationarity in relationships
U.Demsar
5. Bayesian classifier
• Bayes theorem
P(B│A) = P(A│B) x P(A)/P(B); the probability of B given that A has
happened
– typically used in remote sensing image classification problems
• see Appendix in 7.8
– In conditional probability we calculate the probability of A when
we know that B has happened
– In Bayes the question is: knowing that B happened what is the
most probable cause
Example on birds nest
• Events
– B = a bird´s nest in the location in question
– A = a specified combination of attribute values (vegetation, water
depth, distance to open water)
• We know about corresponding probabilities
– P(B) = prob. that there is a nest in the location
– P(A) = prob. that there is the specified combination of attribute
values in the location
• We can calculate
– P(A│B) prob that the terrain is specific in the location if there
is a nest
– P(B│A) prob that there is a nest in the location if the terrain
attributes are specified
• P(B│A) = P(A│B) x P(A)/P(B);
• See the example in Appendix
ORIGINAL MAPS
SIMULATED MAPS
Problem of conventional Bayes
classifier
• The conventional Bayes classifies does not take into
account spatial dependency
• Result of classification is a kind of ”salt and pepper” map
• The solution is that spatial dependency must be taken
into account
• One solution is to use information about neighbourhood
in classification (spatial autocorrelation)
• The techniques is neighbourhood matrix W that can be
seen as Marcov Random Field
Markov Random Field
• Markov Random Field (MRF):
– a set of random variables whose interdependency is
described by an undirected graph (for example a
symmetric neighbourhood matrix W)
• Markov property specifies that
– a variable depends only on its neighbours and it is
independent of all other variables
• The location problem (predicting the label) can be
modelled by assuming that the class label of different
locations constitute an MRF
– label value is based on feature value vector and
neighbourhood class label vector
MRF and Bayes
• a Bayes classifier can be built on the MRF
• then we come to use Bayes probabilities and take into
account the spatial autocorrelation by using MRF
• the label is predicted based on the probability
distributions of the feature values and
neighbourhood labels
• see the formula 6.5 in (Miller & Han, p. 130)
• paragraphs 7.3.3 …7.3.4 (in Shekhar & Chawla) and Coparison of
SAR and MRF do not belong to the reading materials
• In this context it is only important to understand the idea of MRF &
Bayes and to know that there are lot of research on this topic
• very good reading material about Classification in Miller & Han, Ch
6, pp. 117-130
Material for this lecture
• Miller, H., Han,J., Geographic data mining and
knowledge discovery, Ch 6, pp. 117…147
– available as an e-book
• Spatial Databases: A Tour by Shashi Shekhar and
Sanjay Chawla
• This lecture is about the pp. 182…201
– Can be downloaded from
– http://www.spatial.cs.umn.edu/Book/
• There are also few copies of the book in the library
Extra slides about probabilities
• Joint probability
• Conditional probability
• Bayes probability
About probabilities
• joint probability; in case of independent events; by
multiplication of the independent probabilities
• conditional probabilities; in case of not-independent
probabilities; the other event effects to the second; by
the formula of conditional probability
• Bayes probability; in case of reverse problem statement;
we know that something has been caused by something
and we know the causation result; the question is about
the probability of the causing event behind the result
Joint probability
• when A and B are independent
• the situation is considered to be an intersection between
the probabilities; both A and B should happen at the
same time
• the formula is then:
P(AB) = P(A) x P(B)
Conditional probability
• in case of not independent events
• probability of an event (A) when we know that some
other event (B) has occured
– B has already happened; B give us priori
information
– what is the probability that A occurs ?
– we know that B already occurred and it either
reduces or increases the change of A
• posterior probability is denoted
P(A │B) = P(AB) / P(B)
Example
•
•
•
•
•
roll a four sided die (for example 3)
then flip that number of coins (3 coins, with two sides)
what is the probability we get exactly one “heads”
Bi= an i appears on the first roll, P(Bi)=1/4
we can easily calculate the conditional probabilities for
P(A│B), respectively to B ½, ¼, 1/8, 1/16
• and then solve P(A) by applying multiplication rule to the
basic formula of conditional probability
• (Durrett)
Bayesian probability
• a reverse question can be made:
– if A occurs, what is the most likely cause (B value)?
– in the previous example, if we get one heads, what is the
most likely cause (value of B)
• this leads to Bayes´formula, that uses the following
information:
– P(A); the probability of A, now we know the result and its
probability; in this case 0,5
– we are asking the probability of B that caused A; what is
the most likely value of B that caused A; we know the
probability of B and the conditional probability
P(B│A)= P(A│B)P(B)/P(A)
Using Bayes rule, an example
• Suppose there is a school with 60% boys and 40% girls
as its students.
• The female students wear trousers or skirts in equal
numbers; the boys all wear trousers.
• An observer sees a (random) student from a distance,
and what the observer can see is that this student is
wearing trousers.
• What is the probability this student is a girl?
• The correct answer can be computed using Bayes'
theorem.
• The event A is that the student observed is a girl, and the event
B is that the student observed is wearing trousers. To compute
P(A|B), we first need to know:
• P(B|A), or the probability of the student wearing trousers given
that the student is a girl. Since girls are as likely to wear skirts
as trousers, this is 0.5. (conditional probability)
• P(A), or the probability that the student is a girl regardless of any
other information. Since the observer sees a random student,
meaning that all students have the same probability of being
observed, and the fraction of girls among the students is 40%,
this probability equals 0.4.
• P(B), or the probability of a (randomly selected) student wearing
trousers regardless of any other information. Since half of the
girls and all of the boys are wearing trousers, this is 0.5×0.4 +
1.0×0.6 = 0.8.
• Given all this information, the probability of the observer
having spotted a girl given that the observed student is
wearing trousers can be computed by substituting these
values in the formula:
• P(A| B) = P(B|A) P(A) /P(B) = 0.5 x 0.4 / 0.8 = 0.25