Download Mining for Spatial Patterns

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Transcript
Mining for Spatial Patterns
Shashi Shekhar
Department of Computer Science
University of Minnesota
http://www.cs.umn.edu/~shekhar
Collaborators:
U. of Minnesota: V. Kumar, G. Karypis, C.T. Lu, W. Wu, Y. Huang, V. Raju,
P. Zhang, P. Tan, M. Steinbach
NASA Ames Research Center: C. Potter
California State University, Monterey Bay: S. Klooster
This work was partially funded by NASA and Army High Performance Computing Center
Shashi Shekhar
Mining For Spatial Patterns
1
Background
NSF workshop on GIS and DM (3/99)
Spatial data - traffic, bird habitats,
global climate, logistics, ...
For spatial patterns - outliers, location
prediction, associations, sequential
associations, clustering, trends, …
Shashi Shekhar
Mining For Spatial Patterns
2
Framework
Problem statement: capture special
needs
Data exploration: maps, new methods
Try reusing classical methods
from data mining, spatial statistics
If reuse is not possible, invent new
methods
Validation, Performance tuning
Shashi Shekhar
Mining For Spatial Patterns
3
Research Goals
Research Goals:
• modeling of ecological data
 event modeling
 zone modeling.
• finding spatio-temporal patterns
NPP
.
Pressure
 associations
NPP
.
Pressure
.
Precipitation
Precipitation
SST
SST
 predictive models.
Latitude
grid cell
Longitude
Shashi Shekhar
Time
zone
Mining For Spatial Patterns
A key interest is finding
connections between the
ocean and the land.
4
Sources of Earth Science Data
Before 1950, very sparse, unreliable data.
Since 1950, reliable global data.
Ocean temperature and pressure are based on data
from ships.
Most land data, (solar, precipitation, temperature and
pressure) comes from weather stations.
Since 1981, data has been available from Earth
orbiting satellites.
FPAR, a measure related to plant
Since 1999 TERRA, the flagship of the NASA
Earth Observing System, is providing much more
detailed data.
Shashi Shekhar
Mining For Spatial Patterns
5
Example Pattern: Teleconnections
Teleconnections are the simultaneous
variation in climate and related processes
over widely separated points on the Earth.
For example, El Nino is the anomalous
warming of the eastern tropical region of the
Pacific, and has been linked to various climate
phenomena.
Droughts in Australia and Southern Africa
Heavy rainfall along the western coast of South
America
Milder winters in the Midwest
Shashi Shekhar
Mining For Spatial Patterns
6
Net Primary Production (NPP)
Net Primary Production (NPP) is the net assimilation
of atmospheric carbon dioxide (CO2) into organic
matter by plants.
NPP is driven by solar radiation and can be constrained by
precipitation and temperature.
NPP is a key variable for understanding the global
carbon cycle and ecological dynamics of the Earth.
Keeping track of NPP is important because it
includes the food source of humans and all other
organisms.
Sudden changes in the NPP of a region can have a direct impact
on the regional ecology.
An ecosystem model for predicting NPP, CASA (the
Carnegie Ames Stanford Approach) provides a
detailed view of terrestrial productivity.
Shashi Shekhar
Mining For Spatial Patterns
7
Benefits of Data Mining
Data mining provides earth scientist with tools that
allow them to spend more time choosing and
exploring interesting families of hypotheses.
However, statistics is needed to provide methods for determining
the “statistical” significance of results.
By applying the proposed data mining techniques,
some of the steps of hypothesis generation and
evaluation will be automated, facilitated and
improved.
Association rules provide a “new” framework for
detecting relationships between events.
Shashi Shekhar
Mining For Spatial Patterns
8
Approaches
Shashi Shekhar
Mining For Spatial Patterns
9
Clustering
Interested in relationships between regions,
not “points.”
For land, clustering based on NPP or other
variables, e.g., precipitation, temperature.
For ocean, clustering based on SST (Sea
Surface Temperature).
When “raw” NPP and SST are used, clustering
can find seasonal patterns.
Anomalous regions have plant growth patterns
which reversed from those typically observed in
the hemisphere in which they reside, and are easy
to spot.
Shashi Shekhar
Mining For Spatial Patterns
10
Clustering
EL Nino Related SST Clusters
90
60
latitude
30
Niño
Region
Range
Longitude
Range
Latitude
1+2
90°W-80°W
10°S-0°
3
150°W-90°W
5°S-5°N
3.4
170°W-120°W
5°S-5°N
4
160°E-150°W
5°S-5°N
0
El Nino Regions
-30
-60
-90
-180 -150 -120
-90
-60
-30
0
30
longitude
60
90
120
150
180
SNN clusters of SST that are highly correlated with
El Nino indices.
Shashi Shekhar
Mining For Spatial Patterns
11
Spatial Association Rule
Citation: Symp. On Spatial Databases 2001
Problem: Given a set of boolean spatial features
find subsets of co-located features, e.g. (fire, drought, vegetation)
Data - continuous space, partition not natural, no reference feature
Classical data mining approach: association rules
But, Look Ma! No Transactions!!! No support measure!
Approach: Work with continuous data without
transactionizing it!
confidence = Pr.[fire at s | drought in N(s) and vegetation in N(s)]
support: cardinality of spatial join of instances of fire, drought, dry
veg.
participation: min. fraction of instances of a features in join result
new algorithm using spatial joins and apriori_gen filters
Shashi Shekhar
Mining For Spatial Patterns
12
Event Definition
Convert the time series into sequence of events at
each spatial location.
time
y
t1
AK
M
A
B
AB
D
DF
CM
ABE
G
AB
G
DL
J
x
AB
D
AB
DEF
EG
K
BCD
CE
F
EG
M
BCE
Shashi Shekhar
t3
t2
DK
L
AB
GL
CFM
AB
E
Grid Cell (x,y)
(1,1)
(1,2)
(1,3)
(1,4)
(2,1)
(2,2)
(2,3)
(2,4)
(3,1)
(3,2)
(3,3)
(3,4)
(4,1)
(4,2)
(4,3)
(4,4)
Mining For Spatial Patterns
t1
Æ
{A, B, D}
Æ
{A, K, M}
{B, C, E}
Æ
Æ
{A, B}
Æ
{A, B, G}
{C, M}
Æ
Æ
Æ
Æ
Æ
t2
Æ
{D, L, J}
{A, B, E, G}
Æ
{E, G, M}
{C, E, F}
Æ
{D, F}
Æ
Æ
Æ
Æ
Æ
{D, K, L}
Æ
{A, B}
t3
Æ
Æ
{B, C, D}
Æ
{C, F, M}
{A, B, G, L}
Æ
{A, B, D}
Æ
{A, B, E}
Æ
Æ
Æ
Æ
{E, G, K}
{D, E, F}
13
Interesting Association Patterns
Use domain knowledge to eliminate uninteresting
patterns.
A pattern is less interesting if it occurs at random
locations.
Approach:
Partition the land area into distinct groups (e.g., based on landcover type).
For each pattern, find the regions for which the pattern can be
applied.
If the pattern occurs mostly in a certain group of land areas, then it
is potentially interesting.
If the pattern occurs frequently in all groups of land areas, then it
is less interesting.
Shashi Shekhar
Mining For Spatial Patterns
14
Association Rules
Intra-zone non-sequential Patterns
FPAR-Hi  NPP-Hi (support  10)
Shrubland regions
• Region corresponds to semi-arid grasslands, a type of vegetation, which is able to
quickly take advantage of high precipitation than forests.
• Hypothesis: FPAR-Hi events could be related to unusual precipitation conditions.
Shashi Shekhar
Mining For Spatial Patterns
15
Co-location
Can you find co-location patterns from the following sample dataset?
Answers:
Shashi Shekhar
and
Mining For Spatial Patterns
16
Co-location
Can you find co-location patterns from the following sample dataset?
Shashi Shekhar
Mining For Spatial Patterns
17
Co-location
Spatial Co-location
A set of features frequently co-located
Given
A set T of K boolean spatial feature
types
T={f1,f2, … , fk}
A set P of N locations P={p1, …, pN } in
a spatial frame work S, pi P is of
some spatial feature in T
A neighbor relation R over locations in S
Find
Reference Feature Centric
Tc = subsets of T frequently co-located
Objective
Correctness
Completeness
Efficiency
Constraints
R is symmetric and reflexive
Monotonic prevalence measure
Window Centric
Shashi Shekhar
Mining For Spatial Patterns
Event Centric
18
Co-location
Comparison with association rules
Association rules
Co-location rules
underlying space
discrete sets
continuous space
item-types
item-types
events /Boolean spatial features
collections
transactions
neighborhoods
prevalence measure
support
participation index
conditional probability
measure
Pr.[ A in T | B in T ]
Pr.[ A in N(L) | B at L ]
Participation index
Participation ratio pr(fi, c) of feature fi in co-location c = {f1, f2, …, fk}: fraction of instances of fi
with
feature {f1, …, fi-1, fi+1, …, fk} nearby 2.Participation index = min{pr(fi, c)}
Algorithm
Hybrid Co-location Miner
Shashi Shekhar
Mining For Spatial Patterns
19
Spatial Co-location Patterns
Dataset
• Spatial feature A,B,C and their instances
• Possible associations are (A, B), (B, C), etc.
• Neighbor relationship includes following pairs:
•A1, B1
•A2, B1
•A2, B2
•B1, C1
•B2, C2
Shashi Shekhar
Mining For Spatial Patterns
20
Spatial Co-location Patterns
Dataset
Partition approach[Yasuhiko, KDD 2001]
•Support not well defined,i.e. not independent of execution
trace
•Has a fast heuristic which is hard to analyze for
correctness/completeness
Spatial feature A,B, C,
and their instances
Support A,B =2 B,C=2
Shashi Shekhar
Mining For Spatial Patterns
Support A,B=1 B,C=2
21
Spatial Co-location Patterns
Dataset
Reference feature approach [Han SSD 95]
•C as reference feature to get transactions
•Transactions: (B1) (B2)
•Support (A,B) = Ǿ from Apriori algorithm
Spatial feature A,B, C,
and their instances
Shashi Shekhar
•Note: Neighbor relationship includes following pairs:
•A1, B1
•A2, B1
•A2, B2
•B1, C1
•B2, C2
Mining For Spatial Patterns
22
Spatial Co-location Patterns
Dataset
Spatial feature A,B, C,
and their instances
Our approach (Event Centric)
• Neighborhood instead of transactions
• Spatial join on neighbor relationship
• Support  Prevalence
•Participation index = min. p_ratio
•P_ratio(A, (A,B)) = fraction of instance of
A participating in join(A,B, neighbor)
•Examples
Support(A,B)=min(2/2,3/3)=1
Support(B,C)=min(2/2,2/2)=1
Shashi Shekhar
Mining For Spatial Patterns
23
Spatial Co-location Patterns
Dataset
Partition approach
Our approach
Support(A,B)=min(2/2,3/3)=1
Support(B,C)=min(2/2,2/2)=1
Spatial feature A,B, C,
and their instances
Support A,B =2 B,C=2
Reference feature
approach
C as reference feature
Transactions: (B1) (B2)
Support (A,B) = Ǿ
Support A,B=1 B,C=2
Shashi Shekhar
Mining For Spatial Patterns
24
Spatial Outliers
Spatial Outlier: A data point that is extreme relative
to it neighbors
Case Study: traffic stations different from neighbors
[SIGKDD 2001]
Data - space-time plot, distr. Of f(x), S(x)
Distribution of base attribute:
spatially smooth
frequency distribution over value domain: normal
Classical test - Pr.[item in population] is low
Q? distribution of diff.[f(x), neighborhood agg{f(x)}]
Insight: this statistic is distributed normally!
Test: (z-score on the statistics) > 2
Performance - spatial join, clustering methods
Shashi Shekhar
Mining For Spatial Patterns
25
Spatial Outlier Detection
Given
A spatial graph G={V,E}
A neighbor relationship (K neighbors)
An attribute function f : V -> R
An aggregation function : Faggr :R k -> R
A comparison function Fdiff ( f , Faggr )
Confidence level threshold 
Statistic test function ST: R ->{T, F}
Find
O = {vi | vi V, vi is a spatial outlier}
Objective
Correctness: The attribute values of vi
is extreme, compared with its neighbors
Computational efficiency
Constraints
Fdiff and ST are algebraic aggregate
functions of f and Faggr
Computation cost dominated by I/O op.
Shashi Shekhar
Mining For Spatial Patterns
26
Spatial Outlier Detection
Spatial Outlier Detection Test
1. Choice of Spatial Statistic
S(x) = [f(x)–E y N(x)(f(y))]
Theorem: S(x) is normally distributed
if f(x) is normally distributed
2. Test for Outlier Detection
| (S(x) - s) / s | > 
Hypothesis
I/O cost determined by clustering efficiency
f(x)
Shashi Shekhar
Mining For Spatial Patterns
S(x)
27
Spatial Outlier Detection
Results
1. CCAM achieves higher
clustering efficiency (CE)
2. CCAM has lower I/O cost
3. High CE => low I/O cost
4. Big Page => high CE
CE value
Cell-Tree
Shashi Shekhar
CCAM
Mining For Spatial Patterns
I/O cost
Z-order
28
A Unified Approach Spatial Outliers
•Tests : quantitative, graphical
•Results:
•Computation = spatial self-join
•Tests: algebraic functions of join
•Join predicate: neighbor relations
•I/O-cost: f(clustering efficiency)
•Our algorithm is I/O-efficient for
Algebric tests
Scatter Plot
Original Data
Our Approach
Shashi Shekhar
Mining For Spatial Patterns
29
Graphical Spatial Tests
Moran Scatter Plot
Original Data
Variogram Cloud
Shashi Shekhar
Mining For Spatial Patterns
30
Location Prediction
Citations: IEEE Tran. on Multimedia 2002, SIAM DM Conf. 2001,
SIGKDD DMKD 2000
Problem: predict nesting site in marshes
given vegetation, water depth, distance to edge, etc.
Data - maps of nests and attributes
spatially clustered nests, spatially smooth attributes
Classical method: logistic regression, decision trees, bayesian
classifier
but, independence assumption is violated ! Misses autocorrelation !
Spatial auto-regression (SAR), Markov random field bayesian
classifier
Open issues: spatial accuracy vs. classification accurary
Open issue: performance - SAR learning is slow!
Shashi Shekhar
Mining For Spatial Patterns
31
Location Prediction
Given:
1. Spatial Framework S  {s1 ,...sn }
2. Explanatory functions: f X : S  R
3. A dependent class: fC : S  C  {c1 ,...cM }
4. A family  of function
mappings: R ... R  C
k
Find: Classification model: fˆc  
Nest locations
Distance to open water
Objective:maximize
ˆ
classification_accuracy ( f c , f c )
Constraints:
Spatial Autocorrelation exists
Vegetation durability
Shashi Shekhar
Mining For Spatial Patterns
Water depth
32
Motivation and Framework
Shashi Shekhar
Mining For Spatial Patterns
33
Solution Procedures
•
Spatial Autoregression Model (SAR)
• y = Wy + X + 
• W models neighborhood relationships
•  models strength of spatial dependencies
•  error vector
• Solutions
•  and  - can be estimated using ML or Bayesian stat.
• e.g., spatial econometrics package uses Bayesian
approach using sampling-based Markov Chain Monte
Carlo (MCMC) method.
• Likelihood-based estimation requires O(n3) ops.
• Other alternatives – divide and conquer, sparse matrix,
LU decomposition, etc.
Shashi Shekhar
Mining For Spatial Patterns
34
Evaluation
Linear Regression y  X  
Spatial Regression y  Wy  X  
Spatial model is better
Shashi Shekhar
Mining For Spatial Patterns
35
Solution Procedures
•
Markov Random Field based Bayesian Classifiers
• Pr(li | X, Li) = Pr(X|li, Li) Pr(li | Li) / Pr (X)
• Pr(li | Li) can be estimated from training data
• Li denotes set of labels in the neighborhood of si
excluding labels at si
• Pr(X|li, Li) can be estimated using kernel functions
• Solutions
• stochastic relaxation [Geman]
• Iterated conditional modes [Besag]
• Graph cut [Boykov]
Shashi Shekhar
Mining For Spatial Patterns
36
Comparison
•
•
•
•
SAR can be rewritten as y = (QX)  + Q
• where Q = (I- W)-1 which can be viewed as a spatial
smoothing operation.
• This transformation shows that SAR is similar to linear
logistic model, and thus suffers with same limitations –
i.e., SAR model assumes linear separability of classes in
transformed feature space
SAR model also make more restrictive assumptions about the
distribution of features and class shapes than MRF
The relationship between SAR and MRF are analogous to the
relationship between logistic regression and Bayesian
classifiers.
Our experimental results shows that MRF model yields better
spatial and classification accuracies than SAR predictions.
Shashi Shekhar
Mining For Spatial Patterns
37
MRF vs. SAR
Confusion Matrix:
Spatial Confusion Matrix:
Shashi Shekhar
Mining For Spatial Patterns
38
Experiment Design
Shashi Shekhar
Mining For Spatial Patterns
39
Prediction Maps(Learning)
Actual Nest Sites (Real Learning)
MRF-P Prediction (ADNP=3.36)
NZ=85
NZ=138
MRF-GMM Prediction (ADNP=5.88)
SAR Prediction (ADNP=9.80)
NZ=140
Shashi Shekhar
NZ=130
Mining For Spatial Patterns
40
Prediction Maps(Testing)
Actual Nest Sites (Real Testing)
MRF-P Prediction (ADNP=2.84)
Actual Nest Sites (Real Learning)
NZ=30
NZ=80
MRF-GMM Prediction (ADNP=3.35)
NZ=76
Shashi Shekhar
SAR Prediction (ADNP=8.63)
NZ=80
Mining For Spatial Patterns
41
Conclusion and Future Directions
Spatial domains may not satisfy assumptions of classical
methods
data: auto-correlation, continuous geographic space
patterns: global vs. local, e.g. spatial outliers vs. outliers
data exploration: maps and albums
Open Issues
patterns: hot-spots, blobology (shape), spatial trends, …
metrics: spatial accuracy(predicted locations), spatial
contiguity(clusters)
spatio-temporal dataset
scale and resolutions sentivity of patterns
geo-statistical confidence measure for mined patterns
Shashi Shekhar
Mining For Spatial Patterns
42
Reference
1.
S. Shekhar, S. Chawla, S. Ravada, A. Fetterer, X. Liu and C.T. Liu, “Spatial Databases: Accomplishments and
Research Needs”, IEEE Transactions on Knowledge and Data Engineering, Jan.-Feb. 1999.
2.
S. Shekhar and Y. Huang, “Discovering Spatial Co-location Patterns: a Summary of Results”, In Proc. of 7th
International Symposium on Spatial and Temporal Databases (SSTD01), July 2001.
3.
S. Shekhar, C.T. Lu, P. Zhang, "Detecting Graph-based Spatial Outliers: Algorithms and Applications“, the
Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2001.
4.
S. Shekhar, C.T. Lu, P. Zhang, “Detecting Graph-based Saptial Outlier”, Intelligent Data Analysis, To appear in
Vol. 6(3), 2002
5.
S. Shekhar, S. Chawla, the book “Spatial Database: Concepts, Implementation and Trends”, Prentice Hall, 2002
6.
S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Extending Data Mining for Spatial Applications: A Case Study in
Predicting Nest Locations”, Proc. Int. Confi. on 2000 ACM SIGMOD Workshop on Research Issues in Data
Mining and Knowledge Discovery (DMKD 2000), Dallas, TX, May 14, 2000.
7.
S. Chawla, S. Shekhar, W. Wu and U. Ozesmi, “Modeling Spatial Dependencies for Mining Geospatial Data”,
First SIAM International Conference on Data Mining, 2001.
8.
S. Shekhar, P.R. Schrater, R. R. Vatsavai, W. Wu, and S. Chawla, “Spatial Contextual Classification and Prediction
Models for Mining Geospatial Data”,To Appear in IEEE Transactions on Multimedia, 2002.
9.
S. Shekhar, V. Kumar, P. Tan. M. Steinbach, Y. Huang, P. Zhang, C. Potter, S. Klooster, “Mining Patterns in Earth
Science Data”, IEEE Computing in Science and Engineering (Submitted)
Shashi Shekhar
Mining For Spatial Patterns
43
Reference
10.
S. Shekhar, C.T. Lu, P. Zhang, “A Unified Approach to Spatial Outliers Detection”, IEEE Transactions on
Knowledge and Data Engineering (Submitted)
11.
S. Shekhar, C.T. Lu, X. Tan, S. Chawla, Map Cube: A Visualization Tool for Spatial Data Warehouses, as Chapter
of Geographic Data Mining and Knowledge Discovery. Harvey J. Miller and Jiawei Han (eds.), Taylor and
Francis, 2001, ISBN 0-415-23369-0.
12.
S. Shekhar, Y. Huang, W. Wu, C.T. Lu, What's Spatial about Spatial Data Mining: Three Case Studies , as Chapter
of Book: Data Mining for Scientific and Engineering Applications. V. Kumar, R. Grossman, C. Kamath, R.
Namburu (eds.), Kluwer Academic Pub., 2001, ISBN 1-4020-0033-2
13.
Shashi Shekhar and Yan Huang , Multi-resolution Co-location Miner: a New Algorithm to Find Co-location
Patterns in Spatial Datasets, Fifth Workshop on Mining Scientific Datasets (SIAM 2nd Data Mining Conference),
April 2002
Shashi Shekhar
Mining For Spatial Patterns
44