Download Irvine (ACM-GIS) Talk 11/06/2008

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Irvine(CA), November 6, 2008
Finding Regional Co-Location Patterns for Sets
of Continuous Variables in Spatial Datasets
Christoph Eick (University of Houston, USA), Rachana Parmar
(University of Houston, USA), Wei Ding (University of
Massachusetts at Boston, USA, USA), Tomasz Stepinski (Lunar
and Planetary Institute, Houston, USA), Jean-Phillippe Nicot
(Bureau of Economic Geology, University of Texas, Austin, USA)
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Talk Outline
1. Introduction Co-location Mining
2. Clustering with Plug-in Fitness Functions
3. An Interestingness Measure for Co-location
Mining Involving Continuous Variables.
4. Case Study: Arsenic Pollution in the Texas
Water Wells
5. CLEVER—A Representative-based Clustering
Algorithm
6. Conclusion.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
1. Introduction
 “Spatial co-locations represent the subsets of
features which are frequently located
together in geographic space” [Shekhar]
 Most of the past research centers on finding
categorical co-location patterns which are
global.
 However, many real world datasets contain
continuous variables, and global knowledge
may be inconsistent with regional knowledge
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Regional Co-location Mining


Goal: To discover regional co-location patterns involving
continuous variables in which continuous variables take
values from the wings of their statistical distribution
A novel framework that operates in the continuous domain
is proposed to accomplish this goal.
Regional
Co-location Mining
Dataset:
(longitude,latitude,<concentrations>+)
Data Mining & Machine Learning Group
CS@UH
Why is Regional Knowledge Important in Spatial Data Mining?
 A special challenge in spatial data mining is that
information is usually not uniformly distributed in spatial
datasets.
 It has been pointed out in the literature that “whole map
statistics are seldom useful”, that “most relationships in
spatial data sets are geographically regional, rather
than global”, and that “there is no average place on the
Earth’s surface” [Goodchild03, Openshaw99].
 Therefore, it is not surprising that domain experts are
mostly interested in discovering hidden patterns at a
regional or local scale rather than a global scale.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Related Work
 Shekhar et al. discuss several interesting
approaches to mine spatial co-location patterns of
categorical features.
 Huang et al. address the problem of mining colocation patterns with rare features.
 Srikant and Agrawal use discretization of
continuous variables to form categorical variables
on which classical association rule mining is
applied.
 Calder et al. introduce an approach to use rank
correlation to mine quantitative association rules.
 Achtert and others give a method to derive
quantitative, non-spatial models to describe
correlation clusters.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
2. Clustering with Plug-in Fitness Functions
Motivation:
 Finding subgroups in geo-referenced datasets has many
applications.
 However, in many applications the subgroups to be searched
for do not share the characteristics considered by traditional
clustering algorithms, such as cluster compactness and
separation.
 Domain knowledge frequently imposes additional requirements
concerning what constitutes a “good” subgroup.
 Consequently, it is desirable to develop clustering algorithms
that provide plug-in fitness functions that allow domain experts
to express desirable characteristics of subgroups they are
looking for.
 Only very few clustering algorithms published in the literature
provide plug-in fitness functions; consequently existing
clustering paradigms have to be modified and extended by our
research to provide such capabilities.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Current Suite of Spatial Clustering Algorithms




Representative-based: SCEC, SRIDHCR, SPAM, CLEVER
Grid-based: SCMRG
ACM-GIS08
Agglomerative: MOSAIC
Density-based: SCDE, DCONTOUR (not really plug-in but some fitness
functions can be simulated)
Density-based
Grid-based
Representative-based
Agglomerative-based
Clustering Algorithms
Remark: All algorithms partition a dataset into clusters by maximizing a
reward-based, plug-in fitness function.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Spatial Clustering Alg. Cont.
 Datasets are assumed to have the following structure:
(<spatial attributes>;<non-spatial attributes>)
e.g. (longitude, latitude; <chemical concentrations>+)
 Clusters are found in the subspace of the spatial attributes,
called regions in the following.
 The non-spatial attributes are used by the fitness function but
neither in distance computations nor by the clustering algorithm
itself.
 Clustering algorithms are assumed to maximize reward-based
fitness functions that have the following structure:
q( X )   reward (c)   i(c) * c
cX
b
cX
where b is a parameter that determines the premium put on
cluster size (larger values  fewer, larger clusters)
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
3. An Interestingness Measure for Co-location Mining
Involving Continuous Variables
 Goal is to discover interesting regions with
interesting co-location patterns.
 Clustering algorithms that maximize fitness
functions of the form already exist:
b
q( X )   reward (c)   i(c) * c
cX
cX
 To use those algorithms for this task, an
interestingness measure has to be designed.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Co-location Measure for Continuous Variables
 Products of z-scores of continuous variables are used
to measure the interestingness of co-location patterns.
 Pattern A - Attribute A has high values
 Pattern A - Attribute A has low values
 z - score A, o  if z - score A, o   0
z A , o   
otherwise
0
 z - score A, o  if z - score A, o   0
z A , o   
otherwise
0
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Interestingness of a Pattern
 Interestingness of a pattern B (e.g. B= {C,
D, E}) for an object o,
i( B, o)   z ( p, o)
pB
 Interestingness of a pattern B for a region c,




  i B, o 
 * purity ( B, c)
 B, c    oc
c
Remark: Purity (i(B,o)>0) measures the percentage of objects
that exhibit pattern B in region c.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Region Interestingness
 Region interestingness is assessed by
computing the most prevalent pattern:
i c  max








B  S & B 1& P  B

(
B
,
c
)

 Region interestingness solely depends on the
most interesting co-location set for the region.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Example of a Result
All experiments: P(B) = (AsB or AsB) and |B|<5.
b = 1.3, θ=1.0
Experiment 1
Exp. No.
Region Reward
Maximum Valued
Pattern in theRegion
Purity
Average Product for
maximum valued
pattern
23
174.3191
AsMoVF-
0.83
211.0179
2
40
104.8576
AsMoV
0.65
161.3194
3
11
92.9385
AsMoVSO42-
0.55
170.3873
4
36
89.4068
AsBCl-TDS
0.58
153.2687
30.5775
AsMoClTDS
0.57
53.5107
Top 5
Regions
Region Size
1
Exp. 1
5
ACM-GIS08
7
Data Mining & Machine Learning Group
CS@UH
Summary
 Pattern Interestingness in a region is evaluated
using products of (cut-off) z-scores. In general,
products of z-scores measure correlation.
 Additionally, purity is considered that is controlled
by a parameter :
 Finally, the parameter b determines how much
premium is put on the size of a region when
computing region rewards.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
4. Case Study
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Arsenic Water Pollution Problem
 Arsenic pollution is a serious problem in the Texas water supply.
 Hard to explain what causes arsenic pollution to occur.
 Several Datasets were created using the Ground Water Database
(GWDB) by Texas Water Development Board (TWDB) that tests
water wells regularly, one of which was used in the experimental
evaluation in the paper:
 All the wells have a non-null samples for arsenic
 Multiple sample values are aggregated using avg/max functions
 Other chemicals may have null values
 Format: (Longitude, Latitude, <z-values of chemical concentrations>)
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Interesting Observations
 High arsenic is a well-known problem in Southern Ogallala aquifer in
the Texas Panhandle and in the Southern Gulf Cost aquifer. The colocation mining framework was able to identify regions in this areas,
as for example for b=1.3, =1.0 Rank 1, 2 and 3 regions are in
Ogallala aquifer. Rank 4 region is in Gulf cost aquifer. The approach
not only identified that high arsenic is associated with high vanadium
and molybdenum but was also able to discriminate against
companion elements like sulfate and fluoride.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Interesting Observations cont.
 For b=1.5, the extent of arsenic contamination in Texas: Ogallala
Aquifer, Southern Gulf Coast, and West Texas basins, could be
recognized.
 For b=2.0, loosening of cluster definition results in a display of the
known, often described as sharp, boundaries between high and low
arsenic areas in the Ogallala (Ranks 2 and 4) and the Gulf Coast
(Ranks 1 and 3) aquifers.
 In general, for b=1.3 and b=1.5 the discovered regions tend to lie
inside Texas aquifers, which is expected, because wells inside the
same aquifer are connected by water flow.
 The algorithm also finds some inconsistent co-location sets. As for
example, for b=1.5, rank 3 region in west Texas has high arsenic colocated with high chloride, while rank 4 region in south Texas has
low arsenic with high chloride which can be attributed to
geographical differences in regions.
 When  is increased to 5, not surprisingly all top regions have
purities of 90% or above.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Example: Differences in Results Medium/High Rewards for Purity
Table 5. Top 5 regions ranked by reward (as per formula 8).
Exp.
No.
Exp. 2
Exp. 4
Top 5
Regions
Region Reward
Maximum Valued
Pattern in
theRegion
Purity
Average Product
for maximum
valued pattern
Region Size
1
181
61684.5323
AsMoVF-
0.49
52.1019
2
80
24040.6315
AsBCl-TDS
0.48
70.7322
3
467
1884.8856
AsTDS
0.91
0.2047
4
23
701.7072
AsCl-SO42-TDS
0.78
8.1287
5
189
587.9790
AsF-
0.78
0.2909
1
7
11669.7965
AsBCl-TDS
1.0
630.1097
2
117
10407.3250
AsVF-
0.91
12.8550
3
4
2203.2526
AsV SO42-TDS
1.0
275.4066
4
5
2
530
1531.4887
1426.9140
AsMoVB
AsTDS
1.0
0.90
541.4630
0.1939
All: (AsB or AsB) and |B|<5
Experiment
2
b = 1.5, θ=1.0
Experiment
4
b = 1.5, θ=1.0
High Reward Regions =1 and =5
=1
ACM-GIS08
=5
Data Mining & Machine Learning Group
CS@UH
Challenges
 Kind of “seeking a needle in a haystack” problem, because
we search for both interesting places and interesting
patterns.
 The Interestingness measure is not anti-monotone: a
superset of a co-location set might be more interesting.
 Only considering the maximum valued pattern when
evaluating regions is somewhat crude (employed solution:
used seeded pattern and run algorithm multiple times)
 Observation: different fitness function parameter settings
lead to quite different results, many of which are valuable
to domain experts.
 New challenge: results of many runs have to be analyzed
which is a lot of manual labor  need a tool for that.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Representative-based Clustering
Attribute1
2
1
3
4
Attribute2
Objective: Find a set of objects OR such that the clustering X
obtained by using the objects in OR as representatives minimizes q(X).
Properties: Cluster shapes are convex polygons
Popular Algorithms: K-means. K-medoids
Data Mining & Machine Learning Group
CS@UH
5. CLEVER
(ClustEring using representatiVEs and Randomized hill climbing)
 Is a representative-based, sometimes called prototypebased clustering algorithm
 Uses variable number of clusters and larger neighborhood
sizes to battle premature termination and randomized hill
climbing and adaptive sampling to reduce complexity.
 Searches for optimal number of clusters
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
6. Summary
 A novel framework for mining co-location patterns involving
multiple continuous variables with values from the wings of their
statistical distribution is proposed.
 Regional co-location mining is approached as a clustering
problem in which a reward-based fitness function has to be
maximized.
 The approach was successfully applied in a real world case
study involving arsenic contamination. The case study revealed
known areas of arsenic contamination and also some unknown
areas with interesting features. Different parameters lead to
characterization of arsenic patterns at different scales.
 In general, the regional co-location mining framework has been
valuable to domain experts in that it provided a data-driven
approach that suggests promising hypotheses for future
research.
 A novel prototype-based clustering named CLEVER was also
introduced.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
References






S. Shekhar and Y. Huang, “Discovering spatial co-location patterns: A summary of
results,” Lecture Notes in Computer Science, vol. 2121, pp. 236+, 2001.
Y. Huang, J. Pei, and H. Xiong, “Mining co-location patterns with rare events from spatial
data sets,” Geoinformatica, vol. 10, no. 3, pp. 239–260, 2006.
R. Srikant and R. Agrawal, “Mining quantitative association rules in large relational tables,”
in SIGMOD ’96: Proceedings of the 1996 ACM SIGMOD international conference on
Management of data. New York, NY, USA: ACM, 1996, pp. 1–12.
T. Calders, B. Goethals, and S. Jaroszewicz, “Mining rank-correlated sets of numerical
attributes,” in KDD ’06: Proceedings of the 12th ACM SIGKDD international conference on
Knowledge discovery and data mining. New York, NY, USA: ACM, 2006, pp. 96–105.
E. Achtert, C. B¨ohm, H.-P. Kriegel, P. Kr¨oger, and A. Zimek, “Deriving quantitative
models for correlation clusters,” in KDD ’06: Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining. New York, NY, USA:
ACM, 2006, pp. 4–13.
C. F. Eick, B. Vaezian, D. Jiang, and J. Wang, “Discovery of interesting regions in spatial
datasets using supervised clustering,” in Proceedings of the 10th European conference on
Principles of Data Mining and Knowledge discovery, 2006.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Region Discovery Framework
Objective: Develop and implement an integrated
framework to automatically discover interesting
regional patterns in spatial datasets. Treats region
discovery as a clustering problem.
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH
Region Discovery Framework Continued
The clustering algorithms we currently investigate solve the following problem:
Given:
A dataset O with a schema R
A distance function d defined on instances of R
A fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:
q(X)= cX reward(c)=cX interestingness(c)*size(c)b with b>1
Objective:
Find c1,…,ck  O such that:
1. cicj= if ij
2. X={c1,…,ck} maximizes q(X)
3. All cluster ciX are contiguous in the spatial subspace
4. c1,…,ck  O
5. c1,…,ck are usually ranked based on the reward each cluster receives, and
low reward clusters are frequently not reported
ACM-GIS08
Data Mining & Machine Learning Group
CS@UH