Download Intoduction to Region Discovery

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

K-means clustering wikipedia , lookup

Nearest-neighbor chain algorithm wikipedia , lookup

Cluster analysis wikipedia , lookup

Transcript
Discovering Interesting Regions in
Spatial Data Sets
1.
2.
3.
4.
5.
6.
Christoph F. Eick for Data Mining Class
Motivation: Examples of Region Discovery
Region Discovery Framework
A Fitness For Hotspot Discovery
Other Fitness Functions
A Family of Clustering Algorithms for Region Discovery
Summary
Discovering Interesting Regions in
Spatial Data Sets
1.
2.
3.
4.
5.
6.
Christoph F. Eick for Data Mining Class
Motivation: Examples of Region Discovery
Region Discovery Framework
A Fitness For Hotspot Discovery
Other Fitness Functions
A Family of Clustering Algorithms for Region Discovery
Summary
Ch. Eick: Introduction Region Discovery
1. Motivation: Examples of Region Discovery
Application 1: Supervised Clustering [EVJW07]
Application 2: Regional Association Rule Mining and Scoping [DEWY06, DEYWN07]
Application 3: Find Interesting Regions with respect to a Continuous Variables [CRET08]
Application 4: Regional Co-location Mining Involving Continuous Variables [EPWSN08]
Application 5: Find “representative” regions (Sampling)
b=1.01
RD-Algorithm
b=1.04
Wells in Texas:
Green: safe well with respect to arsenic
Red: unsafe well
Ch. Eick: Introduction Region Discovery
2. Region Discovery Framework
• We assume we have spatial or spatio-temporal datasets
that have the following structure:
(x,y,[z],[t];<non-spatial attributes>)
e.g. (longitude, lattitude, class_variable) or (longitude,
lattitude, continous_variable)
• Clustering occurs in the (x,y,[z],[t])-space; regions are
found in this space.
• The non-spatial attributes are used by the fitness
function but neither in distance computations nor by the
clustering algorithm itself.
• For the remainder of the talk, we view region discovery
as a clustering task and assume that regions and
clusters are the same
Ch. Eick: Introduction Region Discovery
Region Discovery Framework Continued
The algorithms we currently investigate solve the following problem:
Given:
A dataset O with a schema R
A distance function d defined on instances of R
A fitness function q(X) that evaluates clustering X={c1,…,ck} as follows:
q(X)= cX reward(c)=cX interestingness(c)*size(c)b with b>1
Objective:
Find c1,…,ck  O such that:
1. cicj= if ij
2. X={c1,…,ck} maximizes q(X)
3. All cluster ciX are contiguous (each pair of objects belonging to ci has
to be delaunay-connected with respect to ci and to d)
4. c1,…,ck  O
5. c1,…,ck are usually ranked based on the reward each cluster receives,
and low reward clusters are frequently not reported
Ch. Eick: Introduction Region Discovery
Challenges for Region Discovery
1.
2.
3.
4.
5.
6.
7.
Recall and precision with respect to the discovered
regions should be high
Definition of measures of interestingness and of
corresponding parameterized reward-based fitness
functions that capture “what domain experts find
interesting in spatial datasets”
Detection of regions at different levels of granularities
(from very local to almost global patterns)
Detection of regions of arbitrary shapes
Necessity to cope with very large datasets
Regions should be properly ranked by relevance (reward);
in many application only the top-k regions are of interest
Design and implementation of clustering algorithms that
are suitable to address challenges 1, 3, 4, 5 and 6.
Ch. Eick: Introduction Region Discovery
3. Fitness Function for Supervised Clustering
Class of Interest: Unsafe_Well
Prior Probability: 20%
γ1 = 0.5, γ2 = 1.5;
R+ = 1, R-= 1;
β = 1.1, =1.
10%
|c|
P(c, Unsafe)
Reward
30%
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
50
200
200
350
200
20/50 = 40%
40/200 = 20%
10/200 = 5%
30/350 = 8.6%
100/200=50%
1
* 501.1
7
0
0.143 * 3501.1
2
* 2001.1
7
1
* 2001.1
2
Ch. Eick: Introduction Region Discovery
4. Fitness Functions for Other Region
Discovery Tasks
4.1 Creating Contour Maps for Water Temperature (Temp)
Fig. 1: Sea Surface
Temperature on July 7 2002
Mean=11.2
Var=2.2
Reward: 48,5
Rank: 3
A single region and its summary
1.
2.
3.
Examples in the data set WT have the form: (x,y,temp); var(c,temp) denotes
the variance of variable temp in region c
interestingness(c)=
IF var(c,temp)>var(WT,temp)
THEN 0
ELSE min(1, log20(var(WT,temp)/var(c,temp)))
with  being a parameter (with default 1)
Basically, regions receive rewards if their variance is lower than the variance
of the variable temperature for the whole data set, and regions whose
variance is at least 20 times less receive the maximum reward of 1.
Ch. Eick: Introduction Region Discovery
4.2 Finding Regions with High Water Temperature Differences
1.
2.
Examples in the data set WT have the form: (x,y,Temp)
Fitness function: Let c be a cluster to be evaluated
interestingness(c)=
IF var(c,temp)<var(WT,temp)
THEN 0
ELSE min(1, log20(var(c,temp)/var(WT,temp))) )
with  being a parameter (with default 1)
Ch. Eick: Introduction Region Discovery
4.3 Programming Project Fitness Functions Purity
r1
(6, 2, 2)
r2
r3
(0, 0, 5)
(2,2,1)
We assume we have 3 classes; in r1 we have 6
objects of class1, 3 objects of class 2, and 2 objects of class1
We assume th=0.5 and =2
i(r1)= (0.6-0.5)**2=0.01
i(r2)=(1-0.5)**2=0.25
i(r3)=0
q(X)=q({r1,r2,r3})= 0.01*10b + 0.25*5b
Ch. Eick: Introduction Region Discovery
Programming Project Fitness Function Variance
r3
Var(r3)=1100
r1
var(r1)=80
r2
Var(r2)=200
We assume =1 and th=1.5
i(r1)= 0
i(r2)=(2-1.5)=0.5
i(r3)=(11-1.5)=9.5
i(r4)=0
r4
Var(r4)=20
O
Var(O)=100
Ch. Eick: Introduction Region Discovery
Interestingness Function Binary Co-location
r1
(1,1)
(-1, 1)
(1, 0.6)
r3
r2
(-1, -4)
(-.0.5, -1)
(-0.5,0)
R4
(1,-1)
(1, 1)
(0.3, -0.1)
We assume =1, th=0.1 and A={B1,B2}
i(r1)= |1-1-0.6|/3 -0.1=0.1
i(r2)=|4+0.5+0|/3-0.1=1.4
i(r3)=…
i(r4)=0 because |-1+1-0.03|/3=0.01<0.1
Binary Co-location: i(o,{B1,B2})=zB1(o)*zB2(o)
Meaning: z-value of B1 is -1, and
z-value of B2 is -4
Ch. Eick: Introduction Region Discovery
Programming Project Function MSE
r1
(2,2) (4,4)
r2
(-1,-1) (-7,-7) (-4,-4)
MSE(r1)=(1**2+1**2+1**2+1**2+1**2)/2=2
MSE(r2)=(3**2+3**2+3**2+3**2+1**2+0+0)/3=12
Ch. Eick: Introduction Region Discovery
4.4 Regional Co-location Mining
R1
R2
Regional
Co-location
R3
R4
Task: Find Co-location patterns for the following data-set.
Global Co-location:
and
are co-located in the whole dataset
Ch. Eick: Introduction Region Discovery
A Reward Function for Binary Co-location
Task: Find regions in which the density of 2 or more classes
is elevated. In general, multipliers lC are computed for
every region r, indicating how much the density of
instances of class C is elevated in region r compared to
C’s density in the whole space, and the interestness of a
region with respect to two classes C1 and C2 is assessed
proportional to the product lC1*lC2
Example: Binary Co-Location Reward Framework;
lC(r)=p(C,r)/prior(C)
C1,C2 = 1/((prior(C1)+prior(C2)) “maximum multiplier”
kC1,C2(r) = IF lC1(r)<1 or lC2(r )<1 THEN 0
ELSE sqrt((lC1(r)–1)*(lC2(r)–1))/(C1,C2 –1)
interestingness(r)= maxC1,C2;C1C2 (kC1,C2(c))
Ch. Eick: Introduction Region Discovery
The Ultimate Vision of the Presented Research
Spatial Databases
Database
Integration
Tool
Data Set
Family of
Clustering
Algorithms
Domain
Expert
Measure of
Interestingness
Acquisition Tool
Fitness
Function
Ranked Set of
Interesting Regions
and their Properties
Visualization
Tools
Architecture Region Discovery Engine
Region
Discovery
Display
Ch. Eick: Introduction Region Discovery
How to Apply the Suggested Methodology
1. With the assistance of domain experts determine
structure of dataset to be used.
2. Acquire measure of interestingness for the problem of
hand (this was purity, variance, MSE, probability
elevation of two or more classes in the examples
discussed before)
3. Convert measure of interestingness into a reward-based
fitness function. The designed fitness function should
assign a reward of 0 to “boring” regions. It is also a good
idea to normalize rewards by limiting the maximum
reward to 1.
4. After the region discovery algorithm has been run, rank
and visualize the top k regions with respect to rewards
obtained (interestingness(c)*size(c)b), and their
properties which are usually task specific.
Ch. Eick: Introduction Region Discovery
5. A Family of Clustering Algorithms for Region Discovery
1.
2.
3.
4.
5.
6.
7.
8.
Supervised Partitioning Around Medoids (SPAM).
Representative-based Clustering Using Randomized Hill
Climbing (CLEVER)
Supervised Clustering using Evolutionary Computing
(SCEC)
Single Representative Insertion/Deletion Hill Climbing with
Restart (SRIDHCR)
Supervised Clustering using Multi-Resolution Grids
(SCMRG)
Agglomerative Clustering (MOSAIC)
Supervised Clustering using Density Estimation
Techniques (SCDE)
Clustering using Density Contouring (DCONTOUR)
Remark: For a more details about SCEC, SPAM, SRIDHCR see [EZZ04, ZEZ06];
the PKDD06 paper briefly discusses SCMRG
Ch. Eick: Introduction Region Discovery
CLEVER
 Separate Slideshow
Ch. Eick: Introduction Region Discovery
Steps of Grid-based Clustering Algorithms
Basic Grid-based Algorithm
1. Define a set of grid-cells
2. Assign objects to the appropriate grid cell and
compute the density of each cell.
3. Eliminate cells, whose density is below a certain
threshold t.
4. Form clusters from contiguous (adjacent) groups
of dense cells (usually minimizing a given
objective function)
Simple version of a grid-based algorithm: Merge
cells greedily as long as merging improves q(X).
20
Ch. Eick: Introduction Region Discovery
Advantages of Grid-based Clustering
Algorithms
• fast:
– No distance computations
– Clustering is performed on summaries and not
individual objects; complexity is usually
O(#populated-grid-cells) and not O(#objects)
– Easy to determine which clusters are
neighboring
• Shapes are limited to union of grid-cells
21
Ch. Eick: Introduction Region Discovery
Ideas SCMRG (Divisive, Multi-Resolution Grids)
Cell Processing Strategy
1. If a cell receives a reward that is larger than the sum of its rewards
its ancestors: return that cell.
2. If a cell and its ancestor do not receive any reward: prune
3. Otherwise, process the children of the cell (drill down)
Ch. Eick: Introduction Region Discovery
Code SCMRG
Ch. Eick: Introduction Region Discovery
Parameters SCMRG
 Separate Transparency!
Ch. Eick: Introduction Region Discovery
6. Summary
1. A framework for region discovery that relies on additive,
reward-based fitness functions and views region
discovery as a clustering problem has been introduced.
2. The framework find interesting places and their
associated patterns.
3. The framework extracts regional knowledge from spatial
datasets
4. The ultimate vision of this research is the development of
region discovery engines that assist earth scientists in
finding interesting regions in spatial datasets.
Ch. Eick: Introduction Region Discovery
Why should people use Region Discovery Engines (RDE)?
RDE: finds sub-regions with special characteristics in large spatial datasets
and presents findings in an understandable form. This is important for:
• Focused summarization
• Find interesting subsets in spatial datasets for further studies
• Identify regions with unexpected patterns; because they are unexpected they deviate
from global patterns; therefore, their regional characteristics are frequently
important for domain experts
• Without powerful region discovery algorithms, finding regional patters tends to be
haphazard, and only leads to discoveries if ad-hoc region boundaries have enough
resemblance with the true decision boundary
• Exploratory data analysis for a mostly unknown dataset
• Co-location statistics frequently blurred when arbitrary region definitions are used,
hiding the true relationship of two co-occurring phenomena that become invisible by
taking averages over regions in which a strong relationship is watered down, by
including objects that do not contribute to the relationship (example: High crimerates along the major rivers in Texas)
• Data set reduction; focused sampling