Download GeoARM: an Interoperable Framework to Improve Geographic Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Semantic Aspects in Spatial Data Mining
Vania Bogorny
Introduction
• Existing approaches for spatial data mining, in general, do
not make use of prior knowledge
• Bogorny (2006) and Bogorny (2007) introduced the idea
of using background knowledge
– in data preprocessing, to reduce spatial joins
– In spatial association rule mining, to eliminate well known patterns
2
Main Problems
• Unnecessary spatial relationship computation
• large amounts of association rules
• Many associations are well known natural geographic dependences
• Existing approaches for mining SAR are Apriori-like
– Most approaches do not make use of background knowledge
– Use syntactic constraints for frequent set and rule prunning
– Only the data is considered, not the schema
• Result
– Same associations explicitly represented in the schema (database
designer) are extracted by SAR mining algoritms
3
Spatial Relationships (Gutting, 1994)
A
B
A
disjoint
Topological
B
touches
B
A
A
B
overlaps
A
B
contains
inside
A
A
B
B
equals
crosses
B north A
Distance
B
B
d
C
Order
A
C
C southeast A
4
Topological Relationships – GEOMETRICALLY POSSIBLE
OGC standard
• Order and Distance relationships may exist among any
pair of spatial features
• Topological Relations depend on the geometry
Topological Disjoint
Relation
Overlaps
Touches
Contains
Inside







Crosses
Equals
Geometric
Combinations



/
/

/
?
/
?
? ?






















5
Topological Relationships – SEMANTICALLY CONSISTENT
Topological
Relation
Semantic
Combinations
Factory () Hospital ()
Bridge () River ( /)
Factory () Airport( □)
River (/) Road (/)
Beach (/) Sea (□)
State (□) Country (□)
Disjoint
Overlaps
Touches













Contains



Inside






Crosses




Equals



6
Spatial Relationships
– Mandatory (Spatial constraints) Dependences:
<island> <inside> <1><1> <Water Body>
– Prohibited:
<River> <contains> <0><0> <Road>
– Possible: Normally undefined
Road crosses River
For Data mining and knowledge discovery,
POSSIBLE and PROHIBITED RELATIONSHIPS are interesting!!!!
All others are well known.
7
Well Known Geographic Dependences
Non-obvious spatial relationships
Well known dependences
Is_a(gasStation)  intersects(street) (100%)
Is_a(island)  within (waterResource) (100%)
8
Well Known relationshiops X Association Rules
Bus Stop
Street
contains(gasStation)  contains(Street) (100%)
intersects(busStop)  intersects(Street) (100%)
Contains(viaduct)  contains(road) (100%)
Bridge &Viaducts
Roads
Vegetation
9
Well Known Associations – Conceptual Schemas
{State, Country}
{Factory, County}
{Island, WaterBody}
….
10
Well Known
Associations –
Conceptual
Schemas
Fonte: 1ª Divisão de Levantamento do Exército Brasileiro
11
Well Known Associations – Conceptual Schemas
12
Well Known Associations – Geo-Ontologies
Geographic dependences are explicit in geo-ontologies (Bogorny, 2005b)
<owl:Class rdf:ID=“Island">
<rdfs:SubClassOf rdf:resource="#SpatialFeatureType"/>
<rdfs:subClassOf>
<owl:Restriction>
<owl:minCardinality
rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</owl:minCardinality>
<owl:allValuesFrom rdf:resource="#WaterResource"/>
<owl:OnProperty>
<owl:ObjectProperty rdf:about="#Within"/>
</owl:OnProperty>
</owl:Restriction>
</rdfs:subClassOf>
</owl:Class>
13
Well Known Associations – Geo-Ontologies
14
Well known dependences X Spatial Association
Rules (SAR)
• Well knonw dependences affect the 3 main steps in the
process of mining SAR:
– Spatial predicate computation: compute unnecessary
relationships
– Frequent set generation: generate frequent itemsets with well
known patterns
– Association rule extraction: produce a high number of rules with
well known dependences
15
Well Known Dependences in SAR
Example of preprocessed spatial dataset
Tuple
(city)
Spatial Predicates
1
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)
2
contains(Hospital), contains(TreatedWaterNet),
crosses(WaterBody)
3
contains(Port),
4
contains(Port), contains(Hospital), contains(TreatedWaterNet),
5
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)
6
contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)
crosses(WaterBody)
contains(Hospital), contains(TreatedWaterNet), contains(Factory)
17
Problem 1 – Geographic Dependences between the Target
Feature Type and Relevant Feature Types
Dependence = City and TreatedWaterNet
contains(Hospital)contains(TreatedWaterNet)
Tuple
(city)
Spatial Predicates
1
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)
2
contains(Hospital), contains(TreatedWaterNet),
crosses(WaterBody)
3
contains(Port),
4
contains(Port), contains(Hospital), contains(TreatedWaterNet),
5
contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)
6
contains(TreatedWaterNet), contains(Factory), crosses(WaterBody)
crosses(WaterBody)
contains(Hospital), contains(TreatedWaterNet), contains(Factory)
Minconf=70%
100% de support
18
Problem 2 - Dependences among Relevant Feature Types
Dependence = {Port, WaterBody}
Minsup=50%
25 frequent sets(6 contain the dependence)
9 closed frequent sets (3 have the dependence)
contains(Port)crosses(WaterBody)
19
Pruning Methods using
Background Knowledge
Frequent Set Pruning (Apriori-KC) (Bogorny, 2006ª)
Given: ,
,
// set of knowledge constraints
// dataset generated with spatial_predicate_extraction
minsup, // minimum support
L1 = {large 1-predicate sets};
For ( k = 2; Lk-1 != ; k++ ) do begin
Ck = apriori_gen(Lk-1); // Generates new candidates
If (k=2)
// remove pairs with dependences
(step 1) Delete from C2 all pairs with a dependence in  ;
Forall rows w   do begin
Cw = subset(Ck, w); // Candidates contained in w
Forall candidates c  Cw do
c.count++;
End;
Lk = {c  Ck | c.count  minsup};
End;
Answer = kLk
21
Understanding the Pruning Methods
Dependences {D} e {A,W}
a) dataset
Tid (city)
1
2
3
4
5
6
Predicate Set
A, C, D,T, W
C, D, W
A, D, T, W
A, C, D, W
A, C, D, T, W
C, D, T
b) frequent predicate sets with minsup 50%
Set k
k=1
k=2
k=3
k=4
Frequent sets
{A}, {C}, {D}, {T}, {W}
{A,C}, {A,D}, {A,T}, {A,W}, {C,D},
{C,T}, {C,W}, {D,T}, {D,W}, {T,W}
{A,C,D}, {A,C,W}, {A,D,T}, {A,D,W},
{A,T,W}, {C,D,T}, {C,D,W}, {D,T,W}
{A,C,D,W}, {A,D,T,W}
c) predicates
A = contains(Port), C = contains(Hospital), W = crosses(WaterBody),
contains(TreatedWaterNet)
T = contains(Factory), DD==contains(Street),
22
Understanding the Pruning Methods
(Input Pruning)
{A,C,D,W}
{A,C,D}
{A,C} {A,D}
{A,D,T,W}
{A,C,W} {A,D,T} {A,D,W}
{A,T}
{A}
{A,W}
{C,D}
{C}
{A,T,W} {C,D,T}
{C,T}
{D}
{C,W}
{T}
{C,D,W} {D,T,W}
{D,T}
{D,W}
{T,W}
{W}
{D} Input pruning
{}
23
Understanding the Pruning Methods
(Frequent set pruning)
{A,C,D,W}
{A,D,T,W}
{A,C,D} {A,C,W} {A,D,T} {A,D,W} {A,T,W} {C,D,T} {C,D,W} {D,T,W}
{A,C} {A,D}
{A,T}
{A,W}
{C,D}
{A}
{C}
{C,T}
{D}
{A,W} Frequent set pruning
{}
{C,W} {D,T}
{T}
{D,W} {T,W}
{W}
25 frequent sets
24
Percentage reduction of association rules considering zero (reference), one, and two
pairs of dependences with an increasing number of elements (predicates)
minconf=0
25
Problem 3 – Redundant Frequent Itemsets
- Considering the 25 frequent sets in the example dataset
- 9 are closed frequent itemsets (3 contain the dependence)
-16 are redundant (3 contain the dependence)
Dependence {A,W}
Dataset
Tid (city)
1
2
3
4
5
6
Predicate Set
A, C, D,T, W
C, D, W
A, D, T, W
A, C, D, W
A, C, D, T, W
C, D, T
9 closed frequent itemsets
26
Problem 3 – Redundant Frequent Itemsets
Tid (city)
1
2
3
4
5
6
Predicate Set
A, C, D,T, W
C, D, W
A, D, T, W
A, C, D, W
A, C, D, T, W
C, D, T
Remove dependences and then generate closed frequent itemsets
Problem –> resultant frequent sets are not closed
{A,D,T,W}
{A,C,D}(145)
{A,C} {A,D}
(145)
(1345)
{A,T}
(135)
{A,D,T}(135) {C,D,T}(156) {C,D,W}(1245){D,T,W}
(135)
{C,D}
(12456)
{C,T} {C,W}
(156)
(12456)
{D,T}
{D,W} {T,W}
(1356)
(12345)
(135)
{A}(1345) {C}(12456) {D}(123456){T}(1356) {W}(12345)
{}
27
Problem 3 – Redundant Frequent Itemsets
- Generate closed frequent itemsets and then eliminate dependences
Problem – loose information
{A,C,D,W}(145) {A,D,T,W}(135)
{A,D,W}(1345){C,D,T}(156){C,D,W}(1245)
{C,D}(12456) {D,T}(1356) {D,W}(12345)
{D}(123456)
9 closed frequent itemsets
Dependence {A,W}
28
Max-FGP (Bogorny 2006c)
- Remove dependences in a first step
{A,C,D,W}(145)
{A,D,T,W}(135)
{A,C,D}(145) {A,C,W}(145) {A,D,T}(135) {A,D,W}(1345){A,T,W}(135) {C,D,T}(156) {C,D,W}(1245){D,T,W}
(135)
{A,C}(145) {A,D}(1345)
{A,T}(135) {A,W}(1345) {C,D}(12456) {C,T}(156) {C,W}(12456) {D,T}(1356) {D,W}(12345) {T,W}(135)
{A}(1345)
{C}(12456)
{}
{D}(123456) {T}(1356) {W}(12345)
29
Max-FGP
-Remove redundant frequent sets
in a second step  generating
maximal frequent sets
{A,C,D}
(145)
{A,C} {A,D}
(145)
(1345)
{A,D,T} {C,D,T} {C,D,W} {D,T,W}
{A,T}
(135)
{C,D}
(1345)
{C}
(156)
{D}
(12456)
(1245)
(135)
{C,T} {C,W} {D,T} {D,W}
(12456)
(135)
{A}
(156)
(123456)
(12456)
{T}
(1356)
(1356)
(12345)
{T,W}
(135)
{W}
(12345)
{}
30
Max-FGP (Bogorny, 2006c)
Given: L; // frequent sets without dependences (Apriori-KC)
; // dataset generated with spatial_predicate_extraction
Find: Maximal M
// find maximal generalized predicate sets
M = L;
For ( k = 2; Mk != ; k++ ) do begin
For ( j = k+1; Mj!=0; j++ ) do begin
If (tidSet (Mk) = tidSet (Mj))
If (Mk  Mj) // Mj is more general than Mk
Delete Mk from M;
End;
End;
Answer = M;
31
Some results on real databases
32
Input Space Pruning
20 predicates
Association Rules Generated with Apriori Removing
One and Two Dependences Between the Target Feature
and Two Relevant Features (Input space Pruning)
Frequent Geographic Patterns Removing One and Two
Dependences between the Target Feature and Two
Relevant Features (Input space Pruning)
1,800
1,600
1,400
1,200
1,000
800
600
400
200
0
Apriori
Apriori (Revoming 1 column)
Apriori (Removing 2 columns)
865
651
432
363
181
331
165
22,251
Apriori
Apriori (Removing 1 column)
20,000
Association Rules
Frequent Sets
25,000
1,731
Apriori (Removing 2 columns)
15,000
10,000
7,128
7,159
5,000
2,241
82
2,252
689
2,268
698 204
0
10%
15%
Minimum Support
1 dependence 50%
2 dependences 70%
20%
10%
15%
20%
Minimum Support
1 dependence 70%
2 dependences 90%
33
Frequent Set Pruning
15 predicates
Frequent Geographic Patterns Removing One
Dependence among Relevant Features
(Frequent Set Pruning)
Frequent Sets
100
Apriori
Apriori-KC (1 pair)
85
51
60
11
47
33
32
7
140
120
100
80
60
22
20 16
20
Max-FGp(1 pair)
Closed frequent sets
160
Closed Frequent sets
71
Apriori-KC (1 pair)
180
Max-FGp(1 pair)
17
80
40
Apriori
200
117
Time(s)
120
Computational Time
14 12
40
20
0
5%
10%
15%
Minimum Support
77%
68%
0
5%
10%
Minimum Support
15%
58%
34
Summary
• Well known dependences exist in several
non-spatial application domains
– Biology/Bioinformatics
– Pregnant  Female (confidence=100%)
– Breast_cancer  Female (confidence 100%)
– ...
• Almost no data mining approaches
consider background knowledge or
domain knowledge
35
Future Tendences
• Data Mining methods will consider semantics
• 3 workshops (KDD and ICDM) for domain-driven data
mining
• ICDM 2008,2009 Workshop - Semantic Aspects in Data
Mining
• Book 2008: Domain-Driven Data Mining
36
Summary: Mining SAR using Background
Knowledge
• Using background knowledge:
– To prune the input space as much as possible
(applicable to any SDM method)
– Apriori-KC  generate frequent itemsets without well
known dependences
– Max-FGP (Maximal Frequent Geographic Patterns) 
generate closed frequent itemsets without well known
dependences
37
References
Bogorny, V.; Valiati, J.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Maximal
Generalized Frequent Geographic Patterns with Knowledge Constraints. In:
IEEE International Conference on Data Mining, IEEE-ICDM, 6., 2006,
Hong-Kong, 2006c
Bogorny, V.; Camargo, S.; Engel, P. M.; Alvares, L.O. Towards elimination of
well known geographic domain patterns in spatial association rule mining.
In: IEEE International Conference on Intelligent Systems, IEEE-IS, 3.,
2006, London. IEEE Computer Society, 2006b. p. 532-537.
Bogorny, V.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Frequent
Geographic Patterns with Knowledge Constraints. In: ACM International
Symposium on Advances in Geographic Information Systems, ACM-GIS,
14., 2006, Arlington. p. 139-146a .
Bogorny, V.; Palma, A; Engel. P. ; Alvares, L.O. Weka-GDPM: Integrating
Classical Data Mining Toolkit to Geographic Information Systems. In: SBBD
Workshop on Data Mining Algorithms and Applications, WAAMD,
Florianopolis, 2006 d.p. 9-16.
38
References
Bogorny, V.; Engel, P. M.; Alvares, L.O. Enhancing the Process of Knowledge Discovery in
Geographic Databases using Geo-Ontologies. In: NIGRO, H. O.; CISARO, S.G.; XODO, D. (Ed.).
Data Mining with Ontologies: Implementations, Findings, and Frameworks. Idea Group, 2007.
CLEMENTINI, E.; DI FELICE, P.; KOPERSKI, K. Mining multiple-level spatial association rules for
objects with a broad boundary. Data & Knowledge Engineering, [S.l.], v.34, n.3, p.251-270,
Sept. 2000.
GUTING, R. H. An Introduction to Spatial Database Systems. The International Journal on Very
Large Data Bases, [S.l.], v.3, n.4, p. 357 – 399, Oct. 1994.
KOPERSKI, K.; HAN, J. Discovery of Spatial Association Rules In Geographic Information Databases.
In: INTERNATIONAL SYMPOSIUM ON LARGE GEOGRAPHICAL DATABASES, SSD, 4., 1995,
Portland. Proceedings… [S.l.]: Springer, 1995. p.47-66.
MENNIS, J.; LIU, J.W. Mining Association Rules in Spatio-Temporal Data: An Analysis of Urban
Socioeconomic and Land Cover Change. Transactions in GIS,
[S.l.], v.9, n.1, p. 5-17, Jan. 2005.
OPEN GIS CONSORTIUM. OpenGIS simple features specification for SQL. 1999. Available at
<http://www.opengeospatial.org/docs/99-054.pdf>. Visited on Aug. 2005.
39
Related documents