Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Semantic Aspects in Spatial Data Mining Vania Bogorny Introduction • Existing approaches for spatial data mining, in general, do not make use of prior knowledge • Bogorny (2006) and Bogorny (2007) introduced the idea of using background knowledge – in data preprocessing, to reduce spatial joins – In spatial association rule mining, to eliminate well known patterns 2 Main Problems • Unnecessary spatial relationship computation • large amounts of association rules • Many associations are well known natural geographic dependences • Existing approaches for mining SAR are Apriori-like – Most approaches do not make use of background knowledge – Use syntactic constraints for frequent set and rule prunning – Only the data is considered, not the schema • Result – Same associations explicitly represented in the schema (database designer) are extracted by SAR mining algoritms 3 Spatial Relationships (Gutting, 1994) A B A disjoint Topological B touches B A A B overlaps A B contains inside A A B B equals crosses B north A Distance B B d C Order A C C southeast A 4 Topological Relationships – GEOMETRICALLY POSSIBLE OGC standard • Order and Distance relationships may exist among any pair of spatial features • Topological Relations depend on the geometry Topological Disjoint Relation Overlaps Touches Contains Inside Crosses Equals Geometric Combinations / / / ? / ? ? ? 5 Topological Relationships – SEMANTICALLY CONSISTENT Topological Relation Semantic Combinations Factory () Hospital () Bridge () River ( /) Factory () Airport( □) River (/) Road (/) Beach (/) Sea (□) State (□) Country (□) Disjoint Overlaps Touches Contains Inside Crosses Equals 6 Spatial Relationships – Mandatory (Spatial constraints) Dependences: <island> <inside> <1><1> <Water Body> – Prohibited: <River> <contains> <0><0> <Road> – Possible: Normally undefined Road crosses River For Data mining and knowledge discovery, POSSIBLE and PROHIBITED RELATIONSHIPS are interesting!!!! All others are well known. 7 Well Known Geographic Dependences Non-obvious spatial relationships Well known dependences Is_a(gasStation) intersects(street) (100%) Is_a(island) within (waterResource) (100%) 8 Well Known relationshiops X Association Rules Bus Stop Street contains(gasStation) contains(Street) (100%) intersects(busStop) intersects(Street) (100%) Contains(viaduct) contains(road) (100%) Bridge &Viaducts Roads Vegetation 9 Well Known Associations – Conceptual Schemas {State, Country} {Factory, County} {Island, WaterBody} …. 10 Well Known Associations – Conceptual Schemas Fonte: 1ª Divisão de Levantamento do Exército Brasileiro 11 Well Known Associations – Conceptual Schemas 12 Well Known Associations – Geo-Ontologies Geographic dependences are explicit in geo-ontologies (Bogorny, 2005b) <owl:Class rdf:ID=“Island"> <rdfs:SubClassOf rdf:resource="#SpatialFeatureType"/> <rdfs:subClassOf> <owl:Restriction> <owl:minCardinality rdf:datatype="http://www.w3.org/2001/XMLSchema#int">1</owl:minCardinality> <owl:allValuesFrom rdf:resource="#WaterResource"/> <owl:OnProperty> <owl:ObjectProperty rdf:about="#Within"/> </owl:OnProperty> </owl:Restriction> </rdfs:subClassOf> </owl:Class> 13 Well Known Associations – Geo-Ontologies 14 Well known dependences X Spatial Association Rules (SAR) • Well knonw dependences affect the 3 main steps in the process of mining SAR: – Spatial predicate computation: compute unnecessary relationships – Frequent set generation: generate frequent itemsets with well known patterns – Association rule extraction: produce a high number of rules with well known dependences 15 Well Known Dependences in SAR Example of preprocessed spatial dataset Tuple (city) Spatial Predicates 1 contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody) 2 contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody) 3 contains(Port), 4 contains(Port), contains(Hospital), contains(TreatedWaterNet), 5 contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody) 6 contains(TreatedWaterNet), contains(Factory), crosses(WaterBody) crosses(WaterBody) contains(Hospital), contains(TreatedWaterNet), contains(Factory) 17 Problem 1 – Geographic Dependences between the Target Feature Type and Relevant Feature Types Dependence = City and TreatedWaterNet contains(Hospital)contains(TreatedWaterNet) Tuple (city) Spatial Predicates 1 contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody) 2 contains(Hospital), contains(TreatedWaterNet), crosses(WaterBody) 3 contains(Port), 4 contains(Port), contains(Hospital), contains(TreatedWaterNet), 5 contains(Port), contains(Hospital), contains(TreatedWaterNet), contains(Factory), crosses(WaterBody) 6 contains(TreatedWaterNet), contains(Factory), crosses(WaterBody) crosses(WaterBody) contains(Hospital), contains(TreatedWaterNet), contains(Factory) Minconf=70% 100% de support 18 Problem 2 - Dependences among Relevant Feature Types Dependence = {Port, WaterBody} Minsup=50% 25 frequent sets(6 contain the dependence) 9 closed frequent sets (3 have the dependence) contains(Port)crosses(WaterBody) 19 Pruning Methods using Background Knowledge Frequent Set Pruning (Apriori-KC) (Bogorny, 2006ª) Given: , , // set of knowledge constraints // dataset generated with spatial_predicate_extraction minsup, // minimum support L1 = {large 1-predicate sets}; For ( k = 2; Lk-1 != ; k++ ) do begin Ck = apriori_gen(Lk-1); // Generates new candidates If (k=2) // remove pairs with dependences (step 1) Delete from C2 all pairs with a dependence in ; Forall rows w do begin Cw = subset(Ck, w); // Candidates contained in w Forall candidates c Cw do c.count++; End; Lk = {c Ck | c.count minsup}; End; Answer = kLk 21 Understanding the Pruning Methods Dependences {D} e {A,W} a) dataset Tid (city) 1 2 3 4 5 6 Predicate Set A, C, D,T, W C, D, W A, D, T, W A, C, D, W A, C, D, T, W C, D, T b) frequent predicate sets with minsup 50% Set k k=1 k=2 k=3 k=4 Frequent sets {A}, {C}, {D}, {T}, {W} {A,C}, {A,D}, {A,T}, {A,W}, {C,D}, {C,T}, {C,W}, {D,T}, {D,W}, {T,W} {A,C,D}, {A,C,W}, {A,D,T}, {A,D,W}, {A,T,W}, {C,D,T}, {C,D,W}, {D,T,W} {A,C,D,W}, {A,D,T,W} c) predicates A = contains(Port), C = contains(Hospital), W = crosses(WaterBody), contains(TreatedWaterNet) T = contains(Factory), DD==contains(Street), 22 Understanding the Pruning Methods (Input Pruning) {A,C,D,W} {A,C,D} {A,C} {A,D} {A,D,T,W} {A,C,W} {A,D,T} {A,D,W} {A,T} {A} {A,W} {C,D} {C} {A,T,W} {C,D,T} {C,T} {D} {C,W} {T} {C,D,W} {D,T,W} {D,T} {D,W} {T,W} {W} {D} Input pruning {} 23 Understanding the Pruning Methods (Frequent set pruning) {A,C,D,W} {A,D,T,W} {A,C,D} {A,C,W} {A,D,T} {A,D,W} {A,T,W} {C,D,T} {C,D,W} {D,T,W} {A,C} {A,D} {A,T} {A,W} {C,D} {A} {C} {C,T} {D} {A,W} Frequent set pruning {} {C,W} {D,T} {T} {D,W} {T,W} {W} 25 frequent sets 24 Percentage reduction of association rules considering zero (reference), one, and two pairs of dependences with an increasing number of elements (predicates) minconf=0 25 Problem 3 – Redundant Frequent Itemsets - Considering the 25 frequent sets in the example dataset - 9 are closed frequent itemsets (3 contain the dependence) -16 are redundant (3 contain the dependence) Dependence {A,W} Dataset Tid (city) 1 2 3 4 5 6 Predicate Set A, C, D,T, W C, D, W A, D, T, W A, C, D, W A, C, D, T, W C, D, T 9 closed frequent itemsets 26 Problem 3 – Redundant Frequent Itemsets Tid (city) 1 2 3 4 5 6 Predicate Set A, C, D,T, W C, D, W A, D, T, W A, C, D, W A, C, D, T, W C, D, T Remove dependences and then generate closed frequent itemsets Problem –> resultant frequent sets are not closed {A,D,T,W} {A,C,D}(145) {A,C} {A,D} (145) (1345) {A,T} (135) {A,D,T}(135) {C,D,T}(156) {C,D,W}(1245){D,T,W} (135) {C,D} (12456) {C,T} {C,W} (156) (12456) {D,T} {D,W} {T,W} (1356) (12345) (135) {A}(1345) {C}(12456) {D}(123456){T}(1356) {W}(12345) {} 27 Problem 3 – Redundant Frequent Itemsets - Generate closed frequent itemsets and then eliminate dependences Problem – loose information {A,C,D,W}(145) {A,D,T,W}(135) {A,D,W}(1345){C,D,T}(156){C,D,W}(1245) {C,D}(12456) {D,T}(1356) {D,W}(12345) {D}(123456) 9 closed frequent itemsets Dependence {A,W} 28 Max-FGP (Bogorny 2006c) - Remove dependences in a first step {A,C,D,W}(145) {A,D,T,W}(135) {A,C,D}(145) {A,C,W}(145) {A,D,T}(135) {A,D,W}(1345){A,T,W}(135) {C,D,T}(156) {C,D,W}(1245){D,T,W} (135) {A,C}(145) {A,D}(1345) {A,T}(135) {A,W}(1345) {C,D}(12456) {C,T}(156) {C,W}(12456) {D,T}(1356) {D,W}(12345) {T,W}(135) {A}(1345) {C}(12456) {} {D}(123456) {T}(1356) {W}(12345) 29 Max-FGP -Remove redundant frequent sets in a second step generating maximal frequent sets {A,C,D} (145) {A,C} {A,D} (145) (1345) {A,D,T} {C,D,T} {C,D,W} {D,T,W} {A,T} (135) {C,D} (1345) {C} (156) {D} (12456) (1245) (135) {C,T} {C,W} {D,T} {D,W} (12456) (135) {A} (156) (123456) (12456) {T} (1356) (1356) (12345) {T,W} (135) {W} (12345) {} 30 Max-FGP (Bogorny, 2006c) Given: L; // frequent sets without dependences (Apriori-KC) ; // dataset generated with spatial_predicate_extraction Find: Maximal M // find maximal generalized predicate sets M = L; For ( k = 2; Mk != ; k++ ) do begin For ( j = k+1; Mj!=0; j++ ) do begin If (tidSet (Mk) = tidSet (Mj)) If (Mk Mj) // Mj is more general than Mk Delete Mk from M; End; End; Answer = M; 31 Some results on real databases 32 Input Space Pruning 20 predicates Association Rules Generated with Apriori Removing One and Two Dependences Between the Target Feature and Two Relevant Features (Input space Pruning) Frequent Geographic Patterns Removing One and Two Dependences between the Target Feature and Two Relevant Features (Input space Pruning) 1,800 1,600 1,400 1,200 1,000 800 600 400 200 0 Apriori Apriori (Revoming 1 column) Apriori (Removing 2 columns) 865 651 432 363 181 331 165 22,251 Apriori Apriori (Removing 1 column) 20,000 Association Rules Frequent Sets 25,000 1,731 Apriori (Removing 2 columns) 15,000 10,000 7,128 7,159 5,000 2,241 82 2,252 689 2,268 698 204 0 10% 15% Minimum Support 1 dependence 50% 2 dependences 70% 20% 10% 15% 20% Minimum Support 1 dependence 70% 2 dependences 90% 33 Frequent Set Pruning 15 predicates Frequent Geographic Patterns Removing One Dependence among Relevant Features (Frequent Set Pruning) Frequent Sets 100 Apriori Apriori-KC (1 pair) 85 51 60 11 47 33 32 7 140 120 100 80 60 22 20 16 20 Max-FGp(1 pair) Closed frequent sets 160 Closed Frequent sets 71 Apriori-KC (1 pair) 180 Max-FGp(1 pair) 17 80 40 Apriori 200 117 Time(s) 120 Computational Time 14 12 40 20 0 5% 10% 15% Minimum Support 77% 68% 0 5% 10% Minimum Support 15% 58% 34 Summary • Well known dependences exist in several non-spatial application domains – Biology/Bioinformatics – Pregnant Female (confidence=100%) – Breast_cancer Female (confidence 100%) – ... • Almost no data mining approaches consider background knowledge or domain knowledge 35 Future Tendences • Data Mining methods will consider semantics • 3 workshops (KDD and ICDM) for domain-driven data mining • ICDM 2008,2009 Workshop - Semantic Aspects in Data Mining • Book 2008: Domain-Driven Data Mining 36 Summary: Mining SAR using Background Knowledge • Using background knowledge: – To prune the input space as much as possible (applicable to any SDM method) – Apriori-KC generate frequent itemsets without well known dependences – Max-FGP (Maximal Frequent Geographic Patterns) generate closed frequent itemsets without well known dependences 37 References Bogorny, V.; Valiati, J.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Maximal Generalized Frequent Geographic Patterns with Knowledge Constraints. In: IEEE International Conference on Data Mining, IEEE-ICDM, 6., 2006, Hong-Kong, 2006c Bogorny, V.; Camargo, S.; Engel, P. M.; Alvares, L.O. Towards elimination of well known geographic domain patterns in spatial association rule mining. In: IEEE International Conference on Intelligent Systems, IEEE-IS, 3., 2006, London. IEEE Computer Society, 2006b. p. 532-537. Bogorny, V.; Camargo, S.; Engel, P.; Alvares, L. O.: Mining Frequent Geographic Patterns with Knowledge Constraints. In: ACM International Symposium on Advances in Geographic Information Systems, ACM-GIS, 14., 2006, Arlington. p. 139-146a . Bogorny, V.; Palma, A; Engel. P. ; Alvares, L.O. Weka-GDPM: Integrating Classical Data Mining Toolkit to Geographic Information Systems. In: SBBD Workshop on Data Mining Algorithms and Applications, WAAMD, Florianopolis, 2006 d.p. 9-16. 38 References Bogorny, V.; Engel, P. M.; Alvares, L.O. Enhancing the Process of Knowledge Discovery in Geographic Databases using Geo-Ontologies. In: NIGRO, H. O.; CISARO, S.G.; XODO, D. (Ed.). Data Mining with Ontologies: Implementations, Findings, and Frameworks. Idea Group, 2007. CLEMENTINI, E.; DI FELICE, P.; KOPERSKI, K. Mining multiple-level spatial association rules for objects with a broad boundary. Data & Knowledge Engineering, [S.l.], v.34, n.3, p.251-270, Sept. 2000. GUTING, R. H. An Introduction to Spatial Database Systems. The International Journal on Very Large Data Bases, [S.l.], v.3, n.4, p. 357 – 399, Oct. 1994. KOPERSKI, K.; HAN, J. Discovery of Spatial Association Rules In Geographic Information Databases. In: INTERNATIONAL SYMPOSIUM ON LARGE GEOGRAPHICAL DATABASES, SSD, 4., 1995, Portland. Proceedings… [S.l.]: Springer, 1995. p.47-66. MENNIS, J.; LIU, J.W. Mining Association Rules in Spatio-Temporal Data: An Analysis of Urban Socioeconomic and Land Cover Change. Transactions in GIS, [S.l.], v.9, n.1, p. 5-17, Jan. 2005. OPEN GIS CONSORTIUM. OpenGIS simple features specification for SQL. 1999. Available at <http://www.opengeospatial.org/docs/99-054.pdf>. Visited on Aug. 2005. 39