Download Efficient Maintenance of Semistructured Schema

Efficient Maintenance of Semistructured Schema Katsaros Dimitrios Aristotle University of Thessaloniki Hellas 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 1 Introduction (1/3) • Semistructured data – Sources: HTML, BibTeX, SGML, etc. – Characteristics: no rigid structure, but some implicit structure, i.e., “schema” – Knowledge of the “schema” is crucial: • • • • Querying/browsing information sources Building indexes/views Storage in relational/object-oriented databases Query processing 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 2 Introduction (2/3) OEM db Movie Movie &1 Review Title Movie &2 Director Name Title Nationality Nationality &3 Director Name Title Director Nationality Name Award Biography Figure 1: Semistructured “movie” objects 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 3 Introduction (3/3) – Discovering the common “schema” • Large volume / Irregularity of data – Solution: Mining the “schema” • Scalable / Can deal with irregularity • Association rules proposed by Wang & Liu [6] – Issue: How to deal with dynamic data ? 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 4 Motivation Our contributions – Maintenance of the discovered schema under insertions of new objects – Schema for the new objects. – Performance evaluation of the method. 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 5 Presentation Outline • • • • • Problem definition Algorithm’s description Performance evaluation Conclusion References 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 6 Object Exchange Model • An Object Exchange Model (OEM) object – – Identifier o (i.e., &o) Value • • Atomic (integer, float, string) Complex – – List: l1:&o1, l2:&o2, …, lk:&ok Bag: {l1:&o1, l2:&o2, …, lk:&ok} where: li are labels (“roles”) ? denotes the wild card matching any label  is the nil structure that contains no label 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 7 Tree-Expressions Definition 1. 2. The nil structure is a tree-expression Let tei be tree-expressions of objects oi. If val(o)= l1:&o1, l2:&o2, …, lk:&ok and i1, i2, …, lr is a subsequence of 1, 2, …, k then li1:tei1, li2:tei2, …, lir:teir is a tree-expression of object o. Representation A tree-expression li1:tei1, li2:tei2, …, lir:teir consists of k subtrees teij each being labeled lij. 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 8 Incremental Schema Mining Problem definition Input 1. A collection of transaction objects in an OEM graph, denoted as DB 2. A minimum support threshold MINSUP 3. The frequent tree expressions for DB 4. A number of new objects added into the collection, denoted as db The incremental schema maintenance problem is to discover all tree expressions which have support in DB  db greater than or equal to MINSUP. 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 9 DeltaSSD •DeltaSSD utilizes Negative Borders Definition [Negative Border] Given a collection of S  P(R) of tree expressions, closed with respect to the “weaker than” relation [6], the negative border Bd- of S consists of the minimal tree expressions X  R not in S. 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 10 DeltaSSD (notation) DB, db, DB (= DB  db) Regular, increment, current database LDB, Ldb, LDB Frequent tree expressions of DB, db, DB NDB, Ndb, NDB Negative border of DB, db, DB TEDB LDB  NDB L, N LDB (LDB  NDB). Negative border of L SupportOf( set, database ) Updates the support count of the tree expressions in set w.r.t. the database NB( set ) Computes the negative border of the set LargeOf( set, database ) Returns the tree expressions in set which have support count above MINSUP in the database 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 11 DeltaSSD 1 2 3 4 5 6 7 8 9 10 BEGIN SupportOf(TEDB, db) L = LargeOf(TEDB, DB) Small = TEDB-L If ( L == LDB ) RETURN( LDB, NDB ) N = NB( L ) If ( N  Small ) RETURN( L, N ) Nu = N – Small SupportOf(Nu, db) C= LargeOf( Nu ) Smalldb = Nu – C 2001 Dimitrios Katsaros 11 12 13 14 15 16 17 18 19 20 21 22 23 If ( |C| ) C=CL repeat C = C  NB( C ) C = C – (Small  Smalldb) until ( C does not grow ) C = C – ( L  Nu ) if( |C| ) then SupportOf(C, db) ScanDB = LargeOf(C Nu, db) N’ = NB(L ScanDB) – Small SupportOf(N’  ScanDB, DB) LDB = L  LargeOf(ScanDB,DB) NDB = NB(LDB db) END Panhellenic Conference on Informatics (ΕΠΥ’8) 12 Experimental settings Generation of synthetic data • One dataset : – (L1, N1) = (25, 1000) – (L2, N2, T2, I2, P2) = (25, 1000, 4, 2, 50) – (N3, T3, I3, P3) = (3000, 4, 2, 50) • Relatively small database, 3000 objects. • Short and “bushy” transactions (thus, few database scans). 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 13 Performance Evaluation Database scans Wang minsup Scan DB ZJZT DeltaSSD Scan db Scan DB Scan db Scan DB Scan db 0.08 3 3 3 3 1 2 0.10 3 3 3 3 1 2 0.12 3 3 3 3 1 2 0.14 3 3 3 3 1 2 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 14 Performance Evaluation Operations (CPU time) Wang ZJZT minsup DeltaSSD 10% 20% 30% 10% 20% 30% 0.08 1860275 646198 1168914 1621995 28257203 31027562 34679212 0.10 825558 341810 365912 411928 26252973 29664263 32888325 0.12 362021 235920 263484 362156 25606975 28268088 30724951 0.14 108420 98877 101733 113010 25262053 27583482 29764508 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 15 Conclusions • DeltaSSD is very efficient in terms of database scans • DeltaSSD incurs excessive processing in terms of tree matchings • Re-computing the frequent tree-expressions is inefficient • Future work includes: – Investigation of the complete closure approach – Techniques to reduce the processing cost of tree matching 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 16 References 1. Y. Aumann, R. Feldman, O. Liphstat and H. Mannila, "Borders: An Efficient Algorithm for Association Generation in Dynamic Databases", Journal of Intelligent Information Systems, vol. 12, no. 1, pp. 61-73, 1999. 2. R. Feldman, Y. Aumann, A. Amir and Mannila, H., "Efficient algorithms for discovering frequent sets in incremental databases", Proceedings of the ACM Workshop on Research Issues in Data Mining and Knowledge Discovery (DMKD'97), 1997. 3. H. Mannila and H. Toivonen, "Levelwise Search and Borders of Theories in Knowledge Discovery", Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 241-258, 1997. 4. V. Pudi and J. Haritsa, "Quantifying the utility of the past in mining large databases", Information Systems, vol. 25, no. 5, pp. 323-343, 2000. 5. S. Thomas, S. Bodagala, K. Alsabti and S. Ranka, "An efficient algorithm for the incremental updation of association rules in large databases", Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD'97), pp. 263-266, 1997. 6. K. Wang and H. Liu,"Discovering Structural Association of Semistructured Data", IEEE Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 353-371, 2000. 7. A. Zhou, Jinwen, S. Zhou and Z. Tian, "Incremental Mining of Schema for Semistructured Data", Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data Mining (PAKDD'99), pp. 159-168, 1999. 2001 Dimitrios Katsaros Panhellenic Conference on Informatics (ΕΠΥ’8) 17

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Efficient Maintenance of Semistructured Schema