Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Efficient Maintenance of
Semistructured Schema
Katsaros Dimitrios
Aristotle University of Thessaloniki
Hellas
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
1
Introduction (1/3)
• Semistructured data
– Sources:
HTML, BibTeX, SGML, etc.
– Characteristics:
no rigid structure, but some implicit
structure, i.e., “schema”
– Knowledge of the “schema” is crucial:
•
•
•
•
Querying/browsing information sources
Building indexes/views
Storage in relational/object-oriented databases
Query processing
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
2
Introduction (2/3)
OEM db
Movie
Movie
&1
Review
Title
Movie
&2
Director
Name
Title
Nationality
Nationality
&3
Director
Name
Title
Director
Nationality Name
Award
Biography
Figure 1: Semistructured “movie” objects
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
3
Introduction (3/3)
– Discovering the common “schema”
• Large volume / Irregularity of data
– Solution: Mining the “schema”
• Scalable / Can deal with irregularity
• Association rules proposed by Wang & Liu [6]
– Issue: How to deal with dynamic data ?
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
4
Motivation
Our contributions
– Maintenance of the discovered schema
under insertions of new objects
– Schema for the new objects.
– Performance evaluation of the method.
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
5
Presentation Outline
•
•
•
•
•
Problem definition
Algorithm’s description
Performance evaluation
Conclusion
References
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
6
Object Exchange Model
•
An Object Exchange Model (OEM) object
–
–
Identifier o (i.e., &o)
Value
•
•
Atomic (integer, float, string)
Complex
–
–
List: l1:&o1, l2:&o2, …, lk:&ok
Bag: {l1:&o1, l2:&o2, …, lk:&ok}
where: li are labels (“roles”)
? denotes the wild card matching any label
is the nil structure that contains no label
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
7
Tree-Expressions
Definition
1.
2.
The nil structure is a tree-expression
Let tei be tree-expressions of objects oi. If val(o)=
l1:&o1, l2:&o2, …, lk:&ok and i1, i2, …, lr is a
subsequence of 1, 2, …, k then li1:tei1, li2:tei2, …,
lir:teir is a tree-expression of object o.
Representation
A tree-expression li1:tei1, li2:tei2, …, lir:teir consists of k
subtrees teij each being labeled lij.
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
8
Incremental Schema Mining
Problem definition
Input
1. A collection of transaction objects in an OEM
graph, denoted as DB
2. A minimum support threshold MINSUP
3. The frequent tree expressions for DB
4. A number of new objects added into the collection,
denoted as db
The incremental schema maintenance problem is to
discover all tree expressions which have support in
DB db greater than or equal to MINSUP.
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
9
DeltaSSD
•DeltaSSD utilizes Negative Borders
Definition [Negative Border]
Given a collection of S P(R) of tree expressions,
closed with respect to the “weaker than” relation
[6], the negative border Bd- of S consists of the
minimal tree expressions
X R not in S.
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
10
DeltaSSD (notation)
DB, db, DB (= DB db)
Regular, increment, current database
LDB, Ldb, LDB
Frequent tree expressions of DB, db, DB
NDB, Ndb, NDB
Negative border of DB, db, DB
TEDB
LDB NDB
L, N
LDB (LDB NDB). Negative border of L
SupportOf( set, database )
Updates the support count of the tree
expressions in set w.r.t. the database
NB( set )
Computes the negative border of the set
LargeOf( set, database )
Returns the tree expressions in set which
have support count above MINSUP in the
database
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
11
DeltaSSD
1
2
3
4
5
6
7
8
9
10
BEGIN
SupportOf(TEDB, db)
L = LargeOf(TEDB, DB)
Small = TEDB-L
If ( L == LDB )
RETURN( LDB, NDB )
N = NB( L )
If ( N Small )
RETURN( L, N )
Nu = N – Small
SupportOf(Nu, db)
C= LargeOf( Nu )
Smalldb = Nu – C
2001 Dimitrios Katsaros
11
12
13
14
15
16
17
18
19
20
21
22
23
If ( |C| )
C=CL
repeat
C = C NB( C )
C = C – (Small Smalldb)
until ( C does not grow )
C = C – ( L Nu )
if( |C| ) then SupportOf(C, db)
ScanDB = LargeOf(C Nu, db)
N’ = NB(L ScanDB) – Small
SupportOf(N’ ScanDB, DB)
LDB = L LargeOf(ScanDB,DB)
NDB = NB(LDB db)
END
Panhellenic Conference on Informatics (ΕΠΥ’8)
12
Experimental settings
Generation of synthetic data
• One dataset :
– (L1, N1) = (25, 1000)
– (L2, N2, T2, I2, P2) = (25, 1000, 4, 2, 50)
– (N3, T3, I3, P3) = (3000, 4, 2, 50)
• Relatively small database, 3000 objects.
• Short and “bushy” transactions (thus, few
database scans).
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
13
Performance Evaluation
Database scans
Wang
minsup Scan DB
ZJZT
DeltaSSD
Scan db
Scan DB
Scan db
Scan DB
Scan db
0.08
3
3
3
3
1
2
0.10
3
3
3
3
1
2
0.12
3
3
3
3
1
2
0.14
3
3
3
3
1
2
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
14
Performance Evaluation
Operations (CPU time)
Wang
ZJZT
minsup
DeltaSSD
10%
20%
30%
10%
20%
30%
0.08
1860275
646198
1168914
1621995
28257203
31027562
34679212
0.10
825558
341810
365912
411928
26252973
29664263
32888325
0.12
362021
235920
263484
362156
25606975
28268088
30724951
0.14
108420
98877
101733
113010
25262053
27583482
29764508
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
15
Conclusions
• DeltaSSD is very efficient in terms of database
scans
• DeltaSSD incurs excessive processing in terms of
tree matchings
• Re-computing the frequent tree-expressions is
inefficient
• Future work includes:
– Investigation of the complete closure approach
– Techniques to reduce the processing cost of tree
matching
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
16
References
1. Y. Aumann, R. Feldman, O. Liphstat and H. Mannila, "Borders: An Efficient Algorithm for
Association Generation in Dynamic Databases", Journal of Intelligent Information
Systems, vol. 12, no. 1, pp. 61-73, 1999.
2. R. Feldman, Y. Aumann, A. Amir and Mannila, H., "Efficient algorithms for discovering
frequent sets in incremental databases", Proceedings of the ACM Workshop on Research
Issues in Data Mining and Knowledge Discovery (DMKD'97), 1997.
3. H. Mannila and H. Toivonen, "Levelwise Search and Borders of Theories in Knowledge
Discovery", Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 241-258, 1997.
4. V. Pudi and J. Haritsa, "Quantifying the utility of the past in mining large databases",
Information Systems, vol. 25, no. 5, pp. 323-343, 2000.
5. S. Thomas, S. Bodagala, K. Alsabti and S. Ranka, "An efficient algorithm for the
incremental updation of association rules in large databases", Proceedings of the
International Conference on Knowledge Discovery and Data Mining (KDD'97), pp.
263-266, 1997.
6. K. Wang and H. Liu,"Discovering Structural Association of Semistructured Data", IEEE
Transactions on Knowledge and Data Engineering, vol. 12, no. 3, pp. 353-371, 2000.
7. A. Zhou, Jinwen, S. Zhou and Z. Tian, "Incremental Mining of Schema for Semistructured
Data", Proceedings of the Pacific-Asia Conference on Knowledge Discovery and Data
Mining (PAKDD'99), pp. 159-168, 1999.
2001 Dimitrios Katsaros
Panhellenic Conference on Informatics (ΕΠΥ’8)
17