Download Document

Online Mining of Frequent Query Trees over XML Data Streams Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee Department of Computer Science National Chiao-Tung University Hsinchu, Taiwan 300, R.O.C. http://www.csie.nctu.edu.tw/~hfli/ *: corresponding author [email protected] 1 Outline  Introduction   Problem Definition   Online Mining of Frequent Query Trees over XML Data Streams The Proposed Algorithm   Mining of Data Streams, Tree Mining FQT-Stream (Frequent Query Trees of Streams) Conclusions and Future Work [email protected] 2 Mining of Data Streams: Motivations  Many Applications generate data streams        Application characteristics     Day to day business (credit card, ATM transactions, etc) Hot Web services (XML data, record and click streams) Telecommunication (call records) Financial market (stock exchange) Surveillance (sensor network, audio/video) System management (network events) Massive volumes of data (several terabytes) Records arrive at a rapid rate Data distribution changes on the fly What do we want to get from data streams ?  Real time query answering, Statistics, and Pattern discovery [email protected] 3 Mining of Data Streams: Computation Model  Requirements of Mining Data Streams    Single pass: each record is examined at most once Bounded storage: Limited Memory for storing synopsis Real-time: Per record processing time (to maintain synopsis) must be low Synopsis in Memory Buffer Stream Mining Processor (Approximate) Results Data Streams [email protected] 4 Problem Definition of Frequent Query Tree Mining (1/2)  XML Query Tree Stream (XQTS)   A sequence of query trees (QTs) QT1, QT2, …, QTN   N is tree id the latest incoming query tree Support of a Query Tree QTi  sup(QTi): the number of QTs in XQTS containing QTi as a subtree [email protected] 5 Problem Definition of Frequent Query Tree Mining (2/2)  A QTi is a Frequent Query Tree (FQT)  if and only if sup(QTi)  sN   s is a user-defined minimum support threshold in the range of [0, 1] Our Task  To mine the set of all frequent query trees (FQTs) by one scan of the XQTS  Using as smaller memory as possible [email protected] 6 Proposed Algorithm FQT-Stream (Frequent Query Trees of Streams)  FQT-Stream consists of 5 phases       1. read a QT (Query Tree) from the buffer in the main memory 2. transform the QT into a new NQTS (Normalized Query Tree Sequence) representation 3. construct a in-memory summary data structure called FQT-forest (a forest of Frequent Query Trees) by projecting the NQTSs 4. prune the infrequent query trees from FQT-forest 5. find the set of all FQTs (Frequent Query Trees) from current FQT-forest Since phase 1 is straightforward,  We focus on phases 2-5 [email protected] 7 Phase 2 of FQT-Stream: NQTS Transformation  NQTS Transformation of QT  Using DFS on the QT  A sequence of triple (node-id, level, order)    level: the level of the QT order: sequence order of the NQTS For example (5-NQTS in Figure 1) [email protected] 8 Phase 3 of FQT-Stream: FQTforest Construction (1/4)  For each NQTS, 2 steps are performed to construct the FQTforest  Step 1: enumerate each NQTS into a set of sub-sequences using Order-Break (OB) technique  OB is a level-wise method [email protected] 9 Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (2/4)  For example, a 5-NQTS = <(A, 0, 1), (B, 1, 2), (D, 2, 3), (E, 2, 4), (C, 1, 5)>  First, the 5-NQTS is broken into three 4NQTSs     <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> <(A, 0, 1), (B, 1, 2), (E, 2, 4), (C, 1, 5)> <(A, 0, 1), (B, 1, 2), (D, 2, 3), (C, 1, 5)> These sequences are 1-OB (One Order Break)   1-OB sequences have one order break in the sequence order The original 5-NQTS is called 0-OB [email protected] 10 Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (3/4)  After delete the duplicates   Three 4-NQTSs  Two 3-NQTSs with One Order Break Two 3-NQTSs  One 2-NQTS   <(A, 0, 1), (E, 2, 4), (C, 1, 5)>, <(A, 0, 1), (B, 1, 2), (C, 1, 5)><(A, 0, 1), (C, 1, 5)> Finally, the set of 1-OB contains 8 NQTSs [email protected] 11 Phase 3 of FQT-Stream: Step 1 of FQT-forest Construction (4/4)  Set of 2-OB is generated from the set of 1-OB  For example    2-OB <(A, 0, 1), (D, 2, 3), (C, 1, 5)> is generated from 1-OB <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> Repeat this process until no candidate kOB Property 1  The maximum size of order break is k-3, i.e., (k3)-OB, if the query tree has k nodes [email protected] 12 Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (1/3)  The OBs (0-OB, 1-OB, 2-OB) are projected and inserted into a FQTforest using Incremental Projection (IP) technique  A NQTS, <X1X2…Xi>, with i nodes is projected into i sub-NQTSs (also called node-suffix NQTSs)  <Xi>, <XiXi-1>, …, <X2>, <X1>  We use one field node-id to represent the fields (node-id, level, order) for simplicity [email protected] 13 Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (2/3)  Example of IP  1-OB: <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> is projected into 4 node-suffix NQTSs as follows      <(C, 1, 5)> <(E, 2, 4), (C, 1, 5)> <(D, 2, 3), (E, 2, 4), (C, 1, 5)> <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> After projection, a tree structure checking is preformed  If the level of the first node in a node-suffix NQTS is not the smallest level  the node-suffix NQTS is deleted [email protected] 14 Phase 3 of FQT-Stream: Step 2 of FQT-forest Construction (3/3)  After tree structure checking  The node-suffix NQTSs are inserted into FQT-forest   Update the corresponding nodes’ supports FQT-forest consists of 2 parts  FN-list    A list of Frequent Nodes Each node Xi in FN-list has a NQTS-tree (Xi.NQTS-tree) NQTS-trees (trees of Normalized Query Tree Sequences)   A sequence (NQTS) is represented by a path And its appearance frequent is maintained in the last of node of the path [email protected] 15 Phase 4 of FQT-Stream: Infrequent Information Pruning  In order to guarantee the limited space requirement   Pruning Infrequent Information Pruning steps  Check each node Xi in the FN-list of FQT-forest   If its sup(Xi) < sN  delete Xi and its NQTS-tree Check other NQTS-trees to prune these infrequent nodes [email protected] 16 Phase 4 of FQT-Stream: Frequent Query Tree Mining  Assume that there are k frequent nodes, <X1, X2, …, Xk>, in the FN-list   FQT-Stream traverses the Xi.NQTS-tree (i, i = 1, 2, …, k) to find the sequences with prefix Xi whose estimated support is greater than or equal to sN in a DFS manner These frequent query trees are stored into a temporal list, called FQT-List [email protected] 17 Conclusions and Future Work  We propose an efficient one-pass algorithm FQT-Stream (Frequent Query Trees of Streams)   To find the set of all frequent query trees over the entire history of online XML data streams Future Work  Online Mining of Frequent Query Trees over Sliding Windows [email protected] 18

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document