Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Online Mining of Frequent
Query Trees over XML Data
Streams
Hua-Fu Li*, Man-Kwan Shan and Suh-Yin Lee
Department of Computer Science
National Chiao-Tung University
Hsinchu, Taiwan 300, R.O.C.
http://www.csie.nctu.edu.tw/~hfli/
*: corresponding author
[email protected]
1
Outline

Introduction


Problem Definition


Online Mining of Frequent Query Trees over
XML Data Streams
The Proposed Algorithm


Mining of Data Streams, Tree Mining
FQT-Stream (Frequent Query Trees of
Streams)
Conclusions and Future Work
[email protected]
2
Mining of Data Streams:
Motivations

Many Applications generate data streams







Application characteristics




Day to day business (credit card, ATM transactions, etc)
Hot Web services (XML data, record and click streams)
Telecommunication (call records)
Financial market (stock exchange)
Surveillance (sensor network, audio/video)
System management (network events)
Massive volumes of data (several terabytes)
Records arrive at a rapid rate
Data distribution changes on the fly
What do we want to get from data streams ?

Real time query answering, Statistics, and Pattern
discovery
[email protected]
3
Mining of Data Streams:
Computation Model

Requirements of Mining Data Streams



Single pass: each record is examined at most once
Bounded storage: Limited Memory for storing synopsis
Real-time: Per record processing time (to maintain
synopsis) must be low
Synopsis
in Memory
Buffer
Stream
Mining
Processor
(Approximate)
Results
Data Streams
[email protected]
4
Problem Definition of Frequent
Query Tree Mining (1/2)

XML Query Tree Stream (XQTS)


A sequence of query trees (QTs)
QT1, QT2, …, QTN


N is tree id the latest incoming query tree
Support of a Query Tree QTi

sup(QTi): the number of QTs in XQTS
containing QTi as a subtree
[email protected]
5
Problem Definition of Frequent
Query Tree Mining (2/2)

A QTi is a Frequent Query Tree (FQT)

if and only if sup(QTi)  sN


s is a user-defined minimum support
threshold in the range of [0, 1]
Our Task

To mine the set of all frequent query
trees (FQTs) by one scan of the XQTS

Using as smaller memory as possible
[email protected]
6
Proposed Algorithm FQT-Stream
(Frequent Query Trees of Streams)

FQT-Stream consists of 5 phases






1. read a QT (Query Tree) from the buffer in the main
memory
2. transform the QT into a new NQTS (Normalized
Query Tree Sequence) representation
3. construct a in-memory summary data structure called
FQT-forest (a forest of Frequent Query Trees) by
projecting the NQTSs
4. prune the infrequent query trees from FQT-forest
5. find the set of all FQTs (Frequent Query Trees) from
current FQT-forest
Since phase 1 is straightforward,

We focus on phases 2-5
[email protected]
7
Phase 2 of FQT-Stream: NQTS
Transformation

NQTS Transformation of QT

Using DFS on the QT

A sequence of triple (node-id, level, order)



level: the level of the QT
order: sequence order of the NQTS
For example (5-NQTS in Figure 1)
[email protected]
8
Phase 3 of FQT-Stream: FQTforest Construction (1/4)

For each NQTS, 2 steps are
performed to construct the FQTforest

Step 1: enumerate each NQTS into a
set of sub-sequences using Order-Break
(OB) technique

OB is a level-wise method
[email protected]
9
Phase 3 of FQT-Stream: Step 1
of FQT-forest Construction (2/4)

For example, a 5-NQTS = <(A, 0, 1), (B, 1,
2), (D, 2, 3), (E, 2, 4), (C, 1, 5)>

First, the 5-NQTS is broken into three 4NQTSs




<(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)>
<(A, 0, 1), (B, 1, 2), (E, 2, 4), (C, 1, 5)>
<(A, 0, 1), (B, 1, 2), (D, 2, 3), (C, 1, 5)>
These sequences are 1-OB (One Order Break)


1-OB sequences have one order break in the sequence
order
The original 5-NQTS is called 0-OB
[email protected]
10
Phase 3 of FQT-Stream: Step 1
of FQT-forest Construction (3/4)

After delete the duplicates


Three 4-NQTSs  Two 3-NQTSs with
One Order Break
Two 3-NQTSs  One 2-NQTS


<(A, 0, 1), (E, 2, 4), (C, 1, 5)>, <(A, 0, 1), (B, 1,
2), (C, 1, 5)><(A, 0, 1), (C, 1, 5)>
Finally, the set of 1-OB contains 8
NQTSs
[email protected]
11
Phase 3 of FQT-Stream: Step 1
of FQT-forest Construction (4/4)

Set of 2-OB is generated from the set of
1-OB

For example



2-OB <(A, 0, 1), (D, 2, 3), (C, 1, 5)> is generated from
1-OB <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)>
Repeat this process until no candidate kOB
Property 1

The maximum size of order break is k-3, i.e., (k3)-OB, if the query tree has k nodes
[email protected]
12
Phase 3 of FQT-Stream: Step 2
of FQT-forest Construction (1/3)

The OBs (0-OB, 1-OB, 2-OB) are
projected and inserted into a FQTforest using Incremental Projection
(IP) technique

A NQTS, <X1X2…Xi>, with i nodes is
projected into i sub-NQTSs (also called
node-suffix NQTSs)

<Xi>, <XiXi-1>, …, <X2>, <X1>

We use one field node-id to represent the fields
(node-id, level, order) for simplicity
[email protected]
13
Phase 3 of FQT-Stream: Step 2
of FQT-forest Construction (2/3)

Example of IP

1-OB: <(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)> is projected
into 4 node-suffix NQTSs as follows





<(C, 1, 5)>
<(E, 2, 4), (C, 1, 5)>
<(D, 2, 3), (E, 2, 4), (C, 1, 5)>
<(A, 0, 1), (D, 2, 3), (E, 2, 4), (C, 1, 5)>
After projection, a tree structure checking is
preformed

If the level of the first node in a node-suffix NQTS is
not the smallest level

the node-suffix NQTS is deleted
[email protected]
14
Phase 3 of FQT-Stream: Step 2
of FQT-forest Construction (3/3)

After tree structure checking

The node-suffix NQTSs are inserted into FQT-forest


Update the corresponding nodes’ supports
FQT-forest consists of 2 parts

FN-list



A list of Frequent Nodes
Each node Xi in FN-list has a NQTS-tree (Xi.NQTS-tree)
NQTS-trees (trees of Normalized Query Tree
Sequences)


A sequence (NQTS) is represented by a path
And its appearance frequent is maintained in the last of
node of the path
[email protected]
15
Phase 4 of FQT-Stream:
Infrequent Information Pruning

In order to guarantee the limited
space requirement


Pruning Infrequent Information
Pruning steps

Check each node Xi in the FN-list of
FQT-forest


If its sup(Xi) < sN  delete Xi and its
NQTS-tree
Check other NQTS-trees to prune these
infrequent nodes
[email protected]
16
Phase 4 of FQT-Stream:
Frequent Query Tree Mining

Assume that there are k frequent nodes,
<X1, X2, …, Xk>, in the FN-list


FQT-Stream traverses the Xi.NQTS-tree (i, i
= 1, 2, …, k) to find the sequences with prefix
Xi whose estimated support is greater than or
equal to sN in a DFS manner
These frequent query trees are stored into
a temporal list, called FQT-List
[email protected]
17
Conclusions and Future Work

We propose an efficient one-pass
algorithm FQT-Stream (Frequent
Query Trees of Streams)


To find the set of all frequent query
trees over the entire history of online
XML data streams
Future Work

Online Mining of Frequent Query Trees
over Sliding Windows
[email protected]
18