Download Mining Frequent Patterns in Data Streams at Multiple Time

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS525 Paper Presentation
Presented by:
Pei Zhang, Jiahua Liu, Pengfei Geng and Salah Ahmed
Mining Frequent Patterns
in Data Streams
at Multiple Time Granularities
Authors: Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu
Part 1
• Introduction
• Problem definition and analysis
• FP-Stream
Introduction
• Frequent pattern mining has been widely studied and used on
static transaction data set, but it is challenging to extend it to
data streams.
• Why it is difficult to mine frequent patterns in data streams?
— Mining frequent itemsets is a set of join operations.
Problem definition and analysis
• Our task is to find the complete set of grequent patterns in a
data stream.
• Apriori algorithm: count only those itemsets whose every proper
subset is frequent.
• Problems to use Apriori-like algorithm
— Join is a blocking operator
— Infrequent items can become frequent later on and hence
cannot be ignored.
Definition
• The frequency of an itemset I over a time period T is the number
of transactions in T in which I occurs. The support of I is the
frequency divide by the total number of transactions observed
in I.
• I is frequent if its support is no less than min_support σ.
• I is sub frequent if its support is less than σ but no less than the
maximun support error ε.
• Otherwise, I is infrequent.
FP-Stream
• This paper propose a time sensitive streaming model: FP-Stream,
which includes two major components:
1. A global frequent pattern tree held in main memory.
2. Tilted time windows embedded in this pattern tree.
Part 2
• Mining Time-Sensitive Frequent Patterns in Data Streams
• Maintaining Tilted-Time Windows
Natural tilted-time window
• People are often interested in recent changes.
• Recent changes are depicted at a fine granularity, but long
term changes at a Coarse granularity.
Frequent patterns for tilted-time
windows
• To mine a variety of frequent patterns associated with time
more flexibly, a frequent pattern set can be maintained.
Pattern tree
• For each tilted-time window, one can register window-based
count for each frequent pattern.
• Each node represents a pattern and its frequency is recorded in
the node
FP-Stream
• Usually frequent patterns do not change dramatically over time.
• Overlap may occur
• To save space, embed the tilted-time window structure into
each node
Maintaining Tilted-Time Windows
• With the arrival of new data
• In order to make the table compact
• Tilted-time window maintenance mechanism is needed
Logarithmic Tilted-time Window
• In the natural tilted-time window, at most 59 (4+24+31) tilted
windows need to be maintained for a period of one month.
• We can reduce the number of tilted-time windows using
logarithmic tilted-time windows schema
• According to logarithmic tilted-time window model, with one
year of data and the finest precision at quarter, it needs
log 2(365  24  4)  1  17 units of time instead of
366  24  4  35,136 units.
Logarithmic Tilted-time Window
• Break the stream of transactions into fixed sized batches B1, B2,
B3, …, Bn…
• Bn is most current batch, B1 is the oldest
• For i ≥ j, let B(i, j) denotes Uik=j Bk
• fI(i, j) denote the frequency of I in B(i, j)
• Frequencies for itemset I with ratio 2 (the growth rate of window
size):
• Maintain intermediate buffer windows
Logarithmic Tilted-time Window
Updating
• Given a new batch of transactions B
• Replace level 0: f(n, n) with f(B)
• Shift f(n, n) back to the next finest level of time (level 1)
• Check status of intermediate window for level 1:
• Not full. Place f(n-1, n-1) in the intermediate window, stop
the algorithm
• Full. f(n-1, n-1) + f(intermediate window) is shifted back to
level 2
• Continue this process until shifting stops
Logarithmic Tilted-time Window
Updating…Example
f (8,8); f (7,7)[]; f (6,5)[]; f (4,1)[]
f (9,9); f (8,8)[ f (7,7)]; f (6,5)[]; f (4,1)[]
f (10,10); f (9,9)[]; f (8,7)[ f (6,5)]; f (4,1)[]
f (11,11); f (10,10)[ f (9,9)]; f (8,7)[ f (6,5)]; f (4,1)[]
f (12,12); f (11,11)[]; f (10,9)[]; f (8,5)[ f (4,1)]
Part 3
• Tail Pruning
• Type I Pruning
• Type II Pruning
• Algorithm
Tail Pruning
•
Let
t 0,...., tn
•
wi
is the window size of
•
Drop tail sequences
condition holds,
be the tilted-time windows where tn is the oldest.
ti .
when the following
Type I and Type II Pruning
•
Type I Pruning:
• If I is found in B but is not in the FP-stream structure, no superset
is in the structure.
• Hence, if
examined.
•
, then none of the supersets need be
Type II Pruning:
• If all of I’s tilted-time window table entries are pruned (and I is
dropped), then any superset will also be dropped.
An Algorithm
•
FP-streaming: Incremental update of the FP-stream structure with
incoming stream data
•
1. Initialize the FP-tree to empty .
•
2. Sort each incoming transaction t, according to f list, and then insert
it into the FP-tree without pruning any items.
•
3. When all the transactions in Bi are accumulated, update the FPstream as follows.
• Mine itemsets out of the FP-tree using FP-growth algorithm
• Scan the FP-stream structure
Part 4
• Experimental Set-Up
• Experimental Results
• Discussion
Experiments Set-Ups
•
Experiments are performed using
•
•
Sun UltraSPARC-Iii Processors, 512 MB RAM
Dataset Generation
•
3 Million Transactions
•
1k Distinct Items
•
Streams are broken into batches of size 50k transactions
•
For every 5 batches 200 random permutations are applied
FP-stream time requirements
• Item permutations
causes the behavior
to jump at every 5
batches
• Stability is regained
quickly.
• Required time
increases as the
average itemset
length increases.
FP-stream space requirements
• The overall space
requirements are
very attracting in
call cases. It was
less than 3MB.
FP-stream average itemset length
• The average
itemset length does
not increase with
the increase of
average
transaction length
• This result was also
verified by Apriori
running on 50k
transactions.
FP-stream total number of itemsets
• The total number of
itemsets increase
with the increase of
average
transaction length.
• This result was also
verified by Apriori
running on 50k
transactions.
Discussion
• Further compression is possible.
• If the support is stable for lots of entries, the table can be
compressed.
• If the tilted time windows of parent node and child node are the
same, only one tilted time window can be maintained.
• It is a very nice idea to mine time sensitive frequent patterns.
• Mining and maintaining frequent patterns become realistic
even with limited main memory.
Feedback
Comments and Questions
Thank You