Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CS525 Paper Presentation Presented by: Pei Zhang, Jiahua Liu, Pengfei Geng and Salah Ahmed Mining Frequent Patterns in Data Streams at Multiple Time Granularities Authors: Chris Giannella, Jiawei Han, Jian Pei, Xifeng Yan, Philip S. Yu Part 1 • Introduction • Problem definition and analysis • FP-Stream Introduction • Frequent pattern mining has been widely studied and used on static transaction data set, but it is challenging to extend it to data streams. • Why it is difficult to mine frequent patterns in data streams? — Mining frequent itemsets is a set of join operations. Problem definition and analysis • Our task is to find the complete set of grequent patterns in a data stream. • Apriori algorithm: count only those itemsets whose every proper subset is frequent. • Problems to use Apriori-like algorithm — Join is a blocking operator — Infrequent items can become frequent later on and hence cannot be ignored. Definition • The frequency of an itemset I over a time period T is the number of transactions in T in which I occurs. The support of I is the frequency divide by the total number of transactions observed in I. • I is frequent if its support is no less than min_support σ. • I is sub frequent if its support is less than σ but no less than the maximun support error ε. • Otherwise, I is infrequent. FP-Stream • This paper propose a time sensitive streaming model: FP-Stream, which includes two major components: 1. A global frequent pattern tree held in main memory. 2. Tilted time windows embedded in this pattern tree. Part 2 • Mining Time-Sensitive Frequent Patterns in Data Streams • Maintaining Tilted-Time Windows Natural tilted-time window • People are often interested in recent changes. • Recent changes are depicted at a fine granularity, but long term changes at a Coarse granularity. Frequent patterns for tilted-time windows • To mine a variety of frequent patterns associated with time more flexibly, a frequent pattern set can be maintained. Pattern tree • For each tilted-time window, one can register window-based count for each frequent pattern. • Each node represents a pattern and its frequency is recorded in the node FP-Stream • Usually frequent patterns do not change dramatically over time. • Overlap may occur • To save space, embed the tilted-time window structure into each node Maintaining Tilted-Time Windows • With the arrival of new data • In order to make the table compact • Tilted-time window maintenance mechanism is needed Logarithmic Tilted-time Window • In the natural tilted-time window, at most 59 (4+24+31) tilted windows need to be maintained for a period of one month. • We can reduce the number of tilted-time windows using logarithmic tilted-time windows schema • According to logarithmic tilted-time window model, with one year of data and the finest precision at quarter, it needs log 2(365 24 4) 1 17 units of time instead of 366 24 4 35,136 units. Logarithmic Tilted-time Window • Break the stream of transactions into fixed sized batches B1, B2, B3, …, Bn… • Bn is most current batch, B1 is the oldest • For i ≥ j, let B(i, j) denotes Uik=j Bk • fI(i, j) denote the frequency of I in B(i, j) • Frequencies for itemset I with ratio 2 (the growth rate of window size): • Maintain intermediate buffer windows Logarithmic Tilted-time Window Updating • Given a new batch of transactions B • Replace level 0: f(n, n) with f(B) • Shift f(n, n) back to the next finest level of time (level 1) • Check status of intermediate window for level 1: • Not full. Place f(n-1, n-1) in the intermediate window, stop the algorithm • Full. f(n-1, n-1) + f(intermediate window) is shifted back to level 2 • Continue this process until shifting stops Logarithmic Tilted-time Window Updating…Example f (8,8); f (7,7)[]; f (6,5)[]; f (4,1)[] f (9,9); f (8,8)[ f (7,7)]; f (6,5)[]; f (4,1)[] f (10,10); f (9,9)[]; f (8,7)[ f (6,5)]; f (4,1)[] f (11,11); f (10,10)[ f (9,9)]; f (8,7)[ f (6,5)]; f (4,1)[] f (12,12); f (11,11)[]; f (10,9)[]; f (8,5)[ f (4,1)] Part 3 • Tail Pruning • Type I Pruning • Type II Pruning • Algorithm Tail Pruning • Let t 0,...., tn • wi is the window size of • Drop tail sequences condition holds, be the tilted-time windows where tn is the oldest. ti . when the following Type I and Type II Pruning • Type I Pruning: • If I is found in B but is not in the FP-stream structure, no superset is in the structure. • Hence, if examined. • , then none of the supersets need be Type II Pruning: • If all of I’s tilted-time window table entries are pruned (and I is dropped), then any superset will also be dropped. An Algorithm • FP-streaming: Incremental update of the FP-stream structure with incoming stream data • 1. Initialize the FP-tree to empty . • 2. Sort each incoming transaction t, according to f list, and then insert it into the FP-tree without pruning any items. • 3. When all the transactions in Bi are accumulated, update the FPstream as follows. • Mine itemsets out of the FP-tree using FP-growth algorithm • Scan the FP-stream structure Part 4 • Experimental Set-Up • Experimental Results • Discussion Experiments Set-Ups • Experiments are performed using • • Sun UltraSPARC-Iii Processors, 512 MB RAM Dataset Generation • 3 Million Transactions • 1k Distinct Items • Streams are broken into batches of size 50k transactions • For every 5 batches 200 random permutations are applied FP-stream time requirements • Item permutations causes the behavior to jump at every 5 batches • Stability is regained quickly. • Required time increases as the average itemset length increases. FP-stream space requirements • The overall space requirements are very attracting in call cases. It was less than 3MB. FP-stream average itemset length • The average itemset length does not increase with the increase of average transaction length • This result was also verified by Apriori running on 50k transactions. FP-stream total number of itemsets • The total number of itemsets increase with the increase of average transaction length. • This result was also verified by Apriori running on 50k transactions. Discussion • Further compression is possible. • If the support is stable for lots of entries, the table can be compressed. • If the tilted time windows of parent node and child node are the same, only one tilted time window can be maintained. • It is a very nice idea to mine time sensitive frequent patterns. • Mining and maintaining frequent patterns become realistic even with limited main memory. Feedback Comments and Questions Thank You