Download Sliding - UCLA Computer Science

Mining Frequent Patterns from Data Streams Carlo Zaniolo UCLA CSD Finding Frequent Patterns for Association Rule Mining    Given a set of transactions T and a support threshold s, find all patterns with support >= s Apriori [Agrawal’ 94], FP-growth [Han’ 00] Fast & light algorithms for data streams    More than 30 proposals [Jiang’ 06] For mining windows over streams In particular DSMSs divide windows into panes, a.k.a. slides Moment (Maintaining Closed Frequent Itemsets over a Stream Sliding Window)  Yun Chi, Haixun Wang, Philip S. Yu, Richard R. Muntz, Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window. ICDM 2004 Moment Algorithm In the absence of concept drifts, not many changes in status  Maintains two types of boundary nodes; 1. Freq / non-freq 2. Closed / non-closed Taking specific actions to maintain a shifting boundary whenever a concept shift occurs  CanTree [Leung’ 05] Use a fixed canonical order according to decreasing single freq.  Use a single-round version of FP-growth Algorithm: Upon each window move:  Add/Remove new/expired trans to/from FPtree (using the same item order)  Run FP-growth! (Without any pruning)  CanTree (cont.)  Pros:   Very efficient for large slides Cons:   Inefficient for small slides Not scalable for large windows  Needs memory for entire window Frequent Patterns Mining over Data Streams Expired …  S4 New S5 S6 W4 W5 Challenges      Computation Storage Real-time response Customization Integration with the DSMS S7 ………. Mining for Frequent Patterns on Data Streams  Difficult problem: [Chi’ 04, Leung’ 05, Cheung’ 03, Koh’ 04, …]  Mining each window from scratch - too expensive  Subsequent windows have many freq patterns in common  Updating frequent patterns every new tuple, also too expensive  SWIM’s middle-road approach: incrementally maintain frequent patterns over sliding windows  Desiderata: scalability with slide size and window size  Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining Frequent Patterns from Large Windows over Data Streams. ICDE 2008: 179-188 SWIM: Sliding Window Incremental Miner  … If pattern p is freq in a window, it must be freq in at least one of its slides -- keep a union of freq patterns of all Expired New slides (PT) S4 S5 S6 W4 W5 Count/Update frequencies Mine Count/Update frequencies Add F7 to PT PT PT = F5 F4 U U F6 F5 U U F7 F6 S7 Prune PT Mining Alg. ………. SWIM  For each new slide Si   Verify frequency of these new patterns in each window slide     Find all frequent patterns in Si (using FP-growth) Immediately or With delay (< N slides) Trade-off: max delay vs. computation. No false negatives or false positives! SWIM – Design Choices     Data Structure for Si’s: FP-tree [Han’ 00] Data Structure for PT: FP-tree Mining Algorithm: FP-growth Count/Update frequencies: Naïve? Hashtree?   Counting is the bottleneck  New and improved counting method named Conditional Counting Conditional Counting  Verification     Given a set of transactions T, a set of patterns P, and a threshold s Goal: Find the exact freq of each p  P w.r.t. to T, IF AND ONLY IF its freq is  s If s=0, verification = counting, but if s>0 extra computation can be avoided Proposed fast verifiers  DTV (Double Tree Verifier), DFV (Depth First Verifier) DTV vs DFV  DTV Scales up well on large trees   Much pruning from conditionalization However, for smaller trees    Less pruning Overhead of conditionalization not always worth it For these use DFV Comparing Verifiers Hybrid Verifier  Start with performing DTV recursively  Until the resulting trees are small enough, then perform DFV Verifiers vs. Hash Trees (Counting) SWIM with Hybrid Verifier (I) SWIM with Hybrid Verifier (II) Optimization when integrated into a DSMS  Stream Mill Miner (SMM) provides integrated support for online mining algorithms by    Constraints used for optimization     User Define Aggregates (UDAs) Definition of Mining Models Max allowed delay Interesting/Uninteresting items Interesting/Uninteresting patterns These are turned from post-conditions into preconditions Conclusions SWIM for incremental mining over large windows 1.   More efficient than existing approaches on data streams Trade-off between real-time response, efficiency, memory, etc. Efficient algorithms for verification/conditional counting 2.   DTV, DFV, and Hybrid These can be used to speed-up many applications:  Incremental mining, enhancing static algorithms, privacy preserving techniques, … Implementations of SWIM and the verifiers available at http://wis.cs.ucla.edu/swim/index.htm References [Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994. [Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without candidate generation or support,” in DEAS, 2003. [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in ICDM, November 2004. [Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in PODS, 2003. [Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000. [Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based on adjusting fp-tree structures.” in DASFAA, 2004. [Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient incremental mining of frequent patterns,” in ICDM, 2005. [Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996, pp. 134–145. Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining Frequent Patterns from Large Windows over Data Streams. ICDE 2008: 179-188 Hetal Thakkar, Barzan Mozafari, Carlo Zaniolo. Continuous Post-Mining of Association Rules in a Data Stream Management System. Chapter VII in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, Yanchang Zhao; Chengqi Zhang; and Longbing Cao (eds.), ISBN: 978-1-60566-404-0. Thank you! Questions?

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Sliding - UCLA Computer Science