Survey

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Survey

Document related concepts

no text concepts found

Transcript

Mining Frequent Patterns from Data Streams Carlo Zaniolo UCLA CSD Finding Frequent Patterns for Association Rule Mining Given a set of transactions T and a support threshold s, find all patterns with support >= s Apriori [Agrawal’ 94], FP-growth [Han’ 00] Fast & light algorithms for data streams More than 30 proposals [Jiang’ 06] For mining windows over streams In particular DSMSs divide windows into panes, a.k.a. slides Moment (Maintaining Closed Frequent Itemsets over a Stream Sliding Window) Yun Chi, Haixun Wang, Philip S. Yu, Richard R. Muntz, Moment: Maintaining Closed Frequent Itemsets over a Stream Sliding Window. ICDM 2004 Moment Algorithm In the absence of concept drifts, not many changes in status Maintains two types of boundary nodes; 1. Freq / non-freq 2. Closed / non-closed Taking specific actions to maintain a shifting boundary whenever a concept shift occurs CanTree [Leung’ 05] Use a fixed canonical order according to decreasing single freq. Use a single-round version of FP-growth Algorithm: Upon each window move: Add/Remove new/expired trans to/from FPtree (using the same item order) Run FP-growth! (Without any pruning) CanTree (cont.) Pros: Very efficient for large slides Cons: Inefficient for small slides Not scalable for large windows Needs memory for entire window Frequent Patterns Mining over Data Streams Expired … S4 New S5 S6 W4 W5 Challenges Computation Storage Real-time response Customization Integration with the DSMS S7 ………. Mining for Frequent Patterns on Data Streams Difficult problem: [Chi’ 04, Leung’ 05, Cheung’ 03, Koh’ 04, …] Mining each window from scratch - too expensive Subsequent windows have many freq patterns in common Updating frequent patterns every new tuple, also too expensive SWIM’s middle-road approach: incrementally maintain frequent patterns over sliding windows Desiderata: scalability with slide size and window size Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining Frequent Patterns from Large Windows over Data Streams. ICDE 2008: 179-188 SWIM: Sliding Window Incremental Miner … If pattern p is freq in a window, it must be freq in at least one of its slides -- keep a union of freq patterns of all Expired New slides (PT) S4 S5 S6 W4 W5 Count/Update frequencies Mine Count/Update frequencies Add F7 to PT PT PT = F5 F4 U U F6 F5 U U F7 F6 S7 Prune PT Mining Alg. ………. SWIM For each new slide Si Verify frequency of these new patterns in each window slide Find all frequent patterns in Si (using FP-growth) Immediately or With delay (< N slides) Trade-off: max delay vs. computation. No false negatives or false positives! SWIM – Design Choices Data Structure for Si’s: FP-tree [Han’ 00] Data Structure for PT: FP-tree Mining Algorithm: FP-growth Count/Update frequencies: Naïve? Hashtree? Counting is the bottleneck New and improved counting method named Conditional Counting Conditional Counting Verification Given a set of transactions T, a set of patterns P, and a threshold s Goal: Find the exact freq of each p P w.r.t. to T, IF AND ONLY IF its freq is s If s=0, verification = counting, but if s>0 extra computation can be avoided Proposed fast verifiers DTV (Double Tree Verifier), DFV (Depth First Verifier) DTV vs DFV DTV Scales up well on large trees Much pruning from conditionalization However, for smaller trees Less pruning Overhead of conditionalization not always worth it For these use DFV Comparing Verifiers Hybrid Verifier Start with performing DTV recursively Until the resulting trees are small enough, then perform DFV Verifiers vs. Hash Trees (Counting) SWIM with Hybrid Verifier (I) SWIM with Hybrid Verifier (II) Optimization when integrated into a DSMS Stream Mill Miner (SMM) provides integrated support for online mining algorithms by Constraints used for optimization User Define Aggregates (UDAs) Definition of Mining Models Max allowed delay Interesting/Uninteresting items Interesting/Uninteresting patterns These are turned from post-conditions into preconditions Conclusions SWIM for incremental mining over large windows 1. More efficient than existing approaches on data streams Trade-off between real-time response, efficiency, memory, etc. Efficient algorithms for verification/conditional counting 2. DTV, DFV, and Hybrid These can be used to speed-up many applications: Incremental mining, enhancing static algorithms, privacy preserving techniques, … Implementations of SWIM and the verifiers available at http://wis.cs.ucla.edu/swim/index.htm References [Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large databases. In VLDB, pages 487–499, 1994. [Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without candidate generation or support,” in DEAS, 2003. [Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent itemsets over a stream sliding window,” in ICDM, November 2004. [Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy preserving data mining,” in PODS, 2003. [Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In SIGMOD, 2000. [Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based on adjusting fp-tree structures.” in DASFAA, 2004. [Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient incremental mining of frequent patterns,” in ICDM, 2005. [Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996, pp. 134–145. Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining Frequent Patterns from Large Windows over Data Streams. ICDE 2008: 179-188 Hetal Thakkar, Barzan Mozafari, Carlo Zaniolo. Continuous Post-Mining of Association Rules in a Data Stream Management System. Chapter VII in Post-Mining of Association Rules: Techniques for Effective Knowledge Extraction, Yanchang Zhao; Chengqi Zhang; and Longbing Cao (eds.), ISBN: 978-1-60566-404-0. Thank you! Questions?

Related documents