Download Sliding - UCLA Computer Science

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Mining Frequent Patterns
from Data Streams
Carlo Zaniolo
UCLA CSD
Finding Frequent Patterns for
Association Rule Mining



Given a set of transactions T and a support
threshold s, find all patterns with support >= s
Apriori [Agrawal’ 94], FP-growth [Han’ 00]
Fast & light algorithms for data streams



More than 30 proposals [Jiang’ 06]
For mining windows over streams
In particular DSMSs divide windows into panes,
a.k.a. slides
Moment (Maintaining Closed Frequent Itemsets
over a Stream Sliding Window)

Yun Chi, Haixun Wang, Philip S. Yu, Richard R. Muntz, Moment:
Maintaining Closed Frequent Itemsets over a Stream Sliding Window. ICDM
2004
Moment Algorithm
In the absence of concept drifts, not many
changes in status
 Maintains two types of boundary nodes;
1. Freq / non-freq
2. Closed / non-closed
Taking specific actions to maintain a shifting
boundary whenever a concept shift occurs

CanTree [Leung’ 05]
Use a fixed canonical order according to
decreasing single freq.
 Use a single-round version of FP-growth
Algorithm:
Upon each window move:
 Add/Remove new/expired trans to/from FPtree (using the same item order)
 Run FP-growth! (Without any pruning)

CanTree (cont.)

Pros:


Very efficient for large slides
Cons:


Inefficient for small slides
Not scalable for large windows

Needs memory for entire window
Frequent Patterns Mining over
Data Streams
Expired
…

S4
New
S5
S6
W4
W5
Challenges





Computation
Storage
Real-time response
Customization
Integration with the DSMS
S7
……….
Mining for Frequent Patterns
on Data Streams

Difficult problem: [Chi’ 04, Leung’ 05, Cheung’ 03, Koh’ 04, …]

Mining each window from scratch - too expensive

Subsequent windows have many freq patterns in common

Updating frequent patterns every new tuple, also too expensive

SWIM’s middle-road approach: incrementally maintain frequent
patterns over sliding windows

Desiderata: scalability with slide size and window size

Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining
Frequent Patterns from Large Windows over Data Streams. ICDE 2008:
179-188
SWIM:
Sliding Window Incremental Miner

…
If pattern p is freq in a window, it must be freq in at least
one of its slides -- keep a union of freq patterns of all
Expired
New
slides (PT)
S4
S5
S6
W4
W5
Count/Update
frequencies
Mine
Count/Update
frequencies
Add F7 to PT
PT
PT = F5
F4 U
U F6
F5 U
U F7
F6
S7
Prune PT
Mining
Alg.
……….
SWIM

For each new slide Si


Verify frequency of these new patterns in
each window slide




Find all frequent patterns in Si (using FP-growth)
Immediately or
With delay (< N slides)
Trade-off: max delay vs. computation.
No false negatives or false positives!
SWIM – Design Choices




Data Structure for Si’s: FP-tree [Han’ 00]
Data Structure for PT: FP-tree
Mining Algorithm: FP-growth
Count/Update frequencies: Naïve? Hashtree?


Counting is the bottleneck 
New and improved counting method named
Conditional Counting
Conditional Counting

Verification




Given a set of transactions T, a set of patterns P,
and a threshold s
Goal: Find the exact freq of each p  P w.r.t. to T,
IF AND ONLY IF its freq is  s
If s=0, verification = counting, but if s>0 extra
computation can be avoided
Proposed fast verifiers

DTV (Double Tree Verifier), DFV (Depth First
Verifier)
DTV vs DFV

DTV Scales up well on large trees


Much pruning from conditionalization
However, for smaller trees



Less pruning
Overhead of conditionalization not always worth it
For these use DFV
Comparing Verifiers
Hybrid Verifier
 Start
with performing DTV
recursively
 Until the resulting trees are small
enough, then perform DFV
Verifiers vs. Hash Trees
(Counting)
SWIM with Hybrid Verifier (I)
SWIM with Hybrid Verifier (II)
Optimization when integrated
into a DSMS

Stream Mill Miner (SMM) provides integrated
support for online mining algorithms by



Constraints used for optimization




User Define Aggregates (UDAs)
Definition of Mining Models
Max allowed delay
Interesting/Uninteresting items
Interesting/Uninteresting patterns
These are turned from post-conditions into preconditions
Conclusions
SWIM for incremental mining over large windows
1.


More efficient than existing approaches on data streams
Trade-off between real-time response, efficiency,
memory, etc.
Efficient algorithms for verification/conditional
counting
2.


DTV, DFV, and Hybrid
These can be used to speed-up many applications:

Incremental mining, enhancing static algorithms, privacy
preserving techniques, …
Implementations of SWIM and the verifiers available at
http://wis.cs.ucla.edu/swim/index.htm
References
[Agrawal’ 94] R. Agrawal and R. Srikant. Fast algorithms for mining association rules in large
databases. In VLDB, pages 487–499, 1994.
[Cheung’ 03] W. Cheung and O. R. Zaiane, “Incremental mining of frequent patterns without
candidate generation or support,” in DEAS, 2003.
[Chi’ 04] Y. Chi, H. Wang, P. S. Yu, and R. R. Muntz, “Moment: Maintaining closed frequent
itemsets over a stream sliding window,” in ICDM, November 2004.
[Evfimievski’ 03] A. Evfimievski, J. Gehrke, and R. Srikant, “Limiting privacy breaches in privacy
preserving data mining,” in PODS, 2003.
[Han’ 00] J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate generation. In
SIGMOD, 2000.
[Koh’ 04] J. Koh and S. Shieh, “An efficient approach for maintaining association rules based
on adjusting fp-tree structures.” in DASFAA, 2004.
[Leung’ 05] C.-S. Leung, Q. Khan, and T. Hoque, “Cantree: A tree structure for efficient
incremental mining of frequent patterns,” in ICDM, 2005.
[Toivonen’ 96] H. Toivonen, “Sampling large databases for association rules,” in VLDB, 1996,
pp. 134–145.
Barzan Mozafari, Hetal Thakkar, Carlo Zaniolo: Verifying and Mining Frequent Patterns from
Large Windows over Data Streams. ICDE 2008: 179-188
Hetal Thakkar, Barzan Mozafari, Carlo Zaniolo. Continuous Post-Mining of Association Rules in
a Data Stream Management System. Chapter VII in Post-Mining of Association Rules:
Techniques for Effective Knowledge Extraction, Yanchang Zhao; Chengqi Zhang; and
Longbing Cao (eds.), ISBN: 978-1-60566-404-0.
Thank you!
Questions?
Related documents