Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining time-delayed associations from event datasets Loo, Kin Kong 13 July, 2006 Contents Motivation Definition of time-delayed association A simple algorithm Improving the simple algorithm Experiments Conclusion The “Butterfly Effect”... “A butterfly's wings might create tiny changes in the atmosphere that ultimately cause a tornado to appear” - from Wikipedia ...... http://www.clipartxp.com/Butterflies/04_Butterfly.jpg http://www.wrh.noaa.gov/hnx/newslet/summer05/xwind.jpg Road network Network of roads and roundabouts City Centre 2 A 4 E 3 4 B D 5 C 4 Road network Alerts issued by a network monitoring system City Centre 2 A 4 E 3 4 B D 5 C 4 0 A A A 1 2 3 4 B B 5 6 B 7 8 C A D B 9 10 11 12 13 14 15 Time Episode Proposed by Mannila et al in [Mannila95] “An episode is a partially ordered collection of events occurring together” Episodes can be described as directed acyclic graphs X Y X Y X Y Z Minimal Occurrences A time interval [ts, te) is a minimal occurrence of an episode if exists in [ts, te) does not occur in any proper sub-interval [ts', te') of [ts, te) 0 A A A 1 2 3 4 B B 5 6 B 7 8 C A D B 9 10 11 12 13 14 15 Time Time-delayed associations A time-delayed association r between two event [u, v] types is in the form I J, I and J are two event types 0<u v Informally interpreted as “if an event of type I occurs at time t, then it is likely that an event of type J occurs within the interval [t+u, t+v] Definition Let D be the event dataset A1, ..., Am are event attributes with domains D1, ..., Dm An event e is in the form (A1, ..., Am, t), where t is the time that e occurred OR e is in the form (E, t), where E is an event type Notations: Ee = the event type of the event te = the time when e occurred Let E be the set of all possible event types Definition (Cont’d) A time-delayed association relates two event types, in the form I [u, v] J such that 0 < u v An event i is a match to I [u, v] J if Ei = I j | Ej = J and ti + u tj ti + v The event j here is a consequence to which i corresponds Support = (number of distinct matches) / |D| Confidence = (number of distinct matches) / (number of occurrences of I) Length = 2 Complex association Treating a time-delayed association as a complex event type, it can be extended and associate to another event type Let R be the complex event type representing [u, v] I J A time-delayed association between R and K is in the form R [u, v] K K is an “ordinary” event type u and v are same as that in I [u, v] J Complex association (Cont’d) An event i is a match to R [u, v] K if i is a match of I J j such that j is a consequence of i and k | Ek = K and tj + u tk tj + v [u, v] Support = (number of distinct matches) / |D| Confidence = supp(R [u, v] K ) / supp(I [u, v] J) Length = 1 + length(I [u, v] J) Problem statement Given an event data set D, time constraints u, v such that 0 < u v, find all time-delayed associations with support not smaller than s and confidence not smaller than c. MQ Mappings Matches and their corresponding consequences can be arranged in a table-like structure 0 A A A 1 2 3 A B B 4 5 6 [, 4] B B 7 8 C D A B B B 9 10 11 12 13 14 15 [, 4] B C m q m q 1 5 6 10 2 5 8 10 2 6 3 5 3 6 12 13 12 14 12 15 Time From MQ mappings to MQ mapping 0 A A A A 1 2 3 [, 4] B 4 B B 5 6 B B 7 [, 4] 8 C C D A B B B 9 10 11 12 13 14 15 (A [, 4] B) Time [, 4] m q m q m q 1 5 6 10 2 10 2 5 8 10 3 10 3 5 2 6 3 6 12 13 12 14 12 15 C Algorithm BRUTE-FORCE Room for improvement Every possible permutation is evaluated (until one is found to be infrequent)! Very large number of MQ mappings are generated too much for main memory intermediate results may need to be swapped to secondary storage Hence, we need Good pruning strategies Good cache management Pruning strategy Apriori property does not hold in time-delayed associations “(A B) C is frequent” does not implies that “A C is frequent” Multiplicity of consequences We define the multiplicity of a consequence as the number of distinct matches corresponding to the consequence [, 4] [, 4] [, 4] [, 4] A B B C (A B) C m q multi m q m q 1 5 3 6 10 2 10 2 5 8 10 4 10 3 5 2 6 3 6 12 13 1 12 14 1 12 15 1 2 0 A A A 1 2 3 4 B B 5 6 B 7 8 C D A B B B 9 10 11 12 13 14 15 Time GlobalK Sort all multiplicities in reverse order Find the minimal k such that sum of top-k of the multiplicities is not less than s |D| A [, 4] B B [, 4] C (A [, 4] B) [, 4] m q multi m q m q 1 5 3 6 10 2 10 2 5 8 10 4 10 3 5 2 6 3 6 2 s |D| = 3 k =1 12 13 1 12 14 1 12 15 1 Infrequent C SectTop GlobalK does not make use of temporal information We can divide the whole history into a number of segments and keep information for each segment For each segment, a vector containing multiplicities of consequences in inverse order is kept SectTop (Cont’d) 0 A [, 4] A A A 1 2 3 B m q multi 1 5 3 2 5 3 5 2 6 3 6 12 13 1 12 14 1 12 15 1 4 B B 5 6 B [, 4] m B 7 8 C q C D A B B B 9 10 11 12 13 14 15 Time s |D| = 3 Cannot be frequent 2 6 10 8 10 Cache replacement strategies Common cache replacement strategies: FIFO – “oldest” data in cache are replaced LRU – data that have not been referenced for the longest time are replaced LFU – data that are least frequently referenced are replaced Candidate generation Experiment Open and closing prices of 33 HSI constituents for around 1400 transaction days (since Jan 2000) are obtained for experiments The prices are transformed to events as follows: Closing price Open price Type A Open price Open price = Closing price Closing price Type B Type C Hence, 33 data points are generated each day, each data point in the form “(0008B, date)” Experiment - Effectiveness of pruning strategies No. of candidate actually enumerated s (%) s (%) Experiment - Effectiveness of pruning strategies (Cont’d) Case when [u, v] = [, 1] at high and low s Experiment - Effectiveness of pruning strategies (Cont’d) Case when [u, v] = [, 2] at high and low s Experiment - Cache replacement strategy and candidate generation LRU, [u, v] = [, 1] at high s (0.6%), with and without using pruning strategy Experiment - Cache replacement strategy and candidate generation LFU, [u, v] = [, 1] at high s (0.6%), with and without using pruning strategy Experiment - Cache replacement strategy and candidate generation LRU, [u, v] = [, 1] at low s (0.3%), with and without using pruning strategy Conclusion Time-delayed association rules offer an alternative tool for sequential analysis of event datasets Although the Apriori property does not hold in the model of time-delayed association, we can still derive good pruning strategies to trim the search space for frequent associations Due to the large volume of intermediate data generated, it is essential to have a rightly chosen cache replacement strategy The way candidates are generated can play a crucial role in the performance of the algorithm Reference Mannila, H., Toivonen, H., and Verkamo, A.I. 1995. Discovering frequent episodes in sequences. In Proceedings of the First International Conference on Knowledge Discovery and Data Mining (KDD ’95). Montr´eal, Canada, pp. 210–215. Thank you.