Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Serial Episode Rules with Time Lags over Multiple Data Streams Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P. Chen Dept. of CS, National Chengchi Univ. (Taiwan) DaWaK’08 Outline Introduction Related work Preliminaries Method Experiments Conclusions Introduction In many applications, data are generated as a form of continuous data streams. Continuously detecting flow and occupancy of a road to qualify the congestion condition of a road forms data streams When roads A and B have heavy traffic, 5 mins later, road C will most likely be congested Serial episode rules with time lags (SER) : XlagY Related Work Finding episodes/episode rules from static time series data has been studied for decades Episodes Precursor B Successor D A A B D C Episode Serial episode A B D L Serial episode rule E Preliminaries Environment: a centralized system collecting n synchronized data streams DS1, DS2, …, DSn n-tuple event: a set of items coming from all streams at the same time itemset: a subset of an n-tuple event serial episode: described as an ordered list of itemsets e.g. serial episode (aA)(bB) Itemset {gA} time: 1, 2, 3, 4, 5, 6, 7, 8 DS1: a, b, b, c, g, a, b, f DS2: A, B, S, G, A, B, A, F … DSn: , , , , , , , n-tuple event Preliminaries (cont.) Minimal occurrence: given a serial episode S, a time interval [a, b] is a minimal occurrence of S, if S occurs in [a, b] S does not occur in any proper subintervals of [a, b] If (b-a+1) T, a time bound given by users, [a, b] is valid MO(S): the set of all minimal occurrences of S Supp(S): the number of valid minimal occurrences of S Time bound T: 3 DS1 DS2 Serial episodes Minimal Occurrences Support (a A)(b B) [1, 2], [6, 7], [11, 12], [13, 14], [18, 19] 5 (g G) [5, 5], [10, 10], [15, 15], [17, 17] 4 6 Preliminaries (cont.) A SER is R: S1Lag = LS2 Supp(R): |{[a, b]|[a, b]MO(S1)[a, b]: valid [c, d] MO(S2)[c, d]: valid s.t. (c-a) = L} Conf(R) = Supp(R)/Supp(S1) 4 Time bound T: 3 DS1 DS2 Serial episode rules (a A)(b B)→4 (g G) 7 Support, Confidence [1, 2]→[5, 5], [6, 7]→[10, 10], [11, Supp: 4, 12]→[15, 15], [13, 14]→[17, 17] Conf: 4/5 = 0.8 Minimal Occurrences Preliminaries (cont.) Problem Formulation: given 4 parameters the maximum time lag (Lmax) the minimum support (minsup) the minimum confidence (minconf) the time bound (T) Find all SERs e.g. R: S1Lag = LS2 satisfying L Lmax Supp(R) N minsup, (N: the number of received n-tuple events) Conf(R) minconf Calculating supports for serial episodes and SERs must take T into account Time bound T: 3 DS1 DS2 Serial episode rules Minimal Occurrences Support, Confidence [1, 2]→[5, 5], [6, 7]→[10, 10], [11, Supp: 4, 4 (N=19) 0.2 (a A)(b B)→4 (g G) 12]→[15, 15], [13, 14]→[17, 17] Conf: 4/5 = 0.8 8 Lmax Minsup 5 0.2 Minconf 0.8 T 3 Preliminaries (cont.) Using the prefix tree for keeping serial episodes S: a serial episode, X: an item S+X: X follows S S+_X: X and the last itemset in S appear at the same time A Serial episode (AB) _B Level 0 Root B B Serial episode (A)(B) 9 Level 1 Level 2 LossyDL The concept of LossyDL: keeping the valid minimal occurrences of a serial episode for generating rules Each item in the current 2-tuple event needs to be processed (traversing in a bottom-up order) The last two minimal occurrences needs to be checked Using Lossy Counting [VLDB02], whenever N 0 mod 1/, the oldest minimal occurrence is removed 10 Processing C can generate (B)(C): [2, 3] and (BC): [3, 3] At time point = 3, a 2-tupe event (BC) arrives, T = 3 B A [1, 1] B [2, 2] [3, 3] B [1, 2] [1, 3]: not minimal B [1, 3] [2, 3] LossyDL (Rule Generation) Mining SERs For any two serial episode with supports (minsup ) N are checked to see if any minimal occurrences of them can be combined. Then, Supp(R) can be computed For each R: S1Lag = LS2, it will be returned if Supp(R) (minsup ) N, and (Supp(R) + N)/Supp(S1) minconf Serial episodes (a A)(b B) (g G) Serial episode rules 11 (a A)(b B)→4 (g G) Minimal Occurrences [1, 2], [6, 7], [11, 12], [13, 14], [18, 19] [5, 5], [10, 10], [15, 15], [17, 17] Minimal Occurrences [1, 2]→[5, 5], [6, 7]→[10, 10], [11, 12]→[15, 15], [13, 14]→[17, 17] TLT A lot of minimal occurrences are kept in LossyDL, but only the last two are used while updating Observations XL(AB) and XLA, obviously Supp(XLA) Supp(XL(AB)): XL(AB) is not significant if XLA does not satisfy one of minsup and minconf (AB)L(CD) and ALC, obviously Supp(ALC) Supp((AB)L(CD)): (AB)L(CD) is not significant if Supp(ALC) < Supp(AB) minconf 12 TLT (cont.) Observations (cont.): Given a SER: (A)(B)5(CD), and T = 3 A1B or A2B, that is ApB, 0<p< T (T1 types) A1B4(CD), A2B3(CD), that is ApBLp(CD) Supp(ApBLp(CD)) min(Supp(ApB), Supp(BLpC)) (A)(B)5(CD) is not significant, if pmin(Supp(ApB), Supp(BLpC)) < Supp(A)(B) minconf Using the observations to prune insignificant rules Time lag table (TLT) ALB is a reduced SER, if A and B are single items For finding S1LmaxS2, the reduced SERs having a time lag at most Lmax+T1 (from the first itemset of precursor to the last itemset of successor) Using Lmax+T1 Time Lag Tables to keep the supports of reduced SER 13 TLT (cont.) The support and the last two minimal occurrences of an serial episode are kept in the prefix tree Keeping supports instead of keeping minimal occurrence lists Keeping the last two minimal occurrences for updating the supports Whenever N 0 mod 1/, all supports are decreased by 1 In addition, the last Lmax+T1 n-tuple events are kept for updating the Time Lag Tables 14 TLT (Rule Generation) Mining SERs Any two serial episode with supports (minsup ) N form the candidate SERs A candidate SER will be returned if it can pass the pruning rules from the above observations 15 Experiments real dataset PDOMEI: the dataset contains the dryness and climate indices derived by experts, usually used to predict droughts Parameter setting = 0.1minsup Lmax = 10 16 17 Conclusions We address the problem of finding significant serial episode rules with time lags over multiple data streams and propose two methods to solve it. TLT is more space-efficient, but LossyDL has high precision 18