Download Mining Serial Episod..

Mining Serial Episode Rules with Time Lags over Multiple Data Streams Tung-Ying Lee, En Tzu Wang Dept. of CS, National Tsing Hua Univ. (Taiwan) Arbee L.P. Chen Dept. of CS, National Chengchi Univ. (Taiwan) DaWaK’08 Outline  Introduction  Related work  Preliminaries  Method  Experiments  Conclusions Introduction  In many applications, data are generated as a form of continuous data streams.  Continuously detecting flow and occupancy of a road to qualify the congestion condition of a road forms data streams  When roads A and B have heavy traffic, 5 mins later, road C will most likely be congested  Serial episode rules with time lags (SER) : XlagY Related Work  Finding episodes/episode rules from static time series data has been studied for decades  Episodes Precursor B Successor D A A B D C Episode Serial episode A B D L Serial episode rule E Preliminaries  Environment: a centralized system collecting n synchronized data streams DS1, DS2, …, DSn  n-tuple event: a set of items coming from all streams at the same time  itemset: a subset of an n-tuple event  serial episode: described as an ordered list of itemsets e.g. serial episode (aA)(bB) Itemset {gA} time: 1, 2, 3, 4, 5, 6, 7, 8 DS1: a, b, b, c, g, a, b, f DS2: A, B, S, G, A, B, A, F … DSn: , , , , , , ,  n-tuple event Preliminaries (cont.)  Minimal occurrence: given a serial episode S, a time interval [a, b] is a minimal occurrence of S, if  S occurs in [a, b]  S does not occur in any proper subintervals of [a, b]  If (b-a+1)  T, a time bound given by users, [a, b] is valid  MO(S): the set of all minimal occurrences of S  Supp(S): the number of valid minimal occurrences of S Time bound T: 3 DS1 DS2 Serial episodes Minimal Occurrences Support (a A)(b B) [1, 2], [6, 7], [11, 12], [13, 14], [18, 19] 5 (g G) [5, 5], [10, 10], [15, 15], [17, 17] 4 6 Preliminaries (cont.)  A SER is R: S1Lag = LS2  Supp(R): |{[a, b]|[a, b]MO(S1)[a, b]: valid  [c, d] MO(S2)[c, d]: valid s.t. (c-a) = L}  Conf(R) = Supp(R)/Supp(S1) 4 Time bound T: 3 DS1 DS2 Serial episode rules (a A)(b B)→4 (g G) 7 Support, Confidence [1, 2]→[5, 5], [6, 7]→[10, 10], [11, Supp: 4, 12]→[15, 15], [13, 14]→[17, 17] Conf: 4/5 = 0.8 Minimal Occurrences Preliminaries (cont.)  Problem Formulation: given 4 parameters  the maximum time lag (Lmax)  the minimum support (minsup)  the minimum confidence (minconf)  the time bound (T)  Find all SERs e.g. R: S1Lag = LS2 satisfying  L  Lmax  Supp(R)  N  minsup, (N: the number of received n-tuple events)  Conf(R)  minconf  Calculating supports for serial episodes and SERs must take T into account Time bound T: 3 DS1 DS2 Serial episode rules Minimal Occurrences Support, Confidence [1, 2]→[5, 5], [6, 7]→[10, 10], [11, Supp: 4, 4  (N=19)  0.2 (a A)(b B)→4 (g G) 12]→[15, 15], [13, 14]→[17, 17] Conf: 4/5 = 0.8 8 Lmax Minsup 5 0.2 Minconf 0.8 T 3 Preliminaries (cont.)  Using the prefix tree for keeping serial episodes  S: a serial episode, X: an item  S+X: X follows S  S+_X: X and the last itemset in S appear at the same time A Serial episode (AB) _B Level 0  Root B B Serial episode (A)(B) 9 Level 1 Level 2 LossyDL  The concept of LossyDL: keeping the valid minimal occurrences of a serial episode for generating rules Each item in the current 2-tuple event needs to be processed (traversing in a bottom-up order) The last two minimal occurrences needs to be checked Using Lossy Counting [VLDB02], whenever N  0 mod 1/, the oldest minimal occurrence is removed 10 Processing C can generate (B)(C): [2, 3] and (BC): [3, 3]  At time point = 3, a 2-tupe event (BC) arrives, T = 3 B A [1, 1] B [2, 2] [3, 3] B [1, 2] [1, 3]: not minimal B [1, 3] [2, 3] LossyDL (Rule Generation)  Mining SERs  For any two serial episode with supports  (minsup  )  N are checked to see if any minimal occurrences of them can be combined. Then, Supp(R) can be computed  For each R: S1Lag = LS2, it will be returned if  Supp(R)  (minsup  )  N, and  (Supp(R) + N)/Supp(S1)  minconf Serial episodes (a A)(b B) (g G) Serial episode rules 11 (a A)(b B)→4 (g G) Minimal Occurrences [1, 2], [6, 7], [11, 12], [13, 14], [18, 19] [5, 5], [10, 10], [15, 15], [17, 17] Minimal Occurrences [1, 2]→[5, 5], [6, 7]→[10, 10], [11, 12]→[15, 15], [13, 14]→[17, 17] TLT  A lot of minimal occurrences are kept in LossyDL, but only the last two are used while updating  Observations  XL(AB) and XLA, obviously Supp(XLA)  Supp(XL(AB)): XL(AB) is not significant if XLA does not satisfy one of minsup and minconf  (AB)L(CD) and ALC, obviously Supp(ALC)  Supp((AB)L(CD)): (AB)L(CD) is not significant if Supp(ALC) < Supp(AB)  minconf 12 TLT (cont.)  Observations (cont.):  Given a SER: (A)(B)5(CD), and T = 3  A1B or A2B, that is ApB, 0<p< T (T1 types)  A1B4(CD), A2B3(CD), that is ApBLp(CD)  Supp(ApBLp(CD))  min(Supp(ApB), Supp(BLpC))  (A)(B)5(CD) is not significant, if  pmin(Supp(ApB), Supp(BLpC)) < Supp(A)(B)  minconf  Using the observations to prune insignificant rules  Time lag table (TLT)  ALB is a reduced SER, if A and B are single items  For finding S1LmaxS2, the reduced SERs having a time lag at most Lmax+T1 (from the first itemset of precursor to the last itemset of successor)  Using Lmax+T1 Time Lag Tables to keep the supports of reduced SER 13 TLT (cont.)  The support and the last two minimal occurrences of an serial episode are kept in the prefix tree  Keeping supports instead of keeping minimal occurrence lists  Keeping the last two minimal occurrences for updating the supports  Whenever N  0 mod 1/, all supports are decreased by 1  In addition, the last Lmax+T1 n-tuple events are kept for updating the Time Lag Tables 14 TLT (Rule Generation)  Mining SERs  Any two serial episode with supports  (minsup  )  N form the candidate SERs  A candidate SER will be returned if it can pass the pruning rules from the above observations 15 Experiments  real dataset  PDOMEI: the dataset contains the dryness and climate indices derived by experts, usually used to predict droughts  Parameter setting   = 0.1minsup  Lmax = 10 16 17 Conclusions  We address the problem of finding significant serial episode rules with time lags over multiple data streams and propose two methods to solve it. TLT is more space-efficient, but LossyDL has high precision 18

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Mining Serial Episod..