Download Mining Serial Episod..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Mining Serial Episode Rules with Time
Lags over Multiple Data Streams
Tung-Ying Lee, En Tzu Wang
Dept. of CS, National Tsing Hua Univ. (Taiwan)
Arbee L.P. Chen
Dept. of CS, National Chengchi Univ. (Taiwan)
DaWaK’08
Outline
 Introduction
 Related work
 Preliminaries
 Method
 Experiments
 Conclusions
Introduction
 In many applications, data are generated as a form of
continuous data streams.
 Continuously detecting flow and occupancy of a road to qualify
the congestion condition of a road forms data streams
 When roads A and B have heavy traffic, 5 mins later, road C will
most likely be congested
 Serial episode rules with time lags (SER) : XlagY
Related Work
 Finding episodes/episode rules from static time series data
has been studied for decades
 Episodes
Precursor
B
Successor
D
A
A
B
D
C
Episode
Serial episode
A
B
D
L
Serial episode rule
E
Preliminaries
 Environment: a centralized system collecting n synchronized data
streams DS1, DS2, …, DSn
 n-tuple event: a set of items coming from all streams at the same time
 itemset: a subset of an n-tuple event
 serial episode: described as an ordered list of itemsets
e.g. serial episode (aA)(bB)
Itemset {gA}
time:
1, 2, 3, 4, 5, 6, 7, 8
DS1: a, b, b, c, g, a, b, f
DS2: A, B, S, G, A, B, A, F
…
DSn: , , , , , , , 
n-tuple event
Preliminaries (cont.)
 Minimal occurrence: given a serial episode S, a time interval [a, b] is a
minimal occurrence of S, if
 S occurs in [a, b]
 S does not occur in any proper subintervals of [a, b]
 If (b-a+1)  T, a time bound given by users, [a, b] is valid
 MO(S): the set of all minimal occurrences of S
 Supp(S): the number of valid minimal occurrences of S
Time bound T: 3
DS1
DS2
Serial episodes Minimal Occurrences
Support
(a A)(b B)
[1, 2], [6, 7], [11, 12], [13, 14], [18, 19] 5
(g G)
[5, 5], [10, 10], [15, 15], [17, 17]
4
6
Preliminaries (cont.)
 A SER is R: S1Lag = LS2
 Supp(R): |{[a, b]|[a, b]MO(S1)[a, b]: valid  [c, d] MO(S2)[c,
d]: valid s.t. (c-a) = L}
 Conf(R) = Supp(R)/Supp(S1)
4
Time bound T: 3
DS1
DS2
Serial episode rules
(a A)(b B)→4 (g G)
7
Support,
Confidence
[1, 2]→[5, 5], [6, 7]→[10, 10], [11, Supp: 4,
12]→[15, 15], [13, 14]→[17, 17]
Conf: 4/5 = 0.8
Minimal Occurrences
Preliminaries (cont.)

Problem Formulation: given 4 parameters
 the maximum time lag (Lmax)
 the minimum support (minsup)
 the minimum confidence (minconf)
 the time bound (T)

Find all SERs e.g. R: S1Lag = LS2 satisfying
 L  Lmax
 Supp(R)  N  minsup, (N: the number of received n-tuple events)
 Conf(R)  minconf
 Calculating supports for serial episodes and SERs must take T into account
Time bound T: 3
DS1
DS2
Serial episode rules Minimal Occurrences
Support, Confidence
[1, 2]→[5, 5], [6, 7]→[10, 10], [11, Supp: 4, 4  (N=19)  0.2
(a A)(b B)→4 (g G)
12]→[15, 15], [13, 14]→[17, 17]
Conf: 4/5 = 0.8
8
Lmax Minsup
5
0.2
Minconf
0.8
T
3
Preliminaries (cont.)
 Using the prefix tree for keeping serial episodes
 S: a serial episode, X: an item
 S+X: X follows S
 S+_X: X and the last itemset in S appear at the same time
A
Serial episode (AB)
_B
Level 0

Root
B
B
Serial episode (A)(B)
9
Level 1
Level 2
LossyDL
 The concept of LossyDL: keeping the valid minimal occurrences of a
serial episode for generating rules
Each item in the current 2-tuple
event needs to be processed
(traversing in a bottom-up order)
The last two minimal occurrences
needs to be checked
Using Lossy Counting [VLDB02],
whenever N  0 mod 1/, the
oldest minimal occurrence is
removed
10
Processing C can
generate (B)(C): [2, 3]
and (BC): [3, 3]

At time point = 3, a 2-tupe
event (BC) arrives, T = 3
B
A
[1, 1]
B
[2, 2] [3, 3]
B
[1, 2] [1, 3]: not minimal
B
[1, 3]
[2, 3]
LossyDL (Rule Generation)
 Mining SERs
 For any two serial episode with supports  (minsup  )  N are
checked to see if any minimal occurrences of them can be combined.
Then, Supp(R) can be computed
 For each R: S1Lag = LS2, it will be returned if
 Supp(R)  (minsup  )  N, and
 (Supp(R) + N)/Supp(S1)  minconf
Serial episodes
(a A)(b B)
(g G)
Serial episode rules
11
(a A)(b B)→4 (g G)
Minimal Occurrences
[1, 2], [6, 7], [11, 12], [13, 14], [18, 19]
[5, 5], [10, 10], [15, 15], [17, 17]
Minimal Occurrences
[1, 2]→[5, 5], [6, 7]→[10, 10], [11,
12]→[15, 15], [13, 14]→[17, 17]
TLT
 A lot of minimal occurrences are kept in LossyDL, but only the last two are used
while updating
 Observations
 XL(AB) and XLA, obviously Supp(XLA)  Supp(XL(AB)): XL(AB) is
not significant if XLA does not satisfy one of minsup and minconf
 (AB)L(CD) and ALC, obviously Supp(ALC)  Supp((AB)L(CD)):
(AB)L(CD) is not significant if Supp(ALC) < Supp(AB)  minconf
12
TLT (cont.)
 Observations (cont.):
 Given a SER: (A)(B)5(CD), and T = 3
 A1B or A2B, that is ApB, 0<p< T (T1 types)
 A1B4(CD), A2B3(CD), that is ApBLp(CD)
 Supp(ApBLp(CD))  min(Supp(ApB), Supp(BLpC))
 (A)(B)5(CD) is not significant, if

pmin(Supp(ApB), Supp(BLpC)) < Supp(A)(B)  minconf
 Using the observations to prune insignificant rules
 Time lag table (TLT)
 ALB is a reduced SER, if A and B are single items
 For finding S1LmaxS2, the reduced SERs having a time lag at most Lmax+T1 (from
the first itemset of precursor to the last itemset of successor)
 Using Lmax+T1 Time Lag Tables to keep the supports of reduced SER
13
TLT (cont.)
 The support and the last two minimal occurrences of an serial episode are
kept in the prefix tree
 Keeping supports instead of keeping minimal occurrence lists
 Keeping the last two minimal occurrences for updating the supports
 Whenever N  0 mod 1/, all supports are decreased by 1
 In addition, the last Lmax+T1 n-tuple events are kept for updating the Time
Lag Tables
14
TLT (Rule Generation)
 Mining SERs
 Any two serial episode with supports  (minsup  )  N form the
candidate SERs
 A candidate SER will be returned if it can pass the pruning rules from the
above observations
15
Experiments
 real dataset
 PDOMEI: the dataset contains the dryness and climate indices
derived by experts, usually used to predict droughts
 Parameter setting
  = 0.1minsup
 Lmax = 10
16
17
Conclusions
 We address the problem of finding significant serial episode rules with time
lags over multiple data streams and propose two methods to solve it. TLT is
more space-efficient, but LossyDL has high precision
18