Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Mining Patterns in Long Sequential Data with Noise Wei Wang, Jiong Yang, Philip S. Yu ACM SIGKDD Explorations Newsletter Volume 2 , Issue 2 (December 2000) Special issue on “Scalable data mining algorithms” Outline • Introduction • Injection of noises – Asynchronous Patterns – Meta Patterns • Over-population of uninteresting patterns • Conclusions Introduction • Pattern discovery in time series data or some inherent physical structure → mining patterns in long sequential data • Application : – Bio-Medical Study : chromosomes as sequences of amino acids – Performance Analysis : system-monitoring application – Client Profile : User profiles can be built based on the discovered pattern on trace logs Introduction (cont.) • Tolerable noises may in different formats (depend on the type of application and the user’s interests) : – Injection of noises. – Over-population of uninteresting patterns. Injection of noises • Two models are proposed to address the issues of accommodating insertion of random noises and characterizing change of behavior. – Asynchronous Patterns – Meta Patterns Asynchronous Patterns • Mining periodic pattern assumed that Disturbance is allowed only in terms of "missing occurrences" but not as general as any "insertion of random noise events". – "Smith reads newspaper every morning" is a periodic pattern. missing occurrences (synchronization) – inventory replenishment of cold medicine : the refill time shifts to the 3rd week of a month (not the beginning of the month any longer). insertion of random noise event (asynchronous) Asynchronous Patterns (cont.) • Valid segments : is required to be of at least min_rep contiguous repetitions of the pattern and the length of each piece of disturbance is allowed only up to max_dis. • Valid subsequence : is a set of nonoverlapping valid segments. • Longest valid subsequence : A valid subsequence with the most overall repetitions. D1,D2~D19 are 19 matches of (d1,*,*) If min-rep=5, S1,S2,S4 are valid segments , S3 is not S1 & S4 : valid segment (dis=9 > mix_dis=6 ) S2 & S4 can be a valid subsequence whose overall # of repetitions is 10 Both X And Y are extendible (given position i and ending position j (j<i), if j≧ Imax_dis-1 then j is extendible), X dominates Y at position 20 (iif the number of repetitions of X ≧ Y) Asynchronous Patterns (cont.) • Distance-based pruning candidate patterns For example, if (d1,*,*,*) and of (d2,*,*,*) are valid, then – Given a symbol2-patterns d and a period l, ifgenerated: Cdl ≧ min_rep three candidates can be (d1threshold, ,d2,*,*) , (d , (d1,*,*,d then it’s possible that2)d. might participate in 1,*,d 2,*) Similarly, (d1, pattern d2, d3,*)ofcan become some valid period l. a candidates 3pattern only if (d{A,B,D,A,C,A,A,C,A,A,A,B,A,A,C d3,*) 1, d2,*,*), (d1,*, d3,*) and (*, d}2,and – For example, aremin_rep=3, all valid. then A,C may be valid pattern and B,D not be • Apriori property of complex patterns – a valid segment of a pattern is also a valid segment of any pattern with fewer symbols specified in the pattern. – For example, a valid segment for (dl,d2,*) will also be one for (dl,*,*). • Extendibility and subsequence dominate Meta patterns • Let S={a,b,c…} be a set of literals. – Basic pattern: each component in the pattern is restricted to be either a literal or a “*”. • biweekly replenishment P1=(r:[1,1],*:[2,2]) • triweekly replenishment P2=(r:[1,1],*:[2,3]) – Meta pattern : may have pattern(s)/metapattern(s) as its component(s). • Two-level periodicity (P1:[1,24],*:[25,25],P2:[26,52]) • Three components: P1,*,P2 Meta patterns (cont.) • ((r:[1,1],*:[2,2]):[1,24],*:[25,25],(r:[1,1],*:[2,3]):[26,52]) – – – – length of a component : (r:[1,1],*:[2,2])=24 Span of meta-pattern : 52 Abbreviation : ((r,*):[1,24],*,(r,*:[2,3]):[26,52]) Level of meta-pattern: max level of its component +1 • Level of basic pattern is 1. For instance, (r,*:[2,3]) is level 1 • P1 = ((r,*):[1,24],*,(r,*:[2,3]):[26,52]) is level 2 • the components of a meta-pattern do not have to be of the same level. For instance, (P1:[1,260],*:[261,300]) is level 3 (P1 is level 2) Meta patterns (cont.) • Figure 2(a) : min_rep=3, max_dis=4, a meta-pattern ((a,b,*):[1,19],*:[20,21],(b,c):[22,27],*:[28,30],(a,b,*):[31,49],*:[50, 51],(b,c):[52,57],*:[58,60]) • Figure 2(b) : Many patterns/meta-patterns may collocate or overlap for any given portion of a sequence. For example,both of (a, b, a, *) and (a, *) are valid within the subsequence. Meta patterns (cont.) How to identify the “proper” candidate ? • Component location property : can provide substantial inter-level pruning during of a A valid low level meta-pattern may serveeffect as a component higher level meta-pattern onlylevel if its presence in the symbol the generation of high candidates from sequence exhibits cyclic behavior and such cyclic valid low levelsome meta-patterns. behavior has to follow the same periodicity as the higher level • Apriori property : cannumber render some pruning meta-pattern by sufficient of times (i.e., at least min_rep to times). power conduct the mining process of • For example: X1 can server as a component of a higher level meta-patterns of the same level. meta-pattern X2 X1=((a,b,*):[1,19],*:[20,21]) X2=((a,b,*):[1,19],*:[20,21],(b,c):[22,27],*:[28,30],X1:[31,150] Meta patterns (cont.) Figure 3, the pruning effects provided by the component location property and the Apriori property are indicated by dashed arrows and solid arrows, respectively. Over-population of uninteresting patterns • In some applications, the number of occurrences may not represent the significance of a pattern. – Computational Biology : gene expressions – Web server load : the high workload on all servers may occur at a much lower frequency than other states. – Earthquake : big earthquake is much more valuable even though it occurs at a much lower frequency than smaller ones. Over-population of uninteresting patterns (cont.) • Information gain : is a measurement of how likely a pattern will occur or the amount of "surprise" when a pattern actually occurs. • Information model : For a given minimum information gain threshold, let Ψ be the set of patterns that satisfy this threshold. • Support model : in order to find all patterns in Ψ, the minimum support threshold has to be set very low. --> too many patterns discovered. next Information gain Let E = {a1 , a2 , . . . an} be a set of distinct events. The event sequence is a sequence of events in E. • information carried by an event ai (ai E) is defined to be I(ai) = -log|E| Prob(ai) – |E| : is # of events in E – Prob(ai) : the probability that ai occurs = Num(ai)/N • information gain : a pattern P in an event sequence D, the information gain of P in D is defined as G(P) = I ( P ) x (Support(P) - 1). Information gain (cont.) I(ai)= -log|E| Prob(ai) G(P)= I (P) x (Support(P) - 1) E = {a1 , a2 , . . . a6} |E| = 40 Support (a2 , a6 , * , *) = Repetition((a2 , a6 , * , * )) = 3 G((a2 , a6 , * ,* )) = I(I(a2)+I(a6)) x (Support ((a2 , a6 , * ,* ))-1 ) = (0.90+1.45) x (3-1) = 2.35 x 2 = 4.70 back Over-population of uninteresting patterns (cont.) • Two traces : For example, the pattern (nodea_fail,*, nodeb_saturated,*) – Scour is a web search engine that is has the eighth highest information gain. This pattern means that a specialized for(node multimedia contents. short time after a router a) fails, the CPU on another node (nodeb) is saturated. Under a thorough investigation, we found – IBM Intranet traces consist of 160 critical that nodeb is a file server and after nodea fails, all requests to nodes, e.g.,file routers, etc., in the some files are sent to nodebservers, , thus causes the bottleneck. IBM T. J. Watson Intranet. Conclusion • This paper discuss three recent research advances of mining patterns in time series data given the presence of noise. – J. Yang, W. Wang, and P. Yu. Mining asynchronous periodic patterns in time series data. Proc. ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining (SIGKDD), pp. 275-279, 2000. – J. Yang, W. Yang, and P. Yu. Meta-patterns: revealing hierarchy of periodic patterns. IBM Research Report,2001. – J. Yang, W. Yaug, and P. Yu. InfoMiner: mining significant periodic patterns with rare events in time series data. IBM Research Report, 2001.