Download Mining Patterns in L..

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Mining Patterns in Long
Sequential Data with Noise
Wei Wang, Jiong Yang, Philip S. Yu
ACM SIGKDD Explorations Newsletter
Volume 2 , Issue 2 (December 2000)
Special issue on “Scalable data mining algorithms”
Outline
• Introduction
• Injection of noises
– Asynchronous Patterns
– Meta Patterns
• Over-population of uninteresting patterns
• Conclusions
Introduction
• Pattern discovery in time series data or
some inherent physical structure →
mining patterns in long sequential data
• Application :
– Bio-Medical Study : chromosomes as
sequences of amino acids
– Performance Analysis : system-monitoring
application
– Client Profile : User profiles can be built
based on the discovered pattern on trace logs
Introduction (cont.)
• Tolerable noises may in different
formats (depend on the type of
application and the user’s interests) :
– Injection of noises.
– Over-population of uninteresting
patterns.
Injection of noises
• Two models are proposed to address
the issues of accommodating insertion
of random noises and characterizing
change of behavior.
– Asynchronous Patterns
– Meta Patterns
Asynchronous Patterns
• Mining periodic pattern assumed that
Disturbance is allowed only in terms of "missing
occurrences" but not as general as any
"insertion of random noise events".
– "Smith reads newspaper every morning" is a periodic
pattern.  missing occurrences (synchronization)
– inventory replenishment of cold medicine : the refill
time shifts to the 3rd week of a month (not the
beginning of the month any longer). insertion of
random noise event (asynchronous)
Asynchronous Patterns (cont.)
• Valid segments : is required to be of at least
min_rep contiguous repetitions of the
pattern and the length of each piece of
disturbance is allowed only up to max_dis.
• Valid subsequence : is a set of nonoverlapping valid segments.
• Longest valid subsequence : A valid
subsequence with the most overall
repetitions.
D1,D2~D19 are
19 matches of
(d1,*,*)
If min-rep=5,
S1,S2,S4 are
valid segments ,
S3 is not
S1 & S4 : valid segment
(dis=9 > mix_dis=6 )
S2 & S4 can be a
valid subsequence
whose overall # of
repetitions is 10
Both X And Y
are extendible
(given position
i and ending
position j (j<i),
if j≧ Imax_dis-1 then
j is extendible),
X dominates Y
at position 20
(iif the number
of repetitions
of X ≧ Y)
Asynchronous Patterns (cont.)
• Distance-based
pruning
candidate
patterns
For example, if (d1,*,*,*)
and of
(d2,*,*,*)
are valid,
then
– Given
a symbol2-patterns
d and a period
l, ifgenerated:
Cdl ≧ min_rep
three
candidates
can be
(d1threshold,
,d2,*,*) , (d
, (d1,*,*,d
then
it’s
possible
that2)d. might participate in
1,*,d
2,*)
Similarly,
(d1, pattern
d2, d3,*)ofcan
become
some valid
period
l. a candidates 3pattern
only if (d{A,B,D,A,C,A,A,C,A,A,A,B,A,A,C
d3,*)
1, d2,*,*), (d1,*, d3,*) and (*, d}2,and
– For example,
aremin_rep=3,
all valid. then A,C may be valid pattern and B,D not be
• Apriori property of complex patterns
– a valid segment of a pattern is also a valid segment of any
pattern with fewer symbols specified in the pattern.
– For example, a valid segment for (dl,d2,*) will also be one
for (dl,*,*).
• Extendibility and subsequence dominate
Meta patterns
• Let S={a,b,c…} be a set of literals.
– Basic pattern: each component in the
pattern is restricted to be either a literal or
a “*”.
• biweekly replenishment P1=(r:[1,1],*:[2,2])
• triweekly replenishment P2=(r:[1,1],*:[2,3])
– Meta pattern : may have pattern(s)/metapattern(s) as its component(s).
• Two-level periodicity
(P1:[1,24],*:[25,25],P2:[26,52])
• Three components: P1,*,P2
Meta patterns (cont.)
• ((r:[1,1],*:[2,2]):[1,24],*:[25,25],(r:[1,1],*:[2,3]):[26,52])
–
–
–
–
length of a component : (r:[1,1],*:[2,2])=24
Span of meta-pattern : 52
Abbreviation : ((r,*):[1,24],*,(r,*:[2,3]):[26,52])
Level of meta-pattern: max level of its component +1
• Level of basic pattern is 1. For instance, (r,*:[2,3])
is level 1
• P1 = ((r,*):[1,24],*,(r,*:[2,3]):[26,52]) is level 2
• the components of a meta-pattern do not have to
be of the same level. For instance,
(P1:[1,260],*:[261,300]) is level 3 (P1 is level 2)
Meta patterns (cont.)
• Figure 2(a) : min_rep=3, max_dis=4, a meta-pattern
((a,b,*):[1,19],*:[20,21],(b,c):[22,27],*:[28,30],(a,b,*):[31,49],*:[50,
51],(b,c):[52,57],*:[58,60])
• Figure 2(b) : Many patterns/meta-patterns may collocate or overlap
for any given portion of a sequence. For example,both of (a, b, a, *)
and (a, *) are valid within the subsequence.
Meta patterns (cont.)
How to identify the “proper” candidate ?
• Component location property : can provide
substantial
inter-level
pruning
during of a
A valid low level
meta-pattern
may serveeffect
as a component
higher
level meta-pattern
onlylevel
if its presence
in the symbol
the
generation
of high
candidates
from
sequence
exhibits
cyclic behavior and such cyclic
valid
low
levelsome
meta-patterns.
behavior has to follow the same periodicity as the higher level
• Apriori
property
: cannumber
render
some
pruning
meta-pattern
by sufficient
of times
(i.e.,
at least
min_rep to
times).
power
conduct the mining process of
• For example: X1 can server as a component of a higher level
meta-patterns
of the same level.
meta-pattern X2
X1=((a,b,*):[1,19],*:[20,21])
X2=((a,b,*):[1,19],*:[20,21],(b,c):[22,27],*:[28,30],X1:[31,150]
Meta patterns (cont.)
Figure 3, the pruning effects provided by the component
location property and the Apriori property are indicated
by dashed arrows and solid arrows, respectively.
Over-population of uninteresting
patterns
• In some applications, the number of
occurrences may not represent the
significance of a pattern.
– Computational Biology : gene expressions
– Web server load : the high workload on all
servers may occur at a much lower frequency
than other states.
– Earthquake : big earthquake is much more
valuable even though it occurs at a much
lower frequency than smaller ones.
Over-population of uninteresting
patterns (cont.)
• Information gain : is a measurement of how
likely a pattern will occur or the amount of
"surprise" when a pattern actually occurs.
• Information model : For a given minimum
information gain threshold, let Ψ be the set
of patterns that satisfy this threshold.
• Support model : in order to find all patterns
in Ψ, the minimum support threshold has to
be set very low. --> too many patterns
discovered.
next
Information gain
Let E = {a1 , a2 , . . . an} be a set of distinct events.
The event sequence is a sequence of events in E.
• information carried by an event ai (ai E) is
defined to be I(ai) = -log|E| Prob(ai)
– |E| : is # of events in E
– Prob(ai) : the probability that ai occurs =
Num(ai)/N
• information gain : a pattern P in an event
sequence D, the information gain of P in D is
defined as G(P) = I ( P ) x (Support(P) - 1).
Information gain (cont.)
I(ai)= -log|E| Prob(ai)
G(P)= I (P) x (Support(P) - 1)
E = {a1 , a2 , . . . a6}
|E| = 40
Support (a2 , a6 , * , *) = Repetition((a2 , a6 , * , * )) = 3
G((a2 , a6 , * ,* )) = I(I(a2)+I(a6)) x (Support ((a2 , a6 , * ,* ))-1 )
= (0.90+1.45) x (3-1) = 2.35 x 2 = 4.70
back
Over-population of uninteresting
patterns (cont.)
• Two traces :
For example, the pattern (nodea_fail,*, nodeb_saturated,*)
– Scour is a web search engine that is
has the eighth highest information gain. This pattern means that a
specialized
for(node
multimedia
contents.
short time
after a router
a) fails, the CPU
on another node
(nodeb) is saturated. Under a thorough investigation, we found
– IBM Intranet traces consist of 160 critical
that nodeb is a file server and after nodea fails, all requests to
nodes,
e.g.,file
routers,
etc., in the
some files
are sent
to nodebservers,
, thus causes
the bottleneck.
IBM T. J. Watson Intranet.
Conclusion
• This paper discuss three recent research
advances of mining patterns in time series
data given the presence of noise.
– J. Yang, W. Wang, and P. Yu. Mining asynchronous
periodic patterns in time series data. Proc. ACM
SIGKDD Int. Conf. on Knowledge Discovery and Data
Mining (SIGKDD), pp. 275-279, 2000.
– J. Yang, W. Yang, and P. Yu. Meta-patterns: revealing
hierarchy of periodic patterns. IBM Research
Report,2001.
– J. Yang, W. Yaug, and P. Yu. InfoMiner: mining
significant periodic patterns with rare events in time
series data. IBM Research Report, 2001.