Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
7. Sequence Mining
Sequences and Strings
Recognition with Strings
MM & HMM
Sequence Association Rules
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
1
Sequences and Strings
• A sequence x is an ordered list of discrete items, such
as a sequence of letters or a gene sequence
– Sequences and strings are often used as synonyms
– String elements (characters, letters, or symbols) are nominal
– A type of particularly long string text
• |x| denotes the length of sequence x
– |AGCTTC| is 6
• Any contiguous string that is part of x is called a
substring, segment, or factor of x
– GCT is a factor of AGCTTC
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
2
Recognition with Strings
• String matching
– Given x and text, determine whether x is a factor of text
• Edit distance (for inexact string matching)
– Given two strings x and y, compute the minimum
number of basic operations (character insertions,
deletions and exchanges) needed to transform x into y
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
3
String Matching
• Given |text| >> |x|, with characters taken from an
alphabet A
– A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…}
• A shift s is an offset needed to align the first
character of x with character number s+1 in text
• Find if there exists a valid shift where there is a
perfect match between characters in x and the
corresponding ones in text
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
4
Naïve (Brute-Force) String Matching
• Given A, x, text, n = |text|, m = |x|
s=0
while s ≤ n-m
if x[1 …m] = text [s+1 … s+m]
then print “pattern occurs at shift” s
s=s+1
• Time complexity (worst case): O((n-m+1)m)
• One character shift at a time is not necessary
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
5
Boyer-Moore and KMP
• See StringMatching.ppt and do not use the following alg
• Given A, x, text, n = |text|, m = |x|
F(x) = last-occurrence function
G(x) = good-suffix function; s = 0
while s ≤ n-m
j=m
while j>0 and x[j] = text [s+j]
j = j-1
if j = 0
then print “pattern occurs at shift” s
s = s + G(0)
else s = s + max[G(j), j-F(text[s+j0])]
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
6
Edit Distance
• ED between x and y describes how many fundamental
operations are required to transform x to y.
• Fundamental operations (x=‘excused’, y=‘exhausted’)
– Substitutions e.g. ‘c’ is replaced by ‘h’
– Insertions e.g. ‘a’ is inserted into x after ‘h’
– Deletions e.g. a character in x is deleted
• ED is one way of measuring similarity between two
strings
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
7
Classification using ED
• Nearest-neighbor algorithm can be applied for
pattern recognition.
– Training: data of strings with their class labels stored
– Classification (testing): a test string is compared to each
stored string and an ED is computed; the nearest stored
string’s label is assigned to the test string.
• The key is how to calculate ED.
• An example of calculating ED
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
8
Hidden Markov Model
•
•
•
•
•
7/03
Markov Model: transitional states
Hidden Markov Model: additional visible states
Evaluation
Decoding
Learning
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
9
Markov Model
• The Markov property:
– given the current state, the transition
probability is independent of any
previous states.
• A simple Markov Model
– State ω(t) at time t
– Sequence of length T:
• ωT = {ω(1), ω(2), …, ω(T)}
– Transition probability
• P(ω j(t+1)| ω i(t)) = aij
– It’s not required that aij = aji
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
10
Hidden Markov Model
• Visible states
– VT = {v(1), v(2), …, v(T)}
• Emitting a visible state vk(t)
– P(v k(t)| ω j(t)) = bjk
• Only visible states vk (t) are
accessible and states ωi (t) are
unobservable.
• A Markov model is ergodic if
every state has a nonzero prob of
occuring give some starting state.
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
11
Three Key Issues with HMM
• Evaluation
– Given an HMM, complete with transition probabilities aij
and bjk. Determine the probability that a particular sequence
of visible states VT was generated by that model
• Decoding
– Given an HMM and a set of observations VT. Determine
the most likely sequence of hidden states ωT that led to VT.
• Learning
– Given the number of states and visible states and a set of
training observations of visible symbols, determine the
probabilities aij and bjk.
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
12
Other Sequential Patterns Mining
Problems
•
Sequence alignment (homology) and sequence
assembly (genome sequencing)
• Trend analysis
–
•
Trend movement vs. cyclic variations, seasonal variations
and random fluctuations
Sequential pattern mining
–
Various kinds of sequences (weblogs)
– Various methods: From GSP to PrefixSpan
•
Periodicity analysis
–
7/03
Full periodicity, partial periodicity, cyclic association rules
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
13
Periodic Pattern
• Full periodic pattern
– ABC ABC ABC
• Partial periodic pattern
Sequences of
transactions
– ABC ADC ACC ABC
• Pattern hierarchy
– ABC ABC ABC DE DE DE DE ABC ABC ABC DE
DE DE DE ABC ABC ABC DE DE DE DE
[ABC:3|DE:4]
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
14
Sequence Association Rule Mining
• SPADE (Sequential Pattern Discovery using
Equivalence classes)
• Constrained sequence mining (SPIRIT)
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
15
Bibliography
• R.O. Duda, P.E. Hart, and D.G. Stork, 2001.
Pattern Classification. 2nd Edition. Wiley
Interscience.
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
16
a11
a22
a12
1
2
a21
a13
a32
a23
a31
3
a33
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
17
v1
a11
a22
a12
b11
v2
1
b12
v3
v1
b13
v4
2
a21
v2
b22
b23
b24
b14
a23
a31
v2
v4
v3
3
b31
v3
a32
a13
v1
7/03
b21
b33
b34
b32
a33
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
v4
18
vk
1(2)
1
1
1
2
3(2)
3
.
.
.
1
1
a12
2(2)
2
b2k
…………
a22
2
…………
2
2
3
…………
3
3
.
.
.
.
.
.
c
c
T-1
T
a32
3
. ac2
.
.
.
.
.
c(2)
t=
7/03
c
c
c
1
2
3
…………
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
19
7/03
v3
v1
v3
v2
0
0
0
0
0
0.00
11
1
1
0.09
0.0
052
0.00
24
0
2
0
0.01
0.00
77
0.00
02
0
3
0
0.2
0.00
57
0.00
07
0
t=
0
1
2
3
4
0.3 x 0.3
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
v0
20
7/03
1
2
3
4
5
6
7
0
/v/
/i/
/t/
/e/
/r/
/b/
/i/
/-/
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
21
max(T)
0
0
0
0
…………
0
0
1
1
1
…………
1
1
max(1)
1
max(3)
2
2
max(T-1)
2
2
…………
2
2
…………
3
3
.
.
.
.
.
.
c
c
T-1
T
max(2)
t=
7/03
3
3
3
3
.
.
.
.
.
.
.
.
.
.
.
.
c
c
c
c
1
2
3
4
…………
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
22
7/03
v3
v1
v3
v2
0
0
0
0
0
0.00
11
1
1
0.09
0.0
052
0.00
24
0
2
0
0.01
0.00
77
0.00
02
0
3
0
0.2
0.00
57
0.00
07
0
t=
0
1
2
3
4
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
v0
23