Download CSE591 Data Mining - Wright State engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
7. Sequence Mining
Sequences and Strings
Recognition with Strings
MM & HMM
Sequence Association Rules
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
1
Sequences and Strings
• A sequence x is an ordered list of discrete items, such
as a sequence of letters or a gene sequence
– Sequences and strings are often used as synonyms
– String elements (characters, letters, or symbols) are nominal
– A type of particularly long string  text
• |x| denotes the length of sequence x
– |AGCTTC| is 6
• Any contiguous string that is part of x is called a
substring, segment, or factor of x
– GCT is a factor of AGCTTC
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
2
Recognition with Strings
• String matching
– Given x and text, determine whether x is a factor of text
• Edit distance (for inexact string matching)
– Given two strings x and y, compute the minimum
number of basic operations (character insertions,
deletions and exchanges) needed to transform x into y
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
3
String Matching
• Given |text| >> |x|, with characters taken from an
alphabet A
– A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…}
• A shift s is an offset needed to align the first
character of x with character number s+1 in text
• Find if there exists a valid shift where there is a
perfect match between characters in x and the
corresponding ones in text
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
4
Naïve (Brute-Force) String Matching
• Given A, x, text, n = |text|, m = |x|
s=0
while s ≤ n-m
if x[1 …m] = text [s+1 … s+m]
then print “pattern occurs at shift” s
s=s+1
• Time complexity (worst case): O((n-m+1)m)
• One character shift at a time is not necessary
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
5
Boyer-Moore and KMP
• See StringMatching.ppt and do not use the following alg
• Given A, x, text, n = |text|, m = |x|
F(x) = last-occurrence function
G(x) = good-suffix function; s = 0
while s ≤ n-m
j=m
while j>0 and x[j] = text [s+j]
j = j-1
if j = 0
then print “pattern occurs at shift” s
s = s + G(0)
else s = s + max[G(j), j-F(text[s+j0])]
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
6
Edit Distance
• ED between x and y describes how many fundamental
operations are required to transform x to y.
• Fundamental operations (x=‘excused’, y=‘exhausted’)
– Substitutions e.g. ‘c’ is replaced by ‘h’
– Insertions e.g. ‘a’ is inserted into x after ‘h’
– Deletions e.g. a character in x is deleted
• ED is one way of measuring similarity between two
strings
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
7
Classification using ED
• Nearest-neighbor algorithm can be applied for
pattern recognition.
– Training: data of strings with their class labels stored
– Classification (testing): a test string is compared to each
stored string and an ED is computed; the nearest stored
string’s label is assigned to the test string.
• The key is how to calculate ED.
• An example of calculating ED
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
8
Hidden Markov Model
•
•
•
•
•
7/03
Markov Model: transitional states
Hidden Markov Model: additional visible states
Evaluation
Decoding
Learning
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
9
Markov Model
• The Markov property:
– given the current state, the transition
probability is independent of any
previous states.
• A simple Markov Model
– State ω(t) at time t
– Sequence of length T:
• ωT = {ω(1), ω(2), …, ω(T)}
– Transition probability
• P(ω j(t+1)| ω i(t)) = aij
– It’s not required that aij = aji
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
10
Hidden Markov Model
• Visible states
– VT = {v(1), v(2), …, v(T)}
• Emitting a visible state vk(t)
– P(v k(t)| ω j(t)) = bjk
• Only visible states vk (t) are
accessible and states ωi (t) are
unobservable.
• A Markov model is ergodic if
every state has a nonzero prob of
occuring give some starting state.
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
11
Three Key Issues with HMM
• Evaluation
– Given an HMM, complete with transition probabilities aij
and bjk. Determine the probability that a particular sequence
of visible states VT was generated by that model
• Decoding
– Given an HMM and a set of observations VT. Determine
the most likely sequence of hidden states ωT that led to VT.
• Learning
– Given the number of states and visible states and a set of
training observations of visible symbols, determine the
probabilities aij and bjk.
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
12
Other Sequential Patterns Mining
Problems
•
Sequence alignment (homology) and sequence
assembly (genome sequencing)
• Trend analysis
–
•
Trend movement vs. cyclic variations, seasonal variations
and random fluctuations
Sequential pattern mining
–
Various kinds of sequences (weblogs)
– Various methods: From GSP to PrefixSpan
•
Periodicity analysis
–
7/03
Full periodicity, partial periodicity, cyclic association rules
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
13
Periodic Pattern
• Full periodic pattern
– ABC ABC ABC
• Partial periodic pattern
Sequences of
transactions
– ABC ADC ACC ABC
• Pattern hierarchy
– ABC ABC ABC DE DE DE DE ABC ABC ABC DE
DE DE DE ABC ABC ABC DE DE DE DE
[ABC:3|DE:4]
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
14
Sequence Association Rule Mining
• SPADE (Sequential Pattern Discovery using
Equivalence classes)
• Constrained sequence mining (SPIRIT)
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
15
Bibliography
• R.O. Duda, P.E. Hart, and D.G. Stork, 2001.
Pattern Classification. 2nd Edition. Wiley
Interscience.
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
16
a11
a22
a12
1
2
a21
a13
a32
a23
a31
3
a33
7/03
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
17
v1
a11
a22
a12
b11
v2
1
b12
v3
v1
b13
v4
2
a21
v2
b22
b23
b24
b14
a23
a31
v2
v4
v3
3
b31
v3
a32
a13
v1
7/03
b21
b33
b34
b32
a33
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
v4
18
vk
1(2)
1
1
1
2
3(2)
3
.
.
.
1
1
a12
2(2)
2
b2k
…………
a22
2
…………
2
2
3
…………
3
3
.
.
.
.
.
.
c
c
T-1
T
a32
3
. ac2
.
.
.
.
.
c(2)
t=
7/03
c
c
c
1
2
3
…………
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
19
7/03
v3
v1
v3
v2
0
0
0
0
0
0.00
11
1
1
0.09
0.0
052
0.00
24
0
2
0
0.01
0.00
77
0.00
02
0
3
0
0.2
0.00
57
0.00
07
0
t=
0
1
2
3
4
0.3 x 0.3
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
v0
20
7/03








1
2
3
4
5
6
7
0
/v/
/i/
/t/
/e/
/r/
/b/
/i/
/-/
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
21
max(T)
0
0
0
0
…………
0
0
1
1
1
…………
1
1
max(1)
1
max(3)
2
2
max(T-1)
2
2
…………
2
2
…………
3
3
.
.
.
.
.
.
c
c
T-1
T
max(2)
t=
7/03
3
3
3
3
.
.
.
.
.
.
.
.
.
.
.
.
c
c
c
c
1
2
3
4
…………
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
22
7/03
v3
v1
v3
v2
0
0
0
0
0
0.00
11
1
1
0.09
0.0
052
0.00
24
0
2
0
0.01
0.00
77
0.00
02
0
3
0
0.2
0.00
57
0.00
07
0
t=
0
1
2
3
4
Data Mining – Sequences
H. Liu (ASU) & G Dong (WSU)
v0
23
Related documents