Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules Spring 2003 Data Mining by H. Liu, ASU 1 Sequences and Strings • A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence – Sequences and strings are often used interchangeably – String elements (characters, letters, or symbols) are nominal – A particularly long string is called text • |x| denotes the length of sequence x – |AGCTTC| is 6 • Any contiguous string that is part of x is called a substring, segment, or factor of x – GCT is a factor of AGCTTC Spring 2003 Data Mining by H. Liu, ASU 2 Recognition with Strings • String matching – Given x and text, determine whether x is a factor of text • Edit distance – Given two strings x and y, compute the min number of basic operations (character insertions, deletions and exchanges) needed to transform x into y Spring 2003 Data Mining by H. Liu, ASU 3 String Matching • Given |text| >> |x|, each discrete character is taken from an alphabet A – A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…} • A shift s is an offset needed to align the first character of x with character number s+1 in text • Find if there exists a valid shift where there is a perfect match between each character in x and the corresponding one in text Spring 2003 Data Mining by H. Liu, ASU 4 Naïve String Matching • Given alphabet A, x, text, n = |text|, m = |x| s=0 while s ≤ n-m if x[1 …m] = text [s+1 … s+m] then print “pattern occurs at shift” s s=s+1 • Time complexity (worst case): O((n-m+1)m) • One character shift at a time is not necessary Spring 2003 Data Mining by H. Liu, ASU 5 Boyer-Moore String Matching • Given A, x, text, n = |text|, m = |x| F(x) = last-occurrence function G(x) = good-suffix function; s = 0 while s ≤ n-m j=m while j>0 and x[j] = text [s+j] j = j-1 if j = 0 then print “pattern occurs at shift” s s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])] Spring 2003 Data Mining by H. Liu, ASU 6 Edit Distance • ED between x and y describes how many fundamental operations are required to transform x to y. • Fundamental operations (x=‘excused’, y=‘exhausted’) – Substitutions, ‘c’ is replaced by ‘h’ – Insertions, ‘a’ is inserted into x after ‘h’ – Deletions, a character in x is deleted • ED is one way of measuring similarity between two strings Spring 2003 Data Mining by H. Liu, ASU 7 Classification using ED • Nearest-neighbor algorithm can be applied for pattern recognition. – Training: data of strings with their class labels stored – Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string. • The key is how to calculate ED. • An example of calculating ED Spring 2003 Data Mining by H. Liu, ASU 8 Hidden Markov Model • • • • • Markov Model: transitional states Hidden Markov Model: additional visible states Evaluation Decoding Learning Spring 2003 Data Mining by H. Liu, ASU 9 Markov Model • The Markov property: – given the current state, the transition probability is independent of any previous states. • A simple Markov Model – State ω(t) at time t – Sequence of length T: • ωT = {ω(1), ω(2), …, ω(T)} – Transition probability • P(ω j(t+1)| ω i(t)) = aij – It’s not required that aij = aji Spring 2003 Data Mining by H. Liu, ASU 10 Hidden Markov Model • Visible states – VT = {v(1), v(2), …, v(T)} • Emitting a visible state vk(t) – P(v k(t)| ω j(t)) = bjk • Only visible states vk (t) are accessible and states ωi (t) are unobservable. • A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state. Spring 2003 Data Mining by H. Liu, ASU 11 Three Key Issues with HMM • Evaluation – Given an HMM, complete with transition probabilities aij and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model • Decoding – Given an HMM and a set of observations VT. Determine the most likely sequence of hidden states ωT that led to VT. • Learning – Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities aij and bjk. Spring 2003 Data Mining by H. Liu, ASU 12 Sequence Association Rule Mining • SPADE (Sequential Pattern Discovery using Equivalence classes) • Constrained sequence mining (SPIRIT) Spring 2003 Data Mining by H. Liu, ASU 13 Bibliography • R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience. Spring 2003 Data Mining by H. Liu, ASU 14