Download CSE591 Data Mining - Wright State engineering

7. Sequence Mining Sequences and Strings Recognition with Strings MM & HMM Sequence Association Rules 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 1 Sequences and Strings • A sequence x is an ordered list of discrete items, such as a sequence of letters or a gene sequence – Sequences and strings are often used as synonyms – String elements (characters, letters, or symbols) are nominal – A type of particularly long string  text • |x| denotes the length of sequence x – |AGCTTC| is 6 • Any contiguous string that is part of x is called a substring, segment, or factor of x – GCT is a factor of AGCTTC 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 2 Recognition with Strings • String matching – Given x and text, determine whether x is a factor of text • Edit distance (for inexact string matching) – Given two strings x and y, compute the minimum number of basic operations (character insertions, deletions and exchanges) needed to transform x into y 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 3 String Matching • Given |text| >> |x|, with characters taken from an alphabet A – A can be {0, 1}, {0, 1, 2,…, 9}, {A,G,C,T}, or {A, B,…} • A shift s is an offset needed to align the first character of x with character number s+1 in text • Find if there exists a valid shift where there is a perfect match between characters in x and the corresponding ones in text 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 4 Naïve (Brute-Force) String Matching • Given A, x, text, n = |text|, m = |x| s=0 while s ≤ n-m if x[1 …m] = text [s+1 … s+m] then print “pattern occurs at shift” s s=s+1 • Time complexity (worst case): O((n-m+1)m) • One character shift at a time is not necessary 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 5 Boyer-Moore and KMP • See StringMatching.ppt and do not use the following alg • Given A, x, text, n = |text|, m = |x| F(x) = last-occurrence function G(x) = good-suffix function; s = 0 while s ≤ n-m j=m while j>0 and x[j] = text [s+j] j = j-1 if j = 0 then print “pattern occurs at shift” s s = s + G(0) else s = s + max[G(j), j-F(text[s+j0])] 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 6 Edit Distance • ED between x and y describes how many fundamental operations are required to transform x to y. • Fundamental operations (x=‘excused’, y=‘exhausted’) – Substitutions e.g. ‘c’ is replaced by ‘h’ – Insertions e.g. ‘a’ is inserted into x after ‘h’ – Deletions e.g. a character in x is deleted • ED is one way of measuring similarity between two strings 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 7 Classification using ED • Nearest-neighbor algorithm can be applied for pattern recognition. – Training: data of strings with their class labels stored – Classification (testing): a test string is compared to each stored string and an ED is computed; the nearest stored string’s label is assigned to the test string. • The key is how to calculate ED. • An example of calculating ED 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 8 Hidden Markov Model • • • • • 7/03 Markov Model: transitional states Hidden Markov Model: additional visible states Evaluation Decoding Learning Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 9 Markov Model • The Markov property: – given the current state, the transition probability is independent of any previous states. • A simple Markov Model – State ω(t) at time t – Sequence of length T: • ωT = {ω(1), ω(2), …, ω(T)} – Transition probability • P(ω j(t+1)| ω i(t)) = aij – It’s not required that aij = aji 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 10 Hidden Markov Model • Visible states – VT = {v(1), v(2), …, v(T)} • Emitting a visible state vk(t) – P(v k(t)| ω j(t)) = bjk • Only visible states vk (t) are accessible and states ωi (t) are unobservable. • A Markov model is ergodic if every state has a nonzero prob of occuring give some starting state. 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 11 Three Key Issues with HMM • Evaluation – Given an HMM, complete with transition probabilities aij and bjk. Determine the probability that a particular sequence of visible states VT was generated by that model • Decoding – Given an HMM and a set of observations VT. Determine the most likely sequence of hidden states ωT that led to VT. • Learning – Given the number of states and visible states and a set of training observations of visible symbols, determine the probabilities aij and bjk. 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 12 Other Sequential Patterns Mining Problems • Sequence alignment (homology) and sequence assembly (genome sequencing) • Trend analysis – • Trend movement vs. cyclic variations, seasonal variations and random fluctuations Sequential pattern mining – Various kinds of sequences (weblogs) – Various methods: From GSP to PrefixSpan • Periodicity analysis – 7/03 Full periodicity, partial periodicity, cyclic association rules Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 13 Periodic Pattern • Full periodic pattern – ABC ABC ABC • Partial periodic pattern Sequences of transactions – ABC ADC ACC ABC • Pattern hierarchy – ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE ABC ABC ABC DE DE DE DE [ABC:3|DE:4] 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 14 Sequence Association Rule Mining • SPADE (Sequential Pattern Discovery using Equivalence classes) • Constrained sequence mining (SPIRIT) 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 15 Bibliography • R.O. Duda, P.E. Hart, and D.G. Stork, 2001. Pattern Classification. 2nd Edition. Wiley Interscience. 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 16 a11 a22 a12 1 2 a21 a13 a32 a23 a31 3 a33 7/03 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 17 v1 a11 a22 a12 b11 v2 1 b12 v3 v1 b13 v4 2 a21 v2 b22 b23 b24 b14 a23 a31 v2 v4 v3 3 b31 v3 a32 a13 v1 7/03 b21 b33 b34 b32 a33 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) v4 18 vk 1(2) 1 1 1 2 3(2) 3 . . . 1 1 a12 2(2) 2 b2k ………… a22 2 ………… 2 2 3 ………… 3 3 . . . . . . c c T-1 T a32 3 . ac2 . . . . . c(2) t= 7/03 c c c 1 2 3 ………… Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 19 7/03 v3 v1 v3 v2 0 0 0 0 0 0.00 11 1 1 0.09 0.0 052 0.00 24 0 2 0 0.01 0.00 77 0.00 02 0 3 0 0.2 0.00 57 0.00 07 0 t= 0 1 2 3 4 0.3 x 0.3 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) v0 20 7/03         1 2 3 4 5 6 7 0 /v/ /i/ /t/ /e/ /r/ /b/ /i/ /-/ Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 21 max(T) 0 0 0 0 ………… 0 0 1 1 1 ………… 1 1 max(1) 1 max(3) 2 2 max(T-1) 2 2 ………… 2 2 ………… 3 3 . . . . . . c c T-1 T max(2) t= 7/03 3 3 3 3 . . . . . . . . . . . . c c c c 1 2 3 4 ………… Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) 22 7/03 v3 v1 v3 v2 0 0 0 0 0 0.00 11 1 1 0.09 0.0 052 0.00 24 0 2 0 0.01 0.00 77 0.00 02 0 3 0 0.2 0.00 57 0.00 07 0 t= 0 1 2 3 4 Data Mining – Sequences H. Liu (ASU) & G Dong (WSU) v0 23

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download CSE591 Data Mining - Wright State engineering