Download PPT - Computer Science and Engineering

Efficient Computation of Substring Equivalence Classes with Suffix Arrays Kazuyuki Narisawa, Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda Kyushu University, Japan Contents • • • • • • • Introduction Problem definition Suffix tree based algorithm Simulation by suffix array Computational experiment Application Summary Main contribution Time and space efficient computation of substring equivalence classes [Blumer et al. 1987] with suffix arrays •Linear time and space •is faster and requires less memory than suffix tree and CDAWG based methods. Equivalence relation and classes Given a string w, the maximal extension of a substring x is ・Every time x occurs in w, x = αxβ  it is preceded by α and followed by β. ・Strings α and β are longest possible. equivalence relation x  y  x = y equivalence class [x] = { y | y  x } Substrings with essentially identical occurrence in w example Betty–bought–a–bit–of–better–butter–and– made–a–better–batter–after–breakfast. bet = –better–b bet  [–better–b] Problem • Input : string w of length n • Output: the equivalence classes on w • Difficulty ▫ The total number of elements in the equivalence classes (shortly ECs) is O(n2). • Solution • The number of the ECs is O(n). • Each EC can be succinctly represented in O(1) space. Succinct representation of the ECs • representative ・・・ the longest element(maximal extension) • minimal strings ・・・ the elements which belong to another EC when the left or right most character is deleted [x] = Substring( x ) ∩ ( y is ∪ Superstring( y ) ) minimal the elements of [x] can be enumerated with the representative and minimal strings example Betty-bought-a-bit-of-betterbutter-and-made-a-betterbatter-after-breakfast. representative minimal strings - be t t e r - b - b e t t e r - b - b e t t e r - b - be t t e be t t e et te t te - be t t e be t t e et te - be t t e be t t e et te - be t t e be t t e et te r r r r r r r r r r - b b b b Problem • Input : string w of length n • Output: succinct representations of the equivalence classes on w ▫ additionally, we will output  size ( the number of elements in each EC)  frequency ( the number of occurrences of the elements in each EC ) of each EC Possible solutions • Suffix Tree[Weiner 1973] • Compact Directed Acyclic Word Graph (CDAWG) [Blumer et al. 1985] • ECs can be computed with either of the data structures in linear time and space. Suffix tree (with suffix link) ababbbabbc$ $ ba c$ b 11 a b b b b a b b b c a c $ b $ b 1 c $ 3 7 Ignore leaves here because they form a trivial EC. c$ b a b b b a b b c $ 2 10 9 a b b c $ c $ 5 6 b c a $ b b c $ 4 8 Equivalence classes on suffix tree ababbbabbc$ ba EC def. a b b b b a b b b c a c $ b $ b 1 c $ 3 7 Essentially same occurrence substrings equivalence$relation c $connected by suffix link two nodes are b 11 and subtrees have the same number of leaves 10 EC c$ b a abb b 9 EC b babb b a b b c $ 2 a b b c $ c $ 5 6 b c babb bab a $ bab b b ba c $ 4 8 ba Suffix tree algorithm 1. foreach node v in suffix tree { 2. if(node v is representative of EC [v]≡ ) { 3. follow suffix link; 4. while(node is in EC [v]≡) { 5. follow suffix link; 6. compute size and minimal strings; 7. } 8. } 9. output succinct representation of EC [v]≡; 10.} Algorithm with suffix tree output representative representative ? representative, in other same EC ? number ofinincoming $ linkEC follow suffix a minimal strings, b same leaves representative ? to representative suffix links = 1 or not ? number b ? c $back size, frequency number of incoming11 suffix links = 1 or10not ? a b b b b a b b b c a c $ b $ b 1 c $ 3 7 c$ b a b not representative 9 b Suffix tree requires large memory space. continue suffix a b c treeb traversal a $ b a b b c $ 2 b c $ c $ 5 6 b b c $ 4 8 Suffix array [Manber and Myers 1993] • Can simulate traversal on suffix tree using lcp and rank arrays [Kasai et al. 2003] • Can simulate traversal on suffix links using additional data structure: suffix link table [Abouelhoda et al. 2004] Our algorithm simulate traversal on suffix links without suffix link table Suffix array ababbbabbc$ lexicographically sort suffixes Suffix Array 1 2 3 4 5 6 7 8 9 10 11 1 3 7 2 6 5 4 8 9 10 11 a b a b b b a b b c $ b a b bba b a bb bc $ c $ a b b b a b b c $ b ba b a b b c $ c 10 $ b b b b bbb a b b ac $ b a bb a b b c b $ b b c a b c a c a b b c $ $ b b a $ $ b b b 1b b c c c $ ab b $ $ c c b $ b c3 $ 7 b $ c 5 4 $ c $ 2 6 $ $ 11 9 8 a b a b b b a b b c $ a a b b b b b b b b a a b b b c b b b b a b c $ b c b b b a $ a $ b c b b b a $ c b b b $ c c b $ $ c $ c $ $ Lcp array ababbbabbc$ $ ba a b b b b a b b b c a c $ b $ b 1 c $ 3 7 c$ b b a b b b a b b c $ 2 lcp[i]：the length of the longest common prefix of i th and (i –1) th suffixes c$ 11 10 9 a b b c $ c $ 5 6 Lcp Array b c a $ b b c $ 4 1 2 3 4 5 6 7 8 9 10 11 -1 2 3 0 4 1 2 2 1 0 0 Suffix Array 8 1 2 3 4 5 6 7 8 9 10 11 1 3 7 2 6 5 4 8 9 10 11 Rank array ababbbabbc$ rank[SA[i]] = i $ ba a b b b b a b b b c a c $ b $ b 1 c $ 3 7 c$ b b a b b b a b b c $ 2 c$ 11 10 9 a b b c $ c $ 5 6 b c a $ b b c $ 4 Rank Array 1 2 3 4 5 6 7 8 9 10 11 1 4 2 7 6 5 3 8 9 10 11 Suffix Array 1 2 3 4 5 6 7 8 9 10 11 1 3 7 2 6 5 4 8 9 10 11 8 Suffix array has less information Information available during traversal for each data structure, when visiting node v Suffix Tree 1. label from root to each node 2. label from parent to each node 3. num. leaves in each subtree 4. parent of each node 5. children of each node 6. suffix link of each node Suffix Array 1. length of label from root to v 2. length of label from root to the parent of v 3. left most leaf ID in subtree rooted at v 4. right most leaf ID in subtree rooted at v Suffix array has less information $ ba b c$ length of parent label from root：1 a b b b b a b b b c a $ b b 1 c $ 1 3 2 a b b c $ 7 3 b a b b c $ 2 4 b c$ 10 10 11 11 9 9 label length a b from c root：4 b a $ b b c b $ c c $ $ 5 4 8 7 8 6 6 5 Suffix array algorithm 1. foreach v in suffix tree (simulated by suffix array){ difficulty 1 2. if(node v is representative of EC [v]≡) { difficulty 2 follow suffix link; 3. 4. while(node is in EC [v]≡) { 5. follow suffix link; 6.difficulty 3 compute size and minimal strings; 7. } These are difficult because suffix array has less information. 8. } 9. output succinct representation of EC [v]≡; 10.} Solving difficulty 1 (representative judge) v Suffix Array l–1 r–1 index L’= rank(l –1) R’= rank(r –1) l L = rank(l) r R = rank(r) Lemma: x = x  1.2. or 3. 1. R – L ≠ R’ – L’, 2. w[l – 1] ≠ w[r – 1], or 3. l – 1 = 0 or r – 1 = 0 (different num. leaves ? ) (different first char ? ) (first char in string ? ) Solving difficulty 2 (equivalence relation judge) ax:label from root x:label from root v Suffix Array index l L = rank(l) r R = rank(r) l+1 r+1 L’ = rank(l+1) R’ = rank(r+1) Lemma: ax  x  1.2. and 3. 1. R – L = R’ – L’, (same number of leaves ?) 2. lcp(L’) < |ax| – 1, and (left most ?) 3. lcp(R’ + 1) < |ax| – 1 (right most ?) Solving difficulty 3 (size computation) case 1 case 2 case 3 size = sum of this Suffix Array index label length of parent l r r’ l r r’ l r L R R+1 L R R+1 L R lcp(R + 1) lcp(R + 1) lcp(L) Lemma size = { lcp(R) – max{ lcp( L ), lcp( R +1) }} Computational experiment • Comparison of algorithms ▫ suffix tree ▫ CDAWG ▫ suffix array • Data ▫ two English and two Genome corpora  Canterbury corpus, Protein corpus • Machine spec. ▫ Red Hat Linux ▫ CPU 2.8GHz, 1 GB memory Experimental result data name cantrby/ plrabn1 2 Protein Corpus/ sc large/ bible.txt large/ E.coli size (MB) data structure time(sec) construction enumeration 0.47 suffix tree CDAWG suffix array 0.95 0.97 0.43 0.21 0.18 0.14 2.8 suffix tree CDAWG suffix array 12.08 12.76 3.08 3.9 suffix tree CDAWG suffix array 4.5 suffix tree CDAWG suffix array total memory （MB） 1.16 1.15 0.57 21.446 9.278 5.392 1.43 13.51 1.12 13.88 0.63 3.71 121.887 69.648 33.192 7.33 6.68 4.71 2.23 1.62 1.50 9.56 8.30 6.21 191.869 56.255 46.319 8.17 8.58 5.95 2.91 11.08 2.31 10.89 1.46 7.41 232.467 139.802 53.086 Application : spam detection the size of the equivalence classes formed by spams are larger than that of non spams. This is Japanese “Sushi” using spam, but this spam does not relate to this study. the number of the equivalence class SPAM Many copies of the same message are sent. the size of the equivalence class Application : spam detection “Unsupervised Spam Detection based on String Alienness Measures” by Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano and Masayuki Takeda Accepted The Tenth International Conference Discovery Science Sendai, Japan, 1-4 October, 2007 (DS ‘07) if you are interested in our study and want to come the conference, you should search not “DS 07” but “Discovery Science 2007”. Summary • Presented an algorithm for computing the equivalence classes with suffix array ▫ simulating traversal on suffix tree + suffix links ▫ using only lcp and rank arrays ▫ running in linear time and space • Compared with other data structures ▫ less memory ▫ faster computation • Can be applied to spam detection[ DS ’07 ] Thank You Compute size of the EC $ ba a b b b b a b b b c a c $ b $ b 1 c $ 3 7 c$ sum of the length of label from parent 11 to each node b c$ 1+3=4 b a b b b a b b c $ 2 10 9 a b b c $ c $ 5 6 b c a $ b b c $ 4 8 Compute minimal strings of the EC z x y z1 z2 if the node is the representative, y1 the label “xx1” is one of the minimal strings zm yk case 1 y 1 x1 x2 xk case 2 if two label length relation is k > m, the label “zz1” is one of the minimal strings suffix tree • each node has: ▫ ▫ ▫ ▫ ▫ parent leftmost child right sibling suffix link label of the incoming edge

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download PPT - Computer Science and Engineering