Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ESA2008@Universitat Karlsruhe, Sep 15, 2008 An Online Algorithm for Finding the Longest Previous Factors Daisuke Okanohara Kunihiko Sadakane University of Tokyo Kyushu University Problem: Finding the longest previous factors (matching) • Input : A text T[0…n-1] • At all position k, report the longest substring T[k-len+1…k] that also occurs in previous positions (history) T[pos…pos+len-1] = T[k-len+1…k] – c.f. LZ77, LZ-factorization a b r a c a d a b r a a b r a c a d a b r a d (pos, len) = (0, 4) (pos, len) = (5, 2) Applications • Data Compression – LZ77, Prediction by Partial Matching • Pattern Analysis – Log analysis • Data Mining Previous approach • Sequential search on the fly – O(n2) time for a text of length n • Offline- Index approach – Read an whole text beforehand, and build an index (suffix array/trees) for it. – Search the match using the index [Chen 07] [Chen 08] [Crochemore 08] [Kolpakov 01] [Larsson 99] – 6n bytes, and O(n log n) time [Chen 08] Suffix Arrays with Range Minimum Query New Problem: Online finding the longest previous factors • Report match information just after reading each character – A case where we don’t know the length of data beforehand, e.g. streaming data • Previous approaches cannot deal with this problem Our approach for new problem • Online construction of enhanced prefix arrays – Update an index just after reading each character – Although many methods used in LZ77 cannot report the longest match, our method can. • Succinct data structures – Keep all information very compactly; using about the same space for an original text Prefix arrays • Keep NOT suffix arrays (SA), but prefix arrays (PA) – because when a character is added at the last of a text, SA may cause W (n) changes, but PA not – In PA, prefixes are sorted in the reverse-lexicographic order T=aaaa SA for T 0$ 1 a$ 2 aa$ 3 aaa$ 4 aaaa$ Tnew=aaaaz SA for Tnew PA for T PA for Tnew 0$ 4 aaaaz$ 3 aaaz$ 2 aaz$ 1 az$ 5 z$ 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa 0 $ 1 $a 2 $aa 3 $aaa 4 $aaaa 5 $aaaaz Our idea • Weiner’s suffix tree construction algorithm – Insert the suffixes from the shortest ones – Modify it to the insert prefixes form the shortest ones – Similar idea is used for the incremental construction of compressed suffix arrays [Chan, et. al 2007], [Lippert 2005] • We extend this work to the succinct version – Our algorithm reports matching information as a byproduct of construction – Do not require tree representation, we just use array information Preliminary: Dynamic Rank/Select Dictionary (DRSD) • For an text T[0…n-1], DRSD supports: – – – – rank(T, c, i): return the number of c in T[0…i] select(T, c, i): return the position of i-th c in T insert(T, c, i): insert c at T[i] delete(T, i): delete T[i] • These operations can be supported in time (O(logn) time if s < logn), bits space where s is the alphabet size [Lee, et. al. 07], Preliminary: Range Minimum Query (RMQ) • Given an array E[0…n-1] of elements from totally ordered set, rmq(E, l, r) returns the index of the smallest element in E[l…r] – i.e. rmq(E, l, r) = argmink∈[l, r]E[k] – return the leftmost such element in the tie • In the static case, RMQ can be supported in O(1) time using 2n+o(n) bits space [Fischer, 2007] • In the dynamic case, RMQ/insert/delete can be supported in O(Tlogn) time using O(n) bits if the lookup cost (E[i]) is O(T) Data structures • Keep the following data structures for T[0…k] – Assume T[0]=$, $ is the unique smallest character • B[0…k]: (Prefix-) BW-transformed Text – B[i] = T[PA[i]+1] and B[i] = $ if PA[i]=k • H[0…k]: Height Array – will be explained in the next slide • C[0…s-1] : Cumulative Array – C[c] = the total number of characters c’ s.t. c’ < c in T • s: The position for the next prefix to be inserted T = $abaababa i PA prefix B 0 0 $ a 1 1 $a b 2 4 $abaa b 3 3 $aba a 4 8 $abaababa $ 5 6 $abaaba b 6 2 $ab a 7 5 $abaab a 8 7 $abaabab a H 0 1 1 3 3 0 2 2 0 T = $abaababa i PA prefix 0 0 1 1 2 4 3 4 5 6 7 8 3 8 6 2 5 7 B $ a $a b $abaa b $aba $abaababa $abaaba $ab $abaab $abaabab a $ b a a a H 0 1 1 3 3 0 2 2 0 PA stores the end position of each prefix (we will omit this) Prefix stores prefixes sorted in the reverselexicographic order (Neither PA nor prefix are stored explicitly) We can examine PA[i] by using SAlookup operation using O(log2n) time as in FM-index [Ferragina 00] T = $abaababa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b a $ b a a a B stores the next character for each prefix H (Burrows Wheeler’s 0 transform for prefix 1 arrays) 1 3 3 0 2 2 0 T = $abaababa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b a $ b a a a H 0 1 1 3 3 0 2 2 0 H stores the length of the longest common suffix between adjacent prefixes T = $abaababa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b a $ b a a a H 0 1 1 3 3 0 2 2 0 s=4 s denotes the position where $ in B, and the longest prefix is placed. T = $abaababa i 0 1 2 3 4 5 6 7 8 prefix C[c] = the number of characters c’ that is smaller than c in T(=B) B H $ a 0 C[$]=0 $a b 1 C[a]=1 $abaa b 1 $aba a 3 $abaababa $ 3 $abaaba b 0 C[b]=6 $ab a 2 $abaab a 2 $abaabab a 0 T = $abaababaa i 0 1 2 3 4 5 6 7 8 The next character `a’ comes ! prefix $ $a $abaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b a $ b a a a H 0 1 1 3 3 0 2 2 0 T = $abaababaa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b a $a b a a a H 0 1 1 3 3 0 2 2 0 Replace $ in B[s] with a (because $ is placed in the position of the longest prefix) T = $abaababaa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b a a b a a a H 0 1 1 3 3 0 2 2 0 Find the position for the new prefix $abaababaa Count the number of a in B[0…s-1] = rank(B, a, s-1) = 2 T = $abaababaa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $abaababaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b $ a a b a a a H 0 1 1 Insert $abaababaa at 3rd position in a 3 3 0 2 2 0 C[a]+rank(B, a, s-1) =3 s := C[a]+rank(B, a, s-1), insert(B, s, $) T = $abaababaa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $abaababaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b $ a a b a a a H 0 1 3 3 0 2 2 0 Update H This is actually the length of the longest match in the history T = $abaababa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b a $ b a a a H 0 1 1 3 3 0 2 2 0 Recall that in the previous step, $abaa and $aba are placed in the prefixes whose B is `a’ These positions can be found by using rank and select c. f. succ(T, `c’, s) = select(T, c, rank(T, s, c)) T = $abaababa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b a $ b a a a H 0 1 1 3 3 0 2 2 0 RMQ(H, 4, 6) = 5, H[5] = 0 Therefore RMQ(H, 4, 6) + 1 is the new value for the next H entry T = $abaababa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b a $ b a a a H 0 1 1 3 3 0 2 2 0 RMQ(H, 3, 3) = 3, H[3] = 3 Therefore RMQ(H, 3, 3) + 1 is the new value for the next H entry T = $abaababaa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $abaababaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b $ a a b a a a H 0 1 4 1 3 3 0 2 2 0 rmq(H, 3, 3) + 1 rmq(H, 4, 6) + 1 T = $abaababaa i 0 1 2 3 4 5 6 7 8 prefix $ $a $abaa $abaababaa $aba $abaababa $abaaba $ab $abaab $abaabab B a b b $ a a b a a a H 0 1 4 1 3 3 0 2 2 0 Report max(4, 1) = 4 as the length of the longest factor and report the position of $abaa as SAlookup[2] - len = 0 Report (pos=0, len=4) as the max. matching Overall algorithm All operations are rank, select, RMQ Overall Analysis • H is stored in 2n bits [Sadakane , Soda 02] – naïve representation requires O(n log n) bits – requires one SA lookup operation to decode • B is stored in nlogs+ o(nlogs) bits – by using dynamic rank/select dictionary • The bottleneck of our algorithm is rmq(H, I, r) which requires O(log3n) time – SAlookup requires O(log2n) time Overall Analysis (cont.) • We can solve the online longest previous factor problem in O(log3n) time for each character, using nlog2s+ o(nlogs) + O(n) bits of space – where s is the alphabet size, and n is the length of a text Simulating window buffer • If the working space is limited, we often discards the history from the oldest ones • We can simulate this by using the almost the same operations as in the insertion operation • We actually do not discard a character but ignore it – If we actually discard an oldest character , it may cause W(n) changes in B and H – The effect of discarded character is remained (prefixes are sorted according to the discarded characters) – But this does not cause the problem if we only report the matching information up to the history size Experiments • In experiment, we used a simpler data structure (algorithm is same) – B and H is store in the balanced binary tree – Each leaf stores the small block of B and H – We call this implementation as OS • Compare OS with other offline algorithms – – – – Require to read the whole text beforehand CPSa, CPSd: SA+LCP with stack [Chen, et. al. 07] CPS6n: SA with RMQ [Chen, et. al. 08] kk-lz: mreps, specialized for σ=4 [Kolpakov 01] Peak memory usage in bytes per input symbol • The space of OS is smallest in many real data especially when the values in H is small Runtime in milliseconds for searching the longest previous factors • OS is about 2~10 times slower than the fastest ones due to the dynamic operations Conclusion • Solve online longest matching problem by using enhanced prefix arrays – Simple and easy to implement – Require about 3~6 times space of the input text – Actually this is a by-product of construction of compressed suffix trees c.f. Weiner’s algorithm • Simple; and much room for improvements – by using better rank/select/rmq implementation Future work • Construction of compressed suffix trees – Update the parenthesis tree efficiently – Actually, the time complexity for this is smaller • Practical improvements – Currently, dynamic succinct data structure is not efficient due to cache misses, and memory fragmentation – Approximated version of longest matching problem; enough for many application Thank you for you attention ! Weiner’s suffix tree’s construction alg. . $a $abraca $abracada $abra $ab $abracadab $abracad $abr $a $abraca $abracada $abra $ab $abracad $abr a a $ $abrac $abr $ab $abr $abracad $abrac $a $abraca $abracada $abra $ab $abracadab $abracad $abr $abracadabr $ $abrac ba $ $abr $abrac $abr $abracad $abracada abr a $ $abrac ab $ $abr $abrac $ $abracad $abracad $abracada