Download PPT - Computer Science and Engineering

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Efficient Computation of
Substring Equivalence
Classes with Suffix Arrays
Kazuyuki Narisawa,
Shunsuke Inenaga, Hideo Bannai and Masayuki Takeda
Kyushu University, Japan
Contents
•
•
•
•
•
•
•
Introduction
Problem definition
Suffix tree based algorithm
Simulation by suffix array
Computational experiment
Application
Summary
Main contribution
Time and space efficient computation
of substring equivalence classes
[Blumer et al. 1987] with suffix arrays
•Linear time and space
•is faster and requires less memory than
suffix tree and CDAWG based methods.
Equivalence relation and classes
Given a string w, the maximal extension of a substring x is
・Every time x occurs in w,
x = αxβ 
it is preceded by α and followed by β.
・Strings α and β are longest possible.
equivalence relation x  y  x = y
equivalence class [x] = { y | y  x }
Substrings with essentially identical occurrence in w
example
Betty–bought–a–bit–of–better–butter–and–
made–a–better–batter–after–breakfast.
bet = –better–b
bet  [–better–b]
Problem
• Input : string w of length n
• Output: the equivalence classes on w
• Difficulty
▫ The total number of elements in the equivalence
classes (shortly ECs) is O(n2).
• Solution
• The number of the ECs is O(n).
• Each EC can be succinctly represented in O(1) space.
Succinct representation of the ECs
• representative ・・・ the longest element(maximal extension)
• minimal strings ・・・ the elements which belong to another EC
when the left or right most character is deleted
[x] = Substring( x ) ∩ ( y is ∪
Superstring( y ) )
minimal
the elements of [x] can be
enumerated with the representative
and minimal strings
example
Betty-bought-a-bit-of-betterbutter-and-made-a-betterbatter-after-breakfast.
representative
minimal
strings
- be t t e r - b
- b e t t e r - b
- b e t t e r - b
- be t t e
be t t e
et te
t te
- be t t e
be t t e
et te
- be t t e
be t t e
et te
- be t t e
be t t e
et te
r
r
r
r
r
r
r
r
r
r
-
b
b
b
b
Problem
• Input : string w of length n
• Output: succinct representations of the
equivalence classes on w
▫ additionally, we will output
 size ( the number of elements in each EC)
 frequency ( the number of occurrences of the
elements in each EC )
of each EC
Possible solutions
• Suffix Tree[Weiner 1973]
• Compact Directed Acyclic Word Graph
(CDAWG) [Blumer et al. 1985]
• ECs can be computed with either of the data
structures in linear time and space.
Suffix tree (with suffix link)
ababbbabbc$
$
ba
c$
b
11
a
b
b
b
b
a
b
b
b
c
a
c
$
b
$
b
1
c
$
3
7
Ignore leaves here
because they form a
trivial EC.
c$
b
a
b
b
b
a
b
b
c
$
2
10
9
a
b
b
c
$
c
$
5
6
b c
a
$
b
b
c
$
4
8
Equivalence classes on suffix tree
ababbbabbc$
ba
EC def.
a
b
b
b
b
a
b
b
b
c
a
c
$
b
$
b
1
c
$
3
7
Essentially same
occurrence substrings
equivalence$relation
c $connected by suffix link
two nodes
are
b
11
and
subtrees have the same number of leaves
10
EC
c$
b
a
abb
b
9
EC
b
babb
b
a
b
b
c
$
2
a
b
b
c
$
c
$
5
6
b c babb
bab
a
$
bab
b
b
ba
c
$
4
8
ba
Suffix tree algorithm
1. foreach node v in suffix tree {
2.
if(node v is representative of EC [v]≡ ) {
3.
follow suffix link;
4.
while(node is in EC [v]≡) {
5.
follow suffix link;
6.
compute size and minimal strings;
7.
}
8.
}
9.
output succinct representation of EC [v]≡;
10.}
Algorithm with suffix tree
output
representative
representative ?
representative,
in other
same EC ?
number ofinincoming
$ linkEC
follow
suffix
a
minimal strings,
b
same
leaves
representative
? to representative
suffix links = 1 or not ? number
b ? c $back
size, frequency
number of incoming11
suffix links = 1 or10not ?
a
b
b
b
b
a
b
b
b
c
a
c
$
b
$
b
1
c
$
3
7
c$
b
a
b not representative 9
b
Suffix tree requires large memory space.
continue suffix
a b c
treeb traversal
a
$
b
a
b
b
c
$
2
b
c
$
c
$
5
6
b
b
c
$
4
8
Suffix array [Manber and Myers 1993]
• Can simulate traversal on suffix tree
using lcp and rank arrays [Kasai et al. 2003]
• Can simulate traversal on suffix links
using additional data structure: suffix link table
[Abouelhoda et al. 2004]
Our algorithm
simulate traversal on suffix links
without suffix link table
Suffix array
ababbbabbc$
lexicographically sort
suffixes
Suffix Array
1 2 3 4 5 6 7 8 9 10 11 1 3 7 2 6 5 4 8 9 10 11
a b a b b b a b b c
$
b a b bba b a bb bc $ c $
a b b b a b b c $
b ba b a b b c $ c 10
$
b b
b
b bbb a b b ac $
b
a
bb a b b c b $
b
b
c
a b c
a
c
a
b
b
c
$
$
b
b
a
$
$
b b
b
1b b
c
c c $ ab
b
$
$
c
c
b
$
b c3 $ 7 b
$
c
5
4
$
c $
2
6
$
$
11
9
8
a
b
a
b
b
b
a
b
b
c
$
a a b b b b b b
b b a a b b b c
b b b b a b c $
b c b b b a $
a $ b c b b
b
a $ c b
b
b
$ c
c
b
$
$
c
$
c
$
$
Lcp array
ababbbabbc$
$
ba
a
b b
b
b
a
b
b
b
c
a
c
$
b
$
b
1
c
$
3
7
c$
b
b
a
b
b
b
a
b
b
c
$
2
lcp[i]:the length of the longest
common prefix of i th and
(i –1) th suffixes
c$
11
10
9
a
b
b
c
$
c
$
5
6
Lcp Array
b c
a
$
b
b
c
$
4
1 2 3 4 5 6 7 8 9 10 11
-1 2 3 0 4 1 2 2 1 0 0
Suffix Array
8
1 2 3 4 5 6 7 8 9 10 11
1 3 7 2 6 5 4 8 9 10 11
Rank array
ababbbabbc$
rank[SA[i]] = i
$
ba
a
b b
b
b
a
b
b
b
c
a
c
$
b
$
b
1
c
$
3
7
c$
b
b
a
b
b
b
a
b
b
c
$
2
c$
11
10
9
a
b
b
c
$
c
$
5
6
b c
a
$
b
b
c
$
4
Rank Array
1 2 3 4 5 6 7 8 9 10 11
1 4 2 7 6 5 3 8 9 10 11
Suffix Array
1 2 3 4 5 6 7 8 9 10 11
1 3 7 2 6 5 4 8 9 10 11
8
Suffix array has less information
Information available during traversal for each
data structure, when visiting node v
Suffix Tree
1. label from root to each node
2. label from parent to each node
3. num. leaves in each subtree
4. parent of each node
5. children of each node
6. suffix link of each node
Suffix Array
1. length of label from root to v
2. length of label from root to the
parent of v
3. left most leaf ID in subtree
rooted at v
4. right most leaf ID in subtree
rooted at v
Suffix array has less information
$
ba
b
c$
length of parent label from root:1
a
b
b
b
b
a
b
b
b
c
a
$
b
b
1
c
$
1
3
2
a
b
b
c
$
7
3
b
a
b
b
c
$
2
4
b
c$
10
10
11
11
9
9
label length
a b from
c root:4
b a
$
b b
c
b
$
c
c
$
$
5
4
8
7
8
6 6
5
Suffix array algorithm
1. foreach
v in suffix tree (simulated by suffix array){
difficulty
1
2.
if(node v is representative of EC [v]≡) {
difficulty
2 follow suffix link;
3.
4.
while(node is in EC [v]≡) {
5.
follow suffix link;
6.difficulty 3 compute size and minimal strings;
7.
} These are difficult because suffix array has less information.
8.
}
9.
output succinct representation of EC [v]≡;
10.}
Solving difficulty 1 (representative judge)
v
Suffix Array
l–1
r–1
index L’= rank(l –1) R’= rank(r –1)
l
L = rank(l)
r
R = rank(r)
Lemma: x = x  1.2. or 3.
1. R – L ≠ R’ – L’,
2. w[l – 1] ≠ w[r – 1], or
3. l – 1 = 0 or r – 1 = 0
(different num. leaves ? )
(different first char ? )
(first char in string ? )
Solving difficulty 2 (equivalence relation judge)
ax:label from root
x:label from root
v
Suffix Array
index
l
L = rank(l)
r
R = rank(r)
l+1
r+1
L’ = rank(l+1) R’ = rank(r+1)
Lemma: ax  x  1.2. and 3.
1. R – L = R’ – L’,
(same number of leaves ?)
2. lcp(L’) < |ax| – 1, and (left most ?)
3. lcp(R’ + 1) < |ax| – 1 (right most ?)
Solving difficulty 3 (size computation)
case 1
case 2
case 3
size =
sum of this
Suffix Array
index
label length
of parent
l
r
r’
l
r
r’
l
r
L
R
R+1
L
R
R+1
L
R
lcp(R + 1)
lcp(R + 1)
lcp(L)
Lemma
size = { lcp(R) – max{ lcp( L ), lcp( R +1) }}
Computational experiment
• Comparison of algorithms
▫ suffix tree
▫ CDAWG
▫ suffix array
• Data
▫ two English and two Genome corpora
 Canterbury corpus, Protein corpus
• Machine spec.
▫ Red Hat Linux
▫ CPU 2.8GHz, 1 GB memory
Experimental result
data
name
cantrby/
plrabn1
2
Protein
Corpus/
sc
large/
bible.txt
large/
E.coli
size
(MB)
data
structure
time(sec)
construction
enumeration
0.47
suffix tree
CDAWG
suffix array
0.95
0.97
0.43
0.21
0.18
0.14
2.8
suffix tree
CDAWG
suffix array
12.08
12.76
3.08
3.9
suffix tree
CDAWG
suffix array
4.5
suffix tree
CDAWG
suffix array
total
memory
(MB)
1.16
1.15
0.57
21.446
9.278
5.392
1.43 13.51
1.12 13.88
0.63 3.71
121.887
69.648
33.192
7.33
6.68
4.71
2.23
1.62
1.50
9.56
8.30
6.21
191.869
56.255
46.319
8.17
8.58
5.95
2.91 11.08
2.31 10.89
1.46 7.41
232.467
139.802
53.086
Application : spam detection
the size of the equivalence
classes formed by spams are
larger than that of non spams.
This is Japanese
“Sushi” using spam,
but this spam does
not relate to this
study.
the number of the
equivalence class
SPAM
Many copies of the same message are sent.
the size of the equivalence class
Application : spam detection
“Unsupervised Spam Detection based on
String Alienness Measures”
by Kazuyuki Narisawa, Hideo Bannai, Kohei Hatano
and Masayuki Takeda
Accepted
The Tenth International Conference Discovery Science
Sendai, Japan, 1-4 October, 2007 (DS ‘07)
if you are interested in our study and want to come the conference,
you should search not “DS 07” but “Discovery Science 2007”.
Summary
• Presented an algorithm for computing the
equivalence classes with suffix array
▫ simulating traversal on suffix tree + suffix links
▫ using only lcp and rank arrays
▫ running in linear time and space
• Compared with other data structures
▫ less memory
▫ faster computation
• Can be applied to spam detection[ DS ’07 ]
Thank You
Compute size of the EC
$
ba
a
b
b
b
b
a
b
b
b
c
a
c
$
b
$
b
1
c
$
3
7
c$
sum of the length of label
from parent 11
to each node
b
c$
1+3=4
b
a
b
b
b
a
b
b
c
$
2
10
9
a
b
b
c
$
c
$
5
6
b c
a
$
b
b
c
$
4
8
Compute minimal strings of the EC
z
x
y
z1
z2
if the
node
is
the
representative,
y1
the label “xx1” is one of the minimal strings
zm
yk
case 1 y
1
x1
x2
xk
case 2
if two label length relation is k > m,
the label “zz1” is one of the minimal strings
suffix tree
• each node has:
▫
▫
▫
▫
▫
parent
leftmost child
right sibling
suffix link
label of the incoming edge
Related documents