Download PPT

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Binary search tree wikipedia , lookup

Control table wikipedia , lookup

Array data structure wikipedia , lookup

Transcript
A New Compressed Suffix Tree
Supporting Fast Search and its
Construction Algorithm Using Optimal
Working Space
Dong Kyue Kim1 and
1 School
2
Heejin Park2
of Electrical and Computer Engineering, Pusan National Univ.
College of Information and Communications, Hanyang Univ.
Contents
Preliminaries
Previous results
Our contribution
Conclusion
Suffix Tree
The suffix tree (ST) of a text T
A compacted trie for all the suffixes of T.
An example for accagat#.
#
a
#
c
accagat#
t
g
agat#
t
at#
c
a
cagat#
g
c
gat#
t#
ccagat#
We assume that # is the lexicographically smallest special symbol.
Suffix Array
The suffix array (SA) of a text T
pos array
lcp array
Suffix Array
The suffix array (SA) of a text T
pos array
lcp array
The pos array of T stores the starting positions of
the lexicographically sorted suffixes of T.
pos
1
8
#
2
1
accagat#
3
4
agat#
4
6
at#
5
3
cagat#
6
2
ccagat#
7
5
gat#
8
7
t#
T = accagat#
Suffix Array
The suffix array (SA) of a text T
pos array
lcp array
pos lcp
1
8
#
The pos array of T stores the starting positions of
the lexicographically sorted suffixes of T.
2
1
0
accagat#
3
4
1
agat#
The lcp array of T stores the length of the longest
common prefix of every adjacent suffixes in the
pos array.
4
6
1
at#
5
3
0
cagat#
6
2
1
ccagat#
For example, lcp[3] stores 1 that is the length of
the longest common prefix of accagat# and agat#.
7
5
0
gat#
8
7
0
t#
T = accagat#
Storing Suffix Trees in Arrays
Suffix trees can be stored in arrays if it is used as a
static data structure.

If a suffix tree is used as a static data structure, they can be
implemented using arrays instead of using nodes and pointers in a
similar way a complete binary tree is stored in an array.
Array-based data structures storing suffix trees
Enhanced suffix arrays (ESA)
Linearlized suffix trees (LST)
Enhanced Suffix Array
Enhanced suffix array
developed by Abouelhoda et al. [SPIRE ’02, WABI ’02, JDA ’04]
a pos array + an lcp array + a child table
The child table is an array implementation of the suffix tree topology
whose node branching is implemented by the linked list.
Pattern search takes O(m|Σ|) time.

m: pattern length, |Σ|: size of alphabets
Linearlized Suffix Tree
Linearlized suffix tree
An improvement on ESA developed by Kim et al. [SPIRE ’04]
a pos array + an lcp array + a new child table
The new child table is an array implementation of the suffix tree
topology whose node branching is implemented by the complete binary
tree.
Pattern search takes O(m log |Σ|) time.

m: pattern length, |Σ|: size of alphabets
Compressed Full-text Indices
Compressed full-text indices
Occupy O(n log|Σ|)-bit space.
 All full-text indices (ST, SA, ESA, LST) we just
introduced occupy O(n)-word space.

Compressed suffix array (CSA)

Succinct representation of pos array.
Compressed suffix tree (CST)

Succinct representation of a pos array, an lcp array, and
a suffix tree topology.
Previous Results
Munro et al. [1998], Sadakane[2002]

A succinct representation of a suffix tree topology
Grossi and Vitter [2000]

A succinct representation of a pos array
Sadakane [2002]

A succinct representation of an lcp array
These data structures require O(n log|Σ|)-bit space, however, when
they were introduced, the working space is more than O(n log|Σ|) bits.
Previous Results
Hon et al.[2002][2003] developed O(n log|Σ|)-bit
working space algorithms for constructing CSTs
and CSAs that run in O(n logεn) time.
Their construction algorithm for CSTs can
construct CSTs supporting O(n logεn |Σ|)-time
pattern search.
However, it cannot construct CSTs supporting
O(n logεn log|Σ|)-time pattern search.
Our Contribution
We first present a new CST supporting O(n logεn log|Σ|)time pattern search.
Then, we present an algorithm for constructing the new
CST running in optimal O(n log|Σ|)-bit working space and
O(n logεn) time.
New Compressed Suffix Tree
Our new compressed suffix tree is a succinct
representation of the linearlized suffix tree (LST).
a succinct representation of a pos array,
 a succinct representation of an lcp array, and
 a succinct representation of a child table, which stores a
suffix tree topology.

New Compressed Suffix Tree
Succinct representation of a pos array and an lcp
array are the same as before.
a succinct representation of a pos array (Grossi & Vitter)
 a succinct representation of an lcp array (Sadakane)

Succinct representation of a child table, which stores
a suffix tree topology, is a new one.
Previous Compressed
Suffix Tree Topology
Previous succinct representation of a suffix tree is a
Parentheses representation.
In this representation, every node is represented by a pair
of parentheses.
A pair of parentheses of a node encloses its children’s
parentheses.
1
2
( () (()
3
4
5
() ()) (()
6
7
()) ()
8
())
Previous Compressed
Suffix Tree Topology
1
2
( () (()
3
4
5
() ()) (()
6
7
()) ()
8
())
In this representation, parent-child relationship is
stored implicitly.
To find a child of a node, a range-minima query is
required.
New compressed tree topology
Our succinct representation differs from the
previous one in that we store the parent-child
relationship explicitly rather than implicitly.
Range-minima query is not required.
Child Table
We first describe a child table and then the
succinct representation of a child table, i.e., the
compressed child table.
A child table stores an lcp-interval tree that is a
modification of a suffix tree.
We first show how to modify a suffix tree to an lcpinterval tree.
 Then, how to store an lcp-interval tree into a child table.

Child Table
suffix tree  lcp-interval tree  child table
The suffix tree for accagat#
The suffix tree for accagat#
whose node branching is a complete binary tree
#
1
2
accagat#
3
agat#
4
5
at#
cagat#
7
8
gat#
t#
6
7
gat#
#
5
1
ccagat#
6
cagat# ccagat#
4
at#
2
3
accagat# agat#
8
t#
Child Table
suffix tree  lcp-interval tree  child table
Each node in the suffix tree is replaced by the interval in the pos array
which stores the suffixes in the subtree rooted at the node.
lcp-interval tree
[1..8]
[1..6]
7
gat#
#
5
1
6
[1..4]
8
[5..6]
[2..4] [5]
# [1]
3
accagat# agat#
[6]
cagat# ccagat#
[2..3]
at#
2
[7]
gat#
t#
cagat# ccagat#
4
[7..8]
[4]
at#
[2]
accagat#
[3]
agat#
[8]
t#
Child Table
suffix tree  lcp-interval tree  child table
lcp-interval tree
Each interval [i..j] have only to store
the first index of its right child, denoted
by child(i,j), so that it can compute its
two children.
[1..8]
[1..6]
[1..4]

Interval [1..8] have only to store 7
to compute its two children [1..6] and
[7..8].
[5..6]
Interval [1..6] stores 5 to compute its
two children [1..4] and [5..6].
[2]
[8]
[7]
[6]
[2..4] [5]
[1]
[4]
[2..3]

[7..8]
[3]
child table
1
2
3
4
5
6
7
7
4
3
2
6
5
8
Child Table
suffix tree  lcp-interval tree  child table
Where is child(i,j) stored?
lcp-interval tree
[1..8]
We store child(i,j) in cldtab[i] or cldtab[j].
[1..6]


If [i..j] is a right child,
child(i,j) is stored in cldtab[i].
If [i..j] is a left child,
child(i,j) is stored in cldtab[j].
[1..4]
[5..6]

Interval [7..8] is a right child so
child(7,8) = 8 is stored in cldtab[7].
Interval [1..6] is a left child so
child(1,6) = 5 is stored in cldtab[6].
[2]
[8]
[7]
[6]
[2..4] [5]
[1]
[4]
[2..3]

[7..8]
[3]
child table
1
2
3
4
5
6
7
7
4
3
2
6
5
8
Compressed Child Table
child table  difference child table  compressed child table
Difference child table


[1..8]
diff array
sign array
[1..6]
[1..4]
[7..8]
[5..6]
[6]
[2..4] [5]
[1]
[8]
[7]
[4]
[2..3]
[3]
[2]
child table
1
2
3
4
5
6
7
7
4
3
2
6
5
8
difference child table
diff
1
0
0
1
0
1
0
sign
0
0
0
1
0
0
0
Compressed Child Table
child table  difference child table  compressed child table
Difference child table


[1..8]
diff array
sign array
[1..6]
[1..4]
In a diff array, instead of storing child(i,j),
we store min{j-child(i,j), child(i,j)-i}.

[5..6]
[8]
[7]
[6]
[2..4] [5]
[1]
For an interval [1..4] whose child(1,4) = 2,
we compute 4-2=2 and 2-1=1 and the
minimum 1 is stored in diff[4].
[7..8]
[4]
[2..3]
[3]
[2]
child table
1
2
3
4
5
6
7
7
4
3
2
6
5
8
difference child table
diff
1
0
0
1
0
1
0
sign
0
0
0
1
0
0
0
Compressed Child Table
child table  difference child table  compressed child table
Difference child table


[1..8]
diff array
sign array
[1..6]
[1..4]
In a diff array, instead of storing child(i,j),
we store min{j-child(i,j), child(i,j)-i}.


Since diff[4] stores child(1,4)-1, sign[4]
stores 1.
[8]
[7]
[6]
[4]
[2..3]
[3]
[2]
Whether diff[i] stores j-child(i,j) or
child(i,j)-i is indicated by sign[i]. It stores 0
if j-child(i,j) is stored in diff[i] and 1 if
child(i,j)-i is stored.
[5..6]
[2..4] [5]
[1]
For an interval [1..4] whose child(1,4) = 2,
we compute 4-2=2 and 2-1=1 and the
minimum 1 is stored in diff[4].
[7..8]
child table
1
2
3
4
5
6
7
7
4
3
2
6
5
8
difference child table
diff
1
0
0
1
0
1
0
sign
0
0
0
1
0
0
0
Compressed Child Table
child table  difference child table  compressed child table
Compressed child table


Compressed diff array
sign array
Compressed Child Table
child table  difference child table  compressed child table
Compressed child table

Compressed diff array




C array: a concatenated bit string of the integers in the diff array
D array: a bit string of the same length as C array where most bits are 0
except the starting bit of each integer in the diff array
Data structures for rank and select for D array to find the ith leftmost 1
in the D array
sign array
diff
4
C array 100
D array 100
0
3
1
0
1
0
0
11
1
0
1
0
1
10
1
1
1
1
Compressed Child Table
Space consumption of a compressed child table

Compressed child table requires 5n + o(n) bits.




C array: 2n bits
D array: 2n bits
Data structures for rank and select: o(n) bits
sign array: n bits
Construction Algorithm
We construct the compressed child table directly from the
lcp array without building a suffix tree or an lcp-interval
tree as intermediate data structures.

The child table can be constructed directly from the lcp array in
O(n) time due to Kim et al [SPIRE2004].

They first construct the extended the lcp array and then compute
the child table.
We modify their construction algorithm so that it
constructs the compressed child table directly from the
compressed lcp array.
Construction Algorithm
The construction algorithm consists of two procedures
EXTLCP and CHILD.

Procedure EXTLCP constructs the compressed extended lcp
array from the compressed lcp array.

Procedure CHILD constructs the compressed child table which
are the C, D, and sign arrays from the compressed extended lcp
array.
Construction Algorithm
Pseudo-code for EXTLCP
Construction Algorithm
To optain the O(n log|Σ|)-bit working space, the size of
temporary data structures should be O(n log|Σ|).
Construction Algorithm
To optain the O(n log|Σ|)-bit working space, the size of
temporary data structures should be O(n log|Σ|).
Arrays ranking an numchild is of size O(n
log|Σ|) because a node may have |Σ|
childrens and each entry of the array
consumes log|Σ| bits
Construction Algorithm
To optain the O(n log|Σ|)-bit working space, the size of
temporary data structures should be O(n log|Σ|).
The size of the stack is O(n log|Σ|) because
it can be encoded by δ-code.
Construction Algorithm
Pseudo-code for CHILD
We also developed some techniques to reduce the working space.
Conclusion
We presented a new compressed suffix tree
supporting O(n logεn log|Σ|)-time pattern search
that consumes 5n + o(n) bit-space.
We also presented a construction algorithm for our
compressed suffix tree running in O(n log|Σ|)-bit
working space and O(n logεn) time.
Compressed Child Table
Space consumption of a compressed child table

Compressed child table requires 5n + o(n) bits.

C array: 2n bits




S(n) = max {k=1..n/2} {S(k)+S(n-k)+log(k+1)}
D array: 2n bits
Data structures for rank and select: o(n) bits
sign array: n bits