Download View PDF - CiteSeerX

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Linked list wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Quadtree wikipedia , lookup

Interval tree wikipedia , lookup

Red–black tree wikipedia , lookup

Binary tree wikipedia , lookup

Binary search tree wikipedia , lookup

B-tree wikipedia , lookup

Transcript
Obtaining Provably Good Performance from
Suffix Trees in Secondary Storage?
Pang Ko and Srinivas Aluru
2
1
Department of Electrical and Computer Engineering
Laurence H. Baker Center for Bioinformatics and Biological Statistics
Iowa State University
{kopang,aluru}@iastate.edu
Abstract. Designing external memory data structures for string databases is of significant recent interest due to the proliferation of biological
sequence data. The suffix tree is an important indexing structure that
provides optimal algorithms for memory bound data. However, string Btrees provide the best known asymptotic performance in external memory for substring search and update operations. Work on external memory variants of suffix trees has largely focused on constructing suffix trees
in external memory or layout schemes for suffix trees that preserve link
locality. In this paper, we present a new suffix tree layout scheme for
secondary storage and present construction, substring search, insertion
and deletion algorithms that are competitive with the string B-tree. For
a set of strings of total length n, a pattern p and disk blocks of size B,
we provide a substring search algorithm that uses O(|p|/B + logB n) disk
accesses. We present algorithms for insertion and deletion of all suffixes
of a string of length m that take O(m logB (n + m)) and O(m logB n)
disk accesses, respectively. Our results demonstrate that suffix trees can
be directly used as efficient secondary storage data structures for string
and sequence data.
1
Introduction
The suffix tree data structure is widely used in text processing, information
retrieval, and computational biology. It is especially useful when there are no
word or sentence structures, such as in biological sequences, for which the suffix
tree is uniquely suited for indexing and querying. With the continued explosion
in the size of biological sequence databases, there is growing interest in string
indexing schemes in general, and disk-based suffix trees in particular.
The suffix tree of a set of strings is a compacted trie of all suffixes of all the
strings. Since the introduction of this data structure by Weiner [15], several linear
time algorithms for in-memory construction of suffix trees have been designed:
notable ones include McCreight’s linear space algorithm [11], Ukkonen’s on-line
algorithm [13], and Farach’s algorithm for integer alphabets [4]. To extend the
scale of data that can be handled in-memory, Grossi and Vitter [9] developed
?
Research supported by the National Science Foundation under IIS-0430853.
compressed suffix trees and suffix arrays. In the last few years, there has been
significant research on disk-based storage of suffix trees for exploiting their utility on ever growing sequence databases. Many algorithms and strategies have
been proposed to reduce the number of disk accesses during the construction of
suffix trees [1, 2, 5, 10, 12]. Of these, only Farach et al. provided a construction
algorithm
for ´secondary storage that achieves the optimal worst case bound of
³
n
n
log M B
Θ B
disk accesses (where M is the size of main memory).
B
While these algorithms and techniques focused on suffix tree construction in
secondary storage, the problems of searching and updating suffix trees (insertion/deletion of all suffixes of a string) on disks have not received as much attention. An interesting solution is provided by Clark and Munro [3] that achieves
efficient space utilization. This is used to obtain a bound on disk accesses for substring search as a function of the height of the tree. In the worst case, the height
of a suffix tree can be proportional to the length of the text indexed, although it
is rarely the case and Clark and Munro’s approach provides good experimental
performance. To date, algorithms with provably good worst-case performance
for substring searches and updates for suffix trees in secondary storage are not
known. To overcome these theoretical limitations, Ferragina and Grossi have
proposed the string B-tree data structure [6, 7]. String B-trees provide the best
known bounds for the worst-case number of disk access required for queries and
updates. It is not known if the same performance bounds can be achieved with
suffix trees. The unbalanced nature of suffix trees appears to be a major obstacle
to designing efficient disk-based algorithms.
In this paper, we propose a new suffix tree layout scheme, and present algorithms with provably good worst-case bounds on disk accesses required for
search and update operations, while maintaining our layout. Let n denote the
number of leaves in the suffix tree, and B denote the size of a disk block. We
provide algorithms that
– search for a pattern p in O(|p|/B + logB n) disk accesses,
– insert (all suffixes of) a string of length m in O(m logB (n+m)) disk accesses,
and
– delete (all suffixes of) a string of length m in O(m logB n) disk accesses.
Since suffix tree construction can be achieved by starting from an empty tree
and inserting strings one after another, the number of disk accesses needed for
suffix tree construction is O(n logB n). Our results provide the same worst-case
performance as string B-trees, thus showing that suffix trees can be stored on
disk and searched as efficiently as string B-trees.
The rest of the paper is organized as follows: In Section 2 we present our
layout scheme. The scheme partitions the suffix tree such that the number of
partitions encountered on any root to leaf path is bounded by logB n. This is
a crucial feature that helps in overcoming problems caused by the unbalanced
nature of suffix trees. Section 3 contains our algorithm for substring search using
the proposed layout. Algorithms for inserting and deleting a new string are discussed in Section 4. Section 5 contains further discussion and Section 6 concludes
the paper.
13
$
B
A
6
$
6
A
B
B
A
2
A
B
B
A
B
A
B
$
B
A
B
$
A
A
3
B
B
A
2
B
B
$
A
A $ A B
B
B A
2
$
$ B
A
A B
B
B A
$
B
B
$
A
B
A
B
$
4
B
2
A B
B A
B
B
A
$
B
A
B
$
Fig. 1. The suffix tree of the string AABBAABBABAB$. The number in an internal
node is the number of leaves in its subtree. A partitioning is shown with C = 3. Two
example partitions of rank zero are circled with dashed lines, and two of rank one
with dotted lines.
2
Suffix Tree Disk Layout
Consider a set S of strings of total length n and a fixed size alphabet Σ. Without
loss of generality, we assume that the last character of each string is a special
character $ ∈
/ Σ, and the remaining characters are drawn from Σ. Let s ∈ S
be a string of length m. We use s[i] to denote the i-th character of s. Let s[i..j]
denote the substring s[i]s[i + 1] . . . s[j]. The i-th suffix of s, s[i..m], is denoted
by si . The suffix tree of the set of strings S, abbreviated ST , is a compacted trie
of all suffixes of all strings in S. Another commonly used notation is to use the
term generalized suffix tree when dealing with multiple strings and reserve the
term suffix tree when dealing with a single string. For convenience, we use the
term suffix tree to denote either case.
For a node v in ST , the string depth of v is the total length of all edge labels
on the path from the root to v. The number of leaves in the subtree under v is
referred to as size(v). If v is a leaf node then size(v) = 1. The rank of node
v, denoted rank(v), is i if and only if C i ≤ size(v) < C i+1 , for some integer
constant C of choice. Nodes u and v belong to the same partition if all nodes
on the undirected path between u and v have the same rank. It is easy to see
that the entire suffix tree is partitioned into disjoint parts. Figure 1 shows an
example of a suffix tree, and some of its partitions.
u
A
G
T
u
A
w
T
G
T
A
A
G
G
G
v
A
4
A
A
G
T
G
T
A
T
A
(b) The skeleton partition
tree of the partition on
the left. The number next
to each node denotes the
total length of the labels
between the nodes in TP .
Only the first character of
the first label between two
nodes is used in the labels.
A
A
(a) An example partition with all character of
the edge labels shown.
Nodes u, v, w are branching nodes.
6
4
T
3
G
G
T
T
G
G
1 w
3
T
A
A
v
T
G
8
Fig. 2. Illustration of partitions and skeleton partition trees.
The rank of a partition P is the same as the rank of the nodes in P, i.e.
rank(P) = rank(v) for any v in P. Node v in P is a leaf in P if and only if none
of v’s children in ST belong to P. Node u is termed the root of P if and only if
u’s parent is not a node in P. Figure 2(a) shows an example of a partition.
Lemma 1. There are at most C − 1 leaves in a partition.
Proof. Let P be a partition that has C 0 ≥ C leaves, and node u be its root. Since
size(u) ≥ C i · C 0 ≥ C i · C = C i+1 , rank(u) > rank(P), a contradiction.
u
t
Node v in P is a branching node if two or more of its children are in P. All
other nodes are referred to as non-branching nodes. From Figure 2(a) we see that
a partition P need not be a compacted trie. For each partition P, a compacted
trie is constructed containing the root node of P, all branching nodes and all the
leaves. Furthermore, only the first character of each edge label is stored. This
resulting compacted trie is referred to as the skeleton partition tree of P, or TP .
Figure 2(b) shows the skeleton partition tree of the partition in Figure 2(a).
Lemma 2. For a partition P, the number of nodes in TP is at most 2C − 2.
Proof. By Lemma 1 there are at most C − 1 leaf nodes in a skeleton partition
tree. Therefore, there can be at most C − 2 branching nodes. In addition, the
root node may or may not be a branching node. So the total number of nodes
in a skeleton partition tree is at most 2C − 2 = O(C).
u
t
While Lemma 2 shows that the size of TP is bounded by O(C), it gives no
bound on the size of P. The worst case number of nodes in partition P of rank
i is C i+1 − C i , corresponding to a chain of C i+1 − C i nodes in ST where the
bottom node has a subtree with C i leaves and all other nodes have an additional
leaf child each. Note that TP in this case has only two nodes, the top and bottom
nodes of the chain. So TP can be viewed as an additional data structure built
on top of P in order to traverse P effectively. A node u in ST appearing in a
partition P is described as uP ; similarly, its appearance in TP is described as
uT . The information stored in uT and uP are different, because uT ’s function is
to help navigate TP to locate a part of P, while uP is used to navigate from one
partition to another in ST . Any disk block can contain either skeleton partition
trees, part of a partition, or some of the input strings, but not a mixture of them.
Let uT and vT be nodes in TP such that uT is the parent of vT . All the
non-branching nodes between uP and vP (including uP , vP ) form a linked list.
Since uP is a branching node, it will be the head of multiple such linked lists.
Also, uP will be the tail of another linked list. For storage efficiency, we require
each node in P to be stored in exactly one linked list. So uP is not stored as
the head (first node) of the linked list from uP to vP , but rather as the tail (last
node) in the linked list that ends at uP . Note if uP is the root of a partition, it
is stored by itself. Since the linked list between uP and vP does not contain uP
we refer to this linked list as LL(uP , vP ].
To summarize, our layout scheme first divides the suffix tree into partitions.
All the nodes in a partition are further divided into linked lists. The skeleton
partition tree allows us to find any of the linked lists in a partition efficiently. All
links in our data structure are bidirectional for navigation. We will now describe
the augmenting information needed to efficiently perform the search operation.
The function of a skeleton partition tree TP is to allow the identification of
a LL(uP , vP ] in P, such that one of the nodes in LL(uP , vP ] has a child pointer
to the next partition we need to load for our search. Let vT be a child of uT in
TP .
– Store in vT the first character of the first edge label on the path from uT to
vT in ST , and refer to this character as f irst char(vT ).
– Store in vT a pointer ptr LL(vT ) to the tail (last node) of LL(uP , vP ].
– Store in uT a pointer ptr LL(uv ) to the head (first node) of LL(uP , vP ].
– Store in uT the string depth of uT in ST , denoted as string depth(uT ).
– Store in uT (s, rep suff(uT )), such that srep suff(uT ) is a suffix represented
by one of the leaves in ST under uT .
For each node uP we store the following information:
– String depth of uP , also denoted as string depth(uP ).
– For each child w of u in the suffix tree, store in uP a pointer to wT . Note
that this pointer is stored irrespective of if w is in the same partition as u.
Also store the first character of the edge label from uP to wT in the suffix
tree. We call this character LL f irst char(uw ).
3
Substring Search
Given a pattern p and the suffix tree for a set of strings S, the substring matching
problem is to locate a position i and a string s ∈ S such that s[i..i + |p| − 1] = p
where |p| is the length of p, or conclude that it is impossible to find such a
match. In a suffix tree we match p character by character with edge labels of
the suffix tree until we can proceed no longer, or until all p’s characters have
been exhausted in which case a match is found. To search for a pattern p in
a suffix tree with our proposed layout, we will traverse the tree partition by
partition. The search begins with the partition containing the root node of ST .
Let ` be a counter that is initialized to zero. The following steps are performed
and repeated for each partition P we encounter.
1. Load TP into the main memory.
2. Start from the root r of TP and travel down TP as follows. Suppose we are at
node uT , and let vT be a child of uT . If f irst char(vT ) = p[string depth(uT )+
1] then travel to vT and repeat this process. Stop if no such vT can be found,
or when uT is a leaf node in TP , or p is exhausted.
3. Suppose the previous step stopped at node vT . Compare the substring
s[rep suff(vT ) + `..rep suff(vT ) + min {string depth(vT ), |p|}] with the substring p[`.. min {string depth(vT ), |p|}]. Let lcp be the number of characters
matched, set ` = ` + lcp.
4. Repeat Steps 2 and 3 until the first node wT such that string depth(wT ) ≥ `
is located. Let uT be the parent of wT in TP , and load LL(uP , wP ] by using
the pointer at uT . Start from the first node u0P of LL(uP , wP ], locate the
00
first node uP
in LL(uP , wP ] such that string depth(u00P ) ≥ `. So far the
process is similar to the search of PAT-tree proposed by Gonnet et al [8].
5. Suppose we stopped at node uP , there are three cases:
(a) If string depth(uP ) = ` and ` = |p|, then a match is found at node u
and the search is stopped.
(b) Else if string depth(uP ) = ` but ` 6= |p|, i.e. the mismatch occurred
immediately after that uP . Find LL f irst char(uw ) = p[` + 1] and use
the pointer to wT to find the next partition. If no match is found we
terminate the search and report no match for p in s.
(c) Otherwise if string depth(uP ) > `, then we also terminate the search
and report no match for p in S.
Using the two-level memory structure proposed by Vitter and Shriver [14], we
assume the size of a disk block is B. Since each node requires constant amount
of space, the number of nodes a disk block can hold is Θ(B).
Lemma 3. Given a string s of length n, and a pattern p of length |p|. The
number of disk accesses needed to locate p in the suffix tree of s is O(|p|/B +
logB n).
Proof. We choose C such that the skeleton partition tree can fit in one disk
block. Since the number of nodes in a skeleton partition tree is at most 2C −2 by
Lemma 2 and each node requires constant space C = Θ(B). For each partition
P its TP is stored in one disk block, so Steps 1 and 2 can be done with one
disk access for each partition. Since the search goes through at most O(logB n)
partitions, the total disk accesses for these steps is O(logB n). Over the course
of the entire search process, Step 3 makes at most O(|p|) character comparisons,
and requires O(|p|/B + logB n) disk accesses because both p and the substring
being compared are stored contiguously on disk. By similar reasoning Step 4
requires O(|p|/B + logB n) disk accesses because the number of nodes loaded for
all linked lists combined is less than |p|. Finally all the information needed for
Step 5 is stored with the node, and no additional disk access is needed. Therefore,
the total number of disk accesses is O(|p|/B + logB n).
u
t
Note that one occurrence of the pattern in the given set of strings can be
found by identifying the first node uT at or below the position after matching all characters of p, and retrieving the representative suffix (s, rep suff(uT )).
The algorithm can be extended to return all occurrences of the pattern p using
O(|p|/B + logB n + occ
B ) disk accesses where occ denotes the number of occurrences.
4
Updating the Suffix Tree
A dynamic suffix tree must support insertion, deletion, and modification of
strings. Since a modification operation can be viewed as a deletion followed
by an insertion, we concentrate our discussion on the insertion and deletion operations. During insertion and deletion the size of a node may be changed. In
order to facilitate the calculation of size, we choose C = 2. By Lemma 1, if
C = 2 then each partition only has one leaf, and is now a path. The skeleton
partition tree contains only two nodes, the root and the leaf, denoted R and L,
respectively. With C = 2, it can be easily verified that all the leaves of ST are
in a partition of their own.
Note that the bound of asymptotic number of disk accesses for substring
search in Lemma 3 is obtained using C = Θ(B). This result can be achieved
even when C = 2 by packing multiple skeletal partition trees into the same disk
block as outlined in Lemma 5.
4.1
Insertion
In order to insert a string s into an existing suffix tree with O(n) nodes, the suffixes of s are inserted into the suffix tree one by one. Therefore we first introduce
the procedure to insert a suffix of s into the suffix tree. When a suffix is inserted
into the suffix tree a leaf is added, and an internal node may also be added. For
every node u in the suffix tree on the path from the root of the suffix tree to the
newly inserted leaf node v, size(u) is increased by one. Because of this change,
rank(u) may also increase by one, which will change the partition P. However,
the number of nodes that have to be moved to another partition is limited.
Lemma 4. The insertion of a new suffix into the suffix tree may increase the
rank of a node by 1 only if it is an ancestor of the new leaf and is the root node
of its partition. The ranks of all other nodes are unaffected.
Proof. If a node is not an ancestor of the new leaf, its size and hence its rank
does not change. The size of each node that is an ancestor of the newly inserted
leaf will increase by one. Consider a node v that is an ancestor of the new leaf.
Suppose v is not the root of a partition and let r denote the root of the partition
containing v. If rank(v) were to increase, then size(v) = C i − 1 just before the
insertion. Since r is an ancestor of v, then size(r) > size(v) ⇒ size(r) ≥ C i , so
r could not have been in the same partition as v, a contradiction.
u
t
While Lemma 4 applies for any choice of C, we choose C = 2 as described in
the beginning of the section. While the size of many nodes will change after an
insertion, it is not necessary to keep track of the correct size of all nodes at all
times. It is enough to only have the correct size for the root R of all partitions.
Since we have chosen C = 2 and as stated before each partition P is now a path
in ST , i.e. P has no branching nodes. So the linked list between RP and LP can
now contain the node RP as its head, without fear of duplication. The linked
list under the new definition is referred to as LL[RP , LP ]. We alter slightly the
information stored with RT and LT to facilitate the insertion operation.
– In RT , size(RT ) contains the number of leaves in the suffix tree under RP .
– Since RT has only one child, the pointer to the head of LL[RT , LT ] is now
ptr LL(RT ). The pointer ptr LL(LT ) still points to the tail of LL[RT , LT ].
– The definition for string depth(v), f irst char(v) and (s, rep suff(v)), where
v is either RT or LT , remain unchanged from before.
The insertion algorithm is applied iteratively to each partition it encounters.
Each iteration is divided into two stages. In the first stage we find the appropriate
place to insert the new leaf and add a new internal node if necessary. In the
second stage we update the partition if the root needs to be moved to another
partition. Assume that the size parameters are correct for RT of each partition.
Suppose we are at partition P, perform Steps 1 to 4 of the search algorithm
described in Section 3. Assume after Step 4, the search algorithm stops at a
node uP . One of the following three scenarios will apply.
1. An internal node w needs to be inserted between R and its parent v in partition P 0 , and the new leaf attached to wP . In this case size(w) = size(RT )+1
and its rank can be calculated accordingly. Based on its rank one of the following cases is true.
(a) The new node w has the same rank as RT , so it is the new root of P.
Put wP as the head of LL[wP , LP ], set pointer ptr LL(wP ) to wP .
(b) The new node w is in a partition by itself, then a new partition is made
containing only w. Pointers in vP 0 and RP are updated accordingly.
(c) The new node wP is in the same partition P 0 as vP 0 . Node w is inserted
after vP 0 in LL[RP 0 , LP 0 ] and update the tail pointer stored in LT .
2. Else if string depth(uP ) = l but l 6= |p|, i.e., the mismatch occurred immediately after that uP . Find LL f irst char(uw ) = p[l + 1] and use the pointer
to wT to find the next partition. If no match is found then node uP is where
the new leaf should be attached. In either case increment size(RP ) by one.
3. Otherwise if string depth(uP ) > l, then a new internal node wP is inserted
between uP and its parent vP ∈ LL[RP , LP ], the new leaf is attached to wP
and size(RP ) is incremented by one.
It is easy to verify that the size parameter is correctly set at the end of this
stage. From Lemma 4 we know that for each partition P encountered during the
insertion, only the root may need to be moved. If so first remove the first node of
LL[RP , LP ] by changing ptr LL(RP ) to point at the next node in LL[RP , LP ].
The next node becomes the new root of P, so update the values we stored for
RP . All these values can be found except for the new size(RP ). Let r be the
Pk
old root of P, then size(RP ) = size(r) − i=1 size(vi ), where vi is a child of r.
The sizes of vi ’s are known because they are roots of different partitions. The
old root r will either become a partition on its own or become a part of the
partition of its parent. In the former case, carry out the procedures in 1b), and
the procedure in 1c) should be followed in the latter case.
For a partition P, LL[RP , LP ] may not be able to fit in one disk block. When
a disk block becomes full, a new disk block is opened. The second half of the
linked list on the current block is copied to the new block. The nodes of the
second half of the linked list could be scattered across the disk block.
Lemma 5. The number of disk accesses needed for the insertion of a suffix is
O(m/B + logB n), where m is the length of the suffix.
Proof. Assume the number of nodes that can be contained in a block is O(B)
and B = 2k . Under the new scheme each skeleton partition tree contains only
two nodes. For a partition P with rank rank(P), we calculate a block rank(P) =
brank(P)/kc. For partitions P 0 and P 00 , without lost of generality assume RP 00
is a child of v ∈ P 0 . If block rank(P 0 ) = block rank(P 00 ) we put TP 0 and TP 00 on
the same disk block. So each time a new disk block containing skeleton partition
trees is loaded the block rank decreases by one, so O(logB n) disk accesses are
sufficient for loading all blocks containing the skeleton partition trees needed by
the algorithm.
Let RP 0 be a child of RP in ST such that block rank(P 0 ) = block rank(P).
We store in RT of P a pointer to RT of partition P 0 , and the first character
of the edge label. Therefore, we can find a partition P 00 in the same disk block
as P, using Step 2 of the substring search algorithm. Then use rep suff at P 00
to decided how to navigate through all the skeleton partition trees on the same
disk block. Thus the complexity is the same as search.
u
t
To speed up insertion of all suffixes of a string, suffix links are used. For
nodes u, v ∈ ST there is a suffix link from u to v, denoted SL(u) = v, if the
concatenation of the edge labels from the root to u and v are aβ and β, respectively. We first briefly introduce McCreight’s suffix tree construction algorithm
[11], then show how these ideas can be used in our layout scheme.
To insert a string s of length m into a suffix tree of size n, the suffixes of s
are inserted into the suffix tree one by one. Suppose we have just inserted suffix
si as a leaf, if a new internal node w is created to attach the leaf representing
si , then go to w’s parent and let l be the length of the edge label between w
and its parent. Otherwise go to the parent of the leaf and let l = 0. Assume
we are at node u now, take the suffix link from node u to v. Compare only the
first character of each edge label with the appropriate characters of s and skip
down, repeat this until l characters have been skipped. If we are inside an edge
label, insert a new internal node w0 , attach the leaf representing si+1 to w0 and
set SL(w) = w0 . Otherwise suppose we are at a node w0 . Set SL(w) = w0 and
continue down from w0 by comparing appropriate characters of s with the edge
labels, until a place to insert si+1 is found.
If we maintain a suffix link for each node uP to vP , then we can follow
the above algorithm to insert suffixes one by one in our layout. To skip the l
characters takes at most O(l/B + logB (n + m)) disk accesses. After the insertion
of each suffix, for each partition P encountered on the path from the root of ST
to the new leaf, the size of RP is incremented by one, and RP is moved to another
partition if necessary. Updating all of these partitions takes at most O(logB (n +
m)) number of disk accesses. Therefore the total number of disk accesses required
for inserting all suffixes of a string of length m is O(m logB (n + m)).
4.2
Maintaining suffix links
In our string insertion and deletion algorithms, suffix links need to be maintained
for each node uP . This can be accomplished by maintaining bidirectional suffix
links such that when uP is moved, all nodes vP 0 with SL(vP 0 ) = uP can be
identified and updated. We note that the string depth of a node in ST never
changes. So for nodes uP , vP in LL[RP , LP ] such that uP is the parent of vP ,
we can leave enough space between uP and vP for future insertions. The amount
of space is proportional to the length of the edge label between uP and vP . This
way the suffix links will only change when a node is moved to another partition.
The complexity of searching, insertion and deletion is not affected. Ferragina
and Grossi [7] have proposed three approaches to maintain their succ pointers,
which can also be applied to maintain the suffix links in our approach.
4.3
Deletion
The deletion process is analogous to the insertion process. Similar to Lemma 4
only the leaf of a partition may need to be moved to another partition for each of
the partitions encountered during the deletion process. In places where size(u)
is incremented by one in the insertion process, size(u) should be decremented
by one. Since some of the strings will be deleted, the (s, rep suff(uT )) entries
may no longer be valid for uT that is an ancestor of a deleted leaf. In this case
after the deletion of a leaf we traverse upwards in the tree, and replace the
(s, rep suff(uT )) entries with the suffixes represented by a sibling of the deleted
leaf, or (s, rep suff(vT )) where vT is a sibling of the deleted leaf.
5
Discussion
In our layout scheme, it is possible that many skeletal partition trees may be
small and not occupy a full disk block. Even though a skeletal partition tree
can have as many as C − 2 nodes, it can also have as few as just two nodes.
For example, the suffix tree of the string An $ has Θ(n) number of disk blocks
with only one node. Note that this is true even if a large value of C such as
Θ(B) is chosen. Choosing C = Θ(B) ensures optimal number of disk accesses
for substring search, even if many disk blocks are sparsely occupied. This is an
interesting contrast with the suffix tree layout of Clark and Munro [3] where the
focus is on succinct representation of suffix trees to conserve secondary storage,
but this scheme does not provide optimal worst case bound for substring search.
Our scheme provides such guarantees despite not trying to minimize the number
of disk blocks for suffix tree storage.
While not needed for ensuring asymptotic performance when C = Θ(B), we
can and should pack as many skeletal partition trees into a single disk block
for efficient storage. This packing is always beneficial and has no harmful side
effects. If C is chosen to be small, such packing is necessary to obtain optimal
disk accesses as outlined in the proof of Lemma 5.
The main advantage of our approach is that it provides the same performance
guarantees as string B-trees without sacrificing the structure of suffix trees. This
will make it easier to co-exist with the large number of applications that already
use suffix trees as the underlying data structure. One limitation of our approach
is that it is applicable only to constant sized alphabets while string B-trees do
not have this limitation. Traversing any root to leaf path in string B-tree incurs
the same number of disk accesses. This number is potentially different for each
path in our layout with the worst-case asymptotic performance same as for string
B-trees. Each disk access in our algorithm increases the number of characters
matched with p, and this may not be true for string B-tree. Thus, searching for
a pattern p takes O(min{p, p/B + logB n}) number of disk access, where n is the
number of characters indexed.
6
Conclusions
In this paper, we present a new suffix tree layout scheme for secondary storage
and provide algorithms with provably good worst-case performance for search
and update operations. The performance of our algorithms matches what can be
obtained by the use of string B-trees, a data structure specifically designed to
efficiently support string operations on secondary storage. Suffix trees are extensively used in biological applications. As our scheme provides how to efficiently
store and operate on them in secondary storage that is competitive with the
best available alternatives, the research presented provides justification for using suffix trees in secondary storage as well. It is important to compare how the
presented algorithms compare in practice with other storage schemes developed
so far (those with and without provable bounds on disk accesses), and such work
remains to be carried out.
References
1. S.J. Bedathur and J.R. Haritsa. Engineering a fast online persistent suffix tree
construction. In Proc. 20th International Conference on Data Engineering, pages
720–731, 2004.
2. S.J. Bedathur and J.R. Haritsa. Search-optimized suffix-tree storage for biological
applications. In Proc. 12th IEEE International Conference on High Performance
Computing, pages 29–39, 2005.
3. D.R. Clark and J.I. Munro. Efficient suffix trees on secondary storage. In Proc.
7th ACM-SIAM Symposium on Discrete Algorithms, pages 383–391, 1996.
4. M. Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th
Annual Symposium on Foundations of Computer Science, pages 137–143, 1997.
5. M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity
of suffix tree construction. Journal of the ACM, 47(6):987–1011, 2000.
6. P. Ferragina and R. Grossi. Fast string searching in secondary storage: theoretical
developments and experimental results. In Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 373–382, 1996.
7. P. Ferragina and R. Grossi. The string B-tree: A new data structure for string
search in external memory and its applications. Journal of the ACM, 46(2):236–
280, 1999.
8. G.H. Gonnet, R.A. Baeza-Yates, and T. Snider. Information Retrieval: Data Structures & Algorithms, chapter 5:“New indices for text: PAT trees and PAT arrays”,
pages 66–82. 1992.
9. R. Grossi and J.S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proc. 32nd Annual ACM Symposium
on Theory of Computing, pages 397–406, 2000.
10. E. Hunt, M.P. Atkinson, and R.W. Irving. Database indexing for large DNA and
protein sequence collections. The VLDB Journal, 11(3):256–271, 2002.
11. E. M. McCreight. A space-economical suffix tree construction algorithm. Journal
of the ACM, 23:262–272, 1976.
12. S. Tata, R.A. Hankins, and J.M. Patel. Practical suffix tree construction. In Proc.
13th International Conference on Very Large Data Bases, pages 36–47, 2004.
13. E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249–260, 1995.
14. J.S. Vitter and E.A.M. Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica, 12(2/3):110–147, 1994.
15. P. Weiner. Linear pattern matching algorithms. In Proc. 14th Symposium on
Switching and Automata Theory, pages 1–11, 1973.