Download View PDF - CiteSeerX

Obtaining Provably Good Performance from Suffix Trees in Secondary Storage? Pang Ko and Srinivas Aluru 2 1 Department of Electrical and Computer Engineering Laurence H. Baker Center for Bioinformatics and Biological Statistics Iowa State University {kopang,aluru}@iastate.edu Abstract. Designing external memory data structures for string databases is of significant recent interest due to the proliferation of biological sequence data. The suffix tree is an important indexing structure that provides optimal algorithms for memory bound data. However, string Btrees provide the best known asymptotic performance in external memory for substring search and update operations. Work on external memory variants of suffix trees has largely focused on constructing suffix trees in external memory or layout schemes for suffix trees that preserve link locality. In this paper, we present a new suffix tree layout scheme for secondary storage and present construction, substring search, insertion and deletion algorithms that are competitive with the string B-tree. For a set of strings of total length n, a pattern p and disk blocks of size B, we provide a substring search algorithm that uses O(|p|/B + logB n) disk accesses. We present algorithms for insertion and deletion of all suffixes of a string of length m that take O(m logB (n + m)) and O(m logB n) disk accesses, respectively. Our results demonstrate that suffix trees can be directly used as efficient secondary storage data structures for string and sequence data. 1 Introduction The suffix tree data structure is widely used in text processing, information retrieval, and computational biology. It is especially useful when there are no word or sentence structures, such as in biological sequences, for which the suffix tree is uniquely suited for indexing and querying. With the continued explosion in the size of biological sequence databases, there is growing interest in string indexing schemes in general, and disk-based suffix trees in particular. The suffix tree of a set of strings is a compacted trie of all suffixes of all the strings. Since the introduction of this data structure by Weiner [15], several linear time algorithms for in-memory construction of suffix trees have been designed: notable ones include McCreight’s linear space algorithm [11], Ukkonen’s on-line algorithm [13], and Farach’s algorithm for integer alphabets [4]. To extend the scale of data that can be handled in-memory, Grossi and Vitter [9] developed ? Research supported by the National Science Foundation under IIS-0430853. compressed suffix trees and suffix arrays. In the last few years, there has been significant research on disk-based storage of suffix trees for exploiting their utility on ever growing sequence databases. Many algorithms and strategies have been proposed to reduce the number of disk accesses during the construction of suffix trees [1, 2, 5, 10, 12]. Of these, only Farach et al. provided a construction algorithm for ´secondary storage that achieves the optimal worst case bound of ³ n n log M B Θ B disk accesses (where M is the size of main memory). B While these algorithms and techniques focused on suffix tree construction in secondary storage, the problems of searching and updating suffix trees (insertion/deletion of all suffixes of a string) on disks have not received as much attention. An interesting solution is provided by Clark and Munro [3] that achieves efficient space utilization. This is used to obtain a bound on disk accesses for substring search as a function of the height of the tree. In the worst case, the height of a suffix tree can be proportional to the length of the text indexed, although it is rarely the case and Clark and Munro’s approach provides good experimental performance. To date, algorithms with provably good worst-case performance for substring searches and updates for suffix trees in secondary storage are not known. To overcome these theoretical limitations, Ferragina and Grossi have proposed the string B-tree data structure [6, 7]. String B-trees provide the best known bounds for the worst-case number of disk access required for queries and updates. It is not known if the same performance bounds can be achieved with suffix trees. The unbalanced nature of suffix trees appears to be a major obstacle to designing efficient disk-based algorithms. In this paper, we propose a new suffix tree layout scheme, and present algorithms with provably good worst-case bounds on disk accesses required for search and update operations, while maintaining our layout. Let n denote the number of leaves in the suffix tree, and B denote the size of a disk block. We provide algorithms that – search for a pattern p in O(|p|/B + logB n) disk accesses, – insert (all suffixes of) a string of length m in O(m logB (n+m)) disk accesses, and – delete (all suffixes of) a string of length m in O(m logB n) disk accesses. Since suffix tree construction can be achieved by starting from an empty tree and inserting strings one after another, the number of disk accesses needed for suffix tree construction is O(n logB n). Our results provide the same worst-case performance as string B-trees, thus showing that suffix trees can be stored on disk and searched as efficiently as string B-trees. The rest of the paper is organized as follows: In Section 2 we present our layout scheme. The scheme partitions the suffix tree such that the number of partitions encountered on any root to leaf path is bounded by logB n. This is a crucial feature that helps in overcoming problems caused by the unbalanced nature of suffix trees. Section 3 contains our algorithm for substring search using the proposed layout. Algorithms for inserting and deleting a new string are discussed in Section 4. Section 5 contains further discussion and Section 6 concludes the paper. 13 $ B A 6 $ 6 A B B A 2 A B B A B A B $ B A B $ A A 3 B B A 2 B B $ A A $ A B B B A 2 $ $ B A A B B B A $ B B $ A B A B $ 4 B 2 A B B A B B A $ B A B $ Fig. 1. The suffix tree of the string AABBAABBABAB$. The number in an internal node is the number of leaves in its subtree. A partitioning is shown with C = 3. Two example partitions of rank zero are circled with dashed lines, and two of rank one with dotted lines. 2 Suffix Tree Disk Layout Consider a set S of strings of total length n and a fixed size alphabet Σ. Without loss of generality, we assume that the last character of each string is a special character $ ∈ / Σ, and the remaining characters are drawn from Σ. Let s ∈ S be a string of length m. We use s[i] to denote the i-th character of s. Let s[i..j] denote the substring s[i]s[i + 1] . . . s[j]. The i-th suffix of s, s[i..m], is denoted by si . The suffix tree of the set of strings S, abbreviated ST , is a compacted trie of all suffixes of all strings in S. Another commonly used notation is to use the term generalized suffix tree when dealing with multiple strings and reserve the term suffix tree when dealing with a single string. For convenience, we use the term suffix tree to denote either case. For a node v in ST , the string depth of v is the total length of all edge labels on the path from the root to v. The number of leaves in the subtree under v is referred to as size(v). If v is a leaf node then size(v) = 1. The rank of node v, denoted rank(v), is i if and only if C i ≤ size(v) < C i+1 , for some integer constant C of choice. Nodes u and v belong to the same partition if all nodes on the undirected path between u and v have the same rank. It is easy to see that the entire suffix tree is partitioned into disjoint parts. Figure 1 shows an example of a suffix tree, and some of its partitions. u A G T u A w T G T A A G G G v A 4 A A G T G T A T A (b) The skeleton partition tree of the partition on the left. The number next to each node denotes the total length of the labels between the nodes in TP . Only the first character of the first label between two nodes is used in the labels. A A (a) An example partition with all character of the edge labels shown. Nodes u, v, w are branching nodes. 6 4 T 3 G G T T G G 1 w 3 T A A v T G 8 Fig. 2. Illustration of partitions and skeleton partition trees. The rank of a partition P is the same as the rank of the nodes in P, i.e. rank(P) = rank(v) for any v in P. Node v in P is a leaf in P if and only if none of v’s children in ST belong to P. Node u is termed the root of P if and only if u’s parent is not a node in P. Figure 2(a) shows an example of a partition. Lemma 1. There are at most C − 1 leaves in a partition. Proof. Let P be a partition that has C 0 ≥ C leaves, and node u be its root. Since size(u) ≥ C i · C 0 ≥ C i · C = C i+1 , rank(u) > rank(P), a contradiction. u t Node v in P is a branching node if two or more of its children are in P. All other nodes are referred to as non-branching nodes. From Figure 2(a) we see that a partition P need not be a compacted trie. For each partition P, a compacted trie is constructed containing the root node of P, all branching nodes and all the leaves. Furthermore, only the first character of each edge label is stored. This resulting compacted trie is referred to as the skeleton partition tree of P, or TP . Figure 2(b) shows the skeleton partition tree of the partition in Figure 2(a). Lemma 2. For a partition P, the number of nodes in TP is at most 2C − 2. Proof. By Lemma 1 there are at most C − 1 leaf nodes in a skeleton partition tree. Therefore, there can be at most C − 2 branching nodes. In addition, the root node may or may not be a branching node. So the total number of nodes in a skeleton partition tree is at most 2C − 2 = O(C). u t While Lemma 2 shows that the size of TP is bounded by O(C), it gives no bound on the size of P. The worst case number of nodes in partition P of rank i is C i+1 − C i , corresponding to a chain of C i+1 − C i nodes in ST where the bottom node has a subtree with C i leaves and all other nodes have an additional leaf child each. Note that TP in this case has only two nodes, the top and bottom nodes of the chain. So TP can be viewed as an additional data structure built on top of P in order to traverse P effectively. A node u in ST appearing in a partition P is described as uP ; similarly, its appearance in TP is described as uT . The information stored in uT and uP are different, because uT ’s function is to help navigate TP to locate a part of P, while uP is used to navigate from one partition to another in ST . Any disk block can contain either skeleton partition trees, part of a partition, or some of the input strings, but not a mixture of them. Let uT and vT be nodes in TP such that uT is the parent of vT . All the non-branching nodes between uP and vP (including uP , vP ) form a linked list. Since uP is a branching node, it will be the head of multiple such linked lists. Also, uP will be the tail of another linked list. For storage efficiency, we require each node in P to be stored in exactly one linked list. So uP is not stored as the head (first node) of the linked list from uP to vP , but rather as the tail (last node) in the linked list that ends at uP . Note if uP is the root of a partition, it is stored by itself. Since the linked list between uP and vP does not contain uP we refer to this linked list as LL(uP , vP ]. To summarize, our layout scheme first divides the suffix tree into partitions. All the nodes in a partition are further divided into linked lists. The skeleton partition tree allows us to find any of the linked lists in a partition efficiently. All links in our data structure are bidirectional for navigation. We will now describe the augmenting information needed to efficiently perform the search operation. The function of a skeleton partition tree TP is to allow the identification of a LL(uP , vP ] in P, such that one of the nodes in LL(uP , vP ] has a child pointer to the next partition we need to load for our search. Let vT be a child of uT in TP . – Store in vT the first character of the first edge label on the path from uT to vT in ST , and refer to this character as f irst char(vT ). – Store in vT a pointer ptr LL(vT ) to the tail (last node) of LL(uP , vP ]. – Store in uT a pointer ptr LL(uv ) to the head (first node) of LL(uP , vP ]. – Store in uT the string depth of uT in ST , denoted as string depth(uT ). – Store in uT (s, rep suff(uT )), such that srep suff(uT ) is a suffix represented by one of the leaves in ST under uT . For each node uP we store the following information: – String depth of uP , also denoted as string depth(uP ). – For each child w of u in the suffix tree, store in uP a pointer to wT . Note that this pointer is stored irrespective of if w is in the same partition as u. Also store the first character of the edge label from uP to wT in the suffix tree. We call this character LL f irst char(uw ). 3 Substring Search Given a pattern p and the suffix tree for a set of strings S, the substring matching problem is to locate a position i and a string s ∈ S such that s[i..i + |p| − 1] = p where |p| is the length of p, or conclude that it is impossible to find such a match. In a suffix tree we match p character by character with edge labels of the suffix tree until we can proceed no longer, or until all p’s characters have been exhausted in which case a match is found. To search for a pattern p in a suffix tree with our proposed layout, we will traverse the tree partition by partition. The search begins with the partition containing the root node of ST . Let ` be a counter that is initialized to zero. The following steps are performed and repeated for each partition P we encounter. 1. Load TP into the main memory. 2. Start from the root r of TP and travel down TP as follows. Suppose we are at node uT , and let vT be a child of uT . If f irst char(vT ) = p[string depth(uT )+ 1] then travel to vT and repeat this process. Stop if no such vT can be found, or when uT is a leaf node in TP , or p is exhausted. 3. Suppose the previous step stopped at node vT . Compare the substring s[rep suff(vT ) + `..rep suff(vT ) + min {string depth(vT ), |p|}] with the substring p[`.. min {string depth(vT ), |p|}]. Let lcp be the number of characters matched, set ` = ` + lcp. 4. Repeat Steps 2 and 3 until the first node wT such that string depth(wT ) ≥ ` is located. Let uT be the parent of wT in TP , and load LL(uP , wP ] by using the pointer at uT . Start from the first node u0P of LL(uP , wP ], locate the 00 first node uP in LL(uP , wP ] such that string depth(u00P ) ≥ `. So far the process is similar to the search of PAT-tree proposed by Gonnet et al [8]. 5. Suppose we stopped at node uP , there are three cases: (a) If string depth(uP ) = ` and ` = |p|, then a match is found at node u and the search is stopped. (b) Else if string depth(uP ) = ` but ` 6= |p|, i.e. the mismatch occurred immediately after that uP . Find LL f irst char(uw ) = p[` + 1] and use the pointer to wT to find the next partition. If no match is found we terminate the search and report no match for p in s. (c) Otherwise if string depth(uP ) > `, then we also terminate the search and report no match for p in S. Using the two-level memory structure proposed by Vitter and Shriver [14], we assume the size of a disk block is B. Since each node requires constant amount of space, the number of nodes a disk block can hold is Θ(B). Lemma 3. Given a string s of length n, and a pattern p of length |p|. The number of disk accesses needed to locate p in the suffix tree of s is O(|p|/B + logB n). Proof. We choose C such that the skeleton partition tree can fit in one disk block. Since the number of nodes in a skeleton partition tree is at most 2C −2 by Lemma 2 and each node requires constant space C = Θ(B). For each partition P its TP is stored in one disk block, so Steps 1 and 2 can be done with one disk access for each partition. Since the search goes through at most O(logB n) partitions, the total disk accesses for these steps is O(logB n). Over the course of the entire search process, Step 3 makes at most O(|p|) character comparisons, and requires O(|p|/B + logB n) disk accesses because both p and the substring being compared are stored contiguously on disk. By similar reasoning Step 4 requires O(|p|/B + logB n) disk accesses because the number of nodes loaded for all linked lists combined is less than |p|. Finally all the information needed for Step 5 is stored with the node, and no additional disk access is needed. Therefore, the total number of disk accesses is O(|p|/B + logB n). u t Note that one occurrence of the pattern in the given set of strings can be found by identifying the first node uT at or below the position after matching all characters of p, and retrieving the representative suffix (s, rep suff(uT )). The algorithm can be extended to return all occurrences of the pattern p using O(|p|/B + logB n + occ B ) disk accesses where occ denotes the number of occurrences. 4 Updating the Suffix Tree A dynamic suffix tree must support insertion, deletion, and modification of strings. Since a modification operation can be viewed as a deletion followed by an insertion, we concentrate our discussion on the insertion and deletion operations. During insertion and deletion the size of a node may be changed. In order to facilitate the calculation of size, we choose C = 2. By Lemma 1, if C = 2 then each partition only has one leaf, and is now a path. The skeleton partition tree contains only two nodes, the root and the leaf, denoted R and L, respectively. With C = 2, it can be easily verified that all the leaves of ST are in a partition of their own. Note that the bound of asymptotic number of disk accesses for substring search in Lemma 3 is obtained using C = Θ(B). This result can be achieved even when C = 2 by packing multiple skeletal partition trees into the same disk block as outlined in Lemma 5. 4.1 Insertion In order to insert a string s into an existing suffix tree with O(n) nodes, the suffixes of s are inserted into the suffix tree one by one. Therefore we first introduce the procedure to insert a suffix of s into the suffix tree. When a suffix is inserted into the suffix tree a leaf is added, and an internal node may also be added. For every node u in the suffix tree on the path from the root of the suffix tree to the newly inserted leaf node v, size(u) is increased by one. Because of this change, rank(u) may also increase by one, which will change the partition P. However, the number of nodes that have to be moved to another partition is limited. Lemma 4. The insertion of a new suffix into the suffix tree may increase the rank of a node by 1 only if it is an ancestor of the new leaf and is the root node of its partition. The ranks of all other nodes are unaffected. Proof. If a node is not an ancestor of the new leaf, its size and hence its rank does not change. The size of each node that is an ancestor of the newly inserted leaf will increase by one. Consider a node v that is an ancestor of the new leaf. Suppose v is not the root of a partition and let r denote the root of the partition containing v. If rank(v) were to increase, then size(v) = C i − 1 just before the insertion. Since r is an ancestor of v, then size(r) > size(v) ⇒ size(r) ≥ C i , so r could not have been in the same partition as v, a contradiction. u t While Lemma 4 applies for any choice of C, we choose C = 2 as described in the beginning of the section. While the size of many nodes will change after an insertion, it is not necessary to keep track of the correct size of all nodes at all times. It is enough to only have the correct size for the root R of all partitions. Since we have chosen C = 2 and as stated before each partition P is now a path in ST , i.e. P has no branching nodes. So the linked list between RP and LP can now contain the node RP as its head, without fear of duplication. The linked list under the new definition is referred to as LL[RP , LP ]. We alter slightly the information stored with RT and LT to facilitate the insertion operation. – In RT , size(RT ) contains the number of leaves in the suffix tree under RP . – Since RT has only one child, the pointer to the head of LL[RT , LT ] is now ptr LL(RT ). The pointer ptr LL(LT ) still points to the tail of LL[RT , LT ]. – The definition for string depth(v), f irst char(v) and (s, rep suff(v)), where v is either RT or LT , remain unchanged from before. The insertion algorithm is applied iteratively to each partition it encounters. Each iteration is divided into two stages. In the first stage we find the appropriate place to insert the new leaf and add a new internal node if necessary. In the second stage we update the partition if the root needs to be moved to another partition. Assume that the size parameters are correct for RT of each partition. Suppose we are at partition P, perform Steps 1 to 4 of the search algorithm described in Section 3. Assume after Step 4, the search algorithm stops at a node uP . One of the following three scenarios will apply. 1. An internal node w needs to be inserted between R and its parent v in partition P 0 , and the new leaf attached to wP . In this case size(w) = size(RT )+1 and its rank can be calculated accordingly. Based on its rank one of the following cases is true. (a) The new node w has the same rank as RT , so it is the new root of P. Put wP as the head of LL[wP , LP ], set pointer ptr LL(wP ) to wP . (b) The new node w is in a partition by itself, then a new partition is made containing only w. Pointers in vP 0 and RP are updated accordingly. (c) The new node wP is in the same partition P 0 as vP 0 . Node w is inserted after vP 0 in LL[RP 0 , LP 0 ] and update the tail pointer stored in LT . 2. Else if string depth(uP ) = l but l 6= |p|, i.e., the mismatch occurred immediately after that uP . Find LL f irst char(uw ) = p[l + 1] and use the pointer to wT to find the next partition. If no match is found then node uP is where the new leaf should be attached. In either case increment size(RP ) by one. 3. Otherwise if string depth(uP ) > l, then a new internal node wP is inserted between uP and its parent vP ∈ LL[RP , LP ], the new leaf is attached to wP and size(RP ) is incremented by one. It is easy to verify that the size parameter is correctly set at the end of this stage. From Lemma 4 we know that for each partition P encountered during the insertion, only the root may need to be moved. If so first remove the first node of LL[RP , LP ] by changing ptr LL(RP ) to point at the next node in LL[RP , LP ]. The next node becomes the new root of P, so update the values we stored for RP . All these values can be found except for the new size(RP ). Let r be the Pk old root of P, then size(RP ) = size(r) − i=1 size(vi ), where vi is a child of r. The sizes of vi ’s are known because they are roots of different partitions. The old root r will either become a partition on its own or become a part of the partition of its parent. In the former case, carry out the procedures in 1b), and the procedure in 1c) should be followed in the latter case. For a partition P, LL[RP , LP ] may not be able to fit in one disk block. When a disk block becomes full, a new disk block is opened. The second half of the linked list on the current block is copied to the new block. The nodes of the second half of the linked list could be scattered across the disk block. Lemma 5. The number of disk accesses needed for the insertion of a suffix is O(m/B + logB n), where m is the length of the suffix. Proof. Assume the number of nodes that can be contained in a block is O(B) and B = 2k . Under the new scheme each skeleton partition tree contains only two nodes. For a partition P with rank rank(P), we calculate a block rank(P) = brank(P)/kc. For partitions P 0 and P 00 , without lost of generality assume RP 00 is a child of v ∈ P 0 . If block rank(P 0 ) = block rank(P 00 ) we put TP 0 and TP 00 on the same disk block. So each time a new disk block containing skeleton partition trees is loaded the block rank decreases by one, so O(logB n) disk accesses are sufficient for loading all blocks containing the skeleton partition trees needed by the algorithm. Let RP 0 be a child of RP in ST such that block rank(P 0 ) = block rank(P). We store in RT of P a pointer to RT of partition P 0 , and the first character of the edge label. Therefore, we can find a partition P 00 in the same disk block as P, using Step 2 of the substring search algorithm. Then use rep suff at P 00 to decided how to navigate through all the skeleton partition trees on the same disk block. Thus the complexity is the same as search. u t To speed up insertion of all suffixes of a string, suffix links are used. For nodes u, v ∈ ST there is a suffix link from u to v, denoted SL(u) = v, if the concatenation of the edge labels from the root to u and v are aβ and β, respectively. We first briefly introduce McCreight’s suffix tree construction algorithm [11], then show how these ideas can be used in our layout scheme. To insert a string s of length m into a suffix tree of size n, the suffixes of s are inserted into the suffix tree one by one. Suppose we have just inserted suffix si as a leaf, if a new internal node w is created to attach the leaf representing si , then go to w’s parent and let l be the length of the edge label between w and its parent. Otherwise go to the parent of the leaf and let l = 0. Assume we are at node u now, take the suffix link from node u to v. Compare only the first character of each edge label with the appropriate characters of s and skip down, repeat this until l characters have been skipped. If we are inside an edge label, insert a new internal node w0 , attach the leaf representing si+1 to w0 and set SL(w) = w0 . Otherwise suppose we are at a node w0 . Set SL(w) = w0 and continue down from w0 by comparing appropriate characters of s with the edge labels, until a place to insert si+1 is found. If we maintain a suffix link for each node uP to vP , then we can follow the above algorithm to insert suffixes one by one in our layout. To skip the l characters takes at most O(l/B + logB (n + m)) disk accesses. After the insertion of each suffix, for each partition P encountered on the path from the root of ST to the new leaf, the size of RP is incremented by one, and RP is moved to another partition if necessary. Updating all of these partitions takes at most O(logB (n + m)) number of disk accesses. Therefore the total number of disk accesses required for inserting all suffixes of a string of length m is O(m logB (n + m)). 4.2 Maintaining suffix links In our string insertion and deletion algorithms, suffix links need to be maintained for each node uP . This can be accomplished by maintaining bidirectional suffix links such that when uP is moved, all nodes vP 0 with SL(vP 0 ) = uP can be identified and updated. We note that the string depth of a node in ST never changes. So for nodes uP , vP in LL[RP , LP ] such that uP is the parent of vP , we can leave enough space between uP and vP for future insertions. The amount of space is proportional to the length of the edge label between uP and vP . This way the suffix links will only change when a node is moved to another partition. The complexity of searching, insertion and deletion is not affected. Ferragina and Grossi [7] have proposed three approaches to maintain their succ pointers, which can also be applied to maintain the suffix links in our approach. 4.3 Deletion The deletion process is analogous to the insertion process. Similar to Lemma 4 only the leaf of a partition may need to be moved to another partition for each of the partitions encountered during the deletion process. In places where size(u) is incremented by one in the insertion process, size(u) should be decremented by one. Since some of the strings will be deleted, the (s, rep suff(uT )) entries may no longer be valid for uT that is an ancestor of a deleted leaf. In this case after the deletion of a leaf we traverse upwards in the tree, and replace the (s, rep suff(uT )) entries with the suffixes represented by a sibling of the deleted leaf, or (s, rep suff(vT )) where vT is a sibling of the deleted leaf. 5 Discussion In our layout scheme, it is possible that many skeletal partition trees may be small and not occupy a full disk block. Even though a skeletal partition tree can have as many as C − 2 nodes, it can also have as few as just two nodes. For example, the suffix tree of the string An $ has Θ(n) number of disk blocks with only one node. Note that this is true even if a large value of C such as Θ(B) is chosen. Choosing C = Θ(B) ensures optimal number of disk accesses for substring search, even if many disk blocks are sparsely occupied. This is an interesting contrast with the suffix tree layout of Clark and Munro [3] where the focus is on succinct representation of suffix trees to conserve secondary storage, but this scheme does not provide optimal worst case bound for substring search. Our scheme provides such guarantees despite not trying to minimize the number of disk blocks for suffix tree storage. While not needed for ensuring asymptotic performance when C = Θ(B), we can and should pack as many skeletal partition trees into a single disk block for efficient storage. This packing is always beneficial and has no harmful side effects. If C is chosen to be small, such packing is necessary to obtain optimal disk accesses as outlined in the proof of Lemma 5. The main advantage of our approach is that it provides the same performance guarantees as string B-trees without sacrificing the structure of suffix trees. This will make it easier to co-exist with the large number of applications that already use suffix trees as the underlying data structure. One limitation of our approach is that it is applicable only to constant sized alphabets while string B-trees do not have this limitation. Traversing any root to leaf path in string B-tree incurs the same number of disk accesses. This number is potentially different for each path in our layout with the worst-case asymptotic performance same as for string B-trees. Each disk access in our algorithm increases the number of characters matched with p, and this may not be true for string B-tree. Thus, searching for a pattern p takes O(min{p, p/B + logB n}) number of disk access, where n is the number of characters indexed. 6 Conclusions In this paper, we present a new suffix tree layout scheme for secondary storage and provide algorithms with provably good worst-case performance for search and update operations. The performance of our algorithms matches what can be obtained by the use of string B-trees, a data structure specifically designed to efficiently support string operations on secondary storage. Suffix trees are extensively used in biological applications. As our scheme provides how to efficiently store and operate on them in secondary storage that is competitive with the best available alternatives, the research presented provides justification for using suffix trees in secondary storage as well. It is important to compare how the presented algorithms compare in practice with other storage schemes developed so far (those with and without provable bounds on disk accesses), and such work remains to be carried out. References 1. S.J. Bedathur and J.R. Haritsa. Engineering a fast online persistent suffix tree construction. In Proc. 20th International Conference on Data Engineering, pages 720–731, 2004. 2. S.J. Bedathur and J.R. Haritsa. Search-optimized suffix-tree storage for biological applications. In Proc. 12th IEEE International Conference on High Performance Computing, pages 29–39, 2005. 3. D.R. Clark and J.I. Munro. Efficient suffix trees on secondary storage. In Proc. 7th ACM-SIAM Symposium on Discrete Algorithms, pages 383–391, 1996. 4. M. Farach. Optimal suffix tree construction with large alphabets. In Proc. 38th Annual Symposium on Foundations of Computer Science, pages 137–143, 1997. 5. M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM, 47(6):987–1011, 2000. 6. P. Ferragina and R. Grossi. Fast string searching in secondary storage: theoretical developments and experimental results. In Proc. 7th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 373–382, 1996. 7. P. Ferragina and R. Grossi. The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236– 280, 1999. 8. G.H. Gonnet, R.A. Baeza-Yates, and T. Snider. Information Retrieval: Data Structures & Algorithms, chapter 5:“New indices for text: PAT trees and PAT arrays”, pages 66–82. 1992. 9. R. Grossi and J.S. Vitter. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In Proc. 32nd Annual ACM Symposium on Theory of Computing, pages 397–406, 2000. 10. E. Hunt, M.P. Atkinson, and R.W. Irving. Database indexing for large DNA and protein sequence collections. The VLDB Journal, 11(3):256–271, 2002. 11. E. M. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23:262–272, 1976. 12. S. Tata, R.A. Hankins, and J.M. Patel. Practical suffix tree construction. In Proc. 13th International Conference on Very Large Data Bases, pages 36–47, 2004. 13. E. Ukkonen. On-line construction of suffix-trees. Algorithmica, 14:249–260, 1995. 14. J.S. Vitter and E.A.M. Shriver. Algorithms for parallel memory I: Two-level memories. Algorithmica, 12(2/3):110–147, 1994. 15. P. Weiner. Linear pattern matching algorithms. In Proc. 14th Symposium on Switching and Automata Theory, pages 1–11, 1973.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download View PDF - CiteSeerX