* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Implementing a Simulated Directed Acyclic Word Graph for
Survey
Document related concepts
Transcript
Implementing a Simulated Directed Acyclic Word Graph for Computing Local Alignment Jakob Schultz-Nielsen, 20061951 Master’s Thesis, Computer Science February 2014 Advisor: Christian Nørgaard Storm Pedersen AU AARHUS UNIVERSITY DEPARTMENT OF COMPUTER SCIENCE ii Abstract Computing local alignments is used to identify similar regions within highly dissimilar sequences. It has a wide variety of uses, especially within Bioinformatics for analysing DNA, RNA or protein sequences. Therefore many different solutions have been presented in the past with the intent of reducing the time needed to identify local alignments, as naïve solutions become insufficient as sequence lengths increase. This thesis will present a simulated directed acyclic word graph data structure which will be used to compute local alignment by using dynamic programming and an effective pruning strategy. This will be the basis for a number of experiments and it will be shown that this solution outperforms a naïve implementation and that its worst-case time-complexity equals the time-complexity of the naïve solution. Furthermore experiments show that the cache miss rate unexpectedly affects the time-complexity of the simulated data structure in such a degree that an accurate approximation has not been possible, though experiments on small inputs suggest that an average case of O(n0.628 m) is feasible. iii iv Acknowledgements First and foremost I would like to thank my thesis advisor Christian Nørgaard Storm Pedersen for his guidance and uncanny ability to make problems disappear by putting things in perspective. A special thanks goes to TÅÅGE KAM M ERET for making my extended stay at Aarhus University a great pleasure. I would like to thank Lauge Mølgaard Hoyer and Johan Sigfred Abildskov for sacrificing their spare time to proof read and make suggestions to my thesis. A great thank you also goes to Vickie Falk Jensen for proof reading my thesis and for keeping me fed and motivated while I was writing my thesis. Many thanks also goes out to everyone who kept pestering me to write a thesis. You may stop now. Jakob Schultz-Nielsen, Aarhus, February 3, 2014. v vi Contents Abstract iii Acknowledgements v 1 Introduction 1 2 Preliminaries 2.1 Alignments . . . . . . . . . . . . . . . . . . 2.2 Smith-Waterman Local Alignment Solution 2.2.1 Definition . . . . . . . . . . . . . . . 2.2.2 Implementation . . . . . . . . . . . . 2.3 Suffixes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3 3 5 5 6 7 3 Simulated Directed Acyclic Word Graph 3.1 Directed Acyclic Word Graph . . . . . . . . 3.1.1 End-Set Equivalence . . . . . . . . . 3.1.2 Definition . . . . . . . . . . . . . . . 3.2 Depth-First Unary Degree Sequence . . . . 3.2.1 Definition . . . . . . . . . . . . . . . 3.2.2 Basic Constant-time Operations . . 3.2.3 Required Constant-time Operations 3.3 FM-index . . . . . . . . . . . . . . . . . . . 3.3.1 Definition . . . . . . . . . . . . . . . 3.3.2 Operations . . . . . . . . . . . . . . 3.4 Simulating a Directed Acyclic Word Graph 3.4.1 Merging Data Structures . . . . . . 3.4.2 Navigation . . . . . . . . . . . . . . 3.4.3 Extracting Information . . . . . . . 3.4.4 Computing Local Alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 11 12 13 13 14 14 16 20 21 24 27 27 29 33 34 4 Experiments 4.1 Experimental Setup . 4.2 Comparing Algorithms 4.3 Cache Misses . . . . . 4.4 Scoring Schemes . . . 4.5 Build Time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 41 43 45 47 50 . . . . . . . . . . . . . . . . . . . . . . . . . vii . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 Conclusion 51 5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52 Bibliography 56 viii Chapter 1 Introduction Effectively computing local alignments between sequences is a fundamental problem in Bioinformatics. It is used to identify similar regions within sequences which are dissimilar when viewed in their entirety. These similar regions can then indicate common traits between otherwise unrelated sources, for example DNA sequences from two separate species. Finding local alignments is also widely used on other sequences such as RNA and proteins, but also has applications beyond Bioinformatics such as language reconstruction [10]. As a consequence, any advancement made in reducing the time required to compute local alignments will assist many areas of research. Many different approaches have been presented in the past for computing local alignment and there are a large number of tools available, of which BLAST, found at http://blast.ncbi.nlm.nih.gov/, is probably the most well known. Common between most of these tools is the use of a scoring scheme, which is used to calculate the similarity of aligned sequences, with a high score inferring that the two sequences in the alignment are very similar. The goal of this thesis will be to examine an approach of solving the local alignment problem presented by Do and Sung [3]. This approach seeks to reduce the space consumption normally required by other data structures, while improving worst-case performance and achieve an expected average timecomplexity of O(n0.628 m) when using a standard BLAST scoring scheme. Since it is not feasible to implement the approach in its entirety due to time constraints, this thesis will focus on investigating the theoretical time consumption presented by Do and Sung [3] and will allow for a greater space complexity by using alternative auxiliary data structures when it is practical. However the implementation will follow the general outline described in [3] as closely as possible, so it is possible to infer whether it is feasible to implement. Put in other words this thesis will attempt to answer the following questions • Is the data structure and the local alignment algorithm described by Do and Sung [3] feasible to implement? • Does the local alignment algorithm described in [3] have a worst-case time performance of O(nm)? • Is its average time performance O(n0.628 m) as expected? 1 To answer these question the thesis will present an implementation of the simulated directed acyclic word graph which offers the same time complexity as presented by Do and Sung [3]. Whether or not this implementation works will verify whether the data structure is feasible to implement. Moreover, by running a number of experiments designed to test the time consumption we should gain insight into the behaviour of the data structure as a function of the input size. The thesis will be structured as follows. Chapter 2 will describe a number of concepts and data structures which are used throughout the thesis. The chapter will also introduce a naïve approach for solving the local alignment problem that will be used in chapter 4 to compare with the simulated directed acyclic word graph implementation. Chapter 3 will then define the directed acyclic data structure and then describe the depth-first unary degree sequence and FM-index data structures, which will be used to simulate the directed acyclic word graph. The chapter ends by describing an algorithm to compute the local alignment using the simulated data structure. In chapter 4 a number of experiments and their results will be presented and analysed. This chapter will eventually yield answers regarding the timecomplexity of the implementation and also give additional insight into the simulated directed acyclic word graph. The experimental results will also be the basis for the conclusions drawn in chapter 5. The source code used for the experiments and which is described throughout the thesis can be found at http://daimi.au.dk/~bubbi/thesis/ along with a pdf version of this thesis. 2 Chapter 2 Preliminaries In this chapter a number of concepts and data structures are introduced which will be used throughout the thesis. These will form the basis for many of the approaches used and are therefore essential to understand the approaches and argumentation used. To form a basis for comparison of the local alignment algorithm used on the simulated directed acyclic word graph, later on, this chapter will also present a naïve implementation for the computation of the local alignment which was first proposed by Smith and Waterman [17]. Before presenting this algorithm the different types of alignments used in this thesis will be defined. Lastly, this chapter will present data structures and concepts regarding suffixes, which will be used conceptually multiple times when presenting the simulated directed acyclic word graph data structure. 2.1 Alignments Given two strings X and Y over the alphabet Σ, let an alignment A be the pair of strings X ′ and Y ′ of equal length over an alphabet Σ ∪ {−}, where '−' is a special character indicating a gap. The following two properties are then true for A. • Removing all gap characters from the alignment strings X ′ and Y ′ returns the strings to their original form, X and Y . • For any index i at most one of the characters X ′ [i] and Y ′ [i] is a gap character. To minimize confusion it should be noted that for any i, a pair of characters X ′ [i] Y ′ [i] is called an indel (short for insertion/deletion) or a gap if either of them is a gap character. Moreover, if none of the characters contain a gap, we call the pair a match if they are the same character, and a mismatch if they differ. To enable comparisons between alignments we need to provide a way of scoring any alignment. So let δ be a scoring scheme which is defined over all 3 character pairs. The total score of an alignment with respect to δ is then defined P as i δ(X ′ [i], Y ′ [i]). Next we will introduce three types of alignments which will be recurring throughout this thesis. To define these we will let S be a string of length n and P be a string of length m over a common alphabet Σ. Global Alignment We define the global alignment problem as finding an alignment A between S and P so that it maximizes the alignment score with respect to the specified scoring scheme δ. For the sake of brevity we denote the global alignment score between S and P as global-score(S, P ). For the purpose of this thesis, it should be noted that scoring schemes are assumed to have values greater to or equal zero for matches and lower than or equal zero for gaps and mismatches. Local Alignment The local alignment problem attempts to find an alignment A between two given substrings S[1 . . . n] and P [1 . . . m], which maximizes the alignment score of A. We denote the score of the local alignment as local-score(S, P ) and can express it using our previous definition of the global alignment problem. local-score(S, P ) = max 1≤h≤i≤n, 1≤l≤j≤m global-score(S[h . . . i], P [l . . . j]). Meaningful Alignment To define the meaningful alignment problem we look at the original definition of a meaningless alignment by Lam et al. [11]. Given an alignment A = (X, Y ) of S and P where X = S[h . . . i] and Y = P [l . . . j], the alignment is defined as meaningful if the following holds. ∀k ∈ {1 . . . i} : global-score(X[1 . . . k], Y [1 . . . k]) > 0. This means that for the alignment A to be meaningful, all non-empty prefixes of the aligned strings X and Y must have a positive global alignment score. If this is not the case, the alignment is said to be meaningless. We denote the meaningful score as meaningful-score(S, P ) Using this definition Lam et al. also shows that the local alignment and the meaningful alignment have the following relation to each other. local-score(S,P) = max 1≤h≤i≤n, 1≤l≤j≤m meaningful-score(S[h . . . i], P [l . . . j]). This relation can be interpreted the following way. Since every meaningless alignment has a prefix with a negative global score then if this prefix where removed from the alignment then the new alignment would have a greater global alignment than the original. Therefore a meaningless alignment can never be a local alignment, which, in turn, means that the local alignment score can be found by finding the meaningful alignment which maximizes the alignment score given δ. 4 2.2 Smith-Waterman Local Alignment Solution In the last section we defined the different alignments which will be used in this thesis. However, we still have not given any algorithm to compute the local alignment, which means we have nothing to compare to the simulated directed acyclic word graph solution presented later. This section will therefore present a naïve solution first proposed by Smith and Waterman [17] to the Local Alignment problem, though with a slight variation to ensure linear space consumption. The algorithm is based on dynamic programming and in its original form it requires O(nm) space. 2.2.1 Definition Given an input string S[1 . . . n], a query string P [1 . . . m] and a scoring scheme δ, we define a table T of size n × m where each entry T [i, j] is calculated using the following recursive definition. 0 T [i − 1, j − 1] + δ(S[i], P [j]) T [i, j] = max T [i − 1, j] + δ(S[i], −) T [i, j − 1] + δ(−, P [j]) Reset Match Gap in S Gap in P For convenience we will also define T [i, j] = 0 if i = 0 or j = 0. There are three cases which propagate scores forwards through the table and one case which ensures that no entry receives a negative score. This "reset" case ensures that the algorithm can, at any position in the table, restart the local alignment search. The scoring scheme δ is defined as having positive values for matching pairs of characters and negative values for mismatching characters and gaps. This results in a table T where each entry T [i, j] ≥ 0, holds the value of the best alignment for S[h . . . i] and P [l . . . j] for some h and l where 1 ≤ h ≤ i ≤ n and 1 ≤ l ≤ j ≤ m. The local alignment score of S and P can then be seen as being the maximum value in T . local-score(S[1 . . . n], P [1 . . . m]) = max T [i, j]. 1≤i≤n, 1≤j≤m To obtain the actual alignment of S and P it is necessary to backtrack from the position, or positions, in T which matches the local alignment score. Backtracking is done by calculating where the entry’s value originated from given the recursive definition, and following the path, or paths, until a zero-valued entry is reached. The path is then the local alignment between S and P , with 5 - C T C A T A - 0 0 0 0 0 0 0 C 0 2 0 2 0 0 0 A 0 0 1 0 4 1 0 T 0 0 2 0 1 6 3 T 0 0 2 0 0 3 5 Figure 2.1: Searching for local alignment using backtracking each edge between entries identifying whether a match/mismatch or a gap has occurred in the alignment. 2.2.2 Implementation The actual implementation of the Smith-Waterman algorithm used in this thesis follows the definition closely, but takes advantage of the fact that only three values, T [i − 1, j − 1], T [i − 1, j] and T [i, j − 1], need to be computed before T [i, j] can be calculated. This makes it possible to calculate all entries using only two rows of length n, thereby requiring only O(n) memory. This removes the possibility of traversing the table to extract the maximum value. This means we need a variable to save the largest value, which has been propagated forward. When all entries have been calculated, this variable will hold the local alignment score between S and P . Algorithm 1 Smith-Waterman procedure Linear-Smith-Waterman(S[1 . . . n], P [1 . . . m]) score ← 0 for i ← 1, n do for j ← 1, m do a ← last[i − 1] + δ(S[i], P [j]) b ← last[i] + δ(−, P [j]) c ← next[i − 1] + δ(S[i], −) next[i, j] ← max(0, a, b, c) if max(a, b, c) > score then score ← max(a, b, c) return score The drawback of using this variation of the Smith-Waterman algorithm is that we have no way of knowing what the local alignment between S and P is, just based on the score, nor whether there are several alignments which 6 - C T C A T A - 0 0 0 0 0 0 0 C 0 2 0 2 0 0 0 A 0 0 1 0 last next T T Figure 2.2: Computing local alignment score in linear space have the same score. To remedy this we need to propagate more information through the table. This is done by letting each entry of the table contain the coordinates to the entry where the alignment began. The information is then forwarded through the table by the propagation cases while the reset case sets an entry’s origin coordinates to its own coordinates. When a score better than the previous value is observed, the value and the start coordinates of the score, and the current entry’s own coordinates are saved. If it is necessary to output all the local alignments, an array can be maintained with the information of each entry whose score equals the largest observed value. Given start coordinates (h, l) and end coordinates (i, j) of an alignment computed by the approach described above, the substrings of S and P which make out the local alignment are S[h, i] and P [l, j]. The local alignment can then be extracted by calculating the global alignment between these substrings using the same score function. Since the substrings which make out the local alignment are usually much shorter than the original S and P , this additional computation should not affect the overall performance of the algorithm significantly, even if the algorithm used is a simple naïve solution. Smith-Waterman’s algorithm has a time complexity of O(nm) since the table T has n × m entries with each entry being computed in constant time. The linear variation also has a time complexity of O(nm) since the same number of entries are computed and also in constant time. 2.3 Suffixes This section will present two data structures, the suffix tree and the suffix array, and a general concept, the suffix link, which will be used repeatedly in this thesis. The suffix tree will be used directly in the implementation of the simulated directed acyclic word graph and therefore the construction algorithm will be defined. However, this thesis will only refer to the suffix array as a conceptual data structure, as it is never directly implemented for the simulated directed acyclic word graph, which relies on a different data structure to enable 7 1 $ A 2 C T 3 $ 4 6 TA$ 5 ATA$ 7 9 TCATA$ 8 A$ 10 CATA$ 11 Figure 2.3: Suffix tree of the string "CTCATA$" including suffix links. suffix array operations. Suffix Tree Let S[1 . . . n] be a string where the last character is the special end marker character $. The suffix tree (or suffix trie) of S is then denoted TS and is a tree where the edges are denoted by strings so that every suffix of S is represented as the path label of a leaf in TS . For notation we denote the path label of a node u as label(u) and it is the concatenation of all edge strings on the path from the root node and the node u. An example of a suffix tree can be seen in figure 2.3. A suffix tree TS of a string S can be constructed in O(n) using several construction algorithms, however for the purposes of this thesis, we only consider McCreights construction algorithm defined by McCreight [14], since this is the construction algorithm used when a suffix tree is needed in the implementation. Suffix Link A key component of the suffix tree construction algorithm created by McCreight [14] is the concept of suffix links. They allow the allow the algorithm to run in linear instead of quadratic time. Let u be a node in the suffix tree TS where its path label label(u) = cx where c is a single character and x is a non-empty string. The suffix link of u, denoted suffix-link(u), then points to a node v in the suffix tree with label(v) = x. If x is the empty string then the suffix link points to the root node. For an example of suffix links in a suffix tree can be seen in figure 2.3. Suffix links can, of course, also be used in other data structures which index suffixes. Suffix Array A suffix array of a string S, denoted as AS , is a space efficient data structure which indexes the suffixes according to their lexicographical ordering. Suffixes are represented by their starting position in the original string S, i.e. the i’th lexically smallest suffix of S start position can be found in AS [i]. An example of the suffix array can be seen in figure 2.4. 8 i 1 2 3 4 5 6 7 S[i] C T C A T A $ AS [i] 7 6 4 3 1 5 2 Represented Suffix $ A$ ATA$ CATA$ CTCATA$ TA$ TCATA$ Figure 2.4: Suffix array AS of the string S = "CTCATA$". 9 10 Chapter 3 Simulated Directed Acyclic Word Graph Now that all the fundamental concepts and data structures regarding alignments and suffixes have been introduced it is possible to begin constructing a variation of the simulated directed acyclic word graph data structure presented by Do and Sung [3], which will be the basis for experiments conducted in the next chapter. This chapter will start off by introducing the general definition of a directed acyclic word graph, which the implementation will attempt to simulate. Afterwards the chapter will present the depth-first unary degree sequence that will define the topology of the simulated directed acyclic word graph, followed by a data structure called the FM-index used to simulate the edge labels and allow access to substring information of the string used to build the data structure. When the two data structures have been defined the chapter will present the approach used to merge them into a single data structure which simulates the directed acyclic word graph described in the beginning of the chapter. Lastly the chapter will describe an algorithm used on the simulated data structure to solve the local alignment problem defined in section 2.1. The algorithm is described by Do and Sung [3] as, for two strings S[1 . . . n] and P [1 . . . m], having a worst-case time consumption of O(nm) and an expected average time consumption of O(n0.628 m) when using a scoring scheme which rewards matches with 1 and punishes mismatches and gaps by −3. The following chapter will investigate the legitimacy of these claims. 3.1 Directed Acyclic Word Graph A Directed Acyclic Word Graph (DAWG) is a data structure originally proposed by Blumer et al. [2] as an alternative to suffix trees and other structures used for exact pattern matching. It was derived from Deterministic Finite Automatons with the intent of creating the smallest possible Automaton which would recognize all substrings of a string T . Moreover Blumer et al. suggests that the data structure has additional properties which makes it more useful in some cases. 11 0,1,2,3,4,5,6 A A C 1,3 4,6 T C T C A T T T 2,5 CT 2 A A C C 3 TC,CTC T T A A 4 C C CA A T A T AT, CAT, TCAT, CTCAT 5 A A 6 S = C T C A T A 0 1 2 3 4 5 6 TA, ATA, CATA, TCATA, CTCATA Figure 3.1: DAWG of "CTCATA" with end-sets and each sets path labels to the left and right respectively. Blumer et al. presents a linear-time algorithm for constructing the DAWG Dw given the word w. Since this thesis will focus on a simulated implementation of a DAWG, this construction algorithm will not be presented. Instead this section will focus on the definition and features of the DAWG data structure. 3.1.1 End-Set Equivalence Before we can define the Directed Acyclic Word Graph, we have to define the end-set equivalence relation which will define the partitioning of all substrings of a given string S in the DAWG DS . Let S[1 . . . n] = s1 . . . sn be a string with every character si ∈ Σ and let y be an arbitrary string over the alphabet Σ with |y| > 0. The end-set of y in S is then defined as end-set(S, y) = {i | si−|y|+1 . . . si }. If the string y is not a substring of S, then its end-set in S is empty. Conversely the end-set of the empty string λ in S is the end-set containing all elements, i.e. end-set(S, λ) = {0, 1, . . . , n}. It should be noted that the zero position in the string S is included in the end-set of the empty string as a special case, but does not appear in any other end-set. An example of a DAWG with its end-sets can be seen in figure 3.1. Two strings x and y over the alphabet Σ are said to be end-set equivalent on S if end-set(S, x) = end-set(S, y). Extending this concept we define an endset equivalence class as being the set of substrings in S which have the same end-set. For notation we define [x]S as being the end-set equivalence class to 12 which the substring x belongs. It should be noted that the set of all end-set equivalence classes is a partitioning of all the substrings in S. Moreover given two end-set equivalence classes [x]S and [y]S in S, if [x]S = [y]S , then one of the substrings x or y is a suffix of the other. 3.1.2 Definition For a string S over Σ the DAWG DS is defined as a directed acyclic graph (V, E), with the set of vertices, V being the set of end-set equivalence classes of S, as defined above. Since the end-set equivalence classes are a partitioning of all substrings in S then so is DS . We define the edges as E = {([x]S , [xc]S )}, with c being a single character denoting the edge label, x and xc are substrings in S, and the end-set relation end-set(S, x) 6= end-set(S, xc). We also introduce the notation c(v,u) denoting the label on the edge (v, u). The source (or root) node of the DAWG DS is the end-set equivalence class of the empty string, i.e. [λ]S , moreover, the sink node is the end-set equivalence class containing the entire string, [S]S . This is obvious when we remember that the DAWG is derived from a deterministic finite automaton. The set of distinct paths in DS starting from the source node [λ]S now represent the set of substrings in S. More precisely, the concatenation of all the edge labels on each path starting from the source node is exactly the set of substrings in S. This is due to the fact that the any path label for any node u in DS has the same end-set equivalence class as u denotes. An example of this can be seen in figure 3.1. Blumer et al. [2] also provides us with properties regarding the size of the DAWG DS built from any string S[1 . . . n]. For any string S where n ≥ 3, the Directed Acyclic Word Graph DS = (V, E) has the following size bounds, |V | ≤ 2n − 1, and |E| ≤ 3n − 4. These constraints on the DS ’s size are vital for the effectiveness of the Local Alignment solution which will be built upon the simulated DAWG. These constraints on the DS ’s size are vital for the effectiveness of the Local Alignment solution which will be built upon the simulated DAWG. 3.2 Depth-First Unary Degree Sequence In this section the Depth-first Unary Degree Sequence (DFUDS) representation introduced by Benoit et al. [1] is presented. The representation is one of the two main components which make up the simulated DAWG and is responsible for the DAWGs topology. The section will first define the DFUDS representation and show how it can be used to represent the topology of any ordered tree. Afterwards, we will define a series of basic operations which will be used as the foundation for interacting with the representation. Using these basic operations we will define a number of operations which are required to simulate a DAWG data structure later on in the chapter. The section will present certain theoretical possibilities of the representation and its operations, but will mostly focus on the actual implementation. This 13 is due to the fact that several of the possible auxiliary data structures which are presented throughout the literature are, albeit succinct, quite impractical to implement. 3.2.1 Definition The Depth-First Unary Degree Sequence, DFUDS, is a representation of the topology of any ordered tree with n nodes using a string of parentheses of length 2n. Given an ordered tree T and an empty DFUDS string U , its construction is then done by visiting all nodes in a pre-order depth-first traversal starting at the root node. For each node in T with degree d, append a string of d '('s and a single ')' to U . When all nodes have been visited, U will contain n − 1 '('s and n ')'s. To achieve a balanced parenthesis sequence, we prepend a '(', which is also considered the imaginary super-root of U . Its only function is to balance the representation, which gives access to a number of constant time operations originally intended for the BP representation described by Munro and Raman [15]. The imaginary super-root is located at position nil, which also means the root can always be found at position one, since it is always the first node to be visited given the traversal order, and all nodes are represented by their leftmost parenthesis. Since the representation is binary, made up of only open and closed parenthesis it is obviously possible to reduce the space consumption by using bits instead of strings. However, using parentheses instead of bits increases the legibility and is therefore the chosen form for this thesis. Before presenting the operations it should be noted that there is, to the authors knowledge, no construction algorithm for the DFUDS given a string S. It is therefore necessary to create a suffix tree TS and doing a depth-first traversal of this ordered tree to extract the representation US . However, since McCreight [14] presents an O(n)-time algorithm for constructing the suffix tree TS from S, this is unlikely to affect the overall construction time significantly. 3.2.2 Basic Constant-time Operations With the DFUDS representation it is possible to have the topology of any ordered tree using only O(n) space, but to be practical it still needs a number of constant time operations to enable fast and efficient navigation of the representation. The operations presented in this section are the basic operations presented by Benoit et al. [1]. These will form the basis for the operations required to build an effective simulated DAWG. Given a valid DFUDS U the following operations can be supported using auxiliary data structures consuming o(n) bits of space, as presented by Benoit et al. [1] and Jansson et al. [9]. For the purpose of this thesis these data structures will be substituted for simpler constructions with a slightly larger space consumption while still yielding constant time complexity. 14 Figure 3.2: The suffix tree for "ATACTC$" and its DFUDS representation. Nodes are labelled with their pre-order numbering. Rank The rank operation is effectively split in two, one for opening parentheses and another for closing parentheses, written as rank( (i) and rank) (i) respectively. The rank( (i) and rank) (i) return the number of '(' and ')' respectively up to and including the position i in U . To obtain constant time we create two tables, one for each operation, which contain the answer for every i. This requires two tables of size |U |. Select Similarly to rank the select operation is split into select( (x) and select) (x). The select( (x) returns the position of the opening parenthesis with rank x and select) (x) is defined similarly for closing parentheses. With this definition of the select operation the following relations with the rank operation should be noted, rank( (select( (x)) = x and select( (rank( (i)) = i if U [i] = '('. The same relations are valid for closing parentheses. Select is also implemented using a table for each case. The tables have |U |/2 entries each since they only need to hold information about their respective parenthesis type. Find The find operation returns the position of the open parenthesis matching a closed parenthesis and vice versa. The operation is defined in two separate operations by Benoit et al. [1] and is therefore also implemented as two separate functions, f ind( (i) and f ind) (i) given an index i. However these return the value from the same entry in the same table, given the same parameter. This is partly to be faithful to the original definition and partly to increase the legibility 15 of the code, since we imply which parenthesis type we originate from depending on which operation is called. The table has |U | entries which hold the position of the matching parenthesis given its index. Excess Excess is the first operation where no additional auxiliary data structure is needed to guarantee a constant time execution. The operation is quite simple and is defined as excess(i) = rank( (i) − rank) (i). This can be translated as the number of open parentheses up to and including position i minus the number of closed parentheses up to and including position i in U . Since rank( and rank) are constant-time operations, excess must also be a constant-time operation. Enclose The enclose operation is defined as taking an opening parenthesis at position i and returning the opening parentheses of the matching pair of parentheses, which encloses the pair of parentheses at position i tightest. This operation is, again, implemented using a table containing the answer to all possible queries thereby reducing it to a simple table lookup, ensuring a constant-time operation. The implementation further allows to query closing parenthesis, but the opening parenthesis of the enclosing pair is still the position returned. 3.2.3 Required Constant-time Operations While the basic operations described in the previous section allows navigation around the DFUDS representation, a layer of more complex operations are needed for navigating the topology of the ordered tree encoded in the representation. This section will focus on the required constant-time operations which are needed to simulate the DAWG according to Do and Sung [3]. As with the basic operations the implementation will not be faithful to the low theoretical space consumption presented by Do and Sung [3]. Instead it will use a number of simpler constructions to simplify the implementation while still upholding the needed time complexity for each operation. This should result in a solution which follows the spirit of the theory while still being practical to implement. Parent The parent of a node encoded in the DFUDS U can be found in constant time using the basic constant time operations described earlier without any additional data structures. First we ensure that the queried node at position u is not the root node, which, is the only node in the tree with no parent. Since the root is always represented at position one, this check is trivial. 16 If u is not the position of the root we find the opening parenthesis that matches the closing parenthesis which appears before the queried position. This puts us within the description of the parent, but does not necessarily give us the leftmost parenthesis which represents the parent of the node at position u. It is therefore necessary to find the first closing parenthesis from this position, and then moving one position forwards. This is the parent node’s leftmost parenthesis, and therefore the answer. To be specific, if u is not the position of the root, the parent of the node whose description begins at u is found by the following combination of basic operations parent(u) = select) (rank) (f ind( (u − 1))) + 1. Leaf-rank The leaf-rank definition is closely related to the original rank operations definition. Given a DFUDS U , each leaf can be found by the occurrence of the pattern '))' with the right parenthesis being the representation of the leaf. Moreover, they appear in the sequence according to their pre-order numbering. This means that leaf-rank can be interpreted as rank)) (u), i.e. the rank operation where the pattern ')' is replaced by '))'. The operation returns the number of leaves up to an including the queried position given a pre-order traversal of the topology. Since this operation is very similar to the original rank operation, we use the same approach to ensure constant time-complexity. This means maintaining a table with answer to every possible query u. Leaf-select This operation can, for a given value i, also be interpreted as the simple select operation only given a new pattern '))', i.e. select)) (i) which returns the position of the i’th leaf in the DFUDS. This operation is implemented similarly to the original select operation, which means we once again simply maintain a table with answers to all possible queries. This also ensures constant time complexity. Leftmost-leaf Given the DFUDS U and a node identified by its leftmost parenthesis v in U the leftmost-leaf operation returns the leftmost-leaf of the subtree rooted in v. This is achieved, in constant time, by the following combination of leaf-rank and leaf-select operations: leftmost-leaf(v) = leaf-select(leaf-rank(v − 1) + 1). This operation basically starts by finding the number of leaves before the subtree rooted in v and then selects the next leaf in the order, which in turn is the leftmost-leaf for the subtree. 17 Rightmost-leaf Like its leftmost counterpart, the rightmost leaf is also found using a combination of the operations described earlier and is arguably even simpler, so: rightmost-leaf(v) = f ind) (enclose(v)). Because the subtree rooted in v is not balanced due to the leftmost opening parenthesis being omitted, it is necessary to find the opening parenthesis which encloses the entire node, and the finding its closing parenthesis. This is then the representation of the rightmost leaf in the subtree. Once again we only use constant time operations twice, resulting in a constant time operation. Lowest Common Ancestor Given two nodes v and u, the lowest common ancestor operation lca(v, u) returns the common ancestor of v and u with the greatest depth in the tree represented by the DFUDS. Jansson et al. [9] showed that it is possible to extract this information using the excess sequence E where each entry i is defined as E[i] = excess(i). This sequence is only conceptual, since we have access to the constant-time excess operation. To extract the lowest common ancestor information out of the excess sequence E, an additional operation called range minimum query on E is needed, denoted as RMQE . The operation takes two arguments i, j and returns the position of the element with the smallest value. If there is a tie, the leftmost element is returned. Given two nodes x and y in the DFUDS, where x < y and x is not an ancestor of y, let i = select) (x) and j = select) (y), i.e. the position of the closing parenthesis of the representation of each node. If the nodes are leaves, then i = x and j = y. The lowest common ancestor can then be found using the following approach: lca(x, y) = parent(RMQE (i, j − 1) + 1). Using an O(n(log log n)2 / log n) bit auxiliary data structure as described by Jansson et al. [9], it is possible to support RMQE (i, j) in constant time. However this data structure is, again, quite impractical to implement. Instead the implementation uses a sparse table with a space consumption of O(n log n) to support constant time range minimum queries on E. The sparse table creates an n × ⌈log n⌉ table M , where each entry M [i, j] corresponds to the index of the minimum value in the subsequence E[i . . . 2j ]. The recursive definition for calculating each entry is therefore M [i, j] = M [i, j − 1] M [i + 2(j−1) , j − 1] 18 if excess(M [i, j − 1]) ≤ excess(M [i + 2(j−1) , j − 1]) else. 1 7 6 12 17 +1 10 1 11 6 7 15 16 +1 i 10 11 12 20 j-1 j 15 16 21 17 20 21 DFUDS ( (((() ) (() ) ) (() ) ) (() ) ) E 1 23454 3 454 3 2 343 2 1 232 1 0 RMQ(i, j-1) = 11 lca(i,j) = parent(11+1) = 1 Figure 3.3: Finding the lowest common ancestor between the nodes positioned in 10 and 16 in the DFUDS for "ATACTC$". Nodes are labelled with their representative index in the DFUDS. When the table has been created we can compute RMQE (i, j) by selecting the two entries in the table which corresponds to the two subsequences which fully cover the subsequence E[i . . . j]. The entry which corresponds to the smallest excess value is then returned. Let k = log(j − i + 1) then the equation for calculating RMQE (i, j) is as follows. RMQE (i, j) = M [i, j − 1] M [j − 2k + 1, k] if excess(M [i, k]) ≤ excess(M [j − 2k + 1, k]) else. Given a sparse table of E, the RMQE (i, j) is a constant time operation since it is merely a constant amount of lookups. While the sparse table gives us the required time complexity with a reasonable space complexity, it creates one small problem. In the event of a tie, it returns the rightmost element instead of the leftmost, as the operation requires. This is corrected by creating the sparse table on the reverse excess sequence and reversing the queries i and j so the ranges correspond to the correct regions. When a result has been extracted it is again mapped back to the correct index in the original sequence. These transformations are calculated in constant time since each position i in the original sequence can be mapped to its reverse, rev(i) = length − i, in constant time. 19 Depth A depth query depth(v) returns the number of ancestors of a node v and can be implemented in a number of ways. Jansson et al. [9] present a two-level data structure of the DFUDS to represent the depth information of the nodes. This data structure makes it possible to answer a depth query without the need of auxiliary structures in constant time. However this data structure is quite impractical to implement, so for the purpose of this thesis the depth operation is implemented using an auxiliary table containing the answer to every possible query. These answers are obtained by traversing the topology of T , which is encoded in U , in a depth-first order and saving their depth in a table according to the index of their representation in the DFUDS. Level Ancestor A level ancestor query level-ancestor(u, d) of a node u and a depth d returns the ancestor of u with the depth d in the tree. Jansson et al. once again present an approach with a low space consumption and in constant time. Again, we choose to implement the operation naively, since a succinct implementation of a DFUDS is not the focus of this thesis. The implementation visits each node and traverses up the tree, saving the position and depth of each ancestor in a table belonging to the specific node. This method obviously utilizes a generous amount of space, since each node has a table containing all its ancestors. However, it does provide us with a means to solve a level ancestor query in constant time by two table lookups. We have now defined the DFUDS representation and given it a number of basic and more complicated operations which enables us to efficiently navigate around in the ordered tree topology which it encodes. All operations have a constant time complexity after the DFUDS and its underlying data structures have been built. The space complexity is largely determined by two constructions, the sparse table used by the lca(x, y) operation, and the tables containing information regarding each node’s ancestors used by the level-ancestor(u, d) operation. These both have O(n log n) space complexity which dominates the other O(n) data structures, giving us an O(n log n) space complexity for the implemented DFUDS. 3.3 FM-index This section will present the FM-index which will be responsible for labelling the edges between nodes in the simulated DAWG. It was developed by Ferragina and Manzini [6] with the purpose of creating a Full-text index which permits fast substring querying, yet which can be compressed considerably. While it is possible to implement the FM-index to reduce the space consumption considerably, by using different compression techniques, this thesis will forego this approach and settle for implementing an FM-index which follows the spirit of the theory described by Mäkinen and Navarro [13], yet allows 20 F $ A A C C T T C $ T A T A C T C A T C $ A C T $ A A C T A C C $ T T A T A T C A C $ L A T C T $ A C Figure 3.4: The conceptual M matrix for the string "CTCATA$". for a larger space consumption. The section will first define the required components needed for the FMindex and will then present solutions to implement these components. Moreover this section will present how the FM-index has been implemented which will be used for the experiments in chapter 4. Lastly a number of operations needed to simulate the DAWG data structure later on will be presented. 3.3.1 Definition Let S be a string whose terminating character is the special end character $ ∈ Σ, defined to be the lexicographically smallest element in Σ. An FM-index of S is then created by implementing three components, the Burrows-Wheeler transform of S, a table C and a function occ(c, i). Let L denote the Burrows-Wheeler transform (BWT) of S. L is then built by arranging every possible cyclical shift of S according to their lexicographical order, and then concatenating the ending character of each row into the string L. Each row of the conceptual matrix M , shown in figure 3.4, can be seen as an entry in the suffix array of S, if one only considers the string up to and including the end marker $. This transformation is implemented by using the suffix array of S to give us the order of rows. The last character in each row is then the character which appears in the position before the starting character in each row. Since the DFUDS defined in section 3.2 requires the suffix tree to create its representation, the FM-index merely extracts the suffix array from this construction instead of calculating the order again. The second component, the table C, is a simple lookup table for which, given the alphabet Σ of the BWT L and a character c ∈ Σ, the entry C[c] contains the number of times a character, with lexicographically lower value than c, appear in L. This construction enables us to efficiently compute where the first occurrence of a character c takes place in the first column in the conceptual matrix M seen in figure 3.4. This component is implemented using a dictionary structure which, given a key c, returns the value as described above. The value of each entry is inserted into the dictionary by a simple traversal which counts the number of elements lexically smaller than its key. The dictionary structure returns the value in 21 {$, A, C, G, T} = {0, 0, 1, 1, 1} A T C T $ A C 0 1 1 1 0 0 1 0 1 {$, A} = {0, 1} {C, G, T} = {0, 1, 1} A $ A 1 0 1 T C T C 1 0 1 0 0 1 {C} = {0} {G, T} = {0, 1} C C 0 0 T T 1 1 Figure 3.5: The Wavelet Tree for the string "ATCT$AC", the strings are shown for convenience but are not stored. O(log(size of container) where the size of the container equals the size of our alphabet, however since we will restrict our experiments to an alphabet of size 5, i.e. Σ = {$, A, C, G, T }, this should not affect the time complexity. The last component needed to complete the FM-index is an implementation of the function occ(c, i). The function takes a character c and an index i as arguments and returns the number of occurrences of c in the BWT L up to and including position i. This component is implemented using the Wavelet Tree data structure which was first introduced by Grossi et al. [8]. A Wavelet Tree is a recursively defined data structure which partitions the input string in two, for each step down the tree, according to the partitioning of its alphabet Σ. Given an alphabet Σ for the string L, let Σ0 be the alphabet containing the first half of Σ and Σ1 be the second half of Σ. Denote all characters c ∈ L with a bit 0 or 1 depending on whether c ∈ Σ0 or c ∈ Σ1 . The resulting bit-vector is stored in the root node. Partition L into two strings L0 with the alphabet Σ0 and L1 with Σ1 , which contain the characters denoted by 0- or 1-bits respectively. Repeat the process described above until the alphabet of the new nodes contain two or less elements. Using only the bit-vector as seen in figure 3.5 makes a traversal of the string necessary in order to answer rank and select queries, so other auxiliary data structures are usually used. The partitioning remains the same, however. With the Wavelet Tree it is possible to answer rank(c, i) and select(c, x) queries in O(log |Σ|) time using a succinct representation of each node’s bitvector, which requires only nH0 + O(n log |Σ| space as described by Mäkinen and Navarro [12]. For the purpose of this thesis, however, the implementation will merely use two tables per node to contain the rank information for 0-, and 1-bits respectively. This allow us to make rank queries on each node in constant time. The rank(c, i) query is reduced to a traversal down the tree. For every node 22 {$, A, C, G, T} = {0, 0, 1, 1, 1} A T C T $ A C 0 1 1 1 0 0 1 rank(1, 5) = 3 1 0 {$, A} = {0, 1} {C, G, T} = {0, 1, 1} A $ A 1 0 1 T C T C 1 0 1 0 rank(0, 3) = 1 0 1 {C} = {0} {G, T} = {0, 1} C C 0 0 T T 1 1 rank(0, 1) = 1 = rank(’C’, 5) Figure 3.6: Querying rank(′ C ′ , 5) on the wavelet tree of the string "ATCT$AC". we identify which bit encodes c and make a rank query inext = rank(encoding(c), i). We then go to the child node which represents the encoding of c and continue querying, i.e. rank(encoding(c), inext ). When a leaf is reached, its rank query is the result returned. This method also ensures a time complexity of O(log |Σ|), however it uses quite a bit more space than the solution presented by Mäkinen and Navarro [12]. The rank(c, i) query is then the occ(c, i) function on the FM-index. The select(c, x) query is similar to the rank(c, i) query. We also maintain two select tables for each node in the Wavelet Tree which contain the answer to every possible select query on the 0-, and 1-bits. However instead of querying at the root and propagating down the tree, this approach starts at the leaf node whose alphabet contains c and then propagates up the tree. An example can be seen in figure 3.7. The time complexity for the select(c, x) query is the same as rank(c, i), being O(log |Σ|). Now that all the components required for the FM-index have been presented, it is possible to unveil an important property called Last-to-First mapping LF(i). The LF-mapping allows us to find where the occurrence of L[i] takes place in the first column of the conceptual matrix M , an example of which can be seen in figure 3.4. This is done by the following definition. LF(i) = C[L[i]] + rank(L[i], i) The approach is quite simple. To retrieve the position of L[i] in F [i] it is first necessary to find out where the section containing the character L[i] begins in F . This is done by using the table C, since we know that C[c] + 1 indicates the first occurrence of c in F . Instead of choosing the first occurrence, we choose the same numbered occurrence as the character L[i] has in L. Since L[i] precedes F [i] in the original string S we know that each character is in the same order in L and F , i.e. the first occurrence of any character F [i] is also the 23 {$, A, C, G, T} = {0, 0, 1, 1, 1} A T C T $ A C 0 1 1 1 0 0 1 select(1, 3) = 4 = select(’T’, 2) 0 {$, A} = {0, 1} 1 {C, G, T} = {0, 1, 1} A $ A 1 0 1 T C T C 1 0 1 0 select(1, 2) = 3 0 1 {C} = {0} {G, T} = {0, 1} C C 0 0 T T 1 1 select(1, 2) = 2 Figure 3.7: Querying select(′ T ′ , 2) on the wavelet tree of the string "ATCT$AC". first occurrence of that character in L. By using the rank(L[i], i) function we find which numbered occurrence we have in L and therefore in F . The LF-mapping is used by some of the operations defined in the next section and is an essential property of the FM-index. 3.3.2 Operations Having defined the FM-index’s construction and its core property, the LFmapping, we can now implement a number of operations, which are needed to support the simulated DAWG. Additionally, this section will present an approach to simulate the Ψ-table defined by Do and Sung [3], which is used later on in the implementation of the suffix-link(u) operation. Backward Search The backward search needed for the simulated DAWG is derived from the FMCount algorithm defined by Mäkinen and Navarro [13]. The FMCount algorithm will therefore be presented first, and then adjusted to fit the description of the operation backward-search(st, ed, c) used by Do and Sung [3]. The FMCount algorithm uses the FM-index to count the number of times a pattern P occurs in the original string S. This is done by searching backwards through P [1 . . . p] by using the C table and rank(c, i). At each iteration i, st points to the first row with the prefix P [i . . . p] and ed points to the last row with the prefix P [i . . . p], in the conceptual matrix M . When the entire string P has been processed, the range between st and ed denotes the number of substrings in S with the prefix P . This algorithm takes O(|P | log |Σ|) since each character in P is processed in time O(log |Σ|) due to the time complexity of the rank operation. We adjust this algorithm to only search for a single character, given as an argument. Within a specified range, st to ed, which are also given as arguments, 24 hence we denote the operation as backward-search(st, ed, c). The range [st . . . ed] denotes some shared prefix (which may be the empty string) and searches for the range which shares the prefix c concatenated with the previous shared prefix. The operation is shown in algorithm 2 and takes O(log |Σ|) since the rank operation of the FM-index takes O(log |Σ|). However, since our experiments will focus on alphabets of size 5, this can be viewed as constant time. Algorithm 2 Backward Search procedure Backward-Search(st, ed, c) sp ← C[c] + rank(c, st − 1) + 1 ep ← C[c] + rank(c, ed) return sp, ep Lookup The lookup operation, lookup(i) is defined by Do and Sung [3] as returning the i’th entry of the suffix array S denoted as AS . The time complexity of the operation is mentioned as being O(log n), however no implementation details are presented nor does the cited source offer any insight as to how this operation is implemented within the specified time constraint. This implies that the implementation is based on the approach presented by Mäkinen and Navarro [12], which stores parts of the suffix array AS , and uses this information to infer AS [i] in O(log n) time. The actual implementation has access to the suffix array when the FM-index is built. This enables us to save a partial sample A′S of AS , which we save for later lookups. The sampling is done at log n intervals, thereby making the space consumption O(log n) as well. By using a hash-map structure we can achieve amortized constant-time access to the elements. This is done by applying a hash-function to a key, the index of an element in AS in this case, which maps the element to a bucket in the hash-map. If the bucket only contains one element then the query can be answered in constant time, otherwise it has to iterate over all elements of the bucket to find the correct key. By allocating a large number of buckets these non-constant lookups should be rare. To utilize the sampled suffix array A′S for a query lookup(i) we then move backwards in S by using the LF-mapping until we find an entry in the A′S which has a value. We can then infer from the value, and how many iterations it took to find a value, what AS [i] has as value. This operation can be seen in algorithm 3. The entire approach is based on the knowledge that L[i] occurs before F [i] in S. This allows the operation to move forward in S until it finds an entry in the sampled suffix array A′S . The difference is then merely the number of steps that where taken across S. Since the suffix array AS is sampled at each log n interval, the worst case number of steps needed to reach an i′ which has a value in A′S , is O(log n), which is also the time constraint stated by Do and Sung [3]. 25 Algorithm 3 Lookup procedure Lookup(i) i′ ← i t←0 while i′ 6∈ A′S do i′ ← C[L[i′ ]] + rank(i′ , L[i′ ]) t←t+1 return A′S [i′ ] + t ⊲ LF (i′ ) which equals one step in S Simulated Ψ-table To enable the implementation of the suffix-link(u) operation in the simulated DAWG, the FM-index needs to support the Ψ-table definition in constant time. The Ψ-table is defined by Do and Sung [3] as follows. Ψ[i] = ( i′ 0 if AS [i′ ] = AS [i] + 1 else. The implementation used is derived from a method introduced by Golynski et al. [7] which is used to decode text after a certain position in a suffix array. While the article uses succinct data structures instead of the FM-index, which obfuscates the approach considerably, these data structures do hold the same information as our FM-index. For example the selectX (c) operation on the bit vector X is equivalent to the C table in the FM-index. Algorithm 4 Simulated Ψ procedure Ψ(i) if i = 1 then return 0 else c←$ for a ∈ Σ do if C[a] < i then c←a break i′ ← select(c, i − C[c]) return i′ ⊲ Begin with the lexically largest ⊲ Correct c found ⊲ Use select operation to find correct i′ The only possible scenario where the case AS [i′ ] = A[i]+1 is not solvable for any i′ , is if A[i] points to the end marker in S. However, since the end marker $ is the lexically smallest element in S we know that it is always found at AS [1], we therefore simply check whether the argument is 1, and return 0 if this is the case. If this is not the case we first try to deduce which character entry AS [i] begins with. This is done by iterating over the alphabet Σ in reverse and finding the entry in C which is smaller than i. This entry indicates that the associated key is the beginning character of AS [i]. When c is found, we apply 26 the derived calculations from [7], thereby obtaining i′ , which we return. This operation requires an iteration over the entire alphabet in worst case, yielding a time complexity of O(|Σ|), however, as stated previously, the alphabet size will be fixed during the experiments, and can therefore be viewed as constant. 3.4 Simulating a Directed Acyclic Word Graph The chapter began by defining the Directed Acyclic Word Graph data structure and then introduced the DFUDS and FM-index data structures and a number of operations to interact with them. This section will merge the two data structures into a single simulated Directed Acyclic Word Graph data structure and introduce a number of operations which enable efficient navigation of the data structure. Lastly the section will present the method used for computing the local alignment given a DAWG and a pattern P [1 . . . m]. 3.4.1 Merging Data Structures In this section it will be shown that we can simulate the DAWG DS using only the DFUDS and FM-index. To do this we will first present a way of simulating the suffix tree TS using the DFUDS and FM-index of the reversed string S. We then present a one-to-one mapping between the suffix tree of TS and the DAWG DS which, together with the simulation of TS , gives us a way of simulating DS by using the merged DFUDS and FM-index data structures. Simulating TS In order to show how to simulate the suffix tree TS using the DFUDS and the FM-index we will first make some observations regarding the underlying data structures they represent. The DFUDS representation is a succinct encoding of a suffix tree’s topology, and the FM-index is essentially a highly compressible suffix array. Given the suffix tree TS and a suffix array AS of a string S let the nodes of TS be ordered lexicographically according to the edge labels. This results in the same ordering of suffixes, which are represented by the leaf nodes, as the entries in AS . Put in other words, we define the rank of a leaf as its numbering when we visit the leaves from left to right. The i’th ranked leaf is then a one-to-one mapping to the entry AS [i]. An example of this mapping can be seen in figure 3.8. Given a node x in the suffix tree TS , let u and v be the leftmost and rightmost leaves respectively in the subtree rooted in x. The suffix range of the suffix label(x) in AS is then rank(u) to rank(v). Since every node in the TS is a range in the suffix array, we do not actually need to store information on the edges of the suffix tree to know what suffix range a node represents via their path label. Since the FM-index contains the same information as the suffix array, we can utilize it together with the topological information from the DFUDS, to simulate the suffix tree. 27 1 $ A C 7 6 1 T 12 CTC$ TACTC$ $ 17 TC$ ACTC$ C$ 10 11 15 16 20 21 2 3 4 5 6 7 1 ( (((() 6 7 ) (() 7 10 11 ) ) 3 1 12 (() 15 16 ) ) 6 4 17 (() 20 21 ) ) 2 5 Figure 3.8: The suffix tree and its DFUDS representation of the string "ATACTC$" along with each leaf mapping to the suffix array. Mapping from TS to DS To define a mapping from the suffix tree TS to a DAWG data structure DS , Do and Sung [3] presents the function γ as [label(u)]S for every non-trivial node u in TS . A non-trivial node is any node for which the label between it and its parent is not merely the end marker $. To illustrate this mapping we will look at the non-trivial nodes in the suffix tree TS seen in figure 3.8, and see how they map to the DAWG seen in 3.9. The first obvious mapping is the root node of US to the source node of DS . Using the mapping function γ on the root node results in [λ]S = [λ]S , which is the end-set containing all elements. This is the same as the definition in 3.1.1. Going a step further we see that all nodes with a single character in their path label in TS is mapped to a node with a single character to the source node in DS . This is an obvious mapping as [c]S = [c]S for any c ∈ Σ. Lastly we will look at the node represented in the DFUDS as 20 with the path label "TACTC$". It is mapped using the γ function to [TACTC]S = [CTCAT]S = {AT, CAT, TCAT, CTCAT}. It should be noted that the end marker is never present in the simulated DAWG since it is not considered a part of the alphabet of DS . The fact that we never encounter an end marker in the simulated DAWG is due to the definitions of the navigation operations, which only operate on alphabets without the end marker. This is also why trivial nodes are never mapped over to DS . As the function is a one-to-one mapping from the suffix tree TS to DS , it does not change the actual representation of the node. Each node is still represented by its leftmost parenthesis in the DFUDS representation. The 28 1 A T C 12 C 7 A T 17 T 21 CT A C C 16 TC,CTC T A 10 CA A T 20 AT, CAT, TCAT, CTCAT A 11 TA, ATA, CATA, TCATA, CTCATA Figure 3.9: DAWG of "CTCATA" with each sets path labels. Each node is marked with its DFUDS representation. mapping merely allows us to change the node’s path label in S into the correct end-set equivalence class of DS for S. 3.4.2 Navigation While the mapping function enables us to regard the non-trivial nodes of the DFUDS as nodes in the DAWG data structure DS , we still have means of navigating between these nodes in DS , as the implicit edges in the DFUDS do not match the edges needed in DS . To navigate the DAWG based on the mapped DFUDS nodes we need some basic operations. For this we introduce the four following constant-time operations. • Get-Source() - Returns the source (or root) node of the DAWG DS . • Find-Child(v, c) - Returns the child of v in DS where c(v,u) = c. Nil is returned if no such child exists. • Parent-Count(u) - Returns the number of parent nodes of the node u. • Extract-Parent(u, i) - Returns the i’th parent of u. It is of course assumed that 1 ≤ i ≤ Parent-Count(u). 29 Get-Source As defined in section 3.1.2 the source node of the DAWG DS is the node which has the end-set equivalence class of the empty string, i.e. [λ]S . This can be interpreted as the node in the DFUDS US , which has the empty path label, and therefore has the suffix range of the entire suffix array. This property only holds for the root node in the suffix tree topology, and therefore the root node of the DFUDS is the same as the source node in the DAWG DS , and since we know from 3.2 that the root node of the DFUDS is always in position 1, the implementation is trivial. Algorithm 5 Get-Source procedure Get-Source() return 1 ⊲ Position of root node in US Find-Child For any node v in DS and a character c ∈ Σ the find-child operation return the node u which has an edge (v, u) with the label c. The approach first takes the suffix range of v, (st, ed), as we have seen earlier, this range is found by finding the leftmost- and rightmost-leaf of the subtree rooted in the node v, so st = leftmost-leaf(v) and ed = rightmost-leaf(v). Then, given the range and the character c, the FM-index can give us the suffix range (sp, ep) of the concatenated string clabel(v). If this range is not empty, a node must exist for which sp is its leftmost-leaf and ep is its rightmost-leaf, querying for the lowest common ancestor on their representation (which is found with leaf-select) should then yield the node u which has clabel(v) as its path label in the suffix tree TS . By using the mapping function we get γ(v) = [label(v)]S and γ(u) = [label(v)c]S , which tells us that they do not belong in the same end-set equivalence class. By using the definition of DS in 3.1.2 we get that (γ(v), γ(u)) = ([label(v)]S , [label(v)c]S ) is an edge in DS with the label c. So we return the representation of u in the DFUDS. If the backwards-search operation returns an empty range, then the substring clabel(v) does not appear in S, which in turn means that label(v)c does not appear in S. This means that the end-set equivalence class [label(v)c] does not exist, and therefore there is no child u of v with an edge label c in DS . Since all the operations used are computed in constant time, as explained in their respective sections, the complete time complexity of the Find-Child operation also has a constant time complexity. Suffix-Link Before presenting the Parent-Count(u) operation, we will first introduce the suffix-link(u) implementation, as it is an essential part in both parent operations. Given the Ψ-table which we defined in 3.3.2, Sadakane [16] then offers the following method for computing the suffix link of a node u in constant time. 30 Algorithm 6 Find-Child procedure Find-Child(v, c) st ← leftmost-leaf(v) ed ← rightmost-leaf(v) sp, ep ← Backward-Search(st, ed, c) if ep − sp + 1 > 0 then return lca(leaf-select(sp), leaf-select(ep)) else return 0 Algorithm 7 Suffix-Link procedure Suffix-Link(u) l ← leaf-rank(leftmost-leaf(u)) r ← leaf-rank(rightmost-leaf(u)) l′ ← Ψ[l] r ′ ← Ψ[r] return lca(leaf-select(l′ ), leaf-select(r ′ )) In other words, this method takes the suffix range of u, then essentially strips the first character in the path label of the two leaves by using the Ψtable, which yields a new suffix range. Calling the lowest common ancestor operation on the two leaves denoting this suffix range then returns the node which represents the suffix link of u. Parent-Count To enable the parent operations, Do and Sung [3] present three lemmas which will be used without proof in this thesis. These lemmas rely upon the nature of the mapping function from the suffix tree TS to DS to compute the parent operations in constant time. The Parent-Count(u) operation relies directly on the following three lemmas. Lemma 1 (Do and Sung [3]). Consider a non-trivial node u such that u is not the root of TS , let v be u’s parent and x = label(v) and xy = label(u). For any non-empty prefix z of y, we have γ(u) = [(xy)]S = [(xz)]S . In fact, γ(u) = {(xz) | z is a non-empty prefix of y} Lemma 2 (Do and Sung [3]). Consider a non-trivial node u whose parent is the root node of TS . Suppose suffix-link(u) = b. The set of parents of γ(u) in DS is {γ(p) | p is any node on the path from b to the root in TS }. Lemma 3 (Do and Sung [3]). Consider a non-trivial node u whose parent, v, is not the root node in TS . Suppose suffix-link(u) = b and suffix-link(v) = e. For every node p in the path from b to e (excluding e) in TS , γ(p) is a parent of γ(u) in DS . 31 The general idea behind the parent operations is that a node u in the suffix tree TS , with a label on the form label(u) = axy where a is a single character and x and y are strings, has a parent in DS , v, which has the label label(v) = xy in TS . This is of course due to the definition of the mapping function and the DAWG which together state that (γ(v), γ(u)) = ([xy]S , [axy]S ) is an edge from γ(v) to γ(u) with edge character a. Lemma 1 then states that every non-empty prefix, z, of y also generates a parent to γ(u), i.e. for all non-empty prefix z of y, ([xz]S , [axy]S ) is an edge in DS if there is a node v where label(v) = xz in TS . To find all the nodes in TS which fit this description, the algorithm finds a range where all occurrences of these nodes must be contained. For this the suffix-link operation is used together with lemma 2 and 3, as seen in algorithm 8. Algorithm 8 Parent-Count procedure Parent-Count(u) if u == 1 then ⊲ If u is the root it has no parent return 0 else v ← parent(u) b ← Suffix-Link(u) if v == 1 then ⊲ Lemma 2 return depth(b) − depth(v) + 1 else ⊲ Lemma 3 e ← Suffix-Link(v) return depth(b) − depth(e) It should be noted that the pseudo-code described by Do and Sung [3] has small flaw in it where the calculation of the parent range is the wrong way around, yielding negative ranges. However the basic underlying concepts and approach do work. A small quirk in the algorithm is that the calculations seem to make certain trivial leaves a parent of some nodes in DS , even though trivial leaves should not appear in the simulated DAWG. Do and Sung [3] does not seem to offer any methods of avoiding this even though it could possibly have strange side effects. However, since the parent operations are not used in the local alignment algorithm described later on, the consequences of this will not be investigated further in this thesis. Extract-Parent To extract the actual nodes which correspond to the parents of a node u in the simulated DAWG DS , we simply take the start of the range of parents, i.e. b = suffix-link(u) and take the number of steps specified, on the path from b to the root of TS . So, given an index i | 1 ≤ i ≤ Parent-Count(u), we can extract the i’th parent of u by simply finding the correct level ancestor of the node b, which denotes the start of the range of parents, i.e. finding the node which is i steps from b. Algorithm 9 describes the operation in its entirety. 32 Algorithm 9 Extract-Parent procedure Extract-Parent(u, i) b ← Suffix-Link(u) return Level-Ancestor(b, depth(b) − i + 1) As with the operation Parent-Count(u) the algorithm presented by Do and Sung [3] is reversed, yet the fundamental approach is sound, and works in the implementation, if the trivial leaf quirk described in 3.4.2 is disregarded. 3.4.3 Extracting Information The simulated Directed Acyclic Word Graph now has a number of operations which enables us to navigate it efficiently. However, we still lack the ability to extract information regarding the substrings in S. It is therefore necessary to implement the following two operations which can be used to extract substring information on the nodes in DS . • End-Set-Count(u) - Returns the number of points in the end-set of node u. • Extract-End-Point(u, i) - Returns the i’th end-point in u’s end-set. The operation assumes 1 ≤ i ≤ End-Set-Count(u). End-Set-Count The End-Set-Count(u) operation is quite simple. Since every node in DS is a mapping from the suffix tree TS , and since every node u in TS represents the suffix range of label(u) in S, we can calculate the number of times the substring label(u) appears in S, since it is exactly as many times as label(u) appears in S. This effectively means that we return the size of u’s suffix range in TS . Algorithm 10 End-Set-Count procedure End-Set-Count(u) st ← leftmost-leaf(u) ed ← rightmost-leaf(u) return ed − st + 1 Since the leftmost-leaf(u) and rightmost-leaf(u) are constant-time operations, the End-Set-Count(u) operation is also a constant-time operation. Extract-End-Point Given a node u in the DAWG DS we look at its mapping in the suffix tree TS . Its path label, label(u), then has the starting locations in S defined as {AS [i] | i = st, . . . , ed}. This means we can extract the starting position of label(u) in S by reversing the starting points of label(u), i.e. {n + 1 − AS [i] | i = 33 st, . . . , ed}. Since the FM-index includes the lookup(i) operation which returns AS [i] we can use it to extract an end-point. So for any node u and any index 1 ≤ i ≤ End-Set-Count(u), the end point position can be found by the following, simple, algorithm. Algorithm 11 Extract-End-Point procedure Extract-End-Point(u, i) st ← leftmost-leaf(u) return n + 1 − lookup(i + st − 1) The lookup(i) operation on the FM-index takes O(log n) time which makes Extract-End-Point(u, i) a worst-case O(log n) time operation. 3.4.4 Computing Local Alignment Now that we have seen how a simulated Directed Acyclic Word Graph can be implemented, we will use it to calculate the local alignment score between a string S and a query string P . We will first define a recursive approach which will be the foundation to the actual iterative implementation used for the experiments. Definition Let DS = (V, E) be the simulated DAWG data structure, where V is a partitioning of all substrings of S[1 . . . n] as described in section 3.1, and let P [1 . . . m] be a query string. We then wish to find the local alignment score between the two strings S and P as it is defined in section 2.1. Since a node u in V represents the set of path labels from the source node to the node, we can define a string x as belonging to u if x equals one of these path labels. The scoring scheme δ for computing the local alignment will be assumed to return negative values for mismatches and insertion/deletions, and some value larger or equal to zero for a match. This allows us to use the definition of a meaningful alignment in 2.1 to later on leave out certain calculations of alignments which can not be a local alignment as explained in 2.1. For every 1 ≤ j ≤ m and every node u ∈ DS we wish to define recursively an entry Nj [u] which maximizes the meaningful score Nj [u] = max meaningful-score(P [k . . . j], x) k≤j, x∈u The recursive definition for which this relation holds is defined by Do and Sung [3] as the following formula. ∀ j = 0 . . . m, Nj [λ] = 0 ∀ u ∈ V \ {λ}, Reset N0 [u] = −∞ Nj [u] = filter max Nj−1 [v] + δ(P [j], c(v, (v, u)∈E Nj−1 [u] + δ(P [j], −) u) ) N [v] + δ(−, c j (v, 34 u) ) Match Gap in S Gap in P The inner equation calculates which transition yields the best alignment for every edge (v, u) ∈ E for the node u. To ensure that this is a meaningful score we then apply the filter function which is defined by the following equation. filter(a) = ( a −∞ if a > 0 else. Put simply, it ensures that the best transition yields a positive score. If this is not true, it sets the entry as negative infinity, thereby ensuring that the meaningless alignment does not affect any further calculations. This ensures that any meaningless alignment is disregarded as soon as it appears. When every entry in the table N has been calculated according to the recursive definition, all we need to do is extract the greatest value which will be the local alignment score between S and P . It should be noted that we do not view the source node as representing an alignment, since it represents the empty string. Likewise we do not consider the empty string in P as a being valid alignment with any substring in S, so we also disregard N0 [u], which gives us local-score(S, P ) = max 1≤j≤m, u∈V \{λ} Nj [u] Implementation The implementation follows the general algorithm outline specified by Do and Sung [3]. Just like the Smith-Waterman algorithm in section 2.2.2 we note from the recursive definition that each entry Nj [u] only requires entries from the j − 1’th iteration, apart from its own iteration, to be computed. Therefore we only need to maintain the tables Nj−1 and Nj for each iteration. Instead of keeping two tables with entries for every node, the implementation uses two hash-maps last and next. Each node is then represented as a pair (key, value) where the key is the nodes representation, and value is the entry Nj−1 [u] and Nj [u] for last and next, respectively. The hash-map has amortized constant-time operations for insertion and search. To help ensure this by avoiding clashes, and to reduce the likelihood of having to rehash the tables if they exceed the initial size, which is also time consuming, we reserve enough memory for each table so that it can hold every node in DS . The algorithm given by Do and Sung [3] also specifies that in order to correctly handle a gap transition, the nodes need to be visited in any topological order. How this order is found, inferred or maintained is not specified, therefore the implementation initializes a hash-map where, given a node’s representation, its topological order is returned. The topological order is found using a recursive depth-first traversal of the DAWG and storing the order numbering of each node. The traversal is described in pseudo-code in algorithm 12. Not included in the pseudo-code is the conversion from a list to a hash-map. This is done with a simple iteration over the list where the node representation is paired with its order number. The implemented method can be seen in pseudo-code in algorithm 13. 35 Algorithm 12 Computing Topological Ordering order ← {} ⊲ Empty list which will contain the ordered nodes procedure Get-Topology(DS ) while There is an unmarked node v ∈ DS do Visit(v) return order procedure Visit(v) if v is not marked then for ∀u | (v, u) ∈ E do Visit(u) Mark v Add v to head of order The implementation begins by initializing the topology mapping using the approach described above. Afterwards it initializes N0 [u], which is the last hash-map, for all nodes u ∈ DS . Since we will only keep entries which are positive in our hash-maps, this is reduced to giving the source node the value zero, i.e. N0 [λ] = last[1] = 0. Moreover, since the recursive definition tells us that the source node always has a score of zero, this is equivalent to the reset case in Smith-Waterman. We also set next[1] = 0. Lastly we hold the largest value seen so far in the variable max, which we initialize to negative infinity. Though several of the steps are quite obvious given the pseudo-code, we will go through it step by step to be thorough. After initializing the required hashmaps and variables, the implementation iterates over the pattern P . For each iteration j it first takes each node v in the hash-map last, which equals each positive entry in Nj−1 [v], and attempts to apply the two cases, match/mismatch and gap in S. It should be noted that in the case of a match we can propagate a value from v in last to u in next, since we conceptually take a step in both S and P . In the case of a gap in S, we only take a step in P , meaning that the Figure 3.10: The figures show how the three cases propagate values forwards conceptually in the local alignment score algorithm. The left figure shows the match/mismatch case, the center figure shows the gap in S case and the right figure shows gap in P case. 36 Algorithm 13 Computing Local Alignment Score procedure Local-Alignment-Score(DS , P ) order ← Get-Topology(DS ) last ← {(1, 0)} next ← {(1, 0)} max ← −∞ for j ← 1, . . . , m do for v ∈ last do for c ∈ Σ do u ← Find-Child(v, c) if u 6= 0 then tmpA ← last[v] + δ(P [j], c) ⊲ Match case if tmpA > 0 & tmpA > next[u] then next[u] ← tmpA if tmpA > max then max ← tmpA tmpB ← last[v] + δ(P [j], −) ⊲ Gap in S case if tmpB > 0 & tmpB > next[u] then next[v] ← tmpB if tmpB > max then max ← tmpB heap ← {} for v ∈ next do heap ← (order[v], v) ⊲ Nodes are compared by their order while |heap| > 0 do v ← heap.pop() ⊲ Extract and delete lowest order node in heap for c ∈ Σ do u ← Find-Child(v, c) if u 6= 0 then tmpC ← next[v] + δ(−, c) ⊲ Gap in P case if tmpC > 0 & tmpC > next[u] then next[u] ← tmpC if tmpC > max then max ← tmpC last ← next next ← {(1, 0)} ⊲ Reset condition return max 37 value is propagated from v in last to v in next, which means we conceptually remain in the same substring in S, since we’ve chosen to add a gap. To satisfy the f ilter operation defined in the recursive formula, we ensure that a score only has a chance of being propagated forward if it is positive. To ensure that we hold on to the best score at all times, we also ensure that the score has to be better than the score currently held by the respective node. Lastly, we update our max variable if a score is greater than any seen before is calculated. The second part of the iteration j calculates the case in which a gap in P leads to a meaningful alignment. However, due to the fact that each entry in the hash-map next is now influenced by another entry in next, the order in which we calculate each node’s entry becomes important for the resulting alignment and its score. Each node u in the hash-map next can therefore first be calculated once we have calculated the entry for every node v, that has an outgoing edge (v, u). This is where the topological ordering of the nodes in DS , which we calculated at the beginning of the algorithm, becomes important. However we do not wish to iterate over every node according to the topological ordering since this would mean iterating over a lot of nodes which has negative infinity as value in Nj . Instead we create a priority queue and insert each node according to its topological ordering, which should allow us to extract and delete the node with the lowest order in O(log |heap|). Insertion can be achieved in constant time or O(log |heap|), depending on the type of heap. Since we know that a topological ordering ensures that all parents of a node have been evaluated before the node itself, this approach ensures that every case where there is a gap in P is handled correctly. It should be noted that we also encounter this problem with the naïve SmithWaterman solution. Since we calculate the rows from left to right, we implicitly have a topological ordering, and therefore we do not have to implement a strategy to ensure this. It is also easily seen how a calculation would fail if the entry to its left has not been calculated. Lastly, before the implementation goes on to iteration j + 1 the last hashmap is set to the next hash-map, which in turn is reset to only having the reset condition (1, 0). After iterating over the entire pattern P , the variable max holds the largest value seen during the run, which, like in Smith-Waterman, is the local alignment score between the string S and P . This value is finally returned. Extracting Local Alignments While the above implementation returns the local alignment score between two string S and P , it does not provide the actual alignments which have this score. This is the same problem faced with the Smith-Waterman algorithm in section 2.2.2, and it is solved in a similar fashion in the simulated DAWG implementation. The first step is to expand the value stored in each entry Nj [u]. Instead of merely saving the best score for u in iteration j, we store a tuple containing the best score, a number Ij,u , j and a number Lj,u. The scores are maintained and 38 calculated as previously, and the j value is set to the number of the iteration if a score is propagated forward, the last two values are updated using the following definition. ∀j = 0 . . . m, Ij,u = Lj,u = Ij,λ = j, Lj,λ = −1 Reset Ij−1,v I Match Gap in S Gap in P j−1,u I j,v Lj−1,v + 1 Match Gap in S Gap in P L j−1,u I + 1 j,v Let Nj [u] be a node which has been calculated to contain at least one local alignment between S and some pattern string P . Its alignments are then made up of the pair of substrings in S and P respectively, which are found by using the calculated values, i.e. {(P [Ij,u . . . j], S[q − Lj,u . . . q]) | q ∈ end-setS (u)}. To find q ∈ end-setS (u) we use the End-Set-Count(u) and Extract-End-Point(u, i) operations which we described earlier. Let us begin by describing the I value. As seen above this value gives us the start position of the substring in the pattern P . The reset condition ensures that the substring begins in the j’th position which is in accordance with the j’th iteration of P . If there is a match case, then the starting value does not change for the substring, and u therefore inherits the value from its parent v in the last iteration. If there is a gap in S, we do not propagate through S, and u therefore keeps its own value from the last iteration. Lastly if there is a gap in P , then there is a propagation through S, making u inherit from its parent v in the current iteration. The value j which is equal to the iteration number, is saved every time no matter which case occurs, this ensure that the end of the substring is saved. Lj,u denotes the length of the substring in S that ends at position q. Since end-setS (u) denotes the end points in S of paths from the source node to node u, it is obvious that we extract the correct substring by taking this range. The reset condition ensures that the initial range is empty. The match case increases the value front the last iterations parent with one, since a match case equals a conceptual step in S. This is also the reason why a gap in S merely inherits the value, since no conceptual step is taken in S. The last case, gap in P , deduces how many gaps have been, conceptually, inserted into P previously by taking the number Ij,v and then adds one more for the gap which is just being inserted, this equals the range of the substring in S. To ensure that we return all the local alignments which fit the local alignment score, we simply modify the variable max to being a list maintaining all nodes u and their associated values, which equal the best score. If a new best score is found, the list is cleared and the appropriated nodes with associated values are inserted. This generates a small overhead, but nothing compared to the actual search. The list can then be iterated over to find all local alignments. Since the operation Extract-End-Point takes O(log n) time we can find all pairs 39 of substrings which represent a local alignment in O(#alignments log n). If it is necessary to find the actual alignment, instead of merely the two substrings which make up the alignment, it is necessary to use some algorithm for computing the global alignment between S[q − Lj,u . . . q] and P [Ij,u . . . j]. However, since the local alignment usually yields substrings which are far shorter than the original string S and P , this extra computation is insignificant compared to actually locating the substrings. We therefore generally disregard this additional overhead. 40 Chapter 4 Experiments The previous chapter began by introducing the DAWG data structure and then went on to introduce and described two additional data structures, the DFUDS and FM-index which were shown to be able to simulate the DAWG by combining the operations of each. The chapter then presented an algorithm for computing local alignment using the simulated DAWG, which Do and Sung [3] concludes has a worst-case time complexity of O(nm) and expects it to have an O(n0.628 m) average time complexity when using a scoring scheme which rewards a match with 1 and punishes a mismatch or gap with −3. This chapter will present a number of experiments run on the simulated DAWG with the purpose of investigating whether these theoretical time complexities hold. It will also present a few other experiments which are intended to give further insight into the data structure, such as the effect of different scoring schemes on the DAWGs performance. However, before describing the experiments and show their results, the chapter will provide the experimental set up used for the experiments. 4.1 Experimental Setup This section will present the different aspects which influence the experiments, such as libraries or format of the data used. It will also, briefly, present the means of analysing the data outputted by the experiments. The contents of this section will set the premise for all experiments run on the naïve SmithWaterman algorithm for computing the local alignment score, as well as the simulated directed acyclic word graph approach. It should be noted that all experiments have been run on an Intel Core i3 550, 3.2 GHz processor with 8 GB memory and running a 64-bit Xubuntu 13.10 distribution. To ensure the least possible amount of overhead is caused by other processes running on the system, as few programs as possible are running while the tests are executing. 41 Implementation The experiments are based on an implementation of the simulated directed acyclic word graph which has been described throughout chapter 3, and an implementation of the naïve Smith-Waterman approach to computing local alignment score described in 2.2. Both implementations are written in C++ and compiled using GCC-4.8 with the optimization flag -O3 set. While the basic approach and data structures used for the implementations have been described in their respective sections, the actual C++ libraries used have not been mentioned. For tables and arrays the implementation uses the Standard Template Library (STL) container vector. In certain parts of the implementation these are also nested, yielding two-dimensional tables. This approach does seem to be controversial in some communities, but since they are only used as lookup tables, for example, in the DFUDS, and never modified, it should not affect the runtime since lookups are done in constant time. The dictionary structure used for the C table in the FM-index described in section 3.3 uses the map<char, int> data structure from the STL to map characters to their C value. It makes lookups in O(log |Σ|) time, which is acceptable since the experiments will be restricted to an alphabet of size four, i.e. DNA. The wavelet tree also uses the map<char, bool> data structure to map characters to their bit representation. The alphabet size, again, ensures an acceptable time complexity. Another data structure used multiple times is the hash-map. This is used a couple of places, e.g. to hold the sample suffix array mentioned in section 3.3.1 and to hold on to alignment scores for each node in the simulated DAWG. The implementation uses the STL unordered_map data structure, which has an average case of constant time for lookups in the hash-map. There is a risk that it will need to rehash the entire structure, when adding elements, which is very time consuming. However by reserving memory in advance this should be avoided. The last data structure to be mentioned is the priority queue structure which is used for ensuring that the algorithm traverses the nodes in the correct topological order in the local alignment score algorithm of the simulated DAWG. For this the implementation uses the fibonacci_heap data structure from the BOOST library collection found at http://www.boost.org. Initially the implementation had two versions of the simulated DAWG local alignment algorithm to test whether there was any difference in performance between using the STL priority_queue and BOOST fibonacci_heap data structures. However, experiments showed that there was no difference and therefore, to simplify things, only one implementation was chosen to be used in the experiments. 42 Data The experiments will be based on two forms of data, both of which will be over the alphabet Σ = {A, C, G, T }. The first type of data will be randomly constructed over the alphabet using the C++ pseudo-random number generator, rand(), which yields a uniform distribution over the characters. The other type of data will rely on DNA samples from http://www.ensembl. org. The first DNA sample used is 1.1 GB Turkey (Meleagris Gallopavo) found at [4], this sample is used to create the strings S, used for constructing the simulated DAWG DS . For the query string P , a DNA sample of 2.6 GB originating from a Bushbabys (Otolemur Garnettii) DNA, found at [5] is used. To ensure that the overhead for creating test strings is as low as possible, a large portion of each DNA sample is loaded into memory before any experiments are performed. The range from which this portion originates is random, which should reduce the likelihood of using the same strings multiple times. When an experiment need a DNA string it then takes a random section of one of the strings loaded into memory, depending on whether it will be used for constructing a DAWG or be used as a query string. Experimental analysis For the purpose of analysing the data produced by the experiments, this thesis will rely heavily upon Gnuplot and its associated tools. Aside from producing the graphs used in this chapter, it will also play a fundamental role in approximating the time complexity of the simulated DAWG. This is done by its fit command which uses a non-linear least-squares regression algorithm to estimate the parameters of a function, given the plotted data. This approach is very common and given an correlated equation it estimates the parameters extremely well. 4.2 Comparing Algorithms The first experiment presented will be a basic comparison of the naïve algorithm versus the more complex DAWG local alignment score algorithm. Both approaches are run with two strings S[1 . . . n] and P [1 . . . m] where n = m with an initial size of 100 and incremented with 100 for each iteration. Each iteration uses a single S and 10 query strings P and saves the average time for the computation of the local alignment score to a data file. The time used to create the simulated DAWG data structure is not included in the experiments, since it is not included by Do and Sung [3]. However, the build time will be examined separately in section 4.5. The experiment will be run for both randomly generated data, and for actual DNA sequences as described in section 4.1. It is not expected to yield remarkably different results, as the number of meaningful alignments is expected to remain roughly to same due to the scoring scheme used. The scoring scheme used returns a score of 1 for matches and −3 for gaps and mismatches, since this is the scoring scheme presented by Do and Sung [3]. 43 3e+10 DAWG Smith-Waterman 2.5e+10 2.5e+10 2e+10 2e+10 nanoseconds nanoseconds 3e+10 1.5e+10 1e+10 5e+09 DAWG Smith-Waterman 1.5e+10 1e+10 5e+09 0 0 0 5000 10000 15000 20000 25000 30000 n 0 5000 10000 15000 20000 n Figure 4.1: Initial comparison of the naïve Smith-Waterman and the simulated DAWG algorithm with random sequence to the left and DNA to the right. The DAWG plot to the left is approximated to 25n0.861 n + 12, and the Smith-Waterman plot to the left is approximated to 30n1.000 n + 12. The experiment is expected to yield a graph with the naïve solution approximating a quadratic function since it is a pure O(nm) algorithm. Likewise it is expected that the DAWG implementation approximates some function from input size to time on the form axb x + c, with a, b and c being free parameters. This function is also described by Do and Sung [3], in which they also present the expected theoretical average time of O(n0.628 m). Using Gnuplot’s approximation command fit over the function described earlier, the resulting approximation of the naïve Smith-Waterman data as seen in figure 4.1 is 30n1.000 n + 12, which solidifies the theoretical time complexity of O(nm) which was assumed in the initial definition of the algorithm. However, applying the same approximation command on the simulated DAWG approach returns the function 25n0.861 n + 12. This is far from the expected theoretical average time stated earlier, however it is still far from its theoretical worst case of O(nm) described by Do and Sung [3]. Though the experiment yields an exponent which is larger than we expected for the entire dataset, it is not constant throughout all ranges of the set. Table 4.2 shows the changing exponent over certain ranges of the dataset from the initial experiment. There are two noticeable jumps which seem to occur when n exceeds 10000 and again when n exceeds 20000. This dramatic increase of the exponent on the time complexity can not be explained merely by looking at the algorithm, since nothing should affect it so dramatically from such a modest increase of the input size. Table 4.2 also shows the exponent change for the experiment running actual DNA sequences as described earlier. Since actual DNA sequences do not have uniformly distributed characters the results seem to fluctuate much more than the randomly generated data. This influences the approximation quite a bit, and the exponent can be seen to fluctuate much more than in the random data. 44 25000 30000 Range [0 : 5000] [5000 : 10000] [10000 : 15000] [15000 : 20000] [20000 : 25000] [25000 : 30000] Random 0.609 0.611 0.777 0.779 0.869 0.869 DNA 0.537 0.549 0.927 0.927 0.896 0.895 Figure 4.2: The change in the exponent of the DAWG’s time function depending on the range of the input size. However it does seem to converge to an exponent which is similar to the random data approximation, indicating that the larger exponents in the middle ranges are probably caused by the larger variance in the data. 4.3 Cache Misses A possible explanation for the sudden increases of the exponent in the time complexity of the simulated DAWG, is that the CPU caches exceed their capacity, thereby making it necessary to fetch required data from memory. Since a cache miss equals extra time spent searching through the cache for the item, then proceeding down until it eventually finds it in a larger, but slower, cache or even in the memory. Worst-case would of course be if the item was only available on the hard drive, however all the necessary data is always loaded into memory before running the experiments, and will not exceed its capacity. To explore this possibility, the experiment described in section 4.2 has been extended to count the number of L2 and L3 misses which occur. This has been implemented by using the PAPI interface, found at http://icl.cs.utk.edu/ papi, to gain access to the hardware counters available on the system. The resulting data has been plotted in two separate graphs which can be seen in figure 4.3. The plotted data seems to offer some explanation for the dramatic increase of the exponent seen in the previous section. It can be clearly seen that the first increase of the exponent, which occurs at around n = 10000, coincides with a rapid increase of L2 cache misses at around the same input size. This would affect the performance of the algorithm as each lookup which resulted in a cache miss would require an iteration through the L3 cache and a fetch operation to access the item needed in the computation of the alignment score. The second increase also coincides with a rapidly increasing number of L3 cache misses, which forces the system to fetch data from memory before being able to continue its execution. But this overhead should only result in a larger constant per fetch, it should not increase the overall time-complexity and therefore the cache misses alone can not explain the increase of the exponent. One possible explanation could be that the exponent is temporarily affected by the increasing miss rate percentage, which would mean that the exponent should stabilize when the miss rate nears 100%. Unfortunately the PAPI interface was not able to access any cache hit coun45 3.5e+08 1e+07 Random DNA 3e+08 Random DNA 9e+06 8e+06 7e+06 L3 misses L2 misses 2.5e+08 2e+08 1.5e+08 6e+06 5e+06 4e+06 3e+06 1e+08 2e+06 5e+07 1e+06 0 0 0 5000 10000 15000 20000 25000 30000 0 n 5000 10000 15000 20000 n Figure 4.3: L2 and L3 cache misses ters on the system used for the experiment, so it is not possible to show what percentage the cache misses represent on the overall cache queries. Though, despite the lack of cache hit data, it seems reasonable to assume that as the input size increases, the cache miss rate percentage would increase as it becomes less likely that the data needed for the next computation is available in the caches, thereby affecting the time complexity of the algorithm in some way - possibly through an increased exponent as observed. Cache Miss Rate This section will investigate the possibility of the cache miss rate affecting the exponent in the previous experiment. If this is the case then it would be expected that the effect only occurs when the miss rate is increasing with respect to the input size. But as the miss rate nears 100% it should no longer be able to affect the exponent, and merely result in a larger factor on the time complexity. Therefore an experiment has been run on input sizes ranging from 30000 to 107250 using randomly generated strings. Due to time constraints the size is incremented by 250 per iteration, and only run four times, which will result in a greater variance in the data. The recursive depth-first traversal used to extract the topology of the DAWG causes a segmentation fault when it nears 110000, thereby limiting the maximum range. This was unfortunately discovered too late to allow for reimplementation. The data from the experiment is approximated for increasing start ranges to the end range and the exponents approximated can be seen in figure 4.4. The initial exponents show that the effect of cache misses do not end at the 30000-mark, and they even exceed the exponent of the naïve implementation. Though this pattern seems to disappear around 60000 where the exponent begins falling rapidly. This could indicate that the growth of the cache miss rate is decelerating, reducing the effect on the exponent. Though it never reaches 46 25000 30000 Range [30 : 107.25] [35 : 107.25] [40 : 107.25] [45 : 107.25] [50 : 107.25] [55 : 107.25] [60 : 107.25] Exponent 1.165 1.171 1.179 1.186 1.195 1.202 1.202 Range [65 : 107.25] [70 : 107.25] [75 : 107.25] [80 : 107.25] [85 : 107.25] [90 : 107.25] Exponent 1.187 1.144 1.012 0.857 0.842 0.836 Figure 4.4: Table showing the change in exponent for runs on long strings on the simulated DAWG. The range values are in thousands of characters. the average time predicted by Do and Sung [3], this could merely be caused by the limited range. Attempting to approximate the data after the 90000-mark results in an asymptotic standard error above 5% when using the Gnuplot’s approximation algorithm and have therefore not been included. This is likely caused by the lower number of data points available, and a greater variance due to a lower number of runs per data point. This experiment seems to indicate that it is indeed the cache miss rate which causes the unexpected increase on the exponent for smaller input ranges. Further experiments are necessary to determine where the exponent stabilizes, which would indicate the actual time complexity of the algorithm, though the results from the early ranges, before the cache misses begin to affect the algorithm, do suggest that the algorithm is capable of reaching the expected O(n0.628 m) average time predicted by Do and Sung [3]. A possible reason for the horrible number cache misses could be due to the number of lookups required in the simulated DAWG to navigate the data structure. As the algorithm for computing the local alignment score, seen in 3.4.4, shows each iteration makes a large number of calls to the Find-Child operation, which calculates whether a given node has a child by making lookups in every underlying data structure. This increases the probability that at least some entries are not present in the cache, be it L2 or L3. This problem is not present in the naïve solution as it only makes three lookups per calculation from two tables. 4.4 Scoring Schemes This section will attempt to compare the effect on performance of the simulated DAWG when different scoring schemes are used for the computation of the local alignment score. Since the DAWG approach relies heavily on the pruning derived from only propagating meaningful alignments forward in the computation, the scoring scheme should influence the algorithms performance substantially. To investigate this hypothesis three different scoring schemes have been chosen, two of which are opposite extremes and the last being a standard scoring scheme 47 of the widely used sequence analysis tool BLAST, which is also presented by Do and Sung [3]. This scoring scheme was also used for the previous experiments. The scoring schemes are defined as the following: 1e+10 1, 0, 0 Smith-Waterman 1, −3, −3 1, −∞, −∞ 8e+09 nanoseconds 8e+09 nanoseconds 1e+10 1, 0, 0 Smith-Waterman 1, −3, −3 1, −∞, −∞ 6e+09 4e+09 2e+09 6e+09 4e+09 2e+09 0 0 0 5000 10000 15000 20000 25000 30000 n 0 5000 10000 15000 20000 n Figure 4.5: The resulting graph from the experiment described in section 4.4. The results from randomly generated inputs are shown to the left, while the results from using DNA as input is shown to the right. Each plots exponent is shown in figure 4.6 • The first extreme scoring scheme rewards match with 1, while a mismatch or gap is punished with 0, and is denoted as {1, 0, 0}. This scoring scheme makes negative scores impossible once a node has received a value, and therefore the DAWG should be forced to make a calculation for all nodes very early on in the experiment. This should result in a worst-case scenario which is defined as O(nm) by Do and Sung [3]. The alignment score should equal the global alignment score between S and P since gaps are not punished, leading to an alignment where most characters possible are matched up. • The other extreme is a scoring scheme which rewards matches with 1 but punishes mismatches and gaps by −∞ denoted as {1, −∞, −∞}. Due to integer range constraints the implementation uses a signed integer with half the size of the smallest representable signed integer to represent negative infinity. The alignments produced by the scoring scheme then equal the longest common substrings between the input strings S and P . This greatly reduces the number of meaningful alignments, since all alignments which include a gap are meaningless with this scoring scheme, which should increase performance significantly. • The standard scoring scheme rewards matches with 1 and punishes a mismatch and gap with −3, denoted as {1, −3, −3} in the graph. This is a widely used scheme and is commonly found in other tools used to analyse sequence alignments. 48 25000 30000 Scoring Scheme {1, 0, 0} Smith-Waterman {1, −3, −3} {1, −∞, −∞} Random 1.188 1.004 0.861 0.721 DNA 1.192 1.009 0.849 0.712 Figure 4.6: The exponent of the equation anb n + c of the different plots in figure 4.5 retrieved using Gnuplot’s regression command fit. The experiment is set up similarly to the experiment described in section 4.2 with S and P being of equal size starting at 100 and then incrementing by 100 up to 30000 characters. As a small variation the number of runs on each size P has been reduced to 4, down from 10. This is due to a practical time constraint where it was not possible to run a full test. This should result in a slightly larger variance in the data, but should still be adequate. The experiment is once again split up in randomly generated strings and actual DNA with the resulting data seen in figure 4.5. The performance of the Smith-Waterman algorithm with the standard scoring scheme has also been included in the graph for comparison. Since this algorithm’s performance is not affected by the scoring scheme there is no reason to include data from tests with other scoring schemes. The resulting graph shows the exact pattern we were expecting, the {1, 0, 0} scoring scheme produces the worst performance in the simulated DAWG by far, even being out performed by the naïve Smith-Waterman algorithm. Since the theoretical worst case presented by Do and Sung [3] is O(nm) just as the naïve algorithm this is not a surprise, since we have already seen how the cache misses seem to affect the DAWG’s time-complexity. In figure 4.6 the exponents of the different plots have been approximated and it is clear that the time-complexity is quite a bit worse than the theoretical assumption. However, if we approximate the range [0 : 2000] the exponent is reduced to 1.08, indicating that the cache miss rate is once again affecting the time-complexity. Since the extreme scoring scheme forces the algorithm to calculate all entries with no pruning, it is not surprising that the experiment encounters a rapidly increasing cache miss rate earlier than when using the other scoring schemes. The second extreme scoring scheme {1, −∞, −∞} results in a plot with the exponent seen in figure 4.6. Even in the best case scenario where the algorithm is able to prune all alignments which contain gaps, the approximated time-complexity exceeds, with a noticeable margin, the expected average case time-complexity of O(n0.628 n) presented by Do and Sung [3]. While this scoring scheme is the last to be affected by the cache miss rate it is still affected substantially even at these relatively short input sizes. The standard {1, −3, −3} scoring scheme yields approximately the same exponent as in section 4.2. Since this previous experiment covers this scoring scheme, further comments will not be made. 49 4.5 Build Time For the simulated directed acyclic word graph to be practical to use, the time complexity for building the data structure has to be insignificant compared to the computation of the local alignment score. Since the experiments on the simulated DAWG are not based on a succinct implementation as described by Do and Sung [3] the resulting time consumption in this section is not necessarily equivalent to the succinct simulated DAWGs time-complexity. It should be noted that Do and Sung [3] do not present a theoretical expected timecomplexity for building the data structure. The build time is expected to be dominated by construction of the O(n log n) sized range minimum query table described in section 3.2.3. The table also takes O(n log n) time to construct, as each entry is created in constant time. The experiment is run by constructing the simulated DAWG on strings of increasing length starting from 1000 and increasing by 1000 each iteration to 300000. Each iteration runs ten different DAWG constructions and the average time is then outputted to a file. The resulting data is then plotted and the graph is approximated using Gnuplot. The data is plotted to the same function as the previous experiments, i.e. axb x+c, partly to be able compare with the local alignment score algorithm, but mostly because this function approximates the plot better than the expected ax log x + c. The resulting approximation is 1335n0.12 n − 106 , which can be said to have very little effect on the overall time consumption compared to computing the local alignment. Since the plotted graph is not very interesting, due to the lack of other plots for comparison, it will not be included. 50 Chapter 5 Conclusion The previous chapter presented a number of experiments which have been run on the simulated DAWG data structure with the goal of investigating the claims made by Do and Sung [3]. This section will evaluate these findings and attempt to answer the questions raised in chapter 1. Afterwards the section will discuss what improvements could be made to the implementation and what experiments could be run to give more insight into the simulated DAWG data structure and its local alignment algorithm. 5.1 Conclusion Beginning with the first question raised in chapter 1, it does seem feasible to implement the simulated directed acyclic word graph data structure, as the implementation used in this thesis works and is able to compute local alignments. Though to achieve the specific data structure presented by Do and Sung [3] it is necessary to substitute or extend a number of the auxiliary data structures used, which means the feasibility will rely upon the feasibility of implementing these changes. While some of them may be somewhat complex to implement there does not seem to be any reason to suggest that they are not feasible. The second question set out to investigate whether the experiments could verify that the worst-case time consumption was equivalent to the O(nm) timecomplexity presented in [3]. This was investigated by giving the simulated DAWG a scoring scheme which would ensure that no pruning was possible, thereby forcing the local alignment algorithm to make calculations on all nodes at each iteration. This experiment resulted in a time-complexity which was approximately the same as the expected worst-case when looking at small inputs, which indicates that it indeed has a worst-case of O(nm) time. The last and most interesting question raised was whether the simulated DAWG data structure had an average time-complexity of O(n0.628 m). The first attempt to answer this question in section 4.2 showed that the local alignment algorithm had an average time-complexity that approached the expected time for input sizes ranging from 0 to 10000, but that this time-complexity quickly deteriorated when the input size exceeded this range. Further investigation in section 4.3 revealed that the cache misses appeared to be affecting 51 the exponent much more than what was expected, which prompted an experiment into whether the cache miss rate was causing this anomaly. The cache miss experiment on large input sizes seemed to indicate that once the miss rate approached 100% the exponent would return to the actual time-complexity of the algorithm. Unfortunately, due to technical problems, the experiment was not able to identify at which input size the exponent on the time-complexity stabilized. However, it did show that the time-complexity observed when the input size exceeded 10000 was not the actual time-complexity of the algorithm as a whole. While there is not enough data to definitively conclude that the time-complexity would stabilize to the expected O(n0.628 m), the initial experiment with small input sizes do seem to suggest that it is feasible. Lastly an experiment was run to deduce whether the time-complexity of building the simulated DAWG would generate enough overhead for the approach to be impractical. This experiment showed that the build time was insignificant when compared to the local alignment computation, even if it achieves the expected average-time complexity. 5.2 Future Work The next step would be to fix the implementation so that ranges above 110000 can be investigated. This requires an improvement of the depth-first traversal of the simulated DAWG, which currently causes a segmentation fault at large input sizes. This would allow for experiments which could determine when the algorithm’s time exponent stabilizes. This would indicate the true timecomplexity of computing local alignment using the simulated DAWG. Since this thesis has focused almost exclusively on experiments run on the simulated DAWG, an experiment in which the simulated DAWG is matched up against well established tools for computing local alignment should be very interesting. While the time complexity is an important aspect, a direct comparison against identical strings run on different tools might produce surprising outcomes due to the different tactics employed. Another aspect which has not been investigated in this thesis, but which could yield interesting results, is using affine gap costs in the scoring scheme of the simulated DAWG. While the implementation in its current form cannot operate on scoring schemes using affine gap costs, it should not be difficult to modify the local alignment algorithm to be compatible with a scoring scheme using affine gap costs. How this change in gap cost affects the time performance would be interesting. One thing that is not taken in to consideration in this thesis, is the effect of using the succinct and compressed data structures originally used in Do and Sung [3] for the simulated DAWG. While it is argued that these merely add an additional constant time penalty to the local alignment calculation there does not seem to be any estimate as to how these complex construction affect the build time of the data structure. If the time complexity for building the simulated DAWG begins to approach the time-complexity of the local alignment 52 then it would probably not be very practical to use. Implementing these should give insight into whether the succinct simulated DAWG is practical. 53 54 Bibliography [1] Benoit, D., E. D. Demaine, J. I. Munro, R. Raman, V. Raman, and S. S. Rao (2005, December). Representing trees of higher degree. Algorithmica 43 (4), 275–292. [2] Blumer, A., J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen, and J. I. Seiferas (1985). The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55. [3] Do, H. H. and W.-K. Sung (2011). Compressed directed acyclic word graph with application in local alignment. In B. Fu and D.-Z. Du (Eds.), COCOON, Volume 6842 of Lecture Notes in Computer Science, pp. 503–518. Springer. [4] Ensembl.org. ftp://ftp.ensembl.org/pub/release-74/fasta/ meleagris_gallopavo/dna/Meleagris_gallopavo.UMD2.74.dna. toplevel.fa.gz. [5] Ensembl.org. ftp://ftp.ensembl.org/pub/release-74/fasta/ otolemur_garnettii/dna/Otolemur_garnettii.OtoGar3.74.dna_rm. toplevel.fa.gz. [6] Ferragina, P. and G. Manzini (2005, July). Indexing compressed text. J. ACM 52 (4), 552–581. [7] Golynski, A., J. I. Munro, and S. S. Rao (2006). Rank/select operations on large alphabets: A tool for text indexing. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA ’06, New York, NY, USA, pp. 368–373. ACM. [8] Grossi, R., A. Gupta, and J. S. Vitter (2003). High-order entropycompressed text indexes. In Proceedings of the Fourteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’03, Philadelphia, PA, USA, pp. 841–850. Society for Industrial and Applied Mathematics. [9] Jansson, J., K. Sadakane, and W.-K. Sung (2012, March). Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78 (2), 619–631. [10] Kondrak., G. (2002). Algorithms for language reconstruction. 55 [11] Lam, T. W., W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu (2008, March). Compressed indexing and local alignment of dna. Bioinformatics 24 (6), 791–797. [12] Mäkinen, V. and G. Navarro (2005). Succinct suffix arrays based on runlength encoding. Nordic Journal of Computing, 40–66. [13] Mäkinen, V. and G. Navarro (2007). Implicit compression boosting with applications to self-indexing. In N. Ziviani and R. Baeza-Yates (Eds.), String Processing and Information Retrieval, Volume 4726 of Lecture Notes in Computer Science, pp. 229–241. Springer Berlin Heidelberg. [14] McCreight, E. M. (1976, April). A space-economical suffix tree construction algorithm. J. ACM 23 (2), 262–272. [15] Munro, J. I. and V. Raman (1997). Succinct representation of balanced parentheses, static trees and planar graphs. In FOCS, pp. 118–126. IEEE Computer Society. [16] Sadakane, K. (2007). Compressed suffix trees with full functionality. Theory Comput. Syst. 41 (4), 589–607. [17] Smith, T. F. and M. S. Waterman (1981, March). Identification of common molecular subsequences. Journal of molecular biology 147 (1), 195–197. 56