Download Implementing a Simulated Directed Acyclic Word Graph for

Document related concepts

Linked list wikipedia , lookup

Lattice model (finance) wikipedia , lookup

Red–black tree wikipedia , lookup

Quadtree wikipedia , lookup

B-tree wikipedia , lookup

Interval tree wikipedia , lookup

Binary search tree wikipedia , lookup

Transcript
Implementing a Simulated
Directed Acyclic Word Graph
for Computing Local Alignment
Jakob Schultz-Nielsen, 20061951
Master’s Thesis, Computer Science
February 2014
Advisor: Christian Nørgaard Storm Pedersen
AU
AARHUS
UNIVERSITY
DEPARTMENT OF COMPUTER SCIENCE
ii
Abstract
Computing local alignments is used to identify similar regions within highly
dissimilar sequences. It has a wide variety of uses, especially within Bioinformatics for analysing DNA, RNA or protein sequences. Therefore many different
solutions have been presented in the past with the intent of reducing the time
needed to identify local alignments, as naïve solutions become insufficient as
sequence lengths increase.
This thesis will present a simulated directed acyclic word graph data structure which will be used to compute local alignment by using dynamic programming and an effective pruning strategy. This will be the basis for a number of
experiments and it will be shown that this solution outperforms a naïve implementation and that its worst-case time-complexity equals the time-complexity
of the naïve solution. Furthermore experiments show that the cache miss rate
unexpectedly affects the time-complexity of the simulated data structure in
such a degree that an accurate approximation has not been possible, though
experiments on small inputs suggest that an average case of O(n0.628 m) is feasible.
iii
iv
Acknowledgements
First and foremost I would like to thank my thesis advisor Christian Nørgaard
Storm Pedersen for his guidance and uncanny ability to make problems disappear by putting things in perspective.
A special thanks goes to TÅÅGE KAM M ERET for making my extended stay at
Aarhus University a great pleasure.
I would like to thank Lauge Mølgaard Hoyer and Johan Sigfred Abildskov
for sacrificing their spare time to proof read and make suggestions to my thesis.
A great thank you also goes to Vickie Falk Jensen for proof reading my thesis
and for keeping me fed and motivated while I was writing my thesis.
Many thanks also goes out to everyone who kept pestering me to write a
thesis. You may stop now.
Jakob Schultz-Nielsen,
Aarhus, February 3, 2014.
v
vi
Contents
Abstract
iii
Acknowledgements
v
1 Introduction
1
2 Preliminaries
2.1 Alignments . . . . . . . . . . . . . . . . . .
2.2 Smith-Waterman Local Alignment Solution
2.2.1 Definition . . . . . . . . . . . . . . .
2.2.2 Implementation . . . . . . . . . . . .
2.3 Suffixes . . . . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
3
3
5
5
6
7
3 Simulated Directed Acyclic Word Graph
3.1 Directed Acyclic Word Graph . . . . . . . .
3.1.1 End-Set Equivalence . . . . . . . . .
3.1.2 Definition . . . . . . . . . . . . . . .
3.2 Depth-First Unary Degree Sequence . . . .
3.2.1 Definition . . . . . . . . . . . . . . .
3.2.2 Basic Constant-time Operations . .
3.2.3 Required Constant-time Operations
3.3 FM-index . . . . . . . . . . . . . . . . . . .
3.3.1 Definition . . . . . . . . . . . . . . .
3.3.2 Operations . . . . . . . . . . . . . .
3.4 Simulating a Directed Acyclic Word Graph
3.4.1 Merging Data Structures . . . . . .
3.4.2 Navigation . . . . . . . . . . . . . .
3.4.3 Extracting Information . . . . . . .
3.4.4 Computing Local Alignment . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
11
11
12
13
13
14
14
16
20
21
24
27
27
29
33
34
4 Experiments
4.1 Experimental Setup .
4.2 Comparing Algorithms
4.3 Cache Misses . . . . .
4.4 Scoring Schemes . . .
4.5 Build Time . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
41
41
43
45
47
50
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
vii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5 Conclusion
51
5.1 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
5.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography
56
viii
Chapter 1
Introduction
Effectively computing local alignments between sequences is a fundamental
problem in Bioinformatics. It is used to identify similar regions within sequences which are dissimilar when viewed in their entirety. These similar regions can then indicate common traits between otherwise unrelated sources, for
example DNA sequences from two separate species. Finding local alignments
is also widely used on other sequences such as RNA and proteins, but also has
applications beyond Bioinformatics such as language reconstruction [10]. As a
consequence, any advancement made in reducing the time required to compute
local alignments will assist many areas of research.
Many different approaches have been presented in the past for computing
local alignment and there are a large number of tools available, of which BLAST,
found at http://blast.ncbi.nlm.nih.gov/, is probably the most well known.
Common between most of these tools is the use of a scoring scheme, which is
used to calculate the similarity of aligned sequences, with a high score inferring
that the two sequences in the alignment are very similar.
The goal of this thesis will be to examine an approach of solving the local alignment problem presented by Do and Sung [3]. This approach seeks
to reduce the space consumption normally required by other data structures,
while improving worst-case performance and achieve an expected average timecomplexity of O(n0.628 m) when using a standard BLAST scoring scheme. Since
it is not feasible to implement the approach in its entirety due to time constraints, this thesis will focus on investigating the theoretical time consumption
presented by Do and Sung [3] and will allow for a greater space complexity
by using alternative auxiliary data structures when it is practical. However
the implementation will follow the general outline described in [3] as closely as
possible, so it is possible to infer whether it is feasible to implement.
Put in other words this thesis will attempt to answer the following questions
• Is the data structure and the local alignment algorithm described by Do
and Sung [3] feasible to implement?
• Does the local alignment algorithm described in [3] have a worst-case time
performance of O(nm)?
• Is its average time performance O(n0.628 m) as expected?
1
To answer these question the thesis will present an implementation of the
simulated directed acyclic word graph which offers the same time complexity
as presented by Do and Sung [3]. Whether or not this implementation works
will verify whether the data structure is feasible to implement. Moreover, by
running a number of experiments designed to test the time consumption we
should gain insight into the behaviour of the data structure as a function of the
input size.
The thesis will be structured as follows. Chapter 2 will describe a number
of concepts and data structures which are used throughout the thesis. The
chapter will also introduce a naïve approach for solving the local alignment
problem that will be used in chapter 4 to compare with the simulated directed
acyclic word graph implementation.
Chapter 3 will then define the directed acyclic data structure and then describe the depth-first unary degree sequence and FM-index data structures,
which will be used to simulate the directed acyclic word graph. The chapter ends by describing an algorithm to compute the local alignment using the
simulated data structure.
In chapter 4 a number of experiments and their results will be presented
and analysed. This chapter will eventually yield answers regarding the timecomplexity of the implementation and also give additional insight into the simulated directed acyclic word graph. The experimental results will also be the
basis for the conclusions drawn in chapter 5.
The source code used for the experiments and which is described throughout
the thesis can be found at http://daimi.au.dk/~bubbi/thesis/ along with
a pdf version of this thesis.
2
Chapter 2
Preliminaries
In this chapter a number of concepts and data structures are introduced which
will be used throughout the thesis. These will form the basis for many of the
approaches used and are therefore essential to understand the approaches and
argumentation used.
To form a basis for comparison of the local alignment algorithm used on the
simulated directed acyclic word graph, later on, this chapter will also present
a naïve implementation for the computation of the local alignment which was
first proposed by Smith and Waterman [17]. Before presenting this algorithm
the different types of alignments used in this thesis will be defined.
Lastly, this chapter will present data structures and concepts regarding
suffixes, which will be used conceptually multiple times when presenting the
simulated directed acyclic word graph data structure.
2.1
Alignments
Given two strings X and Y over the alphabet Σ, let an alignment A be the pair
of strings X ′ and Y ′ of equal length over an alphabet Σ ∪ {−}, where '−' is a
special character indicating a gap. The following two properties are then true
for A.
• Removing all gap characters from the alignment strings X ′ and Y ′ returns
the strings to their original form, X and Y .
• For any index i at most one of the characters X ′ [i] and Y ′ [i] is a gap
character.
To minimize confusion it should be noted that for any i, a pair of characters
X ′ [i] Y ′ [i] is called an indel (short for insertion/deletion) or a gap if either of
them is a gap character. Moreover, if none of the characters contain a gap, we
call the pair a match if they are the same character, and a mismatch if they
differ.
To enable comparisons between alignments we need to provide a way of
scoring any alignment. So let δ be a scoring scheme which is defined over all
3
character pairs. The total score of an alignment with respect to δ is then defined
P
as i δ(X ′ [i], Y ′ [i]).
Next we will introduce three types of alignments which will be recurring
throughout this thesis. To define these we will let S be a string of length n and
P be a string of length m over a common alphabet Σ.
Global Alignment We define the global alignment problem as finding an
alignment A between S and P so that it maximizes the alignment score with
respect to the specified scoring scheme δ. For the sake of brevity we denote the
global alignment score between S and P as global-score(S, P ).
For the purpose of this thesis, it should be noted that scoring schemes are
assumed to have values greater to or equal zero for matches and lower than or
equal zero for gaps and mismatches.
Local Alignment The local alignment problem attempts to find an alignment A between two given substrings S[1 . . . n] and P [1 . . . m], which maximizes the alignment score of A. We denote the score of the local alignment as
local-score(S, P ) and can express it using our previous definition of the global
alignment problem.
local-score(S, P ) =
max
1≤h≤i≤n,
1≤l≤j≤m
global-score(S[h . . . i], P [l . . . j]).
Meaningful Alignment To define the meaningful alignment problem we
look at the original definition of a meaningless alignment by Lam et al. [11].
Given an alignment A = (X, Y ) of S and P where X = S[h . . . i] and Y =
P [l . . . j], the alignment is defined as meaningful if the following holds.
∀k ∈ {1 . . . i} : global-score(X[1 . . . k], Y [1 . . . k]) > 0.
This means that for the alignment A to be meaningful, all non-empty prefixes of the aligned strings X and Y must have a positive global alignment score.
If this is not the case, the alignment is said to be meaningless. We denote the
meaningful score as meaningful-score(S, P )
Using this definition Lam et al. also shows that the local alignment and the
meaningful alignment have the following relation to each other.
local-score(S,P) =
max
1≤h≤i≤n,
1≤l≤j≤m
meaningful-score(S[h . . . i], P [l . . . j]).
This relation can be interpreted the following way. Since every meaningless
alignment has a prefix with a negative global score then if this prefix where
removed from the alignment then the new alignment would have a greater global
alignment than the original. Therefore a meaningless alignment can never be
a local alignment, which, in turn, means that the local alignment score can
be found by finding the meaningful alignment which maximizes the alignment
score given δ.
4
2.2
Smith-Waterman Local Alignment Solution
In the last section we defined the different alignments which will be used in
this thesis. However, we still have not given any algorithm to compute the
local alignment, which means we have nothing to compare to the simulated
directed acyclic word graph solution presented later. This section will therefore
present a naïve solution first proposed by Smith and Waterman [17] to the
Local Alignment problem, though with a slight variation to ensure linear space
consumption. The algorithm is based on dynamic programming and in its
original form it requires O(nm) space.
2.2.1
Definition
Given an input string S[1 . . . n], a query string P [1 . . . m] and a scoring scheme
δ, we define a table T of size n × m where each entry T [i, j] is calculated using
the following recursive definition.


0




T [i − 1, j − 1] + δ(S[i], P [j])
T [i, j] = max

T [i − 1, j] + δ(S[i], −)





T [i, j − 1] + δ(−, P [j])
Reset
Match
Gap in S
Gap in P
For convenience we will also define T [i, j] = 0 if i = 0 or j = 0.
There are three cases which propagate scores forwards through the table
and one case which ensures that no entry receives a negative score. This "reset"
case ensures that the algorithm can, at any position in the table, restart the
local alignment search.
The scoring scheme δ is defined as having positive values for matching pairs
of characters and negative values for mismatching characters and gaps. This
results in a table T where each entry T [i, j] ≥ 0, holds the value of the best
alignment for S[h . . . i] and P [l . . . j] for some h and l where 1 ≤ h ≤ i ≤ n and
1 ≤ l ≤ j ≤ m.
The local alignment score of S and P can then be seen as being the maximum
value in T .
local-score(S[1 . . . n], P [1 . . . m]) = max T [i, j].
1≤i≤n,
1≤j≤m
To obtain the actual alignment of S and P it is necessary to backtrack from
the position, or positions, in T which matches the local alignment score. Backtracking is done by calculating where the entry’s value originated from given
the recursive definition, and following the path, or paths, until a zero-valued
entry is reached. The path is then the local alignment between S and P , with
5
-
C
T
C
A
T
A
-
0
0
0
0
0
0
0
C
0
2
0
2
0
0
0
A
0
0
1
0
4
1
0
T
0
0
2
0
1
6
3
T
0
0
2
0
0
3
5
Figure 2.1: Searching for local alignment using backtracking
each edge between entries identifying whether a match/mismatch or a gap has
occurred in the alignment.
2.2.2
Implementation
The actual implementation of the Smith-Waterman algorithm used in this thesis
follows the definition closely, but takes advantage of the fact that only three
values, T [i − 1, j − 1], T [i − 1, j] and T [i, j − 1], need to be computed before
T [i, j] can be calculated. This makes it possible to calculate all entries using
only two rows of length n, thereby requiring only O(n) memory. This removes
the possibility of traversing the table to extract the maximum value. This
means we need a variable to save the largest value, which has been propagated
forward. When all entries have been calculated, this variable will hold the local
alignment score between S and P .
Algorithm 1 Smith-Waterman
procedure Linear-Smith-Waterman(S[1 . . . n], P [1 . . . m])
score ← 0
for i ← 1, n do
for j ← 1, m do
a ← last[i − 1] + δ(S[i], P [j])
b ← last[i] + δ(−, P [j])
c ← next[i − 1] + δ(S[i], −)
next[i, j] ← max(0, a, b, c)
if max(a, b, c) > score then
score ← max(a, b, c)
return score
The drawback of using this variation of the Smith-Waterman algorithm is
that we have no way of knowing what the local alignment between S and P
is, just based on the score, nor whether there are several alignments which
6
-
C
T
C
A
T
A
-
0
0
0
0
0
0
0
C
0
2
0
2
0
0
0
A
0
0
1
0
last
next
T
T
Figure 2.2: Computing local alignment score in linear space
have the same score. To remedy this we need to propagate more information
through the table. This is done by letting each entry of the table contain the
coordinates to the entry where the alignment began. The information is then
forwarded through the table by the propagation cases while the reset case sets
an entry’s origin coordinates to its own coordinates. When a score better than
the previous value is observed, the value and the start coordinates of the score,
and the current entry’s own coordinates are saved. If it is necessary to output
all the local alignments, an array can be maintained with the information of
each entry whose score equals the largest observed value.
Given start coordinates (h, l) and end coordinates (i, j) of an alignment
computed by the approach described above, the substrings of S and P which
make out the local alignment are S[h, i] and P [l, j]. The local alignment can
then be extracted by calculating the global alignment between these substrings
using the same score function. Since the substrings which make out the local
alignment are usually much shorter than the original S and P , this additional
computation should not affect the overall performance of the algorithm significantly, even if the algorithm used is a simple naïve solution.
Smith-Waterman’s algorithm has a time complexity of O(nm) since the
table T has n × m entries with each entry being computed in constant time.
The linear variation also has a time complexity of O(nm) since the same number
of entries are computed and also in constant time.
2.3
Suffixes
This section will present two data structures, the suffix tree and the suffix
array, and a general concept, the suffix link, which will be used repeatedly in
this thesis. The suffix tree will be used directly in the implementation of the
simulated directed acyclic word graph and therefore the construction algorithm
will be defined. However, this thesis will only refer to the suffix array as a
conceptual data structure, as it is never directly implemented for the simulated
directed acyclic word graph, which relies on a different data structure to enable
7
1
$
A
2
C
T
3
$
4
6
TA$
5
ATA$
7
9
TCATA$
8
A$
10
CATA$
11
Figure 2.3: Suffix tree of the string "CTCATA$" including suffix links.
suffix array operations.
Suffix Tree Let S[1 . . . n] be a string where the last character is the special
end marker character $. The suffix tree (or suffix trie) of S is then denoted TS
and is a tree where the edges are denoted by strings so that every suffix of S is
represented as the path label of a leaf in TS . For notation we denote the path
label of a node u as label(u) and it is the concatenation of all edge strings on
the path from the root node and the node u. An example of a suffix tree can
be seen in figure 2.3.
A suffix tree TS of a string S can be constructed in O(n) using several construction algorithms, however for the purposes of this thesis, we only consider
McCreights construction algorithm defined by McCreight [14], since this is the
construction algorithm used when a suffix tree is needed in the implementation.
Suffix Link A key component of the suffix tree construction algorithm created by McCreight [14] is the concept of suffix links. They allow the allow the
algorithm to run in linear instead of quadratic time.
Let u be a node in the suffix tree TS where its path label label(u) = cx where
c is a single character and x is a non-empty string. The suffix link of u, denoted
suffix-link(u), then points to a node v in the suffix tree with label(v) = x. If x is
the empty string then the suffix link points to the root node. For an example of
suffix links in a suffix tree can be seen in figure 2.3. Suffix links can, of course,
also be used in other data structures which index suffixes.
Suffix Array A suffix array of a string S, denoted as AS , is a space efficient
data structure which indexes the suffixes according to their lexicographical ordering. Suffixes are represented by their starting position in the original string
S, i.e. the i’th lexically smallest suffix of S start position can be found in AS [i].
An example of the suffix array can be seen in figure 2.4.
8
i
1
2
3
4
5
6
7
S[i]
C
T
C
A
T
A
$
AS [i]
7
6
4
3
1
5
2
Represented Suffix
$
A$
ATA$
CATA$
CTCATA$
TA$
TCATA$
Figure 2.4: Suffix array AS of the string S = "CTCATA$".
9
10
Chapter 3
Simulated Directed Acyclic
Word Graph
Now that all the fundamental concepts and data structures regarding alignments and suffixes have been introduced it is possible to begin constructing a
variation of the simulated directed acyclic word graph data structure presented
by Do and Sung [3], which will be the basis for experiments conducted in the
next chapter.
This chapter will start off by introducing the general definition of a directed
acyclic word graph, which the implementation will attempt to simulate. Afterwards the chapter will present the depth-first unary degree sequence that will
define the topology of the simulated directed acyclic word graph, followed by a
data structure called the FM-index used to simulate the edge labels and allow
access to substring information of the string used to build the data structure.
When the two data structures have been defined the chapter will present
the approach used to merge them into a single data structure which simulates
the directed acyclic word graph described in the beginning of the chapter.
Lastly the chapter will describe an algorithm used on the simulated data
structure to solve the local alignment problem defined in section 2.1. The
algorithm is described by Do and Sung [3] as, for two strings S[1 . . . n] and
P [1 . . . m], having a worst-case time consumption of O(nm) and an expected
average time consumption of O(n0.628 m) when using a scoring scheme which rewards matches with 1 and punishes mismatches and gaps by −3. The following
chapter will investigate the legitimacy of these claims.
3.1
Directed Acyclic Word Graph
A Directed Acyclic Word Graph (DAWG) is a data structure originally proposed by Blumer et al. [2] as an alternative to suffix trees and other structures
used for exact pattern matching. It was derived from Deterministic Finite Automatons with the intent of creating the smallest possible Automaton which
would recognize all substrings of a string T . Moreover Blumer et al. suggests
that the data structure has additional properties which makes it more useful in
some cases.
11
0,1,2,3,4,5,6
A
A
C
1,3
4,6
T
C
T
C
A
T
T
T
2,5
CT
2
A
A
C
C
3
TC,CTC
T
T
A
A
4
C
C
CA
A
T
A
T
AT, CAT,
TCAT, CTCAT
5
A
A
6
S =
C T C A T A
0 1 2 3 4 5 6
TA, ATA, CATA,
TCATA, CTCATA
Figure 3.1: DAWG of "CTCATA" with end-sets and each sets path labels to the left
and right respectively.
Blumer et al. presents a linear-time algorithm for constructing the DAWG
Dw given the word w. Since this thesis will focus on a simulated implementation
of a DAWG, this construction algorithm will not be presented. Instead this
section will focus on the definition and features of the DAWG data structure.
3.1.1
End-Set Equivalence
Before we can define the Directed Acyclic Word Graph, we have to define the
end-set equivalence relation which will define the partitioning of all substrings
of a given string S in the DAWG DS .
Let S[1 . . . n] = s1 . . . sn be a string with every character si ∈ Σ and let y
be an arbitrary string over the alphabet Σ with |y| > 0. The end-set of y in
S is then defined as end-set(S, y) = {i | si−|y|+1 . . . si }. If the string y is not
a substring of S, then its end-set in S is empty. Conversely the end-set of the
empty string λ in S is the end-set containing all elements, i.e. end-set(S, λ) =
{0, 1, . . . , n}. It should be noted that the zero position in the string S is included
in the end-set of the empty string as a special case, but does not appear in any
other end-set. An example of a DAWG with its end-sets can be seen in figure
3.1.
Two strings x and y over the alphabet Σ are said to be end-set equivalent
on S if end-set(S, x) = end-set(S, y). Extending this concept we define an endset equivalence class as being the set of substrings in S which have the same
end-set. For notation we define [x]S as being the end-set equivalence class to
12
which the substring x belongs. It should be noted that the set of all end-set
equivalence classes is a partitioning of all the substrings in S. Moreover given
two end-set equivalence classes [x]S and [y]S in S, if [x]S = [y]S , then one of
the substrings x or y is a suffix of the other.
3.1.2
Definition
For a string S over Σ the DAWG DS is defined as a directed acyclic graph
(V, E), with the set of vertices, V being the set of end-set equivalence classes
of S, as defined above. Since the end-set equivalence classes are a partitioning
of all substrings in S then so is DS . We define the edges as E = {([x]S , [xc]S )},
with c being a single character denoting the edge label, x and xc are substrings
in S, and the end-set relation end-set(S, x) 6= end-set(S, xc). We also introduce
the notation c(v,u) denoting the label on the edge (v, u).
The source (or root) node of the DAWG DS is the end-set equivalence class
of the empty string, i.e. [λ]S , moreover, the sink node is the end-set equivalence
class containing the entire string, [S]S . This is obvious when we remember
that the DAWG is derived from a deterministic finite automaton. The set of
distinct paths in DS starting from the source node [λ]S now represent the set
of substrings in S. More precisely, the concatenation of all the edge labels on
each path starting from the source node is exactly the set of substrings in S.
This is due to the fact that the any path label for any node u in DS has the
same end-set equivalence class as u denotes. An example of this can be seen in
figure 3.1.
Blumer et al. [2] also provides us with properties regarding the size of the
DAWG DS built from any string S[1 . . . n]. For any string S where n ≥ 3,
the Directed Acyclic Word Graph DS = (V, E) has the following size bounds,
|V | ≤ 2n − 1, and |E| ≤ 3n − 4. These constraints on the DS ’s size are
vital for the effectiveness of the Local Alignment solution which will be built
upon the simulated DAWG. These constraints on the DS ’s size are vital for
the effectiveness of the Local Alignment solution which will be built upon the
simulated DAWG.
3.2
Depth-First Unary Degree Sequence
In this section the Depth-first Unary Degree Sequence (DFUDS) representation
introduced by Benoit et al. [1] is presented. The representation is one of the
two main components which make up the simulated DAWG and is responsible
for the DAWGs topology.
The section will first define the DFUDS representation and show how it can
be used to represent the topology of any ordered tree. Afterwards, we will define
a series of basic operations which will be used as the foundation for interacting
with the representation. Using these basic operations we will define a number
of operations which are required to simulate a DAWG data structure later on
in the chapter.
The section will present certain theoretical possibilities of the representation
and its operations, but will mostly focus on the actual implementation. This
13
is due to the fact that several of the possible auxiliary data structures which
are presented throughout the literature are, albeit succinct, quite impractical
to implement.
3.2.1
Definition
The Depth-First Unary Degree Sequence, DFUDS, is a representation of the
topology of any ordered tree with n nodes using a string of parentheses of length
2n. Given an ordered tree T and an empty DFUDS string U , its construction
is then done by visiting all nodes in a pre-order depth-first traversal starting at
the root node. For each node in T with degree d, append a string of d '('s and
a single ')' to U .
When all nodes have been visited, U will contain n − 1 '('s and n ')'s.
To achieve a balanced parenthesis sequence, we prepend a '(', which is also
considered the imaginary super-root of U . Its only function is to balance the
representation, which gives access to a number of constant time operations
originally intended for the BP representation described by Munro and Raman
[15]. The imaginary super-root is located at position nil, which also means the
root can always be found at position one, since it is always the first node to be
visited given the traversal order, and all nodes are represented by their leftmost
parenthesis.
Since the representation is binary, made up of only open and closed parenthesis it is obviously possible to reduce the space consumption by using bits
instead of strings. However, using parentheses instead of bits increases the
legibility and is therefore the chosen form for this thesis.
Before presenting the operations it should be noted that there is, to the
authors knowledge, no construction algorithm for the DFUDS given a string
S. It is therefore necessary to create a suffix tree TS and doing a depth-first
traversal of this ordered tree to extract the representation US . However, since
McCreight [14] presents an O(n)-time algorithm for constructing the suffix tree
TS from S, this is unlikely to affect the overall construction time significantly.
3.2.2
Basic Constant-time Operations
With the DFUDS representation it is possible to have the topology of any ordered tree using only O(n) space, but to be practical it still needs a number
of constant time operations to enable fast and efficient navigation of the representation. The operations presented in this section are the basic operations
presented by Benoit et al. [1]. These will form the basis for the operations
required to build an effective simulated DAWG.
Given a valid DFUDS U the following operations can be supported using
auxiliary data structures consuming o(n) bits of space, as presented by Benoit
et al. [1] and Jansson et al. [9]. For the purpose of this thesis these data
structures will be substituted for simpler constructions with a slightly larger
space consumption while still yielding constant time complexity.
14
Figure 3.2: The suffix tree for "ATACTC$" and its DFUDS representation. Nodes
are labelled with their pre-order numbering.
Rank
The rank operation is effectively split in two, one for opening parentheses and
another for closing parentheses, written as rank( (i) and rank) (i) respectively.
The rank( (i) and rank) (i) return the number of '(' and ')' respectively up to
and including the position i in U . To obtain constant time we create two tables,
one for each operation, which contain the answer for every i. This requires two
tables of size |U |.
Select
Similarly to rank the select operation is split into select( (x) and select) (x).
The select( (x) returns the position of the opening parenthesis with rank x and
select) (x) is defined similarly for closing parentheses.
With this definition of the select operation the following relations with the
rank operation should be noted, rank( (select( (x)) = x and select( (rank( (i)) = i
if U [i] = '('. The same relations are valid for closing parentheses.
Select is also implemented using a table for each case. The tables have |U |/2
entries each since they only need to hold information about their respective
parenthesis type.
Find
The find operation returns the position of the open parenthesis matching a
closed parenthesis and vice versa. The operation is defined in two separate
operations by Benoit et al. [1] and is therefore also implemented as two separate
functions, f ind( (i) and f ind) (i) given an index i. However these return the
value from the same entry in the same table, given the same parameter. This is
partly to be faithful to the original definition and partly to increase the legibility
15
of the code, since we imply which parenthesis type we originate from depending
on which operation is called.
The table has |U | entries which hold the position of the matching parenthesis
given its index.
Excess
Excess is the first operation where no additional auxiliary data structure is
needed to guarantee a constant time execution. The operation is quite simple
and is defined as
excess(i) = rank( (i) − rank) (i).
This can be translated as the number of open parentheses up to and including position i minus the number of closed parentheses up to and including
position i in U . Since rank( and rank) are constant-time operations, excess must
also be a constant-time operation.
Enclose
The enclose operation is defined as taking an opening parenthesis at position
i and returning the opening parentheses of the matching pair of parentheses,
which encloses the pair of parentheses at position i tightest. This operation is,
again, implemented using a table containing the answer to all possible queries
thereby reducing it to a simple table lookup, ensuring a constant-time operation.
The implementation further allows to query closing parenthesis, but the opening
parenthesis of the enclosing pair is still the position returned.
3.2.3
Required Constant-time Operations
While the basic operations described in the previous section allows navigation
around the DFUDS representation, a layer of more complex operations are
needed for navigating the topology of the ordered tree encoded in the representation. This section will focus on the required constant-time operations which
are needed to simulate the DAWG according to Do and Sung [3].
As with the basic operations the implementation will not be faithful to the
low theoretical space consumption presented by Do and Sung [3]. Instead it will
use a number of simpler constructions to simplify the implementation while still
upholding the needed time complexity for each operation. This should result
in a solution which follows the spirit of the theory while still being practical to
implement.
Parent
The parent of a node encoded in the DFUDS U can be found in constant
time using the basic constant time operations described earlier without any
additional data structures. First we ensure that the queried node at position u
is not the root node, which, is the only node in the tree with no parent. Since
the root is always represented at position one, this check is trivial.
16
If u is not the position of the root we find the opening parenthesis that
matches the closing parenthesis which appears before the queried position. This
puts us within the description of the parent, but does not necessarily give us
the leftmost parenthesis which represents the parent of the node at position u.
It is therefore necessary to find the first closing parenthesis from this position,
and then moving one position forwards. This is the parent node’s leftmost
parenthesis, and therefore the answer.
To be specific, if u is not the position of the root, the parent of the node
whose description begins at u is found by the following combination of basic
operations
parent(u) = select) (rank) (f ind( (u − 1))) + 1.
Leaf-rank
The leaf-rank definition is closely related to the original rank operations definition. Given a DFUDS U , each leaf can be found by the occurrence of the
pattern '))' with the right parenthesis being the representation of the leaf. Moreover, they appear in the sequence according to their pre-order numbering. This
means that leaf-rank can be interpreted as rank)) (u), i.e. the rank operation
where the pattern ')' is replaced by '))'. The operation returns the number of
leaves up to an including the queried position given a pre-order traversal of the
topology.
Since this operation is very similar to the original rank operation, we use the
same approach to ensure constant time-complexity. This means maintaining a
table with answer to every possible query u.
Leaf-select
This operation can, for a given value i, also be interpreted as the simple select
operation only given a new pattern '))', i.e. select)) (i) which returns the position
of the i’th leaf in the DFUDS. This operation is implemented similarly to the
original select operation, which means we once again simply maintain a table
with answers to all possible queries. This also ensures constant time complexity.
Leftmost-leaf
Given the DFUDS U and a node identified by its leftmost parenthesis v in U
the leftmost-leaf operation returns the leftmost-leaf of the subtree rooted in v.
This is achieved, in constant time, by the following combination of leaf-rank
and leaf-select operations:
leftmost-leaf(v) = leaf-select(leaf-rank(v − 1) + 1).
This operation basically starts by finding the number of leaves before the
subtree rooted in v and then selects the next leaf in the order, which in turn is
the leftmost-leaf for the subtree.
17
Rightmost-leaf
Like its leftmost counterpart, the rightmost leaf is also found using a combination of the operations described earlier and is arguably even simpler, so:
rightmost-leaf(v) = f ind) (enclose(v)).
Because the subtree rooted in v is not balanced due to the leftmost opening
parenthesis being omitted, it is necessary to find the opening parenthesis which
encloses the entire node, and the finding its closing parenthesis. This is then
the representation of the rightmost leaf in the subtree. Once again we only use
constant time operations twice, resulting in a constant time operation.
Lowest Common Ancestor
Given two nodes v and u, the lowest common ancestor operation lca(v, u) returns the common ancestor of v and u with the greatest depth in the tree
represented by the DFUDS. Jansson et al. [9] showed that it is possible to extract this information using the excess sequence E where each entry i is defined
as E[i] = excess(i). This sequence is only conceptual, since we have access to
the constant-time excess operation.
To extract the lowest common ancestor information out of the excess sequence E, an additional operation called range minimum query on E is needed,
denoted as RMQE . The operation takes two arguments i, j and returns the
position of the element with the smallest value. If there is a tie, the leftmost
element is returned.
Given two nodes x and y in the DFUDS, where x < y and x is not an
ancestor of y, let i = select) (x) and j = select) (y), i.e. the position of the
closing parenthesis of the representation of each node. If the nodes are leaves,
then i = x and j = y. The lowest common ancestor can then be found using
the following approach:
lca(x, y) = parent(RMQE (i, j − 1) + 1).
Using an O(n(log log n)2 / log n) bit auxiliary data structure as described
by Jansson et al. [9], it is possible to support RMQE (i, j) in constant time.
However this data structure is, again, quite impractical to implement. Instead
the implementation uses a sparse table with a space consumption of O(n log n)
to support constant time range minimum queries on E.
The sparse table creates an n × ⌈log n⌉ table M , where each entry M [i, j]
corresponds to the index of the minimum value in the subsequence E[i . . . 2j ].
The recursive definition for calculating each entry is therefore
M [i, j] =



M [i, j − 1]


M [i + 2(j−1) , j − 1]
18
if excess(M [i, j − 1]) ≤
excess(M [i + 2(j−1) , j − 1])
else.
1
7
6
12
17
+1
10
1
11
6
7
15
16
+1
i
10
11
12
20
j-1
j
15 16
21
17
20
21
DFUDS
( (((()
)
(()
)
)
(()
)
)
(()
)
)
E
1 23454
3
454
3
2
343
2
1
232
1
0
RMQ(i, j-1) = 11
lca(i,j) = parent(11+1) = 1
Figure 3.3: Finding the lowest common ancestor between the nodes positioned in 10
and 16 in the DFUDS for "ATACTC$". Nodes are labelled with their representative
index in the DFUDS.
When the table has been created we can compute RMQE (i, j) by selecting
the two entries in the table which corresponds to the two subsequences which
fully cover the subsequence E[i . . . j]. The entry which corresponds to the smallest excess value is then returned. Let k = log(j − i + 1) then the equation for
calculating RMQE (i, j) is as follows.
RMQE (i, j) =



M [i, j − 1]


M [j − 2k + 1, k]
if excess(M [i, k]) ≤
excess(M [j − 2k + 1, k])
else.
Given a sparse table of E, the RMQE (i, j) is a constant time operation since
it is merely a constant amount of lookups.
While the sparse table gives us the required time complexity with a reasonable space complexity, it creates one small problem. In the event of a tie, it
returns the rightmost element instead of the leftmost, as the operation requires.
This is corrected by creating the sparse table on the reverse excess sequence and
reversing the queries i and j so the ranges correspond to the correct regions.
When a result has been extracted it is again mapped back to the correct index
in the original sequence. These transformations are calculated in constant time
since each position i in the original sequence can be mapped to its reverse,
rev(i) = length − i, in constant time.
19
Depth
A depth query depth(v) returns the number of ancestors of a node v and can be
implemented in a number of ways. Jansson et al. [9] present a two-level data
structure of the DFUDS to represent the depth information of the nodes. This
data structure makes it possible to answer a depth query without the need of
auxiliary structures in constant time.
However this data structure is quite impractical to implement, so for the
purpose of this thesis the depth operation is implemented using an auxiliary
table containing the answer to every possible query. These answers are obtained
by traversing the topology of T , which is encoded in U , in a depth-first order
and saving their depth in a table according to the index of their representation
in the DFUDS.
Level Ancestor
A level ancestor query level-ancestor(u, d) of a node u and a depth d returns
the ancestor of u with the depth d in the tree. Jansson et al. once again present
an approach with a low space consumption and in constant time. Again, we
choose to implement the operation naively, since a succinct implementation of
a DFUDS is not the focus of this thesis.
The implementation visits each node and traverses up the tree, saving the
position and depth of each ancestor in a table belonging to the specific node.
This method obviously utilizes a generous amount of space, since each node has
a table containing all its ancestors. However, it does provide us with a means
to solve a level ancestor query in constant time by two table lookups.
We have now defined the DFUDS representation and given it a number of
basic and more complicated operations which enables us to efficiently navigate
around in the ordered tree topology which it encodes. All operations have a
constant time complexity after the DFUDS and its underlying data structures
have been built. The space complexity is largely determined by two constructions, the sparse table used by the lca(x, y) operation, and the tables containing information regarding each node’s ancestors used by the level-ancestor(u, d)
operation. These both have O(n log n) space complexity which dominates the
other O(n) data structures, giving us an O(n log n) space complexity for the
implemented DFUDS.
3.3
FM-index
This section will present the FM-index which will be responsible for labelling
the edges between nodes in the simulated DAWG. It was developed by Ferragina
and Manzini [6] with the purpose of creating a Full-text index which permits
fast substring querying, yet which can be compressed considerably.
While it is possible to implement the FM-index to reduce the space consumption considerably, by using different compression techniques, this thesis
will forego this approach and settle for implementing an FM-index which follows the spirit of the theory described by Mäkinen and Navarro [13], yet allows
20
F
$
A
A
C
C
T
T
C
$
T
A
T
A
C
T
C
A
T
C
$
A
C
T
$
A
A
C
T
A
C
C
$
T
T
A
T
A
T
C
A
C
$
L
A
T
C
T
$
A
C
Figure 3.4: The conceptual M matrix for the string "CTCATA$".
for a larger space consumption.
The section will first define the required components needed for the FMindex and will then present solutions to implement these components. Moreover
this section will present how the FM-index has been implemented which will be
used for the experiments in chapter 4. Lastly a number of operations needed
to simulate the DAWG data structure later on will be presented.
3.3.1
Definition
Let S be a string whose terminating character is the special end character
$ ∈ Σ, defined to be the lexicographically smallest element in Σ. An FM-index
of S is then created by implementing three components, the Burrows-Wheeler
transform of S, a table C and a function occ(c, i).
Let L denote the Burrows-Wheeler transform (BWT) of S. L is then built
by arranging every possible cyclical shift of S according to their lexicographical
order, and then concatenating the ending character of each row into the string
L. Each row of the conceptual matrix M , shown in figure 3.4, can be seen as
an entry in the suffix array of S, if one only considers the string up to and
including the end marker $. This transformation is implemented by using the
suffix array of S to give us the order of rows. The last character in each row is
then the character which appears in the position before the starting character
in each row. Since the DFUDS defined in section 3.2 requires the suffix tree to
create its representation, the FM-index merely extracts the suffix array from
this construction instead of calculating the order again.
The second component, the table C, is a simple lookup table for which, given
the alphabet Σ of the BWT L and a character c ∈ Σ, the entry C[c] contains
the number of times a character, with lexicographically lower value than c,
appear in L. This construction enables us to efficiently compute where the first
occurrence of a character c takes place in the first column in the conceptual
matrix M seen in figure 3.4.
This component is implemented using a dictionary structure which, given a
key c, returns the value as described above. The value of each entry is inserted
into the dictionary by a simple traversal which counts the number of elements
lexically smaller than its key. The dictionary structure returns the value in
21
{$, A, C, G, T} = {0, 0, 1, 1, 1}
A T C T $ A C
0 1 1 1 0 0 1
0
1
{$, A} = {0, 1}
{C, G, T} = {0, 1, 1}
A $ A
1 0 1
T C T C
1 0 1 0
0
1
{C} = {0}
{G, T} = {0, 1}
C C
0 0
T T
1 1
Figure 3.5: The Wavelet Tree for the string "ATCT$AC", the strings are shown for
convenience but are not stored.
O(log(size of container) where the size of the container equals the size of our
alphabet, however since we will restrict our experiments to an alphabet of size
5, i.e. Σ = {$, A, C, G, T }, this should not affect the time complexity.
The last component needed to complete the FM-index is an implementation
of the function occ(c, i). The function takes a character c and an index i as
arguments and returns the number of occurrences of c in the BWT L up to and
including position i. This component is implemented using the Wavelet Tree
data structure which was first introduced by Grossi et al. [8].
A Wavelet Tree is a recursively defined data structure which partitions the
input string in two, for each step down the tree, according to the partitioning
of its alphabet Σ. Given an alphabet Σ for the string L, let Σ0 be the alphabet
containing the first half of Σ and Σ1 be the second half of Σ. Denote all
characters c ∈ L with a bit 0 or 1 depending on whether c ∈ Σ0 or c ∈ Σ1 . The
resulting bit-vector is stored in the root node. Partition L into two strings L0
with the alphabet Σ0 and L1 with Σ1 , which contain the characters denoted by
0- or 1-bits respectively. Repeat the process described above until the alphabet
of the new nodes contain two or less elements.
Using only the bit-vector as seen in figure 3.5 makes a traversal of the string
necessary in order to answer rank and select queries, so other auxiliary data
structures are usually used. The partitioning remains the same, however.
With the Wavelet Tree it is possible to answer rank(c, i) and select(c, x)
queries in O(log |Σ|) time using a succinct representation of each node’s bitvector, which requires only nH0 + O(n log |Σ| space as described by Mäkinen
and Navarro [12]. For the purpose of this thesis, however, the implementation
will merely use two tables per node to contain the rank information for 0-, and
1-bits respectively. This allow us to make rank queries on each node in constant
time.
The rank(c, i) query is reduced to a traversal down the tree. For every node
22
{$, A, C, G, T} = {0, 0, 1, 1, 1}
A T C T $ A C
0 1 1 1 0 0 1
rank(1, 5) = 3
1
0
{$, A} = {0, 1}
{C, G, T} = {0, 1, 1}
A $ A
1 0 1
T C T C
1 0 1 0
rank(0, 3) = 1
0
1
{C} = {0}
{G, T} = {0, 1}
C C
0 0
T T
1 1
rank(0, 1) = 1 = rank(’C’, 5)
Figure 3.6: Querying rank(′ C ′ , 5) on the wavelet tree of the string "ATCT$AC".
we identify which bit encodes c and make a rank query inext = rank(encoding(c), i).
We then go to the child node which represents the encoding of c and continue
querying, i.e. rank(encoding(c), inext ). When a leaf is reached, its rank query is
the result returned. This method also ensures a time complexity of O(log |Σ|),
however it uses quite a bit more space than the solution presented by Mäkinen
and Navarro [12].
The rank(c, i) query is then the occ(c, i) function on the FM-index.
The select(c, x) query is similar to the rank(c, i) query. We also maintain
two select tables for each node in the Wavelet Tree which contain the answer to
every possible select query on the 0-, and 1-bits. However instead of querying
at the root and propagating down the tree, this approach starts at the leaf node
whose alphabet contains c and then propagates up the tree. An example can
be seen in figure 3.7. The time complexity for the select(c, x) query is the same
as rank(c, i), being O(log |Σ|).
Now that all the components required for the FM-index have been presented, it is possible to unveil an important property called Last-to-First mapping LF(i). The LF-mapping allows us to find where the occurrence of L[i] takes
place in the first column of the conceptual matrix M , an example of which can
be seen in figure 3.4. This is done by the following definition.
LF(i) = C[L[i]] + rank(L[i], i)
The approach is quite simple. To retrieve the position of L[i] in F [i] it
is first necessary to find out where the section containing the character L[i]
begins in F . This is done by using the table C, since we know that C[c] + 1
indicates the first occurrence of c in F . Instead of choosing the first occurrence,
we choose the same numbered occurrence as the character L[i] has in L. Since
L[i] precedes F [i] in the original string S we know that each character is in the
same order in L and F , i.e. the first occurrence of any character F [i] is also the
23
{$, A, C, G, T} = {0, 0, 1, 1, 1}
A T C T $ A C
0 1 1 1 0 0 1
select(1, 3) = 4 = select(’T’, 2)
0
{$, A} = {0, 1}
1
{C, G, T} = {0, 1, 1}
A $ A
1 0 1
T C T C
1 0 1 0
select(1, 2) = 3
0
1
{C} = {0}
{G, T} = {0, 1}
C C
0 0
T T
1 1
select(1, 2) = 2
Figure 3.7: Querying select(′ T ′ , 2) on the wavelet tree of the string "ATCT$AC".
first occurrence of that character in L. By using the rank(L[i], i) function we
find which numbered occurrence we have in L and therefore in F .
The LF-mapping is used by some of the operations defined in the next
section and is an essential property of the FM-index.
3.3.2
Operations
Having defined the FM-index’s construction and its core property, the LFmapping, we can now implement a number of operations, which are needed
to support the simulated DAWG. Additionally, this section will present an
approach to simulate the Ψ-table defined by Do and Sung [3], which is used
later on in the implementation of the suffix-link(u) operation.
Backward Search
The backward search needed for the simulated DAWG is derived from the
FMCount algorithm defined by Mäkinen and Navarro [13]. The FMCount algorithm will therefore be presented first, and then adjusted to fit the description
of the operation backward-search(st, ed, c) used by Do and Sung [3].
The FMCount algorithm uses the FM-index to count the number of times a
pattern P occurs in the original string S. This is done by searching backwards
through P [1 . . . p] by using the C table and rank(c, i). At each iteration i, st
points to the first row with the prefix P [i . . . p] and ed points to the last row with
the prefix P [i . . . p], in the conceptual matrix M . When the entire string P has
been processed, the range between st and ed denotes the number of substrings
in S with the prefix P . This algorithm takes O(|P | log |Σ|) since each character
in P is processed in time O(log |Σ|) due to the time complexity of the rank
operation.
We adjust this algorithm to only search for a single character, given as an
argument. Within a specified range, st to ed, which are also given as arguments,
24
hence we denote the operation as backward-search(st, ed, c). The range [st . . . ed]
denotes some shared prefix (which may be the empty string) and searches for the
range which shares the prefix c concatenated with the previous shared prefix.
The operation is shown in algorithm 2 and takes O(log |Σ|) since the rank
operation of the FM-index takes O(log |Σ|). However, since our experiments
will focus on alphabets of size 5, this can be viewed as constant time.
Algorithm 2 Backward Search
procedure Backward-Search(st, ed, c)
sp ← C[c] + rank(c, st − 1) + 1
ep ← C[c] + rank(c, ed)
return sp, ep
Lookup
The lookup operation, lookup(i) is defined by Do and Sung [3] as returning
the i’th entry of the suffix array S denoted as AS . The time complexity of the
operation is mentioned as being O(log n), however no implementation details are
presented nor does the cited source offer any insight as to how this operation
is implemented within the specified time constraint. This implies that the
implementation is based on the approach presented by Mäkinen and Navarro
[12], which stores parts of the suffix array AS , and uses this information to infer
AS [i] in O(log n) time.
The actual implementation has access to the suffix array when the FM-index
is built. This enables us to save a partial sample A′S of AS , which we save for
later lookups. The sampling is done at log n intervals, thereby making the space
consumption O(log n) as well. By using a hash-map structure we can achieve
amortized constant-time access to the elements. This is done by applying a
hash-function to a key, the index of an element in AS in this case, which maps
the element to a bucket in the hash-map. If the bucket only contains one
element then the query can be answered in constant time, otherwise it has to
iterate over all elements of the bucket to find the correct key. By allocating a
large number of buckets these non-constant lookups should be rare.
To utilize the sampled suffix array A′S for a query lookup(i) we then move
backwards in S by using the LF-mapping until we find an entry in the A′S
which has a value. We can then infer from the value, and how many iterations
it took to find a value, what AS [i] has as value. This operation can be seen in
algorithm 3.
The entire approach is based on the knowledge that L[i] occurs before F [i]
in S. This allows the operation to move forward in S until it finds an entry in
the sampled suffix array A′S . The difference is then merely the number of steps
that where taken across S.
Since the suffix array AS is sampled at each log n interval, the worst case
number of steps needed to reach an i′ which has a value in A′S , is O(log n),
which is also the time constraint stated by Do and Sung [3].
25
Algorithm 3 Lookup
procedure Lookup(i)
i′ ← i
t←0
while i′ 6∈ A′S do
i′ ← C[L[i′ ]] + rank(i′ , L[i′ ])
t←t+1
return A′S [i′ ] + t
⊲ LF (i′ ) which equals one step in S
Simulated Ψ-table
To enable the implementation of the suffix-link(u) operation in the simulated
DAWG, the FM-index needs to support the Ψ-table definition in constant time.
The Ψ-table is defined by Do and Sung [3] as follows.
Ψ[i] =
(
i′
0
if AS [i′ ] = AS [i] + 1
else.
The implementation used is derived from a method introduced by Golynski
et al. [7] which is used to decode text after a certain position in a suffix array.
While the article uses succinct data structures instead of the FM-index, which
obfuscates the approach considerably, these data structures do hold the same
information as our FM-index. For example the selectX (c) operation on the bit
vector X is equivalent to the C table in the FM-index.
Algorithm 4 Simulated Ψ
procedure Ψ(i)
if i = 1 then
return 0
else
c←$
for a ∈ Σ do
if C[a] < i then
c←a
break
i′ ← select(c, i − C[c])
return i′
⊲ Begin with the lexically largest
⊲ Correct c found
⊲ Use select operation to find correct i′
The only possible scenario where the case AS [i′ ] = A[i]+1 is not solvable for
any i′ , is if A[i] points to the end marker in S. However, since the end marker $
is the lexically smallest element in S we know that it is always found at AS [1],
we therefore simply check whether the argument is 1, and return 0 if this is
the case. If this is not the case we first try to deduce which character entry
AS [i] begins with. This is done by iterating over the alphabet Σ in reverse and
finding the entry in C which is smaller than i. This entry indicates that the
associated key is the beginning character of AS [i]. When c is found, we apply
26
the derived calculations from [7], thereby obtaining i′ , which we return.
This operation requires an iteration over the entire alphabet in worst case,
yielding a time complexity of O(|Σ|), however, as stated previously, the alphabet size will be fixed during the experiments, and can therefore be viewed as
constant.
3.4
Simulating a Directed Acyclic Word Graph
The chapter began by defining the Directed Acyclic Word Graph data structure
and then introduced the DFUDS and FM-index data structures and a number
of operations to interact with them. This section will merge the two data
structures into a single simulated Directed Acyclic Word Graph data structure
and introduce a number of operations which enable efficient navigation of the
data structure. Lastly the section will present the method used for computing
the local alignment given a DAWG and a pattern P [1 . . . m].
3.4.1
Merging Data Structures
In this section it will be shown that we can simulate the DAWG DS using only
the DFUDS and FM-index. To do this we will first present a way of simulating
the suffix tree TS using the DFUDS and FM-index of the reversed string S. We
then present a one-to-one mapping between the suffix tree of TS and the DAWG
DS which, together with the simulation of TS , gives us a way of simulating DS
by using the merged DFUDS and FM-index data structures.
Simulating TS
In order to show how to simulate the suffix tree TS using the DFUDS and the
FM-index we will first make some observations regarding the underlying data
structures they represent. The DFUDS representation is a succinct encoding of
a suffix tree’s topology, and the FM-index is essentially a highly compressible
suffix array.
Given the suffix tree TS and a suffix array AS of a string S let the nodes of
TS be ordered lexicographically according to the edge labels. This results in the
same ordering of suffixes, which are represented by the leaf nodes, as the entries
in AS . Put in other words, we define the rank of a leaf as its numbering when
we visit the leaves from left to right. The i’th ranked leaf is then a one-to-one
mapping to the entry AS [i]. An example of this mapping can be seen in figure
3.8.
Given a node x in the suffix tree TS , let u and v be the leftmost and rightmost
leaves respectively in the subtree rooted in x. The suffix range of the suffix
label(x) in AS is then rank(u) to rank(v).
Since every node in the TS is a range in the suffix array, we do not actually
need to store information on the edges of the suffix tree to know what suffix
range a node represents via their path label. Since the FM-index contains
the same information as the suffix array, we can utilize it together with the
topological information from the DFUDS, to simulate the suffix tree.
27
1
$
A
C
7
6
1
T
12
CTC$
TACTC$
$
17
TC$
ACTC$
C$
10
11
15
16
20
21
2
3
4
5
6
7
1
( (((()
6
7
)
(()
7
10
11
)
)
3
1
12
(()
15
16
)
)
6
4
17
(()
20
21
)
)
2
5
Figure 3.8: The suffix tree and its DFUDS representation of the string "ATACTC$"
along with each leaf mapping to the suffix array.
Mapping from TS to DS
To define a mapping from the suffix tree TS to a DAWG data structure DS , Do
and Sung [3] presents the function γ as [label(u)]S for every non-trivial node
u in TS . A non-trivial node is any node for which the label between it and its
parent is not merely the end marker $.
To illustrate this mapping we will look at the non-trivial nodes in the suffix
tree TS seen in figure 3.8, and see how they map to the DAWG seen in 3.9. The
first obvious mapping is the root node of US to the source node of DS . Using
the mapping function γ on the root node results in [λ]S = [λ]S , which is the
end-set containing all elements. This is the same as the definition in 3.1.1.
Going a step further we see that all nodes with a single character in their
path label in TS is mapped to a node with a single character to the source node
in DS . This is an obvious mapping as [c]S = [c]S for any c ∈ Σ.
Lastly we will look at the node represented in the DFUDS as 20 with the
path label "TACTC$". It is mapped using the γ function to [TACTC]S =
[CTCAT]S = {AT, CAT, TCAT, CTCAT}. It should be noted that the end
marker is never present in the simulated DAWG since it is not considered a
part of the alphabet of DS . The fact that we never encounter an end marker
in the simulated DAWG is due to the definitions of the navigation operations,
which only operate on alphabets without the end marker. This is also why
trivial nodes are never mapped over to DS .
As the function is a one-to-one mapping from the suffix tree TS to DS ,
it does not change the actual representation of the node. Each node is still
represented by its leftmost parenthesis in the DFUDS representation. The
28
1
A
T
C
12
C
7
A
T
17
T
21
CT
A
C
C
16
TC,CTC
T
A
10
CA
A
T
20
AT, CAT,
TCAT, CTCAT
A
11
TA, ATA, CATA,
TCATA, CTCATA
Figure 3.9: DAWG of "CTCATA" with each sets path labels. Each node is marked
with its DFUDS representation.
mapping merely allows us to change the node’s path label in S into the correct
end-set equivalence class of DS for S.
3.4.2
Navigation
While the mapping function enables us to regard the non-trivial nodes of the
DFUDS as nodes in the DAWG data structure DS , we still have means of
navigating between these nodes in DS , as the implicit edges in the DFUDS
do not match the edges needed in DS . To navigate the DAWG based on the
mapped DFUDS nodes we need some basic operations. For this we introduce
the four following constant-time operations.
• Get-Source() - Returns the source (or root) node of the DAWG DS .
• Find-Child(v, c) - Returns the child of v in DS where c(v,u) = c. Nil is
returned if no such child exists.
• Parent-Count(u) - Returns the number of parent nodes of the node u.
• Extract-Parent(u, i) - Returns the i’th parent of u. It is of course assumed
that 1 ≤ i ≤ Parent-Count(u).
29
Get-Source
As defined in section 3.1.2 the source node of the DAWG DS is the node which
has the end-set equivalence class of the empty string, i.e. [λ]S . This can be
interpreted as the node in the DFUDS US , which has the empty path label,
and therefore has the suffix range of the entire suffix array. This property only
holds for the root node in the suffix tree topology, and therefore the root node
of the DFUDS is the same as the source node in the DAWG DS , and since we
know from 3.2 that the root node of the DFUDS is always in position 1, the
implementation is trivial.
Algorithm 5 Get-Source
procedure Get-Source()
return 1
⊲ Position of root node in US
Find-Child
For any node v in DS and a character c ∈ Σ the find-child operation return
the node u which has an edge (v, u) with the label c. The approach first takes
the suffix range of v, (st, ed), as we have seen earlier, this range is found by
finding the leftmost- and rightmost-leaf of the subtree rooted in the node v,
so st = leftmost-leaf(v) and ed = rightmost-leaf(v). Then, given the range
and the character c, the FM-index can give us the suffix range (sp, ep) of the
concatenated string clabel(v). If this range is not empty, a node must exist for
which sp is its leftmost-leaf and ep is its rightmost-leaf, querying for the lowest
common ancestor on their representation (which is found with leaf-select) should
then yield the node u which has clabel(v) as its path label in the suffix tree TS .
By using the mapping function we get γ(v) = [label(v)]S and γ(u) =
[label(v)c]S , which tells us that they do not belong in the same end-set equivalence class. By using the definition of DS in 3.1.2 we get that (γ(v), γ(u)) =
([label(v)]S , [label(v)c]S ) is an edge in DS with the label c. So we return the
representation of u in the DFUDS.
If the backwards-search operation returns an empty range, then the substring clabel(v) does not appear in S, which in turn means that label(v)c does
not appear in S. This means that the end-set equivalence class [label(v)c] does
not exist, and therefore there is no child u of v with an edge label c in DS .
Since all the operations used are computed in constant time, as explained
in their respective sections, the complete time complexity of the Find-Child
operation also has a constant time complexity.
Suffix-Link
Before presenting the Parent-Count(u) operation, we will first introduce the
suffix-link(u) implementation, as it is an essential part in both parent operations.
Given the Ψ-table which we defined in 3.3.2, Sadakane [16] then offers the
following method for computing the suffix link of a node u in constant time.
30
Algorithm 6 Find-Child
procedure Find-Child(v, c)
st ← leftmost-leaf(v)
ed ← rightmost-leaf(v)
sp, ep ← Backward-Search(st, ed, c)
if ep − sp + 1 > 0 then
return lca(leaf-select(sp), leaf-select(ep))
else
return 0
Algorithm 7 Suffix-Link
procedure Suffix-Link(u)
l ← leaf-rank(leftmost-leaf(u))
r ← leaf-rank(rightmost-leaf(u))
l′ ← Ψ[l]
r ′ ← Ψ[r]
return lca(leaf-select(l′ ), leaf-select(r ′ ))
In other words, this method takes the suffix range of u, then essentially
strips the first character in the path label of the two leaves by using the Ψtable, which yields a new suffix range. Calling the lowest common ancestor
operation on the two leaves denoting this suffix range then returns the node
which represents the suffix link of u.
Parent-Count
To enable the parent operations, Do and Sung [3] present three lemmas which
will be used without proof in this thesis. These lemmas rely upon the nature
of the mapping function from the suffix tree TS to DS to compute the parent
operations in constant time.
The Parent-Count(u) operation relies directly on the following three lemmas.
Lemma 1 (Do and Sung [3]). Consider a non-trivial node u such that u is
not the root of TS , let v be u’s parent and x = label(v) and xy = label(u).
For any non-empty prefix z of y, we have γ(u) = [(xy)]S = [(xz)]S . In fact,
γ(u) = {(xz) | z is a non-empty prefix of y}
Lemma 2 (Do and Sung [3]). Consider a non-trivial node u whose parent is
the root node of TS . Suppose suffix-link(u) = b. The set of parents of γ(u) in
DS is {γ(p) | p is any node on the path from b to the root in TS }.
Lemma 3 (Do and Sung [3]). Consider a non-trivial node u whose parent, v,
is not the root node in TS . Suppose suffix-link(u) = b and suffix-link(v) = e.
For every node p in the path from b to e (excluding e) in TS , γ(p) is a parent
of γ(u) in DS .
31
The general idea behind the parent operations is that a node u in the suffix
tree TS , with a label on the form label(u) = axy where a is a single character
and x and y are strings, has a parent in DS , v, which has the label label(v) = xy
in TS . This is of course due to the definition of the mapping function and the
DAWG which together state that (γ(v), γ(u)) = ([xy]S , [axy]S ) is an edge from
γ(v) to γ(u) with edge character a.
Lemma 1 then states that every non-empty prefix, z, of y also generates a
parent to γ(u), i.e. for all non-empty prefix z of y, ([xz]S , [axy]S ) is an edge
in DS if there is a node v where label(v) = xz in TS . To find all the nodes in
TS which fit this description, the algorithm finds a range where all occurrences
of these nodes must be contained. For this the suffix-link operation is used
together with lemma 2 and 3, as seen in algorithm 8.
Algorithm 8 Parent-Count
procedure Parent-Count(u)
if u == 1 then
⊲ If u is the root it has no parent
return 0
else
v ← parent(u)
b ← Suffix-Link(u)
if v == 1 then
⊲ Lemma 2
return depth(b) − depth(v) + 1
else
⊲ Lemma 3
e ← Suffix-Link(v)
return depth(b) − depth(e)
It should be noted that the pseudo-code described by Do and Sung [3] has
small flaw in it where the calculation of the parent range is the wrong way
around, yielding negative ranges. However the basic underlying concepts and
approach do work. A small quirk in the algorithm is that the calculations seem
to make certain trivial leaves a parent of some nodes in DS , even though trivial
leaves should not appear in the simulated DAWG. Do and Sung [3] does not
seem to offer any methods of avoiding this even though it could possibly have
strange side effects. However, since the parent operations are not used in the
local alignment algorithm described later on, the consequences of this will not
be investigated further in this thesis.
Extract-Parent
To extract the actual nodes which correspond to the parents of a node u in
the simulated DAWG DS , we simply take the start of the range of parents, i.e.
b = suffix-link(u) and take the number of steps specified, on the path from b to
the root of TS . So, given an index i | 1 ≤ i ≤ Parent-Count(u), we can extract
the i’th parent of u by simply finding the correct level ancestor of the node b,
which denotes the start of the range of parents, i.e. finding the node which is i
steps from b. Algorithm 9 describes the operation in its entirety.
32
Algorithm 9 Extract-Parent
procedure Extract-Parent(u, i)
b ← Suffix-Link(u)
return Level-Ancestor(b, depth(b) − i + 1)
As with the operation Parent-Count(u) the algorithm presented by Do and
Sung [3] is reversed, yet the fundamental approach is sound, and works in the
implementation, if the trivial leaf quirk described in 3.4.2 is disregarded.
3.4.3
Extracting Information
The simulated Directed Acyclic Word Graph now has a number of operations
which enables us to navigate it efficiently. However, we still lack the ability to
extract information regarding the substrings in S. It is therefore necessary to
implement the following two operations which can be used to extract substring
information on the nodes in DS .
• End-Set-Count(u) - Returns the number of points in the end-set of node
u.
• Extract-End-Point(u, i) - Returns the i’th end-point in u’s end-set. The
operation assumes 1 ≤ i ≤ End-Set-Count(u).
End-Set-Count
The End-Set-Count(u) operation is quite simple. Since every node in DS is a
mapping from the suffix tree TS , and since every node u in TS represents the
suffix range of label(u) in S, we can calculate the number of times the substring
label(u) appears in S, since it is exactly as many times as label(u) appears in
S.
This effectively means that we return the size of u’s suffix range in TS .
Algorithm 10 End-Set-Count
procedure End-Set-Count(u)
st ← leftmost-leaf(u)
ed ← rightmost-leaf(u)
return ed − st + 1
Since the leftmost-leaf(u) and rightmost-leaf(u) are constant-time operations, the End-Set-Count(u) operation is also a constant-time operation.
Extract-End-Point
Given a node u in the DAWG DS we look at its mapping in the suffix tree
TS . Its path label, label(u), then has the starting locations in S defined as
{AS [i] | i = st, . . . , ed}. This means we can extract the starting position of
label(u) in S by reversing the starting points of label(u), i.e. {n + 1 − AS [i] | i =
33
st, . . . , ed}. Since the FM-index includes the lookup(i) operation which returns
AS [i] we can use it to extract an end-point. So for any node u and any index
1 ≤ i ≤ End-Set-Count(u), the end point position can be found by the following,
simple, algorithm.
Algorithm 11 Extract-End-Point
procedure Extract-End-Point(u, i)
st ← leftmost-leaf(u)
return n + 1 − lookup(i + st − 1)
The lookup(i) operation on the FM-index takes O(log n) time which makes
Extract-End-Point(u, i) a worst-case O(log n) time operation.
3.4.4
Computing Local Alignment
Now that we have seen how a simulated Directed Acyclic Word Graph can
be implemented, we will use it to calculate the local alignment score between
a string S and a query string P . We will first define a recursive approach
which will be the foundation to the actual iterative implementation used for
the experiments.
Definition
Let DS = (V, E) be the simulated DAWG data structure, where V is a partitioning of all substrings of S[1 . . . n] as described in section 3.1, and let P [1 . . . m] be
a query string. We then wish to find the local alignment score between the two
strings S and P as it is defined in section 2.1. Since a node u in V represents
the set of path labels from the source node to the node, we can define a string
x as belonging to u if x equals one of these path labels.
The scoring scheme δ for computing the local alignment will be assumed
to return negative values for mismatches and insertion/deletions, and some
value larger or equal to zero for a match. This allows us to use the definition
of a meaningful alignment in 2.1 to later on leave out certain calculations of
alignments which can not be a local alignment as explained in 2.1.
For every 1 ≤ j ≤ m and every node u ∈ DS we wish to define recursively
an entry Nj [u] which maximizes the meaningful score
Nj [u] = max meaningful-score(P [k . . . j], x)
k≤j, x∈u
The recursive definition for which this relation holds is defined by Do and Sung
[3] as the following formula.
∀ j = 0 . . . m,
Nj [λ] = 0
∀ u ∈ V \ {λ},


Reset
N0 [u] = −∞
Nj [u] = filter 
 max



Nj−1 [v] + δ(P [j], c(v,
(v, u)∈E 

Nj−1 [u] + δ(P [j], −)
u) )
N [v] + δ(−, c
j
(v,
34
u) )

Match

Gap in S 

Gap in P
The inner equation calculates which transition yields the best alignment for
every edge (v, u) ∈ E for the node u. To ensure that this is a meaningful score
we then apply the filter function which is defined by the following equation.
filter(a) =
(
a
−∞
if a > 0
else.
Put simply, it ensures that the best transition yields a positive score. If
this is not true, it sets the entry as negative infinity, thereby ensuring that the
meaningless alignment does not affect any further calculations. This ensures
that any meaningless alignment is disregarded as soon as it appears.
When every entry in the table N has been calculated according to the
recursive definition, all we need to do is extract the greatest value which will
be the local alignment score between S and P . It should be noted that we do
not view the source node as representing an alignment, since it represents the
empty string. Likewise we do not consider the empty string in P as a being
valid alignment with any substring in S, so we also disregard N0 [u], which gives
us
local-score(S, P ) =
max
1≤j≤m, u∈V \{λ}
Nj [u]
Implementation
The implementation follows the general algorithm outline specified by Do and
Sung [3]. Just like the Smith-Waterman algorithm in section 2.2.2 we note
from the recursive definition that each entry Nj [u] only requires entries from
the j − 1’th iteration, apart from its own iteration, to be computed. Therefore
we only need to maintain the tables Nj−1 and Nj for each iteration.
Instead of keeping two tables with entries for every node, the implementation uses two hash-maps last and next. Each node is then represented as a
pair (key, value) where the key is the nodes representation, and value is the
entry Nj−1 [u] and Nj [u] for last and next, respectively. The hash-map has
amortized constant-time operations for insertion and search. To help ensure
this by avoiding clashes, and to reduce the likelihood of having to rehash the
tables if they exceed the initial size, which is also time consuming, we reserve
enough memory for each table so that it can hold every node in DS .
The algorithm given by Do and Sung [3] also specifies that in order to
correctly handle a gap transition, the nodes need to be visited in any topological
order. How this order is found, inferred or maintained is not specified, therefore
the implementation initializes a hash-map where, given a node’s representation,
its topological order is returned. The topological order is found using a recursive
depth-first traversal of the DAWG and storing the order numbering of each
node. The traversal is described in pseudo-code in algorithm 12.
Not included in the pseudo-code is the conversion from a list to a hash-map.
This is done with a simple iteration over the list where the node representation
is paired with its order number.
The implemented method can be seen in pseudo-code in algorithm 13.
35
Algorithm 12 Computing Topological Ordering
order ← {}
⊲ Empty list which will contain the ordered nodes
procedure Get-Topology(DS )
while There is an unmarked node v ∈ DS do
Visit(v)
return order
procedure Visit(v)
if v is not marked then
for ∀u | (v, u) ∈ E do
Visit(u)
Mark v
Add v to head of order
The implementation begins by initializing the topology mapping using the
approach described above. Afterwards it initializes N0 [u], which is the last
hash-map, for all nodes u ∈ DS . Since we will only keep entries which are
positive in our hash-maps, this is reduced to giving the source node the value
zero, i.e. N0 [λ] = last[1] = 0. Moreover, since the recursive definition tells us
that the source node always has a score of zero, this is equivalent to the reset
case in Smith-Waterman. We also set next[1] = 0. Lastly we hold the largest
value seen so far in the variable max, which we initialize to negative infinity.
Though several of the steps are quite obvious given the pseudo-code, we will
go through it step by step to be thorough. After initializing the required hashmaps and variables, the implementation iterates over the pattern P . For each
iteration j it first takes each node v in the hash-map last, which equals each
positive entry in Nj−1 [v], and attempts to apply the two cases, match/mismatch
and gap in S. It should be noted that in the case of a match we can propagate
a value from v in last to u in next, since we conceptually take a step in both S
and P . In the case of a gap in S, we only take a step in P , meaning that the
Figure 3.10: The figures show how the three cases propagate values forwards conceptually in the local alignment score algorithm. The left figure shows the match/mismatch
case, the center figure shows the gap in S case and the right figure shows gap in P
case.
36
Algorithm 13 Computing Local Alignment Score
procedure Local-Alignment-Score(DS , P )
order ← Get-Topology(DS )
last ← {(1, 0)}
next ← {(1, 0)}
max ← −∞
for j ← 1, . . . , m do
for v ∈ last do
for c ∈ Σ do
u ← Find-Child(v, c)
if u 6= 0 then
tmpA ← last[v] + δ(P [j], c)
⊲ Match case
if tmpA > 0 & tmpA > next[u] then
next[u] ← tmpA
if tmpA > max then
max ← tmpA
tmpB ← last[v] + δ(P [j], −)
⊲ Gap in S case
if tmpB > 0 & tmpB > next[u] then
next[v] ← tmpB
if tmpB > max then
max ← tmpB
heap ← {}
for v ∈ next do
heap ← (order[v], v)
⊲ Nodes are compared by their order
while |heap| > 0 do
v ← heap.pop()
⊲ Extract and delete lowest order node in heap
for c ∈ Σ do
u ← Find-Child(v, c)
if u 6= 0 then
tmpC ← next[v] + δ(−, c)
⊲ Gap in P case
if tmpC > 0 & tmpC > next[u] then
next[u] ← tmpC
if tmpC > max then
max ← tmpC
last ← next
next ← {(1, 0)}
⊲ Reset condition
return max
37
value is propagated from v in last to v in next, which means we conceptually
remain in the same substring in S, since we’ve chosen to add a gap.
To satisfy the f ilter operation defined in the recursive formula, we ensure
that a score only has a chance of being propagated forward if it is positive. To
ensure that we hold on to the best score at all times, we also ensure that the
score has to be better than the score currently held by the respective node.
Lastly, we update our max variable if a score is greater than any seen before is
calculated.
The second part of the iteration j calculates the case in which a gap in
P leads to a meaningful alignment. However, due to the fact that each entry
in the hash-map next is now influenced by another entry in next, the order
in which we calculate each node’s entry becomes important for the resulting
alignment and its score. Each node u in the hash-map next can therefore first
be calculated once we have calculated the entry for every node v, that has an
outgoing edge (v, u).
This is where the topological ordering of the nodes in DS , which we calculated at the beginning of the algorithm, becomes important. However we do not
wish to iterate over every node according to the topological ordering since this
would mean iterating over a lot of nodes which has negative infinity as value
in Nj . Instead we create a priority queue and insert each node according to its
topological ordering, which should allow us to extract and delete the node with
the lowest order in O(log |heap|). Insertion can be achieved in constant time or
O(log |heap|), depending on the type of heap. Since we know that a topological
ordering ensures that all parents of a node have been evaluated before the node
itself, this approach ensures that every case where there is a gap in P is handled
correctly.
It should be noted that we also encounter this problem with the naïve SmithWaterman solution. Since we calculate the rows from left to right, we implicitly
have a topological ordering, and therefore we do not have to implement a strategy to ensure this. It is also easily seen how a calculation would fail if the entry
to its left has not been calculated.
Lastly, before the implementation goes on to iteration j + 1 the last hashmap is set to the next hash-map, which in turn is reset to only having the reset
condition (1, 0).
After iterating over the entire pattern P , the variable max holds the largest
value seen during the run, which, like in Smith-Waterman, is the local alignment
score between the string S and P . This value is finally returned.
Extracting Local Alignments While the above implementation returns the
local alignment score between two string S and P , it does not provide the actual
alignments which have this score. This is the same problem faced with the
Smith-Waterman algorithm in section 2.2.2, and it is solved in a similar fashion
in the simulated DAWG implementation.
The first step is to expand the value stored in each entry Nj [u]. Instead of
merely saving the best score for u in iteration j, we store a tuple containing the
best score, a number Ij,u , j and a number Lj,u. The scores are maintained and
38
calculated as previously, and the j value is set to the number of the iteration if a
score is propagated forward, the last two values are updated using the following
definition.
∀j = 0 . . . m,
Ij,u =
Lj,u =
Ij,λ = j, Lj,λ = −1 Reset



Ij−1,v
I
Match
Gap in S
Gap in P
j−1,u


I
j,v


Lj−1,v + 1

Match
Gap in S
Gap in P
L
j−1,u


I + 1
j,v
Let Nj [u] be a node which has been calculated to contain at least one local
alignment between S and some pattern string P . Its alignments are then made
up of the pair of substrings in S and P respectively, which are found by using the
calculated values, i.e. {(P [Ij,u . . . j], S[q − Lj,u . . . q]) | q ∈ end-setS (u)}. To
find q ∈ end-setS (u) we use the End-Set-Count(u) and Extract-End-Point(u, i)
operations which we described earlier.
Let us begin by describing the I value. As seen above this value gives us the
start position of the substring in the pattern P . The reset condition ensures
that the substring begins in the j’th position which is in accordance with the
j’th iteration of P . If there is a match case, then the starting value does not
change for the substring, and u therefore inherits the value from its parent v in
the last iteration. If there is a gap in S, we do not propagate through S, and u
therefore keeps its own value from the last iteration. Lastly if there is a gap in
P , then there is a propagation through S, making u inherit from its parent v
in the current iteration. The value j which is equal to the iteration number, is
saved every time no matter which case occurs, this ensure that the end of the
substring is saved.
Lj,u denotes the length of the substring in S that ends at position q. Since
end-setS (u) denotes the end points in S of paths from the source node to node
u, it is obvious that we extract the correct substring by taking this range. The
reset condition ensures that the initial range is empty. The match case increases
the value front the last iterations parent with one, since a match case equals a
conceptual step in S. This is also the reason why a gap in S merely inherits the
value, since no conceptual step is taken in S. The last case, gap in P , deduces
how many gaps have been, conceptually, inserted into P previously by taking
the number Ij,v and then adds one more for the gap which is just being inserted,
this equals the range of the substring in S.
To ensure that we return all the local alignments which fit the local alignment score, we simply modify the variable max to being a list maintaining all
nodes u and their associated values, which equal the best score. If a new best
score is found, the list is cleared and the appropriated nodes with associated
values are inserted. This generates a small overhead, but nothing compared to
the actual search. The list can then be iterated over to find all local alignments.
Since the operation Extract-End-Point takes O(log n) time we can find all pairs
39
of substrings which represent a local alignment in O(#alignments log n).
If it is necessary to find the actual alignment, instead of merely the two
substrings which make up the alignment, it is necessary to use some algorithm
for computing the global alignment between S[q − Lj,u . . . q] and P [Ij,u . . . j].
However, since the local alignment usually yields substrings which are far
shorter than the original string S and P , this extra computation is insignificant
compared to actually locating the substrings. We therefore generally disregard
this additional overhead.
40
Chapter 4
Experiments
The previous chapter began by introducing the DAWG data structure and then
went on to introduce and described two additional data structures, the DFUDS
and FM-index which were shown to be able to simulate the DAWG by combining
the operations of each. The chapter then presented an algorithm for computing
local alignment using the simulated DAWG, which Do and Sung [3] concludes
has a worst-case time complexity of O(nm) and expects it to have an O(n0.628 m)
average time complexity when using a scoring scheme which rewards a match
with 1 and punishes a mismatch or gap with −3.
This chapter will present a number of experiments run on the simulated
DAWG with the purpose of investigating whether these theoretical time complexities hold. It will also present a few other experiments which are intended
to give further insight into the data structure, such as the effect of different
scoring schemes on the DAWGs performance.
However, before describing the experiments and show their results, the chapter will provide the experimental set up used for the experiments.
4.1
Experimental Setup
This section will present the different aspects which influence the experiments,
such as libraries or format of the data used. It will also, briefly, present the
means of analysing the data outputted by the experiments. The contents of
this section will set the premise for all experiments run on the naïve SmithWaterman algorithm for computing the local alignment score, as well as the
simulated directed acyclic word graph approach.
It should be noted that all experiments have been run on an Intel Core i3
550, 3.2 GHz processor with 8 GB memory and running a 64-bit Xubuntu 13.10
distribution. To ensure the least possible amount of overhead is caused by other
processes running on the system, as few programs as possible are running while
the tests are executing.
41
Implementation
The experiments are based on an implementation of the simulated directed
acyclic word graph which has been described throughout chapter 3, and an implementation of the naïve Smith-Waterman approach to computing local alignment score described in 2.2. Both implementations are written in C++ and
compiled using GCC-4.8 with the optimization flag -O3 set.
While the basic approach and data structures used for the implementations
have been described in their respective sections, the actual C++ libraries used
have not been mentioned.
For tables and arrays the implementation uses the Standard Template Library (STL) container vector. In certain parts of the implementation these
are also nested, yielding two-dimensional tables. This approach does seem to
be controversial in some communities, but since they are only used as lookup
tables, for example, in the DFUDS, and never modified, it should not affect the
runtime since lookups are done in constant time.
The dictionary structure used for the C table in the FM-index described
in section 3.3 uses the map<char, int> data structure from the STL to map
characters to their C value. It makes lookups in O(log |Σ|) time, which is
acceptable since the experiments will be restricted to an alphabet of size four,
i.e. DNA.
The wavelet tree also uses the map<char, bool> data structure to map
characters to their bit representation. The alphabet size, again, ensures an
acceptable time complexity.
Another data structure used multiple times is the hash-map. This is used
a couple of places, e.g. to hold the sample suffix array mentioned in section
3.3.1 and to hold on to alignment scores for each node in the simulated DAWG.
The implementation uses the STL unordered_map data structure, which has
an average case of constant time for lookups in the hash-map. There is a risk
that it will need to rehash the entire structure, when adding elements, which is
very time consuming. However by reserving memory in advance this should be
avoided.
The last data structure to be mentioned is the priority queue structure
which is used for ensuring that the algorithm traverses the nodes in the correct
topological order in the local alignment score algorithm of the simulated DAWG.
For this the implementation uses the fibonacci_heap data structure from the
BOOST library collection found at http://www.boost.org.
Initially the implementation had two versions of the simulated DAWG local alignment algorithm to test whether there was any difference in performance between using the STL priority_queue and BOOST fibonacci_heap
data structures. However, experiments showed that there was no difference and
therefore, to simplify things, only one implementation was chosen to be used in
the experiments.
42
Data
The experiments will be based on two forms of data, both of which will be
over the alphabet Σ = {A, C, G, T }. The first type of data will be randomly
constructed over the alphabet using the C++ pseudo-random number generator,
rand(), which yields a uniform distribution over the characters.
The other type of data will rely on DNA samples from http://www.ensembl.
org. The first DNA sample used is 1.1 GB Turkey (Meleagris Gallopavo) found
at [4], this sample is used to create the strings S, used for constructing the simulated DAWG DS . For the query string P , a DNA sample of 2.6 GB originating
from a Bushbabys (Otolemur Garnettii) DNA, found at [5] is used.
To ensure that the overhead for creating test strings is as low as possible,
a large portion of each DNA sample is loaded into memory before any experiments are performed. The range from which this portion originates is random,
which should reduce the likelihood of using the same strings multiple times.
When an experiment need a DNA string it then takes a random section of one
of the strings loaded into memory, depending on whether it will be used for
constructing a DAWG or be used as a query string.
Experimental analysis
For the purpose of analysing the data produced by the experiments, this thesis
will rely heavily upon Gnuplot and its associated tools. Aside from producing the graphs used in this chapter, it will also play a fundamental role in
approximating the time complexity of the simulated DAWG. This is done by
its fit command which uses a non-linear least-squares regression algorithm to
estimate the parameters of a function, given the plotted data. This approach
is very common and given an correlated equation it estimates the parameters
extremely well.
4.2
Comparing Algorithms
The first experiment presented will be a basic comparison of the naïve algorithm versus the more complex DAWG local alignment score algorithm. Both
approaches are run with two strings S[1 . . . n] and P [1 . . . m] where n = m with
an initial size of 100 and incremented with 100 for each iteration. Each iteration uses a single S and 10 query strings P and saves the average time for the
computation of the local alignment score to a data file. The time used to create
the simulated DAWG data structure is not included in the experiments, since it
is not included by Do and Sung [3]. However, the build time will be examined
separately in section 4.5.
The experiment will be run for both randomly generated data, and for actual
DNA sequences as described in section 4.1. It is not expected to yield remarkably different results, as the number of meaningful alignments is expected to
remain roughly to same due to the scoring scheme used. The scoring scheme
used returns a score of 1 for matches and −3 for gaps and mismatches, since
this is the scoring scheme presented by Do and Sung [3].
43
3e+10
DAWG
Smith-Waterman
2.5e+10
2.5e+10
2e+10
2e+10
nanoseconds
nanoseconds
3e+10
1.5e+10
1e+10
5e+09
DAWG
Smith-Waterman
1.5e+10
1e+10
5e+09
0
0
0
5000
10000
15000
20000
25000
30000
n
0
5000
10000
15000
20000
n
Figure 4.1: Initial comparison of the naïve Smith-Waterman and the simulated
DAWG algorithm with random sequence to the left and DNA to the right. The DAWG
plot to the left is approximated to 25n0.861 n + 12, and the Smith-Waterman plot to
the left is approximated to 30n1.000 n + 12.
The experiment is expected to yield a graph with the naïve solution approximating a quadratic function since it is a pure O(nm) algorithm. Likewise it
is expected that the DAWG implementation approximates some function from
input size to time on the form axb x + c, with a, b and c being free parameters.
This function is also described by Do and Sung [3], in which they also present
the expected theoretical average time of O(n0.628 m).
Using Gnuplot’s approximation command fit over the function described
earlier, the resulting approximation of the naïve Smith-Waterman data as seen
in figure 4.1 is 30n1.000 n + 12, which solidifies the theoretical time complexity of
O(nm) which was assumed in the initial definition of the algorithm. However,
applying the same approximation command on the simulated DAWG approach
returns the function 25n0.861 n + 12. This is far from the expected theoretical
average time stated earlier, however it is still far from its theoretical worst case
of O(nm) described by Do and Sung [3].
Though the experiment yields an exponent which is larger than we expected
for the entire dataset, it is not constant throughout all ranges of the set. Table
4.2 shows the changing exponent over certain ranges of the dataset from the
initial experiment. There are two noticeable jumps which seem to occur when n
exceeds 10000 and again when n exceeds 20000. This dramatic increase of the
exponent on the time complexity can not be explained merely by looking at the
algorithm, since nothing should affect it so dramatically from such a modest
increase of the input size.
Table 4.2 also shows the exponent change for the experiment running actual
DNA sequences as described earlier. Since actual DNA sequences do not have
uniformly distributed characters the results seem to fluctuate much more than
the randomly generated data. This influences the approximation quite a bit,
and the exponent can be seen to fluctuate much more than in the random data.
44
25000
30000
Range
[0 : 5000]
[5000 : 10000]
[10000 : 15000]
[15000 : 20000]
[20000 : 25000]
[25000 : 30000]
Random
0.609
0.611
0.777
0.779
0.869
0.869
DNA
0.537
0.549
0.927
0.927
0.896
0.895
Figure 4.2: The change in the exponent of the DAWG’s time function depending on
the range of the input size.
However it does seem to converge to an exponent which is similar to the random
data approximation, indicating that the larger exponents in the middle ranges
are probably caused by the larger variance in the data.
4.3
Cache Misses
A possible explanation for the sudden increases of the exponent in the time
complexity of the simulated DAWG, is that the CPU caches exceed their capacity, thereby making it necessary to fetch required data from memory. Since
a cache miss equals extra time spent searching through the cache for the item,
then proceeding down until it eventually finds it in a larger, but slower, cache
or even in the memory. Worst-case would of course be if the item was only
available on the hard drive, however all the necessary data is always loaded
into memory before running the experiments, and will not exceed its capacity.
To explore this possibility, the experiment described in section 4.2 has been
extended to count the number of L2 and L3 misses which occur. This has been
implemented by using the PAPI interface, found at http://icl.cs.utk.edu/
papi, to gain access to the hardware counters available on the system. The
resulting data has been plotted in two separate graphs which can be seen in
figure 4.3. The plotted data seems to offer some explanation for the dramatic
increase of the exponent seen in the previous section. It can be clearly seen
that the first increase of the exponent, which occurs at around n = 10000,
coincides with a rapid increase of L2 cache misses at around the same input
size. This would affect the performance of the algorithm as each lookup which
resulted in a cache miss would require an iteration through the L3 cache and a
fetch operation to access the item needed in the computation of the alignment
score. The second increase also coincides with a rapidly increasing number of
L3 cache misses, which forces the system to fetch data from memory before
being able to continue its execution. But this overhead should only result in a
larger constant per fetch, it should not increase the overall time-complexity and
therefore the cache misses alone can not explain the increase of the exponent.
One possible explanation could be that the exponent is temporarily affected
by the increasing miss rate percentage, which would mean that the exponent
should stabilize when the miss rate nears 100%.
Unfortunately the PAPI interface was not able to access any cache hit coun45
3.5e+08
1e+07
Random
DNA
3e+08
Random
DNA
9e+06
8e+06
7e+06
L3 misses
L2 misses
2.5e+08
2e+08
1.5e+08
6e+06
5e+06
4e+06
3e+06
1e+08
2e+06
5e+07
1e+06
0
0
0
5000
10000
15000
20000
25000
30000
0
n
5000
10000
15000
20000
n
Figure 4.3: L2 and L3 cache misses
ters on the system used for the experiment, so it is not possible to show what
percentage the cache misses represent on the overall cache queries. Though, despite the lack of cache hit data, it seems reasonable to assume that as the input
size increases, the cache miss rate percentage would increase as it becomes less
likely that the data needed for the next computation is available in the caches,
thereby affecting the time complexity of the algorithm in some way - possibly
through an increased exponent as observed.
Cache Miss Rate
This section will investigate the possibility of the cache miss rate affecting
the exponent in the previous experiment. If this is the case then it would
be expected that the effect only occurs when the miss rate is increasing with
respect to the input size. But as the miss rate nears 100% it should no longer
be able to affect the exponent, and merely result in a larger factor on the time
complexity.
Therefore an experiment has been run on input sizes ranging from 30000 to
107250 using randomly generated strings. Due to time constraints the size is
incremented by 250 per iteration, and only run four times, which will result in a
greater variance in the data. The recursive depth-first traversal used to extract
the topology of the DAWG causes a segmentation fault when it nears 110000,
thereby limiting the maximum range. This was unfortunately discovered too
late to allow for reimplementation.
The data from the experiment is approximated for increasing start ranges
to the end range and the exponents approximated can be seen in figure 4.4.
The initial exponents show that the effect of cache misses do not end at the
30000-mark, and they even exceed the exponent of the naïve implementation.
Though this pattern seems to disappear around 60000 where the exponent begins falling rapidly. This could indicate that the growth of the cache miss rate
is decelerating, reducing the effect on the exponent. Though it never reaches
46
25000
30000
Range
[30 : 107.25]
[35 : 107.25]
[40 : 107.25]
[45 : 107.25]
[50 : 107.25]
[55 : 107.25]
[60 : 107.25]
Exponent
1.165
1.171
1.179
1.186
1.195
1.202
1.202
Range
[65 : 107.25]
[70 : 107.25]
[75 : 107.25]
[80 : 107.25]
[85 : 107.25]
[90 : 107.25]
Exponent
1.187
1.144
1.012
0.857
0.842
0.836
Figure 4.4: Table showing the change in exponent for runs on long strings on the
simulated DAWG. The range values are in thousands of characters.
the average time predicted by Do and Sung [3], this could merely be caused by
the limited range.
Attempting to approximate the data after the 90000-mark results in an
asymptotic standard error above 5% when using the Gnuplot’s approximation
algorithm and have therefore not been included. This is likely caused by the
lower number of data points available, and a greater variance due to a lower
number of runs per data point.
This experiment seems to indicate that it is indeed the cache miss rate
which causes the unexpected increase on the exponent for smaller input ranges.
Further experiments are necessary to determine where the exponent stabilizes,
which would indicate the actual time complexity of the algorithm, though the
results from the early ranges, before the cache misses begin to affect the algorithm, do suggest that the algorithm is capable of reaching the expected
O(n0.628 m) average time predicted by Do and Sung [3].
A possible reason for the horrible number cache misses could be due to
the number of lookups required in the simulated DAWG to navigate the data
structure. As the algorithm for computing the local alignment score, seen in
3.4.4, shows each iteration makes a large number of calls to the Find-Child
operation, which calculates whether a given node has a child by making lookups
in every underlying data structure. This increases the probability that at least
some entries are not present in the cache, be it L2 or L3. This problem is not
present in the naïve solution as it only makes three lookups per calculation from
two tables.
4.4
Scoring Schemes
This section will attempt to compare the effect on performance of the simulated
DAWG when different scoring schemes are used for the computation of the local
alignment score. Since the DAWG approach relies heavily on the pruning derived from only propagating meaningful alignments forward in the computation,
the scoring scheme should influence the algorithms performance substantially.
To investigate this hypothesis three different scoring schemes have been chosen,
two of which are opposite extremes and the last being a standard scoring scheme
47
of the widely used sequence analysis tool BLAST, which is also presented by Do
and Sung [3]. This scoring scheme was also used for the previous experiments.
The scoring schemes are defined as the following:
1e+10
1, 0, 0
Smith-Waterman
1, −3, −3
1, −∞, −∞
8e+09
nanoseconds
8e+09
nanoseconds
1e+10
1, 0, 0
Smith-Waterman
1, −3, −3
1, −∞, −∞
6e+09
4e+09
2e+09
6e+09
4e+09
2e+09
0
0
0
5000
10000
15000
20000
25000
30000
n
0
5000
10000
15000
20000
n
Figure 4.5: The resulting graph from the experiment described in section 4.4. The
results from randomly generated inputs are shown to the left, while the results from
using DNA as input is shown to the right. Each plots exponent is shown in figure 4.6
• The first extreme scoring scheme rewards match with 1, while a mismatch
or gap is punished with 0, and is denoted as {1, 0, 0}. This scoring scheme
makes negative scores impossible once a node has received a value, and
therefore the DAWG should be forced to make a calculation for all nodes
very early on in the experiment. This should result in a worst-case scenario which is defined as O(nm) by Do and Sung [3]. The alignment score
should equal the global alignment score between S and P since gaps are
not punished, leading to an alignment where most characters possible are
matched up.
• The other extreme is a scoring scheme which rewards matches with 1 but
punishes mismatches and gaps by −∞ denoted as {1, −∞, −∞}. Due
to integer range constraints the implementation uses a signed integer
with half the size of the smallest representable signed integer to represent
negative infinity. The alignments produced by the scoring scheme then
equal the longest common substrings between the input strings S and
P . This greatly reduces the number of meaningful alignments, since all
alignments which include a gap are meaningless with this scoring scheme,
which should increase performance significantly.
• The standard scoring scheme rewards matches with 1 and punishes a
mismatch and gap with −3, denoted as {1, −3, −3} in the graph. This
is a widely used scheme and is commonly found in other tools used to
analyse sequence alignments.
48
25000
30000
Scoring Scheme
{1, 0, 0}
Smith-Waterman
{1, −3, −3}
{1, −∞, −∞}
Random
1.188
1.004
0.861
0.721
DNA
1.192
1.009
0.849
0.712
Figure 4.6: The exponent of the equation anb n + c of the different plots in figure 4.5
retrieved using Gnuplot’s regression command fit.
The experiment is set up similarly to the experiment described in section
4.2 with S and P being of equal size starting at 100 and then incrementing by
100 up to 30000 characters. As a small variation the number of runs on each
size P has been reduced to 4, down from 10. This is due to a practical time
constraint where it was not possible to run a full test. This should result in a
slightly larger variance in the data, but should still be adequate.
The experiment is once again split up in randomly generated strings and
actual DNA with the resulting data seen in figure 4.5. The performance of the
Smith-Waterman algorithm with the standard scoring scheme has also been
included in the graph for comparison. Since this algorithm’s performance is
not affected by the scoring scheme there is no reason to include data from tests
with other scoring schemes.
The resulting graph shows the exact pattern we were expecting, the {1, 0, 0}
scoring scheme produces the worst performance in the simulated DAWG by far,
even being out performed by the naïve Smith-Waterman algorithm. Since the
theoretical worst case presented by Do and Sung [3] is O(nm) just as the naïve
algorithm this is not a surprise, since we have already seen how the cache misses
seem to affect the DAWG’s time-complexity. In figure 4.6 the exponents of the
different plots have been approximated and it is clear that the time-complexity
is quite a bit worse than the theoretical assumption. However, if we approximate
the range [0 : 2000] the exponent is reduced to 1.08, indicating that the cache
miss rate is once again affecting the time-complexity. Since the extreme scoring
scheme forces the algorithm to calculate all entries with no pruning, it is not
surprising that the experiment encounters a rapidly increasing cache miss rate
earlier than when using the other scoring schemes.
The second extreme scoring scheme {1, −∞, −∞} results in a plot with the
exponent seen in figure 4.6. Even in the best case scenario where the algorithm is able to prune all alignments which contain gaps, the approximated
time-complexity exceeds, with a noticeable margin, the expected average case
time-complexity of O(n0.628 n) presented by Do and Sung [3]. While this scoring scheme is the last to be affected by the cache miss rate it is still affected
substantially even at these relatively short input sizes.
The standard {1, −3, −3} scoring scheme yields approximately the same
exponent as in section 4.2. Since this previous experiment covers this scoring
scheme, further comments will not be made.
49
4.5
Build Time
For the simulated directed acyclic word graph to be practical to use, the time
complexity for building the data structure has to be insignificant compared to
the computation of the local alignment score. Since the experiments on the
simulated DAWG are not based on a succinct implementation as described by
Do and Sung [3] the resulting time consumption in this section is not necessarily equivalent to the succinct simulated DAWGs time-complexity. It should
be noted that Do and Sung [3] do not present a theoretical expected timecomplexity for building the data structure.
The build time is expected to be dominated by construction of the O(n log n)
sized range minimum query table described in section 3.2.3. The table also takes
O(n log n) time to construct, as each entry is created in constant time.
The experiment is run by constructing the simulated DAWG on strings of
increasing length starting from 1000 and increasing by 1000 each iteration to
300000. Each iteration runs ten different DAWG constructions and the average
time is then outputted to a file. The resulting data is then plotted and the
graph is approximated using Gnuplot.
The data is plotted to the same function as the previous experiments, i.e.
axb x+c, partly to be able compare with the local alignment score algorithm, but
mostly because this function approximates the plot better than the expected
ax log x + c. The resulting approximation is 1335n0.12 n − 106 , which can be
said to have very little effect on the overall time consumption compared to
computing the local alignment.
Since the plotted graph is not very interesting, due to the lack of other plots
for comparison, it will not be included.
50
Chapter 5
Conclusion
The previous chapter presented a number of experiments which have been run
on the simulated DAWG data structure with the goal of investigating the claims
made by Do and Sung [3]. This section will evaluate these findings and attempt
to answer the questions raised in chapter 1. Afterwards the section will discuss
what improvements could be made to the implementation and what experiments
could be run to give more insight into the simulated DAWG data structure and
its local alignment algorithm.
5.1
Conclusion
Beginning with the first question raised in chapter 1, it does seem feasible to
implement the simulated directed acyclic word graph data structure, as the implementation used in this thesis works and is able to compute local alignments.
Though to achieve the specific data structure presented by Do and Sung [3] it
is necessary to substitute or extend a number of the auxiliary data structures
used, which means the feasibility will rely upon the feasibility of implementing
these changes. While some of them may be somewhat complex to implement
there does not seem to be any reason to suggest that they are not feasible.
The second question set out to investigate whether the experiments could
verify that the worst-case time consumption was equivalent to the O(nm) timecomplexity presented in [3]. This was investigated by giving the simulated
DAWG a scoring scheme which would ensure that no pruning was possible,
thereby forcing the local alignment algorithm to make calculations on all nodes
at each iteration. This experiment resulted in a time-complexity which was approximately the same as the expected worst-case when looking at small inputs,
which indicates that it indeed has a worst-case of O(nm) time.
The last and most interesting question raised was whether the simulated
DAWG data structure had an average time-complexity of O(n0.628 m). The
first attempt to answer this question in section 4.2 showed that the local alignment algorithm had an average time-complexity that approached the expected
time for input sizes ranging from 0 to 10000, but that this time-complexity
quickly deteriorated when the input size exceeded this range. Further investigation in section 4.3 revealed that the cache misses appeared to be affecting
51
the exponent much more than what was expected, which prompted an experiment into whether the cache miss rate was causing this anomaly. The cache
miss experiment on large input sizes seemed to indicate that once the miss rate
approached 100% the exponent would return to the actual time-complexity of
the algorithm.
Unfortunately, due to technical problems, the experiment was not able to
identify at which input size the exponent on the time-complexity stabilized.
However, it did show that the time-complexity observed when the input size
exceeded 10000 was not the actual time-complexity of the algorithm as a whole.
While there is not enough data to definitively conclude that the time-complexity
would stabilize to the expected O(n0.628 m), the initial experiment with small
input sizes do seem to suggest that it is feasible.
Lastly an experiment was run to deduce whether the time-complexity of
building the simulated DAWG would generate enough overhead for the approach
to be impractical. This experiment showed that the build time was insignificant
when compared to the local alignment computation, even if it achieves the
expected average-time complexity.
5.2
Future Work
The next step would be to fix the implementation so that ranges above 110000
can be investigated. This requires an improvement of the depth-first traversal
of the simulated DAWG, which currently causes a segmentation fault at large
input sizes. This would allow for experiments which could determine when
the algorithm’s time exponent stabilizes. This would indicate the true timecomplexity of computing local alignment using the simulated DAWG.
Since this thesis has focused almost exclusively on experiments run on the
simulated DAWG, an experiment in which the simulated DAWG is matched
up against well established tools for computing local alignment should be very
interesting. While the time complexity is an important aspect, a direct comparison against identical strings run on different tools might produce surprising
outcomes due to the different tactics employed.
Another aspect which has not been investigated in this thesis, but which
could yield interesting results, is using affine gap costs in the scoring scheme
of the simulated DAWG. While the implementation in its current form cannot
operate on scoring schemes using affine gap costs, it should not be difficult to
modify the local alignment algorithm to be compatible with a scoring scheme
using affine gap costs. How this change in gap cost affects the time performance
would be interesting.
One thing that is not taken in to consideration in this thesis, is the effect
of using the succinct and compressed data structures originally used in Do and
Sung [3] for the simulated DAWG. While it is argued that these merely add
an additional constant time penalty to the local alignment calculation there
does not seem to be any estimate as to how these complex construction affect
the build time of the data structure. If the time complexity for building the
simulated DAWG begins to approach the time-complexity of the local alignment
52
then it would probably not be very practical to use. Implementing these should
give insight into whether the succinct simulated DAWG is practical.
53
54
Bibliography
[1] Benoit, D., E. D. Demaine, J. I. Munro, R. Raman, V. Raman, and S. S. Rao
(2005, December). Representing trees of higher degree. Algorithmica 43 (4),
275–292.
[2] Blumer, A., J. Blumer, D. Haussler, A. Ehrenfeucht, M. T. Chen, and J. I.
Seiferas (1985). The smallest automaton recognizing the subwords of a text.
Theor. Comput. Sci. 40, 31–55.
[3] Do, H. H. and W.-K. Sung (2011). Compressed directed acyclic word graph
with application in local alignment. In B. Fu and D.-Z. Du (Eds.), COCOON,
Volume 6842 of Lecture Notes in Computer Science, pp. 503–518. Springer.
[4] Ensembl.org.
ftp://ftp.ensembl.org/pub/release-74/fasta/
meleagris_gallopavo/dna/Meleagris_gallopavo.UMD2.74.dna.
toplevel.fa.gz.
[5] Ensembl.org.
ftp://ftp.ensembl.org/pub/release-74/fasta/
otolemur_garnettii/dna/Otolemur_garnettii.OtoGar3.74.dna_rm.
toplevel.fa.gz.
[6] Ferragina, P. and G. Manzini (2005, July). Indexing compressed text. J.
ACM 52 (4), 552–581.
[7] Golynski, A., J. I. Munro, and S. S. Rao (2006). Rank/select operations
on large alphabets: A tool for text indexing. In Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, SODA ’06,
New York, NY, USA, pp. 368–373. ACM.
[8] Grossi, R., A. Gupta, and J. S. Vitter (2003). High-order entropycompressed text indexes. In Proceedings of the Fourteenth Annual ACMSIAM Symposium on Discrete Algorithms, SODA ’03, Philadelphia, PA,
USA, pp. 841–850. Society for Industrial and Applied Mathematics.
[9] Jansson, J., K. Sadakane, and W.-K. Sung (2012, March). Ultra-succinct
representation of ordered trees with applications. J. Comput. Syst. Sci. 78 (2),
619–631.
[10] Kondrak., G. (2002). Algorithms for language reconstruction.
55
[11] Lam, T. W., W. K. Sung, S. L. Tam, C. K. Wong, and S. M. Yiu (2008,
March). Compressed indexing and local alignment of dna. Bioinformatics 24 (6), 791–797.
[12] Mäkinen, V. and G. Navarro (2005). Succinct suffix arrays based on runlength encoding. Nordic Journal of Computing, 40–66.
[13] Mäkinen, V. and G. Navarro (2007). Implicit compression boosting with
applications to self-indexing. In N. Ziviani and R. Baeza-Yates (Eds.), String
Processing and Information Retrieval, Volume 4726 of Lecture Notes in Computer Science, pp. 229–241. Springer Berlin Heidelberg.
[14] McCreight, E. M. (1976, April). A space-economical suffix tree construction algorithm. J. ACM 23 (2), 262–272.
[15] Munro, J. I. and V. Raman (1997). Succinct representation of balanced
parentheses, static trees and planar graphs. In FOCS, pp. 118–126. IEEE
Computer Society.
[16] Sadakane, K. (2007). Compressed suffix trees with full functionality. Theory Comput. Syst. 41 (4), 589–607.
[17] Smith, T. F. and M. S. Waterman (1981, March). Identification of common
molecular subsequences. Journal of molecular biology 147 (1), 195–197.
56