Download Suffix trees: How to do google search in bioinformatics? Curriculum

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Interval tree wikipedia , lookup

Transcript
Suffix trees: How to do google search in bioinformatics?
Curriculum module designer: Ananth Kalyanaraman ([email protected] )
This document is intended for the instructor. It contains information pertaining to the project and the
different ways in which it can be designed. In the Appendix section some template algorithms for parallel
implementation under different programming models are also provided. C++/MPI-implementations are
provided in the Project Source folder.
The document is presented in the form of a FAQs with the hope that different questions about the project
can be addressed in a focused manner. Please contact the module designer for clarifications and
comments.
Q) What is the goal of this project and what are expected learning outcomes?
The primary goal of this project is to introduce the problem of pattern matching and have the students
design, implement and evaluate different algorithmic approaches to solve this problem. The pattern
matching problem is the problem of checking to see if a set of query sequences occur as substrings in a
given string database (e.g., genome).
Upon successful completion, the students should:
a) Be able to design and analyze algorithms for the problem of pattern matching (and extend the
ideas to other closely related string matching problems);
b) Have acquired extensive experience in using string data structures (both tree-based and arraybased) and their APIs for matching problems;
c) Be able to identify the main challenges in implementing a parallel solution to string matching.
d) Be able to design and implement parallel approaches to support scalable processing of multiple
queries and large sequence databases.
e) Be able to identify the tradeoffs among the different string matching approaches using different
different data structures/techniques.
f) Be able to identify the primary challenges in carrying forward the techniques from theory to
practice.
g) Be able to identify the kind of pattern matching techniques that are better suited for different
practical settings (use-cases).
h) Be able to different exact matching to inexact matching techniques.
i) Be able to identify and appreciate real world applications that can benefit from the use of scalable
pattern matching techniques.
Q) Are there other variants of the problem that can be implemented?
Yes, there are multiple variants of the original problem. Given a query sequence:
(i)
(ii)
The first simple variant is that the pattern matching routine returns a boolean (true, if the query is
found as a substring, and false otherwise)
A second variant is that, in addition to returning the boolean, the function also returns the
frequency of the query in the database – i.e., how many times does each query occur as a
substring in the database.
1
(iii)
(iv)
A third variant is to expect the function to return not just the count but also the exact Genomic
positions that contain the query occurs as a substring.
A fourth (albeit independent) variant could be to allow for some errors while matching a query
against the input genome/database. In other words, treat the problem as one of inexact matching.
Q) What is contained in the “ProjectSource” folder ? What implementations are provided?
The project source folder contains a multiple codebases. The Appendix explains the algorithms
implemented by the code in the source folder. All code are in C++ and MPI.
Here is a brief outline of the folder structure:
a) Pmatch_naive:
This folder contains the Naïve search algorithm.
b) Pmatch_st:
This folder contains the suffix tree based search algorithm.
c) “scripts”:
The scripts folder contains all the perl scripts for use in the project. The
scripts can be used to generate synthetic inputs (by simulating DNA sequencing) and
collect statistics about FASTA formatted sequence files. A Readme.txt is provided to
help use the scripts.
d) data:
This folder contains numerous FASTA files which could be used
as test and experimental inputs (both database and queries).
Q) What kind of background should the students have to work on this project?
a) Students should have a strong programming experience in C/C++. Additional knowledge
on STL could be a plus.
b) Students should have taken (or be co-enrolled) the CS undergraduate algorithm design
and analysis course, and should have already taken the undergraduate data structure
class.
c) Students need not have any biological background.
d) Students should be comfortable in at least one of the standard parallel programming
environments such as MPI, MapReduce or OpenMP.
Q) Does the project have a multiscale component – either in terms of data or in terms of
computation?
Yes, the ProjectSource folder has two subfolders:
a) Pmatch_naive: implements a naïve searching algorithm that is better suited for smaller
input sizes (genome & #queries).
b) Pmatch_st: implements a suffix-tree searching algorithm that is better suited for usecases where the genome (big or small) needs to be preprocessed only once (or a few
times infrequently), while there are large batches of queries coming over a period of
time.
Runtime analysis of these two algorithms are provided in the Appendix.
2
Both algorithms can be run on a single processor, or on multiple processors, providing the
project with multiple computation scales. In addition to working implementations in the MPI
model, ideas for parallelizing under other frameworks such as OpenMP and MapReduce are
provided in the Appendix.
Also, the project allows for three different test plans when it comes to performance analysis
(as listed in the sample job script, sub.sh, in the test folders of these codebases):
a) Studying runtime as a function of genome size (keeping #queries and number of
processors fixed)
b) Studying runtime as a function of number of queries (keeping genome and number of
processors fixed)
c) Studying runtime as a function of number of processors (keeping genome and #queries
fixed)
Numerous input files (both genomes and query files) are provided as part of the package
(please see under the “data” folder). These can be used to test to scale the performance of
different implementations. Also on the “scripts” folder, perl scripts are provided (see
readgenerator.pl) to generate arbitrarily sized inputs. Please refer to the readme file in that
folder for more details.
Q) How do I know if the students have designed their algorithm well? Are there a
reference algorithms provided along with the project?
Yes, please refer to the Appendix section below.
Q) Are there extensions and other variants of this project that can be pursued?
Yes, please refer to the Appendix section below.
Q) How long is the project expected to take and is it team work based project?
The expected duration for implementation for both the projects (naïve and suffix-tree based) is
roughly 2 weeks, assuming the students are proficient writing programs in C/C++ and have
basic programming experience in MPI/OpenMP/MapReduce.
It is desirable that the students work in teams of 2 or 3 each.
Q) From the ProjectSource folder, what are the project components that the students
should be asked to implement and what parts should be provided to them (as libraries)
for use in the project?
For the naïve implementation: It can be expected that the entire codebase to be implemented by
the students. Alternatively, if the instructor wants to have the students save some time in
reading the FASTA file and focus more on the implementation details of pattern matching, then
the instructor can provide the FastaReader library (API: fasta_reader.h, and implementation in
3
fasta_reader.cpp) which the students should be able to use (as shown in pmatch_naive.cpp and
pmatch_st.cpp) to load the input sequence files, both genome and queries.
For the suffix tree-based implementation: The students are expected to be provided with the
suffix tree data structure API and implementation in SuffixTree.h and SuffixTree.cpp
respectively. They need to use this library in order to build the suffix tree on the genome and
then use the tree, using its API functions which are well documented (refer to the public
methods in SuffixTree class), in order to implement pattern matching (as shown pmatch_st.cpp).
Q) What is the practical relevance of solving the pattern matching problem? Are there
real world applications (both in bioinformatics and outside) that can benefit from an
efficient implementation for this problem?
Yes, very briefly, the formulation of pattern matching that is targeted here has direct applications
in genome read-to-reference mapping and genome searching applications. More specifically,
the genome read-to-reference mapping problem is a common theme in resequencing projects,
where a biologist has: i) a reference (already sequenced) genome that is representative of a
species, and alongside ii) new raw reads (sequences) collected from specific individuals of that
species (or a closely related one). The goal is to find the minor variations (called single
nucleotide polymorphisms [SNPs]) in the individual’s genome relative to the reference.
The second problem is genome searching, where a biologist has sequenced a new genic
sequence but wants to locate it along the source genome or in genomes from closely related
species (this is common in comparative genomics project).
APPENDIX
I)
Pattern matching with error tolerance – Problem definition
Inputs:
 {G: Genome FASTA file containing a single genome sequence with N characters}
 {Q: Queries FASTA file containing m queries, each of possibly a different length.}
 {Error tolerance level expressed in % (of query length)}
Problem statement: To search for the input queries in the genome, and reports the
topmost hit for each query. The “topmost hit for each query” is defined as the location in
the genome which matches to the query with the least number of mismatching
characters (errors), and the number of errors should be no more than the error tolerance
level indicated by the user. If there is no such location, then the code outputs a “query
not found” message. If there is more than one such location, then output any one of
those locations (chosen arbitrarily) as the output.
Possible extensions:
1) One possible extension could be to ask the students to provide all locations where a
query is found.
4
2) Another extension could be to have students implement a more optimized version of
their code assuming no errors (i.e., queries need to match exactly to be reported as
found).
3) A much more involved extension would to assume there is no genome and instead
the problem goal is that given only a set of queries, find all other queries in the set that
match significantly with each of the queries. This will be equivalent of performing an allagainst-all comparison.
II)
Pattern matching with error tolerance – A Naïve algorithm
Implementation source folder: pmatch_naive
A Naïve serial algorithm:
Assumptions:
1) The queries are of roughly the same length (i.e., negligible standard
deviation in the average query length).
2) There is sufficient memory available (local RAM) to store the entire Genome input.
This is a naive implementation that does a brute force search for each query against the
entire genome, by sliding a window of the length of the query along the genome and
enumerating only those positions that correspond to the topmost hit.
Note: This version does *NOT* use suffix trees or any other sophisticated string data
structure.
The steps are outlined below:
1) Load the input genome into a string array in the memory (call it G[1..N]).
(optionally, read all the queries at once or the input queries one at a time.)
2) For each query q  Q
i) Set “maximum allowed errors” := error tolerance (%) x query length (|q}|)
ii) slide a window of the length |q| along the entire length of the genome.
(Note: There will be N-|q|+1 windows.)
iii) for each window:
a. compare q against the characters in that window
b. if the number of mismatching positions is below the maximum allowed errors
for this query, then check if it the best hit so far. If it is, then update best hit
information and remember the genomic location (i.e., window).
iv) At the end, output one best hit for the query
Asymptotic analysis:
Let m: number of queries
M: sum of the length of all the queries
N: length of the genome
l: average length of a query
(p: number of processors)
5
Time complexity:
O(lN) per query, implying a total of O(mlN) (=O(MN)) for all queries.
Space complexity:
Each processor stores the entire genome and queries can be read one at a
time. So space complexity is O(N).
Parallel algorithm for MPI platforms:
Assuming the lengths of the queries are all roughly equal to one another (i.e., negligible
standard deviation), this algorithm is easy to parallelize.
1) Each processor loads the entire genome G.
2) Each processor loads O(m/p) queries on each processor. This can be done in a
distributed manner without needing any communication.
3) Each processor runs the serial algorithm for searching its local set of O(m/p) queries
against the genome.
Let p be the number of processors. This algorithm will take O(MN/p) time and O(N+m/p)
space.
Parallel algorithm for OpenMP platforms:
The above same approach and analysis will carry forward to a OpenMP multithreaded
setting as well with some minor changes:
1) The master thread loads the entire genome G.
2) The master thread also all the queries into the memory.
3) The master thread spawns p worker threads.
4) Each worker thread first picks an unclaimed query (next in queue) and runs the serial
algorithm for searching it against the genome. As threads pick a query they also
mark it in the query array so that other threads do not pick it. This is the only step
that needs a lock/unlock.
Parallel algorithm for MapReduce platforms:
One MapReduce algorithm is as follows:
Let us assume there are pm mappers and pr reducers.
1) Each mapper loads roughly O(m/pm) queries (using the input splitter function)
2) Each mapper loads the entire genome once.
3) Each mapper then searches each local query against the genome and outputs.
4) There is no need for reducers in this model.
Another variant of MapReduce algorithm could be that, each mapper reads roughly an
equal portion of both the query set and the genome. Then each mapper compares the
local queries against the locally stored genome, and emit intermediate <key,value> pairs
the form <query id, best local hits>. Then the MapReduce shuffle stage will gather all
6
hits corresponding each query in a designated reducer, where that reducer can report
the global best hit.
III)
Pattern matching with error tolerance – A suffix tree based algorithm
Implementation source folder: pmatch_st
This is a suffix tree based implementation that does searches for each query against
the suffix tree built for the entire genome.
Assumptions:
1) The queries are of roughly the same length (i.e., negligible standard
deviation in the average query length).
2) There is sufficient memory available locally to each processor
to store the entire Genome input, its suffix tree, and part of the queries.
Serial Algorithm:
In this is suffix tree based algorithm, search for each query against the suffix
tree built for the entire genome. The main algorithmic steps are as follows:
(One-time preprocessing step)
1) Load the input genome and build the suffix tree for the genome. This can be done in
O(N) time using linear algorithms such as the McCreight algorithm. The code
provided in pmatch_st does this.
(Query matching)
2) Load the queries (incrementally or all at once).
3) For each query q  Q
i) Set “maximum allowed errors” := error tolerance (%) x query length (|q}|)
ii) Search q in the suffix tree of G as follows:
//
Search algorithm:
//
A) FIND THE LONGEST SUBSTRING MATCHING POSITIONS
//
i) Start at the root, walk down the tree by comparing one character at a
//
time with the query vs. tree path until one of the following happens:
//
a) there is mismatch:
//
action: update longest path if this is the longest
//
keep track of the matching locations
//
(by querying the internal node immediately below the path)
//
if the number of mismatches is greater than the cutoff then
//
quit search
//
//
or, b) the query's length has been successfully exhausted:
//
action: select the longest matching path among the paths visited so
//
far for extension
7
//
//
//
//
//
//
B) EXTEND AND EVALUATE TO FIND WHICH THE BEST HIT
(ACCOUNTING FOR ERRORS BELOW CUTOFF)
i) extract the window of |q| characters from the genome and perform a
simple comparison
ii) score each window by the number of errors
iii) output and report the window with the least number of errors.
This logic is implemented in the function st_search() of pmatch_st.cpp. The main API
calls to the suffix tree which this search function needs to use are those which are
defined as “public methods” in the SuffixTree class (SuffixTree.h).
Parallelization:
All the above parallelization methods described for the Naïve approach can also be used
here. Of course a more interesting extension (and a lot more challenging one) would be
to build the suffix tree itself in parallel. There are some fairly complex solutions available
in literature which may not be best suited for an undergraduate course curriculum. This
could however provide a great discussion point in class regarding the challenges in
parallel algorithm design.
Assymptotic analysis:
Time complexity:
Preprocessing time (one time):
O(N) to build the suffix tree in each processor for the entire genome
Query time:
O(l) per query, implying a total of O(ml) (=O(M)) for all queries.
In parallel, assuming linear scaling, this would imply O(M/p) time.
Space complexity:
Each processor stores the entire genome and O(m/p) queries.
Per processor peak space complexity = O(N + M/p).
The constant of proportionality of the suffix tree is roughly 150 in the
current implementation. That is, for every input byte the suffix tree takes 150 bytes.
IV)
Pattern matching with error tolerance – Naïve vs. suffix tree based approach: An
evaluation
Both algorithms (Naïve and Suffix tree-based) suggested have their
individual strengths and weaknesses.
The Naïve approach is certainly simple and straightforward to code. In fact most of my
coding effort on that component was consumed in implementing the Input/Output
8
operations (e.g., FASTA file reading) that is necessary to load the inputs. The solution is
also really simple to analyze in terms of time & memory complexities. The only
disadvantage is that for each query the algorithm scans through the entire genome input.
Therefore, the approach cannot be considered scalable for very large genome sizes.
This is an issue in practice, because the DNA databases are expanding in size and it
won’t be desirable to have a method whose time for searching a single query depends
on the size of the database on that particular day! Perhaps one of the experiments that
the students can do is to check until what genome size does this implementation be
made to still finish in a practical amount of time (i.e., at most a few seconds per query).
The Suffix Tree approach overcomes the above disadvantage by performing a query
search in time proportional to the query’s length (and independent of the genome size).
The suffix tree construction is a one-time preprocessing activity. In real world, if the
genomic database grows then the tree has to be built occasionally (but never
frequently). Therefore, under a setting where the genome database does not change
frequently and there are huge number of queries that appear over time (something like a
Google search setting), then suffix tree based approach can be highly desirable.
It is also not way more complex in terms of coding the pattern matching component
using the suffix tree. The programmer only has to know how to navigate the tree top
down from the root and for that needs to understand the API of the Suffix Tree class
(which has only a handful of public methods). In fact I was able to reuse most of my
code from the naïve implementation in my suffix tree based pattern matching code
(compare the st_search in pmatch_st.cpp with the main function in pmatch_naive.cpp).
The only disadvantage of the suffix tree based approach is its space requirement.
Although linear O(N), the constant of proportionality tends to be big in practice. In our
current implementation it is well over 100, which makes it harder to conduct in serial
computers with limited RAM. There are more space efficient data structures like suffix
array but those could be difficult to cover as part of an undergraduate curriculum.
Nevertheless this project could be a great introduction to the value of string data
structures in general.
9