Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Suffix trees: How to do google search in bioinformatics? Curriculum module designer: Ananth Kalyanaraman ([email protected] ) This document is intended for the instructor. It contains information pertaining to the project and the different ways in which it can be designed. In the Appendix section some template algorithms for parallel implementation under different programming models are also provided. C++/MPI-implementations are provided in the Project Source folder. The document is presented in the form of a FAQs with the hope that different questions about the project can be addressed in a focused manner. Please contact the module designer for clarifications and comments. Q) What is the goal of this project and what are expected learning outcomes? The primary goal of this project is to introduce the problem of pattern matching and have the students design, implement and evaluate different algorithmic approaches to solve this problem. The pattern matching problem is the problem of checking to see if a set of query sequences occur as substrings in a given string database (e.g., genome). Upon successful completion, the students should: a) Be able to design and analyze algorithms for the problem of pattern matching (and extend the ideas to other closely related string matching problems); b) Have acquired extensive experience in using string data structures (both tree-based and arraybased) and their APIs for matching problems; c) Be able to identify the main challenges in implementing a parallel solution to string matching. d) Be able to design and implement parallel approaches to support scalable processing of multiple queries and large sequence databases. e) Be able to identify the tradeoffs among the different string matching approaches using different different data structures/techniques. f) Be able to identify the primary challenges in carrying forward the techniques from theory to practice. g) Be able to identify the kind of pattern matching techniques that are better suited for different practical settings (use-cases). h) Be able to different exact matching to inexact matching techniques. i) Be able to identify and appreciate real world applications that can benefit from the use of scalable pattern matching techniques. Q) Are there other variants of the problem that can be implemented? Yes, there are multiple variants of the original problem. Given a query sequence: (i) (ii) The first simple variant is that the pattern matching routine returns a boolean (true, if the query is found as a substring, and false otherwise) A second variant is that, in addition to returning the boolean, the function also returns the frequency of the query in the database – i.e., how many times does each query occur as a substring in the database. 1 (iii) (iv) A third variant is to expect the function to return not just the count but also the exact Genomic positions that contain the query occurs as a substring. A fourth (albeit independent) variant could be to allow for some errors while matching a query against the input genome/database. In other words, treat the problem as one of inexact matching. Q) What is contained in the “ProjectSource” folder ? What implementations are provided? The project source folder contains a multiple codebases. The Appendix explains the algorithms implemented by the code in the source folder. All code are in C++ and MPI. Here is a brief outline of the folder structure: a) Pmatch_naive: This folder contains the Naïve search algorithm. b) Pmatch_st: This folder contains the suffix tree based search algorithm. c) “scripts”: The scripts folder contains all the perl scripts for use in the project. The scripts can be used to generate synthetic inputs (by simulating DNA sequencing) and collect statistics about FASTA formatted sequence files. A Readme.txt is provided to help use the scripts. d) data: This folder contains numerous FASTA files which could be used as test and experimental inputs (both database and queries). Q) What kind of background should the students have to work on this project? a) Students should have a strong programming experience in C/C++. Additional knowledge on STL could be a plus. b) Students should have taken (or be co-enrolled) the CS undergraduate algorithm design and analysis course, and should have already taken the undergraduate data structure class. c) Students need not have any biological background. d) Students should be comfortable in at least one of the standard parallel programming environments such as MPI, MapReduce or OpenMP. Q) Does the project have a multiscale component – either in terms of data or in terms of computation? Yes, the ProjectSource folder has two subfolders: a) Pmatch_naive: implements a naïve searching algorithm that is better suited for smaller input sizes (genome & #queries). b) Pmatch_st: implements a suffix-tree searching algorithm that is better suited for usecases where the genome (big or small) needs to be preprocessed only once (or a few times infrequently), while there are large batches of queries coming over a period of time. Runtime analysis of these two algorithms are provided in the Appendix. 2 Both algorithms can be run on a single processor, or on multiple processors, providing the project with multiple computation scales. In addition to working implementations in the MPI model, ideas for parallelizing under other frameworks such as OpenMP and MapReduce are provided in the Appendix. Also, the project allows for three different test plans when it comes to performance analysis (as listed in the sample job script, sub.sh, in the test folders of these codebases): a) Studying runtime as a function of genome size (keeping #queries and number of processors fixed) b) Studying runtime as a function of number of queries (keeping genome and number of processors fixed) c) Studying runtime as a function of number of processors (keeping genome and #queries fixed) Numerous input files (both genomes and query files) are provided as part of the package (please see under the “data” folder). These can be used to test to scale the performance of different implementations. Also on the “scripts” folder, perl scripts are provided (see readgenerator.pl) to generate arbitrarily sized inputs. Please refer to the readme file in that folder for more details. Q) How do I know if the students have designed their algorithm well? Are there a reference algorithms provided along with the project? Yes, please refer to the Appendix section below. Q) Are there extensions and other variants of this project that can be pursued? Yes, please refer to the Appendix section below. Q) How long is the project expected to take and is it team work based project? The expected duration for implementation for both the projects (naïve and suffix-tree based) is roughly 2 weeks, assuming the students are proficient writing programs in C/C++ and have basic programming experience in MPI/OpenMP/MapReduce. It is desirable that the students work in teams of 2 or 3 each. Q) From the ProjectSource folder, what are the project components that the students should be asked to implement and what parts should be provided to them (as libraries) for use in the project? For the naïve implementation: It can be expected that the entire codebase to be implemented by the students. Alternatively, if the instructor wants to have the students save some time in reading the FASTA file and focus more on the implementation details of pattern matching, then the instructor can provide the FastaReader library (API: fasta_reader.h, and implementation in 3 fasta_reader.cpp) which the students should be able to use (as shown in pmatch_naive.cpp and pmatch_st.cpp) to load the input sequence files, both genome and queries. For the suffix tree-based implementation: The students are expected to be provided with the suffix tree data structure API and implementation in SuffixTree.h and SuffixTree.cpp respectively. They need to use this library in order to build the suffix tree on the genome and then use the tree, using its API functions which are well documented (refer to the public methods in SuffixTree class), in order to implement pattern matching (as shown pmatch_st.cpp). Q) What is the practical relevance of solving the pattern matching problem? Are there real world applications (both in bioinformatics and outside) that can benefit from an efficient implementation for this problem? Yes, very briefly, the formulation of pattern matching that is targeted here has direct applications in genome read-to-reference mapping and genome searching applications. More specifically, the genome read-to-reference mapping problem is a common theme in resequencing projects, where a biologist has: i) a reference (already sequenced) genome that is representative of a species, and alongside ii) new raw reads (sequences) collected from specific individuals of that species (or a closely related one). The goal is to find the minor variations (called single nucleotide polymorphisms [SNPs]) in the individual’s genome relative to the reference. The second problem is genome searching, where a biologist has sequenced a new genic sequence but wants to locate it along the source genome or in genomes from closely related species (this is common in comparative genomics project). APPENDIX I) Pattern matching with error tolerance – Problem definition Inputs: {G: Genome FASTA file containing a single genome sequence with N characters} {Q: Queries FASTA file containing m queries, each of possibly a different length.} {Error tolerance level expressed in % (of query length)} Problem statement: To search for the input queries in the genome, and reports the topmost hit for each query. The “topmost hit for each query” is defined as the location in the genome which matches to the query with the least number of mismatching characters (errors), and the number of errors should be no more than the error tolerance level indicated by the user. If there is no such location, then the code outputs a “query not found” message. If there is more than one such location, then output any one of those locations (chosen arbitrarily) as the output. Possible extensions: 1) One possible extension could be to ask the students to provide all locations where a query is found. 4 2) Another extension could be to have students implement a more optimized version of their code assuming no errors (i.e., queries need to match exactly to be reported as found). 3) A much more involved extension would to assume there is no genome and instead the problem goal is that given only a set of queries, find all other queries in the set that match significantly with each of the queries. This will be equivalent of performing an allagainst-all comparison. II) Pattern matching with error tolerance – A Naïve algorithm Implementation source folder: pmatch_naive A Naïve serial algorithm: Assumptions: 1) The queries are of roughly the same length (i.e., negligible standard deviation in the average query length). 2) There is sufficient memory available (local RAM) to store the entire Genome input. This is a naive implementation that does a brute force search for each query against the entire genome, by sliding a window of the length of the query along the genome and enumerating only those positions that correspond to the topmost hit. Note: This version does *NOT* use suffix trees or any other sophisticated string data structure. The steps are outlined below: 1) Load the input genome into a string array in the memory (call it G[1..N]). (optionally, read all the queries at once or the input queries one at a time.) 2) For each query q Q i) Set “maximum allowed errors” := error tolerance (%) x query length (|q}|) ii) slide a window of the length |q| along the entire length of the genome. (Note: There will be N-|q|+1 windows.) iii) for each window: a. compare q against the characters in that window b. if the number of mismatching positions is below the maximum allowed errors for this query, then check if it the best hit so far. If it is, then update best hit information and remember the genomic location (i.e., window). iv) At the end, output one best hit for the query Asymptotic analysis: Let m: number of queries M: sum of the length of all the queries N: length of the genome l: average length of a query (p: number of processors) 5 Time complexity: O(lN) per query, implying a total of O(mlN) (=O(MN)) for all queries. Space complexity: Each processor stores the entire genome and queries can be read one at a time. So space complexity is O(N). Parallel algorithm for MPI platforms: Assuming the lengths of the queries are all roughly equal to one another (i.e., negligible standard deviation), this algorithm is easy to parallelize. 1) Each processor loads the entire genome G. 2) Each processor loads O(m/p) queries on each processor. This can be done in a distributed manner without needing any communication. 3) Each processor runs the serial algorithm for searching its local set of O(m/p) queries against the genome. Let p be the number of processors. This algorithm will take O(MN/p) time and O(N+m/p) space. Parallel algorithm for OpenMP platforms: The above same approach and analysis will carry forward to a OpenMP multithreaded setting as well with some minor changes: 1) The master thread loads the entire genome G. 2) The master thread also all the queries into the memory. 3) The master thread spawns p worker threads. 4) Each worker thread first picks an unclaimed query (next in queue) and runs the serial algorithm for searching it against the genome. As threads pick a query they also mark it in the query array so that other threads do not pick it. This is the only step that needs a lock/unlock. Parallel algorithm for MapReduce platforms: One MapReduce algorithm is as follows: Let us assume there are pm mappers and pr reducers. 1) Each mapper loads roughly O(m/pm) queries (using the input splitter function) 2) Each mapper loads the entire genome once. 3) Each mapper then searches each local query against the genome and outputs. 4) There is no need for reducers in this model. Another variant of MapReduce algorithm could be that, each mapper reads roughly an equal portion of both the query set and the genome. Then each mapper compares the local queries against the locally stored genome, and emit intermediate <key,value> pairs the form <query id, best local hits>. Then the MapReduce shuffle stage will gather all 6 hits corresponding each query in a designated reducer, where that reducer can report the global best hit. III) Pattern matching with error tolerance – A suffix tree based algorithm Implementation source folder: pmatch_st This is a suffix tree based implementation that does searches for each query against the suffix tree built for the entire genome. Assumptions: 1) The queries are of roughly the same length (i.e., negligible standard deviation in the average query length). 2) There is sufficient memory available locally to each processor to store the entire Genome input, its suffix tree, and part of the queries. Serial Algorithm: In this is suffix tree based algorithm, search for each query against the suffix tree built for the entire genome. The main algorithmic steps are as follows: (One-time preprocessing step) 1) Load the input genome and build the suffix tree for the genome. This can be done in O(N) time using linear algorithms such as the McCreight algorithm. The code provided in pmatch_st does this. (Query matching) 2) Load the queries (incrementally or all at once). 3) For each query q Q i) Set “maximum allowed errors” := error tolerance (%) x query length (|q}|) ii) Search q in the suffix tree of G as follows: // Search algorithm: // A) FIND THE LONGEST SUBSTRING MATCHING POSITIONS // i) Start at the root, walk down the tree by comparing one character at a // time with the query vs. tree path until one of the following happens: // a) there is mismatch: // action: update longest path if this is the longest // keep track of the matching locations // (by querying the internal node immediately below the path) // if the number of mismatches is greater than the cutoff then // quit search // // or, b) the query's length has been successfully exhausted: // action: select the longest matching path among the paths visited so // far for extension 7 // // // // // // B) EXTEND AND EVALUATE TO FIND WHICH THE BEST HIT (ACCOUNTING FOR ERRORS BELOW CUTOFF) i) extract the window of |q| characters from the genome and perform a simple comparison ii) score each window by the number of errors iii) output and report the window with the least number of errors. This logic is implemented in the function st_search() of pmatch_st.cpp. The main API calls to the suffix tree which this search function needs to use are those which are defined as “public methods” in the SuffixTree class (SuffixTree.h). Parallelization: All the above parallelization methods described for the Naïve approach can also be used here. Of course a more interesting extension (and a lot more challenging one) would be to build the suffix tree itself in parallel. There are some fairly complex solutions available in literature which may not be best suited for an undergraduate course curriculum. This could however provide a great discussion point in class regarding the challenges in parallel algorithm design. Assymptotic analysis: Time complexity: Preprocessing time (one time): O(N) to build the suffix tree in each processor for the entire genome Query time: O(l) per query, implying a total of O(ml) (=O(M)) for all queries. In parallel, assuming linear scaling, this would imply O(M/p) time. Space complexity: Each processor stores the entire genome and O(m/p) queries. Per processor peak space complexity = O(N + M/p). The constant of proportionality of the suffix tree is roughly 150 in the current implementation. That is, for every input byte the suffix tree takes 150 bytes. IV) Pattern matching with error tolerance – Naïve vs. suffix tree based approach: An evaluation Both algorithms (Naïve and Suffix tree-based) suggested have their individual strengths and weaknesses. The Naïve approach is certainly simple and straightforward to code. In fact most of my coding effort on that component was consumed in implementing the Input/Output 8 operations (e.g., FASTA file reading) that is necessary to load the inputs. The solution is also really simple to analyze in terms of time & memory complexities. The only disadvantage is that for each query the algorithm scans through the entire genome input. Therefore, the approach cannot be considered scalable for very large genome sizes. This is an issue in practice, because the DNA databases are expanding in size and it won’t be desirable to have a method whose time for searching a single query depends on the size of the database on that particular day! Perhaps one of the experiments that the students can do is to check until what genome size does this implementation be made to still finish in a practical amount of time (i.e., at most a few seconds per query). The Suffix Tree approach overcomes the above disadvantage by performing a query search in time proportional to the query’s length (and independent of the genome size). The suffix tree construction is a one-time preprocessing activity. In real world, if the genomic database grows then the tree has to be built occasionally (but never frequently). Therefore, under a setting where the genome database does not change frequently and there are huge number of queries that appear over time (something like a Google search setting), then suffix tree based approach can be highly desirable. It is also not way more complex in terms of coding the pattern matching component using the suffix tree. The programmer only has to know how to navigate the tree top down from the root and for that needs to understand the API of the Suffix Tree class (which has only a handful of public methods). In fact I was able to reuse most of my code from the naïve implementation in my suffix tree based pattern matching code (compare the st_search in pmatch_st.cpp with the main function in pmatch_naive.cpp). The only disadvantage of the suffix tree based approach is its space requirement. Although linear O(N), the constant of proportionality tends to be big in practice. In our current implementation it is well over 100, which makes it harder to conduct in serial computers with limited RAM. There are more space efficient data structures like suffix array but those could be difficult to cover as part of an undergraduate curriculum. Nevertheless this project could be a great introduction to the value of string data structures in general. 9