Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Journal of Engineering Research and Studies E-ISSN0976-7916 Research Article ENHANCED SUFFIX ARRAY TO PROTEIN SEQUENCE ALIGNMENT A. Kunthavai*1, S.Vasantharathna2 Address for Correspondence Assistant Professor(SG), Department of CSE/IT, Coimbatore Institute of Technology Coimbatore, Tamilnadu , India,1 2 Associate Professor, Department of Electrical Engineering, Coimbatore Institute of Technology Coimbatore, Tamilnadu, India. 1 ABSTRACT Sequence alignment is a popular bioinformatics application that determines the degree of similarity between nucleotide or amino acid sequences which is assumed to have same ancestral relationships. This sequence alignment method reads query sequence from the user and makes an alignment against large protein and gene sequence data sets and locate targets that are similar to an input query sequence. Traditional accurate algorithm, such as Smith-Waterman and FASTA are computationally very expensive, which limits their use in practice. The current set of popular search tools, such as BLAST and WU-BLAST, employ heuristics to improve the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable. This paper provides enhanced suffix array with ESAPRO Tool, to perform accurate and faster biological sequence analysis as an improvement on the computation time of existing tools in this area. The main idea is to pick matched patterns of the query sequence and identify sequences in the database which share a large number of these matched patterns. Then, reducing the size of the database to very few sequences which are found closest to the query sequence in question. Experiment results are cross validated using data mining technique. This show that a new ESAPRO Tool developed effectively reduces the database and obtains very similar results compared to those traditional algorithms in approximately half the time taken by them. KEYWORDS Sequence alignment, enhanced suffix array, local alignment, Data mining INTRODUCTION Bioinformatics is the field of analyzing the biological information using computers and statistical techniques; the science of developing and utilizing computer databases and algorithms to accelerate and enhance biological research. Sequence alignment is one of the most important fundamental operations in bioinformatics. It has been successfully applied to predict the function, structure and evolution of biological sequences. It can reveal biological relationship among organisms, for example, finding evolutionary information, determining causes and cures of diseases. Sequence comparison is a basic operation of the Protein sequencing problem. Sequences can be aligned across their entire length (global alignment) or only in certain regions (local alignment). Local sequence alignment plays a major role in the analysis of DNA and protein sequences [1]. This paper describes the pair wise local alignment, which is the basic step for many other applications like detecting homology, finding protein structure and function, deciphering evolutionary relationships, etc. Smith Waterman developed a dynamic programming approach to sequence alignment problem that is widely used [2]. BLAST [3], FASTA [4] and WU-BLAST [7] are two commonly used programs for similarity searching on biological sequences. While most of the methods used are based on heuristic paradigms and have relatively a fast execution time, they do not produce optimal alignments sought by entirely sequenced, and this reality presents JERS/Vol.II/ Issue III/July-September,2011/73-77 the need for comparing long Protein sequences, which is a challenging task due to its high demands for computational requirements (power and memory). One important feature of BLAST is its ability to compare a query with a database of sequences. Considering the rapid growth of database sizes, this problem demands ever-growing computation resources, and remains as a computational challenge. D.R. Singh in Sequence Comparison Tool (SCT) [5] developed sequence similarity in a database instead of pair wise sequence alignment. This tool preprocesses the database to create a special generalized suffix tree from the sequences in the database. The suffix tree [5] is one of the most important data structures in string processing and comparative genomic. The space consumption of the suffix tree is a bottleneck in large-scale applications such as genome analysis. To overcome this bottleneck, in this paper suffix tree is replaced with suffix arrays enhanced with the lcp-table. Generalized suffix array (GSArray) is formed for the query sequence and longest common subsequence (LCS) is obtained and is compared against the database; sequences, which consist of LCS, are identified from the database. Sequences are ranked with respect to the number of significant patterns they share with the query sequence. Finally database is reduced by selecting only a given number of sequences with topmost ranks. 2. AN OVERVIEW OF ENHANCED SUFFIX ARRAY Every algorithm that uses a suffix tree as data structure can systematically be replaced with an algorithm that Journal of Engineering Research and Studies uses an enhanced suffix array [6] and solves the same problem in the same time complexity but with improved space complexity. The generic name enhanced suffix array stands for data structures consisting of the suffix array and additional lcp tables. Let ∑ be a finite ordered alphabet. ∑* is the set of all strings over ∑, ∑+ to denote the set ∑* \ {ε} of nonempty strings. Let S be a string of length |S| = n over ∑. To simplify analysis, it is assumed that the size of the alphabet is a constant, and that n<232. In case of Protein, the alphabet is basically composed of 20 characters. 2.1 Suffix Tree A suffix tree for the string S is a rooted directed tree with exactly n+1 leaves numbered 0 to n. Each internal node, other than the root, has at least two children and each edge is labeled with a nonempty sub string of S$. No two edges out of a node can have edge-labels beginning with the same character. The key feature of the suffix tree is that for any leaf i, the concatenation of the edge-labels on the path from the root to leaf i exactly spells out the string Si, where Si = S [i..n − 1]$ denotes the ith nonempty suffix of the string S$, 0 ≤ i < n. The space and time complexity of this suffix tree is O (n). Fig. 1 shows the suffix tree for the string S = acaaacatat. E-ISSN0976-7916 suffixes of S$ in ascending lexicographic order as shown in figure 2 . Table 1 The lcp-interval table of the enhanced suffix array of S = acaaacatat$ i 0 1 suftab 2 3 lcptab 0 2 S(suftab[i]) aaacatat$ aacatat$ 2 0 1 acaaacatat$ 3 4 3 acatat$ 4 6 1 atat$ 5 8 2 at$ 6 7 8 9 10 1 0 caaacatat$ 5 7 9 10 2 0 1 0 catat$ tat$ t$ $ 2.3 Suffix array generation Creating suffix array requires time O (nlogn) and searching for a pattern in it requires time O (n log m), where n is the length of the pattern and m is the length of the string. To make the search still better, enhanced suffix array is used, which is the basic suffix array, but enhanced with lcp-table. The enhanced suffix array is generated with information about the internal nodes of the suffix tree in lcptab and suf fields using Abdullah’s algorithm [6]. lcptab[i] stores the length of longest common prefix of the suffixes suf[i] and suf[i1]. The longest common prefix is the longest string i.e. a substring of two or more strings. The lcptable can be constructed in linear time as shown in table 1. The LCP-interval tree of the enhanced suffix array of the given sequence is shown in figure 2. 0-[0..10] Figure 1. The suffix tree for S = acaaacatat. 2.2 Suffix Array More space efficient data structures than the suffix tree exist. The most prominent one is the suffix array [8] which requires only 4n bytes (4 bytes per input character) in its basic form. Suffix array is designed for efficient searching of a large text. Searching a text can be performed by binary search using the suffix array. Suffix trees can be constructed in O (n) time in the worst case, versus O (nlogn) time for suffix arrays. Suffix arrays will prove to be better than suffix trees for many applications. The suffix array (denoted by suftab) of the string S is an array of integers in the range 0 to n, specifying the lexicographic ordering of the n + 1 suffixes of the string S$. That is, S(suftab[0]), S(suftab[1]), . . . , S(suftab[n]) is the sequence of JERS/Vol.II/ Issue III/July-September,2011/73-77 1-[0..5] 2-[0..1] 2-[6..9] 3-[2..3] 1-[8...9] 2-[4..5] Figure 2 The lcp-interval tree of S = acaaacatat$ 3.0 METHODOLOGY The enhanced suffix array is generated for the Oryza Sativa Protein sequences in the data base using Abdullah’s algorithm [6] and is named as GSArray. The substring of a string S in GSArray is used to identify similarity between homologous sequences, because similar sequences contain conserved regions. Significantly similar sequences are identified with the help of a score. A given pattern with high score carries important information that belongs to a family of Journal of Engineering Research and Studies sequence with a highly conserved region. It is calculated based on the length of the pattern and frequency i.e., number of occurrences in the database. The given pattern p is classified as significant if it satisfies the following constraints: • The length of p ≥ a given length-threshold: a significant pattern must be sufficiently long to carry important biological information. • The score of p ≥ a given score-threshold: a significant pattern must have a sufficiently high score. GSArray is constructed for the entire sequences in the database. While constructing GSArray, at each node i, the length l(pi) and the frequency f(pi) (i.e. the number of occurrences of pi in the database) is stored for the corresponding pattern pi, this frequency is incremented for every new node. Then the score function w(pi) is calculated using the equation (1) (1) The score function w (pi) is used to find significant similarity of pattern pi. Then the query sequence Q and number of sequences to be selected from the database is read from the user, temporarily added onto GSArray. This enables to determine which suffixes of the query are shared by the sequence in the database. The query sequence is only temporarily added to the tree so that GSAlign is not affected for future sequence searches. Initially all the nodes in the GSArray are 0. When the query sequence Q is added as a suffix, the nodes visited is set to 1. This expedites the search for common patterns within the GSArray because only those paths in the tree for patterns that contain substrings of the query sequence are examined. In depth-first manner, starting at the root all the nodes are visited to check the value 1. If the current node has no child whose value is 1, then the search backtrack to its parent node. During this traversal all the significant patterns are collected. The sequence may have other common patterns that are not significant. An optimal alignment between these two sequences in an ideal case contains all significant patterns. After this process the query sequence from GSArray is deleted. Top ten significant patterns are selected and stored. The sequence that contains significant patterns is extracted and stored. Reverse check is made to obtain the accuracy of the results; it computes how many chosen patterns are being shared by each of the sequences extracted already. Higher the number (weight), greater will be the similarity of the corresponding sequence to the query. Based on this weight the sequences are ranked. A top n sequences are transferred to new database. Algorithm for ESAPRO Tool is shown in figure 3. //Input: set of PROTEIN Sequence s // Output: Reduced set of PROTEIN sequences JERS/Vol.II/ Issue III/July-September,2011/73-77 E-ISSN0976-7916 S1: Read PROTEIN database; S2: Construct GSArray for the input sequences, Set node visit =0; S3: While constructing the suffix array, store the information of Label-length, frequency at nodes and sequences traversing through the branches; S4: Use the equation (1) to calculate the degree of similarity of patterns in the form of prefixes;. S5: Read query sequence Q and the number of sequences n to select from the user; S6: Temporarily add the suffixes of query to the generalized GSArray; S7: While adding the suffixes highlight nodes of the paths which are traversed by the query sequence; S8: Post process the GSArray to extract patterns shared by the query Sequence, that lies above a defined threshold on the function-value; S9: Pick the top ten of these patterns and store; S8: Do a reverse check to compute the weight of each sequence in the subset; S9: Rank the sequences according to these weights; S11: Pick top n sequences from the subset and write to a new database; Figure 3 ESAPRO Tool Algorithm 3.1 Cross validation The real world Protein sequence databases from Oryza Sativa (GSS) group is extracted from NCBI website and enhanced suffix array is formed for all the sequences in the database. Based on the user given query sequence, the significant patterns are generated. The sequences in the Protein database are given weights according to the number of patterns they contain. The reduced database is formed by selecting top ‘n’ sequences with highest ranks and written into a new database. The latest data mining technique, 7-fold cross validation is used to validate the results obtained from ESAPRO Tool and WU-BLAST. The main idea of 7-fold cross validation approach is ``train on 6 folds, test on 1 fold''. The data set is divided into 7 parts. Among the 7 parts 6/7 of the data are used for training data set and the remaining 1/7 is used for testing data set. ESAPRO Tool is applied on training data set and then on WU_BLAST for single user given query sequence. For the same query sequence WU-BLAST alone is applied on the testing data set. The average of this seven runs is computed for analysis. In this paper such seven queries are taken and analyzed. 49 sequences from Oryza Sativa are taken as a database set, 42 sequences are used for training data set and 7 sequences are used for test data set. Single query sequence is applied first on ESAPRO Tool, data base is reduced. WU-BLAST is performed on the new data base and the same query sequence. Then for same sequence WU-BLAST alone is applied. The sequences in the Journal of Engineering Research and Studies training and test data set are interchanged and the above steps are repeated until every fold is used for training. Average of the result from 7 runs are calculated and stored. This is experimented for 7 different queries. The objective is to test whether the results are consistent for all the queries on a particular database in terms of computation time. 4.0 RESULT AND DISCUSSION Reduced database from ESAPRO Tool is cross validated using 7-fold cross validation approach. The cross validation result of seven different queries is shown in figure 4and 5. The idea is to test whether the results are consistent for all queries on a particular database in terms of computation time. First series in the figures represents the computation time obtained after ESA Tool and the second one represents the result of applying WU-BLAST alone. The figure shows that the result obtained by ESAPRO Tool are consistent and performs sequence comparison with a good accuracy and a practical time improvement is achieved over WUBLAST. The Enhanced suffix array algorithm used in ESAPRO Tool requires 5n bytes /character where SCT requires 20n bytes which uses suffix tree. Hence it is proved that space complexity is approximately 5 times more than SCT. Experimental results show that the running time of developed algorithm ESAPRO Tool using enhanced suffix array is much better than SCT which uses suffix tree . Figure 4 Results of 7-fold cross validation array and extended by adding frequency and length information for the patterns. ESAPRO Tool distinguishes patterns by computing significance-scores. A pattern is regarded as significant if it is long enough, and it appears frequently enough in the database. The scoring function takes into account a pattern's length and frequency, the given threshold values, and determines if a pattern is significant. Using these, for a given query sequence ESAPRO Tool reduces the database to only a few sequences that share the most significant patterns with the query. This reduction in database size speeds-up the local alignment of the query sequence against the database. Experimental results have shown that ESAPRO Tool provides a speed-up over WU-BLAST, which is currently the dominant search engine for database-searches. It is able to reduce the time of a database search to nearly five times originally taken by WU-BLAST. Results from WU-BLAST have shown that this method is experimentally effective, as the results obtained by ESAPRO Tool are accurate. Combined with the extended suffix array, ESAPRO Tool has the advantage of using WU-BLAST to do the local sequence alignment. Latest Data Mining technique, 7-fold cross validation is applied to attain a greater accuracy in the results. The 7 runs of the cross validation helps to establish that ESAPRO Tool performs consistently well for all the queries for a particular database included in our tests. The Enhanced suffix array algorithm used in ESAPRO Tool requires 5n bytes /character where SCT requires 20n bytes which uses suffix tree. So, space complexity is approximately 5 times more than SCT. Experimental results show that the running time of ESAPRO Tool is much better than SCT which uses suffix tree. In this paper, a small domain of sequences have been selected from the Protein database and experimentally proved that, enhanced suffix array reduces space complexity by 5 times. The most valuable future work is to compress the Protein sequences and GSArray is formed for the compressed Protein sequences. ESAPRO Tool can also be applied to Global Sequence Alignment and Multiple Sequence Alignment. This work can also be continued by covering databases of protein sequences for alignment. REFERNCES 1. 2. Fig. 5 Results of 7-fold cross validation 5.0 CONCLUSION In this paper the ESAPRO Tool using enhanced suffix array has been developed. ESAPRO Tool pre-processes the database to create a generalized enhanced suffix JERS/Vol.II/ Issue III/July-September,2011/73-77 E-ISSN0976-7916 3. Gus field, D., Algorithms on Strings, Trees, and Sequences, Cambridge University Press, 1997. Smith, T. F. and M. S. Waterman, Identification of common molecular subsequences, Journal of Molecular Biology, 147:195-197, 1981. Altschul, S. F., W. Gish, W. Miller, E. W. Myers, and D. Lipmann. (1990) Basic Local Alignment Search Tool, Journal of Molecular Biology, Vol. 215; 215:403-10. Journal of Engineering Research and Studies 4. 5. 6. 7. Pearson, W.R. (2000) Flexible sequence similarity searching with the FASTA3 program package, Methods Mol. Bipl., 132,185-219. Divya R. Singh, Abdullah N. Arslan, Xindong Wu, Using an extended suffix Tree to speed-up sequence alignment, IADIS International Conference Applied Computing 2006, 655-660 M.I.Abouelhoda, Stefan Kurtz, Enno Ohlebusch, Replacing suffix trees with enhanced suffix arrays, Journal of Discrete Algorithms 2 (2004) 53–86 http://blast.wustl.edu/ JERS/Vol.II/ Issue III/July-September,2011/73-77 E-ISSN0976-7916