Download enhanced suffix array to protein sequence alignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Array data structure wikipedia , lookup

Transcript
Journal of Engineering Research and Studies
E-ISSN0976-7916
Research Article
ENHANCED SUFFIX ARRAY TO PROTEIN SEQUENCE ALIGNMENT
A. Kunthavai*1, S.Vasantharathna2
Address for Correspondence
Assistant Professor(SG), Department of CSE/IT, Coimbatore Institute of Technology
Coimbatore, Tamilnadu , India,1
2
Associate Professor, Department of Electrical Engineering, Coimbatore Institute of Technology
Coimbatore, Tamilnadu, India.
1
ABSTRACT
Sequence alignment is a popular bioinformatics application that determines the degree of similarity between nucleotide or amino
acid sequences which is assumed to have same ancestral relationships. This sequence alignment method reads query sequence from
the user and makes an alignment against large protein and gene sequence data sets and locate targets that are similar to an input
query sequence. Traditional accurate algorithm, such as Smith-Waterman and FASTA are computationally very expensive, which
limits their use in practice. The current set of popular search tools, such as BLAST and WU-BLAST, employ heuristics to improve
the speed of such searches. However, such heuristics can sometimes miss targets, which in many cases is undesirable. This paper
provides enhanced suffix array with ESAPRO Tool, to perform accurate and faster biological sequence analysis as an improvement
on the computation time of existing tools in this area. The main idea is to pick matched patterns of the query sequence and identify
sequences in the database which share a large number of these matched patterns. Then, reducing the size of the database to very few
sequences which are found closest to the query sequence in question. Experiment results are cross validated using data mining
technique. This show that a new ESAPRO Tool developed effectively reduces the database and obtains very similar results
compared to those traditional algorithms in approximately half the time taken by them.
KEYWORDS Sequence alignment, enhanced suffix array, local alignment, Data mining
INTRODUCTION
Bioinformatics is the field of analyzing the biological
information using computers and statistical techniques;
the science of developing and utilizing computer
databases and algorithms to accelerate and enhance
biological research. Sequence alignment is one of the
most
important
fundamental
operations
in
bioinformatics. It has been successfully applied to
predict the function, structure and evolution of
biological sequences. It can reveal biological
relationship among organisms, for example, finding
evolutionary information, determining causes and cures
of diseases.
Sequence comparison is a basic operation of the Protein
sequencing problem. Sequences can be aligned across
their entire length (global alignment) or only in certain
regions (local alignment). Local sequence alignment
plays a major role in the analysis of DNA and protein
sequences [1]. This paper describes the pair wise local
alignment, which is the basic step for many other
applications like detecting homology, finding protein
structure and function, deciphering evolutionary
relationships, etc.
Smith Waterman developed a
dynamic programming approach to sequence alignment
problem that is widely used [2]. BLAST [3], FASTA
[4] and WU-BLAST [7] are two commonly used
programs for similarity searching on biological
sequences. While most of the methods used are based
on heuristic paradigms and have relatively a fast
execution time, they do not produce optimal alignments
sought by entirely sequenced, and this reality presents
JERS/Vol.II/ Issue III/July-September,2011/73-77
the need for comparing long Protein sequences, which is
a challenging task due to its high demands for
computational requirements (power and memory). One
important feature of BLAST is its ability to compare a
query with a database of sequences. Considering the
rapid growth of database sizes, this problem demands
ever-growing computation resources, and remains as a
computational challenge. D.R. Singh in Sequence
Comparison Tool (SCT) [5] developed sequence
similarity in a database instead of pair wise sequence
alignment. This tool preprocesses the database to create
a special generalized suffix tree from the sequences in
the database. The suffix tree [5] is one of the most
important data structures in string processing and
comparative genomic. The space consumption of the
suffix tree is a bottleneck in large-scale applications
such as genome analysis. To overcome this bottleneck,
in this paper suffix tree is replaced with suffix arrays
enhanced with the lcp-table. Generalized suffix array
(GSArray) is formed for the query sequence and longest
common subsequence (LCS) is obtained and is
compared against the database; sequences, which consist
of LCS, are identified from the database. Sequences are
ranked with respect to the number of significant patterns
they share with the query sequence. Finally database is
reduced by selecting only a given number of sequences
with topmost ranks.
2. AN OVERVIEW OF ENHANCED SUFFIX
ARRAY
Every algorithm that uses a suffix tree as data structure
can systematically be replaced with an algorithm that
Journal of Engineering Research and Studies
uses an enhanced suffix array [6] and solves the same
problem in the same time complexity but with improved
space complexity. The generic name enhanced suffix
array stands for data structures consisting of the suffix
array and additional lcp tables.
Let ∑ be a finite ordered alphabet. ∑* is the set of all
strings over ∑, ∑+ to denote the set ∑* \ {ε} of nonempty strings. Let S be a string of length |S| = n over ∑.
To simplify analysis, it is assumed that the size of the
alphabet is a constant, and that n<232. In case of
Protein, the alphabet is basically composed of 20
characters.
2.1 Suffix Tree
A suffix tree for the string S is a rooted directed tree
with exactly n+1 leaves numbered 0 to n. Each internal
node, other than the root, has at least two children and
each edge is labeled with a nonempty sub string of S$.
No two edges out of a node can have edge-labels
beginning with the same character. The key feature of
the suffix tree is that for any leaf i, the concatenation of
the edge-labels on the path from the root to leaf i exactly
spells out the string Si, where Si = S [i..n − 1]$ denotes
the ith nonempty suffix of the string S$, 0 ≤ i < n. The
space and time complexity of this suffix tree is O (n).
Fig. 1 shows the suffix tree for the string S = acaaacatat.
E-ISSN0976-7916
suffixes of S$ in ascending lexicographic order as shown
in figure 2 .
Table 1 The lcp-interval table of the enhanced suffix
array of S = acaaacatat$
i
0
1
suftab
2
3
lcptab
0
2
S(suftab[i])
aaacatat$
aacatat$
2
0
1
acaaacatat$
3
4
3
acatat$
4
6
1
atat$
5
8
2
at$
6
7
8
9
10
1
0
caaacatat$
5
7
9
10
2
0
1
0
catat$
tat$
t$
$
2.3 Suffix array generation
Creating suffix array requires time O (nlogn) and
searching for a pattern in it requires time O (n log m),
where n is the length of the pattern and m is the length of
the string. To make the search still better, enhanced
suffix array is used, which is the basic suffix array, but
enhanced with lcp-table.
The enhanced suffix array is generated with information
about the internal nodes of the suffix tree in lcptab and
suf fields using Abdullah’s algorithm [6]. lcptab[i] stores
the length of longest common prefix of the suffixes
suf[i] and suf[i1]. The longest common prefix is the
longest string i.e. a substring of two or more strings. The
lcptable can be constructed in linear time as shown in
table 1. The LCP-interval tree of the enhanced suffix
array of the given sequence is shown in figure 2.
0-[0..10]
Figure 1. The suffix tree for S = acaaacatat.
2.2 Suffix Array
More space efficient data structures than the suffix tree
exist. The most prominent one is the suffix array [8]
which requires only 4n bytes (4 bytes per input
character) in its basic form. Suffix array is designed for
efficient searching of a large text. Searching a text can
be performed by binary search using the suffix array.
Suffix trees can be constructed in O (n) time in the worst
case, versus O (nlogn) time for suffix arrays. Suffix
arrays will prove to be better than suffix trees for many
applications. The suffix array (denoted by suftab) of the
string S is an array of integers in the range 0 to n,
specifying the lexicographic ordering of the n + 1
suffixes of the string S$. That is, S(suftab[0]),
S(suftab[1]), . . . , S(suftab[n]) is the sequence of
JERS/Vol.II/ Issue III/July-September,2011/73-77
1-[0..5]
2-[0..1]
2-[6..9]
3-[2..3]
1-[8...9]
2-[4..5]
Figure 2 The lcp-interval tree of S = acaaacatat$
3.0 METHODOLOGY
The enhanced suffix array is generated for the Oryza
Sativa Protein sequences in the
data
base using
Abdullah’s algorithm [6] and is named as GSArray. The
substring of a string S in GSArray is used to identify
similarity between homologous sequences, because
similar sequences contain conserved regions.
Significantly similar sequences are identified with the
help of a score. A given pattern with high score carries
important information that belongs to a family of
Journal of Engineering Research and Studies
sequence with a highly conserved region. It is calculated
based on the length of the pattern and frequency i.e.,
number of occurrences in the database. The given
pattern p is classified as significant if it satisfies the
following constraints:
• The length of p ≥ a given length-threshold: a
significant pattern must be sufficiently long to
carry important biological information.
• The score of p ≥ a given score-threshold: a
significant pattern must have a sufficiently
high score.
GSArray is constructed for the entire sequences in the
database. While constructing GSArray, at each node i,
the length l(pi) and the frequency f(pi) (i.e. the number of
occurrences of pi in the database) is stored for the
corresponding pattern pi, this frequency is incremented
for every new node. Then the score function w(pi) is
calculated using the equation (1)
(1)
The score function w (pi) is used to find significant
similarity of pattern pi. Then the query sequence Q and
number of sequences to be selected from the database is
read from the user, temporarily added onto GSArray.
This enables to determine which suffixes of the query
are shared by the sequence in the database. The query
sequence is only temporarily added to the tree so that
GSAlign is not affected for future sequence searches.
Initially all the nodes in the GSArray are 0. When the
query sequence Q is added as a suffix, the nodes visited
is set to 1. This expedites the search for common
patterns within the GSArray because only those paths in
the tree for patterns that contain substrings of the query
sequence are examined. In depth-first manner, starting
at the root all the nodes are visited to check the value 1.
If the current node has no child whose value is 1, then
the search backtrack to its parent node. During this
traversal all the significant patterns are collected. The
sequence may have other common patterns that are not
significant. An optimal alignment between these two
sequences in an ideal case contains all significant
patterns. After this process the query sequence from
GSArray is deleted. Top ten significant patterns are
selected and stored. The sequence that contains
significant patterns is extracted and stored. Reverse
check is made to obtain the accuracy of the results; it
computes how many chosen patterns are being shared by
each of the sequences extracted already. Higher the
number (weight), greater will be the similarity of the
corresponding sequence to the query. Based on this
weight the sequences are ranked. A top n sequences are
transferred to new database.
Algorithm for ESAPRO Tool is shown in figure 3.
//Input: set of PROTEIN Sequence s
// Output: Reduced set of PROTEIN sequences
JERS/Vol.II/ Issue III/July-September,2011/73-77
E-ISSN0976-7916
S1: Read PROTEIN database;
S2: Construct GSArray for the input sequences, Set node
visit =0;
S3: While constructing the suffix array, store the information
of Label-length, frequency at nodes and sequences
traversing through the branches;
S4: Use the equation (1) to calculate the degree of
similarity of patterns in the form of prefixes;.
S5: Read query sequence Q and the number of
sequences n to select from the user;
S6: Temporarily add the suffixes of query to the
generalized GSArray;
S7: While adding the suffixes highlight nodes of the
paths which are traversed by the query sequence;
S8: Post process the GSArray to extract patterns shared
by the query Sequence, that lies above a defined
threshold on the function-value;
S9: Pick the top ten of these patterns and store;
S8: Do a reverse check to compute the weight of each
sequence in the subset;
S9: Rank the sequences according to these weights;
S11: Pick top n sequences from the subset and write to a
new
database;
Figure 3 ESAPRO Tool Algorithm
3.1 Cross validation
The real world Protein sequence databases from Oryza
Sativa (GSS) group is extracted from NCBI website and
enhanced suffix array is formed for all the sequences in
the database. Based on the user given query sequence,
the significant patterns are generated. The sequences in
the Protein database are given weights according to the
number of patterns they contain. The reduced database
is formed by selecting top ‘n’ sequences with highest
ranks and written into a new database. The latest data
mining technique, 7-fold cross validation is used to
validate the results obtained from ESAPRO Tool and
WU-BLAST. The main idea of 7-fold cross validation
approach is ``train on 6 folds, test on 1 fold''. The data
set is divided into 7 parts. Among the 7 parts 6/7 of the
data are used for training data set and the remaining 1/7
is used for testing data set. ESAPRO Tool is applied on
training data set and then on WU_BLAST for single user
given query sequence. For the same query sequence
WU-BLAST alone is applied on the testing data set. The
average of this seven runs is computed for analysis. In
this paper such seven queries are taken and analyzed. 49
sequences from Oryza Sativa are taken as a database set,
42 sequences are used for training data set and 7
sequences are used for test data set. Single query
sequence is applied first on ESAPRO Tool, data base is
reduced. WU-BLAST is performed on the new data base
and the same query sequence. Then for same sequence
WU-BLAST alone is applied. The sequences in the
Journal of Engineering Research and Studies
training and test data set are interchanged and the above
steps are repeated until every fold is used for training.
Average of the result from 7 runs are calculated and
stored. This is experimented for 7 different queries. The
objective is to test whether the results are consistent for
all the queries on a particular database in terms of
computation time.
4.0 RESULT AND DISCUSSION
Reduced database from ESAPRO Tool is cross validated
using 7-fold cross validation approach. The cross
validation result of seven different queries is shown in
figure 4and 5. The idea is to test whether the results are
consistent for all queries on a particular database in
terms of computation time. First series in the figures
represents the computation time obtained after ESA Tool
and the second one represents the result of applying
WU-BLAST alone. The figure shows that the result
obtained by ESAPRO Tool are consistent and performs
sequence comparison with a good accuracy and a
practical time improvement is achieved over WUBLAST. The Enhanced suffix array algorithm used in
ESAPRO Tool requires 5n bytes /character where SCT
requires 20n bytes which uses suffix tree. Hence it is
proved that space complexity is approximately 5 times
more than SCT. Experimental results show that the
running time of developed algorithm ESAPRO Tool
using enhanced suffix array is much better than SCT
which uses suffix tree
.
Figure 4 Results of 7-fold cross validation
array and extended by adding frequency and length
information for the patterns. ESAPRO Tool
distinguishes patterns by computing significance-scores.
A pattern is regarded as significant if it is long enough,
and it appears frequently enough in the database. The
scoring function takes into account a pattern's length and
frequency, the given threshold values, and determines if
a pattern is significant. Using these, for a given query
sequence ESAPRO Tool reduces the database to only a
few sequences that share the most significant patterns
with the query. This reduction in database size speeds-up
the local alignment of the query sequence against the
database. Experimental results have shown that
ESAPRO Tool provides a speed-up over WU-BLAST,
which is currently the dominant search engine for
database-searches. It is able to reduce the time of a
database search to nearly five times originally taken by
WU-BLAST. Results from WU-BLAST have shown
that this method is experimentally effective, as the
results obtained by ESAPRO Tool are accurate.
Combined with the extended suffix array, ESAPRO
Tool has the advantage of using WU-BLAST to do the
local sequence alignment. Latest Data Mining technique,
7-fold cross validation is applied to attain a greater
accuracy in the results. The 7 runs of the cross validation
helps to establish that ESAPRO Tool performs
consistently well for all the queries for a particular
database included in our tests. The Enhanced suffix
array algorithm used in ESAPRO Tool requires 5n bytes
/character where SCT requires 20n bytes which uses
suffix tree. So, space complexity is approximately 5
times more than SCT. Experimental results show that
the running time of ESAPRO Tool is much better than
SCT which uses suffix tree. In this paper, a small
domain of sequences have been selected from the
Protein database and experimentally proved that,
enhanced suffix array reduces space complexity by 5
times. The most valuable future work is to compress the
Protein sequences and GSArray is formed for the
compressed Protein sequences. ESAPRO Tool can also
be applied to Global Sequence Alignment and Multiple
Sequence Alignment. This work can also be continued
by covering databases of protein sequences for
alignment.
REFERNCES
1.
2.
Fig. 5 Results of 7-fold cross validation
5.0 CONCLUSION
In this paper the ESAPRO Tool using enhanced suffix
array has been developed. ESAPRO Tool pre-processes
the database to create a generalized enhanced suffix
JERS/Vol.II/ Issue III/July-September,2011/73-77
E-ISSN0976-7916
3.
Gus field, D., Algorithms on Strings, Trees, and
Sequences, Cambridge University Press, 1997.
Smith, T. F. and M. S. Waterman, Identification of
common molecular subsequences, Journal of
Molecular Biology, 147:195-197, 1981.
Altschul, S. F., W. Gish, W. Miller, E. W. Myers,
and D. Lipmann. (1990) Basic Local Alignment
Search Tool, Journal of Molecular Biology, Vol. 215;
215:403-10.
Journal of Engineering Research and Studies
4.
5.
6.
7.
Pearson, W.R. (2000) Flexible sequence similarity
searching with the FASTA3 program package,
Methods Mol. Bipl., 132,185-219.
Divya R. Singh, Abdullah N. Arslan, Xindong Wu,
Using an extended suffix Tree to speed-up sequence
alignment, IADIS International Conference Applied
Computing 2006, 655-660
M.I.Abouelhoda, Stefan Kurtz, Enno Ohlebusch,
Replacing suffix trees with enhanced suffix arrays,
Journal of Discrete Algorithms 2 (2004) 53–86
http://blast.wustl.edu/
JERS/Vol.II/ Issue III/July-September,2011/73-77
E-ISSN0976-7916