Download (STAN): looking for nucleotidic and peptidic patterns in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gel electrophoresis of nucleic acids wikipedia , lookup

Promoter (genetics) wikipedia , lookup

DNA barcoding wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Molecular cloning wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Genome evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Genomic library wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
BIOINFORMATICSAPPLICATIONS NOTE
Vol. 21 no. 24 2005, pages 4408–4410
doi:10.1093/bioinformatics/bti710
Genome analysis
Suffix-tree analyser (STAN): looking for nucleotidic and
peptidic patterns in chromosomes
Jacques Nicolas , Patrick Durand, Grégory Ranchy, Sébastien Tempel and
Anne-Sophie Valin
IRISA-INRIA, Campus de Beaulieu, 35402 Rennes cedex, France
Received on June 27, 2005; revised on July 13, 2005; accepted on October 6, 2005
Advance Access publication October 13, 2005
ABSTRACT
Summary: We have developed STAN (suffix-tree analyser), a tool to
search for nucleotidic and peptidic patterns within whole chromosomes.
Pattern syntax uses a string variable grammar-like formalism which
allows the description of complex patterns including ambiguities, insertions/deletions, gaps, repeats and palindromes. STAN is based on a
reduction to multipart matching on a suffix-tree data structure and can
handle large DNA sequences, whether assembled or not.
Availability: STAN is accessible online at http://www.irisa.fr/
symbiose/STAN
Contact: [email protected]
Sequence pattern matching within biological sequences (DNA,
RNA and proteins) can be achieved very efficiently using regular
expressions. A number of tools have been developed in this context
(Kucherov and Rusinowitch, 1995; Altschul et al., 1997; Gatiker
et al., 2002). However, when considering high-order sequence
organization (such as copy, palindrome or even structural properties) a more expressive formalism is required. During the past
decade, definite clause grammars (DCG) (Pereira and Warren,
1980), a particular form of context-free grammars, have been
used in various works to model DNA sequence features (Searls,
1989; Helgesen and Sibbald, 1993; Leung et al., 2001), as well as to
model gene regulation (Collado-Vides, 1992). Among those formalisms, the pioneering work of David Searls on string variable
grammars (SVGs) is of particular interest (Searls, 1995, 2002).
SVG introduces the concept of a variable that can be associated
to a string during a pattern search. SVGs can be used to model not
only DNA/RNA sequence features but also structural features such
as repeats, palindromes, stem–loop or pseudo-knots (Searls, 1993).
To our knowledge, the only two tools capable of searching for SVGbased patterns in biological sequences are GenLang (Dong and
Searls, 1994) and PatScan (Dsouza et al., 1997). However, GenLang
is no longer maintained and, because of its time complexity, was
restricted to the analysis of medium size sequences (several
Mbases). PatScan, on the other hand, does not guarantee to find
To whom correspondence should be addressed.
all occurrences of complex patterns (once a hit is found, it does not
check overlapping alternative solutions). This paper describes a new
tool, STAN (suffix-tree analyser), allowing to search for a subset of
SVG patterns in fully sequenced chromosomes. STAN is capable of
efficiently scanning sequences as large as human chromosomes, it
guarantees to find all occurrences of complex patterns and provides
three different types of search operations as explained below.
STAN patterns can be used to model sequence and/or structural
features of nucleotidic and peptidic sequences. STAN patterns are
made of words, separated by a ‘-’character, belonging to the following language. Letters are taken from standard IUPAC alphabets
describing DNA or protein sequences. Tokens are made of a single
letter or a string of letters. Ambiguous tokens are described using
either curly braces ({C} for all nucleotides but cytosine in a DNA
pattern) or brackets ([AC] for A or C letter) or brackets and pipe
characters ([AC|AG] for AC or AG tokens). Spacers are described
using the syntax ‘x(a,b)’. In this syntax, and the following, a and b
denote positive integers, a b. Tokens with insertions and/or deletions are described as ‘indel(m,ai,bi,ad,bd)’ where m is a token and
ai and bi (respectively ad and bd) give the minimum and maximum
numbers of letter insertions (respectively deletions) allowed in m.
Tokens with mismatches are described as ‘m:a’ where m is a token
and a in the maximum number of letter substitutions allowed in m.
String variables are represented by identifiers starting with the upper
letter X. They may be constrained in size using the syntax ‘X:[a,b]’
where a and b are the lower and upper limits of X size. A token or a
string variable may be prefixed with operator ‘’ to search for the
reverse complement. In that way, pattern ‘X - X’ allows to search
for palindromes. Finally, mismatches are allowed when defining a
string variable (‘X - X:a’ allows to search for palindromes with a
letter mismatches).
The current implementation of STAN allows searching for SVGbased patterns in most of publicly available genome sequences from
Archaea, Bacteria and Eukaryota, as well as in user-provided DNA
sequences. To perform a pattern search efficiently, STAN uses
suffix trees, one of the most effective computational representations
of a sequence (McCreight, 1976; Kurtz and Schleiermacher, 1999;
Kurtz et al., 2004). Given a pattern and a sequence, STAN works as
follows. The sequence, stored on disk as a binary representation of
Ó The Author 2005. Published by Oxford University Press. All rights reserved. For Permissions, please email: [email protected]
The online version of this article has been published under an open access model. Users are entitled to use, reproduce, disseminate, or display the open access
version of this article for non-commercial purposes provided that: the original authorship is properly and fully attributed; the Journal and Oxford University
Press are attributed as the original place of publication with the correct citation details given; if an article is subsequently reproduced or disseminated not in its
entirety but only in part or as a derivative work this must be clearly indicated. For commercial re-use, please contact [email protected]
STAN: looking for patterns in genomes
A
B
T-[CT]-x(0,1)-TAC:1-x(2)-TAT-[TA]-AT-[TC]-T-GGGAAG:2-T-ACA:1-TT:1-[AT]-x(0,1)-TAA:1-[ATG]-TGT:2 –
5’ end
x(100,3000)-AAATCGT:1-X:[7]-x(4)-~X:5-TTAAAATCTAG:2
spacer
hairpin
3’ end
C
D
Fig. 1. (A) AtREP3 model from Kapitonov and Jurka (2001). (B) Refined AtREP3 model deduced by iterative database scanning using STAN. (C) STAN webform interface used to format the result of a search for the AtREP3 model in the entire genome of Arabidopsis thaliana. (D) Search time comparison, looking for
AtREP3 model in the A.thaliana chromosome 1 with STAN (S), PatScan (P) and GenLang (G) (see text).
its suffix tree, is fully loaded into the computer memory. The pattern
is interpreted by a Prolog program and converted into a series of
search operations that are directly applied to the suffix tree using
C compiled software calls. The search for patterns is performed by
exhaustive enumeration and search of all possible candidates generated from the given pattern in case of insertion and deletion and
by exhaustive search of the pattern in all the suffix-tree paths till the
number of errors is reached in case of mismatches. Three kinds of
searches can be achieved using STAN: single nucleotide-based
pattern versus DNA sequence, single amino acid-based pattern
versus six frames translation of a DNA sequence and multiple
amino acid-based patterns versus DNA sequence. The last type of
search can be used to produce all the combinations of up to five
patterns located anywhere in the six coding frames of a DNA
sequence. It can be of high interest to locate patterns spanning
several exons of a coding sequence in eukaryote genomes.
As an example of STAN application, we have searched the genome of Arabidopsis thaliana for the AtREP3 transposable element.
AtREP3 is a novel family of A.thaliana’s non-autonomous transposons described by a TC 50 terminus, a CTRR 30 terminus and
a 30 subterminal short hairpin structure (Fig. 1A) (Kapitonov and
Jurka, 2001). On account of the strong variability of internal
sequence, AtREP3 cannot be detected by BLAST or other classical
software. In order to identify all family members, we have used
STAN iteratively to scan database with refined models starting with
the model from Figure 1A. The final model presented in Figure 1B
4409
J.Nicolas et al.
was then used to scan the entire genome of A.thaliana (5 chromosomes for a total of 119 Mb) in both normal and reverse directions.
We identified 125 different AtREP3 sizing from 450 to 2500 nt.
Cumulated search time was 20 s on a SunFire 6800 computer.
Figure 1D shows a search time comparison, looking for our
AtREP3 model in the A.thaliana chromosome 1 (29 Mb) with
STAN, PatScan and GenLang. For fair comparison, all tools
were installed and executed from the command-line on our SunFire
computer, and reported times are averaged from four successive
executions. STAN, in contrast to the other tools with which it has
been compared, runs on four processors and the time reported in the
graph is the sum of the user times for each processor.
More details on STAN’s pattern language, implementation and
performance are provided in the STAN manual available online
(see http://www.irisa.fr/symbiose/STAN).
ACKNOWLEDGEMENTS
Funding to pay the Open Access publication charges for this article
was provided by Région Bretagne (Plateforme bioinformatique
Quest Genopole).
Conflict of Interest: none declared.
REFERENCES
Altschul,S.F. et al. (1997) Gapped BLAST and PSI-BLAST: a new generation of
protein database search programs. Nucleic Acids Res., 25, 3389–3402.
Collado-Vides,J. (1992) Grammatical model of the regulation of gene expression.
Proc. Natl Acad. Sci. USA, 89, 9405–9409.
4410
Dong,S. and Searls,D. (1994) Gene structure prediction by linguistic methods.
Genomics, 23, 540–551.
Dsouza,M. et al. (1997) Searching for patterns in genomic data. Trends Genet., 13,
497–498.
Gatiker,A. et al. (2002) ScanProsite: a reference implementation of a PROSITE
scanning tool. Appl. Bioinformatics, 1, 107–108.
Helgesen,C. and Sibbald,P.R. (1993) PALM—a pattern language for molecular
biology. In 1st International Conference on Intelligent Systems for Molecular
Biology, Bethesda, MD, AAAI Press, pp. 172–180.
Kapitonov,V. and Jurka,J. (2001) Rolling-circle transposons in eukaryotes. Proc. Natl
Acad. Sci. USA, 98, 8714–8719.
Kucherov,G. and Rusinowitch,M. (1995) Matching a set of strings with variable length
don0 t cares. LNCS, 937, 230–247.
Kurtz,S. and Schleiermacher,C. (1999) REPuter: fast computation of maximal repeats
in complete genomes. Bioinformatics, 15, 426–427.
Kurtz,S. et al. (2004) Versatile and open software for comparing large genomes.
Genome Biol., 5, R12.
Leung,S. et al. (2001) Basic gene grammars and DNA-ChartParser for language
processing of Escherichia coli promoter DNA sequences. Bioinformatics, 17,
226–236.
McCreight,E. (1976) A space-economical suffix tree construction algorithm. J. ACM,
23, 262–272.
Pereira,F. and Warren,D. (1980) Definite clause grammars for language analysis—a
survey of the formalism and a comparison with augmented transition networks.
Artif. Intell., 13, 231–278.
Searls,D. (1989) Investigating the linguistics of DNA with definite clause grammars.
In Lusk,E. and Overbeek,R. (eds), Logic Programming: Proceedings of the North
American Conference on Logic Programming. MIT Press, Cambridge, MA,
pp. 189–208.
Searls,D. (1993) The computational linguistics of biological sequences. In Hunter,L.
(ed.), Artificial Intelligence and Molecular Biology. AAAI Press, Menlo Park, CA,
pp. 47–120.
Searls,D. (1995) String variable grammar : a logic grammar formalism for the biological language of DNA. J. Logic Prog., 14, 73–102.
Searls,D. (2002) The language of genes. Nature, 420, 211–217.