Download Automatic identification of topic boundaries in

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Silencer (genetics) wikipedia , lookup

Promoter (genetics) wikipedia , lookup

DNA barcoding wikipedia , lookup

Genetic code wikipedia , lookup

RNA-Seq wikipedia , lookup

Deoxyribozyme wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Non-coding DNA wikipedia , lookup

Protein structure prediction wikipedia , lookup

Bisulfite sequencing wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Point mutation wikipedia , lookup

Molecular evolution wikipedia , lookup

Community fingerprinting wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Structural alignment wikipedia , lookup

Homology modeling wikipedia , lookup

Transcript
Technical report TR No: IIIT-H/TR/2004/012
IPSVAC – Integrated Platform for Sequence Visualization,
Analysis and Comparison
Naveena V.K. Yanamala 1 , Harikrishna V. Rekapalli 1, Sri Jyothsna Yeleswarapu,
Ram Sateesh Talari , Abhijit Mitra 2
International Institute of Information Technology
Gachibowli
Hyderabad, 500019 India
091-040-23001969
[email protected], {harikrishna, jyothsna, ramsateesh_iiit}@msitprogram.net,
[email protected]
Keywords
Integrated toolkit, Sequence Analysis tool, Multiple Sequence Alignment, Tandem Repeats, Sequence
Viewer
Abstract
Motivation: With the increased availability of sequence information and concomitant increase in the
number of automated analysis servers, biologists today need to deal with multiple data sources and
multiple software tools which use diverse methods and algorithms for analysis. For knowledge mining
and hypothesis building exercises, user level intervention in terms of comparing, validating and
visualizing second order features derived from the sequence data is of crucial importance. Convenient
options for doing this from an integrated platform which enables the user to operate with a single input
format and to retrieve parsed outputs, relevant to his research context, from different servers in
customizable user defined visual formats for easy comparison and analysis are an urgent requirement for
biologists in the post genomic era. The seemingly inevitable necessity of having to negotiate with
heterogeneous legacy resources, which have come up because of rapid parallel developments on all
fronts related to technology, approach and algorithms, constitutes the essential challenge involved.
Results: IPSVAC is an integrated toolkit, which is interactive, customizable and modular. It can be
used for analysis; display and storage of results, related to biological sequence data, obtained from a
choice of tools which may be available publicly or which have been incorporated as an add-on module.
In this paper we have presented an overview of our approach, towards addressing the needs of individual
users, while developing this prototype. An integrated file format converter addresses the requirement for
different input formats. The modular nature provides the option for adding alternative and
complementary algorithms for analysis. Interactive modules that provide the user with options for data
sources and software tools, also allows the user to customize the rendering and visualization of the
outputs. An additional feature allows the registered user to maintain a personalized page containing
summarized results accumulated over a period of time for follow-up analysis.
Availability: IPSVAC is a web based tool, can be accessed freely from
http://bioinformatics.iiit.net/ipsvac.
1. Contributed towards major part of the development of algorithms and integration of the modules
2. To whom all correspondence may be addressed ([email protected])
Technical report TR No: IIIT-H/TR/2004/012
Introduction
Recent advances in high throughput experimental techniques have resulted in a rapid explosion of
sequence data on the World Wide Web. Efficient handling and analysis of these data is required for
diverse research areas such as --- biomedical research, molecular evolutionary analysis, functional
genomics, personalized medicine etc and needs the development of easy-to-use computer programs.
Although computer programs and database resources for bioinformatics applications are becoming more
widely available, these resources do not conform to any uniform or specific standard and are frequently
incompatible. Platforms that integrate heterogeneous software are of immense importance to the field.
There are several integrated software packages, developed to address this requirement, such as
EMBOSS (Rice, Longden et al. 2000), GCG (Womble 2000), MacVector (Rastogi 2000). Some of them
are available for commercial purposes while the others are command line driven, and do not support
carrying forward sequences and results from one application to another. An experimentalist needs to
perform various operations on a sequence. Uploading the sequence every time for each of the operations
can become a laborious process (Figure 1). In this paper, we describe an integrated toolkit, called
Integrated Platform for Sequence Analysis Visualization and Comparison (IPSVAC) that integrates
several analysis applications into a generalized software platform. The toolkit is written using JavaTM
language, CGI, and Perl 5.1.8 and is freely available for research and academic purposes
(http://bioinformatics.iiit.net/ipsvac). It is supported on all popular browsers such as IE, Netscape,
Mozilla, Opera etc.
Step1 : A DNA sequence is
stored in a file.
Step2 : DNA sequence is
blasted against the NCBI
database to find Similar
sequences.
5
4
Step3 : All the sequences that
have a good similarity with the
query sequence are retrieved
from global databases available
using Accession id’s.
3
Step4 : All the retrieved
sequences are aligned using a
web based tool.
Step5 : The alignment is
stored and submitted to
another software for
phylogenetic analysis.
2
1
Researcher / Scientist
Figure 1: Sequence of steps performed for an analysis task
Integrated Toolkit
Integrated Platform for Sequence Visualization, Analysis and Comparison (IPSVAC) is an integrated
toolkit to perform most of the sequence analysis applications (Figure 2). The primary feature of this
toolkit is that it supports analyzing set of sequences on an integrated platform, carrying forward
sequences from one tool to another. Secondly, the features derived during the analysis, and the results
are displayed in meaningful color-coded annotations as opposed to plain text information. Finally, the
toolkit supports file format conversion for extended functionality.
Technical report TR No: IIIT-H/TR/2004/012
The toolkit broadly consists of four components as shown in Figure 2: Sequence Retriever, File Format
Inter-converter, Applications module and storage module. The applications module is in turn further
divided into sub modules.
1. Sequence Retriever
The first step of IPSVAC is to retrieve sequence either from locally available database or from remote
databases. The sequence retriever of IPSVAC provides an interface, which is categorized to upload a
sequence present in the user account or to fetch one from the remote database available. Sequence
Retriever supports fetching records from GenBank, Swissprot, RefSeq and the journal citations from
MEDLINE, by providing the accession-id and the database name. It displays the sequence by default,
and other subsections of record like annotation and feature table on explicit selection. Several Interfaces
are provided in order to perform further analysis on the fetched data.
IPSVAC
Sequence retrieval
File format conversion
Applications
Primary Analysis
Comparative analysis
Visual analysis
1.
2.
3.
4.
5.
1.
2.
3.
1.
2.
3.
Restriction sites
Sequence statistics
Tandem repeats
Cleavage sites
BLAST
Pair-wise sequence alignment
Multiple sequence alignment
Phylogenetic trees
Feature viewer
Sequence Viewer
Phylogenetic tree viewer
Storage
Figure 2: Overall functionality of the toolkit IPSVAC
The interface provides a menu that allows the user to perform different operations on the retrieved
sequence. Figure 3 displays the screenshot of the interface with the menu items that include: File,
Sequence, Analysis, and Viewers.
The File menu allows us to Open, and Save sequence records. The Sequence menu helps us to view
Sequence, Feature table and Annotation information separately. The Annotation gives details of
sequence record like the accession id, the definition, the keywords, source organism, the references,
author etc, while the Feature table displays all the special features within the sequence, like the genes,
CDS regions, the translated sequences, signal peptide regions etc. These three sections together present
all the details of the sequence in text format. A graphical representation of the feature table is provided,
described later as Feature Viewer. The Analysis category provides users with different analysis options
as submenu items depending on the type of the sequence: nucleotide or protein. The Viewers provide
two options: feature viewer, to render graphically the features annotated on DNA or protein sequence,
Technical report TR No: IIIT-H/TR/2004/012
and a sequence viewer, to color-code the nucleotide bases/amino acid residues. The detailed description
about the analysis and viewers category is explained under applications modules section.
2. File Format Inter-converter :
This tool supports inter converting sequence data between a variety of file formats: FastA, GenBank,
EMBL, GCG, PIR, ACE, SWISS-PROT, FASTQ. The output file after conversion can be used as input
to other programs or can be saved for later use.
Figure 3: The figure shows a sequence with accession-id P43780 retrieved from Swiss-Prot. It also
shows different menu options.
3. Application’s Module
The application modules may be further classified into interdependent groups based on their functions as
follows:
a. Applications for primary analysis
b. Applications for comparative analysis
c. Applications for visual analysis
(A) Applications for Primary Analysis
The primary and the most important goal of IPSVAC is to serve varied sequence analysis options on a
single platter to the user. The Analysis category of IPSVAC generates different options as submenu
items depending on the type of the sequence: nucleotide or protein.
Technical report TR No: IIIT-H/TR/2004/012
For Nucleotide Sequences, the submenu items displayed are: Statistics, Restriction sites, Tandem
repeats and BLAST. The Statistics option displays different measures like; the nucleotide base count,
codon count and molecular weight of the sequence.
Restriction site analysis: Restriction enzymes (RE) recognize rather short sequences of double stranded
DNA as targets for cleavage. Each Restriction enzyme has a particular target in duplex DNA, usually a
specific sequence of four to six bp, and are found in wide range of bacterial species. Different REs have
different target sequences; thousands of REs have been discovered in different bacterial species and one
or more of these enzymes recognizes over 100 different DNA sequences. A restriction map is a linear
array of sites on DNA cleaved by various RE. REs are the progenitors of today's modern Biology, the
building blocks to commonly used techniques of today (RDT). RFLP (restriction fragment length
polymorphism) that refers to inherited differences in sites for RE are used in genetic mapping to link the
genome to a genetic marker. They are also used in cloning techniques, construction of DNA libraries
etc. REBASE (Roberts, Vincze et al. 2003) is a database that provides information about restriction
enzymes and related proteins. Webcutter (Maarek 1997), NebCutter V2.0 (Vincze, Posfai et al. 2003),
Restrict of EMBOSS are different tools developed to find restriction enzyme cutting sites. An interface,
restriction sites in IPSVAC perform this task. It provides an in-house database of important REs
obtained from the Bioperl library of restriction enzymes. When a sequence is given as input for
restriction site analysis, the details of cleavage patterns and fragmentation statistics with respect to each
of the enzymes available in our library are displayed to the user in a neatly formatted manner.
Analyses of Tandem Repeats: The human genome tends to show sufficient variability among
individuals in a population, this variability is due to two or more contiguous, approximate copies of
nucleotide repetitions, called tandem repeats and also has important applications including genetic
mapping (Mariat, De Gouyon et al. 1993), population studies (Cagnon), diagnosis (Zischler, Nanda et al.
1989), forensics and DNA finger printing (Fredman, Siegfried et al. 2002). Depending on the repeat
region these are classified into several groups; mini-satellites (variable number of tandem repeats,
VNTRs) have core repeats with 9-80 bp, while micro-satellites (short tandem repeats, STRs) contain 2-6
base-pair repeats. Extensive knowledge about pattern size, copy number, mutational history, etc. for
tandem repeats has been limited by the inability to easily detect them in genomic sequence data.
Tandem Repeat Finder (Benson 1999), MaskerAid (Bedell, Korf et al. 2000), String (Parisi, De Fonzo et
al. 2003) are some of the tools available for finding Tandem repeat regions in sequences. The Tandem
Repeat Detector (TRD) of IPSVAC can take as input a single sequence or a set of sequences in FASTA
format along with the desired tuple size (2 to 6), match, mismatch and threshold values. The program
detects all the micro-satellite regions, direct and approximate repeats (along with alignment) of the
specified tuple size, the score of which is calculated from the match and mismatch values, copy number,
and also the sequence name, if multiple sequences are given as input to the program.
The BLAST module performs a remote BLAST using BLASTn on the sequence, taking as input the
Database (which is the nr database), the organism (blasts against all organisms), and E-value (the default
value being 10e) displays results in descending order of scores along with the alignment (please see
applications for comparative analysis for more details).
For Protein Sequences, the submenu items include: Cleavage Sites, which displays the signal cleavage
sites in the sequence, BLAST module which performs remote BLAST using BLASTp (the results of
which are displayed in a graphical format, similar to those of BLASTn). In future versions of IPSVAC
we plan to include all different flavors of blast, same as at NCBI (Jenuth 2000).
Technical report TR No: IIIT-H/TR/2004/012
Protein Cleavage site analysis: Specific signals present in the polypeptide chains are responsible for
transport and sorting of proteins to their appropriate destination. In 1999 Blobel discovered that
“proteins have intrinsic signals that govern transport and localization in the cell” (Blobel 1999). One
well documented transport signal concerns the signal peptide, forming the n-terminal of a protein which
is secreted through cell membrane. These peptides mediate translocation across the endoplasmic
reticulum membrane in eukaryotes. In prokaryotes signal peptides mediate translocation across the inner
and outer membranes of the cell. The translocation is mediated by cleaving of signal sites that span
across the membrane regions. The prediction of these protein cleavage sites is helpful in various areas
like cancer research (Smith 1994), used in recombinant protein expression and purification and in many
other areas. SignalIP (Menne, Hermjakob et al. 2000) (using neural networks (Nielsen, Engelbrecht et
al. 1997) or HMMs (Nielsen and Krogh 1998) ), SPScan of GCG10 and Sigcleave of EMBOSS (using
weighted matrix approach) are different programs for predicting the signal peptide cleavage sites. For
comparison and evaluation of these tools see . The IPSVAC cleavage site option finds the positions
where a specified enzyme might cut a peptide sequence and reports the starting and ending positions of
the fragments, sequence of fragment etc.
(B) Applications for Comparative Analysis
Pairwise Sequence comparison: sequence alignment is a crucial operation in bioinformatics, and
genetics research. Typically a scoring function is used to rank different alignments so that biologically
plausible alignments score higher. Sequence alignments are used to probe sequence databases for similar
sequences, in genetic disease research, construction of phylogenetic trees, comparing functions between
similar genes, and to find how much they are diverged. The most basic algorithm to align two sequences
was developed by Needleman and Wunsch (Needleman and Wunsch 1970). Short and highly similar
subsequences may be missed in a global alignment because they are outweighed by the rest of the
sequence. In some cases, indentifying such local alignments proves useful. The Smith and Waterman
algorithm (Waterman 1984) finds an alignment that determines the longest/best subsequence pair that
gives maximum degree of similarity between the two sequences.
Single sequence alignment conveys whether two sequences show enough similarity to infer that they are
homologous to one another. Proteins that have a significant biological relationship to one another often
share only isolated regions of sequence similarity. For identifying relationships of this nature, the ability
to find local regions of optimal similarity is advantageous over global alignments. BLAST (Altschul,
Gish et al. 1990) is an alignment tool that uses a measure of local similarity to score sequence
alignments in such a way as to identify regions of good local alignment. The basic BLAST algorithm
can be implemented in DNA and protein sequence database searches, motif searches, gene identification
searches, and in the analysis of multiple regions of similarity in long DNA sequences. FASTA (Pearson
1990) and WU-BLAST (Lopez, Silventoinen et al. 2003) are two other sequence searching tools. All
these methods are implemented using dynamic programming technique, which can be extended to more
complex situations like overlapped matches, performed for putting together a set of sequenced DNA
fragments (fragment assembly) and repeated matches. But it is not computationally feasible when the
sequences become very large. Alternative methods were proposed for finding optimal alignment in
linear space (Gotoh 1982), (Myers and Miller 1988), (Hirschberg 1975).
The Sequence alignment tool of IPSVAC is equipped to align the sequences globally, locally and also
can be used to align the fragments using the overlapped matches method. In order to do this the software
accepts two sequences as input, this can be done either by browsing the directory for sequence files or
just pasting the sequences into the space provided. The user can also select as input the type of
alignment to be implemented. As discussed earlier it has four different types of algorithms for aligning
Technical report TR No: IIIT-H/TR/2004/012
sequences: global alignment (Needleman-Wunsch, Hirschberg), local alignment (Smith-Waterman,
Hirschberg), overlapped alignment and alignment for repeated matches (dynamic programming). Inputs
other than sequences required for these algorithms are the scoring matrices. For nucleotide sequences
identity matrix, transition-transversion matrix and Blast-Matrix are the options and for protein sequences
BLOSUM50, BLOSUM62, PAM30, PAM250 are supported (Wilbur 1985; Henikoff and Henikoff
1992). Scores can be either Linear (default) or Affine and depending on this selection, the Gap Open
penalty and Gap Extension penalty (defaults 10 and 2) are accepted as input. The output generated
contains score of the alignment and the aligned sequence’s. The alignment tool provides ways to
represent the alignment in a color coded format to enhance the interpretation and to locate the aligned
regions of the sequence easily at a single glance. The program implements dynamic programming
technique for the implementation of the algorithms. The Hirschberg algorithm was implemented for
efficient utilization of space in processing the alignment results.
Multiple sequence comparison: Multiple alignments of protein sequences are important tools in
studying sequences. The basic information they provide is identification of conserved sequence regions.
Aligning of several sequences provides an insight into molecular evolutionary analysis for constructing
of phylogenetic trees, characterizing protein families, identifying shared sequences of homology,
determining consensus of aligned sequences. This also helps in predicting secondary and tertiary
structures of protein sequences. Bains suggested an iterative method which involves successive
applications of the standard algorithms. It begins with a trial consensus alignment (say the alignment
between sequences 1 and 2. Then the third sequence is aligned against the consensus sequence and a
new consensus emerges. This continues until the consensus alignment converges to a global consensus.
This type of method is very dependent on the order that the sequences are introduced. One of the most
popular multiple alignment programs begins with all pairwise alignments and is called Clustal (Higgins
and Sharp 1988). Two variants currently being used are ClustalW (Thompson, Higgins et al. 1994) - a
local alignment algorithm, and the older ClustalV (Higgins 1994) - a global alignment method. Other
methods make use of a multiple dimensional dot plot and then look for dots that are common to each
group. Still others rely heavily on user input such as the popular windows program MACAW (Schuler,
Altschul et al. 1991). Others such as MSA (Gupta, Kececioglu et al. 1995) attempt to provide a nearoptimal sum-of-pairs global solution to the multiple alignment .
The IPSVAC relies on ClustalW to perform multiple sequence alignment. A set of sequences are taken
as input and multiple sequence alignment is performed using ClustalW which is one of the most accurate
programs for multiple sequence alignment. It is a general purpose multiple sequence alignment program
for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent
sequences. It calculates the best match for the selected sequences, and lines them up so that the
identities, similarities and differences can be seen. This program accepts input in seven different
formats: NBRF/PIR, EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup),
GCG9/RSF and GDE flat file. All the set of sequences should be in one file, one followed by another.
Once the sequences are aligned, The IPSVAC provides ways to display the output on screen and also to
color codes the alignment for better display and understanding. It provides the user with an option to
display the single nucleotide alterations in each position in the set of sequences that are aligned. The
details are discussed later in “applications for visual analysis” as an application for SNP search.
Phylogenetic tree analysis: Phylogenetics is the area of research concerned with finding the genetic
connections and relationships between species. The term phylogeny refers to these relationships, usually
presented as a phylogenetic tree. There are three kinds of methods for constructing phylogenetic trees:
distance matrix, maximum likelihood and parsimony. Distance matrix methods first estimate the
pairwise distances between the sequences while the other methods construct many trees from all the
Technical report TR No: IIIT-H/TR/2004/012
information in the multiple alignment and decide which is best. The simplest distance based method is
unweighted pair-group method using arithmetic averages (UPGMA). Ideally a researcher would like to
have a black box in which to throw sequences and get out a fully annotated phylogenetic tree. This is,
however, not possible for two reasons. First, an algorithm that considers all possible multiple sequence
alignments and then, for each alignment, all possible phylogenetic trees and picks out the best one,
would take too much time. That is why most phylogenetic programs work on previously aligned
sequences. Second, the result is always strongly influenced by the criteria that are used to define the best
tree. Phylogenetic trees take several forms: They can be rooted or unrooted, binary or general, and may
show, or not show, edge lengths. TreeView (Page 1996), Phylip (Retief 2000), MEGA (Kumar, Tamura
et al. 2001), TreeFinder are different tools that compute phylogenetic trees.
IPSVAC provides a GUI interface to construct the phylogenetic trees with the use PHYLIP (Retief
2000) software. The PHYLIP is a tool that is most popularly used to generate evolutionary trees and
comes with a package of programs for inferring phylogenies (evolutionary trees). This tool supports
different tree generation methods discussed earlier like: distance matrix, maximum likelihood and
parsimony. The IPSVAC generates phylogenetic trees relying on this software package. An interface is
provided to support a wide range of methods provided by PHYLIP software suite. A set of sequences
are taken as input and as a first step to prepare the phylip input format sequences are aligned using
ClustalW. Then the output from ClustalW is used as input to the phylip program and the desired method
is invoked as per the user’s choice to generate appropriate phylogenetic tree.
(C) Applications for Visual Analysis
The Viewers provide two options: feature viewer, to render graphically the features annotated on DNA
or protein sequence, and a sequence viewer, to color-code the nucleotide bases/amino acid residues.
Numerous databases have been constructed to store these sequence regions and derived features, and
their associated functions. Common features of DNA sequence include introns, exons, 3' or 5'
untranslated regions, transcription start sites, cis-elements and other protein binding sites, repeats, low
complexity regions and single nucleotide polymorphisms (SNPs). Protein sequence features include
secondary structures (α-helices and β-strands), transmembrane regions, and post-translational
modifications such as phosphorylation and glycosylation sites. There can be many number of features
associated with a single sequence, it is always extremely difficult for a text record to reveal all the
salient features in an intuitive fashion. THEATRE (Edwards, Carver et al. 2003) is an attempt to
combine the features produced by widely used sequence analysis tools or databases. SeqVISTA is
another graphical tool that facilitates visualization of sequence features.
The feature viewer of IPSVAC is used to display the feature information associated with the sequence
like: CDS regions, exons, introns, poly-A signal sites, promoter regions, repeat regions etc. For each
feature annotated for the sequence a thick colored line is displayed to show spanning and location of the
bases in that sequence. The total sequence is scaled to represent the overall image. The length and
location of each line represents the number of bases it spans in that sequence and its location. For each
type of feature a different color is selected. This provides us with a bird’s eye view of all the important
features in a sequence (Figure 4). In future we would like to incorporate the functionality of zooming
into each feature, so that the user can view the details associated with the particular feature more clearly.
In future versions we would like to include a module for the automated annotation of new non-annotated
sequences.
Technical report TR No: IIIT-H/TR/2004/012
The sequence viewer module in IPSVAC can take in one or more sequences and can ameliorate the
understanding of the important features, including alignment features, by providing convenient visual
images of sequences color coded according to user selections.
For nucleotide sequences the options provided allows the user to highlight individual bases or sets of
bases such as purines/pyrimidines, AT/GC, or start/stop codons in different colors. For protein
sequences different color codes are assigned to represent the different properties such as hydrophobicity,
polarity etc. of the residues. Each of these options when selected displays all residues of that particular
property in a given color code, and all others are in white (Figure 5).
Figure 4 : The output of feature viewer of a DNA sequence with accession id AE009723
A special feature of this viewer module is its integration with sub modules and interfacing with external
servers for pre and post processing of sequences. A single point input and the click of a button are all
that is required from the user to be able to visualize important derived features rendered in useful color
coded formats. For example, the user can input up to two sequences, which he has retrieved into his
account, at one go to get all the six reading frames for both of them with start/stop codons highlighted
in color. For each individual reading frame, the user also has the option of viewing the corresponding
translated amino acid sequence with individual amino acids colored according to user selected
attributes. For any given pair of sequences, in addition to the options mentioned above, the user can
choose to view pair-wise alignment features in color coded format. This involves an automated
interfacing with Align::NW, a Bioperl module, for the required pair wise alignment using the
Needleman and Wunsch algorithm. For multiple sequences, IPSVAC interfaces with ClustalW and
takes multiple files in FASTA format as input. The ClustalW output is parsed to remove the exact
matches also referred as blocks and is processed to find the altered nucleotide base positions across the
length of the sequences and render it in the SUBINDEL visualizer. This feature is designed to find
substitutions, insertions and deletions of nucleotide bases at each position in the alignment.
It may be mentioned here that polymorphisms in nucleotide sequences are being extensively used as
molecular markers in human genome analysis as well as in crop breeding programs. Compared to other
types of polymorphisms, the detection and discovery of SNPs have evoked greater interest because of
their abundance, evolutionary stability and usefulness for genetic dissection of complex traits (Wu, Di
Technical report TR No: IIIT-H/TR/2004/012
Rienzo et al. 2001) and diseases (Simmons 2001). They are an invaluable tool for genome mapping,
offering the potential for generating very high-density genetic maps, which can be used to develop
haplotyping systems for genes or regions of interest. Unlike RFLPs (Restriction Fragment Length
Polymorphism), SNPs are direct markers providing the exact nature of allelic variants. They are far
more prevalent than SSRs (Simple Sequence Repeats) and hence provide a high density of markers near
a locus of interest. Several SNP databases have been developed, such as, dbSNP (Sherry, Ward et al.
2001), HGVBase (Fredman, Siegfried et al. 2002). Efforts are also on to develop in silico tools for
detecting SNPs. However to our knowledge most such tools have to be used in conjunction with
laboratory experiments and results have to be anyway validated in the laboratories. A convenient visual
rendering of MSA output provided by the SUBINDEL visualizer may be useful for, among other things,
a preliminary short listing of potential single nucleotide polymorphisms or SNPs.
Figure 5: The figure displays the output of sequence viewer for a protein sequence
4. Storage Module
All sequences retrieved by the user are stored in a separate dynamically allotted area. They
automatically appear in a drop down list in the GUI and can be easily given as inputs in any of the
modules. The storage module also stores the summary results and operational details in a separate page
maintained for different registered users. Provision has been made such that this summary can
automatically generate the detailed results when given as an input to the stand-alone version of IPSVAC.
A ready to use downloadable stand-alone version of IPSVAC is to be placed on the web site very soon.
Conclusions
We have developed a tool which provides a number of facilities for the analysis of biological sequences
starting with uploading a sequence, to pair wise alignment, database search, multiple alignments and
phylogeny analysis etc. on a single platform. The work described in this paper outlines our approach,
involving the integration of modular components, towards addressing an urgent need of biologists. Of
course this preliminary result leaves room for improvement in terms of increasing the scope of
application areas. Providing interfaces with, for example, tools involving new algorithms for the analysis
of large-scale gene expression data are being planned for the future.
Technical report TR No: IIIT-H/TR/2004/012
Acknowledgements
Sri Jyothsna Yeleswarapu has contributed towards the color-coding modules. Ramsateesh Talari
has contributed towards development of the web interface.
The authors wish to thank Dr. Jayashree Balaji from ICRISAT for initial guidance on different
modules, Dr. Nita Parekh for providing helpful suggestions for refinement and improvements in tool
design and Dr. C. K. Mitra for reviewing the write up. One of the authors (Naveena V. K. Y.) is
currently in CMU under a research exchange program with IIIT Hyderabad and is thankful to Dr.Raj
Reddy and Dr. Judith Klein-Seetharaman for encouraging her to write up this work for publication.
References
Altschul, S. F., W. Gish, et al. (1990). "Basic local alignment search tool." J Mol Biol 215(3): 403-10.
Bedell, J. A., I. Korf, et al. (2000). "MaskerAid: a performance enhancement to RepeatMasker."
Bioinformatics 16(11): 1040-1.
Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucleic Acids Res
27(2): 573-80.
Blobel, G. (1999). ""proteins have intrinsic signals that govern their transport and localization in the
cell"
http://www.nobel.se/medicine/laureates/1999/presentation-speech.html."
Cagnon, C. "http://www.univpau.fr/RECHERCHE/LEM/LEM/Lemenglish/LEMresearch/Biology/Study1.htm."
Edwards, Y. J., T. J. Carver, et al. (2003). "Theatre: A software tool for detailed comparative analysis
and visualization of genomic sequence." Nucleic Acids Res 31(13): 3510-7.
Fredman, D., M. Siegfried, et al. (2002). "HGVbase: a human sequence variation database emphasizing
data quality and a broad spectrum of data sources." Nucleic Acids Res 30(1): 387-91.
Gotoh, O. (1982). "An improved algorithm for matching biological sequences." J Mol Biol 162(3): 7058.
Gupta, S. K., J. D. Kececioglu, et al. (1995). "Improving the practical space and time efficiency of the
shortest-paths approach to sum-of-pairs multiple sequence alignment." J Comput Biol 2(3): 45972.
Henikoff, S. and J. G. Henikoff (1992). "Amino acid substitution matrices from protein blocks." Proc
Natl Acad Sci U S A 89(22): 10915-9.
Higgins, D. G. (1994). "CLUSTAL V: multiple alignment of DNA and protein sequences." Methods
Mol Biol 25: 307-18.
Higgins, D. G. and P. M. Sharp (1988). "CLUSTAL: a package for performing multiple sequence
alignment on a microcomputer." Gene 73(1): 237-44.
Hirschberg, D. S. (1975). "A Linear Space Algorithm for computing Maximal Common Subsequences."
18(6): 341-343.
Jenuth, J. P. (2000). "The NCBI. Publicly available tools and resources on the Web." Methods Mol Biol
132: 301-12.
Kumar, S., K. Tamura, et al. (2001). "MEGA2: molecular evolutionary genetics analysis software."
Bioinformatics 17(12): 1244-5.
Lopez, R., V. Silventoinen, et al. (2003). "WU-Blast2 server at the European Bioinformatics Institute."
Nucleic Acids Res 31(13): 3795-8.
Mariat, D., B. De Gouyon, et al. (1993). "Genetic mapping through the use of synthetic tandem repeats
in the mouse genome." Mamm Genome 4(3): 135-40.
Technical report TR No: IIIT-H/TR/2004/012
Menne, K. M., H. Hermjakob, et al. (2000). "A comparison of signal sequence prediction methods using
a test set of signal peptides." Bioinformatics 16(8): 741-2.
Myers, E. W. and W. Miller (1988). "Optimal alignments in linear space." Comput Appl Biosci 4(1): 117.
Needleman, S. B. and C. D. Wunsch (1970). "A general method applicable to the search for similarities
in the amino acid sequence of two proteins." J Mol Biol 48(3): 443-53.
Nielsen, H., J. Engelbrecht, et al. (1997). "A neural network method for identification of prokaryotic and
eukaryotic signal peptides and prediction of their cleavage sites." Int J Neural Syst 8(5-6): 58199.
Nielsen, H. and A. Krogh (1998). "Prediction of signal peptides and signal anchors by a hidden Markov
model." Proc Int Conf Intell Syst Mol Biol 6: 122-30.
Page, R. D. (1996). "TreeView: an application to display phylogenetic trees on personal computers."
Comput Appl Biosci 12(4): 357-8.
Parisi, V., V. De Fonzo, et al. (2003). "STRING: finding tandem repeats in DNA sequences."
Bioinformatics 19(14): 1733-8.
Pearson, W. R. (1990). "Rapid and sensitive sequence comparison with FASTP and FASTA." Methods
Enzymol 183: 63-98.
Rastogi, P. A. (2000). "MacVector. Integrated sequence analysis for the Macintosh." Methods Mol Biol
132: 47-69.
Retief, J. D. (2000). "Phylogenetic analysis using PHYLIP." Methods Mol Biol 132: 243-58.
Rice, P., I. Longden, et al. (2000). "EMBOSS: the European Molecular Biology Open Software Suite."
Trends Genet 16(6): 276-7.
Schuler, G. D., S. F. Altschul, et al. (1991). "A workbench for multiple alignment construction and
analysis." Proteins 9(3): 180-90.
Sherry, S. T., M. H. Ward, et al. (2001). "dbSNP: the NCBI database of genetic variation." Nucleic
Acids Res 29(1): 308-11.
Smith, B. J. (1994). "Chemical cleavage of proteins." Methods Mol Biol 32: 297-309.
Thompson, J. D., D. G. Higgins, et al. (1994). "CLUSTAL W: improving the sensitivity of progressive
multiple sequence alignment through sequence weighting, position-specific gap penalties and
weight matrix choice." Nucleic Acids Res 22(22): 4673-80.
Waterman, M. S. (1984). "Efficient sequence alignment algorithms." J Theor Biol 108(3): 333-7.
Wilbur, W. J. (1985). "On the PAM matrix model of protein evolution." Mol Biol Evol 2(5): 434-47.
Womble, D. D. (2000). "GCG: The Wisconsin Package of sequence analysis programs." Methods Mol
Biol 132: 3-22.
Zischler, H., I. Nanda, et al. (1989). "Digoxigenated oligonucleotide probes specific for simple repeats
in DNA fingerprinting and hybridization in situ." Hum Genet 82(3): 227-33.