Download Automatic identification of topic boundaries in

Technical report TR No: IIIT-H/TR/2004/012 IPSVAC – Integrated Platform for Sequence Visualization, Analysis and Comparison Naveena V.K. Yanamala 1 , Harikrishna V. Rekapalli 1, Sri Jyothsna Yeleswarapu, Ram Sateesh Talari , Abhijit Mitra 2 International Institute of Information Technology Gachibowli Hyderabad, 500019 India 091-040-23001969 [email protected], {harikrishna, jyothsna, ramsateesh_iiit}@msitprogram.net, [email protected] Keywords Integrated toolkit, Sequence Analysis tool, Multiple Sequence Alignment, Tandem Repeats, Sequence Viewer Abstract Motivation: With the increased availability of sequence information and concomitant increase in the number of automated analysis servers, biologists today need to deal with multiple data sources and multiple software tools which use diverse methods and algorithms for analysis. For knowledge mining and hypothesis building exercises, user level intervention in terms of comparing, validating and visualizing second order features derived from the sequence data is of crucial importance. Convenient options for doing this from an integrated platform which enables the user to operate with a single input format and to retrieve parsed outputs, relevant to his research context, from different servers in customizable user defined visual formats for easy comparison and analysis are an urgent requirement for biologists in the post genomic era. The seemingly inevitable necessity of having to negotiate with heterogeneous legacy resources, which have come up because of rapid parallel developments on all fronts related to technology, approach and algorithms, constitutes the essential challenge involved. Results: IPSVAC is an integrated toolkit, which is interactive, customizable and modular. It can be used for analysis; display and storage of results, related to biological sequence data, obtained from a choice of tools which may be available publicly or which have been incorporated as an add-on module. In this paper we have presented an overview of our approach, towards addressing the needs of individual users, while developing this prototype. An integrated file format converter addresses the requirement for different input formats. The modular nature provides the option for adding alternative and complementary algorithms for analysis. Interactive modules that provide the user with options for data sources and software tools, also allows the user to customize the rendering and visualization of the outputs. An additional feature allows the registered user to maintain a personalized page containing summarized results accumulated over a period of time for follow-up analysis. Availability: IPSVAC is a web based tool, can be accessed freely from http://bioinformatics.iiit.net/ipsvac. 1. Contributed towards major part of the development of algorithms and integration of the modules 2. To whom all correspondence may be addressed ([email protected]) Technical report TR No: IIIT-H/TR/2004/012 Introduction Recent advances in high throughput experimental techniques have resulted in a rapid explosion of sequence data on the World Wide Web. Efficient handling and analysis of these data is required for diverse research areas such as --- biomedical research, molecular evolutionary analysis, functional genomics, personalized medicine etc and needs the development of easy-to-use computer programs. Although computer programs and database resources for bioinformatics applications are becoming more widely available, these resources do not conform to any uniform or specific standard and are frequently incompatible. Platforms that integrate heterogeneous software are of immense importance to the field. There are several integrated software packages, developed to address this requirement, such as EMBOSS (Rice, Longden et al. 2000), GCG (Womble 2000), MacVector (Rastogi 2000). Some of them are available for commercial purposes while the others are command line driven, and do not support carrying forward sequences and results from one application to another. An experimentalist needs to perform various operations on a sequence. Uploading the sequence every time for each of the operations can become a laborious process (Figure 1). In this paper, we describe an integrated toolkit, called Integrated Platform for Sequence Analysis Visualization and Comparison (IPSVAC) that integrates several analysis applications into a generalized software platform. The toolkit is written using JavaTM language, CGI, and Perl 5.1.8 and is freely available for research and academic purposes (http://bioinformatics.iiit.net/ipsvac). It is supported on all popular browsers such as IE, Netscape, Mozilla, Opera etc. Step1 : A DNA sequence is stored in a file. Step2 : DNA sequence is blasted against the NCBI database to find Similar sequences. 5 4 Step3 : All the sequences that have a good similarity with the query sequence are retrieved from global databases available using Accession id’s. 3 Step4 : All the retrieved sequences are aligned using a web based tool. Step5 : The alignment is stored and submitted to another software for phylogenetic analysis. 2 1 Researcher / Scientist Figure 1: Sequence of steps performed for an analysis task Integrated Toolkit Integrated Platform for Sequence Visualization, Analysis and Comparison (IPSVAC) is an integrated toolkit to perform most of the sequence analysis applications (Figure 2). The primary feature of this toolkit is that it supports analyzing set of sequences on an integrated platform, carrying forward sequences from one tool to another. Secondly, the features derived during the analysis, and the results are displayed in meaningful color-coded annotations as opposed to plain text information. Finally, the toolkit supports file format conversion for extended functionality. Technical report TR No: IIIT-H/TR/2004/012 The toolkit broadly consists of four components as shown in Figure 2: Sequence Retriever, File Format Inter-converter, Applications module and storage module. The applications module is in turn further divided into sub modules. 1. Sequence Retriever The first step of IPSVAC is to retrieve sequence either from locally available database or from remote databases. The sequence retriever of IPSVAC provides an interface, which is categorized to upload a sequence present in the user account or to fetch one from the remote database available. Sequence Retriever supports fetching records from GenBank, Swissprot, RefSeq and the journal citations from MEDLINE, by providing the accession-id and the database name. It displays the sequence by default, and other subsections of record like annotation and feature table on explicit selection. Several Interfaces are provided in order to perform further analysis on the fetched data. IPSVAC Sequence retrieval File format conversion Applications Primary Analysis Comparative analysis Visual analysis 1. 2. 3. 4. 5. 1. 2. 3. 1. 2. 3. Restriction sites Sequence statistics Tandem repeats Cleavage sites BLAST Pair-wise sequence alignment Multiple sequence alignment Phylogenetic trees Feature viewer Sequence Viewer Phylogenetic tree viewer Storage Figure 2: Overall functionality of the toolkit IPSVAC The interface provides a menu that allows the user to perform different operations on the retrieved sequence. Figure 3 displays the screenshot of the interface with the menu items that include: File, Sequence, Analysis, and Viewers. The File menu allows us to Open, and Save sequence records. The Sequence menu helps us to view Sequence, Feature table and Annotation information separately. The Annotation gives details of sequence record like the accession id, the definition, the keywords, source organism, the references, author etc, while the Feature table displays all the special features within the sequence, like the genes, CDS regions, the translated sequences, signal peptide regions etc. These three sections together present all the details of the sequence in text format. A graphical representation of the feature table is provided, described later as Feature Viewer. The Analysis category provides users with different analysis options as submenu items depending on the type of the sequence: nucleotide or protein. The Viewers provide two options: feature viewer, to render graphically the features annotated on DNA or protein sequence, Technical report TR No: IIIT-H/TR/2004/012 and a sequence viewer, to color-code the nucleotide bases/amino acid residues. The detailed description about the analysis and viewers category is explained under applications modules section. 2. File Format Inter-converter : This tool supports inter converting sequence data between a variety of file formats: FastA, GenBank, EMBL, GCG, PIR, ACE, SWISS-PROT, FASTQ. The output file after conversion can be used as input to other programs or can be saved for later use. Figure 3: The figure shows a sequence with accession-id P43780 retrieved from Swiss-Prot. It also shows different menu options. 3. Application’s Module The application modules may be further classified into interdependent groups based on their functions as follows: a. Applications for primary analysis b. Applications for comparative analysis c. Applications for visual analysis (A) Applications for Primary Analysis The primary and the most important goal of IPSVAC is to serve varied sequence analysis options on a single platter to the user. The Analysis category of IPSVAC generates different options as submenu items depending on the type of the sequence: nucleotide or protein. Technical report TR No: IIIT-H/TR/2004/012 For Nucleotide Sequences, the submenu items displayed are: Statistics, Restriction sites, Tandem repeats and BLAST. The Statistics option displays different measures like; the nucleotide base count, codon count and molecular weight of the sequence. Restriction site analysis: Restriction enzymes (RE) recognize rather short sequences of double stranded DNA as targets for cleavage. Each Restriction enzyme has a particular target in duplex DNA, usually a specific sequence of four to six bp, and are found in wide range of bacterial species. Different REs have different target sequences; thousands of REs have been discovered in different bacterial species and one or more of these enzymes recognizes over 100 different DNA sequences. A restriction map is a linear array of sites on DNA cleaved by various RE. REs are the progenitors of today's modern Biology, the building blocks to commonly used techniques of today (RDT). RFLP (restriction fragment length polymorphism) that refers to inherited differences in sites for RE are used in genetic mapping to link the genome to a genetic marker. They are also used in cloning techniques, construction of DNA libraries etc. REBASE (Roberts, Vincze et al. 2003) is a database that provides information about restriction enzymes and related proteins. Webcutter (Maarek 1997), NebCutter V2.0 (Vincze, Posfai et al. 2003), Restrict of EMBOSS are different tools developed to find restriction enzyme cutting sites. An interface, restriction sites in IPSVAC perform this task. It provides an in-house database of important REs obtained from the Bioperl library of restriction enzymes. When a sequence is given as input for restriction site analysis, the details of cleavage patterns and fragmentation statistics with respect to each of the enzymes available in our library are displayed to the user in a neatly formatted manner. Analyses of Tandem Repeats: The human genome tends to show sufficient variability among individuals in a population, this variability is due to two or more contiguous, approximate copies of nucleotide repetitions, called tandem repeats and also has important applications including genetic mapping (Mariat, De Gouyon et al. 1993), population studies (Cagnon), diagnosis (Zischler, Nanda et al. 1989), forensics and DNA finger printing (Fredman, Siegfried et al. 2002). Depending on the repeat region these are classified into several groups; mini-satellites (variable number of tandem repeats, VNTRs) have core repeats with 9-80 bp, while micro-satellites (short tandem repeats, STRs) contain 2-6 base-pair repeats. Extensive knowledge about pattern size, copy number, mutational history, etc. for tandem repeats has been limited by the inability to easily detect them in genomic sequence data. Tandem Repeat Finder (Benson 1999), MaskerAid (Bedell, Korf et al. 2000), String (Parisi, De Fonzo et al. 2003) are some of the tools available for finding Tandem repeat regions in sequences. The Tandem Repeat Detector (TRD) of IPSVAC can take as input a single sequence or a set of sequences in FASTA format along with the desired tuple size (2 to 6), match, mismatch and threshold values. The program detects all the micro-satellite regions, direct and approximate repeats (along with alignment) of the specified tuple size, the score of which is calculated from the match and mismatch values, copy number, and also the sequence name, if multiple sequences are given as input to the program. The BLAST module performs a remote BLAST using BLASTn on the sequence, taking as input the Database (which is the nr database), the organism (blasts against all organisms), and E-value (the default value being 10e) displays results in descending order of scores along with the alignment (please see applications for comparative analysis for more details). For Protein Sequences, the submenu items include: Cleavage Sites, which displays the signal cleavage sites in the sequence, BLAST module which performs remote BLAST using BLASTp (the results of which are displayed in a graphical format, similar to those of BLASTn). In future versions of IPSVAC we plan to include all different flavors of blast, same as at NCBI (Jenuth 2000). Technical report TR No: IIIT-H/TR/2004/012 Protein Cleavage site analysis: Specific signals present in the polypeptide chains are responsible for transport and sorting of proteins to their appropriate destination. In 1999 Blobel discovered that “proteins have intrinsic signals that govern transport and localization in the cell” (Blobel 1999). One well documented transport signal concerns the signal peptide, forming the n-terminal of a protein which is secreted through cell membrane. These peptides mediate translocation across the endoplasmic reticulum membrane in eukaryotes. In prokaryotes signal peptides mediate translocation across the inner and outer membranes of the cell. The translocation is mediated by cleaving of signal sites that span across the membrane regions. The prediction of these protein cleavage sites is helpful in various areas like cancer research (Smith 1994), used in recombinant protein expression and purification and in many other areas. SignalIP (Menne, Hermjakob et al. 2000) (using neural networks (Nielsen, Engelbrecht et al. 1997) or HMMs (Nielsen and Krogh 1998) ), SPScan of GCG10 and Sigcleave of EMBOSS (using weighted matrix approach) are different programs for predicting the signal peptide cleavage sites. For comparison and evaluation of these tools see . The IPSVAC cleavage site option finds the positions where a specified enzyme might cut a peptide sequence and reports the starting and ending positions of the fragments, sequence of fragment etc. (B) Applications for Comparative Analysis Pairwise Sequence comparison: sequence alignment is a crucial operation in bioinformatics, and genetics research. Typically a scoring function is used to rank different alignments so that biologically plausible alignments score higher. Sequence alignments are used to probe sequence databases for similar sequences, in genetic disease research, construction of phylogenetic trees, comparing functions between similar genes, and to find how much they are diverged. The most basic algorithm to align two sequences was developed by Needleman and Wunsch (Needleman and Wunsch 1970). Short and highly similar subsequences may be missed in a global alignment because they are outweighed by the rest of the sequence. In some cases, indentifying such local alignments proves useful. The Smith and Waterman algorithm (Waterman 1984) finds an alignment that determines the longest/best subsequence pair that gives maximum degree of similarity between the two sequences. Single sequence alignment conveys whether two sequences show enough similarity to infer that they are homologous to one another. Proteins that have a significant biological relationship to one another often share only isolated regions of sequence similarity. For identifying relationships of this nature, the ability to find local regions of optimal similarity is advantageous over global alignments. BLAST (Altschul, Gish et al. 1990) is an alignment tool that uses a measure of local similarity to score sequence alignments in such a way as to identify regions of good local alignment. The basic BLAST algorithm can be implemented in DNA and protein sequence database searches, motif searches, gene identification searches, and in the analysis of multiple regions of similarity in long DNA sequences. FASTA (Pearson 1990) and WU-BLAST (Lopez, Silventoinen et al. 2003) are two other sequence searching tools. All these methods are implemented using dynamic programming technique, which can be extended to more complex situations like overlapped matches, performed for putting together a set of sequenced DNA fragments (fragment assembly) and repeated matches. But it is not computationally feasible when the sequences become very large. Alternative methods were proposed for finding optimal alignment in linear space (Gotoh 1982), (Myers and Miller 1988), (Hirschberg 1975). The Sequence alignment tool of IPSVAC is equipped to align the sequences globally, locally and also can be used to align the fragments using the overlapped matches method. In order to do this the software accepts two sequences as input, this can be done either by browsing the directory for sequence files or just pasting the sequences into the space provided. The user can also select as input the type of alignment to be implemented. As discussed earlier it has four different types of algorithms for aligning Technical report TR No: IIIT-H/TR/2004/012 sequences: global alignment (Needleman-Wunsch, Hirschberg), local alignment (Smith-Waterman, Hirschberg), overlapped alignment and alignment for repeated matches (dynamic programming). Inputs other than sequences required for these algorithms are the scoring matrices. For nucleotide sequences identity matrix, transition-transversion matrix and Blast-Matrix are the options and for protein sequences BLOSUM50, BLOSUM62, PAM30, PAM250 are supported (Wilbur 1985; Henikoff and Henikoff 1992). Scores can be either Linear (default) or Affine and depending on this selection, the Gap Open penalty and Gap Extension penalty (defaults 10 and 2) are accepted as input. The output generated contains score of the alignment and the aligned sequence’s. The alignment tool provides ways to represent the alignment in a color coded format to enhance the interpretation and to locate the aligned regions of the sequence easily at a single glance. The program implements dynamic programming technique for the implementation of the algorithms. The Hirschberg algorithm was implemented for efficient utilization of space in processing the alignment results. Multiple sequence comparison: Multiple alignments of protein sequences are important tools in studying sequences. The basic information they provide is identification of conserved sequence regions. Aligning of several sequences provides an insight into molecular evolutionary analysis for constructing of phylogenetic trees, characterizing protein families, identifying shared sequences of homology, determining consensus of aligned sequences. This also helps in predicting secondary and tertiary structures of protein sequences. Bains suggested an iterative method which involves successive applications of the standard algorithms. It begins with a trial consensus alignment (say the alignment between sequences 1 and 2. Then the third sequence is aligned against the consensus sequence and a new consensus emerges. This continues until the consensus alignment converges to a global consensus. This type of method is very dependent on the order that the sequences are introduced. One of the most popular multiple alignment programs begins with all pairwise alignments and is called Clustal (Higgins and Sharp 1988). Two variants currently being used are ClustalW (Thompson, Higgins et al. 1994) - a local alignment algorithm, and the older ClustalV (Higgins 1994) - a global alignment method. Other methods make use of a multiple dimensional dot plot and then look for dots that are common to each group. Still others rely heavily on user input such as the popular windows program MACAW (Schuler, Altschul et al. 1991). Others such as MSA (Gupta, Kececioglu et al. 1995) attempt to provide a nearoptimal sum-of-pairs global solution to the multiple alignment . The IPSVAC relies on ClustalW to perform multiple sequence alignment. A set of sequences are taken as input and multiple sequence alignment is performed using ClustalW which is one of the most accurate programs for multiple sequence alignment. It is a general purpose multiple sequence alignment program for DNA or proteins. It produces biologically meaningful multiple sequence alignments of divergent sequences. It calculates the best match for the selected sequences, and lines them up so that the identities, similarities and differences can be seen. This program accepts input in seven different formats: NBRF/PIR, EMBL/SWISSPROT, Pearson (Fasta), Clustal (*.aln), GCG/MSF (Pileup), GCG9/RSF and GDE flat file. All the set of sequences should be in one file, one followed by another. Once the sequences are aligned, The IPSVAC provides ways to display the output on screen and also to color codes the alignment for better display and understanding. It provides the user with an option to display the single nucleotide alterations in each position in the set of sequences that are aligned. The details are discussed later in “applications for visual analysis” as an application for SNP search. Phylogenetic tree analysis: Phylogenetics is the area of research concerned with finding the genetic connections and relationships between species. The term phylogeny refers to these relationships, usually presented as a phylogenetic tree. There are three kinds of methods for constructing phylogenetic trees: distance matrix, maximum likelihood and parsimony. Distance matrix methods first estimate the pairwise distances between the sequences while the other methods construct many trees from all the Technical report TR No: IIIT-H/TR/2004/012 information in the multiple alignment and decide which is best. The simplest distance based method is unweighted pair-group method using arithmetic averages (UPGMA). Ideally a researcher would like to have a black box in which to throw sequences and get out a fully annotated phylogenetic tree. This is, however, not possible for two reasons. First, an algorithm that considers all possible multiple sequence alignments and then, for each alignment, all possible phylogenetic trees and picks out the best one, would take too much time. That is why most phylogenetic programs work on previously aligned sequences. Second, the result is always strongly influenced by the criteria that are used to define the best tree. Phylogenetic trees take several forms: They can be rooted or unrooted, binary or general, and may show, or not show, edge lengths. TreeView (Page 1996), Phylip (Retief 2000), MEGA (Kumar, Tamura et al. 2001), TreeFinder are different tools that compute phylogenetic trees. IPSVAC provides a GUI interface to construct the phylogenetic trees with the use PHYLIP (Retief 2000) software. The PHYLIP is a tool that is most popularly used to generate evolutionary trees and comes with a package of programs for inferring phylogenies (evolutionary trees). This tool supports different tree generation methods discussed earlier like: distance matrix, maximum likelihood and parsimony. The IPSVAC generates phylogenetic trees relying on this software package. An interface is provided to support a wide range of methods provided by PHYLIP software suite. A set of sequences are taken as input and as a first step to prepare the phylip input format sequences are aligned using ClustalW. Then the output from ClustalW is used as input to the phylip program and the desired method is invoked as per the user’s choice to generate appropriate phylogenetic tree. (C) Applications for Visual Analysis The Viewers provide two options: feature viewer, to render graphically the features annotated on DNA or protein sequence, and a sequence viewer, to color-code the nucleotide bases/amino acid residues. Numerous databases have been constructed to store these sequence regions and derived features, and their associated functions. Common features of DNA sequence include introns, exons, 3' or 5' untranslated regions, transcription start sites, cis-elements and other protein binding sites, repeats, low complexity regions and single nucleotide polymorphisms (SNPs). Protein sequence features include secondary structures (α-helices and β-strands), transmembrane regions, and post-translational modifications such as phosphorylation and glycosylation sites. There can be many number of features associated with a single sequence, it is always extremely difficult for a text record to reveal all the salient features in an intuitive fashion. THEATRE (Edwards, Carver et al. 2003) is an attempt to combine the features produced by widely used sequence analysis tools or databases. SeqVISTA is another graphical tool that facilitates visualization of sequence features. The feature viewer of IPSVAC is used to display the feature information associated with the sequence like: CDS regions, exons, introns, poly-A signal sites, promoter regions, repeat regions etc. For each feature annotated for the sequence a thick colored line is displayed to show spanning and location of the bases in that sequence. The total sequence is scaled to represent the overall image. The length and location of each line represents the number of bases it spans in that sequence and its location. For each type of feature a different color is selected. This provides us with a bird’s eye view of all the important features in a sequence (Figure 4). In future we would like to incorporate the functionality of zooming into each feature, so that the user can view the details associated with the particular feature more clearly. In future versions we would like to include a module for the automated annotation of new non-annotated sequences. Technical report TR No: IIIT-H/TR/2004/012 The sequence viewer module in IPSVAC can take in one or more sequences and can ameliorate the understanding of the important features, including alignment features, by providing convenient visual images of sequences color coded according to user selections. For nucleotide sequences the options provided allows the user to highlight individual bases or sets of bases such as purines/pyrimidines, AT/GC, or start/stop codons in different colors. For protein sequences different color codes are assigned to represent the different properties such as hydrophobicity, polarity etc. of the residues. Each of these options when selected displays all residues of that particular property in a given color code, and all others are in white (Figure 5). Figure 4 : The output of feature viewer of a DNA sequence with accession id AE009723 A special feature of this viewer module is its integration with sub modules and interfacing with external servers for pre and post processing of sequences. A single point input and the click of a button are all that is required from the user to be able to visualize important derived features rendered in useful color coded formats. For example, the user can input up to two sequences, which he has retrieved into his account, at one go to get all the six reading frames for both of them with start/stop codons highlighted in color. For each individual reading frame, the user also has the option of viewing the corresponding translated amino acid sequence with individual amino acids colored according to user selected attributes. For any given pair of sequences, in addition to the options mentioned above, the user can choose to view pair-wise alignment features in color coded format. This involves an automated interfacing with Align::NW, a Bioperl module, for the required pair wise alignment using the Needleman and Wunsch algorithm. For multiple sequences, IPSVAC interfaces with ClustalW and takes multiple files in FASTA format as input. The ClustalW output is parsed to remove the exact matches also referred as blocks and is processed to find the altered nucleotide base positions across the length of the sequences and render it in the SUBINDEL visualizer. This feature is designed to find substitutions, insertions and deletions of nucleotide bases at each position in the alignment. It may be mentioned here that polymorphisms in nucleotide sequences are being extensively used as molecular markers in human genome analysis as well as in crop breeding programs. Compared to other types of polymorphisms, the detection and discovery of SNPs have evoked greater interest because of their abundance, evolutionary stability and usefulness for genetic dissection of complex traits (Wu, Di Technical report TR No: IIIT-H/TR/2004/012 Rienzo et al. 2001) and diseases (Simmons 2001). They are an invaluable tool for genome mapping, offering the potential for generating very high-density genetic maps, which can be used to develop haplotyping systems for genes or regions of interest. Unlike RFLPs (Restriction Fragment Length Polymorphism), SNPs are direct markers providing the exact nature of allelic variants. They are far more prevalent than SSRs (Simple Sequence Repeats) and hence provide a high density of markers near a locus of interest. Several SNP databases have been developed, such as, dbSNP (Sherry, Ward et al. 2001), HGVBase (Fredman, Siegfried et al. 2002). Efforts are also on to develop in silico tools for detecting SNPs. However to our knowledge most such tools have to be used in conjunction with laboratory experiments and results have to be anyway validated in the laboratories. A convenient visual rendering of MSA output provided by the SUBINDEL visualizer may be useful for, among other things, a preliminary short listing of potential single nucleotide polymorphisms or SNPs. Figure 5: The figure displays the output of sequence viewer for a protein sequence 4. Storage Module All sequences retrieved by the user are stored in a separate dynamically allotted area. They automatically appear in a drop down list in the GUI and can be easily given as inputs in any of the modules. The storage module also stores the summary results and operational details in a separate page maintained for different registered users. Provision has been made such that this summary can automatically generate the detailed results when given as an input to the stand-alone version of IPSVAC. A ready to use downloadable stand-alone version of IPSVAC is to be placed on the web site very soon. Conclusions We have developed a tool which provides a number of facilities for the analysis of biological sequences starting with uploading a sequence, to pair wise alignment, database search, multiple alignments and phylogeny analysis etc. on a single platform. The work described in this paper outlines our approach, involving the integration of modular components, towards addressing an urgent need of biologists. Of course this preliminary result leaves room for improvement in terms of increasing the scope of application areas. Providing interfaces with, for example, tools involving new algorithms for the analysis of large-scale gene expression data are being planned for the future. Technical report TR No: IIIT-H/TR/2004/012 Acknowledgements Sri Jyothsna Yeleswarapu has contributed towards the color-coding modules. Ramsateesh Talari has contributed towards development of the web interface. The authors wish to thank Dr. Jayashree Balaji from ICRISAT for initial guidance on different modules, Dr. Nita Parekh for providing helpful suggestions for refinement and improvements in tool design and Dr. C. K. Mitra for reviewing the write up. One of the authors (Naveena V. K. Y.) is currently in CMU under a research exchange program with IIIT Hyderabad and is thankful to Dr.Raj Reddy and Dr. Judith Klein-Seetharaman for encouraging her to write up this work for publication. References Altschul, S. F., W. Gish, et al. (1990). "Basic local alignment search tool." J Mol Biol 215(3): 403-10. Bedell, J. A., I. Korf, et al. (2000). "MaskerAid: a performance enhancement to RepeatMasker." Bioinformatics 16(11): 1040-1. Benson, G. (1999). "Tandem repeats finder: a program to analyze DNA sequences." Nucleic Acids Res 27(2): 573-80. Blobel, G. (1999). ""proteins have intrinsic signals that govern their transport and localization in the cell" http://www.nobel.se/medicine/laureates/1999/presentation-speech.html." Cagnon, C. "http://www.univpau.fr/RECHERCHE/LEM/LEM/Lemenglish/LEMresearch/Biology/Study1.htm." Edwards, Y. J., T. J. Carver, et al. (2003). "Theatre: A software tool for detailed comparative analysis and visualization of genomic sequence." Nucleic Acids Res 31(13): 3510-7. Fredman, D., M. Siegfried, et al. (2002). "HGVbase: a human sequence variation database emphasizing data quality and a broad spectrum of data sources." Nucleic Acids Res 30(1): 387-91. Gotoh, O. (1982). "An improved algorithm for matching biological sequences." J Mol Biol 162(3): 7058. Gupta, S. K., J. D. Kececioglu, et al. (1995). "Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment." J Comput Biol 2(3): 45972. Henikoff, S. and J. G. Henikoff (1992). "Amino acid substitution matrices from protein blocks." Proc Natl Acad Sci U S A 89(22): 10915-9. Higgins, D. G. (1994). "CLUSTAL V: multiple alignment of DNA and protein sequences." Methods Mol Biol 25: 307-18. Higgins, D. G. and P. M. Sharp (1988). "CLUSTAL: a package for performing multiple sequence alignment on a microcomputer." Gene 73(1): 237-44. Hirschberg, D. S. (1975). "A Linear Space Algorithm for computing Maximal Common Subsequences." 18(6): 341-343. Jenuth, J. P. (2000). "The NCBI. Publicly available tools and resources on the Web." Methods Mol Biol 132: 301-12. Kumar, S., K. Tamura, et al. (2001). "MEGA2: molecular evolutionary genetics analysis software." Bioinformatics 17(12): 1244-5. Lopez, R., V. Silventoinen, et al. (2003). "WU-Blast2 server at the European Bioinformatics Institute." Nucleic Acids Res 31(13): 3795-8. Mariat, D., B. De Gouyon, et al. (1993). "Genetic mapping through the use of synthetic tandem repeats in the mouse genome." Mamm Genome 4(3): 135-40. Technical report TR No: IIIT-H/TR/2004/012 Menne, K. M., H. Hermjakob, et al. (2000). "A comparison of signal sequence prediction methods using a test set of signal peptides." Bioinformatics 16(8): 741-2. Myers, E. W. and W. Miller (1988). "Optimal alignments in linear space." Comput Appl Biosci 4(1): 117. Needleman, S. B. and C. D. Wunsch (1970). "A general method applicable to the search for similarities in the amino acid sequence of two proteins." J Mol Biol 48(3): 443-53. Nielsen, H., J. Engelbrecht, et al. (1997). "A neural network method for identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites." Int J Neural Syst 8(5-6): 58199. Nielsen, H. and A. Krogh (1998). "Prediction of signal peptides and signal anchors by a hidden Markov model." Proc Int Conf Intell Syst Mol Biol 6: 122-30. Page, R. D. (1996). "TreeView: an application to display phylogenetic trees on personal computers." Comput Appl Biosci 12(4): 357-8. Parisi, V., V. De Fonzo, et al. (2003). "STRING: finding tandem repeats in DNA sequences." Bioinformatics 19(14): 1733-8. Pearson, W. R. (1990). "Rapid and sensitive sequence comparison with FASTP and FASTA." Methods Enzymol 183: 63-98. Rastogi, P. A. (2000). "MacVector. Integrated sequence analysis for the Macintosh." Methods Mol Biol 132: 47-69. Retief, J. D. (2000). "Phylogenetic analysis using PHYLIP." Methods Mol Biol 132: 243-58. Rice, P., I. Longden, et al. (2000). "EMBOSS: the European Molecular Biology Open Software Suite." Trends Genet 16(6): 276-7. Schuler, G. D., S. F. Altschul, et al. (1991). "A workbench for multiple alignment construction and analysis." Proteins 9(3): 180-90. Sherry, S. T., M. H. Ward, et al. (2001). "dbSNP: the NCBI database of genetic variation." Nucleic Acids Res 29(1): 308-11. Smith, B. J. (1994). "Chemical cleavage of proteins." Methods Mol Biol 32: 297-309. Thompson, J. D., D. G. Higgins, et al. (1994). "CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice." Nucleic Acids Res 22(22): 4673-80. Waterman, M. S. (1984). "Efficient sequence alignment algorithms." J Theor Biol 108(3): 333-7. Wilbur, W. J. (1985). "On the PAM matrix model of protein evolution." Mol Biol Evol 2(5): 434-47. Womble, D. D. (2000). "GCG: The Wisconsin Package of sequence analysis programs." Methods Mol Biol 132: 3-22. Zischler, H., I. Nanda, et al. (1989). "Digoxigenated oligonucleotide probes specific for simple repeats in DNA fingerprinting and hybridization in situ." Hum Genet 82(3): 227-33.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Automatic identification of topic boundaries in