Download Bosque, a software system for phylogenetic analysis

Bosque: Software system for phylogenetic analysis Salvador Ramírez Flandes Laboratorio PROFC, Departamento de Oceanografía Facultad de Ciencias Naturales y Oceanográficas Universidad de Concepción Bosque: software system for phylogenetic analysis Table of Contents QUICK START TO BOSQUE........................................................................................................................ 3 TREE PROJECTS AND TREE WINDOWS ................................................................................................ 4 THE SEQUENCE TAB ....................................................................................................................................... 5 THE ALIGNMENT TAB...................................................................................................................................... 6 THE TREE TAB ............................................................................................................................................... 7 DATA MANAGEMENT IN BOSQUE........................................................................................................... 8 THE LOCAL DATABASE ................................................................................................................................... 8 IMPORTING DATA INTO BOSQUE ..................................................................................................................... 9 From a local file ..................................................................................................................................... 10 From NCBI´s Entrez-Genbank................................................................................................................ 10 From a Blast query ................................................................................................................................. 13 COMPUTING ALIGNMENTS AND TREES............................................................................................. 18 USEFUL TOOLS OF BOSQUE ................................................................................................................... 22 THE SEQUENCE EDITOR ................................................................................................................................ 22 THE ALIGNMENT EDITOR ............................................................................................................................. 27 THE TREE EDITOR ........................................................................................................................................ 30 THE SEQUENCES WINDOW ........................................................................................................................... 32 NETWORKING OPTIONS OF BOSQUE .................................................................................................. 34 REMOTE EXECUTION OF JOBS ....................................................................................................................... 34 PUBLIC CHANNEL ......................................................................................................................................... 34 SHARING RESOURCES ................................................................................................................................... 34 THE BOSQUE SERVER............................................................................................................................... 35 INSTALLATION OF THE BOSQUE SERVER ...................................................................................................... 36 APPENDIX ..................................................................................................................................................... 37 SEQUENCE FILE FORMATS ............................................................................................................................. 37 Fasta file format...................................................................................................................................... 37 Genbank file format ................................................................................................................................ 37 INDEX ............................................................................................................................................................. 38 2 Bosque: software system for phylogenetic analysis Quick start to Bosque 1. Download Bosque from http://bosque.udec.cl/downloads/BosqueSetup.exe 2. Install the software on your computer 3. Execute Bosque. The first time you will need to specify where do you want to leave the file containing the local database 4. Create a Project, by giving it any descriptive name. 5. Create a Tree Project within your recently created Project (step 3). To create a Tree Project click the button at right, whose label says: “New Tree”. This will create a Tree Project and will present it on a nice Tree Window. 6. Add sequences to your Tree Project. In the Sequences Tab of your Tree Window (created on step 5), there is a button whose label says “Add Seqs”. Press this button and the Sequences Window will appear. Here you can import sequences from different sources. Then accept and the sequences will be added to the Tree Window 7. Change to the tab “Alignment” on the Tree Window. Here you will see all your sequences unordered (or unaligned if you want). Press the button “Align Sequences” at the bottom (if you move your mouse pointer over the buttons they will show you a label, look the one that display “align sequences”. With the time of course you will recognize it, without looking for it, hopefully). 8. After the sequences are aligned press the button “construct tree” in the upper toolbar. This button has an icon that looks like a phylogenetic tree, you will find it easily! 9. You have a tree on your Tree Window. Now you can construct more trees on this same Tree Window by using other methods, or you can create another Tree Project by pressing “New Tree” button, etc. 3 Bosque: software system for phylogenetic analysis Tree Projects and Tree Windows The current molecular phylogenetic analyses use nucleotides and amino-acids sequence data to infer the phylogeny of organisms. These analyses, therefore, begin with the integration of a set of sequences of interest and then by their respective alignment, which is the input data for the different phylogenetic methods that will produce, ultimately, a phylogenetic tree. Thinking on this basic pipeline, Bosque defines the concept of Tree Project as a set consisting of: 1. a set of sequences 2. an alignment of these sequences, and 3. a set of trees out of this alignment Since there is no perfect tree-reconstruction method, it is common the use of different techniques with different models to produce multiple trees, which then can be analyzed or merged, someway, in a so-called consensus-tree. Bosque stores these Tree Projects on a local relational database (implemented on SQLite1) whose format is transparent for the user. By selecting and loading a simple given name for the Tree Project, the user automatically loads all the sequences, the alignment and a number of trees for his Tree Project. In Bosque these Tree Projects are manipulated on a special window, called (not surprisingly) Tree Window. This Tree Window expresses graphically the Tree Project concept, and thus it is divided on different tabs: one for the list of sequences, another for the alignment and a single or multiple tabs for the trees. Bosque can handle multiple Tree Windows at a time, in what is called, technically, a MDI or Multiple Document Interface2. The Tree Window has a special toolbar on top with icon-buttons for common operations related with the Tree Project. These operations include: save the Tree Project to the local database, export data (this data will depend on what tab we are positioned at the moment of the request of the operation), search a particular sequence, configure special options of the Tree Project, reconstruct a tree from the alignment (if it already exists), print the tree, and close the Tree Project. Along the following sections we will review in detail every tab of the Tree Window. 1 SQLite is a small C library that implements a self-contained, embeddable, zero-configuration SQL database engine. http://www.sqlite.org 2 See http://en.wikipedia.org/wiki/Multiple_document_interface for more information about MDI. 4 Bosque: software system for phylogenetic analysis The sequence Tab In Figure 1 we can see a screenshot of a Tree Window displaying the sequence tab. As we can see, it is composed, vertically, of three parts: the toolbar on top (common to all the tabs on the Tree Window), the sequence’s table and a toolbar on the bottom, for operations applicable only to sequences. Figure 1. Tree Window displaying the Sequences Tab The sequences table shows information about the sequences, and it is configurable what it is displayed on its columns. To change the default columns (accession number of the sequence, definition, size in base pairs and organism name) a right-click should be pressed over the header of the table and a popup menu with options should appear. The toolbar at bottom consists of buttons for adding more sequences to the Tree Project, for editing a particular sequence, for removing sequences from the Tree Project and for exporting the sequences to foreign formats, such as fasta or genbank3. Remember that at this point all the sequences are stored on the local database, so they were already imported on the special Sequence Window, which we will review later on this tutorial. 3 See the appendix on sequence file formats. 5 Bosque: software system for phylogenetic analysis The alignment tab After we have collected our sequences for the analysis we need to align them, so only homologous sites (in theory) be compared by the tree-reconstruction methods on the next stage. When we select the alignment tab for the first time, after the collection of sequences on the previous step, we see them, of course, not aligned and then we can use the “align button” at bottom (in Figure 2 indicated with the number 4) to carry out this alignment process, which will show a window to use an external, widely used program called Muscle. Figure 2. The Alignment tab of a Tree Window showing a set of aligned sequences After the alignment is done, the sequences will appear ordered as Muscle dictate, given the data provided and the options selected on the Muscle Window for this purpose4. The sequences are presented here on a special table called the Alignment Editor. This editor shows the bases of the alignment on cells which can be edited by double-click over them, or by selecting regions to do cuts with the scissor on button marked as 5 on Figure 2. Please refer to the section “The Alignment Editor” section for further information. 4 To know about the details of how muscle performs the alignments please refer to http://www.drive5.com/muscle/. 6 Bosque: software system for phylogenetic analysis The Tree Tab After we are satisfied with the alignment we are ready to do the tree-reconstruction, for which we can press the “tree button” on the toolbar on top of the Tree Window. Figure 3. Tree on a Tree Window There are numerous tree-reconstruction methods and we have covered a part of them by using well-known phylogenetic command-line programs. For now we have integrated into Bosque the programs from the Phylip package and the Tree Puzzle program implementing maximum likelihood by quartet puzzling method5. Please note that if the program is connected to a bosque-server then special options will appear on the popup menu for the “tree button” on the Tree Window’s toolbar. In fact, these options will allow executing the programs remotely on the server. This feature is particularly useful when we are analyzing a dataset with many sequences, so the analysis may take a not so short amount of time. Finally it is necessary to say that the tree is not only displayed for visualization but it can also be edited with mouse options. For example, right-button-clicking a particular tip of the tree opens a context popup menu. Also it is possible to move tips along the tree, change the appearance of the text, expand/shrink a whole tip, rename a particular tip, et-cetera. 5 For more information about these program please see the respective webpage for each package. The webpage for phylip is: http://evolution.genetics.washington.edu/phylip.html. The webpage for tree-puzzle is: http//www.tree-puzzle.de. 7 Bosque: software system for phylogenetic analysis Data management in Bosque All data in Bosque is stored on a local relational database, using the SQLite library. By local we mean that this database is located (as a file) on the same computer where Bosque is executed. The main advantage (for this application) of having a local database to store everything is that we avoid the complications for the user of: 1. Manipulate computer files on different formats 2. Organize these files along the different phylogenetic projects that a particular user may be carrying out. Recognized or not, much phylogenetic analyses require the use of the trial and error practice and so, one normally need to compose several sets of sequences, analyze them, add more sequences to those sets, analyze them again, remove another, do the analysis again, and so on. If a particular user manages multiple projects one can imagine easily that the amount of different files with sequences begin to grow rapidly. Also the names of the files need to be very informative so that with just looking the list of file we can know what they contain. To avoid all the above endeavor, Bosque implements all the data management on a single database, organized in tables, so the user can manipulate elements at a “phylogenetic level”, such as sequences, trees, alignments, etc. and not at a level which is multipurpose, such as mere computer files, to which it is necessary to complement with information from the phylogenetic level, by naming them with special names, use special file extensions, organize them on special directories, et-cetera. The local database This database is composed of different tables to store the different elements used, such as sequences, trees, jobs, servers, etc. To see where this database is located on the current filesystem one can select the option “database properties” from the “Database” menu on the main window of Bosque. On this window are displayed also the size (in megabytes) of the file containing the database, the amount of projects, trees and sequences. As we have already said, this database is, actually, a SQLite database, which in turn is an open source relational database implementation. This means that the tables of this database (and so, their actual data) can be manipulated by other programs also, apart from Bosque. This feature is important since the user needs not to rely only to Bosque the accessibility of his data. In case of trouble with Bosque, the user can always use SQLite programs (downloadable from the sqlite webpage: http://www.sqlite.org) to access their data from the tables of the Bosque 8 Bosque: software system for phylogenetic analysis database. Of course this practice is not recommended and should only be performed when there is no other way to rescue the desired data at a particular moment. If the data on the database is corrupted someway (by the external manipulation of the database for example), an unpredictable behavior of Bosque is expected. Importing data into Bosque To import data into Bosque we should use the special Sequences Window. An image of this window is presented on Figure 4. Figure 4. The Sequences Window The Sequences Window is invoked by pressing the button “sequences” on the toolbar of the main window. This window can display the sequences by doing a search on the local database. If no criteria are specified on the boxes then all the sequences will be listed on the table. More on this window will be seen later on this tutorial. Now we will just review the importing options at the bottom of the window. There are three ways to import sequence data into Bosque: 9 Bosque: software system for phylogenetic analysis 1. from a fasta or genbank file 2. from a query on genbank entrez 3. from a query on blast servers From a local file Typically, the data from sequencing projects include files in Fasta format with the sequences. In order to enter these data into Bosque it is necessary to select the first option (import from local file) and select the filename containing the sequence data. This option also allows importing sequences in the genbank format. Both formats (fasta and genbank), can contain either single or multiple sequences within a single file. Files in fasta format normally have the extension .fas or .fasta. Files in the genbank format typically use the extension .gb. Please refer to the appendix for a brief description of the fasta and genbank format. From NCBI´s Entrez-Genbank The second option is Entrez which is an integrated, text-based search and retrieval system used at NCBI for the major databases, including PubMed, Nucleotide and Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and others. Bosque implements a tool window to specify a search on the NCBI servers and bringing the sequences to the local database. In Figure 5 we can appreciate this window: 10 Bosque: software system for phylogenetic analysis Figure 5. Genbank Entrez tool window As we can see from this window, it is possible to indicate what database will be used for the search, the amount of sequences to retrieve on the result and a query text. When the button “advanced search” is pressed it is possible to specify more fields for the search, so users with more experience on these queries can compose complex queries. Once the text search is indicated the “search” button should be pressed so Bosque will connect to the NCBI servers to do the actual query and bring the results. 11 Bosque: software system for phylogenetic analysis Figure 6. Entrez window with the sequences obtained on a search If the search is successful, a list of sequences will be shown on the other window tab (Figure 6). Here we can select sequences and import them with the import button. An important feature is that when large sequences, composed of many CDS (Coding Data Sequence), are presented on the table, they appear in a different color to be noted by the user. Since we are normally interested on just one of the CDS, we would like to browse the CDS of these sequences to select the particular sequence we are interested on. This can be done with a double-click over the sequence. A window like the one on Figure 7 appears. 12 Bosque: software system for phylogenetic analysis Figure 7. Exploring the CDS of a particular complex sequence Since the list can be long this might take a couple of minutes to download, depending on the internet connect available. Another important thing about this tool is that not all the large sequences (complete genomes, complete chromosomes, etc) are annotated and with their corresponding mapping on the “Gene” database at NCBI. This means that may occur that it is not possible to browse every large sequence that may appear on this table. From a Blast query The third method to import data in Bosque is through a Blast query against the NCBI´s Blast servers. Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing primary biological sequence information, such as the amino-acid sequences of different proteins or the nucleotides of DNA sequences. A BLAST search enables a researcher to compare a query sequence with a library or database of sequences, and identify library sequences that resemble the query sequence above a certain threshold6. 6 Definition obtained from http://en.wikipedia.org/wiki/BLAST 13 Bosque: software system for phylogenetic analysis The Figure 8 shows the window on Bosque to import sequences from a Blast query. Figure 8. Window to import sequences from a Blast query To do a BLAST query one needs to supply the sequence query on the “Sequence Data” box. There are two ways of getting the sequence on that box: the first is to select an external file (on the Fasta format) containing sequences that will be shown on the table sequences. Then when a particular sequence is selected its data is automatically displayed on the “sequence data” box and so we are ready to do the BLAST. The other method is just to copy/paste the sequence on that box. Selecting the blast program The following information was extracted from http://www.ncbi.nlm.nih.gov/Education/blasttutorial.html Bosque allow the selection of the different blast programs. Below is a table of these programs. 14 Bosque: software system for phylogenetic analysis Program Description blastp Compares an amino acid query sequence against a protein sequence database. blastn Compares a nucleotide query sequence against a nucleotide sequence database. blastx Compares a nucleotide query sequence translated in all reading frames against a protein sequence database. You could use this option to find potential translation products of an unknown nucleotide sequence. tblastn Compares a protein query sequence against a nucleotide sequence database dynamically translated in all reading frames. tblastx Compares the six-frame translations of a nucleotide query sequence against the sixframe translations of a nucleotide sequence database. Please note that the tblastx program cannot be used with the nr database on the BLAST Web page because it is computationally intensive. Selecting the blast database It is possible to select several NCBI databases to compare the query sequences against. Note that some databases are specific to proteins or nucleotides and cannot be used in combination with certain BLAST programs (for example a blastn search against swissprot). Proteins Database Description nr All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF month All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last 30 days. swissprot The last major release of the SWISS-PROT protein sequence database (no updates). These are uploaded to our system when they are received from EMBL. patents Protein sequences derived from the Patent division of GenBank. yeast Yeast (Saccharomyces cerevisiae) protein sequences. This database is not to be confused with a listing of all Yeast protein sequences. It is a database of the protein translations of the Yeast complete genome. E. coli E. coli (Escherichia coli) genomic CDS translations. pdb Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank. kabat Kabat's database of sequences of immunological interest. For more information [kabatpro] http://immuno.bme.nwu.edu/ alu Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994). 15 Bosque: software system for phylogenetic analysis Nucleotides Database Description nr All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS, or HTGS sequences). month All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30 days. dbest Non-redundant database of GenBank+EMBL+DDBJ EST Divisions. dbsts Non-redundant database of GenBank+EMBL+DDBJ STS Divisions. mouse ests The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism mouse. human ests The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the organism human. other ests The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms except mouse and human. yeast Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences. Not a collection of all Yeast nucelotides sequences, but the sequence fragments from the Yeast complete genome. E. coli E. coli (Escherichia coli) genomic nucleotide sequences. pdb Sequences derived from the 3-dimensional structure of proteins. kabat Kabat's database of sequences of immunological interest. For more information [kabatnuc] http://immuno.bme.nwu.edu/ patents Nucleotide sequences derived from the Patent division of GenBank. vector Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory). mito Database of mitochondrial sequences (Rel. 1.0, July 1995). alu Select Alu repeats from REPBASE, suitable for masking Alu repeats from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994). epd Eukaryotic Promotor Database ISREC in Epalinges s/Lausanne (Switzerland). gss Genome Survey Sequence, includes single-pass genomic data, exon-trapped sequences, and Alu PCR sequences. htgs High Throughput Genomic Sequences. Blast execution and retrieving of results After the sequence data has been provided and the options has been set adequately the user should press the button “BLAST” to send the blast job to the NCBI´s servers. Of course, first check that your internet connection is working fine. The BLAST server accepts the job and returns an estimated time to finish. Bosque will wait that time and will connect to the BLAST server to see if the job has really 16 Bosque: software system for phylogenetic analysis ended. If that is the case then Bosque will retrieve the results (with their scoring and E-value) and will show the found sequences, such as on Figure 9. Figure 9. The result of a blast search This table is similar to the table already reviewed of the Entrez window. Here the user should select the sequences he wants to import and then press the “import” button. If no sequence is selected then all the sequences will be imported. On this same table Bosque indicates if the sequence is already present on the local database. This is on the last column of this table. Naturally, a sequence that is already on the local database needs not to be imported. 17 Bosque: software system for phylogenetic analysis Computing alignments and trees At this point we already know how to get sequence data into Bosque. Now we will see how to get a phylogenetic tree from these sequences. The first step is to create a Tree Project and this is done by pressing the button “New Tree” on the main window of Bosque. Figure 10. Main window of Bosque This will open a Tree Window that contains all the data of a Tree Project, i.e. Sequences, Alignment and Trees. For this reason a Tree Window is composed of multiple tabs: a Sequences Tab, an Alignment Tab and zero or more Tree Tabs. All these tabs have been already described on this tutorial. A newly created Tree Window looks like the one on Figure 1, but without any sequence on the table. To add sequences the button “Add Seqs” at bottom-left should be pressed. This opens the Sequences Window (Figure 4) where sequences can be searched on the local database or can be imported from files, genbank-entrez or BLAST (see a previous section called “Importing data into Bosque” for further details). Once we have the sequences into the Sequences Tab, we can change to the Alignment Tab (Figure 2) of the Tree Window and press the button “Align Sequences” (identified with label 4 on the Figure 2). Currently, the alignment is carried out using Muscle, which is a widely used command-line program to compute multiple sequence alignments. 18 Bosque: software system for phylogenetic analysis Figure 11. Muscle interface of Bosque Figure 11 shows the window to invoke Muscle for the sequence alignment. Depending on the amount of sequences and the length of these, the process of multiple sequence alignment can take a considerable time. This process can also be very consuming of CPU and memory. When the alignment is finished, the sequences are organized into the Alignment Editor on the Alignment tab (Figure 2). This alignment should be revised by the user to ensure it is correct for the purposes of the analysis. Remember that an alignment is just a “theory” about how the evolution in nature has occurred for these sequences. The researcher needs to assists this automatic alignment computation. Once the user is satisfied with the alignment it can be the moment to try a treereconstruction with these data. For this purpose we press the button “Tree reconstruction” on the upper toolbar of the Tree Window. Figure 12. Construct a Tree out of an Alignment 19 Bosque: software system for phylogenetic analysis As can be appreciated from Figure 12, two tree-reconstruction methods are provided: Distance methods (carried out by Phylip) and Maximum Likelihood by Quartet Puzzle. (When the user is connected to a server other methods are provided. Please see more on this on a later section about Networking with Bosque). The distance methods are less time consuming than others and so can be suitable for first inspections of the data. Different distances have been defined depending on what data we are using for the tree-reconstruction. Distances for amino-acids sequences are, of course, different from the distances for DNA sequences. Figure 13. Neighbor Joining for DNA sequences using Phylip In Figure 13 we can see the interface on Bosque for calculating a phylogenetic tree out of DNA sequences using the Neighbor Joining method. Here it is possible to specify the sequence that can be used as the outgroup and the type of distance with which we want to construct the phylogenetic tree. For DNA data the distances defined here are the ones defined on the Phylip package (program dnadist) and they are: F84 (default), Kimura 2 parameters and Jukes-Cantor. For amino-acid data the distances are: Jones-Taylor-Thornton matrix, Henikoff/Tillier PMB matrix and Dayhoff PAM matrix. In the “Resampling” tab it is possible to indicate that the output tree be a consensus tree out of a resampling of the data using statistical methods like Bootstrapping for example. 20 Bosque: software system for phylogenetic analysis In Figure 14 we can see the window for the tree-reconstruction using the TreePuzzle program. Figure 14. Maximum Likelihood by Quartet Puzzle algorithm using Tree-Puzzle In this window we can select the outgroup and other parameters related with the model of evolution for the DNA sequences and for the amino-acid sequences. In the case of DNA sequences the models of evolution available are: Hasegawa et al. 1985, and Tamuna-Nei (1993). The user does not need to specify every option. If omitted the default values for the options are calculated, which is recommended for most users. Finally, in the “Rate Heterogeneity” tab is possible to express the rate heterogeneity among the different sites of the sequences, if required. 21 Bosque: software system for phylogenetic analysis Useful tools of Bosque The Sequence Editor Either on the sequence tab of the Tree Window (Figure 1) or in the Sequences Window (Figure 4) it is possible to edit a sequence to view (or modify) all the information that Bosque has about it. The amount of information of a particular sequence within Bosque depends on the amount of information that the source of the sequence had when we imported it. For example, If we import a sequence from a FASTA file7 then the sequence will have only a definition and its sequence data (DNA or amino-acids bases). On the other hand, if we imported the sequence from a BLAST query, then the sequence will have, probably, more information from the Genbank database. Figure 15. Editing a sequence: Tab: sequence information 7 See the appendix for a description of the FASTA format. 22 Bosque: software system for phylogenetic analysis In Figure 15 we can appreciate the Sequence Editor on a sequence tab. As we can see in the figure, this sequence has plenty of information obtained from the Genbank database. Apart from the normal sequence fields (defined for a sequence entry on the Genbank database) Bosque has added three new fields: Custom Name, User tag1 and User tag2. If a sequence has a custom name then this will be the name displayed on the tree, if this sequence is used to construct a phylogenetic tree. The tags are useful to search sequences on the Sequences Window (Figure 4). Bosque also stores the complete Genbank file when possible. This is the second tab of the Sequence Editor and we can see this in Figure 16. Figure 16. The Sequence Editor displaying the Genbank tab This is particularly useful if we are interested on the source of the sequence and the possible associated publications. 23 Bosque: software system for phylogenetic analysis Finally we can see in Figure 15, on the lower-right corner an option that can be set, called “protein coding sequence”. This option is only available for DNA sequences of course, and if we activate this option we are specifying that our sequence code for a protein8 and then a third tab on the Sequence Editor is activated. On this tab we can carry out the translation of the sequence and we can later use this sequence to align and construct phylogenetic trees using the amino-acid instead of the DNA sequences. Figure 17. The Sequence Editor displaying the translation tab and set to use a translation from the genbank file If the sequence has its genbank information and if this provides translations of the CDS (Coding Data Sequence) then we are able to select one of the translations for the different CDS that the sequence may include. In Figure 17 we can see this. 8 This is not necessarily the case for all the DNA sequences of course. The ribosomal subunit 16s gene for example does not code for any protein. 24 Bosque: software system for phylogenetic analysis When we select one translation then automatically the translation is displayed on the corresponding box. The other method to translate a particular sequence is what Bosque call a manual translation. This consists in indicating the start codon (with a simple click over the sequence) and the translation table and automatically Bosque will translate the codons into aminoacids and will show the result on the box at bottom. This can be seen graphically on Figure 18. Figure 18. The Sequence Editor displaying the translation tab and set to use a manual translation to translate the DNA sequence A third method is provided on this same window and it allows choosing among the six possible translations that a sequence may have9 and that have the best alignment to a given template sequence. If we have already an amino-acid 9 Three on the original sequence and three on the reverse complement sequence. 25 Bosque: software system for phylogenetic analysis sequence or a DNA sequence with its translation then we might use this sequence to serve as a template for the translation of a DNA sequence. The best alignment is calculated as the number of similar bases between the sequences and, of course, Bosque will choose the translation that maximizes this number of base similarities. After a translation is chosen on this window, we need to accept the changes with “Accept” button at the bottom of the window. Bosque will then ask for a confirmation and will save the translation on the local database for future use in some phylogenetic tree. DNA sequences with a translation set will be displayed with a different color on the Sequences Window (Figure 4). 26 Bosque: software system for phylogenetic analysis The Alignment Editor The Alignment Editor is a table that uses a single cell for every base of each different sequence selected for the current Tree Project. Figure 19 shows the aspect of this Editor with a set of aligned sequences. Figure 19. The Alignment Editor on the alignment tab of a Tree Window As was already mentioned on a previous section, the alignment editor is placed on the alignment tab of a Tree Window. After the integration of sequences, these are automatically put on the alignment editor without any alignment, so they may appear unordered (bases on a particular site or column does not match with much other sequences on the same site). While the sequences are in this state, Bosque will not allow performing tree-reconstruction operations to ensure the user complete this stage. With button marked as “4” on Figure 19, we can align these sequences by using a program called Muscle10. After the alignment is done, the sequences will appear ordered as Muscle dictate, given the data provided and the options selected on the Muscle Window for this purpose. The editor shows the bases of the alignment on cells which can be edited by double-click over them, or by selecting regions to do other editions. For example 10 To know about the details of how muscle performs the alignments please refer to http://www.drive5.com/muscle/. 27 Bosque: software system for phylogenetic analysis we can select a rectangular region and perform cuts with the scissor on button marked as 5 on Figure 19. This tool has three options: cut the left end, in which we should select a single column and this operation will cut the whole left end, from beginning to the selected column. A cut in the middle requires a rectangular selection and will remove all the columns from the first to the last selected. A cut right end requires a single column selection and will perform the cut from the selected column until the last column in the alignment. These operations may be of interests when there are sequences with heterogeneous sizes and we might want to leave columns where all (or the majority) of the sequences have data (and not gaps). On button marked as 1 we can reload the unaligned sequences from the sequence tab into this editor, so we can align them again. This function is useful when we add more sequences to our analysis later and need to re-calculate the alignment. With button 2 we can customize the names of the sequences for the presentation on the tree. If we would like to compare two particular sequences we can selected the checkbutton “comparing” and select two sequences by click over their names on the left side of the table. When the pair of sequences is selected, only these sequences will be shown and will be displayed the amount of differences between these two sequences and the percentage of similarity. Here, a gap weight of zero (the default value) indicate that sites with at least one gap will not be considered on the comparison. Sometimes it is necessary to compare all the sequences at once, for this purpose we can use the similarity table window, which can be invoked by pressing button marked as 3 (on Figure 19) and which is showed in Figure 20. Figure 20. The similarity table from the alignment tab 28 Bosque: software system for phylogenetic analysis On this similarity table are presented all-against-all the sequences with their percentage of similarity. From this data, it might be useful to organize the closest sequences, given a threshold percentage. To do this the option “group by similarity” is presented as a button on the bottom of the window. This option will ask for a percentage (say 95%) and Bosque will organize all the sequences on groups (normally called OTUs11). An example of this is presented on Figure 21. Figure 21. The sequences grouped by their percentage of sequence similarity Basic operations of exporting theses tables are offered at the bottom of each of the previous reviewed windows. Please prefer exporting the data to ASCII (either tab separated (TSV) or comma separated (CSV)) instead of excel. CSV files can be easily imported to most Spreadsheet programs. Returning to the Alignment Editor (Figure 19), a few options are available when the user select sequence names on the alignment (on the left of the table) and press right-click. Here the user can tag sequences for his convenience (normally to facilitate the search with Sequences Window, to add more sequences to a Tree Project for example), execute a BLAST query using the particular selected sequence and remove a single or multiple selected sequences from the alignment (and from the Tree Project or the database if desired). Finally, it is possible to customize the colors with which the bases are displayed on the Alignment Editor. This option is under “Edit” on the menu of the Main Window. 11 OTU stand by Operational Taxonomical Unit. 29 Bosque: software system for phylogenetic analysis The Tree Editor The Tree Editor is a special graphical tool of a Tree Window and serves to display and modify the aspect of a tree. Since a Tree Project may be composed of several trees then we may have multiple Tree Editors on the same Tree Window. Figure 22. The Tree Editor on a tab of a Tree Window As we can see in Figure 22, in the Tree Editor it is possible to modify the aspect of a phylogenetic by a set of options in a context menu that appear when a tip of tree is selected and the right-click is pressed. The tips can also be moved along the phylogenetic tree as desired. This feature is particularly useful to re-root a tree graphically. In all these tip re-arrangements the branch lengths are always conserved to keep the distance relations among the sequences. To adjust the width and height of the tree, the Tree Editor has scale factors to shrink (proportionally) the tree graphics either horizontally (branch length scale) and vertically (Y margin scale). Both options appear on the top of the Tree Editor, as can be appreciated in Figure 22 and Figure 23. When the properties of a selected tip are edited appear a window like the one on Figure 23. 30 Bosque: software system for phylogenetic analysis Figure 23. Editing the properties of a tip of a tree In this figure we can appreciate a whole tip that has been shrunk. This has been done by unchecking the option “Expanded” on the properties. We have also given it a particular name, which is displayed in the inner part of the box on the tree. We can also change the branch length and the support or significance value of tip, even when the “legality” of this practice may be questioned. All the fonts and colors can be changed also, using the properties window. This, however, is quite obvious so we will omit further explanation on that. Finally it is important to say that in order to remove a particular tree; the red button at the upper-right corner of the Tree Editor can be used. 31 Bosque: software system for phylogenetic analysis The Sequences Window Figure 4 shows a Sequence Window displaying a list of DNA sequences. The Sequences Window has been already introduced in the section “Importing data into Bosque”. Since this we will not review here again the import functionalities that this window provides. When we select DNA sequences from this table and active the context menu (with a right-click over the selection) two options appear. The first option is the automatic translation of the selected DNA sequences given a particular amino-acid sequence template. Remember that in the Sequence Editor we have the possibility of translate a single DNA sequence by specifying that it is a “protein coding” DNA sequence. This option on the Sequences Window is useful when we are interested on the translation of multiple sequences, to avoid a sequence-by-sequence translation, process that could be time-consuming depending on the amount of sequences to translate. Figure 24. A massive translation of sequences from the Sequences Window using an aminoacid sequence template In Figure 24 we can see the window for the translation of a set of DNA sequences. The first option on this window is the selection of the translation table to use. Then we can indicate a particular sequence already present on the local database to 32 Bosque: software system for phylogenetic analysis serve as a template for the translation. Of course this template sequence should have a homology relation with the selected sequences in order the translated sequences have any sense. Otherwise, if we do not have already a homologous sequence on the local database for the selected ones, we can copy/paste directly the amino-acid sequence template into lower box. When we have already set these parameters then we can press the OK button and the process of translation will begin for all the selected sequences. After this process we can use these sequences to construct phylogenetic trees of amino-acid sequences. The other option in the context menu for a selection on the Sequences Window is the tagging of sequences. Bosque can handle two tags for the sequences, and the user is free to use these tags as desired. Some people find useful, for example, to use tag1 as the gene name (or protein name in case of amino-acid sequences), and tag2 as a particular grouping that the user have created for the purposes of a phylogenetic analysis. 33 Bosque: software system for phylogenetic analysis Networking options of Bosque Remote execution of jobs Public channel Sharing resources 34 Bosque: software system for phylogenetic analysis The Bosque Server In the previous section we have reviewed the networking options of Bosque, which allow the users to use a remote server in order to perform actions such as the remote execution of jobs and interactions with other users connected to the same server. In this section we will see how to install a Bosque-Server on Linux system. In the following figure we show the relations among the Bosque System actors. Figure 25. A Bosque System As we can see from Figure 25, Bosque applications can connect to a Bosque Servers, which are installed on typical Linux servers. The Bosque Server uses a MySQL database in order to manage the data related to users and executed jobs by them. In general the Bosque Server provides the following services: • • Remote job’s execution. Typical phylogenetic analyses include many sequences and the best phylogenetic analyses methods are resourcesconsuming. Since this, most users may want to run their computational analyses on special servers, which, normally, have much more main memory and faster processors than common desktop computers, providing analysis’s results in less time and avoiding leaving a personal computer on a long-time running job. Interaction with other users. When two or more users are connected to the same Bosque Server, they can share phylogenetic resources such as sequences, alignments and trees in a way very similar to most file-sharing programs. This feature makes easy the cooperation on research projects since the user needs not to be aware of file formats of any kind. Finally a public channel for talking is provided on the Bosque Server. 35 Bosque: software system for phylogenetic analysis Installation of the Bosque Server The Bosque Server has the following prerequisites: • • A Linux server A MySQL server with a database and a user with privileges to create tables The steps to install the Bosque Server are: 1. Download http://bosque.udec.cl/downloads/bosqueserver-0.9-1.i586.rpm 2. Install the server rpm (rpm –ivh bosqueserver-0.9-1.i586.rpm). The files will be installed on /usr/local/bosque-server 3. Create a MySQL database (dbname) and user (dbuser) 4. Choose a TCP port in which the Bosque Server will listen 5. Use data of steps 3 and 4 to modify /usr/local/bosque-server/etc/bosque-server.conf properly 6. Configure your linux firewall (if any) to allow TCP connections to the port chosen in step 4 above 7. Create the necessary tables on the database: mysql –u dbuser dbname –p < /usr/local/bosque-server/sql/bosque.sql 8. Execute the server at /usr/local/bosque-server/sbin/bosque-server Please note that if you chose a port lower than 1024 you will necessarily will have to run the server as root. 36 Bosque: software system for phylogenetic analysis Appendix Sequence file formats Fasta file format The following was obtained from: http://en.wikipedia.org/wiki/FASTA_format FASTA format is a text-based format for representing either nucleic acid sequences or protein sequences, in which base pairs or protein residues are represented using single-letter codes. The format also allows for sequence names and comments to precede the sequences. The simplicity of FASTA format makes it easy to manipulate and parse sequences using text-processing tools and scripting languages like Python and Perl. A sequence in FASTA format begins with a single-line description, followed by lines of sequence data. The description line is distinguished from the sequence data by a greater-than (">") symbol in the first column. The word following the ">" symbol is the identifier of the sequence, and the rest of the line is the description (both are optional). There should be no space between the ">" and the first letter of the identifier. It is recommended that all lines of text be shorter than 80 characters. The sequence ends if another line starting with a ">" appears; this indicates the start of another sequence. A simple example of two sequences in FASTA format: >SEQUENCE_1 MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL >SEQUENCE_2 SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH Genbank file format A sequence file in GenBank format can contain several sequences. One sequence in GenBank format starts with a line containing the word LOCUS and a number of annotation lines. The start of the sequence is marked by a line containing "ORIGIN" and the end of the sequence is marked by two slashes ("//"). 37 Bosque: software system for phylogenetic analysis An example sequence in GenBank format is: LOCUS DEFINITION AAU03518 237 bp DNA PLN 04-FEB-1995 Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S rRNA and 5.8S rRNA genes, partial sequence. U03518 41 a 77 c 67 g 52 t ACCESSION BASE COUNT ORIGIN 1 aacctgcgga 61 tattgtaccc 121 ccccccgggc 181 tgagttgatt // aggatcatta tgttgcttcg ccgtgcccgc gaatgcaatc ccgagtgcgg gcgggcccgc cggagacccc agttaaaact gtcctttggg cgcttgtcgg aacacgaaca ttcaacaatg cccaacctcc ccgccggggg ctgtctgaaa gatctcttgg catccgtgtc ggcgcctctg gcgtgcagtc ttccggc Index Alignment Editor, 2, 6, 19, 27, 29 amino-acids, 4, 20, 22 ASCII, 29 best alignment, 25 BLAST, 13, 14, 15, 16, 18, 22, 29 Bootstrapping, 20 Bosque, 1, 2, 3, 4, 7, 8, 9, 10, 11, 13, 14, 16, 17, 18, 19, 20, 22, 23, 25, 26, 27, 29, 32, 33, 34 bosque-server, 7 branch length, 30, 31 CDS, 12, 13, 15, 24 Coding Data Sequence, 12 codons, 25 comparing sequences, 28 consensus-tree, 4 CSV, 29 Custom Name, 23 database, 2, 3, 4, 8, 9, 10, 11, 13, 15, 16, 17, 18, 22, 23, 26, 29, 32 Dayhoff, 20 dnadist, 20 Entrez, 10, 17 E-value, 17 F84, 20 fasta, 5, 10 Fasta, 10, 14, 35 FASTA, 22, 35 gap weight, 28 genbank, 5, 10, 18, 24 GenBank, 15, 16, 35, 36 group by similarity, 29 Hasegawa, 21 Henikoff/Tillier, 20 homologous sites, 6 Jones-Taylor-Thornton, 20 Jukes-Cantor, 20 Kimura 2 parameters, 20 maximum likelihood, 7 Maximum Likelihood, 20, 21 Multiple Document Interface, 4 Muscle, 6, 18, 19, 27 NCBI, 10, 11, 13, 15, 16 nucleotides, 4, 13, 15 OTU, 29 outgroup, 20, 21 percentage of similarity, 28, 29 Phylip, 7, 20 phylogenetic tree, 3, 4, 18, 20, 23, 26, 30 Project, 3 protein coding, 24, 32 Quartet Puzzle, 20, 21 quartet puzzling, 7 Rate Heterogeneity, 21 right-click, 5, 29, 30, 32 scissor, 6, 28 Sequence Editor, 2, 22, 23, 24, 25, 32 Sequences Tab, 3, 5, 18 Sequences Window, 2, 3, 9, 18, 22, 23, 26, 29, 32, 33 similarity table, 28, 29 38 Bosque: software system for phylogenetic analysis SQLite, 4, 8 tag1, 23, 33 tag2, 23, 33 Tamuna-Nei, 21 translation, 15, 24, 25, 26, 32 Tree Project, 3, 4, 18, 27, 29, 30 Tree Projects, 4 Tree Puzzle, 7 Tree Window, 3, 4, 5, 6, 7, 18, 19, 22, 27, 30 Tree Windows, 4 tree-reconstruction, 4, 6, 7, 19, 20, 21, 27 tree-reconstruction methods, 6, 7, 20 try and error practice, 8 TSV, 29 39

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Bosque, a software system for phylogenetic analysis