Download Bosque, a software system for phylogenetic analysis

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cre-Lox recombination wikipedia , lookup

Promoter (genetics) wikipedia , lookup

RNA-Seq wikipedia , lookup

DNA barcoding wikipedia , lookup

Molecular evolution wikipedia , lookup

Non-coding DNA wikipedia , lookup

Point mutation wikipedia , lookup

Community fingerprinting wikipedia , lookup

Multilocus sequence typing wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Homology modeling wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
Bosque: Software system for phylogenetic analysis
Salvador Ramírez Flandes
Laboratorio PROFC, Departamento de Oceanografía
Facultad de Ciencias Naturales y Oceanográficas
Universidad de Concepción
Bosque: software system for phylogenetic analysis
Table of Contents
QUICK START TO BOSQUE........................................................................................................................ 3
TREE PROJECTS AND TREE WINDOWS ................................................................................................ 4
THE SEQUENCE TAB ....................................................................................................................................... 5
THE ALIGNMENT TAB...................................................................................................................................... 6
THE TREE TAB ............................................................................................................................................... 7
DATA MANAGEMENT IN BOSQUE........................................................................................................... 8
THE LOCAL DATABASE ................................................................................................................................... 8
IMPORTING DATA INTO BOSQUE ..................................................................................................................... 9
From a local file ..................................................................................................................................... 10
From NCBI´s Entrez-Genbank................................................................................................................ 10
From a Blast query ................................................................................................................................. 13
COMPUTING ALIGNMENTS AND TREES............................................................................................. 18
USEFUL TOOLS OF BOSQUE ................................................................................................................... 22
THE SEQUENCE EDITOR ................................................................................................................................ 22
THE ALIGNMENT EDITOR ............................................................................................................................. 27
THE TREE EDITOR ........................................................................................................................................ 30
THE SEQUENCES WINDOW ........................................................................................................................... 32
NETWORKING OPTIONS OF BOSQUE .................................................................................................. 34
REMOTE EXECUTION OF JOBS ....................................................................................................................... 34
PUBLIC CHANNEL ......................................................................................................................................... 34
SHARING RESOURCES ................................................................................................................................... 34
THE BOSQUE SERVER............................................................................................................................... 35
INSTALLATION OF THE BOSQUE SERVER ...................................................................................................... 36
APPENDIX ..................................................................................................................................................... 37
SEQUENCE FILE FORMATS ............................................................................................................................. 37
Fasta file format...................................................................................................................................... 37
Genbank file format ................................................................................................................................ 37
INDEX ............................................................................................................................................................. 38
2
Bosque: software system for phylogenetic analysis
Quick start to Bosque
1. Download Bosque from http://bosque.udec.cl/downloads/BosqueSetup.exe
2. Install the software on your computer
3. Execute Bosque. The first time you will need to specify where do you want
to leave the file containing the local database
4. Create a Project, by giving it any descriptive name.
5. Create a Tree Project within your recently created Project (step 3). To
create a Tree Project click the button at right, whose label says: “New Tree”.
This will create a Tree Project and will present it on a nice Tree Window.
6. Add sequences to your Tree Project. In the Sequences Tab of your Tree
Window (created on step 5), there is a button whose label says “Add Seqs”.
Press this button and the Sequences Window will appear. Here you can
import sequences from different sources. Then accept and the sequences
will be added to the Tree Window
7. Change to the tab “Alignment” on the Tree Window. Here you will see all
your sequences unordered (or unaligned if you want). Press the button
“Align Sequences” at the bottom (if you move your mouse pointer over the
buttons they will show you a label, look the one that display “align
sequences”. With the time of course you will recognize it, without looking for
it, hopefully).
8. After the sequences are aligned press the button “construct tree” in the
upper toolbar. This button has an icon that looks like a phylogenetic tree,
you will find it easily!
9. You have a tree on your Tree Window. Now you can construct more trees
on this same Tree Window by using other methods, or you can create
another Tree Project by pressing “New Tree” button, etc.
3
Bosque: software system for phylogenetic analysis
Tree Projects and Tree Windows
The current molecular phylogenetic analyses use nucleotides and amino-acids
sequence data to infer the phylogeny of organisms. These analyses, therefore,
begin with the integration of a set of sequences of interest and then by their
respective alignment, which is the input data for the different phylogenetic methods
that will produce, ultimately, a phylogenetic tree. Thinking on this basic pipeline,
Bosque defines the concept of Tree Project as a set consisting of:
1. a set of sequences
2. an alignment of these sequences, and
3. a set of trees out of this alignment
Since there is no perfect tree-reconstruction method, it is common the use of
different techniques with different models to produce multiple trees, which then can
be analyzed or merged, someway, in a so-called consensus-tree. Bosque stores
these Tree Projects on a local relational database (implemented on SQLite1)
whose format is transparent for the user. By selecting and loading a simple given
name for the Tree Project, the user automatically loads all the sequences, the
alignment and a number of trees for his Tree Project.
In Bosque these Tree Projects are manipulated on a special window, called (not
surprisingly) Tree Window. This Tree Window expresses graphically the Tree
Project concept, and thus it is divided on different tabs: one for the list of
sequences, another for the alignment and a single or multiple tabs for the trees.
Bosque can handle multiple Tree Windows at a time, in what is called, technically,
a MDI or Multiple Document Interface2.
The Tree Window has a special toolbar on top with icon-buttons for common
operations related with the Tree Project. These operations include: save the Tree
Project to the local database, export data (this data will depend on what tab we are
positioned at the moment of the request of the operation), search a particular
sequence, configure special options of the Tree Project, reconstruct a tree from the
alignment (if it already exists), print the tree, and close the Tree Project.
Along the following sections we will review in detail every tab of the Tree Window.
1
SQLite is a small C library that implements a self-contained, embeddable, zero-configuration SQL database engine.
http://www.sqlite.org
2
See http://en.wikipedia.org/wiki/Multiple_document_interface for more information about MDI.
4
Bosque: software system for phylogenetic analysis
The sequence Tab
In Figure 1 we can see a screenshot of a Tree Window displaying the sequence
tab. As we can see, it is composed, vertically, of three parts: the toolbar on top
(common to all the tabs on the Tree Window), the sequence’s table and a toolbar
on the bottom, for operations applicable only to sequences.
Figure 1. Tree Window displaying the Sequences Tab
The sequences table shows information about the sequences, and it is
configurable what it is displayed on its columns. To change the default columns
(accession number of the sequence, definition, size in base pairs and organism
name) a right-click should be pressed over the header of the table and a popup
menu with options should appear.
The toolbar at bottom consists of buttons for adding more sequences to the Tree
Project, for editing a particular sequence, for removing sequences from the Tree
Project and for exporting the sequences to foreign formats, such as fasta or
genbank3.
Remember that at this point all the sequences are stored on the local
database, so they were already imported on the special Sequence
Window, which we will review later on this tutorial.
3
See the appendix on sequence file formats.
5
Bosque: software system for phylogenetic analysis
The alignment tab
After we have collected our sequences for the analysis we need to align them, so
only homologous sites (in theory) be compared by the tree-reconstruction methods
on the next stage. When we select the alignment tab for the first time, after the
collection of sequences on the previous step, we see them, of course, not aligned
and then we can use the “align button” at bottom (in Figure 2 indicated with the
number 4) to carry out this alignment process, which will show a window to use an
external, widely used program called Muscle.
Figure 2. The Alignment tab of a Tree Window showing a set of aligned sequences
After the alignment is done, the sequences will appear ordered as Muscle dictate,
given the data provided and the options selected on the Muscle Window for this
purpose4.
The sequences are presented here on a special table called the Alignment Editor.
This editor shows the bases of the alignment on cells which can be edited by
double-click over them, or by selecting regions to do cuts with the scissor on button
marked as 5 on Figure 2.
Please refer to the section “The Alignment Editor” section for further information.
4
To know about the details of how muscle performs the alignments please refer to http://www.drive5.com/muscle/.
6
Bosque: software system for phylogenetic analysis
The Tree Tab
After we are satisfied with the alignment we are ready to do the tree-reconstruction,
for which we can press the “tree button” on the toolbar on top of the Tree Window.
Figure 3. Tree on a Tree Window
There are numerous tree-reconstruction methods and we have covered a part of
them by using well-known phylogenetic command-line programs. For now we have
integrated into Bosque the programs from the Phylip package and the Tree Puzzle
program implementing maximum likelihood by quartet puzzling method5.
Please note that if the program is connected to a bosque-server then special
options will appear on the popup menu for the “tree button” on the Tree Window’s
toolbar. In fact, these options will allow executing the programs remotely on the
server. This feature is particularly useful when we are analyzing a dataset with
many sequences, so the analysis may take a not so short amount of time.
Finally it is necessary to say that the tree is not only displayed for visualization but
it can also be edited with mouse options. For example, right-button-clicking a
particular tip of the tree opens a context popup menu. Also it is possible to move
tips along the tree, change the appearance of the text, expand/shrink a whole tip,
rename a particular tip, et-cetera.
5
For more information about these program please see the respective webpage for each package. The webpage for phylip is:
http://evolution.genetics.washington.edu/phylip.html. The webpage for tree-puzzle is: http//www.tree-puzzle.de.
7
Bosque: software system for phylogenetic analysis
Data management in Bosque
All data in Bosque is stored on a local relational database, using the SQLite library.
By local we mean that this database is located (as a file) on the same computer
where Bosque is executed.
The main advantage (for this application) of having a local database to store
everything is that we avoid the complications for the user of:
1. Manipulate computer files on different formats
2. Organize these files along the different phylogenetic projects that a
particular user may be carrying out.
Recognized or not, much phylogenetic analyses require the use of the trial and
error practice and so, one normally need to compose several sets of sequences,
analyze them, add more sequences to those sets, analyze them again, remove
another, do the analysis again, and so on. If a particular user manages multiple
projects one can imagine easily that the amount of different files with sequences
begin to grow rapidly. Also the names of the files need to be very informative so
that with just looking the list of file we can know what they contain.
To avoid all the above endeavor, Bosque implements all the data management on
a single database, organized in tables, so the user can manipulate elements at a
“phylogenetic level”, such as sequences, trees, alignments, etc. and not at a level
which is multipurpose, such as mere computer files, to which it is necessary to
complement with information from the phylogenetic level, by naming them with
special names, use special file extensions, organize them on special directories,
et-cetera.
The local database
This database is composed of different tables to store the different elements used,
such as sequences, trees, jobs, servers, etc. To see where this database is located
on the current filesystem one can select the option “database properties” from the
“Database” menu on the main window of Bosque. On this window are displayed
also the size (in megabytes) of the file containing the database, the amount of
projects, trees and sequences.
As we have already said, this database is, actually, a SQLite database, which in
turn is an open source relational database implementation. This means that the
tables of this database (and so, their actual data) can be manipulated by other
programs also, apart from Bosque. This feature is important since the user needs
not to rely only to Bosque the accessibility of his data. In case of trouble with
Bosque, the user can always use SQLite programs (downloadable from the sqlite
webpage: http://www.sqlite.org) to access their data from the tables of the Bosque
8
Bosque: software system for phylogenetic analysis
database. Of course this practice is not recommended and should only be
performed when there is no other way to rescue the desired data at a particular
moment. If the data on the database is corrupted someway (by the external
manipulation of the database for example), an unpredictable behavior of Bosque is
expected.
Importing data into Bosque
To import data into Bosque we should use the special Sequences Window. An
image of this window is presented on Figure 4.
Figure 4. The Sequences Window
The Sequences Window is invoked by pressing the button “sequences” on the
toolbar of the main window. This window can display the sequences by doing a
search on the local database. If no criteria are specified on the boxes then all the
sequences will be listed on the table. More on this window will be seen later on this
tutorial. Now we will just review the importing options at the bottom of the window.
There are three ways to import sequence data into Bosque:
9
Bosque: software system for phylogenetic analysis
1. from a fasta or genbank file
2. from a query on genbank entrez
3. from a query on blast servers
From a local file
Typically, the data from sequencing projects include files in Fasta format with the
sequences. In order to enter these data into Bosque it is necessary to select the
first option (import from local file) and select the filename containing the sequence
data. This option also allows importing sequences in the genbank format. Both
formats (fasta and genbank), can contain either single or multiple sequences within
a single file. Files in fasta format normally have the extension .fas or .fasta. Files in
the genbank format typically use the extension .gb. Please refer to the appendix for
a brief description of the fasta and genbank format.
From NCBI´s Entrez-Genbank
The second option is Entrez which is an integrated, text-based search and retrieval
system used at NCBI for the major databases, including PubMed, Nucleotide and
Protein Sequences, Protein Structures, Complete Genomes, Taxonomy, and
others.
Bosque implements a tool window to specify a search on the NCBI servers and
bringing the sequences to the local database. In Figure 5 we can appreciate this
window:
10
Bosque: software system for phylogenetic analysis
Figure 5. Genbank Entrez tool window
As we can see from this window, it is possible to indicate what database will be
used for the search, the amount of sequences to retrieve on the result and a query
text. When the button “advanced search” is pressed it is possible to specify more
fields for the search, so users with more experience on these queries can compose
complex queries.
Once the text search is indicated the “search” button should be pressed so Bosque
will connect to the NCBI servers to do the actual query and bring the results.
11
Bosque: software system for phylogenetic analysis
Figure 6. Entrez window with the sequences obtained on a search
If the search is successful, a list of sequences will be shown on the other window
tab (Figure 6). Here we can select sequences and import them with the import
button. An important feature is that when large sequences, composed of many
CDS (Coding Data Sequence), are presented on the table, they appear in a
different color to be noted by the user. Since we are normally interested on just one
of the CDS, we would like to browse the CDS of these sequences to select the
particular sequence we are interested on. This can be done with a double-click
over the sequence. A window like the one on Figure 7 appears.
12
Bosque: software system for phylogenetic analysis
Figure 7. Exploring the CDS of a particular complex sequence
Since the list can be long this might take a couple of minutes to download,
depending on the internet connect available. Another important thing about this tool
is that not all the large sequences (complete genomes, complete chromosomes,
etc) are annotated and with their corresponding mapping on the “Gene” database
at NCBI. This means that may occur that it is not possible to browse every large
sequence that may appear on this table.
From a Blast query
The third method to import data in Bosque is through a Blast query against the
NCBI´s Blast servers.
Basic Local Alignment Search Tool, or BLAST, is an algorithm for comparing
primary biological sequence information, such as the amino-acid sequences of
different proteins or the nucleotides of DNA sequences. A BLAST search enables a
researcher to compare a query sequence with a library or database of sequences,
and identify library sequences that resemble the query sequence above a certain
threshold6.
6
Definition obtained from http://en.wikipedia.org/wiki/BLAST
13
Bosque: software system for phylogenetic analysis
The Figure 8 shows the window on Bosque to import sequences from a Blast
query.
Figure 8. Window to import sequences from a Blast query
To do a BLAST query one needs to supply the sequence query on the “Sequence
Data” box. There are two ways of getting the sequence on that box: the first is to
select an external file (on the Fasta format) containing sequences that will be
shown on the table sequences. Then when a particular sequence is selected its
data is automatically displayed on the “sequence data” box and so we are ready to
do the BLAST. The other method is just to copy/paste the sequence on that box.
Selecting the blast program
The following information was extracted from http://www.ncbi.nlm.nih.gov/Education/blasttutorial.html
Bosque allow the selection of the different blast programs. Below is a table of these
programs.
14
Bosque: software system for phylogenetic analysis
Program Description
blastp
Compares an amino acid query sequence against a protein sequence database.
blastn
Compares a nucleotide query sequence against a nucleotide sequence database.
blastx
Compares a nucleotide query sequence translated in all reading frames against a protein
sequence database. You could use this option to find potential translation products of an
unknown nucleotide sequence.
tblastn
Compares a protein query sequence against a nucleotide sequence database
dynamically translated in all reading frames.
tblastx
Compares the six-frame translations of a nucleotide query sequence against the sixframe translations of a nucleotide sequence database. Please note that the tblastx
program cannot be used with the nr database on the BLAST Web page because it is
computationally intensive.
Selecting the blast database
It is possible to select several NCBI databases to compare the query sequences
against. Note that some databases are specific to proteins or nucleotides and
cannot be used in combination with certain BLAST programs (for example a blastn
search against swissprot).
Proteins
Database
Description
nr
All non-redundant GenBank CDS translations+PDB+SwissProt+PIR+PRF
month
All new or revised GenBank CDS translation+PDB+SwissProt+PIR released in the last
30 days.
swissprot
The last major release of the SWISS-PROT protein sequence database (no updates).
These are uploaded to our system when they are received from EMBL.
patents
Protein sequences derived from the Patent division of GenBank.
yeast
Yeast (Saccharomyces cerevisiae) protein sequences. This database is not to be
confused with a listing of all Yeast protein sequences. It is a database of the protein
translations of the Yeast complete genome.
E. coli
E. coli (Escherichia coli) genomic CDS translations.
pdb
Sequences derived from the 3-dimensional structure Brookhaven Protein Data Bank.
kabat
Kabat's database of sequences of immunological interest. For more information
[kabatpro] http://immuno.bme.nwu.edu/
alu
Translations of select Alu repeats from REPBASE, suitable for masking Alu repeats
from query sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu
alert" by Claverie and Makalowski, Nature vol. 371, page 752 (1994).
15
Bosque: software system for phylogenetic analysis
Nucleotides
Database
Description
nr
All non-redundant GenBank+EMBL+DDBJ+PDB sequences (but no EST, STS, GSS,
or HTGS sequences).
month
All new or revised GenBank+EMBL+DDBJ+PDB sequences released in the last 30
days.
dbest
Non-redundant database of GenBank+EMBL+DDBJ EST Divisions.
dbsts
Non-redundant database of GenBank+EMBL+DDBJ STS Divisions.
mouse
ests
The non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the
organism mouse.
human
ests
The Non-redundant Database of GenBank+EMBL+DDBJ EST Divisions limited to the
organism human.
other ests
The non-redundant database of GenBank+EMBL+DDBJ EST Divisions all organisms
except mouse and human.
yeast
Yeast (Saccharomyces cerevisiae) genomic nucleotide sequences. Not a collection of
all Yeast nucelotides sequences, but the sequence fragments from the Yeast
complete genome.
E. coli
E. coli (Escherichia coli) genomic nucleotide sequences.
pdb
Sequences derived from the 3-dimensional structure of proteins.
kabat
Kabat's database of sequences of immunological interest. For more information
[kabatnuc] http://immuno.bme.nwu.edu/
patents
Nucleotide sequences derived from the Patent division of GenBank.
vector
Vector subset of GenBank(R), NCBI, (ftp://ncbi.nlm.nih.gov/pub/blast/db/ directory).
mito
Database of mitochondrial sequences (Rel. 1.0, July 1995).
alu
Select Alu repeats from REPBASE, suitable for masking Alu repeats from query
sequences. It is available at ftp://ncbi.nlm.nih.gov/pub/jmc/alu. See "Alu alert" by
Claverie and Makalowski, Nature vol. 371, page 752 (1994).
epd
Eukaryotic Promotor Database ISREC in Epalinges s/Lausanne (Switzerland).
gss
Genome Survey Sequence, includes single-pass genomic data, exon-trapped
sequences, and Alu PCR sequences.
htgs
High Throughput Genomic Sequences.
Blast execution and retrieving of results
After the sequence data has been provided and the options has been set
adequately the user should press the button “BLAST” to send the blast job to the
NCBI´s servers. Of course, first check that your internet connection is working fine.
The BLAST server accepts the job and returns an estimated time to finish. Bosque
will wait that time and will connect to the BLAST server to see if the job has really
16
Bosque: software system for phylogenetic analysis
ended. If that is the case then Bosque will retrieve the results (with their scoring
and E-value) and will show the found sequences, such as on Figure 9.
Figure 9. The result of a blast search
This table is similar to the table already reviewed of the Entrez window. Here the
user should select the sequences he wants to import and then press the “import”
button. If no sequence is selected then all the sequences will be imported. On this
same table Bosque indicates if the sequence is already present on the local
database. This is on the last column of this table. Naturally, a sequence that is
already on the local database needs not to be imported.
17
Bosque: software system for phylogenetic analysis
Computing alignments and trees
At this point we already know how to get sequence data into Bosque. Now we will
see how to get a phylogenetic tree from these sequences. The first step is to
create a Tree Project and this is done by pressing the button “New Tree” on the
main window of Bosque.
Figure 10. Main window of Bosque
This will open a Tree Window that contains all the data of a Tree Project, i.e.
Sequences, Alignment and Trees. For this reason a Tree Window is composed of
multiple tabs: a Sequences Tab, an Alignment Tab and zero or more Tree Tabs. All
these tabs have been already described on this tutorial.
A newly created Tree Window looks like the one on Figure 1, but without any
sequence on the table. To add sequences the button “Add Seqs” at bottom-left
should be pressed. This opens the Sequences Window (Figure 4) where
sequences can be searched on the local database or can be imported from files,
genbank-entrez or BLAST (see a previous section called “Importing data into
Bosque” for further details).
Once we have the sequences into the Sequences Tab, we can change to the
Alignment Tab (Figure 2) of the Tree Window and press the button “Align
Sequences” (identified with label 4 on the Figure 2).
Currently, the alignment is carried out using Muscle, which is a widely used
command-line program to compute multiple sequence alignments.
18
Bosque: software system for phylogenetic analysis
Figure 11. Muscle interface of Bosque
Figure 11 shows the window to invoke Muscle for the sequence alignment.
Depending on the amount of sequences and the length of these, the process of
multiple sequence alignment can take a considerable time. This process can also
be very consuming of CPU and memory.
When the alignment is finished, the sequences are organized into the Alignment
Editor on the Alignment tab (Figure 2). This alignment should be revised by the
user to ensure it is correct for the purposes of the analysis. Remember that an
alignment is just a “theory” about how the evolution in nature has occurred for
these sequences. The researcher needs to assists this automatic alignment
computation.
Once the user is satisfied with the alignment it can be the moment to try a treereconstruction with these data. For this purpose we press the button “Tree
reconstruction” on the upper toolbar of the Tree Window.
Figure 12. Construct a Tree out of an Alignment
19
Bosque: software system for phylogenetic analysis
As can be appreciated from Figure 12, two tree-reconstruction methods are
provided: Distance methods (carried out by Phylip) and Maximum Likelihood by
Quartet Puzzle. (When the user is connected to a server other methods are
provided. Please see more on this on a later section about Networking with
Bosque).
The distance methods are less time consuming than others and so can be suitable
for first inspections of the data. Different distances have been defined depending
on what data we are using for the tree-reconstruction. Distances for amino-acids
sequences are, of course, different from the distances for DNA sequences.
Figure 13. Neighbor Joining for DNA sequences using Phylip
In Figure 13 we can see the interface on Bosque for calculating a phylogenetic tree
out of DNA sequences using the Neighbor Joining method. Here it is possible to
specify the sequence that can be used as the outgroup and the type of distance
with which we want to construct the phylogenetic tree. For DNA data the distances
defined here are the ones defined on the Phylip package (program dnadist) and
they are: F84 (default), Kimura 2 parameters and Jukes-Cantor. For amino-acid
data the distances are: Jones-Taylor-Thornton matrix, Henikoff/Tillier PMB matrix
and Dayhoff PAM matrix.
In the “Resampling” tab it is possible to indicate that the output tree be a
consensus tree out of a resampling of the data using statistical methods like
Bootstrapping for example.
20
Bosque: software system for phylogenetic analysis
In Figure 14 we can see the window for the tree-reconstruction using the TreePuzzle program.
Figure 14. Maximum Likelihood by Quartet Puzzle algorithm using Tree-Puzzle
In this window we can select the outgroup and other parameters related with the
model of evolution for the DNA sequences and for the amino-acid sequences. In
the case of DNA sequences the models of evolution available are: Hasegawa et al.
1985, and Tamuna-Nei (1993).
The user does not need to specify every option. If omitted the default values for the
options are calculated, which is recommended for most users.
Finally, in the “Rate Heterogeneity” tab is possible to express the rate
heterogeneity among the different sites of the sequences, if required.
21
Bosque: software system for phylogenetic analysis
Useful tools of Bosque
The Sequence Editor
Either on the sequence tab of the Tree Window (Figure 1) or in the Sequences
Window (Figure 4) it is possible to edit a sequence to view (or modify) all the
information that Bosque has about it. The amount of information of a particular
sequence within Bosque depends on the amount of information that the source of
the sequence had when we imported it. For example, If we import a sequence from
a FASTA file7 then the sequence will have only a definition and its sequence data
(DNA or amino-acids bases). On the other hand, if we imported the sequence from
a BLAST query, then the sequence will have, probably, more information from the
Genbank database.
Figure 15. Editing a sequence: Tab: sequence information
7
See the appendix for a description of the FASTA format.
22
Bosque: software system for phylogenetic analysis
In Figure 15 we can appreciate the Sequence Editor on a sequence tab. As we can
see in the figure, this sequence has plenty of information obtained from the
Genbank database. Apart from the normal sequence fields (defined for a sequence
entry on the Genbank database) Bosque has added three new fields: Custom
Name, User tag1 and User tag2. If a sequence has a custom name then this will be
the name displayed on the tree, if this sequence is used to construct a
phylogenetic tree. The tags are useful to search sequences on the Sequences
Window (Figure 4).
Bosque also stores the complete Genbank file when possible. This is the second
tab of the Sequence Editor and we can see this in Figure 16.
Figure 16. The Sequence Editor displaying the Genbank tab
This is particularly useful if we are interested on the source of the sequence and
the possible associated publications.
23
Bosque: software system for phylogenetic analysis
Finally we can see in Figure 15, on the lower-right corner an option that can be set,
called “protein coding sequence”. This option is only available for DNA sequences
of course, and if we activate this option we are specifying that our sequence code
for a protein8 and then a third tab on the Sequence Editor is activated. On this tab
we can carry out the translation of the sequence and we can later use this
sequence to align and construct phylogenetic trees using the amino-acid instead of
the DNA sequences.
Figure 17. The Sequence Editor displaying the translation tab and set to use a translation
from the genbank file
If the sequence has its genbank information and if this provides translations of the
CDS (Coding Data Sequence) then we are able to select one of the translations for
the different CDS that the sequence may include. In Figure 17 we can see this.
8
This is not necessarily the case for all the DNA sequences of course. The ribosomal subunit 16s gene for example does not code for any
protein.
24
Bosque: software system for phylogenetic analysis
When we select one translation then automatically the translation is displayed on
the corresponding box.
The other method to translate a particular sequence is what Bosque call a manual
translation. This consists in indicating the start codon (with a simple click over the
sequence) and the translation table and automatically Bosque will translate the
codons into aminoacids and will show the result on the box at bottom. This can be
seen graphically on Figure 18.
Figure 18. The Sequence Editor displaying the translation tab and set to use a manual
translation to translate the DNA sequence
A third method is provided on this same window and it allows choosing among the
six possible translations that a sequence may have9 and that have the best
alignment to a given template sequence. If we have already an amino-acid
9
Three on the original sequence and three on the reverse complement sequence.
25
Bosque: software system for phylogenetic analysis
sequence or a DNA sequence with its translation then we might use this sequence
to serve as a template for the translation of a DNA sequence. The best alignment
is calculated as the number of similar bases between the sequences and, of
course, Bosque will choose the translation that maximizes this number of base
similarities.
After a translation is chosen on this window, we need to accept the changes with
“Accept” button at the bottom of the window. Bosque will then ask for a
confirmation and will save the translation on the local database for future use in
some phylogenetic tree. DNA sequences with a translation set will be displayed
with a different color on the Sequences Window (Figure 4).
26
Bosque: software system for phylogenetic analysis
The Alignment Editor
The Alignment Editor is a table that uses a single cell for every base of each
different sequence selected for the current Tree Project. Figure 19 shows the
aspect of this Editor with a set of aligned sequences.
Figure 19. The Alignment Editor on the alignment tab of a Tree Window
As was already mentioned on a previous section, the alignment editor is placed on
the alignment tab of a Tree Window. After the integration of sequences, these are
automatically put on the alignment editor without any alignment, so they may
appear unordered (bases on a particular site or column does not match with much
other sequences on the same site). While the sequences are in this state, Bosque
will not allow performing tree-reconstruction operations to ensure the user
complete this stage.
With button marked as “4” on Figure 19, we can align these sequences by using a
program called Muscle10. After the alignment is done, the sequences will appear
ordered as Muscle dictate, given the data provided and the options selected on the
Muscle Window for this purpose.
The editor shows the bases of the alignment on cells which can be edited by
double-click over them, or by selecting regions to do other editions. For example
10
To know about the details of how muscle performs the alignments please refer to http://www.drive5.com/muscle/.
27
Bosque: software system for phylogenetic analysis
we can select a rectangular region and perform cuts with the scissor on button
marked as 5 on Figure 19. This tool has three options: cut the left end, in which we
should select a single column and this operation will cut the whole left end, from
beginning to the selected column. A cut in the middle requires a rectangular
selection and will remove all the columns from the first to the last selected. A cut
right end requires a single column selection and will perform the cut from the
selected column until the last column in the alignment. These operations may be of
interests when there are sequences with heterogeneous sizes and we might want
to leave columns where all (or the majority) of the sequences have data (and not
gaps).
On button marked as 1 we can reload the unaligned sequences from the sequence
tab into this editor, so we can align them again. This function is useful when we
add more sequences to our analysis later and need to re-calculate the alignment.
With button 2 we can customize the names of the sequences for the presentation
on the tree.
If we would like to compare two particular sequences we can selected the checkbutton “comparing” and select two sequences by click over their names on the left
side of the table. When the pair of sequences is selected, only these sequences
will be shown and will be displayed the amount of differences between these two
sequences and the percentage of similarity. Here, a gap weight of zero (the default
value) indicate that sites with at least one gap will not be considered on the
comparison.
Sometimes it is necessary to compare all the sequences at once, for this purpose
we can use the similarity table window, which can be invoked by pressing button
marked as 3 (on Figure 19) and which is showed in Figure 20.
Figure 20. The similarity table from the alignment tab
28
Bosque: software system for phylogenetic analysis
On this similarity table are presented all-against-all the sequences with their
percentage of similarity. From this data, it might be useful to organize the closest
sequences, given a threshold percentage. To do this the option “group by
similarity” is presented as a button on the bottom of the window. This option will
ask for a percentage (say 95%) and Bosque will organize all the sequences on
groups (normally called OTUs11). An example of this is presented on Figure 21.
Figure 21. The sequences grouped by their percentage of sequence similarity
Basic operations of exporting theses tables are offered at the bottom of each of the
previous reviewed windows. Please prefer exporting the data to ASCII (either tab
separated (TSV) or comma separated (CSV)) instead of excel. CSV files can be
easily imported to most Spreadsheet programs.
Returning to the Alignment Editor (Figure 19), a few options are available when the
user select sequence names on the alignment (on the left of the table) and press
right-click. Here the user can tag sequences for his convenience (normally to
facilitate the search with Sequences Window, to add more sequences to a Tree
Project for example), execute a BLAST query using the particular selected
sequence and remove a single or multiple selected sequences from the alignment
(and from the Tree Project or the database if desired).
Finally, it is possible to customize the colors with which the bases are displayed on
the Alignment Editor. This option is under “Edit” on the menu of the Main Window.
11
OTU stand by Operational Taxonomical Unit.
29
Bosque: software system for phylogenetic analysis
The Tree Editor
The Tree Editor is a special graphical tool of a Tree Window and serves to display
and modify the aspect of a tree. Since a Tree Project may be composed of several
trees then we may have multiple Tree Editors on the same Tree Window.
Figure 22. The Tree Editor on a tab of a Tree Window
As we can see in Figure 22, in the Tree Editor it is possible to modify the aspect of
a phylogenetic by a set of options in a context menu that appear when a tip of tree
is selected and the right-click is pressed. The tips can also be moved along the
phylogenetic tree as desired. This feature is particularly useful to re-root a tree
graphically. In all these tip re-arrangements the branch lengths are always
conserved to keep the distance relations among the sequences.
To adjust the width and height of the tree, the Tree Editor has scale factors to
shrink (proportionally) the tree graphics either horizontally (branch length scale)
and vertically (Y margin scale). Both options appear on the top of the Tree Editor,
as can be appreciated in Figure 22 and Figure 23.
When the properties of a selected tip are edited appear a window like the one on
Figure 23.
30
Bosque: software system for phylogenetic analysis
Figure 23. Editing the properties of a tip of a tree
In this figure we can appreciate a whole tip that has been shrunk. This has been
done by unchecking the option “Expanded” on the properties. We have also given
it a particular name, which is displayed in the inner part of the box on the tree. We
can also change the branch length and the support or significance value of tip,
even when the “legality” of this practice may be questioned.
All the fonts and colors can be changed also, using the properties window. This,
however, is quite obvious so we will omit further explanation on that.
Finally it is important to say that in order to remove a particular tree; the red button
at the upper-right corner of the Tree Editor can be used.
31
Bosque: software system for phylogenetic analysis
The Sequences Window
Figure 4 shows a Sequence Window displaying a list of DNA sequences. The
Sequences Window has been already introduced in the section “Importing data into
Bosque”. Since this we will not review here again the import functionalities that this
window provides.
When we select DNA sequences from this table and active the context menu (with
a right-click over the selection) two options appear. The first option is the automatic
translation of the selected DNA sequences given a particular amino-acid sequence
template. Remember that in the Sequence Editor we have the possibility of
translate a single DNA sequence by specifying that it is a “protein coding” DNA
sequence. This option on the Sequences Window is useful when we are interested
on the translation of multiple sequences, to avoid a sequence-by-sequence
translation, process that could be time-consuming depending on the amount of
sequences to translate.
Figure 24. A massive translation of sequences from the Sequences Window using an aminoacid sequence template
In Figure 24 we can see the window for the translation of a set of DNA sequences.
The first option on this window is the selection of the translation table to use. Then
we can indicate a particular sequence already present on the local database to
32
Bosque: software system for phylogenetic analysis
serve as a template for the translation. Of course this template sequence should
have a homology relation with the selected sequences in order the translated
sequences have any sense. Otherwise, if we do not have already a homologous
sequence on the local database for the selected ones, we can copy/paste directly
the amino-acid sequence template into lower box. When we have already set these
parameters then we can press the OK button and the process of translation will
begin for all the selected sequences. After this process we can use these
sequences to construct phylogenetic trees of amino-acid sequences.
The other option in the context menu for a selection on the Sequences Window is
the tagging of sequences. Bosque can handle two tags for the sequences, and the
user is free to use these tags as desired. Some people find useful, for example, to
use tag1 as the gene name (or protein name in case of amino-acid sequences),
and tag2 as a particular grouping that the user have created for the purposes of a
phylogenetic analysis.
33
Bosque: software system for phylogenetic analysis
Networking options of Bosque
Remote execution of jobs
Public channel
Sharing resources
34
Bosque: software system for phylogenetic analysis
The Bosque Server
In the previous section we have reviewed the networking options of Bosque, which
allow the users to use a remote server in order to perform actions such as the
remote execution of jobs and interactions with other users connected to the same
server.
In this section we will see how to install a Bosque-Server on Linux system. In the
following figure we show the relations among the Bosque System actors.
Figure 25. A Bosque System
As we can see from Figure 25, Bosque applications can connect to a Bosque
Servers, which are installed on typical Linux servers. The Bosque Server uses a
MySQL database in order to manage the data related to users and executed jobs
by them. In general the Bosque Server provides the following services:
•
•
Remote job’s execution. Typical phylogenetic analyses include many
sequences and the best phylogenetic analyses methods are resourcesconsuming. Since this, most users may want to run their computational
analyses on special servers, which, normally, have much more main
memory and faster processors than common desktop computers, providing
analysis’s results in less time and avoiding leaving a personal computer on
a long-time running job.
Interaction with other users. When two or more users are connected to the
same Bosque Server, they can share phylogenetic resources such as
sequences, alignments and trees in a way very similar to most file-sharing
programs. This feature makes easy the cooperation on research projects
since the user needs not to be aware of file formats of any kind. Finally a
public channel for talking is provided on the Bosque Server.
35
Bosque: software system for phylogenetic analysis
Installation of the Bosque Server
The Bosque Server has the following prerequisites:
•
•
A Linux server
A MySQL server with a database and a user with privileges to create tables
The steps to install the Bosque Server are:
1. Download http://bosque.udec.cl/downloads/bosqueserver-0.9-1.i586.rpm
2. Install the server rpm (rpm –ivh bosqueserver-0.9-1.i586.rpm). The files
will be installed on /usr/local/bosque-server
3. Create a MySQL database (dbname) and user (dbuser)
4. Choose a TCP port in which the Bosque Server will listen
5. Use data of steps 3 and 4 to modify
/usr/local/bosque-server/etc/bosque-server.conf properly
6. Configure your linux firewall (if any) to allow TCP connections to the port
chosen in step 4 above
7. Create the necessary tables on the database:
mysql –u dbuser dbname –p < /usr/local/bosque-server/sql/bosque.sql
8. Execute the server at /usr/local/bosque-server/sbin/bosque-server
Please note that if you chose a port lower than 1024 you will necessarily will have
to run the server as root.
36
Bosque: software system for phylogenetic analysis
Appendix
Sequence file formats
Fasta file format
The following was obtained from: http://en.wikipedia.org/wiki/FASTA_format
FASTA format is a text-based format for representing either nucleic acid
sequences or protein sequences, in which base pairs or protein residues are
represented using single-letter codes. The format also allows for sequence names
and comments to precede the sequences.
The simplicity of FASTA format makes it easy to manipulate and parse sequences
using text-processing tools and scripting languages like Python and Perl.
A sequence in FASTA format begins with a single-line description, followed by
lines of sequence data. The description line is distinguished from the sequence
data by a greater-than (">") symbol in the first column. The word following the ">"
symbol is the identifier of the sequence, and the rest of the line is the description
(both are optional). There should be no space between the ">" and the first letter of
the identifier. It is recommended that all lines of text be shorter than 80 characters.
The sequence ends if another line starting with a ">" appears; this indicates the
start of another sequence. A simple example of two sequences in FASTA format:
>SEQUENCE_1
MTEITAAMVKELRESTGAGMMDCKNALSETNGDFDKAVQLLREKGLGKAAKKADRLAAEG
LVSVKVSDDFTIAAMRPSYLSYEDLDMTFVENEYKALVAELEKENEERRRLKDPNKPEHK
IPQFASRKQLSDAILKEAEEKIKEELKAQGKPEKIWDNIIPGKMNSFIADNSQLDSKLTL
MGQFYVMDDKKTVEQVIAEKEKEFGGKIKIVEFICFEVGEGLEKKTEDFAAEVAAQL
>SEQUENCE_2
SATVSEINSETDFVAKNDQFIALTKDTTAHIQSNSLQSVEELHSSTINGVKFEEYLKSQI
ATIGENLVVRRFATLKAGANGVVNGYIHTNGRVGVVIAAACDSAEVASKSRDLLRQICMH
Genbank file format
A sequence file in GenBank format can contain several sequences. One sequence
in GenBank format starts with a line containing the word LOCUS and a number of
annotation lines. The start of the sequence is marked by a line containing
"ORIGIN" and the end of the sequence is marked by two slashes ("//").
37
Bosque: software system for phylogenetic analysis
An example sequence in GenBank format is:
LOCUS
DEFINITION
AAU03518
237 bp
DNA
PLN
04-FEB-1995
Aspergillus awamori internal transcribed spacer 1 (ITS1) and 18S
rRNA and 5.8S rRNA genes, partial sequence.
U03518
41 a
77 c
67 g
52 t
ACCESSION
BASE COUNT
ORIGIN
1 aacctgcgga
61 tattgtaccc
121 ccccccgggc
181 tgagttgatt
//
aggatcatta
tgttgcttcg
ccgtgcccgc
gaatgcaatc
ccgagtgcgg
gcgggcccgc
cggagacccc
agttaaaact
gtcctttggg
cgcttgtcgg
aacacgaaca
ttcaacaatg
cccaacctcc
ccgccggggg
ctgtctgaaa
gatctcttgg
catccgtgtc
ggcgcctctg
gcgtgcagtc
ttccggc
Index
Alignment Editor, 2, 6, 19, 27, 29
amino-acids, 4, 20, 22
ASCII, 29
best alignment, 25
BLAST, 13, 14, 15, 16, 18, 22, 29
Bootstrapping, 20
Bosque, 1, 2, 3, 4, 7, 8, 9, 10, 11, 13,
14, 16, 17, 18, 19, 20, 22, 23, 25,
26, 27, 29, 32, 33, 34
bosque-server, 7
branch length, 30, 31
CDS, 12, 13, 15, 24
Coding Data Sequence, 12
codons, 25
comparing sequences, 28
consensus-tree, 4
CSV, 29
Custom Name, 23
database, 2, 3, 4, 8, 9, 10, 11, 13, 15,
16, 17, 18, 22, 23, 26, 29, 32
Dayhoff, 20
dnadist, 20
Entrez, 10, 17
E-value, 17
F84, 20
fasta, 5, 10
Fasta, 10, 14, 35
FASTA, 22, 35
gap weight, 28
genbank, 5, 10, 18, 24
GenBank, 15, 16, 35, 36
group by similarity, 29
Hasegawa, 21
Henikoff/Tillier, 20
homologous sites, 6
Jones-Taylor-Thornton, 20
Jukes-Cantor, 20
Kimura 2 parameters, 20
maximum likelihood, 7
Maximum Likelihood, 20, 21
Multiple Document Interface, 4
Muscle, 6, 18, 19, 27
NCBI, 10, 11, 13, 15, 16
nucleotides, 4, 13, 15
OTU, 29
outgroup, 20, 21
percentage of similarity, 28, 29
Phylip, 7, 20
phylogenetic tree, 3, 4, 18, 20, 23, 26,
30
Project, 3
protein coding, 24, 32
Quartet Puzzle, 20, 21
quartet puzzling, 7
Rate Heterogeneity, 21
right-click, 5, 29, 30, 32
scissor, 6, 28
Sequence Editor, 2, 22, 23, 24, 25, 32
Sequences Tab, 3, 5, 18
Sequences Window, 2, 3, 9, 18, 22,
23, 26, 29, 32, 33
similarity table, 28, 29
38
Bosque: software system for phylogenetic analysis
SQLite, 4, 8
tag1, 23, 33
tag2, 23, 33
Tamuna-Nei, 21
translation, 15, 24, 25, 26, 32
Tree Project, 3, 4, 18, 27, 29, 30
Tree Projects, 4
Tree Puzzle, 7
Tree Window, 3, 4, 5, 6, 7, 18, 19, 22,
27, 30
Tree Windows, 4
tree-reconstruction, 4, 6, 7, 19, 20,
21, 27
tree-reconstruction methods, 6, 7, 20
try and error practice, 8
TSV, 29
39