Download ADOPS - Automatic Detection Of Positively Selected Sites 1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Quantitative comparative linguistics wikipedia , lookup

Pathogenomics wikipedia , lookup

Viral phylodynamics wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Adaptive evolution in the human genome wikipedia , lookup

Gene expression programming wikipedia , lookup

Designer baby wikipedia , lookup

Cre-Lox recombination wikipedia , lookup

Human genome wikipedia , lookup

Non-coding DNA wikipedia , lookup

RNA-Seq wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

Genome editing wikipedia , lookup

Genomics wikipedia , lookup

Microevolution wikipedia , lookup

Metagenomics wikipedia , lookup

Expanded genetic code wikipedia , lookup

Genetic code wikipedia , lookup

Helitron (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Point mutation wikipedia , lookup

Smith–Waterman algorithm wikipedia , lookup

Sequence alignment wikipedia , lookup

Multiple sequence alignment wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Transcript
Journal of Integrative Bioinformatics, 9(3):200, 2012
http://journal.imbio.de
David Reboiro-Jato1 , Miguel Reboiro-Jato1 , Florentino Fdez-Riverola1 , Cristina P.
Vieira4 , Nuno A. Fonseca 2,3 , and Jorge Vieira4*
1
Departamento de Informática, Universidade de Vigo, Spain
http://www.esei.uvigo.es/
2
CRACS-INESC Porto LA, Universidade do Porto, Portugal
http://cracs.fc.up.pt/
3
4
EMBL-European Bioinformatics Institute, UK
http://www.ebi.ac.uk/
Instituto de Biologia Molecular e Celular, Universidade do Porto, Portugal
http://www.ibmc.up.pt
Summary
Maximum-likelihood methods based on models of codon substitution have been widely
used to infer positively selected amino acid sites that are responsible for adaptive changes.
Nevertheless, in order to use such an approach, software applications are required to align
protein and DNA sequences, infer a phylogenetic tree and run the maximum-likelihood
models. Therefore, a significant effort is made in order to prepare input files for the different software applications and in the analysis of the output of every analysis. In this paper
we present the ADOPS (Automatic Detection Of Positively Selected Sites) software. It was
developed with the goal of providing an automatic and flexible tool for detecting positively
selected sites given a set of unaligned nucleotide sequence data. An example of the usefulness of such a pipeline is given by showing, under different conditions, positively selected
amino acid sites in a set of 54 Coffea putative S-RNase sequences. ADOPS software is
freely available and can be downloaded from http://sing.ei.uvigo.es/ADOPS.
1
Introduction
Understanding the molecular basis of evolution is one of the main goals of Biology. Changes
in gene expression levels, as well as in protein sequences, may be adaptive and several processes to infer deviations from neutrality using DNA sequence data are available (see for instance, [1, 2, 3, 4, 5]). When changes at few amino acid sites are the target of selection,
such positively selected amino acid sites may be detected using maximum-likelihood methods based on models of codon substitution [6], implemented in the software developed by
Yang [7, 8], named codeml 1 . This approach has been applied numerous times to infer positively selected amino acid sites in host immune response genes [9, 10], the envelope glycoprotein of the dengue viruses [11], the attachment glycoprotein of respiratory syncytial virus [12],
* To
1
whom correspondence should be addressed. Email: [email protected]
codeml: http://abacus.gene.ucl.ac.uk/software/paml.html
doi:10.2390/biecoll-jib-2012-200
1
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
ADOPS - Automatic Detection Of Positively Selected Sites
http://journal.imbio.de
haemagglutinin gene of measles virus [13] influenza B virus hemagglutinin gene [14], HIV
genes [15], hemagglutinin-neuraminidase gene of Newcastle disease virus [16], Trypanosoma
brucei genes [17], at the vertebrate skeletal muscle sodium channel gene [18], at the p53
gene [19], the fruitless gene in Anastrepha fruit flies [20], CC chemokine receptor proteins [21],
or at the plant genes that are involved in gametophytic self-incompatibility specificity determination [22, 23, 23, 24, 25] to name just a few.
The software developed by Yang requires as input files the aligned nucleotide coding sequences,
a tree showing the inferred relationship of the sequences being used, and a control file. Since the
programs commonly used to align nucleotide sequences and infer phylogenetic relationships
use different input file formats this is a tedious and time consuming process. Moreover, multiple
aspects of the process may interfere with the ability to infer positively selected amino acid sites
or, even worse, erroneously infer positively selected amino acid sites. Such issues have been
discussed in the published literature [26, 27, 28, 29], as well as in the PAML2 (the software
package that includes codeml), and are summarized below.
Under neutrality, the (dN/dS or Ka/Ks) ratio (frequently called ω) is expected to be below
one [30, 31, 32]. When using codeml, the likelihood of models not allowing, as well as those
allowing codon classes with a ω value larger than one, can be compared. If models allowing
for classes larger than one fit better the data, then this can be taken as evidence for positive
selection at the codon sites in the ω > 1 class. It is clear, that there must be enough variation
in the data set for the methods to have any power. Neither highly similar sequences nor very
divergent sequences are informative, but it is very hard to specify exact values. When using
divergent sequences, the occurrence of multiple substitutions at the same sites may be hard to
infer (saturation of substitutions). Nevertheless, this is likely not a problem, as the performance
of the models does not seem to deteriorate until the sequences become very divergent (about
10 nucleotide substitutions per nucleotide site along the tree [6, 27]. Moreover, a large tree
with many branches will be able to tolerate more changes than a small tree. High sequence
divergence is, nevertheless, often associated with other problems, such as the ability to infer the
correct sequence alignment.
When performing multiple DNA sequence alignments, the sequences should be first translated
into amino acids and the resulting amino acid sequences aligned. Then, the DNA sequences
should be aligned, using as a guide the amino acid alignment, an option incorporated in few
software applications, such as DAMBE [33]. There is, however, no multiple sequence alignment scheme that outperforms the rest in producing reliable phylogenetic trees [34]. Therefore,
the possible effect on sequence relationship inferences of the use of a given alignment scheme
must be taken into consideration, either by using only codon sites that are aligned with high
confidence, or by using different alignment algorithms to determine which codons would be
identified as positively selected, no matter the alignment scheme used. Clearly, aligning nonorthologous codons may result in the detection of false positively selected amino acid sites.
When sequences are very divergent they may also have different sequence nucleotide compositions, and this may have an impact when inferring, through phylogenetic reconstruction, the
relationship of the different sequences.
The presence of recombinant sequences in the data set can also be problematic, because the
sequence relationships can no longer be described by a single phylogenetic tree. The violation
of this basic assumption of the model, affects the likelihood method for detecting positive
2
PAML FAQ webpage http://abacus.gene.ucl.ac.uk/software/pamlFAQs.pdf
doi:10.2390/biecoll-jib-2012-200
2
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012
http://journal.imbio.de
selection, generating false positives [27, 29]. Nevertheless, parameter rich models, such as M7
(a nearly neutral model using a beta distribution and 10 classes) and M8 (a positive selection
model using a beta distribution and 10 classes, as well as one class where ω is larger than one),
seem to deal better with this issue than the parameter poor models such as M1 (a nearly neutral
model with two classes) and M2 (a positive selection model with three classes;[8]). There
are several methods available to detect recombinant sequences (most of them implemented in
the Recombination Detection Program (RDP3) software [35]), and thus, removing sequences
involved in such events should increase the performance of the method. Nevertheless, it is
likely that not all recombinant sequences are ever identified. One possibility is to use sequence
subsets to determine whether the results are dependent on the inclusion of a few sequences.
Genome sequence annotations are in many cases made by inference, and thus, the possibility
of wrong gene annotation should be taken into account (see for instance [36]). In some cases, it
is thus desirable to determine the possible impact of the inclusion of a few sequences that may
be wrongly annotated.
With the above issues in mind, that, in order to be addressed, require a significant amount of
file preparation, here, we address the problem of automatizing the whole process of detecting
positively selected sites.
ADOPS has a graphical interface that provides an integrated view of all the results, and allows
comparison of multiple runs, side by side. ADOPS runs under the AIBench environment [37,
38] and, in principle, can run under a Windows, or Unix-like environment. Nevertheless, there
is at this stage no 64-bit Windows T-Coffee software available, and thus, the pipeline runs under
Unix-like environments only at this stage.
An example of the usefulness of the pipeline is given, by performing analyses with the Coffea
(Rubiaceae) putative S-RNase gene (a non-redundant set of 54 complete sequences), the female
component of the gametophytic self-incompatibility (GSI) system.
2
ADOPS Software
ADOPS starts with a file with a set of unaligned DNA sequences (in FASTA format). TCoffee [39] is used to align and manipulate the sequences. T-Coffee allows the user to select a
variety of alignment methods and produces confidence statistics for each amino acid position
in the alignment. The phylogenetic tree is generated by MrBayes [40]. The tree and aligned
sequences are given as input to codeml to detect positively selected amino acid sites. ADOPS
software is freely available and can be downloaded from http://sing.ei.uvigo.es/
ADOPS.
2.1
Workflow
The ADOPS’s workflow is depicted in Figure 1. There are three main stages in the workflow: i)
alignment of the sequences; 2) phylogenetic tree reconstruction; and 3) detection of positively
selected amino acid sites. We next describe in more detail each of the stages.
doi:10.2390/biecoll-jib-2012-200
3
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012
http://journal.imbio.de
Figure 1: Schematic ADOPS workflow.
2.1.1
Alignments
The first stage of the pipeline is the alignment. The nucleotide sequences are translated (using
the transeq tool available in the EMBOSS package) and aligned at the amino acid level using
T-Coffee [39, 41], a resourceful multiple sequence alignment (MSA) software.
T-Coffee allows the user to align sequences using third party aligners and provides a metamethod to combine the output of several individual methods into one single MSA. Furthermore,
it also includes an algorithm that integrates structural (or homology) information when building
MSAs.
The alignment generated by T-Coffee is used as a guide to produce the corresponding nucleotide
alignment. However, since for a given alignment, not all amino acid positions will be aligned
with equal confidence it is advisable to use only the amino acid positions that are well supported. As mentioned, T-Coffee allows the user to select a variety of alignment methods and
produces confidence statistics for each amino acid position in the alignment. ADOPS can filter
amino acid positions in the alignment with low confidence score (by default, aligned amino
acid positions with a confidence score bellow 3 are discarded).
The size and divergence of the nucleotide sequences may have a considerable impact on the
number of gaps included in the MSA, which may reduce the number of informative positions
in the MSA. A position in a MSA is called informative if the column in the MSA does not
contain a gap. Only informative positions are used during phylogenetic tree reconstruction.
Therefore, using a subset of the initial nucleotide sequences may increase the signal to infer the
phylogenetic tree.
In the alignment stage ADOPS provides the option to automatically optimize the alignment,
i.e., maximize the number of informative positions. To this end, ADOPS includes an algorithm
that determines the subset of sequences that maximizes the number of informative positions.
The algorithm performs a greedy-search. At each iteration the algorithm evaluates the impact
on the MSA of removing a particular sequence. If the number of informative positions increases
above a determined threshold then the sequence is removed and the algorithm continues. The
doi:10.2390/biecoll-jib-2012-200
4
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012
Journal of Integrative Bioinformatics, 9(3):200, 2012
http://journal.imbio.de
2.1.2
Phylogenetic trees
The phylogenetic tree is generated by MrBayes [40], a program for the Bayesian estimation of
phylogeny. MrBayes allows the implementation of a parameter-rich generalized time reversible
(GTR) model of sequence evolution, among-site rate variation, and a proportion of invariable
sites.
2.1.3
Positive selection detection
The MSA and respective tree are given as input to codeml to detect positively selected amino
acid sites. The MSA is automatically converted, using ALTER [42], to the NEXUS format in
order to be included in the codeml configuration file.
ADOPS can be configured to assess the impact in positive selection detection of different alignment strategies (i.e., different alignment algorithms or subsets of sequences). In this case, a tree
is generated for each MSA and then a consensus tree is built and given as input to codeml. By
default, the codeml [8] random-sites models M0, M1, M2, M3, M7 and M8 are used.
2.1.4
Implementation
In order to implement the final-user application for the workflow depicted in Figure 1 we have
used our previous AIBench project [37], a Java application framework for scientific software
development specially designed for giving support to the IPO (Input-Process-Output) model.
The integration of the results produced by T-Coffee and codeml is achieved by using a reference
sequence and a multiple track system (as implemented in PileLine [43]). By default, amino acid
sites identified as positively selected are shown as well as their probability of being positively
selected and whether they were identified using Naive empirical Bayesian (NEB) or Bayes
empirical Bayes (BEB) analyses [8].
2.2
Available Functionalities
ADOPS includes a graphical user interface (GUI) which facilitates the workflow configuration
and execution control. This GUI also provides several informative views and summaries, containing the most important information generated by the underlying programs. The ADOPS
GUI is divided into three main sections: (i) menu, with the main options of the application, (ii)
clipboard, which provides a quick access to the projects and experiments, and (iii) main panel,
where the project view is showed. The Figure 2 shows the ADOPS GUI with a project view
opened. As it can be seen, through project view, the user can manage projects and experiments,
launch experiment executions and visualize the generated results.
doi:10.2390/biecoll-jib-2012-200
5
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
algorithm stops when the “optimal” subset is found or when a maximum number of sequences
is removed (user defined).
http://journal.imbio.de
Figure 2: ADOPS graphical user interface.
2.2.1
Data Types
ADOPS has two main classes of data types: (i) projects, which are associated with a single
FASTA file, and (ii) experiments, which are associated with a project and that can be executed
with a custom configuration. A new project can be created from a file containing a set of
unaligned nucleotides sequences in FASTA format. For project creation, an empty folder is
needed (if it not exits, it would be created) to store the project and its experiments’ files. In
order to apply the ADOPS workflow to an input FASTA file, an experiment is needed. Each
experiment has its own configuration, which overrides the project configuration, and stores the
results files generated during the workflow execution in a folder created inside the project’s
folder. Each experiment can be configured to use a subset of the project sequences or all of
them.
2.2.2
Configuration
The execution configuration of the ADOPS workflow, including the parameters of the underlying programs, are stored in a plain file using the Java properties file format. This is a very simple
format where each line contains one parameter (or comment) separated from its assigned value
with an equals symbol (e.g. tcoffee.alignMethod=CLUSTALW2). The properties dialogs included in the ADOPS GUI makes the configuration edition easier by providing helpful widgets
for each configuration property.
ADOPS uses a hierarchical configuration system, having three levels of configuration: (i) system configuration, (ii) project configuration and (iii) experiment configuration. System configuration is the default configuration used when creating a new project. Project configuration
overrides the system configuration and it is used when creating a new experiment. Finally,
experiment configuration overrides the project configuration establishing the final parameters
that are used when executing the ADOPS workflow. This hierarchical system encourages the
configuration reuse and facilitates the testing of different execution configurations.
doi:10.2390/biecoll-jib-2012-200
6
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012
Journal of Integrative Bioinformatics, 9(3):200, 2012
Workflow Execution
Once an experiment is correctly configured, it can be executed. The workflow execution is
completely automatized and controlled by ADOPS, invoking the corresponding programs when
needed. During the execution, intermediate results are showed as soon as they are generated,
allowing the user to view them without having to wait for the workflow to finish. Given the
long execution time of some of the used programs, this is a very interesting feature as it reduces
the waiting time of the user and allows him/her to abort the execution if the intermediate results
are not the expected.
In a full workflow execution, ADOPS produces eleven result or summary views, including an
output log. For the T-Coffee program, aligned nucleotide and amino acid sequences are showed
in FASTA format. If any of the Clustal [44] align methods was chosen, then the contents of
the aln file are also showed. The phylogenetic tree generated by MrBayes is displayed both
as plain text and as a graphic tree using a customized version of the Archaeopteryx software
tool3 (see Figure 3). Additionally, a summary of the potential scale reduction factor (PSRF)
obtained with MrBayes is also showed. A view with the complete codeml output is showed
along with a summary of the Naive Empirial Bayes (NEB) and Bayes Empirical Bayes (BEB)
analysis results obtained for the selected models. Finally, when the execution ends, a summary
report is generated, and the positively selected sites (PSS) view is created using the amino acid
alignment results and the codeml probabilities information. As it can be seen in Figure 4, PSS
is a configurable view where the selected sites are highlighted with different colors depending
on the NEB and BEB probabilities for the chosen model.
All the files generated during the workflow execution, along with the summaries displayed in
the ADOPS GUI are stored in the experiment’s folder and can be accessed and opened with any
plain text editor.
Figure 3: Phylogenetic tree visualization with Archaeopteryx.
3
Archaeopteryx: http://www.phylosoft.org/archaeopteryx/
doi:10.2390/biecoll-jib-2012-200
7
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
2.2.3
http://journal.imbio.de
http://journal.imbio.de
Figure 4: Positively selected sites view.
Figure 5: Coffea S-RNase Ka/Ks scatter plot.
3
Biological example
GSI is a genetic mechanism present in flowering plants that prevents self-fertilization, by enabling the pistil to reject pollen from genetically related individuals [45]. The S-pistil gene
product in Rosaceae, Solanaceae and Plantaginaceae is an extracellular ribonuclease, called SRNase [46, 47]. Phylogenetic analyses suggest that RNase-based GSI has evolved only once,
before the split of the Asteridae and Rosidae, about 120 million years ago [48, 49, 50]. For
Rosaceae and Solanaceae families, the pistil gene shows the expected features for a gene determining GSI specificity, namely, expression in pistils only, high levels of synonymous and
non-synonymous divergence, as well as positively selected amino acid sites that account for the
many specificities known to be present in natural populations [24]. The putative Coffea (Rubiaceae) S-RNase gene presents expression in pistil tissues and polymorphism compatible with
that observed for Solanaceae, Plantaginaceae and Rosaceae S-RNase gene [24]. Here we use
ADOPS to infer the presence/absence of amino acids under positive selection, those that are
doi:10.2390/biecoll-jib-2012-200
8
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012
http://journal.imbio.de
involved in determining S-allele specificities. We also address the effect of the different alignments in the identification of those amino acid sites. In contrast with Solanaceae, in Coffea,
recently derived S-allele paralogs have been found [47, 51]. Such sequences could be amplified in all 10 individuals from three Coffea species analyzed [51]. These sequences show the
amino acid pattern [HY]EW found in 54 out of 69 S-like RNases and in only seven out of 658
S-RNase sequences [50]. This motif is, however, also present in 10 of the putative Coffea SRNase sequences. Therefore, we also look for positively selected amino acid sites in sequences
harboring the HEW motif only.
The putative S-RNase alleles of self-incompatible Coffea species here used are from Nowak
et al. [47]. The C. andrambovatensis S8 and C. tsirananae S8 sequences have not been used
because they are shorter than 350 bp. It should be noted that C. resinosa S10 sequence is not
available in GenBank. Therefore, our data set contains 54 Coffea putative S-RNase sequences.
C. resinosa S51 is assigned as S53 in GenBank, while C. canephora S17 and S27 sequences
are assigned as S40 and S43 in GenBank. The 54 Coffea putative S-RNase sequences were
aligned using the ClustalW algorithm as implemented in ADOPS. This implies that sequences
were first translated into their corresponding amino acid sequences, aligned, and the resulting
alignment was then used as a guide to align the corresponding DNA sequences.
Pair-wise Ka (the Jukes-Cantor corrected non-synonymous rate of substitution per non-synonymous site) and Ks (the Jukes-Cantor corrected synonymous rate of substitution per synonymous site) values were estimated using DNasp [52]. Figure 5 shows the Ka/Ks scatter plot.
There are no pair-wise comparisons that produce a Ka/Ks ratio (frequently called ω) above
one. It should be noted that the average Ka and Ks values are 0.18 and 0.56, respectively.
Therefore, the sequences are very divergent at the nucleotide level, although a few sequence
pairs show little divergence (those results close to the origin in Fig. 5). Despite this high level
of divergence, the alignment is, for the most part, not ambiguous, since few gaps are introduced in the alignment, independently of the multiple sequence alignment algorithm used (data
not shown). Nevertheless, inspection of the tree generated with MrBayes (as implemented in
ADOPS with the standard options, i.e., the use of a GTR model with different gamma distributions for the first/second codon positions and third codon positions; 500000 iterations,
sampling every 100th iterations and a burn-in of 25%) reveals that many allele relationships
are poorly resolved (data not shown). When using the RDP (1000 permutations), GeneConv,
Bootscan, MaxChi, Chimaera, Sciscan and 3seq methods, implemented in the RDP3 software
package [35], and our set of 54 sequences aligned using ClustalW, as implemented in ADOPS,
no recombinant sequences are, however, detected. If present, recombinant sequences should
have been detected, given the observed high sequence divergence [29].
Figure 6 summarizes the results of the different analyses. Positively selected amino acid sites
are detected when the complete data set is used, and this finding is independent of the multiple sequence alignment algorithm used (runs were performed using a subset of the multiple
sequence alignment algorithms implemented in ADOPS, namely, ClustalW, Muscle and TCoffee). For this data set, such independence is not unexpected, given that, for the most part,
the alignment is not ambiguous. Therefore, this set of putative S-RNases shows the expected
signature of true S-RNases and thus, the notion that at least some of the sequences in the data
set are S-RNases. When only the sequences not having the unexpected HEW motif (see introduction) are used (this analysis as well as those described below can be easily done with the
Select Sequences option available in ADOPS, otherwise it would require a significant amount
of file preparation), five positively selected amino acid sites are detected. Since there is little
doi:10.2390/biecoll-jib-2012-200
9
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012
http://journal.imbio.de
Figure 6: Positively selected amino acid sites. Gray - NEB (Naive Empirical Bayes) and BEB
(Bayes Empirical Bayes) > 95%; Underscore - 90% < NEB and BEB < 95%. White spaces were
introduced in the alignment in order to have a better view of the orthologous positions, under the
different multiple alignments produced by different algorithms. Bold - positively selected amino
acids identified with model M8 but not with model M2. For T-Coffee, results are only shown for
model M2, since there was no convergence when using model M8.
ambiguity in the alignment (see above) for this analysis we used the ClustalW algorithm only.
When only the 10 sequences showing the unexpected HEW motif (see introduction) are used,
no positively selected amino acid sites are detected. It should be noted that, on average, the
sequences harboring the HEW motif are as divergent as the whole set (the Ka and Ks values
are 0.17 and 0.57 for the sequences containing the HEW motif and 0.18 and 0.56 for the entire
data set). Nevertheless, it could be argued that with 10 sequences only, there is little power
to detect positively selected amino acid sites. In order to address this issue, we performed 40
runs, each with 10 non-HEW randomly selected sequences (Figure 7). Indeed, when using the
M2 model 75% of those runs do not produce evidence for positively selected amino acid sites,
while when using the M8 model, 17.25% of those runs do not produce evidence for positively
selected amino acid sites.
In conclusion, as expected, the set of 54 putative S-RNase Coffea sequences here examined,
shows evidence for positively selected amino acid sites. Although sequences with the unexpected HEW motif do not show evidence for positively selected amino acid sites, this could be
due to the low sample size (10 sequences). The number of amino acids under positive selection in Coffea (Rubiaceae) and Solanaceae (the two groups have been diverging for 90 million
years [47]) for the orthologous region here analysed is similar (7 versus 10, respectively). Thus,
there is support for these sequences being the S-RNase gene in Coffea.
4
Conclusions
ADOPS is ideal for research projects involving the analysis of tens of genes or detailed analyzes of a single gene. By using ADOPS the tedious and time consuming process of input file
preparation for the software applications included in the pipeline is eliminated. For instance,
without ADOPS, for the biological example here given, more than 100 input files needed to
be prepared. Moreover, ADOPS provides a graphical interface that greatly eases the interpretation of the results obtained with the software applications included in the pipeline. The
entire pipeline can be run using a command line option as well, thus being adequate to process hundreds or thousands of genes as well. ADOPS software can be freely downloaded from
http://sing.ei.uvigo.es/ADOPS/.
doi:10.2390/biecoll-jib-2012-200
10
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012
http://journal.imbio.de
Figure 7: Evidence for positive selection when using 10 randomly chosen non-HEW Coffea putative S-RNase sequences. Delta - twice the difference of the log-likelihoods of the models being
compared. A - M2 (Positive Selection) vs M1 (Nearly Neutral); B - M8 (beta) vs M7 (beta & ω > 1).
The line indicates the level above which the difference in the log-likelihoods is significant.
Acknowledgements
This work has been partially supported by the following projects: the Spanish Ministry of Science and Innovation, the Plan E from the Spanish Government and the European Union from
the ERDF (TIN2009-14057-C03-02 and AIB2010PT-00353), and the projects PTDC/EIAEIA/100897/2008, PTDC/BIA-BEC/100616/2008, PTDC/EIA-EIA/121686/-2010, and PTDC/
BIA-BEC/099933/2008 comp-01-0124-FEDER-008916, funded by Programa Operacional para
a Ciência e Inovação (POCI) 2010, co-funded by Fundo Europeu para o Desenvolvimento Regional (FEDER) funds and Programa Operacional para a Promoção da competividade (COMPETE; FCOMP-01-0124-FEDER-008923 and FCOMP-01-0124-FEDER-022701).
doi:10.2390/biecoll-jib-2012-200
11
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012
Journal of Integrative Bioinformatics, 9(3):200, 2012
http://journal.imbio.de
[1] R. R. Hudson, M. Kreitman, and M. Aguade. A Test of Neutral Molecular Evolution
Based on Nucleotide Data. Genetics, 116(1):153–159, 1987.
[2] F. Tajima. Statistical Method for Testing the Neutral Mutation Hypothesis by DNA Polymorphism. Genetics, 123(3):585–595, 1989.
[3] Y. X. Fu. Statistical Tests of Neutrality of Mutations Against Population Growth, Hitchhiking and Background Selection. Genetics, 147(2):915–925, 1997.
[4] J. K. Kelly. A test of neutrality based on interlocus associations. Genetics, 146(3):1197–
1206, 1997.
[5] J. H. McDonald and M. Kreitman. Adaptive protein evolution at the Adh locus in
Drosophila. Nature, 351:652–654, 1991.
[6] Z. Yang and R. Nielsen. Synonymous and nonsynonymous rate variation in nuclear genes
of mammals. Journal of Molecular Evolution, 46(4):409–418, 1998.
[7] Z. Yang. PAML: a program package for phylogenetic analysis by maximum likelihood.
Computer applications in the biosciences: CABIOS, 13(5):555–556, 1997.
[8] Z. Yang. PAML 4: Phylogenetic Analysis by Maximum Likelihood. Molecular Biology
and Evolution, 24, 2007.
[9] F. M. Jiggins and K. W. Kim. A screen for immunity genes evolving under positive
selection in Drosophila. Journal of Evolutionary Biology, 20(3):965–970, 2007.
[10] R. Morales-Hojas, C. P. Vieira, M. Reis, and J. Vieira. Comparative analysis of five
immunity-related genes reveals different levels of adaptive evolution in the virilis and
melanogaster groups of Drosophila. Heredity, 102(6):573–578, 2009.
[11] S. S. Twiddy, C. H. Woelk, and E. C. Holmes. Phylogenetic evidence for adaptive evolution of dengue viruses in nature. Journal of General Virology, 83(7):1679–1689, 2002.
[12] C. H. Woelk and E. C. Holmes. Variable immune-driven natural selection in the attachment (G) glycoprotein of respiratory syncytial virus (RSV). Journal of Molecular Evolution, 52(2):182–192, 2001.
[13] C. H. Woelk, L. Jin, E. C. Holmes, and D. W. G. Brown. Immune and artificial selection
in the haemagglutinin (H) glycoprotein of measles virus. Journal of General Virology,
82(10):2463, 2001.
[14] J. Shen, B. D. Kirk, J. Ma, and Q. Wang. Diversifying selective pressure on influenza B
virus hemagglutinin. Journal of Medical Virology, 81(1):114–124, 2009.
[15] W. Yang, J. P. Bielawski, and Z. Yang. Widespread adaptive evolution in the human
immunodeficiency virus type 1 genome. Journal of Molecular Evolution, 57(2):212–221,
2003.
doi:10.2390/biecoll-jib-2012-200
12
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
References
http://journal.imbio.de
[16] M. Gu, W. Liu, L. Xu, Y. Cao, C. Yao, S. Hu, and X. Liu. Positive selection in the
hemagglutinin-neuraminidase gene of Newcastle disease virus and its effect on vaccine
efficacy. Virology, 8(1):150, 2011.
[17] R. D. Emes and Z. Yang. Duplicated paralogous genes subject to positive selection in the
genome of Trypanosoma brucei. PloS one, 3(5):e2295, 2008.
[18] J. Lu, J. Zheng, Q. Xu, K. Chen, and C. Zhang. Adaptive evolution of the vertebrate
skeletal muscle sodium channel. Genetics and Molecular Biology, 34(2):323–328, 2011.
[19] M. M. Khan, A. M. Rydén, M. S. Chowdhury, M. D. Hasan, and J. U. Kazi. Maximum
likelihood analysis of mammalian p53 indicates the presence of positively selected sites
and higher tumorigenic mutations in purifying sites. Gene, 483(1-2):29–35, 2011.
[20] I. Sobrinho and R. de Brito. Evidence for positive selection in the gene fruitless in Anastrepha fruit flies. BMC Evolutionary Biology, 10(1):293, 2010.
[21] K. J. Metzger and M. A. Thomas. Evidence of positive selection at codon sites localized
in extracellular domains of mammalian CC motif chemokine receptor proteins. BMC
Evolutionary Biology, 10(1):139, 2010.
[22] C. P. Vieira, D. Charlesworth, and J. Vieira. Evidence for rare recombination at the gametophytic self-incompatibility locus. Heredity, 91(3):262–267, 2003.
[23] M. D. S. Nunes, R. A. M. Santos, S. M. Ferreira, J. Vieira, and C. P. Vieira. Variability
patterns and positively selected sites at the gametophytic self-incompatibility pollen SFB
gene in a wild self-incompatible Prunus spinosa (Rosaceae) population. New phytologist,
172(3):577–587, 2006.
[24] J. Vieira, R. Morales-Hojas, R. A. M. Santos, and C. P. Vieira. Different positively selected sites at the gametophytic self-incompatibility pistil S-RNase gene in the Solanaceae
and Rosaceae (Prunus, Pyrus, and Malus). Journal Molecular Evolution, 65(2):175–185,
2007.
[25] J. Vieira, R. A. M. Santos, S. M. Ferreira, and C. P. Vieira. Inferences on the number and
frequency of S-pollen gene (SFB) specificities in the polyploid Prunus spinosa. Heredity,
101(4):351–358, 2008.
[26] Z. Yang and J. P. Bielawski. Statistical methods for detecting molecular adaptation. Trends
in Ecology & Evolution, 15(12):496–503, 2000.
[27] M. Anisimova, J. P. Bielawski, and Z. Yang. Accuracy and power of Bayes prediction of
amino acid sites under positive selection. Molecular biology and evolution, 19(6):950–
958, 2002.
[28] Z. Yang and W. J. Swanson. Codon-substitution models to detect adaptive evolution that
account for heterogeneous selective pressures among site classes. Molecular Biology and
Evolution, 19(1):49–57, 2002.
[29] C. Casola and M. W. Hahn. Gene conversion among paralogs results in moderate false
detection of positive selection using likelihood methods. Journal of Molecular Evolution,
68(6):679–687, 2009.
doi:10.2390/biecoll-jib-2012-200
13
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012
Journal of Integrative Bioinformatics, 9(3):200, 2012
http://journal.imbio.de
[31] T. Ohta. The nearly neutral theory of molecular evolution. Annual Review of Ecology and
Systematics, 23:263–286, 1992.
[32] M. Kimura. Preponderance of synonymous changes as evidence for the neutral theory of
molecular evolution. Nature, 1977.
[33] X. Xia and Z. Xie. DAMBE: software package for data analysis in molecular biology and
evolution. Heredity, 92(4):371–373, 2001.
[34] N. Essoussi, K. Boujenfa, and M. Limam. A comparison of MSA tools. Bioinformation,
2(10):452, 2008.
[35] D. P. Martin, P. Lemey, M. Lott, V. Moulton, D. Posada, and P. Lefeuvre. RDP3: a
flexible and fast computer program for analyzing recombination. Bioinformatics (Oxford,
England), 26(19):2462–3, 2010.
[36] M. Reis, S. Sousa-Guimarães, C. P. Vieira, C. E. Sunkel, and J. Vieira. Drosophila genes
that affect meiosis duration are among the meiosis related genes that are more often found
duplicated. Plos One, 10(3):e17512, 2011.
[37] D. Glez-Peña, M. Reboiro-Jato, P. Maia, M. Rocha, F. Dı́az, and F. Fdez-Riverola.
AIBench: A rapid application development framework for translational research in
biomedicine. Computer Methods and Programs in Biomedicine, 98:191–203, May 2010.
[38] H. López-Fernández, M. Reboiro-Jato, D. Glez-Peña, J. R. Méndez Reboredo, H. M. Santos, R. J. Carreira, J. L. Capelo-Martı́nez, and F. Fdez-Riverola. Rapid development of
proteomic applications with the AIBench framework. Journal of Integrative Bioinformatics, 8(3):171, 2011.
[39] C. Notredame, D. G. Higgins, and J. Heringa. T-Coffee: A novel method for fast and
accurate multiple sequence alignment. Journal of Molecular Biology, 302(1):205–217,
2000.
[40] J. P. Huelsenbeck and F. Ronquist. MRBAYES: Bayesian inference of phylogenetic trees.
Bioinformatics, 17(8):754–755, August 2001.
[41] J. F. Taly, C. Magis, G. Bussotti, J. M. Chang, P. Di Tommaso, I. Erb, J. EspinosaCarrasco, C. Kemena, and C. Notredame. Using the T-Coffee package to build multiple sequence alignments of protein, RNA, DNA sequences and 3D structures. Nature
Protocols, 6(11):1669–1682, 2011.
[42] D. Glez-Peña, D. Gómez-Blanco, M. Reboiro-Jato, F. Fdez-Riverola, and D. Posada. ALTER: program-oriented conversion of DNA and protein alignments. Nucleic Acids Research, 38(suppl 2):W14–W18, 2010.
[43] D. Glez-Peña, G. Gómez-López, M. Reboiro-Jato, F. Fdez-Riverola, and D. G. Pisano.
PileLine: a toolbox to handle genome position information in next-generation sequencing
studies. BMC Bioinformatics, 12(1), 2011.
doi:10.2390/biecoll-jib-2012-200
14
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
[30] M. Kimura. The neutral theory of molecular evolution. Cambridge Univ Pr, 1985.
http://journal.imbio.de
[44] M. A. Larkin, G. Blackshields, N. P. Brown, R. Chenna, P. A. McGettigan, H. McWilliam,
F. Valentin, I. M. Wallace, A. Wilm, R. Lopez, J. D. Thompson, T. J. Gibson, and D. G.
Higgins. Clustal W and Clustal X version 2.0. Bioinformatics, 23(21):2947–2948, 2007.
[45] D. de Nettancourt. Incompatibility in angiosperms.
10(4):185–199, 1997.
Sexual Plant Reproduction,
[46] E. H. Roalson and A. G. McCubbin. S-RNases and sexual incompatibility: structure, functions, and evolutionary perspectives. Molecular phylogenetics and evolution, 29(3):490–
506, 2003.
[47] M. D. Nowak, A. P. Davis, F. Anthony, and A. D. Yoder. Expression and TransSpecific Polymorphism of Self-Incompatibility RNases in Coffea (Rubiaceae). PloS one,
6(6):e21019, 2011.
[48] B. Igic and J. R. Kohn. Evolutionary relationships among self-incompatibility RNases.
Proceedings of the National Academy of Sciences, 98(23):13167, 2001.
[49] J. E. Steinbachs and K. E. Holsinger.
S-RNase–mediated gametophytic selfincompatibility is ancestral in Eudicots. Molecular Biology and Evolution, 19(6):825–
829, 2002.
[50] J. Vieira, N. A. Fonseca, and C. P. Vieira. An S-RNase-based gametophytic selfincompatibility system evolved only once in eudicots. Journal of Molecular Evolution,
67(2):179–190, 2008.
[51] E. Asquini, M. Gerdol, D. Gasperini, B. Igic, G. Graziosi, and A. Pallavicini. S-RNaselike Sequences in Styles of Coffea (Rubiaceae). Evidence for S-RNase Based Gametophytic Self-Incompatibility? Tropical Plant Biology, 4(3–4):237–249, 2011.
[52] P. Librado and J. Rozas. DnaSP v5: a software for comprehensive analysis of DNA
polymorphism data. Bioinformatics, 25(11):1451–1452, 2009.
doi:10.2390/biecoll-jib-2012-200
15
Copyright 2012 The Author(s). Published by Journal of Integrative Bioinformatics.
This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivs 3.0 Unported License (http://creativecommons.org/licenses/by-nc-nd/3.0/).
Journal of Integrative Bioinformatics, 9(3):200, 2012