Download Evolutionary Distances for Protein

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Biochemistry wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Molecular ecology wikipedia , lookup

Biosynthesis wikipedia , lookup

Protein structure prediction wikipedia , lookup

Homology modeling wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Point mutation wikipedia , lookup

Genetic code wikipedia , lookup

Molecular evolution wikipedia , lookup

Transcript
Evolutionary Distances for Protein-Coding Sequences: Modeling SiteSpecific Residue Frequencies
Aaron L. Halpern1 and William J. Bruno
Theoretical Biology and Biophysics, Los Alamos National Laboratory; and Santa Fe Institute
Estimation of evolutionary distances from coding sequences must take into account protein-level selection to avoid
relative underestimation of longer evolutionary distances. Current modeling of selection via site-to-site rate heterogeneity generally neglects another aspect of selection, namely position-specific amino acid frequencies. These frequencies determine the maximum dissimilarity expected for highly diverged but functionally and structurally conserved sequences, and hence are crucial for estimating long distances. We introduce a codon-level model of coding
sequence evolution in which position-specific amino acid frequencies are free parameters. In our implementation,
these are estimated from an alignment using methods described previously. We use simulations to demonstrate the
importance and feasibility of modeling such behavior; our model produces linear distance estimates over a wide
range of distances, while several alternative models underestimate long distances relative to short distances. Siteto-site differences in rates, as well as synonymous/nonsynonymous and first/second/third-codon-position differences,
arise as a natural consequence of the site-to-site differences in amino acid frequencies.
Introduction
The estimation of evolutionary distances from genetic sequences is important both for our understanding
of evolutionary history and for our understanding of
evolutionary processes, including matters such as mutation rates, population sizes, and selectional pressures.
If mutation rates were constant from site to site and the
process of fixation occurred purely through genetic drift
(i.e., no selection), as may be the case for pseudogenes
or other noncoding regions of a genome, models of evolution and methods of inferring evolutionary distances
based purely on the mutation process, such as the JukesCantor, Kimura two-parameter, and Felsenstein/HKY
models, would be adequate; see Swofford et al. (1996)
for discussion of such models. However, when evolutionary distances are calculated based on coding sequences, as is often the case, failure to take into account
effects of selection may lead to substantial underestimates of evolutionary distances. The factor by which
distances are underestimated will increase with increasing distance.
That (protein-level) selection has a significant impact on sequence evolution and on the estimation of
evolutionary patterns (phylogeny, patterns of synonymous and nonsynonymous substitutions, etc.) is amply
clear, as illustrated by the interest in models of evolution
which vary either rates of change at a site (Felsenstein
and Churchill 1996; Yang 1996) or rates of change between different character states (e.g., the use of PAM
matrices as in Felsenstein [1993] or biochemical measures of similarity between amino acid residues in calculating distances between proteins as in Goldman and
Yang [1994] and Thorne, Goldman, and Jones [1996]).
1 Present address: Department of Molecular Genetics and Microbiology, Health Sciences Center, University of New Mexico.
Key words: site-specific frequencies, evolutionary distances, selection, maximum likelihood, saturation, variable-rate models.
Address for correspondence and reprints: Aaron L. Halpern, Department of Molecular Genetics and Microbiology, Health Sciences
Center, University of New Mexico, 915 Camino de Salud, Albuquerque, New Mexico 87131. E-mail: [email protected].
Mol. Biol. Evol. 15(7):910–917. 1998
q 1998 by the Society for Molecular Biology and Evolution. ISSN: 0737-4038
910
From work on nucleic acid sequence evolution,
however, it is clear that not only rates of substitution
between states, but also the equilibrium frequencies of
the different states are important; Felsenstein (1981) and
Gojobori (1983) show that changes in equilibrium frequencies may strongly influence estimates of the degree
of mutational saturation which sequences have achieved.
We show here that similar effects on the protein level
due to selection must be modeled in order to achieve
accurate estimates of the evolutionary distances for
highly divergent coding sequences.
Large-scale sequencing efforts are leading to a rapid increase in both number and size of data sets containing highly divergent but clearly homologous and
functionally conserved protein-coding sequences. For
coding sequence data sets of sufficient size and diversity, amino acid frequencies can be estimated directly
from a multiple alignment of the sequences, and these
can be incorporated into a model of protein evolution
specific to that family of sequences, much as positionspecific scoring matrices, profiles, and Markov models
which make use of site-specific amino acid frequencies
have come to be used in homology searches (Henikoff
and Henikoff 1992, 1994; Baldi et al. 1994; Krogh et
al. 1994; Eddy, Mitchison, and Durbin 1995; Gribskov
and Veretnik 1996).
In order to enable such an approach, we provide a
model of nucleic acid sequence evolution which is codon-based and employs site-specific modeling of the effects of protein-level selection. The model of selection
is a simple one, in which the following four conditions
hold: (1) The effects of selection are on the probabilities
of fixation once a mutation has occurred in an otherwise
homogeneous population, with fixation being rapid relative to mutation so that polymorphism within a population may be neglected, as modeled by Kimura (1962).
(2) Selectional pressures (i.e, fixation probabilities) are
assumed to be constant at a given position for all lineages in the alignment under consideration. (3) Each
position is assumed to be independent of all other positions. (4) Only purifying (negative) selection is present. Since the model is codon-based, it could be used to
Site-Specific Model for Coding Sequences
model codon bias, but to keep the number of parameters
more manageable, we will normally assume that selection depends only on the encoded amino acid. Without
denying that such issues may be biologically and analytically important, we further neglect variation in the
form of the mutational and selectional matrices between
lineages, other forms of selection (RNA and/or DNA
structure requirements, etc.), and uncertainty of alignment.
Thus, a site-invariant model of codon-to-codon mutation is combined with site-specific estimates of equilibrium frequencies of each amino acid. Rates of replacement are determined for each site from these two
components according to the effect of selection on the
probability of fixation of mutations, following Kimura
(1962) and Golding and Felsenstein (1990). A consequence of the model, rather than a feature which is explicitly modeled, is that rates of substitution are lower
at positions under greater purifying selection. From this
model, evolutionary distances between pairs of taxa may
be calculated by the method of maximum likelihood.
Likewise, data sets may be simulated according to given
parameter values. By estimating parameters from real
data sets and then simulating sequences from these parameter values, we may construct realistic data sets with
known pairwise evolutionary distances.
We will show, through analysis of such simulated
sequences, that a variety of current methods for estimating evolutionary distances which do not take sitespecific character frequencies into accounts are in danger of underestimating true distances under conditions
which are empirically relevant. Methods for which this
may be a problem include those which model mutational
biases in varying detail but do not model selection at
all (from Jukes-Cantor to the general time-reversible
model of mutation); those which model selection in
terms of an average effect on substitution probabilities,
such as the use of PAM matrices on protein sequences
in PROTDIST (Felsenstein 1993); models based on biochemical measures of similarity (Goldman and Yang
1994); models including a synonymous/nonsynonymous
distinction (Muse and Gaut 1994); and models allowing
different rates of substitution between positions (Felsenstein and Churchill 1996; Yang 1996). Recent works by
Mitchison and Durbin (1995) and Thorne, Goldman, and
Jones (1996) which distinguish substitution patterns for
a few classes of positions—by similarity of amino acid
usage in the former report and by secondary structure
in the latter—go part way toward modeling selectional
character biases but, in our estimation, are unlikely to
be sufficiently detailed to more than partially overcome
the problem. Real data sets which are currently being
analyzed to determine times of divergence for early evolutionary events (e.g., Eigen and Nieselt-Struwe 1990;
Myers, MacInnes, and Korber 1992; Chan et al. 1995;
Wray, Levinton, and Shapiro 1996) are likely to lead to
underestimations of the more remote divergence times
using these methods. Furthermore, we will show that for
a sufficiently large and diverse alignment, it is possible
to estimate the equilibrium frequencies at each site with
911
sufficient accuracy to obtain estimates of evolutionary
distances which are linear out to very large distances.
Materials and Methods
Model
The qualitative impression one gets from working
with protein alignments is that while some positions in
a protein permit almost any residue (and indeed loop
regions permit extensive length variation), many positions in an alignment appear to allow only a limited
subset of amino acids, or at least show strong preferences for a few (sometimes only one) specific residues.
Without denying that covariation and structural or functional evolution may lead proteins from different taxa to
be under different selectional pressures at a given position, we will pursue a model of coding-sequence evolution in which each position in a protein is under sitespecific selectional pressures which have remained constant since the common ancestor of the data set at hand.
These pressures are assumed to have a twofold effect: they influence, for each site, the relative frequencies of each residue and, in combination with a model
of the mutational process, determine the rates at which
substitution between given codons occurs. As we show
below, for a given model of mutation and a model of
fixation involving purifying selection and genetic drift,
the relative frequencies of two residues at a site may be
used to derive rates of substitution between codons relative to the mutation rate. Consequently, we may parameterize site-specific selection in terms of residue frequencies, which, as we shall show, may be estimated
from the observed sequences in a large enough data set
with sufficient accuracy to reconstruct evolutionary distances.
We model sequence evolution as a stationary Markov process in which the unit of description is the codon. The choice of the codon permits us to take into
account the mutational process, the genetic code, and
the selectional pressures applying at the protein level, as
in the work of Goldman and Yang (1994) and Muse and
Gaut (1994). A model of sequence evolution then corresponds to the specification of rates of substitution between codons. We assume that sites are independent,
that reversibility applies, and that the mutation process,
but not the selection process, is identical at all sites.
For position (codon) i, we define a 64 3 64 matrix
of substitution rates r i such that r iab is the rate at which
codon a is replaced by codon b at position i. These
position-specific substitution rates are determined as the
product of three terms: a site-invariant probability of
mutation (in one replication cycle) between codons, pab,
a site-specific probability of fixation of a mutation, f iab,
and an arbitrary scaling constant, k. Thus,
i 5 k 3 p
i
rab
ab 3 f ab,b±a
(1)
i 5 2
raa
(2)
Or
b,b±a
i
ab .
To specify pab, we make use of a discrete-time
model of mutation; letting aj and bj refer to the nucleotides in the jth positions of codons a and b, we take
912
Halpern and Bruno
the probability of mutation between two codons in a
single replication cycle, pab, to be the product of the
mutation probabilities at each nucleotide position j within the codon:
Pp
3
p ab 5
j51
aj bj .
(3)
The terms pajbj are specified according to an overall
probability of mutation per cycle and any reversible
model of mutation. In analyses presented in this paper,
we make use of a model which includes nucleotide frequencies and a transition/transversion ratio, akin to the
models known as HKY and F84 (Hasegawa, Kishino,
and Yano 1985; Felsenstein 1993). The extension to the
general reversible model is straightforward. For our implementation, we specify parameters for nucleotide frequency (nA, nC, nG, nT), the probability of a transition
per replication cycle (pti), and the ratio of transitions to
transversions (pti/ptv). Each of these parameters reflects
the values that would be expected from mutation alone,
in the absence of selection. For comparison, equation
(4) shows the relation between pti/ptv and the transition/
transversion rate ratio of HKY, k:
k5
pti (nA 1 nG )(nC 1 nT )
.
p tv
n AnG 1 nCnT
(4)
From these specifications, we may derive the probabilities of specific mutations per replication cycle. For
a transition x → y, we have
p xy 5
ny
p ,
2(nAnG 1 nCn T ) ti
and for a transversion x → z, we have
pxz 5
equations of Kimura (1962) for the probability of fixation of a single mutant allele in an otherwise homogeneous population apply. In this model, if the relative
fitness of b to a is 1 1 s, s K 1, and the effective
population size is N, the fixation probabilities are:
nz
p ti
.
2(nA 1 nG )(nC 1 nT ) pti /ptv
The probability pxx is, of course, 1 2 Sy,y±xpxy. The resulting single-nucleotide probabilities are combined via
equation (3) to obtain probabilities per replication cycle
of codon-to-codon mutation.
Since the rates of mutation per cycle are low (pab
K 1), we may use a continuous-time approximation of
the behavior of this discrete time model of mutation
once our rates are defined. Indeed, replacement of our
model with one in which rates of mutation between codons differing in two or three nucleotides are taken to
be zero (Goldman and Yang 1994; Muse and Gaut 1994)
has very little effect on our results under the conditions
we have investigated. Similarly, our results are not very
sensitive to the choice of pti. However, if one were to
take the characteristic frequencies of unobserved amino
acids to be very near zero (small relative to the percycle mutation probability), disallowing multiple hits
per cycle could give significant changes in results.
As for the probability of fixation of a mutant, f iab,
we make use of the weak-mutation model of Golding
and Felsenstein (1990), according to which the time between introduction of a mutant and its eventual fixation
is taken to be small (zero) relative to the time between
fixations, so that polymorphism may be ignored and the
f ab 5
1 2 e 22s
2s
ø
22Ns
12e
1 2 e 22Ns
(5)
f ba ø
1 2 e 2s
22s
ø
.
2Ns
12e
1 2 e 2Ns
(6)
(In the discussion of eqs. 5–14, we omit the position
index i; all fixation probabilities and codon frequencies
are for a given site.) For s K 1, the ratio of fixation
rates is then:
f ab
f ba
2s
1 2 e 22Ns
e 2Ns 2 1
ø
5
5 e 2Ns .
22s
1 2 e 22Ns
1 2 e 2Ns
(7)
Since relative equilibrium frequencies between a
and b are determined by the ratios of their substitution
rates, given reversibility, we may also express the ratio
of fixation rates in terms of equilibrium frequencies (pa,
pb) and mutation rates (pab, pba),
pb p ba
f
5 ab 5 e 2Ns .
pa p ab
f ba
This implies we may substitute ln
(8)
1p p 2 for 2Ns into
pb p ba
a ab
equation (5), giving
ln
f ab }
1p p 2
pb p ba
a ab
p p
1 2 a ab
pa p ba
.
(9)
(For the case of pa pab 5 pb pba, we use the l’Hôpital’s
Rule to derive fab } 1.) This permits us to determine the
substitution rate in terms of the mutation probabilities
and the equilibrium frequencies as follows, with k an
arbitrary constant (see below):
ln
rab 5 k 3 p ab 3
1p p 2
pb p ba
a ab
p p
1 2 a ab
pa p ba
.
(10)
Thus, given a set of parameter values specifying
the mutation rates and, for each position, the equilibrium
frequencies of each codon, we may specify positionspecific substitution rates.
The resulting rates have the following properties:
As required by reversibility, rab/rba 5 pb/pa. However,
the total amount of substitution (the ‘‘flux’’) between
two codons is not uniform but, rather, depends on their
relative fitnesses: the greater the difference in fitness,
the less overall substitution. This has an effect not only
Site-Specific Model for Coding Sequences
on rates between individual pairs of codons, but also on
overall rates of substitutions at different sites: the maximum rate is achieved when all codons are equally fit,
and the overall rate of substitution tends to decrease with
decreasing variability of a position (as measured, e.g.,
by entropy). Equally, pairs of synonymous codons, for
which fitnesses are the same (according to the model),
will have the highest possible flux, and nonsynonymous
pairs will generally have lower flux. Finally, because of
the greater degeneracy of third-position substitutions,
third-position changes will be more frequent than firstor second-position substitutions. Although the specific
assumptions regarding fitness and fixation underlying
the model may be false for a given data set, the model
may still be of interest for these general properties.
We note that the rates of substitution may be scaled
arbitrarily to choose the units of evolutionary distance
as we please. By calculating the expected overall rate
of change at a position under no selectional pressure,
we may make our units the expected number of ‘‘hits’’
(codon substitutions) separating sequences at a site under no selectional pressure:
k5
O
a±b
1
p9a r9a→ b
(11)
p9a 5 n a1na2na3

p9b p ba
p9a p ab
pab
r9ab 5  1 2 p9a p ab
p9b p ba


 p ab
(12)
log
for p9a p ab ± p9b p ba
(13)
for p9a p ab 5 p9b p ba .
This has the appeal that it should not depend on
the strength of the selectional pressures on the region of
interest, allowing direct comparison of distances calculated from different regions. Assuming that most hits on
a site involve single-point mutations, we may approximate the number of hits per nucleotide at an unselected
site as one third the number of hits per unselected codon.
Given a model of (codon) substitution rates at each
position, the maximum likelihood estimate (MLE) of
evolutionary distance between two sequences may be
obtained using standard methods. Briefly, the matrix of
substitution rates at each position, r i, is multiplied by t
and exponentiated to give a matrix, Pi(t), which has elements piab(t), the likelihood of codon a at site i in one
sequence corresponding to that of codon b in another as
a function of evolutionary distance, t. (See Swofford et
al. [1996] for an introduction to maximum-likelihood
distance and phylogeny estimation.) From this, the likelihood of two observed sequences being separated by
distance t can be determined. The value of t which maximizes this likelihood is t̂, our estimate of the distance
between the two sequences.
Moreover, given such a model, it is possible to generate a simulated family of related proteins. For a given
phylogenetic tree (topology and branch lengths), and a
913
specification of the mutation and selection parameters,
we may generate an initial random sequence according
to the equilibrium frequencies, associate it with some
arbitrarily chosen node in the tree, and then ‘‘evolve’’
this sequence along the branches of the tree according
to the model.
Frequency Estimation
We turn now to the issue of estimating the parameters required by the model for the calculation of distances. In principle, all parameters could be simultaneously estimated from a large enough data set, given
sufficient computing power. We note that the approach
of Lewis (1998), involving the use of genetic algorithms
to simultaneously estimate a phylogeny and various
model parameters, may make this more feasible for our
model. At present, we have chosen to implement our
model in a program which requires as input specification
of the mutational and selectional parameters. In this section, we briefly discuss how such specifications may be
determined using existing methods.
Regarding the mutational parameters, nucleotide
frequencies may be estimated by the observed frequencies in the data set. The transition/transversion ratio, as
well as additional parameters if one considers a more
complicated mutational model than discussed above,
could be estimated using methods described elsewhere
(Yang 1995; Bruno and Arvestad 1997). We note, however, that application of these methods to coding sequences is not ideal, since it may result in estimates of
values reflecting selection, rather than mutation alone.
For instance, the estimated ratio of transitions to transversions might be higher than the true ratio of such mutations because of (among other things) the bias in third
positions: selection tends to favor silent changes over
amino-acid-replacing changes, and third-position transitions tend to be silent, in contrast to third-position
transversions. Estimation of mutational parameters from
noncoding sequences such as pseudogenes, if available,
might be preferred. In the simulations discussed below,
since we are mostly concerned with illustrating the importance and feasibility of modeling site-specific frequency effects, we will simply set the nucleotide frequencies and transition/transversion ratio to the values
used in simulating the sequences.
As for the estimation of site-specific codon frequencies, given a sufficiently large data set, individual
codon frequencies could in principle be estimated, allowing the modeling of codon bias (see also Goldman
and Yang 1994), but the large number of parameters to
be estimated would require unrealistically large data
sets. To make the problem more tractable, we make one
simplification which considerably reduces the estimation
problem, albeit at the loss of some of the true complexity of the evolutionary process: we ignore the possibility
of site-specific nucleic-acid-level selection effects (e.g.,
codon bias or RNA secondary structure) and calculate
the frequency of a given codon as a function of the
frequency of the corresponding amino acid and the nucleotide frequencies. Concretely, we let the equilibrium
frequency of codon i be estimated as follows, where
914
Halpern and Bruno
aa(i) is the amino acid coded for by codon i, and Paa(i)
is the frequency of aa(i):
pi 5
P
aa(i)
n i1n i2n i3
O
aa(j)5aa(i)
n j1n j2n j3
.
(14)
We thus reduce the codon frequency estimation
problem to the estimation of overall nucleotide frequencies (for which the observed frequencies in the alignment of interest may be used), and the estimation of
equilibrium frequencies of each amino acid at a given
site. This latter problem is a topic which has received
considerable attention in the literature on detection of
sequence homologs and sequence alignment (Baldi et al.
1994; Eddy 1996; Henikoff and Henikoff 1996; Sjolander et al. 1996; and references cited therein). We discuss
here a specific method for the estimation of equilibrium
frequencies which may be used to set the selectional
parameters required by our model for distance calculations (or sequence simulation). We show below that for
simulated but relatively realistic data, the method gives
useful results. We wish to emphasize, however, that the
choice of a specific method for estimating amino acid
frequencies is separate from the use of these estimates
in the model described above, so the specific method we
are about to describe could be replaced by other methods.
The estimation involves two steps, the estimation
of the number of times each amino acid has been introduced at a given position over the evolution of the sequences in the data set, and the addition of pseudocounts
to these counts. Both of these procedures are described
in detail elsewhere. Briefly, the effect of phylogenetic
structure on the estimates is taken into account using the
RIND program described by Bruno (1996), which iteratively estimates the evolutionary history and amino
acid frequencies under a model of sequence evolution
which ignores the genetic code. One result is a set of
estimated ‘‘counts’’ of the number of times each amino
acid is introduced at a given position. The final estimate
of site-specific frequencies is obtained by adding pseudocounts to these counts and taking the fraction of the
total at a given position of all counts and pseudocounts
which is contributed by a given amino acid to be the
frequency of that amino acid, as described by Karplus
(1995a, 1995b) and Henikoff and Henikoff (1996). We
find, empirically, that adding a uniform pseudocount in
the range of 0.5 to 1.0 counts to each category results
in good distance estimates (see below). We expect that
since estimating amino acid frequencies is of considerable interest for homology search methods and other
sorts of sequence analyses, better methods will continue
to be developed; see Sjolander et al. (1996) for an ambitious approach.
Sequence Simulation
In order to illustrate the potential effects of sitespecific frequencies on the estimation of evolutionary
distances, we have constructed simulated sequences
based on parameters estimated from a real alignment of
sequences of the major capsid protein, L1, of the pap-
illomaviruses. An alignment of 74 full-length L1 sequences was obtained from the Papillomavirus Database
(Myers et al. 1996) and slightly modified by hand; all
positions containing gaps were removed, leaving an
amino acid alignment of 430 residues and the corresponding nucleotide alignment of 1,320 positions. The
sequences in the resulting nucleotide alignment vary in
dissimilarity (Hamming distance) from 6% to 48%.
A phylogeny was estimated from the nucleotide
alignment using DNADIST (Felsenstein 1993; Kimura
two-parameter model with default settings) and a version of NEIGHBOR which returns only positive branch
lengths, NNNEIGHBOR (Bruno 1995). In order to permit investigation of long evolutionary distances, the
lengths of all branches in the tree were multiplied by a
factor of 6.0; we note that since the distances between
the more divergent sequences are likely to have been
substantially underestimated by DNADIST, this scaling
does not necessarily result in unrealistically diverged sequences. From the protein alignment, amino acid frequencies at each position were estimated as described
above using RIND; a small pseudocount (0.2) was added to each category on the assumption that the counts
underestimated the true variability in the original data
set.
Simulated sequences were generated from the
scaled tree according to the model described above using the amino acid frequency estimates derived from the
real sequence set, with mutational parameters specified
as follows: the probability of a transition at a given nucleotide in a single replication cycle, pti, was given as
0.000001 (1026), the transition-transversion ratio (pti/ptn)
as 2.0, and the nucleotide frequences nA, nC, nG, nT as
0.3, 0.2, 0.2, and 0.3, respectively.
Having an (artificial) data set corresponding to a
known phylogeny permits comparison of the true pairwise distances (those corresponding to the distances in
the tree used to generate the sequences) with those estimated by various methods. Figure 1 compares results
from DNADIST with all sites identical; DNArates and
fastDNAml, allowing site-to-site rate variation; the model described here, using the input parameters (that is, the
model used to generate the sequences); and the model
described here with parameters estimated from the simulated data set. Details are given in the legend to figure 1.
Programs for distance estimation and sequence
simulation implementing the methods presented above
are available from the authors.
Results and Discussion
Figure 1A shows that site-invariant models may seriously underestimate longer distances if sequences
evolve under selection that results in site-specific residue
frequencies. Figure 1B shows that this underestimation
is also seen with models allowing site-to-site rate variation but not modeling site-specific frequencies. Figure
1C shows, not surprisingly, that when distances are estimated based on the correct model, they remain linear
out to greater distances. Figure 1D shows that it is possible to estimate amino acid frequencies from the (sim-
Site-Specific Model for Coding Sequences
FIG. 1.—Comparison of pairwise distance estimates with correct
distances for simulated data set described in text. In each figure, the
x-axis gives the path length between pairs of sequences through the
tree used in simulating the data set. The y-axis gives the pairwise
distance estimate for each method. A, Pairwise distances were estimated using the maximum-likelihood model of DNADIST (Felsenstein
1993) with parameter settings corresponding to the mutational parameters used to simulate the sequences (frequencies of A, C, G, T 5 0.3,
0.2, 0.2, 0.3, respectively; transition/transversion ratio 2.0, corresponding to setting the T option of the program to 1.5833), with rates at all
sites equal. B, Distances were estimated under the same model as in
A, but allowing site-to-site rate variation. Positions were assigned to
one of 9 rate categories by DNArates (Olsen et al. 1994), using an
initial topology estimated with DNADIST (as above) and NEIGHBOR.
Pairwise distances between sequences were estimated using these rate
assignments using fastDNAml (Olsen et al. 1994). C, Distances were
estimated with the model described here, using the same parameter
values as were used to simulate the sequences. D, Distances were estimated with the model described here, with parameters estimated from
the simulated data set using RIND counts (Bruno 1996), adding 0.65
pseudocounts to each category. In all cases, the line indicates the best
linear fit through the origin for points with input distance #1.5.
ulated) data set with sufficient accuracy to give estimates of evolutionary distance which are linear out to
large distances. As expected (Gojobori, Ishii, and Nei
1982), the error on the estimate for long distances is
greater than that for short distances.
Three points should be made regarding these observations. First, the different distance measures return
results in different units; a slope of one in figure 1 is
thus not necessarily expected. However, the different
measures would nonetheless be linearly related to the
original distances if the methods were giving unbiased
estimates of the evolutionary distances between sequences.
Second, the conditions under which the simpler
models failed to give linear distance estimates can occur
in real data sets. Although the model introduced here
and the parameters used to generate the sequences under
discussion involve many assumptions that are only approximately correct (e.g., constancy of evolutionary
pressures on all lineages, or site-to-site independence),
we believe that they are a reasonable approximation of
real evolutionary processes. The amino acid frequencies
915
used in the simulation were estimated from a real data
set, and the branch lengths likewise reflect an attainable
degree of divergence.
Third, there remains one ‘‘free parameter’’ in our
estimation of amino acid frequencies which must currently be specified by the user: the number of pseudocounts which are added (z) to regularize the estimates.
Figure 1D shows the best results out of several attempts
using different values of z. The results in figure 1D show
that it is possible to choose a value of z which gives
linear estimates out to quite long evolutionary distances.
It should, in principle, be possible to provide a method
for deciding on this value for a given data set without
knowing the distances a priori. As a point of departure,
we note that methods based on urn models of substitution, involving either uniform pseudocounts (‘‘zero-offsets’’) or more complicated prior distributions such as
Dirichlet mixtures estimated from protein alignments
(see Sjolander et al. 1996 and references therein), appear
to give underestimates of the amount of regularization
needed, as has been discussed by Henikoff and Henikoff
(1996) in connection with defining profiles for homology searches. We believe the effect is due to the genetic
code: substitutions do not correspond to draws from an
urn, since the choice of the new state is dependent on
the old state; failure to recognize this dependence leads
to an overestimation of the amount of evidence, which
can be (partially) compensated for by adding more pseudocounts. If the lack of independence of counts can be
dealt with, it would certainly be worthwhile to consider
more complicated methods of determining pseudocounts, such as Dirichlet mixtures (Sjolander et al. 1996)
or substitution matrix methods (Henikoff and Henikoff
1996).
Conclusion
Perhaps the key contribution of the current work is
the modeling of selection in terms of its effect on equilibrium frequencies of residues. We have shown that
failing to model such effects on the evolution of coding
regions may lead to substantial underestimation of longer distances. We have shown that it is possible to model
these effects using standard models of mutation and
population genetics, deriving rates of substitution which
would lead to the estimated equilibrium frequencies at
each position. Finally, we have shown that it is possible
to estimate equilibrium frequencies from data sets of
realistic size and divergence with sufficient accuracy to
allow estimation of distances between highly divergent
sequences.
By separating the mutational process from the fixation process, we may capture differences in codon-tocodon rates due to the genetic code that are relevant at
small degrees of divergence, as well as differences in
amino acid frequencies that are key to estimation of
larger distances. We also have a separation of site-independent parameters (mutational) and site-specific parameters (selectional), each of which may be estimated
by previous methods. This contrasts with the rates implied by, e.g., PAM matrices, which fail to distinguish
916
Halpern and Bruno
effects of the code from effects of protein-level selection. An additional feature of the model that we find
appealing is that it results in site-to-site rate variation
without including a set of site-specific rate parameters
in the model; the model thus accounts for heterogeneity
of rates in a novel, and probably biologically meaningful, way. Without denying the existence of rate variation
due to positive selection (e.g., as result of immune pressure on an antigenic region of a pathogen) or purifying
selection at levels other than the amino acid level (e.g.,
RNA secondary structure, same-residue codon bias,
etc.), it would be interesting to see the extent to which
observed rate heterogeneity in coding regions could be
reduced to variation in the strength of residue preferences as indicated by their frequencies.
In the implementation of distance estimation presented here, we have assumed that mutational and selectional parameters would be estimated independently,
using established methods. However, separating the estimation of these parameters from the estimation of distances is clearly not optimal. Ideally, these parameters
and a phylogeny would be estimated simultaneously.
The likelihood functions described here make that possible in principle, although the computation may not
currently be practical.
We feel that the method presented here will be of
use in assessing relative evolutionary distances for data
sets containing a mixture of highly divergent and less
divergent pairs of sequences, whereas previous methods
are likely to underestimate longer distances and thus underestimate differences between more and less divergent
pairs. We also hope that the method will find use in
evaluating the degree to which the more divergent
clades of a tree have approached saturation, which
should permit us to consider questions regarding the
likelihood of increased divergence over time. Finally,
the method may be helpful in assessing whether uncertainty regarding the inner branchings of a tree reflects a
period of explosive diversification or loss of evolutionary signal in the noise of time.
Acknowledgments
We thank Stanley Sawyer and an anonymous referee for editorial input. Funding for A.L.H. was provided by the Sexually-Transmitted Diseases Branch of the
National Institute of Allergy and Infectious Diseases and
Family Health International; W.J.B. was supported by a
grant from the Department of Energy Office of Biological and Environmental Research.
LITERATURE CITED
BALDI, P., Y. CHAUVIN, T. HUNKAPILLER, and M. A. MCCLURE.
1994. Hidden Markov models of biological primary sequence information. Proc. Natl. Acad. Sci. USA 91:1059–
1063.
BRUNO, W. 1995. Unpublished revisions to J. Felsenstein’s program NEIGHBOR; available at ftp://atlas.lanl.gov/pub/
billb/rind/nnneighbor.c.
. 1996. Modeling residue usage in aligned protein sequences via maximum likelihood. Mol. Biol. Evol. 13:
1368–1375.
BRUNO, W. J., and L. ARVESTAD. 1997. Estimation of reversible substitution matrices and evolutionary distance. J. Mol.
Evol. 45:696–703.
CHAN, S.-Y., H. DELIUIS, A. L. HALPERN, and H.-U. BERNARD.
1995. Analysis of genomic sequences of 95 papillomavirus
types—uniting typing, phylogeny, and taxonomy. J. Virol.
69:3074–3083.
EDDY, S. R. 1996. Hidden Markov models. Curr. Opin. Struct.
Biol. 6:361–365.
EDDY, S. R., G. MITCHISON, and R. DURBIN. 1995. Maximum
discrimination hidden Markov models of sequence consensus. J. Comput. Biol. 2:9–23.
EIGEN, M., and K. NIESELT-STRUWE. 1990. How old is the
immunodeficiency virus? AIDS 4:S85–S93.
FELSENSTEIN, J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J. Mol. Evol. 17:368–
376.
. 1993. PHYLIP (phylogeny inference package). Version 3.5c. Distributed by author, Department of Genetics,
University of Washington, Seattle.
FELSENSTEIN, J., and G. A. CHURCHILL. 1996. A hidden Markov model approach to variation among sites in rate of evolution. Mol. Biol. Evol. 13:93–104.
GOJOBORI, T. 1983. Codon substitution in evolution and the
‘‘saturation’’ of synonymous changes. Genetics 105:1011–
1027.
GOJOBORI, T., K. ISHII, and M. NEI. 1982. Estimation of average number of nucleotide substitutions when the rate of
substitution varies with nucleotide. J. Mol. Evol. 18:414–
423.
GOLDING, B., and J. FELSENSTEIN. 1990. A maximum likelihood approach to the detection of selection from a phylogeny. J. Mol. Evol. 31:511–523.
GOLDMAN, N., and Z. YANG. 1994. A codon-based model of
nucleotide substitution for protein-coding DNA sequences.
Mol. Biol. Evol. 11:725–736.
GRIBSKOV, M., and S. VERETNIK. 1996. Identification of sequence pattern with profile analysis. Methods Enzymol.
266:198–212.
HASEGAWA, M., H. KISHINO, and T. YANO. 1985. Dating of
the human–ape splitting by a molecular clock of mitochondrial DNA. J. Mol. Evol. 21:160–174.
HENIKOFF, J. G., and S. HENIKOFF. 1996. Using substitution
probabilities to improve position-specific scoring matrices.
CABIOS 12:135–143.
HENIKOFF, S., and J. G. HENIKOFF. 1992. Amino acid substitution matrices from protein blocks. Proc. Natl. Acad. Sci.
USA 3:10915–10919.
. 1994. Protein family classification based on searching
a database of blocks. Genomics 19:97–107.
KARPLUS, K. 1995a. Regularizers for estimating distributions
of amino acids from small samples. Technical report, University of California, Santa Cruz. Technical Report UCSCCRL-95-11, URL ftp://ftp.cse.ucsc.edu/pub/tr/ucsc-crl-9511.ps.Z.
. 1995b. Regularizers for estimating distributions of
amino acids from small samples. Pp. 188–196 in C. RAWLINGS, D. CLARK, R. ALTMAN, L. HUNTER, T. LENGAUER,
and S. WODAK, eds. ISMB-95: Proceedings, Third International Conference on Intelligent Systems for Molecular
Biology. AAAI Press, Menlo Park, Calif.
KIMURA, M. 1962. On the probability of fixation of mutant
genes in a population. Genetics 4:713–719.
KROGH, A., M. BROWN, I. S. MIAN, K. SJOLANDER, and D.
HAUSSLER. 1994. Hidden Markov models in computational
biology: applications to protein modeling. J. Mol. Biol. 235:
1501–1531.
Site-Specific Model for Coding Sequences
LEWIS, P. O. 1998. A genetic algorithm for maximum-likelihood phylogeny inference using nucleotide sequence data.
Mol. Biol. Evol. 15:277–283.
MITCHISON, G., and R. DURBIN. 1995. Tree-based maximal
likelihood substitution matrices and hidden Markov models.
J. Mol. Evol. 41:1139–1151.
MUSE, S. V., and B. S. GAUT. 1994. A likelihood approach for
comparing synonymous and nonsynonymous nucleotide
substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11:715–724.
MYERS, G., A. HALPERN, C. BAKER, A. MCBRIDE, C. WHEELER, and J. DOORBAR. 1996. Human papillomaviruses 1996,
a compilation and analysis of nucleic acid and amino acid
sequences. Theoretical Biology and Biophysics, Los Alamos National Laboratory, Los Alamos, N.M.
MYERS, G., K. MACINNES, and B. KORBER. 1992. The emergence of simian/human immunodeficiency viruses. AIDS
Res. Hum. Retroviruses 8:373–386.
OLSEN, G. J., H. MATSUDA, R. HAGSTROM, and R. OVERBEEK.
1994. fastDNAml—a tool for construction of phylogenetic
trees of DNA sequences using maximum likelihood. CABIOS 10:41–48.
SOJLANDER, K., K. KARPLUS, M. BROWN, R. HUGHEY, A.
KROGH, I. S. MIAN, and D. HAUSSLER. 1996. Dirichlet
917
mixtures: a method for improving detection of weak but
significant protein sequence homology. CABIOS 12:327–
345.
SWOFFORD, D. L., G. J. OLSEN, P. J. WADDELL, and D. M.
HILLIS. 1996. Phylogenetic inference. Pp. 407–514 in D. M.
HILLIS, C. MORITZ, and B. K. MABLE, eds. Molecular Systematics. 2nd edition. Sinauer, Sunderland, Mass.
THORNE, J. L., N. GOLDMAN, and D. T. JONES. 1996. Combining protein evolution and secondary structure. Mol. Biol.
Evol. 13:666–673.
WRAY, G. A., J. S. LEVINTON, and L. H. SHAPIRO. 1996. Molecular evidence for deep pre-Cambrian divergences among
metazoan phyla. Science 274:568–573.
YANG, Z. 1995. Phylogenetic analysis by maximum likelihood
(PAML). Version 1.1. http://abacus.gene.ucl.ac.uk/ziheng/
paml.html.
. 1996. Among-site rate variation and its impact on phylogenetic distances. Trends Ecol. Evol. 11:367–372.
STANLEY SAWYER, reviewing editor
Accepted April 8, 1998