Download Counterexample to a Claim About the Reconstruction of Ancestral

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
Counterexample to a Claim About the
Reconstruction of Ancestral Character States
Brian Lucena
Division of Computer Science
University of California, Berkeley
[email protected]
David Haussler
Howard Hughes Medical Institute
Department of Biomolecular Engineering
University of California, Santa Cruz
[email protected]
Since Pauling and Zuckerkandl first suggested it more than 40 years ago,
the idea of reconstructing ancestral proteins and DNA sequences from the information contained in sequences of present day species has held considerable
fascination (Pauling & Zuckerkandl, 1963). Such reconstructions can provide
a unifying framework for understanding the molecular origins and evolution
of key components in living organisms. However, only recently has it become relatively straightforward to perform such reconstructions and then test
the reconstructed molecules functionally in the lab. Now there is a surge
of activity in this area. Ancestral protein sequences for rhodopsin (Chang,
et al., 2002), ultra-violet vision gene SWS1 (Shi & Yokoyama, 2003), ribonucleases (Jermann, et al., 1995; Zhang & Rosenberg, 2002), Tu elongation factors (Gaucher, et al., 2003), and steroid receptors (Thornton, et al., 2003) have
been reconstructed and tested (see the reviews (Chang & Donoghue, 2000)
and (Thornton, 2004)). In addition, DNA from the common ancestor of placental mammals has been reconstructed for a megabase-sized region containing
the cystic fibrosis gene (Blanchette, et al., 2004), for several families of transposons (Adey, et al., 1994; Ivics, et al., 1997; Smit & Riggs, 1996; Jurka, 2002),
and for some complete small genomes like HIV (Hillis, et al., 1994). In the
latter case the predicted ancestral sequences were compared to the known ones,
so the accuracy of the resonstruction could be measured directly. However, in
other cases theoretical results (Yang, et al., 1995; Schultz, et al., 1996; Schultz
& Churchill, 1999) and computer simulations (Zhang & Nei, 1997; Blanchette,
et al., 2004; Cai, et al., 2004) are required to assess the accuracy of the reconstructed sequence.
In these investigations it has been observed that the topology of the phylo-
1
genetic tree relating the present day species to the target ancestral species can
affect the accuracy obtainable in reconstruction of the target ancestral character states. Simulations show that a star-like phylogeny, i.e. a rapid radiation
of many different lineages from a common target ancestor, such as occured in
the radiation of placental mammals, allows the ancestral character states of
that target ancestor to be more accurately reconstructed than those of more
recent ancestors in parts of the tree where speciation events are more regularly spaced (Blanchette, et al., 2004). More generally, it has been claimed
that the star phylogeny always “represents the best case for ancestral character state reconstruction, because each observation is conditionally independent
and yields maximum information about the ancestor” (Schultz, et al., 1996), see
also (Schultz & Churchill, 1999). Here we show that the actual situation is more
complex and depends on the branch lengths. This complexity occurs even for
the simplest evolutionary model with only one parameter: the Poisson model
(known also as the Neyman r-state model, generalized Jukes-Cantor model, and
the Potts model), where the parameter determines the rate of substitution and
all substitutions are equally likely.
Consider the tree topology shown in Figure 1, where A represents the common ancestor of 3 present day species, designated by C1 , C2, and C3 . B2 represents the common ancestor of C2 and C3 while B1 represents the ancestor of C1
at the same moment in evolutionary time as B2 . This tree contains a subtree
with the simplest star topology, the two-leaf subtree shown in bold in figure
1b, and it also contains a subtree with the simplest non-star topology, the ”Y
topology” shown in figure 1c. The question is: which of these two topologies
(b or c) is better for reconstructing the ancestral character at A? To make this
more concrete, suppose each character takes a value in the 20-letter alphabet
of amino acids. You want to learn about the ancestral character A, which has
a discrete uniform marginal distribution. Imagine that your budget only allows
you to determine two of the three characters C1, C2 and C3 of the contemporary
species. Which two should you choose?
A
A
B1
B2
C1
C2
(a)
C3
A
B1
B2
C1
C2
(b)
C3
B1
C1
B2
C2
C3
(c)
Figure 1: (a) The evolutionary tree. (b) The star topology is highlighted. (c)
The Y-topology is highlighted.
The information present about a variable X from related variables Y and Z
is given by the mutual information I(X; Y, Z) (Cover & Thomas, 1991). The
higher the mutual information, the more reconstructible is X from Y and Z.
Since C2 and C3 are interchangeable, our problem thus reduces to the question
2
of which is the larger mutual information, I(A; C1, C3) or I(A; C2, C3). For short
branches I(A; C1 , C3) is higher, as claimed in (Schultz, et al., 1996; Schultz &
Churchill, 1999) and has been observed in simulations (Blanchette et al. 2004).
That is, you would prefer a subtopology that is a two-branch star, where the
species you observe share no common ancestor except A (Figure 1b). As a
concrete example, suppose A is drawn uniformly from the set of all 20 amino
acids and the conservation probability is .75 for each branch of the tree (i.e.
from A to B1 or B2 , from B1 to C1, or from B2 to C2 or C3). This corresponds
to about .29 expected substitutions per site (for each branch). Under these
conditions I(A; C1 , C3) = 2.419 but I(A; C2, C3) = 1.949. This is intuitive,
because these species give independent evidence about the ancestral character
(they are conditionally independent given the ancestor A). But in a long branch
setting with a much lower conservation probability, I(A; C2, C3) is higher. For
example, if the conservation probability is .15 (expected number of substitutions
≈ 2.139) , we have I(A; C1, C3) = .003162 but I(A; C2 , C3) = .003215. Thus,
somewhat nonintuitively, for reconstruction of the ancestral character in a long
branch setting it is better to have a Y-topology, where there is an intermediate
common ancestor, such as in Figure 1c. In this case the observed characters are
conditionally dependent given the ancestral character A.
A
B1
A
B2
C1
C2
C3
(a)
B3
C4
B1
A
B2
C1
C2
C3
(b)
B3
C4
B1
B2
C1
C2
C3
B3
C4
(c)
Figure 2: (a) The evolutionary tree. (b) The star topology is highlighted. (c)
A correlated topology is highlighted.
This effect remains if a fourth (conditionally) independent species branching from A is allowed (Figure 2) and we are allowed to choose 3 of the 4
species. Using the same parameters (conservation probability = .15), we get
I(A; C1 , C2, C3) = .004796 and I(A; C1, C2, C4) = .004743. This shows that
even in cases where the target ancestor for reconstruction is the last common
ancestor of all observed present day species, the star topology is not always best.
These results contrast with those of the case of binary characters. There
it has been proven that the star topology is always best for reconstruction of
the ancestoral character state for a tree with any number of leaves under the
generalized Jukes-Cantor model (Evans, et al., 2000) (Theorem 6.1), validating
the claim of (Schultz, et al., 1996; Schultz & Churchill, 1999) for this case. The
counterexamples we give here show that there is a fundamental difference in the
behavior of this problem when there are 2 states versus when there are many.
Phenomena which are quite similar (and mathematically deeper) to the results in this paper have been demonstrated in the mathematical literature re3
garding probability on trees. The asymptotic analysis of the Poisson model
in (Mossel, 2001) and (Mossel & Peres, 2003) (there referred to as the Potts
model) demonstrates that there exist non-star topologies which have strictly
positive information about the roots at the leaves (asymptotically), while the
corresponding star topology would have information approaching zero. A thorough understanding of this result makes the results described in this paper somewhat less surprising. It should also be noted that similar phenomenon can occur
with a binary state space if the model is asymmetric. This is suggested by the
results in (Mossel, 2001). Furthermore, Mossel (2001) demonstrates another example where (in the language of this paper) I(A; C1 , C2, C4) = I(A; C1 , C2) = 0
yet I(A; C2 , C3) > 0. However, the model used there is unlikely to be biologically meaningful as it involves a transition matrix that does not correspond to
a continuous or reversible markov process
Biologists may be interested in computing the information some subset of
species has about the root of a tree, given some probability distribution on the
tree. The typical model specification is given by a marginal distribution of the
root of a tree and a conditional distribution for each node given its parent. While
computing the information is theoretically quite straightforward, it effectively
requires evaluating the joint probability of every state configuration of the root
and the leaves. In a general n-state model on an arbitrary tree with k leaves
this requires n(k+1) evaluations of the joint probability function. Symmetry in
the model and/or the tree can often make this fairly simple. For example, for
the Poisson model with a star topology with k leaves, the joint probability of a
particular configuration depends only on how many leaves are the same as the
root, so only k + 1 different probabilities need be computed. It should also be
noted that, in the typical model specification, computing a joint probability of
the root and leaves requires taking a sum over the internal nodes of the tree.
Doing this with efficient methods contributes an additional factor of cn2 (or just
cn if the tree has only two levels) where c is the number of internal nodes to the
computational complexity. So, in general, computing the mutual information
in these scenarios can become computationally infeasible unless there is a great
deal of symmetry. For this reason, theoretical results which shed light on optimal
choices can be of great practical value.
Acknowledgements
This work was supported in part by an NSF Mathematical Sciences Postdoctoral
Fellowship.
References
N. Adey, et al. (1994). ‘Molecular resurrection of an extinct ancestral promoter
for mouse L1’. Proc Natl Acad Sci. 91(4):1569–1573.
4
M. Blanchette, et al. (2004). ‘Reconstructing large regions of an ancestral mammalian genome in silico’. Genome Research 14(12):2412–2423.
W. Cai, et al. (2004). ‘Reconstruction of ancestral protein sequences and its
application’. BMC Evolutionary Biology 4:33.
B. Chang & M. Donoghue (2000). ‘Recreating ancestral proteins’. Trends Ecol
Evol 15(3):109–114.
B. Chang, et al. (2002). ‘Recreating a functional ancestral archosaur visual
pigment’. Mol Biol Evol. 19(9):1483–1483.
T. Cover & J. Thomas (1991). Elements of Information Theory. John Wiley &
sons.
W. Evans, et al. (2000). ‘Broadcasting on Trees and the Ising Model’. Ann.
Appl. Prob. 10(2):410–433.
E. Gaucher, et al. (2003). ‘Inferring the palaeoenvironment of ancient bacteria
on the basis of resurrected proteins’. Nature 425(6955):285–8.
D. Hillis, et al. (1994). ‘Application and accuracy of molecular phylogenies’.
Science 264:671.
Z. Ivics, et al. (1997). ‘Molecular reconstruction of Sleeping Beauty, a Tc1-like
transposon from fish, and its transposition in human cells’. Cell 91:501–10.
T. Jermann, et al. (1995). ‘Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily’. Nature 2;374(6517):57–59.
J. Jurka (2002). ‘Repbase update: a database and an electronic journal of
repetitive elements’. Trends Genet 16(9):418–20.
E. Mossel (2001). ‘Reconstruction on trees: beating the second eigenvalue’.
Ann. Appl. Probab. 11(1):285–300.
E. Mossel & Y. Peres (2003). ‘Information flow on trees’. Ann. Appl. Probab.
13(3):817–844.
L. Pauling & E. Zuckerkandl (1963). ‘Chemical paleogenetics, molecular restoration studies of extinct forms of life’. Acta Chemica Scandinavica 17(1).
T. Schultz & G. Churchill (1999). ‘The role of subjectivity in reconstructing
ancestral character states: A Bayesian approach to unknown rates, states and
transformation asymmetries.’. Syst. Biol. 48:651–664.
T. Schultz, et al. (1996). ‘The reconstruction of ancestral character states’.
Evolution 50:504.
Y. Shi & S. Yokoyama (2003). ‘Molecular analysis of the evolutionary significance of ultraviolet vision in vertebrates’. Proc. Natl. Acad. Sci. USA.
100(14):8308–13.
5
A. Smit & A. Riggs (1996). ‘Tiggers and other DNA transposon fossils in the
human genome’. Proc. Natl. Acad. Sci. 93:1443–1448.
J. Thornton (2004). ‘Resurrecting ancient genes: experimental analysis of extinct molecules’. Nature Reviews Genetics 5:366–375.
J. W. Thornton, et al. (2003). ‘Resurrecting the ancestral steroid receptor:
ancient origin of estrogen signaling’. Science 301(5640):1714–7.
Z. Yang, et al. (1995). ‘A new method of inference of ancestral nucleotide and
amino acid sequences’. Genetics 141:1641–1650.
J. Zhang & M. Nei (1997). ‘Accuracies of ancestral amino acid sequences inferred
by parsimony, likelihood and distance methods’. J. Mol. Evol. 44(S1):S139–
46.
J. Zhang & H. Rosenberg (2002). ‘Complementary advantageous substitutions
in the evolution of an antiviral RNase of higher primates’. Proc Natl Acad
Sci USA. 99(8):5486–91.
6