Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Counterexample to a Claim About the Reconstruction of Ancestral Character States Brian Lucena Division of Computer Science University of California, Berkeley [email protected] David Haussler Howard Hughes Medical Institute Department of Biomolecular Engineering University of California, Santa Cruz [email protected] Since Pauling and Zuckerkandl first suggested it more than 40 years ago, the idea of reconstructing ancestral proteins and DNA sequences from the information contained in sequences of present day species has held considerable fascination (Pauling & Zuckerkandl, 1963). Such reconstructions can provide a unifying framework for understanding the molecular origins and evolution of key components in living organisms. However, only recently has it become relatively straightforward to perform such reconstructions and then test the reconstructed molecules functionally in the lab. Now there is a surge of activity in this area. Ancestral protein sequences for rhodopsin (Chang, et al., 2002), ultra-violet vision gene SWS1 (Shi & Yokoyama, 2003), ribonucleases (Jermann, et al., 1995; Zhang & Rosenberg, 2002), Tu elongation factors (Gaucher, et al., 2003), and steroid receptors (Thornton, et al., 2003) have been reconstructed and tested (see the reviews (Chang & Donoghue, 2000) and (Thornton, 2004)). In addition, DNA from the common ancestor of placental mammals has been reconstructed for a megabase-sized region containing the cystic fibrosis gene (Blanchette, et al., 2004), for several families of transposons (Adey, et al., 1994; Ivics, et al., 1997; Smit & Riggs, 1996; Jurka, 2002), and for some complete small genomes like HIV (Hillis, et al., 1994). In the latter case the predicted ancestral sequences were compared to the known ones, so the accuracy of the resonstruction could be measured directly. However, in other cases theoretical results (Yang, et al., 1995; Schultz, et al., 1996; Schultz & Churchill, 1999) and computer simulations (Zhang & Nei, 1997; Blanchette, et al., 2004; Cai, et al., 2004) are required to assess the accuracy of the reconstructed sequence. In these investigations it has been observed that the topology of the phylo- 1 genetic tree relating the present day species to the target ancestral species can affect the accuracy obtainable in reconstruction of the target ancestral character states. Simulations show that a star-like phylogeny, i.e. a rapid radiation of many different lineages from a common target ancestor, such as occured in the radiation of placental mammals, allows the ancestral character states of that target ancestor to be more accurately reconstructed than those of more recent ancestors in parts of the tree where speciation events are more regularly spaced (Blanchette, et al., 2004). More generally, it has been claimed that the star phylogeny always “represents the best case for ancestral character state reconstruction, because each observation is conditionally independent and yields maximum information about the ancestor” (Schultz, et al., 1996), see also (Schultz & Churchill, 1999). Here we show that the actual situation is more complex and depends on the branch lengths. This complexity occurs even for the simplest evolutionary model with only one parameter: the Poisson model (known also as the Neyman r-state model, generalized Jukes-Cantor model, and the Potts model), where the parameter determines the rate of substitution and all substitutions are equally likely. Consider the tree topology shown in Figure 1, where A represents the common ancestor of 3 present day species, designated by C1 , C2, and C3 . B2 represents the common ancestor of C2 and C3 while B1 represents the ancestor of C1 at the same moment in evolutionary time as B2 . This tree contains a subtree with the simplest star topology, the two-leaf subtree shown in bold in figure 1b, and it also contains a subtree with the simplest non-star topology, the ”Y topology” shown in figure 1c. The question is: which of these two topologies (b or c) is better for reconstructing the ancestral character at A? To make this more concrete, suppose each character takes a value in the 20-letter alphabet of amino acids. You want to learn about the ancestral character A, which has a discrete uniform marginal distribution. Imagine that your budget only allows you to determine two of the three characters C1, C2 and C3 of the contemporary species. Which two should you choose? A A B1 B2 C1 C2 (a) C3 A B1 B2 C1 C2 (b) C3 B1 C1 B2 C2 C3 (c) Figure 1: (a) The evolutionary tree. (b) The star topology is highlighted. (c) The Y-topology is highlighted. The information present about a variable X from related variables Y and Z is given by the mutual information I(X; Y, Z) (Cover & Thomas, 1991). The higher the mutual information, the more reconstructible is X from Y and Z. Since C2 and C3 are interchangeable, our problem thus reduces to the question 2 of which is the larger mutual information, I(A; C1, C3) or I(A; C2, C3). For short branches I(A; C1 , C3) is higher, as claimed in (Schultz, et al., 1996; Schultz & Churchill, 1999) and has been observed in simulations (Blanchette et al. 2004). That is, you would prefer a subtopology that is a two-branch star, where the species you observe share no common ancestor except A (Figure 1b). As a concrete example, suppose A is drawn uniformly from the set of all 20 amino acids and the conservation probability is .75 for each branch of the tree (i.e. from A to B1 or B2 , from B1 to C1, or from B2 to C2 or C3). This corresponds to about .29 expected substitutions per site (for each branch). Under these conditions I(A; C1 , C3) = 2.419 but I(A; C2, C3) = 1.949. This is intuitive, because these species give independent evidence about the ancestral character (they are conditionally independent given the ancestor A). But in a long branch setting with a much lower conservation probability, I(A; C2, C3) is higher. For example, if the conservation probability is .15 (expected number of substitutions ≈ 2.139) , we have I(A; C1, C3) = .003162 but I(A; C2 , C3) = .003215. Thus, somewhat nonintuitively, for reconstruction of the ancestral character in a long branch setting it is better to have a Y-topology, where there is an intermediate common ancestor, such as in Figure 1c. In this case the observed characters are conditionally dependent given the ancestral character A. A B1 A B2 C1 C2 C3 (a) B3 C4 B1 A B2 C1 C2 C3 (b) B3 C4 B1 B2 C1 C2 C3 B3 C4 (c) Figure 2: (a) The evolutionary tree. (b) The star topology is highlighted. (c) A correlated topology is highlighted. This effect remains if a fourth (conditionally) independent species branching from A is allowed (Figure 2) and we are allowed to choose 3 of the 4 species. Using the same parameters (conservation probability = .15), we get I(A; C1 , C2, C3) = .004796 and I(A; C1, C2, C4) = .004743. This shows that even in cases where the target ancestor for reconstruction is the last common ancestor of all observed present day species, the star topology is not always best. These results contrast with those of the case of binary characters. There it has been proven that the star topology is always best for reconstruction of the ancestoral character state for a tree with any number of leaves under the generalized Jukes-Cantor model (Evans, et al., 2000) (Theorem 6.1), validating the claim of (Schultz, et al., 1996; Schultz & Churchill, 1999) for this case. The counterexamples we give here show that there is a fundamental difference in the behavior of this problem when there are 2 states versus when there are many. Phenomena which are quite similar (and mathematically deeper) to the results in this paper have been demonstrated in the mathematical literature re3 garding probability on trees. The asymptotic analysis of the Poisson model in (Mossel, 2001) and (Mossel & Peres, 2003) (there referred to as the Potts model) demonstrates that there exist non-star topologies which have strictly positive information about the roots at the leaves (asymptotically), while the corresponding star topology would have information approaching zero. A thorough understanding of this result makes the results described in this paper somewhat less surprising. It should also be noted that similar phenomenon can occur with a binary state space if the model is asymmetric. This is suggested by the results in (Mossel, 2001). Furthermore, Mossel (2001) demonstrates another example where (in the language of this paper) I(A; C1 , C2, C4) = I(A; C1 , C2) = 0 yet I(A; C2 , C3) > 0. However, the model used there is unlikely to be biologically meaningful as it involves a transition matrix that does not correspond to a continuous or reversible markov process Biologists may be interested in computing the information some subset of species has about the root of a tree, given some probability distribution on the tree. The typical model specification is given by a marginal distribution of the root of a tree and a conditional distribution for each node given its parent. While computing the information is theoretically quite straightforward, it effectively requires evaluating the joint probability of every state configuration of the root and the leaves. In a general n-state model on an arbitrary tree with k leaves this requires n(k+1) evaluations of the joint probability function. Symmetry in the model and/or the tree can often make this fairly simple. For example, for the Poisson model with a star topology with k leaves, the joint probability of a particular configuration depends only on how many leaves are the same as the root, so only k + 1 different probabilities need be computed. It should also be noted that, in the typical model specification, computing a joint probability of the root and leaves requires taking a sum over the internal nodes of the tree. Doing this with efficient methods contributes an additional factor of cn2 (or just cn if the tree has only two levels) where c is the number of internal nodes to the computational complexity. So, in general, computing the mutual information in these scenarios can become computationally infeasible unless there is a great deal of symmetry. For this reason, theoretical results which shed light on optimal choices can be of great practical value. Acknowledgements This work was supported in part by an NSF Mathematical Sciences Postdoctoral Fellowship. References N. Adey, et al. (1994). ‘Molecular resurrection of an extinct ancestral promoter for mouse L1’. Proc Natl Acad Sci. 91(4):1569–1573. 4 M. Blanchette, et al. (2004). ‘Reconstructing large regions of an ancestral mammalian genome in silico’. Genome Research 14(12):2412–2423. W. Cai, et al. (2004). ‘Reconstruction of ancestral protein sequences and its application’. BMC Evolutionary Biology 4:33. B. Chang & M. Donoghue (2000). ‘Recreating ancestral proteins’. Trends Ecol Evol 15(3):109–114. B. Chang, et al. (2002). ‘Recreating a functional ancestral archosaur visual pigment’. Mol Biol Evol. 19(9):1483–1483. T. Cover & J. Thomas (1991). Elements of Information Theory. John Wiley & sons. W. Evans, et al. (2000). ‘Broadcasting on Trees and the Ising Model’. Ann. Appl. Prob. 10(2):410–433. E. Gaucher, et al. (2003). ‘Inferring the palaeoenvironment of ancient bacteria on the basis of resurrected proteins’. Nature 425(6955):285–8. D. Hillis, et al. (1994). ‘Application and accuracy of molecular phylogenies’. Science 264:671. Z. Ivics, et al. (1997). ‘Molecular reconstruction of Sleeping Beauty, a Tc1-like transposon from fish, and its transposition in human cells’. Cell 91:501–10. T. Jermann, et al. (1995). ‘Reconstructing the evolutionary history of the artiodactyl ribonuclease superfamily’. Nature 2;374(6517):57–59. J. Jurka (2002). ‘Repbase update: a database and an electronic journal of repetitive elements’. Trends Genet 16(9):418–20. E. Mossel (2001). ‘Reconstruction on trees: beating the second eigenvalue’. Ann. Appl. Probab. 11(1):285–300. E. Mossel & Y. Peres (2003). ‘Information flow on trees’. Ann. Appl. Probab. 13(3):817–844. L. Pauling & E. Zuckerkandl (1963). ‘Chemical paleogenetics, molecular restoration studies of extinct forms of life’. Acta Chemica Scandinavica 17(1). T. Schultz & G. Churchill (1999). ‘The role of subjectivity in reconstructing ancestral character states: A Bayesian approach to unknown rates, states and transformation asymmetries.’. Syst. Biol. 48:651–664. T. Schultz, et al. (1996). ‘The reconstruction of ancestral character states’. Evolution 50:504. Y. Shi & S. Yokoyama (2003). ‘Molecular analysis of the evolutionary significance of ultraviolet vision in vertebrates’. Proc. Natl. Acad. Sci. USA. 100(14):8308–13. 5 A. Smit & A. Riggs (1996). ‘Tiggers and other DNA transposon fossils in the human genome’. Proc. Natl. Acad. Sci. 93:1443–1448. J. Thornton (2004). ‘Resurrecting ancient genes: experimental analysis of extinct molecules’. Nature Reviews Genetics 5:366–375. J. W. Thornton, et al. (2003). ‘Resurrecting the ancestral steroid receptor: ancient origin of estrogen signaling’. Science 301(5640):1714–7. Z. Yang, et al. (1995). ‘A new method of inference of ancestral nucleotide and amino acid sequences’. Genetics 141:1641–1650. J. Zhang & M. Nei (1997). ‘Accuracies of ancestral amino acid sequences inferred by parsimony, likelihood and distance methods’. J. Mol. Evol. 44(S1):S139– 46. J. Zhang & H. Rosenberg (2002). ‘Complementary advantageous substitutions in the evolution of an antiviral RNase of higher primates’. Proc Natl Acad Sci USA. 99(8):5486–91. 6