Download Article A Model of Substitution Trajectories in

A Model of Substitution Trajectories in Sequence Space and Long-Term Protein Evolution Dinara R. Usmanova,1,2,3 Luca Ferretti,4,5 Inna S. Povolotskaya,2,3 Peter K. Vlasov,2,3 and Fyodor A. Kondrashov*,2,3,6 1 Moscow Institute of Physics and Technology, Institutskiy Pereulok 9, g.Dolgoprudny, Russia Bioinformatics and Genomics Programme, Centre for Genomic Regulation (CRG), Barcelona, Spain 3 Universitat Pompeu Fabra (UPF), Barcelona, Spain 4 Systematique, Adaptation et Evolution (UMR 7138), UPMC University Paris 06, CNRS, MNHN, IRD, Paris, France 5 CIRB, Collège de France, Paris, France 6 Institucio Catalana de Recerca i Estudis Avançats (ICREA), Barcelona, Spain *Corresponding author: E-mail: [email protected]. Associate editor: Hideki Innan 2 Abstract The nature of factors governing the tempo and mode of protein evolution is a fundamental issue in evolutionary biology. Specifically, whether or not interactions between different sites, or epistasis, are important in directing the course of evolution became one of the central questions. Several recent reports have scrutinized patterns of long-term protein evolution claiming them to be compatible only with an epistatic fitness landscape. However, these claims have not yet been substantiated with a formal model of protein evolution. Here, we formulate a simple covarion-like model of protein evolution focusing on the rate at which the fitness impact of amino acids at a site changes with time. We then apply the model to the data on convergent and divergent protein evolution to test whether or not the incorporation of epistatic interactions is necessary to explain the data. We find that convergent evolution cannot be explained without the incorporation of epistasis and the rate at which an amino acid state switches from being acceptable at a site to being deleterious is faster than the rate of amino acid substitution. Specifically, for proteins that have persisted in modern prokaryotic organisms since the last universal common ancestor for one amino acid substitution approximately ten amino acid states switch from being accessible to being deleterious, or vice versa. Thus, molecular evolution can only be perceived in the context of rapid turnover of which amino acids are available for evolution. Key words: molecular evolution, fitness landscape, epistasis. Introduction Article Whether or not epistasis, a situation when fitness is dependent on the interaction of alleles, plays a major role in molecular evolution is the subject of scrutiny and debate (de Visser et al. 2011; Lehner 2011; Breen et al. 2012; Hansen 2013; McCandlish et al. 2013; de Visser and Krug 2014). Two types of approaches are being used to reveal the type and amount of epistasis in protein evolution: First, studies aiming to reconstruct recent evolutionary trajectories revealing potential epistatic interactions among substitutions that occurred recently (Weinreich et al. 2006; Bridgham et al. 2009; Lozovsky et al. 2009; Romero and Arnold 2009; Lunzer et al. 2010; Khan et al. 2011; Zhang et al. 2012; Covert et al. 2013); second, studies quantifying the degree of epistasis among substitutions that occurred across long evolutionary periods (Miyamoto and Fitch 1995; Huelsenbeck 2002; Kondrashov et al. 2002; Choi et al. 2005; Bazykin et al. 2007; Wang et al. 2007; Rogozin et al. 2008; Rokas and Carroll 2008; Bollback and Huelsenbeck 2009; Povolotskaya and Kondrashov 2010; Gloor et al. 2010; Soylemez and Kondrashov 2012; Naumenko et al. 2012; de Juan et al. 2013; Wellner et al. 2013). Such studies are often statistical in nature and usually cannot identify specific interactions, yet they provide a broader outlook on the nature of the fitness landscape across different areas of the sequence space. Many of these studies claimed that different aspects of molecular evolution are not compatible with evolutionary models devoid of epistatic interactions. However, the dynamics of protein sequence divergence has not been subject to modeling with explicit parameters of epistasis. Modeling protein (or DNA) sequence evolution is often concerned with estimating the rate at which two sequences diverge typically describing sequence divergence as a Markov chain process. Initially, such models considered the neutral divergence of DNA sequences. The most general neutral, siteindependent, general time-reversible (GTR) model (Tavare 1986), which was created following more restricted models (Jukes and Cantor 1969; Kimura 1980; Felsenstein 1981), allows different substitution rates for each nucleotide pair (see O’Meara 2012 for review). Within the existing Markov chain models, the probabilities of each site being occupied by each of the four nucleotides are estimated across time using a ß The Author 2014. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected] 542 Mol. Biol. Evol. 32(2):542–554 doi:10.1093/molbev/msu318 Advance Access publication November 17, 2014 Model of Substitution Trajectories . doi:10.1093/molbev/msu318 MBE FIG. 1. Three categories of Markov chain models of protein evolution. The general time reversal models estimate the probability that a site is occupied by a specific nucleotide, Z. The probability of finding specific nucleotides at a site changes with time and the rate of change is described by a 4 4 matrix, R, because each of the four nucleotide can change into the other three nucleotides with a certain rate ri!j. The ri!j rates typically reflect the rate of mutation and, therefore, Z½t þ ¼ Z½t eR models the neutral rate of change of nucleotides across sites. As selection influences the rate of substitution in sites it is introduced as a parameter !, with Z½t þ ¼ Z½t e!R models. In that case ! 4 1 reflects the action of positive selection and accelerates the rate of evolution and ! < 1 reflects negative selection slowing down the rate of change of Z. As the action of selection may be different in different sites, some models attempt to capture the resulting rate variation across sites by assigning a different ! to different sites. The covarion models reflect the possibility that the rate of evolution of a site is itself subject to change with time. They introduce extra parameters allowing for sites to switch among the different ! categories. 4 4 matrix of nucleotide substitution rates (fig. 1). Amino acid substitution models are analogous in a sense that a 20 20 matrix of amino acid substitution rates can be used to estimate the probabilities of a site being occupied by each of the 20 amino acids. The first level of complication of these models arises from the impact of selection on sequence divergence. When the matrix of substitution rates reflects only the rate of mutation the Markov chain models reflect neutral sequence divergence. The typical way to introduce selection to these models is to assume a multiplier of the mutation rate, !, which models selection by slowing down the rate of substitutions at sites under selection (fig. 1). For DNA sequence models the rate of evolution can be estimated separately for synonymous and nonsynonymous sites, with the matrix of nonsynonymous substitution rates multiplied by !, which is the single parameter determining the strength of selection. Amino acid sequence divergence models often use a precomputed matrix of amino acid substitution rates, such as BLOSUM or PAM (Dayhoff et al. 1978; Henikoff S and Henikoff JG 1992; Whelan and Goldman 2001). The second level of complication of these models comes from the realization that the strength of selection may be different at different sites. A class of Markov chain models with variable rate of evolution across sites has been created (Nei and Gojobori 1986; Yang 1994) in which ! follows a specific distribution, typically a Gamma distribution. Practically, a discrete Gamma distribution is used and sites are, therefore, classified into a number of categories with a different rate multiplier !i. Within this approach some sites may be completely invariable, in that case for those sites !i = 0. The final level of complication considers the possibility that the strength of selection at a site changes with time. 543 MBE Usmanova et al. . doi:10.1093/molbev/msu318 Such covarion models (Fitch and Markowitz 1970) introduce parameters, which describe the rate of change between different substitution rate multipliers. The first such models allowed a site to switch between invariable and variable states (Tuffley and Steel 1998; Huelsenbeck 2002) the second ones to switch between several !i categories (Galtier 2001). Wang et al. (2007) combined both approaches into a more general covarion model. The models that incorporate selection do so by varying the rate of evolution across different sites without varying the probability of different substitution across sites. Biologically speaking we know that the fitness impact of a L!C substitution in one site may be radically different from that of the L!C substitution at a different site. However, assuming a different substitution matrix for different sites is impractical as it overparametrizes the model. Therefore, for the time being the assumption that sites differ from each other only in their rate of evolution, which in a covarion model is time-dependent, remains widely accepted. This assumption may not interfere with the models ability to reliably estimate the overall rate of evolution. However, such models may not be appropriate when differences in the rate of the same type of substitutions across different sites contribute substantially to the sequence divergence process. A widely utilized verbal model of molecular evolution has been formulated by Maynard Smith (1970), which compares protein evolution to a word game, whereby two words (proteins) with meaning (function) are connected by a series of one letter (amino acid) changes (substitutions) such that a continuous pathway between these two words is created. The example used by Maynard Smith is that of the word “WORD” evolving into the word “GENE” through meaningful intermediates comprising a trajectory of substitutions in sequence space: WORD$WORE$GORE$GONE$GENE. This trajectory reveals an important property of time-dependence, or epistasis, of substitutions in different sites. For example, the substitution of R!N at the third site is meaningful only after D!E and W!G substitutions occurred in the fourth and first sites, respectively. If R!N at the third site was to be the first substitution in WORD, it would lead to the sequence of WOND, which does not have meaning in the English language. The existing Markov chain based models may accurately estimate the rate of evolution of sequences where such time-dependence is common. However, such models cannot be used to study the time-dependence itself. Here, we develop a mathematical model of long-term protein evolution focusing on the evolutionary dynamics of evolving sequences through sequence space in a similar way to that described by Maynard Smith (1970). Instead of focusing on the rate of amino acid substitution as the estimated parameter of the model we focus on the prevalence of epistatic interactions between amino acid states as the parameter of interest. We then attempt to fit our model to several recent observations of long-term evolution. We do not investigate issues related to phylogeny inference. 544 Results The Concept of the Model Verbal Model We take Maynard Smith’s (1970) analogy as the basis for our model of sequence evolution, which investigates the impact of interactions of sites across the protein sequence. An exhaustive description of all possible interactions is impossible for a sequence even of moderate length (L) as it requires to consider all 20L sequences in sequence space. This vast number can be reduced by focusing on evolutionary trajectories within the sequence space. For the trajectory considered by Maynard Smith (1970) WORD!WORE!GORE! GONE!GENE the entire sequence space, in English, is 264 = 456,976. However, to understand the local restrictions imposed by the interactions of letters in different places across the word it is sufficient to describe the fitness landscape one substitution away from each of the five words in the trajectory (fig. 2). The current sequence as well as the potential fitness impact of all single letter substitutions can be shown in a matrix, which we call the fitness matrix. In the case of a four letter word in English, a cell of the fitness matrix may have three different states, A, B, or C. First, C, or the “current” letter represents the current state of the letter in a specific position. For example, the second letter of WORD is O and, therefore, the (2,O) cell of the matrix has the state C. Second, A stands for a letter that is currently “available” for evolution. In the second letter of the word WORD, the (2,A) cell has the state A because O!A substitution in the second letter creates the word WARD, which has meaning in English. Finally, a substitution may be “blocked” (B), whereby if the substitution were to occur it would not create a meaningful word, but this letter at the same site may be present in another word. All substitutions in WORD in the second site other than O!A are of such type. For example, O!J substitution creates a meaningless WJRD; however, the letter J at the second site can be part of an actual word, AJAR and, therefore, the (2,J) cell is occupied by state B (see table 1 for a list of all abbreviations). Given a trajectory of substitutions in sequence space, it is possible to track not only the substitutions that have occurred but also the associated changes in the fitness impact of other substitutions. The first substitution in the given trajectory is the D!E substitution in the fourth letter. This is reflected in the A!C switch of the cell (4,E) and the reciprocal C!A switch of the (4,D) cell. Thus, a C$A switch represents one substitution. Furthermore, each substitution may cause a number of A!B and B!A switches. In this case, the D!E substitution at site 4 causes two B!A switches in cells (2,E) and (2,I). This occurs because these substitutions, O!E and O!I, now lead to meaningful words (WERE and WIRE, respectively), whereas the same substitutions prior to the D!E substitution did not lead to a meaningful combination of letters (WERD and WIRD). If we consider the second substitution in the sequence W!G in the first letter (leading to WORE!GORE), another B!A switch occurs in the cell (2,Y), which leads to the word GYRE, and three A!B switches at the same site in the cells (2,A), (2,E), and (2,I), Model of Substitution Trajectories . doi:10.1093/molbev/msu318 MBE FIG. 2. The fitness matrices of words encountered in the trajectory of substitutions WORD!GENE described by Maynard Smith (1970). The fitness matrix of a specific sequence reflects both the current (C) sequence, with the state C in the corresponding cell of the matrix, as well as the fitness impact of all possible single letter substitutions. For example, in the first word in the trajectory, “WORD” there are 16 available (A) substitutions, out of 100 total possible ones, that would lead to another word in English (having high fitness). All other 84 states are blocked (B), meaning that if such a substitution were to occur would not lead to a meaningful sequence of letters. A substitution that actually occurred in the trajectory is reflected by a bidirectional C$A switch in two cells of the matrix. With every substitution the potential impact of other substitution also changes (changes between the current and the previous fitness matrix are shown in orange). which lead to the meaningless combinations of GARE, GERE, and GIRE, respectively. The fitness matrix approach to investigating the trajectory of substitutions in the WORD$GENE analogy can be further dissected to reveal various parallels with protein evolution. Some letters (sites) tend to have more substitutions available at any given moment (the first letter compared with the second in the example), some substitutions can transition rapidly between allowed and blocked, whereas others switch between these states at a much slower rate, etc. However, investigating many or long trajectories of substitutions using the full fitness matrix may still be troublesome, although this has been done experimentally for a small number of proteins with very short trajectories (McLaughlin et al. 2012; Gong et al. 2013; Roscoe et al. 2013; Thyagarajan and Bloom 2014). Therefore, for ease of mathematical treatment we follow site-independent models and assume that all sites across the sequence have the same properties. Our model aims to study the probability that a cell in the fitness matrix is one of these three states (C, A, or B) and the rate of A$B switches with every sequence substitution (A$C switch). The GTR models investigate the rate of nucleotide substitutions on average across the entire sequence without tracking individual sites yet they can reconstruct the expected sequence divergence between the evolving and the ancestral sequence. Similarly, our model investigates the rate of switches between different states of the fitness matrix cells across the entire matrix without tracking individual sites and retains the ability to determine the expected divergences between the evolving and the ancestral sequence. Fitness Matrix The protein fitness matrix has L columns and 20 rows, where L is the length of the protein and 20 is the number of all 545 Usmanova et al. . doi:10.1093/molbev/msu318 MBE Table 1. Abbreviations. Possible states of cells in the fitness matrix C Current An Available and neighboring Available and far Af Bn Blocked and neighboring Bf Blocked and far F Forbidden Constants L Length of protein sequence m Number of non-F amino acids per site a Fraction of A cells among A and B cells c Number of switches A$B occurring with one amino acid substitution d Fraction of n cells among n and f cells ’ Number of n$f switches occurring with one amino acid substitution Rates of switches sC!An sAn!C sAn !Bn ¼ sAf !Bf ¼ sA!B sBn !An ¼ sBf !Af ¼ sB!A sAn !Af ¼ sBn !Bf ¼ sn!f sAf !An ¼ sBf !Bn ¼ sf !n Markov process variables Z = (ZC, ZAn ZAf, ZBn, ZBf ) Vector of probabilities that a cell is in a particular state R Rate matrix of switches Empirical observations (Povolotskaya and Kondrashov 2010) D Protein distance U Amino acid usage Nt Number of substitutions toward the reference sequence Number of substitutions away from the Na reference sequence Rate of convergent evolution Kc Kd Rate of divergent evolution K4 Rate of synonymous evolution in 4-fold sites amino acids. In addition to the three aforementioned states A, B, and C we introduce another state F, or a “forbidden” amino acid, which confers low fitness in all possible genetic backgrounds. In the WORD$GENE analogy, forbidden states do not appear because for each letter of the alphabet at each of the four sites there exists at least one word in English in which that letter is used at the site in question. The fitness matrix as we define it assumes a binary distribution of fitnesses, such that a genotype can have a high, 1, or low, 0, fitness without intermediate values (C and A correspond to high fitness whereas B and F to low fitness). Because we consider trajectories of amino acid substitutions we must also take into account properties of the genetic table. Some amino acid substitutions may be available from a fitness perspective; however, they cannot occur at the present moment because more than one nucleotide substitution is required on the DNA level. We, therefore, segregate the available (A) and blocked (B) cells into mutationally “neighboring” 546 FIG. 3. Switches between five states in the fitness matrix. The current amino acid state can switch into an available amino acid that is one nucleotide substitution away (C$An), which reflects one amino acid substitution. With every C$An switch amino acid states that were previously available to evolution become blocked (An/f!Bn/f ) and vice versa, other amino acid states that were blocked become available (Bn/f!An/f ). Furthermore, with every C$An switch ’ amino acid states that were previously in the mutational neighborhood become unreachable with one nucleotide mutation (An!Af or Bn!Bf switches) and vice versa (Af!An or Bf!Bn) switches. F never changes because it reflects those amino acid states that can never be found in a protein sequence. (n) states, An and Bn, or nonneighboring states labeled with the f subscript (for “far”), Af and Bf. Thus, we define six distinct states: C, An, Af, Bn, Bf, F with An and Af forming set A whereas Bn and Bf forming set B. Conversely, An and Bn form set n, all amino acid states that can be reached by a single substitution and Af and Bf form set f, those amino acid states that cannot. Each cell of the fitness matrix has one of these six states and the model considers the rate of switches among them (fig. 3). Evolution and Switches between States in the Fitness Matrix Three types of switches of the cell state are possible in the fitness matrix. First, an amino acid substitution causes one cell to switch from state C to state An and another cell to have the opposite switch—from state An to state C. Second, in the same site the sets of n-labeled and f-labeled cells also change, with some amino acids that were previously more than one nucleotide substitution away become neighboring amino acids and vice versa. Third, with every substitution some previously blocked amino acids became available and some available amino acids became blocked. Thus, the dynamics of accumulating substitutions in a sequence can be described with the change of the state of cells of the fitness matrix with a total of ten different switches: C!An, An! C, An!Af, Af!An, Bn!Bf, Bf!Bn, An!Bn, Bn!An, Af!Bf, and Bf!Af (fig. 3). Markov Process for States of a Single Cell in the Fitness Matrix We introduce an approach to analyze the state of a single cell of the fitness matrix. We describe a non-F cell with a vector Z that contains probabilities that the cell is in one of the five possible states, Z = (ZC, ZAn, ZAf, ZBn, ZBf ). The probability a state F of a cell is not included because non-F cells never become F and vice versa. Vector Z changes with time because states switch between each other. We measure the rate of the switches si!j as the expected number of switches occurring with every substitution in a site. Therefore, sC!An ¼ 1. MBE Model of Substitution Trajectories . doi:10.1093/molbev/msu318 Moreover, we consider the fitness impact of an amino acid state to be an independent parameter from its mutational neighborhood, that is, whether a site is in state A or B is independent of whether or not it is n or f. Therefore, sAn !Bn ¼ sAf !Bf ¼ sA!B , sBn !An ¼ sBf !Af ¼ sB!A , sAn !Af ¼ sBn !Bf ¼ sn!f , sAf !An ¼ sBf !Bn ¼ sf !n . The switches form a Markov process with the rate matrix, R: 0 1 ½Bf ½C ½An ½Af ½Bn B C B ½C C 1 0 0 0 B C B C B ½An sAn !C C s s 0 n!f A!B B C R¼B C; B ½Af 0 sf !n 0 sA!B C B C B C B ½Bn 0 sB!A 0 sn!f C @ A ½Bf 0 0 sB!A sf !n where diagonal entries are determined by the constraint that rows sum to 0. The transition probabilities between different states can then be obtained by taking the exponent of the product of the rate matrix and time measured as the number of substitutions per site, or eRt . The distribution of probabilities to obtain a cell of a specific particular state at time t equals: Z½t ¼ Z½0 eRt : ð1Þ Figure 4 shows an example of evaluation Z[t] for Z[0] = (1,0,0,0,0) and for switching rates in the same value range as that obtained for real proteins families (see next section). Constants To estimate si!j rates in the rate matrix, several constants are necessary. First, we introduce m, the average number of non-F cells per column. Biologically m is the number of amino acids per site that can confer nonzero fitness in at least one genetic background. Second, we introduce , the fraction of cells with state A among all cells with states A and B. Third, we introduce , the number of switches A$B that happen with one amino acid substitution. The and constants have an affinity in that describes the relative amount of A and B states and describes the rate of change between them. Two more constants are estimated from the genetic code. The fraction of cells with state n among all cells with state n and f, = 0.4 calculated considering two amino acids as neighbors if any of their codons are separated by one nucleotide substitution. The number of switches n$f, ’0 = 3.3, also calculated as an average across all codons for all 20 amino acids. The and ’ constants are related in that stands for the number of neighboring amino acids whereas ’ stands for how rapidly this set of neighboring amino acids changes. However, we are interested only in switches between non-F states. Thus, we define the number of possible n$f switches per site as ’ ¼ ’0 m1 : 20 1 FIG. 4. Numerical evaluation of components of Z[t]. The initial condition is Z[0] = (1,0,0,0,0) with constants = 0.06, = 5, m = 7.3. ð2Þ The constant 1 appears in the numerator and denominator because at each site one amino acid is C, which is neither n nor f. Using these constants we calculate the number of cells with different states in one column of the fitness matrix and the stationary probabilities of non-F states to which components of vector Z converge independently of the initial state (table 2). We then derive rates of switches as the number of switches divided by the number of cells from which a switch could have occurred: sAn !C ¼ sA!B ¼ sB!A ¼ ; ¼ NAn þ NAf ðm 1Þ ; ¼ NBn þ NBf ð1 Þðm 1Þ sn!f ¼ sf !n ¼ 1 1 ; ¼ NAn ðm 1Þ ’ ’ ; ¼ NAn þ NBn ðm 1Þ ’ ’ : ¼ NAf þ NBf ð1 Þðm 1Þ Thus, our model has five constants, with two of them, and ’, characterizing the genetic code whereas the other three, m, and , characterize the fitness landscape. To estimate realistic ranges of m, , and , we use empirical data on evolution of protein sequences. Fitting Observations to Model Our model is aimed at testing recent statements regarding the epistatic nature of long-term protein evolution. Specifically, the model is used to fit previously obtained data on protein sequence divergence from Povolotskaya and Kondrashov (2010) and Breen et al. (2012). We thus 547 MBE Usmanova et al. . doi:10.1093/molbev/msu318 Table 2. Number of Cells with Various States in the Fitness Matrix. Nj, number of cells with state j in one column of fitness matrix Z1 j , equilibrium frequency of state j for non-F cell C 1 An adðm 1Þ Af að1 dÞðm 1Þ Bn ð1 aÞdðm 1Þ Bf ð1 aÞð1 dÞðm 1Þ 1 m ad m1 m að1 dÞ m1 m ð1 aÞd m1 m ð1 aÞð1 dÞ m1 m provide a brief description of the previously published data that we investigate here with the present model. Polarization of amino acid substitutions with one or more outgroup sequences reveals the directionality of evolution. For example, if one sequence has an Alanine (A) and another sequence in the orthologous site contains a Threonine (T) and closely related outgroup sequences also contain a Threonine then an T!A substitution is inferred. Such polarized substitutions can then be related to a fourth, reference, sequence when the reference sequence is an outgroup to all three sequences involved in the polarization. Following our example, if the orthologous site in the reference sequence is occupied by a T then the T!A substitution can be inferred to be a substitution away from the reference sequence. Conversely, if the orthologous site in the reference sequence is occupied by an A then the T!A substitution can be inferred to be a substitution toward the reference sequence. The ratio of the sum of all toward substitutions (Nt) and the sum of all away substitutions (Na), Nt/Na, can then be taken as a measure of the relative rate of divergence of the sister sequences and the reference sequence. The observation that sister and reference sequences that have already diverged considerably continue to do so, that is, Nt/Na < 1 for large values of sequence divergence (protein distance, D), has been claimed to be a consequence of epistatic interactions between sites in the evolving sequences (Povolotskaya and Kondrashov 2010). The rate of divergence of sequences from each other can be estimated in an analogous manner to the Nt/Na measure by deconstructing Nt/Na into two independent rates of evolution—the rate of divergent evolution (Kd) and the rate of convergent evolution (Kc). The rate of divergent evolution is estimated as Na divided by the number of sites in which an away substitution could have occurred. The estimate of the number of such divergent sites is equal to the number of sites in which the ancestral state of the two sister sequences matches the reference sequence. Similarly, Kc is Nt divided by the number of convergent sites, in which a toward substitution could have occurred. The number of sites that are occupied by a different amino acid in the ancestor of the sister sequences and the reference sequence and are separated with only a single nucleotide substitution is the target number of convergent sites. Kd was estimated to be substantially slower than the rate of synonymous divergence in 4-fold sites (K4), independently of D (Povolotskaya and Kondrashov 2010). Alternatively, Kc is comparable to K4 when D is near 0 and rapidly declines as D increases reaching a plateau slightly above Kd (fig. 5). The dependence of Kc/K4 but not Kd/K4 on D, and similar observations (Kondrashov et al. 2002; Choi et al. 2005; Rogozin et al. 2008; Rokas and Carroll 2008; Bollback 548 F ð20 mÞ — FIG. 5. Relative rate of protein evolution. Kc/K4 is shown by and Kd/K4 by w (from Povolotskaya and Kondrashov 2010). We fit the observed Kc/K4 to that calculated by the solution of equation (4) varying as a parameter. The optimal fit was found for ~ 5 (thick solid line). Two near fits for = 4 and = 6 are depicted with thin solid lines. Thick dashed line shows Kc/K4 for significantly higher and thick dotted line for significantly lower values of . and Huelsenbeck 2009; Gloor et al. 2010; Naumenko et al. 2012; Soylemez and Kondrashov 2012), has been claimed to show support for epistasis in protein evolution but has not been modeled. The final observation that at present requires more formal modeling deals with amino acid usage (U), the number of different amino acids in an orthologous site. In a large multiple sequence alignment, U ~ 9 (Breen et al. 2012), or approximately half of all possible amino acids. However, the same proteins exhibit a short-term rate of nonsynonymous evolution (Kn) up to an order of magnitude lower than Kn ~ 0.5 that is expected if an amino acid state has the same effect in all species. The reason why amino acids are accepted in the long-term yet rejected in the short term may be due to epistatic interactions being common (Fitch and Markowitz 1970; Maynard Smith 1970; Povolotskaya and Kondrashov 2010; Breen et al. 2012); however, this parameter has also not been put into a formal context of a macroevolutionary model. We obtained quadruplet alignments from Povolotskaya and Kondrashov (2010) where each quadruplet alignment consisted of two sister sequences, one outgroup sequence and one reference sequence. Such quadruplet alignments were available for 572 clusters of orthologous groups (COGs), functional gene families that were predicted to have been present in the last universal common ancestor, LUCA (Mirkin et al. 2003). Reference sequence from the alignments corresponds to the sequence at t = 0 in our model. Model of Substitution Trajectories . doi:10.1093/molbev/msu318 MBE Distance As a sequence diverges from the sequence it was at t = 0, we can estimate the protein distance between them as a function of time. The protein distance (D) between two sequences is defined as 1 minus the sequence identity. The sequence identity between the sequence at time t and at t = 0 equals the probability that a cell that was in state C at t = 0 is at state C at time t, that is, ZC[t] when Z[0] = (1,0,0,0,0). This gives: agreement with Kc/K4 from Povolotskaya and Kondrashov (2010). Second, we explore Kc/K4 [t] when t!1. From table 2: D½t ¼ 1 ZC ½t j Z½0¼ð1;0;0;0;0Þ : ð3Þ We then consider the value of D at equilibrium, at t = 1 as In Povolotskaya and D1 ¼ 1 ZC 1 ¼ 1 1=m. Kondrashov (2010), the distance limit has been estimated as the time when Nt/Na = 1. In principle, we can use D1 to estimate one of the constants, the number of non-F amino acids m, as m¼ 1 : 1 D1 ð4Þ From Povolotskaya and Kondrashov (2010) the divergence equilibrium (Nt/Na = 1) was estimated as D1 &0:90 0:95. However, from equation (4) it is clear that when D1 is large a difference of just 5% leads to a large scatter of m with m = 10 20. Thus, although it appears that m is likely to be high, consistent with a large U observed by Breen et al. (2012), this approach is not suitable for an accurate estimation of m. Sequence Divergence and Convergence The ratio Kd/K4 relates the rate of divergence of nonsynonymous substitutions to the rate of 4-fold synonymous evolution. In terms of our model, a nucleotide substitution leading to amino acid with state A fixes with the same probability as a synonymous substitution. Thus, Kd/K4 equals to the proportion of substitutions that lead to an amino acid with state A, in other words it is the ratio of the number of cells with state A and the number of cells with all non-C states. Therefore, Kd NA m1 : ¼ ¼ 19 K4 20 NC ð5Þ The ratio Kc/K4 relates the rate of convergent amino acid substitutions to the 4-fold synonymous rate of evolution. Kc measures the rate of substitution toward the reference sequence, or toward cells in the fitness matrix which were C at t = 0. Amino acids with state A fix with the same probability as synonymous substitutions. But when calculating Kc only amino acids in the mutational neighborhood are taken into account. Thus, Kc/K4 equals the probability that a cell that was in state C at time t = 0 is in state An at time t divided by the probability that a cell is in state n: Kc ZAn ½t ½t ¼ : ZAn ½t þ ZBn ½t j Z½0¼ð1;0;0;0;0Þ K4 ð6Þ We then explore equation (6) as t!0. At t = 0 ZAn = 0 and ZBn = 0 whereas Kc/K4 is not defined, because there are no convergent sites. Applying L’H^opital’s rule, we find dZAn ½0 dZAn ½0 dZBn ½0 Kc K4 ½t ! 0 ¼ dt =ð dt þ dt Þ ¼ 1. That is in a good Kc 1 Z1 ½D ¼ 1 An 1 ¼ : K4 ZAn þ ZBn ð7Þ From data we estimate Kc/K4!0.06, which implies = 0.06. Kd/K4 is approximately constant and equals 0.02, substituting it and into equation (5) we get m = 7.3. From equation (2), we calculate ’ = 1.1. Estimating the Degree of Epistasis in Protein Evolution In the previous sections, we estimated values for all constants used in the model except for , which reflects the amount of epistasis in evolution. To obtain it, we fit the empirically obtained Kc/K4 from Povolotskaya and Kondrashov (2010) to the estimated Kc/K4[D] that we obtain from our model. We evaluate equation (1) numerically varying as a parameter. For a given , we use the functions ZC(t), ZAn(t), ZAf(t), ZBn(t), and ZBf(t) obtained from equation (1) to calculate Kc/K4[t] in equation (7) and D[t] in equation (3). We repeat this process varying the parameter until we obtain a good fit between the observed and predicted Kc/K4 across D (fig. 5). Mathematically speaking we minimize relative errors of fit: 2 ND 1 X Kc =K4 emp ðDi Þ Kc =K4 theor ðDi Þ min ; ð8Þ ND Kc =K4 theor ðDi Þ i¼1 where ND is the number of data points for Kc/K4[D]. ¼ 5 1 provides the best fit (fig. 5). Evolution Relative to a Reference Sequence, Nt/Na As we have estimated all parameters of the model, we can now estimate Nt/Na: Nt Nsites Nsubst ¼ tsites tsubst ; Na Na Na ð9Þ where Nsites tðaÞ is amount of convergent (divergent) sites in protein, and Nsubst tðaÞ defines how many substitutions from one convergent (divergent) site toward (away) the reference are possible. Convergent sites are those from which a single nucleotide substitution can lead to reference amino acids that is currently available. Thus, Nsites t ~ZAn . For every convergent site, only one substitution can lead toward the reference amino ¼ 1. Sites with the amino acid which matches acid: Nsubst t subst reference are divergent, so Nsites equals the average a ~ZC . Na number of available and neighboring amino acids per column ¼ NAn ¼ ðm 1Þ . in the fitness matrix: Nsubst a Therefore, we get: Nt ZAn ½t ½t ¼ ZC ½t ðm 1Þ Na ¼ Kc ZAn ½t þ ZBn ½t ½t : 19 ZC ½t j Z½0¼ð1;0;0;0;0Þ Kd ð10Þ Equation (10) allows us to convert Kc =Kd into Nt =Na and vice versa. We calculate Nt =Na using rates of divergent and 549 MBE Usmanova et al. . doi:10.1093/molbev/msu318 kind, S(N,n). There are m!/(m n)! ways to associate these subsets with n different colors: P½n ¼ SðN; nÞ m! ; usage ¼ EðnÞ mN ðm nÞ! ð11Þ We estimate m = 9 (see next section) as the mean value of m distribution among different COGs (fig. 7) and using equation (11) calculate the dependence of usage on N (fig. 8). Due to the high value, the amino acid usage is considerable even for N ~ 10, number of substitutions per site, with usage almost approaching its maximal value with 30 substitutions per site. FIG. 6. Observed and predicted relative rates of sequence divergence. The observed values of Nt/Na shown by and the predicted fit with our model using optimal parameters shown with w. convergent evolution from Povolotskaya and Kondrashov (2010) and estimate Z from equation (1) when Z[0] = (1,0,0,0,0) and with the five estimated constants. Figure 6 shows the comparison of data (fig. 3 from Povolotskaya and Kondrashov 2010) and the estimate obtained with our model. Usage NA = (m 1) = 0.06(7.3 1) = 0.38 indicates how many amino acid substitutions are available at a given moment. Thus, on average less than 1 amino acid substitution per site is acceptable at a time. However, when longer time periods are taken into account, up to eight different amino acids may be found at the same site across different species (Breen et al. 2013). These two observations can be reconciled if , the number of A$B switches associated with one C$An switch, is relatively high. For the data from Povolotskaya and Kondrashov (2010), our model predicts &5. Such a high implies that per site there are five times more switches between the available and blocked states of amino acids than there are actual amino acid substitutions at the same site. It follows that the distribution of available amino acids changes substantially even for a very small number of substitutions in that site. Therefore, we can model all non-F amino acids (m) as having an equal probability to occur at a site after a substitution and the usage expected to be observed after a certain amount of sequence divergence is a simple combinatorial problem. Let N be the number of substitutions observed per site with every substitution being a random choice from m. Reformulating the problem, we have a pool of objects of m different colors, we then take N random objects from the pool and calculate n, the number of different colors out of N objects taken from the pool. The value of n varies across trials but we consider amino acid usage to be equal expectation of n distribution. The probability to get n different colors in set of N objects is given by the number of possible sets with n different colors divided by the total number of possible sets, which is mN. The number of ways to partition N objects into n nonempty subsets equals the Stirling number of the second 550 Considering the Distribution of Parameters across Gene Families In the previous sections, we used data which were obtained by taking average values across all 572 COGs (Povolotskaya and Kondrashov 2010). Here, we use data on the same variables that were obtained for each COG separately. For some protein families, there are not enough data to calculate Kc/K4 and Kd/K4 with an acceptable degree of precision. Therefore, we selected 119 COGs for which Kd/K4 and Kc/K4 were defined across distances 0.1–0.8 without strong outliers. For each COG, we obtained Kd/K4 as an average Kd/K4 across different D between the trios and the reference sequence in bins of 0.1 of D. We presume that is a limit of Kc/K4 when D!1. Then, we calculate m using equation (5) and ’ using equation (2). Finally, we fit Kc/K4 with numerical evaluation of equation (1) and find . Distributions of these parameters for the 119 COGs are shown on figure 7. Discussion The model presented here makes three assumptions. The first assumption is that fitness is either 0 or 1. Binary fitness landscapes, while unrealistic in certain situations, are nevertheless useful when recapitulating the multidimensionality of the sequence space (Gavrilets 1997; Gavrilets and Gravner 1997; van Nimwegen et al. 1999; Aita et al. 2003; Gravner et al. 2007). When fitness values are either 1 or 0 mutations are either neutral or lethal and all permitted substitutions have an equal probability of occurrence. Thus, the impact of slightly deleterious and beneficial substitutions on molecular evolution cannot be taken into account, even though their contribution may be considerable (Ohta 1998; Andolfatto 2005; Popadin et al. 2007; McCandlish et al. 2013). The reason for excluding the effects of such substitutions is 2fold. First, the practical considerations of modeling a complex and multidimensional fitness landscape in which substitutions are not equal in effect. The nonepistatic impact of slightly deleterious and beneficial substitutions on evolution in the absence of epistasis is well characterized (Crow and Kimura 1970), whereas the effects of epistasis on long-term molecular evolution have not been subject to the same level of scrutiny. Our work develops a null model that may be expanded to incorporate these effects. Second, it appears certain that neither the fixation of slightly deleterious or beneficial alleles can serve as the basis for explaining the measurements considered here, at least without the impact of epistasis. The fixation of slightly deleterious alleles is Model of Substitution Trajectories . doi:10.1093/molbev/msu318 MBE FIG. 7. Distribution of estimated parameters for 119 COGs. The distribution of the number of nonforbidden amino acids per site (m), proportion of available amino acids over all available and blocked states (), and the rate of A$B switches () are shown. FIG. 8. Estimating amino acid usage. Usage calculated as the most probable number of amino acids to be observed at a site as a function of the number of accumulated substitutions per site. The solid line represents the number of nonforbidden amino acids at a site (m). compatible with a large U under some circumstances (McCandlish et al. 2013). However, slightly deleterious alleles alone cannot lead to large sequence divergence in fast-evolving proteins and are incompatible with reaching maximum sequence divergence distances substantially slower than the rate of divergence of neutral sequences (Kondrashov et al. 2010). There is a tradeoff between the contribution of slightly deleterious alleles toward amino acid usage and sequence divergence; it is not possible to have a high U and a high D at the same time solely due to the contribution of fixation of deleterious alleles (Breen et al. 2013). Furthermore, periodic fixation of nonepistatic slightly deleterious alleles cannot lead to a decline in the rate of convergent evolution, in contrast to the observed Kc/K4 relationship (Povolotskaya and Kondrashov 2010). Therefore, slightly deleterious alleles may contribute to molecular evolution but a model based on the accumulation of slightly deleterious alleles alone cannot explain all of the available data (also see Kondrashov et al. 2010). A similar argument is applicable to the impact of beneficial alleles on observations of molecular evolution considered here. Large sequence divergence observed between sequences is compatible with most differences between sequences having been fixed by positive selection. However, the high values of Kc/K4, especially for small D, are incompatible with sequence divergence being largely driven by the accumulation of beneficial alleles. The second assumption is that all sites in the protein are expected to have the same properties (described by , , ’, m, and ). This includes genetic properties, such as the number of mutational neighbors at each site, or the properties of the protein, such as the number of available, blocked, and forbidden amino acids at a site. Indeed, it may be possible to alleviate this assumption by introducing a variation of our model in a manner similar to that has been done for models of the rate of protein evolution (Nei and Gojobori 1986; Yang 1994). We hope that the present work will be a stepping stone in this direction. The third assumption is that all of these five parameters are independent of each other. This may not be entirely the case, for example amino acids in the mutational neighborhood of each other are more likely to be available (Freeland et al. 2000). Our model is not unique in modeling interactions between different alleles to describe the fitness landscape. However, the model presented here differs substantially from models developed previously. Previous efforts were principally concerned with the issue of the shape of the fitness landscape 551 MBE Usmanova et al. . doi:10.1093/molbev/msu318 (Kauffman and Levin 1987; Macken and Perelson 1989; Gavrilets 1997; Gavrilets and Gravner 1997; Ohta 1998; Kondrashov FA and Kondrashov AS 2001; Gravner et al. 2007; Ferrada and Wagner 2010; Crona et al. 2013; Lobkovsky et al. 2013). Our model, on the other hand, considers patterns of long-term divergent and convergent molecular evolution as a factor of the epistatic interactions arising on a multidimensional fitness landscape. Our model shares the same conceptual basis as covarion models (Fitch and Markowitz 1970; Tuffley and Steel 1998; Galtier 2001; Huelsenbeck 2002), which allow the rate of evolution at a site to change over time. Our model approaches the same issue but from a perspective that, in our opinion, more accurately reflects the biological basis for interaction between sites. Our analysis has revealed several important aspects of protein evolution. The crucial aspect of our model is that we can find a limited parameter space that is consistent with all of the available observations of protein evolution. The three estimated parameters that reveal the nature of the fitness landscape are the number of amino acids that in principle can confer nonzero fitness (m), the fraction of them which allowed for substitution in one moment (), and the number of switches between amino acids in available and blocked states per a single amino acid substitution (). We estimate an m between 5 and 15 for most protein families (fig. 7), which indicates that few states are forbidden, that is, correspond to low fitness regardless of the amino acid composition of other sites. Here, a modest number of universally forbidden amino acids implies that most amino acid sites can accept most of the possible amino acids given the right combination of states at other sites (Fitch and Markowitz 1970; Maynard Smith 1970; Povolotskaya and Kondrashov 2010). Conversely, this implies that a single amino acid site can accept many different amino acids, consistent with the previous observations of a high amino acid usage (Breen et al. 2012). We estimate = 0.06, the fraction of available amino acids at a given time as just a small fraction of all amino acid states. Therefore, at any given time many sites cannot accept any substitutions. This estimate is consistent with the observation of slow rate of sequence divergence of sequences in our data set, with Kd/K4 ~ 0.02 and with mutational studies (Guo et al. 2004). Evidently, for faster evolving protein families is likely to be higher, however, as we considered old gene families, which tend to be conservative, our estimate is unlikely to be representative of the entire diversity of protein rates of evolution found across all taxa. Within the confines of our model, it is not possible to maintain = 0 while maintaining a fit to all of the observed parameters of protein evolution. Specifically, appears to be the key parameter that allows the model to fit the observation of the decline in the rate of convergent evolution, Kc/K4 (fig. 5). Because the model is scaled to the number of amino acid substitutions that occur in a protein the = 5 estimate cannot be taken as a direct indication that epistatic interactions are intragenic. Individual proteins do not evolve in isolation and cases of intergenic epistatic interaction have been 552 documented (Lehner 2011). Indeed, our model does not imply a causative interaction between amino acid substitutions and switches between available and blocked states of amino acids in the fitness matrix. Indeed, a general interpretation of = 5 is that on average five switches between available and blocked states occur on the same timeframe as a single amino acid substitution. These switches may either be a consequence of changes in the same protein or in the rest of the genome. If most interactions are intragenic then the best fit of = 5 implies that the fitness matrix of a protein is changing faster than its sequence implying that the inherently epistatic nature of the fitness landscape is an inseparable and defining factor of molecular evolution. Our model can be used to analyze the patterns of molecular evolution of specific gene families, as was done here (fig. 7). We observed that all parameters, including , vary substantially across different gene families, implying that the nature of the fitness landscape is a feature that may differ across protein families. The application of this model to a broader set of proteins than considered here can lead the way toward a more general characterization of fitness landscapes and evolutionary trajectories in nature. Acknowledgments The work was supported by grants from Agence Nationale de la Recherche (ANR-12-JSV7-0007), HHMI International Early Career Scientist Program (55007424), the EMBO Young Investigator Programme, MINECO (BFU2012-31329 and Sev-2012-0208), and an ERC Starting Grant (335980_EinME). All authors participated in the design of the model. D.R.U. and L.F. performed the mathematical analysis. D.R.U and I.S.P. obtained and analyzed data on protein evolution. D.R.U. and F.A.K. wrote the draft. References Aita T, Ota M, Husimi Y. 2003. An in silico exploration of the neutral network in protein sequence space. J Theor Biol. 221:599–613. Andolfatto P. 2005. Adaptive evolution of non-coding DNA in Drosophila. Nature 437:1149–1152. Bazykin GA, Kondrashov FA, Brudno M, Poliakov A, Dubchak I, Kondrashov AS. 2007. Extensive parallelism in protein evolution. Biol Direct. 2:20. Bollback JP, Huelsenbeck JP. 2009. Parallel genetic evolution within and between bacteriophage species of varying degrees of divergence. Genetics 181:225–234. Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. 2012. Epistasis as the primary factor in molecular evolution. Nature 490: 535–538. Breen MS, Kemena C, Vlasov PK, Notredame C, Kondrashov FA. 2013. Reply to: The role of epistasis in protein evolution. Nature 497:E2–E3. Bridgham JT, Ortlund EA, Thornton JW. 2009. An epistatic ratchet constrains the direction of glucocorticoid receptor evolution. Nature 461:515–519. Choi SS, Li W, Lahn BT. 2005. Robust signals of coevolution of interacting residues in mammalian proteomes identified by phylogeny-aided structural analysis. Nat Genet. 37:1367–1371. Covert AW 3rd, Lenski RE, Wilke KO, Ofria C. 2013. Experiments on the role of deleterious mutations as stepping stones in adaptive evolution. Proc Natl Acad Sci U S A. 110:E3171–E3178. Model of Substitution Trajectories . doi:10.1093/molbev/msu318 Crona K, Greene D, Barlow M. 2013. The peaks and geometry of fitness landscapes. J Theor Biol. 317:1–10. Crow JF, Kimura M. 1970. An introduction to population genetics theory. New York: Harper and Row. Dayhoff MO, Schwartz RM, Orcutt BC. 1978. A model of evolutionary change in proteins Atlas of protein sequence and structure, Vol. 5. Washington (DC): National Biomedical Research Foundation. p. 345–352. de Juan D, Pazos F, Valencia A. 2013. Emerging methods in protein co-evolution. Nat Rev Genet. 14:249–261. de Visser JA, Cooper TF, Elena SF. 2011. The causes of epistasis. Proc Biol Sci. 278:3617–3624. de Visser JA, Krug J. 2014. Empirical fitness landscapes and the predictability of evolution. Nat Rev Genet. 15:480–490. Felsenstein J. 1981. Evolutionary trees from DNA sequences: a maximum likelihood approach. J Mol Evol. 17:368–376. Ferrada E, Wagner A. 2010. Evolutionary innovations and the organization of protein functions in genotype space. PLoS One 5: e14172. Fitch WM, Markowitz E. 1970. An improved method for determining codon variability in a gene and its application to the rate of fixation of mutations in evolution. Biochem Genet. 4: 579–593. Freeland SJ, Knight RD, Landweber LF, Hurst LD. 2000. Early fixation of an optimal genetic code. Mol Biol Evol. 17:511–518. Galtier N. 2001. Maximum-likelihood phylogenetic analysis under a covarion-like model. Mol Biol Evol. 18:866–873. Gavrilets S. 1997. Evolution and speciation on holey adaptive landscapes. Trends Ecol Evol. 12:307–312. Gavrilets S, Gravner J. 1997. Percolation on the fitness hypercube and the evolution of reproductive isolation. J Theor Biol. 184: 51–64. Gloor GB, Tyagi G, Abrassart DM, Kingston AJ, Fernandes AD, Dunn SD, Brandl CJ. 2010. Functionally compensating coevolving positions are neither homoplasic nor conserved in clades. Mol Biol Evol. 27: 1181–1191. Gong LI, Suchard MA, Bloom JD. 2013. Stability-mediated epistasis constrains the evolution of an influenza protein. Elife 2:e00631. Gravner J, Pitman D, Gavrilets S. 2007. Percolation on fitness landscapes: effects of correlation, phenotype, and incompatibilities. J Theor Biol. 248:627–645. Guo HH, Choe J, Loeb LA. 2004. Protein tolerance to random amino acid change. Proc Natl Acad Sci U S A. 101:9205–9210. Hansen TF. 2013. Why epistasis is important for selection and adaptation. Evolution 67:3501–3511. Henikoff S, Henikoff JG. 1992. Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A. 89:10915–10919. Huelsenbeck JP. 2002. Testing a covariotide model of DNA substitution. Mol Biol Evol. 19:698–707. Jukes T, Cantor C. 1969. Evolution of protein molecules. In: Munro H, editor. Pages in mammalian protein metabolism. New York: Academic Press. p. 21–132. Kauffman S, Levin S. 1987. Towards a general theory of adaptive walks on rugged landscapes. J Theor Biol. 128:11–45. Khan AI, Dinh DM, Schneider D, Lenski RE, Cooper TF. 2011. Negative epistasis between beneficial mutations in an evolving bacterial population. Science 332:1193–1196. Kimura M. 1980. A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol. 16:111–120. Kondrashov FA, Kondrashov AS. 2001. Multidimensional epistasis and the disadvantage of sex. Proc Natl Acad Sci U S A. 98: 12089–12092. Kondrashov AS, Povolotskaya IS, Ivankov DN, Kondrashov FA. 2010. Rate of sequence divergence under constant selection. Biol Direct. 5:5. Kondrashov AS, Sunyaev S, Kondrashov FA. 2002. Dobzhansky-Muller incompatibilities in protein evolution. Proc Natl Acad Sci U S A. 99: 14878–14883. MBE Lehner B. 2011. Molecular mechanisms of epistasis within and between genes. Trends Genet. 27:323–331. Lobkovsky AE, Wolf YI, Koonin EV. 2013. Quantifying the similarity of monotonic trajectories in rough and smooth fitness landscapes. Mol Biosyst. 9:1627–1631. Lozovsky ER, Chookajorn T, Brown KM, Imwong M, Shaw PJ, Kamchonwongpaisan S, Neafsey DE, Weinreich DM, Hartl DL. 2009. Stepwise acquisition of pyrimethamine resistance in the malaria parasite. Proc Natl Acad Sci U S A. 106:12025–12030. Lunzer M, Golding GM, Dean AM. 2010. Pervasive cryptic epistasis in molecular evolution. PLoS Genet. 6:e1001162. Macken CA, Perelson AS. 1989. Protein evolution on rugged landscapes. Proc Natl Acad Sci U S A. 86:6191–6195. Maynard Smith J. 1970. Natural selection and the concept of a protein space. Nature 225:563–564. McCandlish DM, Rajon E, Shah P, Ding Y, Plotkin JB. 2013. The role of epistasis in protein evolution. Nature 497:E1–E2. McLaughlin RN Jr, Poelwijk FJ, Raman A, Gosal WS, Ranganathan R. 2012. The spatial architecture of protein function and adaptation. Nature 491:138–142. Mirkin BG, Fenner TI, Galperin MY, Koonin EV. 2003. Algorithms for computing parsimonious evolutionary scenarios for genome evolution, the last universal common ancestor and dominance of horizontal gene transfer in the evolution of prokaryotes. BMC Evol Biol. 3:2. Miyamoto MM, Fitch WM. 1995. Testing the covarion hypothesis of molecular evolution. Mol Biol Evol. 12:503–513. Naumenko SA, Kondrashov AS, Bazykin GA. 2012. Fitness conferred by replaced amino acids declines with time. Biol Lett. 8: 825–828. Nei M, Gojobori T. 1986. Simple methods for estimating the numbers of synonymous and nonsynonymous nucleotide substitutions. Mol Biol Evol. 3(5):418–426. Ohta T. 1998. Evolution by nearly-neutral mutations. Genetica 102–103(1–6):83–90. O’Meara BC. 2012. Evolutionary inferences from phylogenies: a review of methods. Annu Rev Ecol Evol Syst. 43(1):267–285. Popadin K, Polishchuk LV, Mamirova L, Knorre D, Gunbin K. 2007. Accumulation of slightly deleterious mutations in mitochondrial protein-coding genes of large versus small mammals. Proc Natl Acad Sci U S A. 104:13390–13395. Povolotskaya IS, Kondrashov FA. 2010. Sequence space and the ongoing expansion of the protein universe. Nature 465:922–926. Rogozin IB, Thomson K, Cs€ur€os M, Carmel L, Koonin EV. 2008. Homoplasy in genome-wide analysis of rare amino acid replacements: the molecular-evolutionary basis for Vavilov’s law of homologous series. Biol Direct. 3:7. Rokas A, Carroll SB. 2008. Frequent and widespread parallel evolution of protein sequences. Mol Biol Evol. 25:1943–1953. Romero PA, Arnold FH. 2009. Exploring protein fitness landscapes by directed evolution. Nat Rev Mol Cell Biol. 10:866–876. Roscoe BP, Thayer KM, Zeldovich KB, Fushman D, Bolon DN. 2013. Analyses of the effects of all ubiquitin point mutants on yeast growth rate. J Mol Biol. 425:1363–1377. Soylemez O, Kondrashov FA. 2012. Estimating the rate of irreversibility in protein evolution. Genome Biol Evol. 4:1213–1222. Tavare S. 1986. Some probabilistic and statistical problems on the analysis of DNA sequences. Lect Math Life Sci. 17:57–86. Thyagarajan B, Bloom JD. 2014. The inherent mutational tolerance and antigenic evolvability of influenza hemagglutinin. Elife e03300. Tuffley C, Steel M. 1998. Modeling the covarion hypothesis of nucleotide substitution. Math Biosci. 147:63–91. van Nimwegen E, Crutchfield JP, Huynen M. 1999. Neutral evolution of mutational robustness. Proc Natl Acad Sci U S A. 96:9716–9720. Wang HC, Spencer M, Susko E, Roger AJ. 2007. Testing for covarion-like evolution in protein sequences. Mol Biol Evol. 24:294–305. Weinreich DM, Delaney NF, Depristo MA, Hartl DL. 2006. Darwinian evolution can follow only very few mutational paths to fitter proteins. Science 312:111–114. 553 Usmanova et al. . doi:10.1093/molbev/msu318 Wellner A, Raitses Gurevich M, Tawfik DS. 2013. Mechanisms of protein sequence divergence and incompatibility. PLoS Genet. 9: e1003665. Whelan S, Goldman N. 2001. A general empirical model of protein evolution derived from multiple protein families using a maximum-likelihood approach. Mol Biol Evol. 18:691–699. 554 MBE Yang Z. 1994. Maximum likelihood phylogenetic estimation from DNA sequences with variable rates over sites: approximate methods. J Mol Evol. 39(3):306–314. Zhang W, Dourado DF, Fernandes PA, Ramos MJ, Mannervik B. 2012. Multidimensional epistasis and fitness landscapes in enzyme evolution. Biochem J. 445:39–46.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Article A Model of Substitution Trajectories in