* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Molecular Clocks, Base Substitutions, & Phylogenetic Distances Definition: A mutation is a either an exchange within a DNA sequence of one nucleotide for another or indel events. In effect it is a mistake in the replication and repair of DNA. Mutations are divided into three categories: 1. Deleterious – disadvantageous to the survival of the organism. 2. Advantageous – contribute to the continued survival of the organism. 3. Neutral – for example, a third nucleotide change in the coding for valine. Advantageous changes are in the minority. Also, some changes can greatly affect an organism A deceptively simple, important equation: K r 2T Where: r = the rate at which substitutions occur K = the number of substitutions two sequences have undergone since they last shared a common ancestor expressed in substitutions per site. T = the divergence time Unfortunately, none of these variables are known. T can be estimated by archaeological evidence, if it exists. K can be approximated by sequence comparison. Different portions of genes accumulate changes at widely varying rates: Amino Acids experience different substitution rates. • Four-fold Degenerate Sites, those sites where a substitution for one nucleotide by any one of the other three nucleotides does not result in a change of the amino acid, occur most rapidly, i.e. the third site of glycine. • Two-fold Degenerate Sites, those where two of the nucleotides result in one amino acid and two result in another, i.e. aspartic acid and glutamic acid, occur less frequently. • Nondegenerate Sites, those where a change in this site always results in a change in the amino acid, i.e. almost any of the middle sites in Table 1.1 on p11 of K&R, are the least common. Natural selection makes it difficult to assess mutation rates for the obvious fact that it has a tendency to eliminate deleterious mutations. Substitutions are mutations that have been filtered through selection. We consider two types of substitutions: Synonymous – those that do not result in a change of the amino acid. Nonsynonymous – those that result in a change of the amino acid. Synonymous changes are less affected by selection and thus are more reflective of the true mutation rate than nonsynonymous changes Table of synonymous and non synonymous substitution rates for various genes in four mammalian species. See Table 3.3 on page 64 of K&R for identification of the genes. Because of differences in the selectivity constraints for various substitutions in individual proteins, differences in amino acid replacement between nuclear genes can be quite striking. On the other hand, rates of molecular evolution for loci with similar functional constraints can be quite uniform over long periods of evolutionary time. This observation caused Zukerkandl and Pauling in the 1960’s to suggest that within homologous proteins the substitution rates were so constant that they were like the ticking of a Molecular Clock. While the clock may run at different rates for different proteins, the number of differences between two homologous proteins correlated well with the time since speciation caused them to diverge. This hypothesis is controversial. Classical evolutionists maintain that the erratic tempo of morphological evolution is inconsistent with a steady rate of molecular change. Furthermore, disagreements regarding the divergence times have also placed in question any uniformity in evolution rates that are promised by a “molecular clock.” See as one example the article on the time of divergence of the human and the chimp. One of the hypotheses there is that humans, because of their longer life span, have a ‘slower’ molecular clock. On the other hand these varying rates can be explained in several different ways and much useful information has been obtained from sequence comparison. For the moment we will proceed with the assumption of a molecular clock for highly conserved sequences. However, we are not yet out of the woods. For sequences with relatively few substitutions a simple count will provide a reasonable approximation of K. On the other hand, simple counting in sequences with many differences may cause a significant underestimation of the actual number of substitutions. Why? Jukes and Cantor in 1969 developed the first, and most simple, model of nucleotide substitution that will account for the underestimate of simple counting of differences and give a more accurate accounting for the number of substitutions since two sequences last shared a common ancester. In 1980 Kimura developed a more sophisticated model that took into account different rates for transitions and transversions. To begin, we will investigate the ramifications of the JukesCantor model. This model assumes that a certain proportion of any of the given nucleotides will change during any one evolutionary period and that any one of them is likely to change to any of the other nucleotides without restriction, i.e. with equal probability. This assumption leads to a table that can be expressed in the following way: α = the proportion of a particular nucleotide that changes during any one evolutionary time period. Reiterating the formula for p implied by the Jukes-Cantor model: We can solve for the elapsed time, t, based on α and p: 4 p) 3 t 4 ln( 1 ) 3 ln( 1 p can be approximated by the number of observed differences in the two sequences. However, that still leaves us with one equation in two unknowns, α and t. This is not good! Or is it? If we look at a the product αt and think about its meaning for a minute, we see that this product is the number of time steps times the mutation rate or the expected number of substitutions per site during the elapsed time. This includes even those that do not appear in the count of differences, i.e. the “hidden substitutions” (those that eventually resulted in a position once again being occupied by its original nucleotide occupant. We define a new variable d = αt which is called the Jukes-Cantor distance. Notice that this distance is proportional to t. We are almost where we want to be. We make one last observation: If x is small ln(1 – x) -x . For example, ln(1 - .00001) = -.00001000005 Thus, since α is very small, we have: 4 4 ln( 1 ) 3 3 This approximation allows us to solve for d, the Jukes-Cantor distance. t 4 p) 3 4 3 ln( 1 p) 4 4 3 3 ln( 1 Multiplying both sides by α, 3 4 d t ln( 1 p) 4 3 Thus, given two sequences, S0 and S1 3 4 d JC ( S 0 , S1 ) ln( 1 p) 4 3 We conclude with an example: Consider two sequences with 40 sites S0: AGCTTCCGATCCGCTATAATCGTTAGTTGTTACACCTCTG S1: AGCTTCTGATACGCTATAATCGTGAGTTGTTACATCTCCG Five sites have undergone substitution. Thus p = 5/40 = 1/8 = .125 Thus, 3 4 3 3 d JC ( S 0 , S1 ) ln( 1 .125) ln(. 833333) (.182321556) .13674117 4 3 4 4 This is the expected percentage of changes, i.e. 5.5 is the expected number of substitutions based on the observed differences between the two sequences.