Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Introduction to molecular evolution Lecture 13, Statistics 246 March 4, 2004 1 Evolution using molecules: implicit assumptions Our DNA is inherited from our parents more or less unchanged. Molecular evolution is dominated by mutations that are neutral from the standpoint of natural selection. Mutations accumulate at fairly steady rates in surviving lineages. We can study the evolution of (macro) molecules and reconstruct the evolutionary history of organisms using their molecules. 2 Some important dates in history (billions of years ago) Origin of the universe 15 4 Formation of the solar system 4.6 First self-replicating system 3.5 0.5 Prokaryotic-eukaryotic divergence 1.8 0.3 Plant-animal divergence 1.0 Invertebrate-vertebrate divergence 0.5 Mammalian radiation beginning 0.1 86 CSH Doolittle et al. 3 The three kingdoms 4 Two important early observations Different proteins evolve at different rates, and this seems more or less independent of the host organism, including its generation time. It is necessary to adjust the observed percent difference between two homologous proteins to get a distance more or less linearly related to the time since their common ancestor. ( Later we offer a rational basis for doing this.) An striking early version of these observations is next. 5 g V ertebrates/ Insects f Carp/Lamprey abcde Reptiles/Fish 200 Birds/Reptiles Mammals/ Reptiles 220 Mammals Corrected amino acid changes per 100 residues Rates of macromolecular evolution h i j 10 180 160 6 5 140 7 89 Evolution of the globins 120 100 80 60 1 40 4 Separation of ancestors of plants and animals Pliocene Miocene Oligocene Eocene Paleocene 100 200 300 400 500 600 Huronian Algonkian Cambrian Devonian Silurian Ordovician Jurassic Triassic Permian Cretaceous 0 Carboniferous 20 3 2 700 800 900 1000 1100 1200 1300 1400 6 Millions of years since divergence After Dickerson (1971) Different rates of change for different proteins Protein aPAMs/100 Pseudogenes Fibrinopeptides Lactalbumins Lysozymes Ribonucleases Hemoglobins Acid proteases Triosephosphate isomerase Phosphoglyceraldehyde dehydrogenase Glutamate dehydrogenase residues/108 years Theoretical lookback timeb 45c 200c 670c 750c 850c 1.5d 2.3d 6d 400 90 27 24 21 12 8 3 9d 2 1 18d ______________________________________________________________________________________ aPAMs, Accepted point mutations (explained shortly). bUseful lookback time = 360 PAMs. cMillion years. dBillion years. From Doolittle 1986 7 Rates of change in protein families Protein Ratea Protein Fibrinopeptides Growth hormone Ig kappa chain C region Kappa casein Ig gamma chain C region Lutropin beta chain Ig lambda chain C region Complement C3a Lactalbumin Epidermal growth factor Somatotropin Pancreatic ribonuclease Lipotropin beta Haptoglobin alpha chain Serum albumin Phospholipase A2 Protease inhibitor PST1 type Prolactin Pancreatic hormone Carbonic anydrase C Lutropin alpha chain Hemoglobin alpha chain Hemoglobin beta chain Lipid-binding protein A-II Gastrin Animal lysozyme Myoglobin Amyloid A Nerve growth factor Acid proteases Myelin basic protein 90 37 37 33 31 30 27 27 27 26 25 21 21 20 19 19 18 17 17 16 16 12 12 10 9.8 9.8 8.9 8.7 8.5 8.4 7.4 Thyrotropin beta chain Parathyrin Parvalbumin BPTI Protease inhibitors Trypsin Melanotropin beta Alpha crystallin A chain Endorphin Cytochrome b5 Insulin Calcitonin Neurophysin 2 Plastocyanin Lactate dehydrogenase Adenylate cyclase Triosephosphate isomerase Vasoactive intestinal peptide Corticotropin Glyceraldehyde 3-P DH Cytochrome C Plant ferredoxin Collagen Troponin C, skeletal muscle Alpha crystallin B-chain Glucagon Glutamate DH Histone H2B Histone H2A Histone H3 Ubiquitin Histone H4 apercent/100My From (Nei, 1987; Dayhoff et al., 1978) Rate 7.4 7.3 7.0 6.2 5.9 5.6 5.0 4.8 4.5 4.4 4.3 3.6 3.5 3.4 3.2 2.8 2.6 2.5 2.2 2.2 1.9 1.7 1.5 1.5 1.2 0.9 0.9 0.5 0.14 0.1 8 0.1 Some terminology In evolution, homology (here of proteins), means similarity due to common ancestry. A common mode of protein evolution is by duplication. Depending on the relations between duplication and speciation dates, we have two different types of homologous proteins. Loosely, Orthologues: the “same” gene in different organisms;common ancestry goes back to a speciation event. Paralogues: different genes in the same organism; common ancestry goes back to a gene duplication. Lateral gene transfer gives another form of homology. 9 Beta-globins (orthologues) 10 BG-human BG-macaque BG-bovine BG-platypus BG-chicken BG-shark 60 70 80 100 110 120 NLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFG .......Q................K............... D......A................K.......V...RN.. D......K................NR.....IV...R..S .I.N..SQ....................DI.II...A..S DV.SQ.TD..KK.AEE....V.S.K..AKCF.VE.GILLK 130 BG-human BG-macaque BG-bovine BG-platypus BG-chicken BG-shark 40 RFFESFGDLSTPDAVMGNPKVKAHGKKVLGAFSDGLAHLD ..........S.........................N... ...........A....N............DS..N.MK... ....A.....SAG............A...TS.G.A.KN.. ...A...N..S.T.IL...M.R.......TS.G.AVKN.. .Y.GNLKEFTACSYG-----..E.A...T..LGVAVT..G 90 BG-human BG-macaque BG-bovine BG-platypus BG-chicken BG-shark 30 MVHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPWTQ -........N...T.......................... --M..A...A....F....K.................... -...SGG......N......IN.L................ ...W.A...QLI.G.......A.C.A...A...I...... -..WSEV.LHEI.TT.KSIDKHSL.AK..A.MFI.....T 50 BG-human BG-macaque BG-bovine BG-platypus BG-chicken BG-shark 20 140 KEFTPPVQAAYQKVVAGVANALAHKYH .....Q..................... .....VL..DF.............R.. .D.S.E....W..L.S...H..G.... .D...EC...W..L.RV..H...R... DK.A.QT..IWE.YFGV.VD.ISKE.. . means same as reference sequence - means deletion 10 Beta-globins: uncorrected pairwise distances DISTANCES between protein sequences, calculated over: 1 to 147 Below diagonal: observed number of differences Above diagonal: number of differences per 100 amino acids hum mac bov pla chi sha hum ---- 5 16 23 31 65 mac 7 ---- 17 23 30 62 bov 23 24 ---- 27 37 65 pla 34 34 39 ---- 29 64 chi 45 44 52 42 ---- sha 91 88 91 90 87 61 ---- 11 Beta-globins: corrected pairwise distances DISTANCES between protein sequences, calculated over 1 to 147. Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 amino acids Correction method: Jukes-Cantor hum mac bov pla chi sha hum ---- 5 17 27 37 108 mac 7 ---- 18 27 36 102 bov 23 24 ---- 32 46 110 pla 34 34 39 ---- 34 106 chi 45 44 52 42 ---- 98 sha 91 88 91 90 87 ---12 UPGMA tree BG-shark BG-chicken BG-platypus BG-bovine BG-macaque BG-human 13 Human globins (paralogues) 10 alpha-human beta-human delta-human epsilon-human gamma-human myo-human 50 70 90 100 110 DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHL .NLKGTFAT..E..CD..H...E..R..GNV.VCV..H.F .NLKGTF.Q..E..CD..H...E..R..GNV.VCV..RNF .NLKP.FAK..E..CD..H...E.....GNVMVII..T.F ..LKGTFAQ..E..CD..H...E.....GNV.VTV..I.F GHHEAEIKP.AQS..T.HKIPVKYLEFI.E.IIQV.QSKH 120 alpha-human beta-human delta-human epsilon-human gamma-human myo-human 60 KTYFPHF-DLSHGSA-----QVKGHGKKVADALTNAVAHV QRF.ES.G...TPD.VMGNPK..A.....LG.FSDGL..L QRF.ES.G...SPD.VMGNPK..A.....LG.FSDGL..L QRF.DS.GN..SP..ILGNPK..A.....LTSFGD.IKNM QRF.DS.GN..SA..IMGNPK..A.....LTS.GD.IK.L LEK.DK.KH.KSEDEMKASEDL.K..AT.LT..GGILKKK 80 alpha-human beta-human delta-human epsilon-human gamma-human myo-human 30 -VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSFPTT VH.T.EE.SA.T.L....--NVD.V.G...G.LLVVY.W. VH.T.EE..A.N.L....--NVDAV.G...G.LLVVY.W. VHFTAEE.AA.TSL.S.M--NVE.A.G...G.LLVVY.W. GHFTEE..ATITSL....--NVEDA.G.T.G.LLVVY.W. -G..DGEWQL.LNV....E.DIPGH.Q.V.I.L.KGH.E. 40 alpha-human beta-human delta-human epsilon-human gamma-human myo-human 20 130 140 PAEFTPAVHASLDKFLASVSTVLTSKYR-----GK....P.Q.AYQ.VV.G.ANA.AH..H...... GK....QMQ.AYQ.VV.G.ANA.AH..H...... GK....E.Q.AWQ.LVSA.AIA.AH..H...... GK....E.Q..WQ.MVTA.ASA.S.R.H...... .GD.GADAQGAMN.A.ELFRKDMA.N.KELGFQG 14 Human globins: uncorrected pairwise distances DISTANCES between protein sequence, calculated over 1 to 154. Below diagonal: observed number of differences Above diagonal: number of differences per 100 amino acids alpha beta ---- 55 beta 82 ---- delta 82 10 epsil 89 gamma myo alpha delta epsil gamma 55 60 57 74 7 25 27 75 ---- 27 29 74 35 39 ---- 20 77 85 39 42 29 ---- 76 116 117 119 118 116 myo 15 ---- Human globins: corrected pairwise distances DISTANCES between protein sequences,calculated over 1-141 Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 amino acids Correction method: Jukes-Cantor alpha beta delta epsil gamma myo alpha ---- 281 281 281 313 208 beta 82 ---- 7 30 31 1000 delta 82 10 ---- 34 33 470 epsil 89 35 39 ---- 21 402 gamma 85 39 42 29 ---- 470 myo 116 117 119 118 ----16 116 Alu sequences (a-globin2 Alu 1, Knight et al., 1996) hum-1 hum-2 hum-3 chimp bonob goril orang C . . . . . . C . . . . . G G . . . . . . A . . . . . . C . . . . . . A . . . . . . G . . . . . . G . . A A . A C . . . . . . 10 A . . . . . . C . . . . . . G . . . . A . G . . . . . . T . . . . . . G . . . . . . G . . . . . . C . . . . . . T . . . . . . C . . . . . . 20 A C . . . . . . . . . . . . A . . . . . G C . . . . . . C . . . . . . T . . . . . . G . . . . . . T . . . . . . A . . C C . . A . . . . . . 30 T C . . . . . . . . . . . . C . . . . . . C . . . . . . A . . . . . . G . . . . . . T . . C C C C A . . . . . . C . . . . . . T . . . . . . 40 T T . . . . . . . . . . . . G . . . . . . G . . . . . . G . . . . . . A . . . . . . G . . . . . . G . . . . C . C . . . . . . T . . . . . C 50 G A . . . . . . . . . . . . G . . . . . . G . . . . . . C . . . . . T G . . . . . . A . . G G G G G . . . . . . A . . . . . C G . . . . . . 60 G . . . . . . hum-1 hum-2 hum-3 chimp bonob goril orang A . . . . . . T . . . . . . C . . . . . . A . . . . . . C . . . . . . C . . . . . . T . . . . . . G . . . . . . A . . . . . . 70 G . . . . . . G . . . . . . T . . . . . . C . . . . . T G . . . . . . G . . . . . . G . . . . . . A . . . . . . G . . . . . . T . . . . . . 80 T T . . . . . C . C . C . C G . . . . . . A . . . . . . G . . . . . A A . . . . . . C T . . . . . C . . . . . . A . . . . . . G . . . . . . 90 C C . . . . . . . . . . . . T . . . . . . G . . . . . . A . . . . . . C . . . . . . C . . . . . . A . . . . . . A . . . . . . T . . . . . . 100 A T . . . . . . . . . . . . G . . . . . . G . . . . . . A . . . . . . G . . . . . . A . . . . . . A . . . . . . A . . . . . . C . . . . T . 110 C C . . . . . . . . . . . . C . . . . . . A . . . . . . G . . . . . . T . C . . . . T . . . . . . A . . . . . C T . . . . . . A . . . . . . 120 C . . . . . . hum-1 hum-2 hum-3 chimp bonob goril orang T . . . . . . A . . . . . . A . . . . . . A . . . . . . A . . . . . . A . . . . . . T . . . . . . A . . . . . . 130 C A A . . . . . . . . . . . . . . . . . . A . . . . . . A . . . . . . T . . . . . . T . . . . . . A . . . . . . G . . . . . . C . . . . . . T . . . . . . 140 G G . . . . . . . . . . . . G . . . . . . T . . . . . C G . . . . . . T . . . . G . G . . . . . . G . . C . . . T . . . . . . G . . . . . . 150 G C . . . . . . . . . . . . G . . . . . . C . . . . . . A . . . . . . T . . . . . . G . . . . . . C . . . . . . C . . . . . . T . . . . . . 160 G T . . . . . . . . . . . . A . . . . . . A . . . . . . T . . . . . . C . . . . . T C . . . . . . T . . C C C C A . . . . . . G . . . . . . 170 C T . . . . . . . . . . . . A . . . . . . C . . . . . . T . . . . . . A . . . . . . G . . . . . . G . . . . . . A . . G G G G A . . . . . . 180 G . . . . . . hum-1 hum-2 hum-3 chimp bonob goril orang G . . . . . . C . . . . . . T . . . . . . G . . . . . . A . . . . . . G . . . . . . G . . . . . . C . . . . . . 190 A G G . . . . . . . . . . . . . . . . . . A . . . . . . G . . A A . . A . . . . . . A . . . . . . T . . . . . . C . . . . . . G . . . . A . C . . . . . . 200 T T . . . . . . . . . . . . G . . . . . . A . . . . . . A . . . . . . C . . . . . . C . . . . . . C . . . . . . G . . . . . . G . . . . . . 210 G A . . . . . . . . . . . . G . . . . A . G . . . . . . T . . . . . . G . . . . . . G . . . . . . A . . . . . . G . . . . . . G . . . . . . 220 T T . . . . . . . . . . . . G . . . . T . A . . . . . T G . . . . . . G . . . . . . T . . . . . . G . . . . . . A . . . . . . G . . . . . . 230 C T . . . . . . . . . . . . G . . . . . . A . . . . . . G . . . . . . A . . . . . . T . . . . . . C . . . . . . A . . . . . G C . . . . . . 240 G . . . . A . hum-1 hum-2 hum-3 chimp bonob goril orang C . . . . . . C . . . . . . A . . . . . . T . . C C . . T . . . . . . G . . . . . . C . . . . . . A . . . . . . 250 C T C . . . . . . . . . . . . . . . . . . C . . . . . . A . . . . . . G . . . . . . C . . . . . . C . . . . . . T . . . . . . G . . . . . . G . . . . . . 260 G C . . . . . . . . . . . . A . . . . . . A . . . . . . C . . . . . . A . . . . . . A . . . . . . G . . . . . . A . . . . . . G . . . . . . 270 C A . . . . . . . . . . . . A . . . . . . A . . . . . . A . . . . . . C . . . . . . T . . . . . . C . . . . . . C . . . . . . G . . A A A A 280 T C . . . . . . . . . . . . T . . . . . . C . . . . . T A . . . . . . A . . . . . . A . . . . . . A . . . . . . A . . . . . . A . . . . . . 290 T A . . . . . . . . . . . . A . . . . . . A . . . . . . T . . . . . . A . . . . . . A A . . . . C . . . 17 . . . . T . . . . . . A . . . . . . 300 A . . . . . . Alu sequences:uncorrected pairwise distances DISTANCES between nucleic acid sequences calculated over: 1 to 300, considering all base positions Below diagonal: observed number of differences Above diagonal: number of differences per 100 bases hum-1 hum-2 hum-3 chimp bonob goril orang hum-1 ---- 0 0 4 3 5 7 hum-2 1 ---- 1 4 4 5 7 hum-3 1 2 ---- 4 4 5 7 chimp 12 13 13 ---- 1 5 6 bonob 10 11 11 2 ---- 4 5 15 15 14 12 ---- 7 21 21 18 16 22 ---- goril orang 14 20 18 Alu sequences: corrected pairwise distances DISTANCES between nucleic acid sequences, calculated over: 1 to 300, considering all base positions Below diagonal: observed number of differences Above diagonal: estimated number of substitutions per 100 bases Correction method: Jukes-Cantor hum-1 hum-2 hum-3 chimp bonob goril orang hum-1 ---- 0 0 4 3 5 7 hum-2 1 ---- 1 4 4 5 7 hum-3 1 2 ---- 4 4 5 7 chimp 12 13 13 ---- 1 5 6 bonob 10 11 11 2 ---- 4 6 goril 14 15 15 14 12 ---- 8 19 orang 20 21 21 18 16 22 ---- Correcting distances between DNA and protein sequences We mentioned earlier that it is necessary to adjust observed percent differences to get a distance measure which scales linearly with time. This is because we can have multiple and back substitutions at a given position along a lineage. All of the correction methods (with names like Jukes-Cantor, 2-parameter Kimura, etc) are justified by simple probabilistic arguments involving Markov chains whose basis is worth mastering. The same molecular evolutionary models are used in scoring sequence alignments. 20 Markov chain State space = {A,C,G,T}. p(i,j) = pr(next state Sj | current state Si) Markov assumption: p(i,j) = pr(next state Sj | current state Si & any configuration of states before this) Only the present state, not previous states, affects the probs of moving to next states. 21 A simple 4-state Markov chain: the Kimura 2-parameter model for nucleotide change A A c G a C b T b G a c b b C b b c a T b b a c Transition probabilities: Horizontal: a Diagonal & vertical: b Self: c = a2b A G C T 22 The multiplication rule pr(state after next is Sk | current state is Si) = ∑j pr(state after next is Sk, next state is Sj | current state is Si) [addition rule] = ∑j pr(next state is Sj| current state is Si) x pr(state after next is Sk | current state is Si, next state is Sj) = ∑j p(i,j) x p(j,k) [multiplication rule] [Markov assumption] = (i,k)-element of P2, where P=(p(i,j)). More generally, pr(state t steps from now is Sk | current state is Si) = i,k element of Pt 23 Continuous-time version For any s,t write pij(t) = pr(Sj at time t+s | Si at time s) for the stationary (time-homogeneous) transition probabilities. Write P(t) = (pij(t)) for the matrix of pij(t)’s. Then for any t,u: P(t+u) = P(t) P(u). It follows that P(t) = exp(Qt), where Q = P’(0) is the derivative of P(t) at t = 0. Q is called the infinitesimal matrix of P(t), and satisfies P’(t) = QP(t) = P(t)Q. 24 Interpretation of Q Roughly, q(i,j) is the rate of transitions of i to j, while q(i,i) = j q(i,j), so each row sum is 0. If under some initial conditions, we have a Markov chain evolving in continuous time with infinitesimal matrix Q, and pj(t) = pr(Sj at time t), then pj(t+h) =i pr(Si at t, Sj at t+h) = i pr(Si at t)pr(Sj at t+h | Si at t) = pj(t)x(1+hqjj) + i jpi(t)x hqij i.e., h-1[pj(t+h) - pj(t)] = pj(t)q(j,j) + i j pi(t)q(i,j) which becomes P’ = QP as h 0. Important approximation: when t is small, P(t) I + Qt. 25 The Jukes-Cantor model (1969) -3 Q= Common ancestor of human and orang. t time units P(t) = human (now) Consider e.g. the 2nd position in a-globin2 Alu1. -3 r s s s s r s s -3 -3 s s r s s s s r r = (1+3e4t)/4, s = (1 e4t)/4. 26 Definition of PAM Let P(t) = exp(Qt). Then the A,G element of P(t) is pr(G now | A then) = (1 e4t)/4. Same for all pairs of different nucleotides. Overall rate of change k = 3t. When k = .01, described as 1 PAM PAM = accepted point mutation Put t = .01/3 = 1/300. Then the resulting P = P(1/300) is called the PAM(1) matrix. Why use PAMs? 27 Evolutionary time, PAM Since sequences evolve at different rates, it is convenient to rescale time so that 1 PAM of evolutionary time corresponds to 1% expected substitutions. For Jukes-Cantor, k = 3t is the expected number of substitutions in [0,t], so is a distance. (Show this.) Set 3t = 1/100, or t = 1/300, so 1 PAM = 1/300 years. 28 Distance adjustment For a pair of sequences, k = 3t is the desired metric, but not observable. Instead, pr(different) is observed. So we use a model to convert pr(different) to k. This is completely analogous to the conversion of = pr(recombination) to genetic (map) distance (= expected number of crossovers) using the Haldane map function = 1/2 (1 e-2d), assuming the no-interference (Poisson) model. 29 Towards Jukes-Cantor adjustment common ancestor still 2nd position in a-globin Alu 1 Assume that the common ancestor has A, G, C or T with probability 1/4. G orang C human Then the chance of the nt differing 3/4 p≠ = 3/4 (1 e8t) = 3/4 (1 e4k/3), since k =2 3t 30 t Jukes-Cantor adjustment If we suppose all nucleotide positions behave identically and independently, and n≠ differ out of n, we can invert this, obtaining = 3/4 log(1 4/3 n≠/n). This is the corrected or adjusted fraction of differences (under this simple model). 100 to get PAMs The analogous simple model for amino acid sequences has = 19/20 log(1 20/19 n≠/n). 100 for PAM. 31 Illustration 1. Human and bovine beta-globins are aligned with no deletions at 145 out of 147 sites. They differ at 23 of these sites. Thus n≠/n = 23/145, and the corrected distance using the Jukes-Cantor formula is (natural logs) 19/20 log(1 20/19 23/145) = 17.3 10-2. 2. The hum-1 and gorilla sequences are aligned without gaps across all 300 bp, and differ at 14 sites. Thus n≠/n = 14/300, and the corrected distance using the Jukes-Cantor formula is 3/4 log(1 4/3 14/300) = 4.8 10-2. 32 Correspondence between observed a.a. differences and the evolutionary distance (Dayhoff et al., 1978) Observed Percent Difference 1 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 Evolutionary Distance in PAMs 1 5 11 17 23 30 38 47 56 67 80 94 112 133 159 195 246 328 33