* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Genome evolution: a sequence
Survey
Document related concepts
Transcript
Genome Evolution © Amos Tanay, The Weizmann Institute Genome evolution Lecture 1: Introduction (and some background on Markov Processes) Amos Tanay, Ziskind 204, ext 3579 עמוס תנאי [email protected] http://www.wisdom.weizmann.ac.il/~atanay/GenomeEvo/ Genome Evolution © Amos Tanay, The Weizmann Institute Evolution Code Genome Genome Evolution © Amos Tanay, The Weizmann Institute (Most of us) don’t think we evolved to be optimal.. Talking and eating… Retina in human and octopus octopus human retina Nerve fibers The Appendix Evolving tiny iphone fingers? Genome Evolution © Amos Tanay, The Weizmann Institute But when looking at the genome.. • Very commonly, evolution is assigned with super-powers • Any non random pattern in the genome is assumed to have a “meaning” • ..because otherwise “evolution would have eliminated it” • In a way – the current attitude to evolution among many biologists is semi-religious Mathematics helps demystifying the inner works of evolution Genome Evolution © Amos Tanay, The Weizmann Institute The Lamarckian view – directed evolution Jean Baptiste Lamarck Genome Evolution © Amos Tanay, The Weizmann Institute The Darwinian view – natural selection Darwin Genome Evolution © Amos Tanay, The Weizmann Institute Statistics, Genetics and Molecular Biology Frequency of recessive allele (blue flower color) in “desert snow” flowers (Lynanthus parruae) FSR 0.1589 0.717 0.005 0.657 Haldane FRT 0.3299 0.000 0.000 0.032 0.573 Fischer 0.000 0.009 0.302 0.000 0.007 0.002 0.004 0.000 0.000 0.504 0.005 0.008 0.126 0.000 0.339 0.000 H 0.4995 0.010 0.106 0.224 0.068 0.000 H 0.0272 Dobzhansky 0.014 0.411 H 0.3062 Mayr Genome Evolution © Amos Tanay, The Weizmann Institute Modeling evolution – take I Blue allele A A Generations/time a Yellow allele a Modeling the dynamics of allele frequencies AA Generations/time AA Aa aa Aa aa Modeling the dynamics of allele frequencies Genome Evolution © Amos Tanay, The Weizmann Institute Modeling evolution – take I Blue allele A Generations/time a Yellow allele A a Modeling the dynamics of allele frequencies t t+1 Genome Evolution © Amos Tanay, The Weizmann Institute Try it at home I: 1. Simulate a population of 10,000 “genomes”, all having the same genotype AA 2. Replace 10 individuals with an “aA” allele 3. Simulate a new generation: synthesize 20,000 random pairs and select 10,000 out of them 4. Plot the fraction of the “a” allele in the population 5. Repeat the experiment many times: can you say something (empirical/analytic) about the probability of ending up with a population lacking “a” completely? Genome Evolution © Amos Tanay, The Weizmann Institute Statistics, Genetics and Molecular Biology The code – Genomic sequences …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… Monod Jacob Crick The machine – Protein networks in cells Genome Evolution © Amos Tanay, The Weizmann Institute Statistics, Genetics and Molecular Biology A. Quantitative description of populations their genes and their evolution 1920 Fischer 1930 1950 B. Molecular understanding of the mechanisms that store, transmit and process biological information Haldane Dobzhansky Crick A+B = The (only?) real quantitative theory of biology! Mayr Monod Jacob Genome Evolution © Amos Tanay, The Weizmann Institute Neutral Evolution Selectionists: Mutations are occurring by chance - some get selected and these are the changes we see between genomes Kimura et al.: Most of the changes between genomes are neutral - not a result of selection! …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… …ACGAATAGCAAAAGGGCAGATGGCATTCTAGATCGAAAGCATGAAACTAGATAGCAT… …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… Kimura …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… Genome Evolution © Amos Tanay, The Weizmann Institute Neutral Evolution Kimura’s analytic achievement was the solution of a certain class of Partial Differential Equations that describe the dynamic of allele frequencies under neutral evolution 1 V ( x) ( x, t ) ( x, t ) M ( x) ( x, t ) t x 2 x 2 But we can try and understand the essence of neutral evolution even without fancy mathematics: Neutral changes Along the path are fixated Last common ancestor t=1 Coalescent time t=n Genome Evolution © Amos Tanay, The Weizmann Institute Modeling evolution – take II One letter: t=1 t=2 t=3 t=4 t=5 A A A G G t=6 t=7 G Markov A pAG A Markov Chain: A pGA G A set of states (A,C,G,T) Transition probabilities 1 P0 i P01 i 0 1 P1i P10 i 1 .. .. .. .. P Pn 0 n0 C P02 .. P0 n P12 .. P1 n .. .. .. .. .. .. Pn 0 .. 1 Pni in T Genome Evolution © Amos Tanay, The Weizmann Institute Modeling evolution – take II Pij (t ) 0, P (t ) 1 ij Kolmogorov j P (t ) P (h) P (t h) ik kj ij t, h 0 A Markov process: k 1 i j lim Pij (t ) t 0 0 i j Theorem: 1 Pii (t ) Pii ' (0) lim qii t 0 t Pij (t ) Pij ' (0) lim qij t 0 t P' (t ) QP (t ) P(t ) exp( Qt ) A set of states (A,C,G,T) Transition rates rates Genome Evolution © Amos Tanay, The Weizmann Institute From hundreds to billions loci…. Genome = many independent nucleotides x1 Universal x2 x3 x4 x5 x6 Q 1960 Multiple copies of the same Markov process 1970 1980 Protein analysis Phylogenetic reconstruction 1990 2000 2010 Genome Evolution © Amos Tanay, The Weizmann Institute From hundreds to billions loci…. Genome = many independent nucleotides x1 Universal x2 x3 x4 x5 x6 Q Multiple copies of the same Markov process 1960 1970 1980 1990 2000 2010 Genome Evolution © Amos Tanay, The Weizmann Institute “Its conserved” Don’t change! Warn you! …ACGAATAGCAAATGGGCAGATGGCAGTCTAGATCGAAAGCATGAAACTAGATAGCAT… The everlasting power of conservation Conserved == Functional? Conserved == because it was selected for? Genome Evolution © Amos Tanay, The Weizmann Institute Mutations are not simple Mutations cannot be simple! DNA replication – wiki version.. Try it at home II: 1. Download human-chimp pairwise alignment: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/vsPanTro2/ 2. Count how many time a human T is align to chimp A,C,G,T 3. Recount, this time classifying according to the bases before and after (16 possibilities) 4. Do you see differences in the fraction of conserved T’s? Genome Evolution © Amos Tanay, The Weizmann Institute Evolution of an AAAA..AA sequence X’t Maximum Likelihood flanking aware model Xt Xt+ 1 X’’t Xt Xt+ 1 Maximum Likelihood independent loci model Movie removed to save space Simulation based on real parameters from yeast genomes Genome Evolution © Amos Tanay, The Weizmann Institute Evolution of an AAAA..AA sequence X’t Maximum Likelihood flanking aware model Xt Xt+ 1 X’’t Xt Xt+ 1 Maximum Likelihood independent loci model Movie removed to save space Simulation based on real parameters from yeast genomes Genome Evolution © Amos Tanay, The Weizmann Institute What happened? stationary distributions TCT TTT TAT TGT TTT Stationary distribution condition pTTT ( pTTT TCT pTTT TAT pTTT TGT ) pTAT pTAT TTT pTCT pTCT TTT pTGT pTGT TTT QTXT CTT Theorem: (under reasonable conditions..) - A Markov process will converge to a stationary distribution Theorem: the stationary distribution will be uniform iff the transition probabilities are symmetric QCXT CTT Genome Evolution © Amos Tanay, The Weizmann Institute Try it at home III: 1. Download a human chromosome: http://hgdownload.cse.ucsc.edu/goldenPath/hg18/ 2. Count the number of appearances of each word on 6 DNA bases 3. Now do it again on windows of 1MB. 4. Are the statistics always the same? How can this be tested? What can explain possible differences? Genome Evolution © Amos Tanay, The Weizmann Institute Example: CpG islands CG 1% diverged 40% diverged >99% of the genome: CpG is methylated <1% of the genome CpG is unmethylated CG CG CA TG CA TG CpG islands: high CpG stationary distribution Due to variable mutability Got very famous: (3.4M hits in google..) – and are still very confusing May play a role in cancer – no optimality here Genome Evolution © Amos Tanay, The Weizmann Institute More complex example: codon bias nucleotides Amino-acids Ala/A GCU, GCC, GCA, GCG Leu/L UUA, UUG, CUU, CUC, CUA, CUG Arg/R CGU, CGC, CGA, CGG, AGA, AGG Lys/K AAA, AAG Asn/N AAU, AAC Met/M AUG Asp/D GAU, GAC Phe/F UUU, UUC Cys/C UGU, UGC Pro/P CCU, CCC, CCA, CCG Gln/Q CAA, CAG Ser/S UCU, UCC, UCA, UCG, AGU, AGC Glu/E GAA, GAG Thr/T ACU, ACC, ACA, ACG Gly/G GGU, GGC, GGA, GGG Trp/W UGG His/H CAU, CAC Tyr/Y UAU, UAC AUU, AUC, AUA Val/V GUU, GUC, GUA, GUG AUG STOP UAG, UGA, UAA Ile/I START The genetic code is degenerate: More than one codon (triplet) for each amino acid Codon bias: codons are not appearing uniformly in the genome Popular theory: codon usage is optimized to maximize protein yield Genome Evolution © Amos Tanay, The Weizmann Institute Evolution Genome (Genetic code) CODE1 CODE2 (Codon bias) Function Function2 (Protein coding sequence) (Translation control) Genome Evolution © Amos Tanay, The Weizmann Institute The sequence context affect the probability of a short insertion or deletion within >100 fold Insertion/Deletion usually break the gene frame and destroy the derived protein Sequence context Around deletions l=2 Translating garbage A C G T Codons are selected to reduce the potential of frame shifting mutations! Distance from deletion junction Sequence context Around inserts l=2 Relative Number of Insert susceptible loci Relative Number of delete susceptible loci Randomized exons vs. Real exons A C G T Insert length Distance from insert Deletion length Genome Evolution © Amos Tanay, The Weizmann Institute Evolution Mutability Genome (Genetic code) CODE1 CODE2 (Codon bias) Function Function2 (Protein coding sequence) (Translation control) Genome Evolution © Amos Tanay, The Weizmann Institute Genome sequences are something new and exciting We can approach them using the old techniques and methodologies… • Focusing on one feature at a time • Looking at the static picture • Boring ourselves to death Or we can take new bold approaches… This is what Lamark, Darwin, Myer, Crick or Kimura would have done… Jacob Genome Evolution © Amos Tanay, The Weizmann Institute Humans and Chimps ~5-7 million years 3X109 {ACGT} 3X109 {ACGT} Genome alignment • Where are the “important” differences? • How did they happen? Genome Evolution © Amos Tanay, The Weizmann Institute 9% 1.2% 0.8% 3% 1.5% 0.5% Human Chimp Gorilla Orangutan Gibbon Baboon Macaque Where are the “important” differences? How did new features were gained? Marmoset 0.5% Genome Evolution © Amos Tanay, The Weizmann Institute Antibiotic resistance: Staphylococcus aureus Timeline for the evolution of bacterial resistance in an S. aureus patient (Mwangi et al., PNAS 2007) •Skin based •killed 19,000 people in the US during 2005 (more than AIDS) •Resistance to Penicillin: 50% in 1950, 80% in 1960, ~98% today •2.9MB genome, 30K plasmid How do bacteria become resistant to antibiotics? Can we eliminate resistance by better treatment protocols, given understanding of the evolutionary process? Genome Evolution © Amos Tanay, The Weizmann Institute Ultimate experiment: sequence the entire genome of the evolving S. aureus Mutations Resistance to Antibiotics Vanco. Rifampi Oxacili Dapto. 20/7 1 0.012 0.75 0.01 20/9 4 16 25 0.05 1/10 6 16 0.75 0.05 6/10 8 16 1.5 1.0 13/10 8 16 0.75 1.0 1 2 3 4-6 7 8 9 10 11 12 13 14 15…18 S. Aureus got found just few “right” mutations and survived multi-antibiotics Genome Evolution © Amos Tanay, The Weizmann Institute Yeast Genome duplication • The budding yeast S. cerevisiae genome have extensive duplicates • We can trace a whole genome duplication by looking at yeast species that lack the duplicates (K. waltii, A. gosypii) • Only a small fraction (5%) of the yeast genome remain duplicated Genome Evolution © Amos Tanay, The Weizmann Institute • How can an organism tolerate genome duplication and massive gene loss? • Is this critical in evolving new functionality? Genome Evolution © Amos Tanay, The Weizmann Institute “Junk” and ultraconservation Baker’s yeast 12MB ~6000 genes The worm c.elegans 100MB ~20,000 genes Humans 3GB ~27,000 genes 1 cell ~1000 cells ~50 trillions cells Genome Evolution © Amos Tanay, The Weizmann Institute From: Lynch 2007 Genome Evolution © Amos Tanay, The Weizmann Institute intergenic exon intron exon intron exon intron ENCODE Data exon intergenic Genome Evolution © Amos Tanay, The Weizmann Institute Things you need to know or catch up with: • • • Basic discrete probability, std distributions Basic graphs and combinatorics A clue on biology will be helpful, not vital What you’ll learn: • • • • • Foundations: Population genetics Foundations: Molecular evolution inference in graphical models comparative genomics Intro to genome organization and key concepts in evolution Books: Graur and Li, Molecular Evolution Lynch, Origin of genome architecture Hartl and Clark, Population genetics Durbin et al. Biological sequence analysis Karlin and Taylor, Markov Processes Freidman and Koller draft textbook (handouts) Papers as we go along.. N. Friedman D. Koller BN and beyond Genome Evolution © Amos Tanay, The Weizmann Institute Course duties • 4 exercises, 40% of the grade • 1 Genomic exercise (in pairs) for 10% of the grade – Compare two genomes of your choice: mammals, worms, flies, yeasts, bacteria, plants • Exam: 60% – Master key computational results Genome Evolution © Amos Tanay, The Weizmann Institute Probabilities • Our probability space: – DNA/Protein sequences: {A,C,G,T} – Time/populations • Queries: – If a locus have an A at time t, what is the chance it will be C at time t+1? – If a locus have an A in an individual from population P, what is the chance it will be C in another individual from the same population? – What is the chance to find the motif ACGCGT anywhere in a random individual of population P? what is the chance it will remain the same after 2m years? P( ) P( ) Conditional Probability: P( | ) Chain Rule: P( ) P( | ) P( ) Bayes Rule: P( | ) A B P( | ) P( ) P( ) Genome Evolution © Amos Tanay, The Weizmann Institute Modeling processes Modeling the process t Comparing end points (probability distribution Genome Evolution © Amos Tanay, The Weizmann Institute Poisson process 0 1 2 3 4 0 1 2 3 Random walk -1 Markov chain A Brownian motion B t C D Discrete time T=1 T=2 T=3 T=4 T=5 Continuous time X R1 , R 2 Genome Evolution © Amos Tanay, The Weizmann Institute Markov chains General Stochastic process: X s, Pr( state, time) The Markov property: u t s, knowing X t makes X s and X u indepdent A set of states: Finite or Countable. (e.g., Integers, {A,C,G,T}) Discrete time: T=0,1,2,3,…. P( x, s; t , A) Pr( X t A | X s x) Transition probability P( x, s; t , A) P( x, t s, A) Stationary transition probabilities Pij Pr( X t 1 j | X t i ) ( X t1 h , X t 2 h,..., X tn h) ~ ( X t1 , X t 2,..., X tn) One step transitions Stationary process Genome Evolution © Amos Tanay, The Weizmann Institute Markov chains 4 Nucleotides The loaded coin A B T=1 A G C T A B T=2 A G C T A B T=3 A G C T A B T=4 A G C T 20 Amino Acids pab ARNDCEQGHILKMFPSTWYV ARNDCEQGHILKMFPSTWYV 1-pab A B pba 1-pba ARNDCEQGHILKMFPSTWYV ARNDCEQGHILKMFPSTWYV Genome Evolution © Amos Tanay, The Weizmann Institute Markov chains Transition matrix P: 1 P0 i P01 i 0 1 P1i P10 i 1 .. .. .. .. P Pn 0 n0 P02 .. P0 n P12 .. P1 n .. .. .. .. .. .. Pn 0 .. 1 Pni in A discrete time Markov chain is completely defined given an initial condition and a probability matrix. The Markov chain graph G is defined on the states. We connect (a,b) whenever Pab>0 ( P )T x Matrix power Distribution after T time steps given x as an initial condition Genome Evolution © Amos Tanay, The Weizmann Institute Spectral decomposition T=1 A T=2 A paa B pab T=3 A paa paa pab pba B paa pab pab pbb P 1 P0 i P01 i 0 1 P1i P10 i 1 .. .. .. .. P Pn 0 n0 Right,left eigenvector: Px x, xP x When an eigen-basis exists x (1) , x ( 2 ) ,..., x ( n ) We can find right eigenvectors: (1) ,.., ( n ) And left eigenvectors: With the eigenvalue spectrum: P02 .. P0 n P12 .. P1 n .. .. .. .. .. .. Pn 0 .. 1 Pni in (1) ,.., ( n ) 1 ,.., n Which are bi-orthogonal: ( (1) , (1) ) k 1ik jk ij And define the spectral decomposition: P n Genome Evolution © Amos Tanay, The Weizmann Institute Spectral decomposition P PT ()T ... T 1T 0 T 0 0 T2 0 0 0 Tn To compute transition probabilities: O(|E|)*T ~ O(N2)*T per initial condition T matrix multiplications to preprocess for time T Using spectral decomposition: O(Spectral pre-process) + 2 matrix multiplications per condition Genome Evolution © Amos Tanay, The Weizmann Institute Convergence Spec(P) = P’s eignvalues, 1 2... 1 = largest, always = 1. A Markov Chain is irreducible if its underlying graph is connected. In that case there is a single eigenvalue that equals 1. What does the left eigenvector corresponding to 1 represent? Fixed point: out (i , j )E Pi Pij ( j ,i )E Pj Pji in 2 = second largest eigenvalue. Controlling the rate of process convergence