Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Genome Evolution © Amos Tanay, The Weizmann Institute Genome evolution Lecture 3: population genetics I: mutation and recombination Genome Evolution © Amos Tanay, The Weizmann Institute Population genetics Drift: The process by which allele frequencies are changing through generations Mutation: The process by which new alleles are being introduced Recombination: the process by which multi-allelic genomes are mixed Selection: the effect of fitness on the dynamics of allele drift Epistasis: the effects of fitness dependencies among different alleles “Organismal” effects: Ecology, Geography, Behavior Genome Evolution © Amos Tanay, The Weizmann Institute Wright-Fischer model for genetic drift ∞ gametes N individuals N individuals ∞ gametes We follow the frequency of an allele in the population, until fixation (f=2N) or loss (f=0) We can model the frequency as a Markov process on a variable X (the number of A alleles) with transition probabilities: 2 N i i Tij 1 j 2 N 2 N j 2N j Sampling j alleles from a population 2N population with i alleles. In larger population the frequency would change more slowly (the variance of the binomial variable is pq/2N – so sampling wouldn’t change that much) Loss 0 1 2N-1 2N Fixation Genome Evolution © Amos Tanay, The Weizmann Institute Mutations vs Drift Diversity (q)= chance of having same genotype on two random individuals Mutations are generating population diversity Mutation is happening is some biologically dependent rate m (more on that later in the course) Drift is eliminating population’s diversity through fixation Fixation is happening in a rate of ~4N generation How will the population look like given both forces? Genome Evolution © Amos Tanay, The Weizmann Institute Stationary distribution when drift is dominating If mutations is slow compared to drift, we can model the population as a single random variable. Then evolution is a Markov process on two or more states of that variables Simplest model: assume two alleles, and mutations probabilities: Pr( A a ) m Pr( a A) If the process is running long enough, we will converge to a stationary distribution: Pr( A) A m m a Remember – under these assumption, we are likely to sample the entire population at either A or a state. Think what conditions on the mutation rate can justify this model? Genome Evolution © Amos Tanay, The Weizmann Institute What happen when mutations are rapid? If mutations is rapid compared to drift, we lose all population structure This is just a random mixing process Evolution cannot work in this way – information must be propagated In practice, population maintain a non-trivial balance between mutation and drift But we do not know the mutation rate (or the effective population size) Genome Evolution © Amos Tanay, The Weizmann Institute A coalescent model approach: Infinite alleles model When alleles where measure at the protein levels, it was reasonable to assume mutations were generating new variants (isozymes) – never reversing or repeating a variants Adding mutations with probability m, the coalescent process is extended by killing lineages (time is speeded up by a 2N factor): Coalescent: k ( k 1) 1 2 2N mutation: km 2 N k q 2 , (q 4 Nm ) Back in time “Coalescent with killing” Genome Evolution © Amos Tanay, The Weizmann Institute Hoppe’s Urn Probability model (Hoppe’s Urn): Selecting from an urn with one black ball of mass q and more balls with other colors and mass 1. Each time the black ball is selected, a new ball with a new color is added to the urn. If another color is selected, the selected ball and another ball from the same color are returned to the urn. Theorem: Hoppe’s Urn and the Coalescent with killing are equivalent (The Chinese restaurant process) Probability = 1/(n+q) Probability = q/(n+q) Genome Evolution © Amos Tanay, The Weizmann Institute Testing the infinite alleles model Theorem (Ewens sampling formula): Let ai be the number of alleles present i times in a sample of size n. When the scaled mutation rate is q=4Nm, A simplified statistics is the number of distinct alleles. This should have the expected value: E (k ) 1 q q q 1 q 2 .. q q n 1 Proof: At each step of the Hoppe’s process, we draw the black ball with probability: q q i 1 Genome Evolution © Amos Tanay, The Weizmann Institute Testing the infinite alleles model Figure 7.16,7.17 Not quite neutral VNTR locus in humans: observed (open columns) and Ewens predicted allele counts. Highly non neutral F computed from the number of Xdh alleles in 89 D. pseudoobscura lines gene: 52 had a common allele, 8 singletons. Compared to a simulation assuming the infinite allele model. Genome Evolution © Amos Tanay, The Weizmann Institute Infinite sites model In the infinite sites model, mutations occur at distinct sites, exactly once. This model is appropriate for long DNA sequences Theorem: Let m be the mutation rate for a locus under consideration, and set q=4Nm. Under the infinite sites model, the expected number of segregating sites is: n 1 1 i 1 i E (S ) q Proof: Let tj be the amount of time in the coalescent during which there are j lineages. We showed earlier that tj has approximately an exponential distribution with mean 2/(j(j-1)). The total amount of time in the tree for a sample size n is: n Ttot jt j j 2 n E (Ttot ) j 2 n 2 1 j 2 j ( j 1) j 2 ( j 1) Mutations occur at rate 2Nm: E (Sn ) 2 NmE (Ttot ) Genome Evolution © Amos Tanay, The Weizmann Institute Infinite sites model Theorem: q=4Nm. Under the infinite sites model, the number of segregating sites Sn has n 1 1 2 n 1 1 V ( S n ) q q 2 i 1 i i 1 i Proof: Let sj be the number of segregating sites created when there were j lineages. While there are j lineages, we may get mutations at rate 2Nmj, and coalescence at rate j(j-1)/2. Mutations occur before coalescence with probability: 2 Nuj 4 Nu 2 Nuj j ( j 1) / 2 4 Nu j 1 k k successes: q j 1 Pr( s j k ) k 0,1,2,.. q j 1 q j 1 It’s a shifted geometric distribution: 1 p (q j 1) 2 q Var ( s j ) 2 p q j 1 ( j 1) 2 q 2 ( j 1)q 2 ( j 1) 2 q q2 j 1 ( j 1) 2 Genome Evolution © Amos Tanay, The Weizmann Institute Watterson’s estimator, using the infinite site model n 1 We can estimate q=4Nm from an empirical Sn Theorem: For the Watterson’s estimator q w E (q w ) q Sn hn 1 i 1 i E (S ) q g 1 2 V (q w ) q q 2 hn hn So we can build a model of the population from as little data as S What will happen if we want to incorporate more complex models? (e.g., expansion, migration?) Genome Evolution © Amos Tanay, The Weizmann Institute Finite alleles model If we think of a single DNA base, we only have 4 possible alleles Our model must the include recurrent mutations A G T C Even if we assume neutrality, our mutations can be come dependent -We may have different rates at different sites -We may have coupling of one base and the bases nearby We may need to consider insertions and deletions Importantly, if all these are neutral, then the basic coalescent structure is not affected The Poission process: (t ) j t Pr(m j ) e j! Expected = t Genome Evolution © Amos Tanay, The Weizmann Institute Using simulations The sampling procedure: Generate a large number of populations (using the model we presented) Compute the distribution of your statistics on this random case Compare it to the value you observe in your population if you find a significant bias, some modeling assumption must be wrong In principle, we can sample generation after generation, for sufficient time (how much?) Direct simulation using Wright-Fischer is painfully expensive (why?) If you are only interested in the current population, most of your coin tossing will be useless We can use the coalescent approach and just sample genealogies, going back in time For example, using the coalescent with killing Important: this is analogous to first sample a tree and then scatter the mutations there We can also think of simulation evolution while ignoring the population, based on the Markov process shown above (what are the limitations here?) Genome Evolution © Amos Tanay, The Weizmann Institute Recombination and linkage Assume two loci have alleles A1,A2, B1,B2 Linkage equilibrium: Only double Heterozygous can allow recombination to change allele frequencies: P( A1 B1 ) p1q1 A1 B1 P( A1 B2 ) p1q2 P( A2 B1 ) p2 q1 A1B1/ A2B2 P( A2 B2 ) p2 q2 A2 B2 A1 B2 A1B2/ A1B2 A2 B1 The recombination fraction r: proportion of recombinant gametes generated from double heterozygote For different chromosomes: r = 0.5 For the same chromosome, function of the distance and possibly other factors Genome Evolution © Amos Tanay, The Weizmann Institute A1 B2 Linkage disequilibrium (LD) A2 B1 P11 P( A1B1 ), P12 P( A1B2 ), P21 P( A2 B1 ), P22 P( A2 B2 ) r A1 B1 Recombination on any A1- / -B1 A2 B2 No recomb Next generation: P11' (1 r ) P11 rq1 p1 A1 B1 P11' q1 p1 (1 r )( P11 p1q1 ) Define the linkage disequilibrium parameter D as: A2 B2 1-r A1 B1 D A2 B2 D P11 p1q1 Dn (1 r ) Dn 1 (1 r ) n D0 r=0.05 r=0.5 r=0.2 D P11P22 P12 P21 Generation Genome Evolution © Amos Tanay, The Weizmann Institute Linkage disequilibrium (LD) - example blood group genotypes M/N and S/s. Both alleles in Hardy-Weinberg For M/N – For S/s – p1 = 0.5425 q1 = 0.3080 Observed p2 = 0.4575 q2 = 0.6920 unlinked MS 484 334.2 Ms 611 750.8 NS 142 281.8 Ns 773 633.2 2 (obs exp) exp 2 184.7 Linkage equilibrium highly unlikely! D P11P22 P12 P21 0.07 Genome Evolution © Amos Tanay, The Weizmann Institute Sources of Linkage disequilibrium LD in original population that was not stabilized due to low r Genetic coadaptation: regions of the genome that are not subject to recombination (for example, inverted chromosomal fragments) Admixture of populations with different allele frequencies: D0 D0 P11 0.0025 P11 0.9025 P12 0.0475 P12 0.0475 P21 0.0475 P21 0.0475 P22 0.9025 P22 0.0025 P11 0.4525 P12 0.0475 P21 0.0475 P22 0.4525 D 0.2025 Genome Evolution © Amos Tanay, The Weizmann Institute Recombination rates in the human population: LD blocks Genome Evolution © Amos Tanay, The Weizmann Institute Recombination rates in the human population Recombination rates are highly non uniform – with major effects on genome structure! Genome Evolution © Amos Tanay, The Weizmann Institute Selection Fitness: the relative reproductive success of an individual (or genome) Fitness is only defined with respect to the current population. Fitness is unlikely to remain constant in all conditions and environments Sampling probability is multiplied by a selection factor 1+s Mutations can change fitness A deleterious mutation decrease fitness. It would therefore be selected against. This process is called negative or purifying selection. A advantageous or beneficial mutation increase fitness. It would therefore be subject to positive selection. A neutral mutation is one that do not change the fitness. Genome Evolution © Amos Tanay, The Weizmann Institute The Moran model Instead of working with discrete generation, we replace at most one individual at each time step A t A t A a a X A A A a a a A A A A A A Replace by sampling from the current population t 0 We assume time steps are small, what kind of mathematical models is describing the process? Genome Evolution © Amos Tanay, The Weizmann Institute Continuous time Markov processes P( x, s; t , A) Pr( X t A | X s x) t [0, ) Markov Conditions on transitions: Pij (t ) 0 P (t ) 1 ij j Pik (t ) Pkj (h) Pij (t h) t , h 0 k 1 i j lim Pij (t ) t 0 0 i j Theorem: 1 Pii (t ) qii t 0 t Pij (t ) Pij ' (0) lim qij t 0 t Pii ' (0) lim exists (may be infinite) exists and finite Kolmogorov Genome Evolution © Amos Tanay, The Weizmann Institute Rates and transition probabilities The process’s rate matrix: Q q0 i i 0 q1 0 .. .. q n0 q0 1 q0 2 .. q0 n q1i q1 2 .. q1 n .. .. .. .. .. .. .. .. qn1 qn 2 i 1 .. qn i in Transitions differential equations (backward form): Pij ( s t ) Pij (t ) Pik ( s) Pkj (t ) Pij (t ) k Pik ( s) Pkj (t ) [ Pii ( s) 1]Pij (t ) k i s 0 P'ij (t ) qik Pkj (t ) q ii Pij (t ) k i P' (t ) QP (t ) P(t ) exp( Qt ) Genome Evolution © Amos Tanay, The Weizmann Institute The Moran model A t A t A a a X A A A a a a A A A A A A Replace by sampling from the current population t 0 Assume the rate of replacement for each individual is 1, We derive a model similar to Wright-Fischer, but in continuous time. A process on a random variable counting the number of allele A: Loss 0 i-1 1 i i+1 2N-1 i i 1 bi (2 N i ) i i 1 di i Rates: 2N i 2N i 2N “Birth” “Death” 2N Fixation Genome Evolution © Amos Tanay, The Weizmann Institute Fixation probability Loss 0 i-1 1 i i+1 2N-1 i i 1 bi (2 N i ) i i 1 di i Rates: 2N i 2N i 2N 2N Fixation “Birth” “Death” In fact, in the limit, the Moran model converge to the Wright-Fischer model, for example: Theorem: When going backward in time, the Moran model generate the same distribution of genealogy as Wright-Fischer, only that the time is twice as fast Theorem: In the Moran model, the probability that A becomes fixed when there are initially I copies is i/2N Proof: like the proof for the Wright-Fischer model. The expected X value is unchanged since the probability of births and deaths is the same Genome Evolution © Amos Tanay, The Weizmann Institute Fixation time Ei Ei ( | T2 N To ) Expected fixation time assuming fixation Theorem: In the Moran model, let p = i / 2N, then: Proof: not here.. Ei 2 N (1 p) log( 1 p) p