Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CSE280b: Population Genetics Vineet Bafna/Pavel Pevzner March 2006 www.cse.ucsd.edu/classes/sp05/cse291 Vineet Bafna Review • • Hardy Weinberg Equilibrium Linkage Equlibrium March 2006 Vineet Bafna Recombination and Linkage Equilibrium 0 0 0 0 1 1 0 • • • 1 1 1 1 0 0 0 1 1 In a freely mixing population, an individual chromosome randomly chooses its parent from the available pool With unimpeded recombination (Linkage Equilibrium), the individual freely chooses its two parent chromsomes, and then freely chooses alleles from the parents What is the probability of seeing the allele <1 1>? March 2006 Vineet Bafna Measures of LD • • Consider two bi-allelic sites with alleles marked with 0 and 1 Define – – • • P00 = Pr[Allele 0 in locus 1, and 0 in locus 2] P0* = Pr[Allele 0 in locus 1] Linkage equilibrium if P00 = P0* P*0 D = abs(P00 - P0* P*0) = abs(P01 - P0* P*1) = … March 2006 Vineet Bafna LD over time • With random mating, and fixed recombination rate r between the sites, Linkage Disequilibrium will disappear over time –Let D(t) = LD at time t –P(t)00 = (1-r) P(t-1)00 + r P(t-1)0* P(t-1)*0 –D(t) = P(t)00 - –D(t) =(1-r) March 2006 P(t)0* P(t)*0 = P(t)00 - P(t-1)0* P(t-1)*0 D(t-1) =(1-r)t D(0) Vineet Bafna (HW) LD over distance • Assumption – • Recombination rate increases linearly with distance Let r be the (constant) recombination rate per bp per generation D(t) = P(t)00- P0* P*0 = (1-r)d P(t-1)00+ [1- (1-r)d ] P0* P*0 - P0* P*0 = (1-r)d D(t-1) •The assumption is reasonable, but recombination rates vary from region to region, adding to complexity March 2006 Vineet Bafna LD and disease mapping • • • Consider a mutation that is causal for a disease. The goal of disease gene mapping is to discover which gene (locus) carries the mutation. Consider every polymorphism, and check: – – • There might be too many polymorphisms Multiple mutations (even at a single locus) that lead to the same disease Instead, consider a dense sample of polymorphisms that span the genome March 2006 Vineet Bafna LD can be used to map disease genes LD 0 1 1 0 0 1 D N N D D N • • LD decays with distance from the disease allele. By plotting LD, one can short list the region containing the disease gene. March 2006 Vineet Bafna LD and disease gene mapping problems • • • Marker density? Complex diseases Population sub-structure March 2006 Vineet Bafna Population Genetics • • • • Often we look at these equilibria (Linkage/HW) and their deviations in specific populations These deviations offer insight into evolution. However, what is Normal? A combination of empirical (simulation) and theoretical insight helps distinguish between expected and unexpected. March 2006 Vineet Bafna Topic 2: Simulating population data • • We described various population genetic concepts (HW, LD), and their applicability The values of these parameters depend critically upon the population assumptions. – – – – – • What if we do not have infinite populations No random mating (Ex: geographic isolation) Sudden growth Bottlenecks Ad-mixture It would be nice to have a simulation of such a population to test various ideas. How would you do this simulation? March 2006 Vineet Bafna Wright Fisher Model of Evolution • • Fixed population size from generation to generation Random mating March 2006 Vineet Bafna Coalescent model • Insight 1: – – – Separate the genealogy from allelic states (mutations) First generate the genealogy (who begat whom) Assign an allelic state (0) to the ancestor. Drop mutations on the branches. March 2006 Vineet Bafna Coalescent model • – – – – • 0 0 0 0 0 Insight 1: Assign an allelic state (0) to the ancestor. Drop mutations on the branches. The mutations are proportional to the branch length. Each site (locus) mutates at most once. At the end, drop any fixed sites Loci How efficient is this? How many generations do we need to simulate today’s population? 0 1 1 0 1 0 0 1 1 1 0 1 1 0 1 0 0 0 0 1 Individuals March 2006 Vineet Bafna Loci Coalescent theory • Insight 2: – – Much of the genealogy is irrelevant, because it disappears. Better to go backwards March 2006 Vineet Bafna Coalescent • • • • • Note that in the Wright Fischer model, the population is freely mixing, and constant size N One way to think about it is that each individual in the current generation selects a parent uniformly at random from the N individuals in the previous generation. When two individuals choose the same parent, they coalesce. Once they coalesce, they stay together. We continue until only one individual is left. Note that this only gives a random topology with labeled leaves. The only thing of interest is branch length (number of generations to MRCA) March 2006 Vineet Bafna 1 2 3 4 Coalescent theory (Kingman) • Input – • (Fixed population (N individuals), random mating) Consider 2 individuals. – Probability that they coalesce in the previous generation (have the same parent)= 1 N • Probability that they do not coalesce after t generations= 1 1 t e t N March 2006 N Vineet Bafna Coalescent theory • Consider k individuals. – Probability that no pair coalesces after 1 generation – Probability that no pair coalesces after t generations k t k t 2 2 k2 N 1 e e N March 2006 Vineet Bafna is time in units of N generations Coalescent approximation • Insight 3: – – Topology is independent of coalescent times If you have n individuals, generate a random binary topology • Iterate (until one individual) – • Pick a pair at random, and coalesce Insight 4: – To generate coalescent times, there is no need to go back generation by generation. Generate n random variables to get the n coalescence times. March 2006 Vineet Bafna Coalescent approximation • • At any step, there are 1 <= k <= n individuals To generate time to coalesce (k to k-1 individuals) – – Pick a number from exponential distribution with rate k(k-1)/2 Mean time to coalescence (in units of N generations) = 2/(k(k-1)) March 2006 Vineet Bafna Mean time to coalesce • If there are k individuals, the Probability for a coalescence in one generation is – • k(k-1)/2N Expected time to coalesce = 2N/k(k-1) k 2 p N E(k) p 2qp 3q 2 p Note that 1 + q q 2 q 3 Differentiating, March 2006 E(k) 1 p 11 q 1 + 2q 3q 2 4q 3 Vineet Bafna 1 p2 Typical coalescents • • 4 random examples with n=6 (Note that we do not need to specify N. Why?) Expected time to coalesce to 1 node? March 2006 Vineet Bafna Coalescent properties • Expected time for the last step =1 • • • The last step is half of the total time to coalesce Studying larger number of individuals does not change numbers tremendously EX: Number of mutations in a population is proportional to the total branch length of the tree – E(Ttot) March 2006 Vineet Bafna Variants (exponentially growing populations) • • If the population is growing exponentially, the branch lengths become similar, or even star-like. Why? With appropriate scaling of time, the same process can be extended to various scenarios: malefemale, hermaphrodite, segregation, migration, etc. March 2006 Vineet Bafna Simulating population data • • • Generate a coalescent (Topology + Branch lengths) For each branch length, drop mutations with rate Generate sequence data • • • Note that the resulting sequence is a perfect phylogeny. Given such sequence data, can you reconstruct the coalescent tree? (Only the topology, not the branch lengths) Also, note that all pairs of positions are correlated (should have high LD). March 2006 Vineet Bafna Coalescent with Recombination • An individual may have one parent, or 2 parents March 2006 Vineet Bafna ARG: Coalescent with recombination • • • • Given: mutation rate , recombination rate , population size 2N (diploid), sample size n. How can you generate the ARG (topology+branch lengths) efficiently? How will you generate sequences for n individuals? Given sequence data, can you reconstruct the ARG (topology) March 2006 Vineet Bafna Recombination • Define r as the probability of recombining per generation. – • Note that the parameter is a value which will be defined later Assume k individuals in a generation. The following might happen: 1. 2. 3. 4. An individual arises because of a recombination event between two individuals (It will have 2 parents). Two individuals coalesce. Neither (Each individual has a distinct parent). Multiple events (low probability). March 2006 Vineet Bafna Recombination • • • • • We ignore the case of multiple (> 1) events in one generation Pr (No recombination) = 1-kr k Pr (No coalescence) 2 1 2N Consider scaled time in units of 2N generations. Thus the number of individuals increase with rate kr2N, and decrease with rate k 2 The value 2rN is usually small, and therefore, the process will ultimately coalesce to a single individual (MRCA) March 2006 Vineet Bafna ARG • • • Let k = n, Define 4rN Iterate until k= 1 – – What is the flaw in this procedure? Choose time from an exponential distribution with rate k k 2 2 Pick event as recombination with probability – – (k 1) If event is recombination, choose an individual to recombine, and a position, else choose a pair to coalesce. Update k, and continue March 2006 Vineet Bafna Ancestral Recombination Graph March 2006 Vineet Bafna Simulating sequences on the ARG • • • Generate topology and branch lengths as before For each recombination, generate a position. Next generate mutations at random on branch lengths – For a mutation, select a position as well. March 2006 Vineet Bafna Review: Coalescent theory applications • Coalescent simulations allow us to test various hypothesis. The March 2006 Vineet Bafna coalescent/ARG is usually not inferred, unlike in phylogenies. Coalescent theory: example • Ex: ~1400bp at Sod locus in Dros. – – – 10 taxa 5 were identical. The other 5 had 55 mutations. Q: Is this a chance event, or is there selection for this haplotype. March 2006 Vineet Bafna Coalescent application – – – – – 10000 coalescent simulations were performed on 10 taxa. 55 mutations on the coalescent branches Count the number of times 5 lineages are identical The event happened in 1.1% of the cases. Conclusion: selection, or some other mechanism explains this data. March 2006 Vineet Bafna Coalescent example: Out of Africa hypothesis • • Looking at lineage specific mutations might help discard the candelabra model. How? How do we decide between the multi-regional and Out-of-Africa model? How do we decide if the ancestor was African? March 2006 Vineet Bafna Human Samples • • We look at data from human samples Gabriel et al. Science 2002. – 3 populations were sampled at multiple regions spanning the genome • • • • • • March 2006 54 regions (Average size 250Kb) SNP density 1 over 2Kb 90 Individuals from Nigeria (Yoruban) 93 Europeans 42 Asian 50 African American Vineet Bafna Population specific recombination • • D’ was used as the measure between SNP pairs. SNP pairs were classified in one of the following – – – • Strong LD Strong evidence for recombination Others (13% of cases) This roughly favors out-ofafrica. A Coalescent simulation can help give confidence values on this. March 2006 Gabriel et al., Science 2002 Vineet Bafna Recombination events and • • • Given , n, can you compute the expected number of recombination events? It can be shown that E(n, ) = log (n) Questions that people are interested in • • Given a set of sequences from a population, compute the recombination rate Given a population reconstruct the most likely history (as an ancestral recombination graph) March 2006 Vineet Bafna Re-constructing history without the coalescent March 2006 Vineet Bafna An algorithm for constructing a perfect phylogeny • • We will consider the case where 0 is the ancestral state, and 1 is the mutated state. This will be fixed later. In any tree, each node (except the root) has a single parent. – • • It is sufficient to construct a parent for every node. In each step, we add a column and refine some of the nodes containing multiple children. Stop if all columns have been considered. March 2006 Vineet Bafna Inclusion Property • • For any pair of columns i,j – i < j if and only if i1 j1 Note that if i<j then the edge containing i is an ancestor of the edge containing i i j March 2006 Vineet Bafna Example r 1 2 3 4 5 A 1 1 0 0 0 B 0 0 1 0 0 C 1 1 0 1 0 D 0 0 1 0 1 E 1 0 0 0 0 A B Initially, there is a single clade r, and each node has r as its parent March 2006 Vineet Bafna C D E Sort columns • • Sort columns according to the inclusion property (note that the columns are already sorted here). This can be achieved by considering the columns as binary representations of numbers (most significant bit in row 1) and sorting in decreasing order March 2006 Vineet Bafna A B C D E 1 1 0 1 0 1 2 1 0 1 0 0 3 0 1 0 1 0 4 0 0 1 0 0 5 0 0 0 1 0 Add first column • In adding column i – – Check each edge and decide which side you belong. Finally add a node if you can resolve a clade A B C D E 1 2 3 4 5 1 1 0 0 0 0 0 1 0 0 1 1 0 1 0 0 0 1 0 1 1 0 0 0 0 r u A March 2006 Vineet Bafna C E B D Adding other columns • Add other columns on edges using the ordering property A B C D E 1 1 0 1 0 1 2 1 0 1 0 0 3 0 1 0 1 0 4 0 0 1 0 0 r 1 E 3 2 B 5 4 D C March 2006 Vineet Bafna A 5 0 0 0 1 0 Unrooted case • • Switch the values in each column, so that 0 is the majority element. Apply the algorithm for the rooted case March 2006 Vineet Bafna March 2006 Vineet Bafna March 2006 Vineet Bafna