* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download The Coalescent Theory
Survey
Document related concepts
Hardy–Weinberg principle wikipedia , lookup
Designer baby wikipedia , lookup
Genome evolution wikipedia , lookup
Metagenomics wikipedia , lookup
Genome (book) wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Medical genetics wikipedia , lookup
Gene expression programming wikipedia , lookup
Microsatellite wikipedia , lookup
Frameshift mutation wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Koinophilia wikipedia , lookup
Genetic drift wikipedia , lookup
Viral phylodynamics wikipedia , lookup
Point mutation wikipedia , lookup
Transcript
By Mireya Diaz Department of Epidemiology and Biostatistics for EECS 458 Agenda • Basic concepts of population genetics • The coalescent theory • Coalescent process of two sequences • Coalescent time • Statistical inference • Applications: reconstruction of human evolutionary history • Future venues Basic Concepts in Population Genetics Mutation Random genetic drift Selection f1 f2 fk Basic Concepts in Population Genetics • Mutation: limited role in evolution due to its slow effect, however contributes to the maintenance of alleles in the population Locus with 2 allelles: A1 (p(n)) and A2 (q(n)=1-p(n)) Non-overlapping generations A1->A2 at rate u and A2->A1 at rate v (u, v ~10-5, 10-6) Allele can mutate most once/generation p(n 1) (1 u ) p(n) v(1 p(n)) p ( n) if initial gene freq. of A1=p(0) As n->∞ “equilibrium” pˆ v uv qˆ u uv v v ( p0 )(1 u v) n uv uv Basic Concepts in Population Genetics • Random genetic drift: change in gene frequency due to random sampling of gametes from a finite population. Important for small size populations Each generation 2N gametes sampled at random from parent generation y(n): # gametes of type A1, in absence of mutation and selection 2N j p (1 p) 2 N j P( y (n 1) j | y (n) i ) j p i 2N Wright-Fisher model • One allele will be lost P( fixation A1 ) f ( A1 | t 0) 1 Basic Concepts in Population Genetics • Selection: can act at different stages of the life of an organism (e.g. differential fecundity, viability) Locus with 2 alleles A1, A2 Three genotypes: A1 A1 (w11), A1 A2 (w12), A2A2 (w22) with fitness wij, relative survival chances of zygotes of genotype AiAj Under Hardy-Weinberg equilibrium p(n 1) p(n)[ p(n) w11 q(n) w12 ] w q(n 1) w p 2 (n)w11 2 p(n)q(n)w12 q 2 (n)w22 If w11>w12>w22 w11<w12<w22 w11,w22<w12 w12< w11,w22 -> -> -> -> q(n)[ p(n) w12 q(n) w22 ] w p(n 1) q(n 1) 1 A1 becomes fixed A2 becomes fixed overdominance, stable polymorphism underdominance, unstable polymorphism, A1 or A2 becomes fixed f(0) The Coalescent Theory • Stochastic process: continuous-time Markov process • Large population approximation of Wright-Fisher model, and other neutral models • Probability model for genealogical tree of random sample of n genes from large population • Most significant progress in theoretical population genetics (past 2 decades). Cornerstone for rigorous statistical analysis of molecular data from populations • Need of: inferring the past from samples taken from present population • Seminal work: Kingman, J Appl Prob 19A:27, 1982 The Coalescent Theory – Key Idea • Start with a sample and trace backwards in time to identify EVENTS in the past since the Most Recent Common Ancestor (MRCA) in the sample • Consider sample of n sequences of a DNA region for a population • Assume no recombination between sequences • N sequences are connected by a single phylogenetic tree (genealogy) where the root=MRCA MRCA Diverge Coalesce The Coalescent Theory: Usefulness • Sample-based theory • By-product: development of highly-efficient algorithms for simulation of samples under various population genetics models • Particularly suitable for molecular data • Estimate parameters of evolutionary models (vs. history of specific locus – phylogenetics) The Coalescent Process of Two Sequences • Consider diploid organisms • Wright-Fisher model: – Sequence in a population at a generation = random sample with replacement from those in the previous generation – Mutations at locus of interest: selectively neutral (do not affect reproductive success, all individuals likely to reproduce, all lineages equally likely to coalesce) • P(coalescence at previous generation)=? P=1/2N, N=effective population size P(coalescence t+1 generations ago) = 1 (1 1 / 2 N )t 2N • For haploid structures, use N rather than 2N The Coalescent Tree MRCA T2 T3 T4 T5 Genealogical relationship of sample of genes • Topology is independent of branch lengths • Branch lengths are independent, exponential rv’s (waiting time between coalescent events) • Topology is generated by randomly picking lineages to coalesce -> “all topologies are equally likely” The Coalescent Time • Assume: # mutations in a given period ~Poisson mean time 2N generation between two sequences mean # mutations in two sequences = 4Nm (m: mutation rate seq/generations) • Underlying assumption: randomly mating (~ organisms with high mobility) • Coalescent time: time between two successive coalescent events • Exponential variable, mean = 2/k(k-1) k: # ancestral sequences between the two events Coalescent Tree Parameters 1 N And coalesce P(2 lineages pick same parent) Remain distinct 1 1 N Expected time to MRCA (height of the tree): n 2 n n 1 E T (k ) E[T (k )] 21 n k 2 k ( k 1) k 2 k 2 Expected total branch length of the tree: n n1 2 E[Ttot (n)] E kT (k ) ~ 2( log n) k 2 k 1 k The Coalescent Theory & Statistical Inference • Mutation rate • Age of MRCA • Recombination rate • Ancestral population size • Migration rate Reconstruction of Human Evolutionary History • • • • • • Goal: estimate times of evolutionary events (major migrations), demographic history (population bottlenecks, expansions) Haploid sequences: mtDNA, Y chromosome Case study: recent common ancestry of human Y chromosome Source: Thomson et al. PNAS 2000; 97:7360-5 Estimations: expected time to MRCA and ages of certain mutations Data: 53-70 chromosomes, sequences variation at three genes (SMCY, DBY, DFFRY) in Y chromosome Recent common ancestry of Y chromosome • For ages of major events: need mutation rate estimate (SN substitution) • Substitutions between chimpanzee and human sequences • Mutation rate per site per year = No. subst./2*Tsplit*L • Tsplit: time since chimp and human split (~5M years ago) • Assumptions: selective neutrality of all changes on Y since divergence Summary of gene characteristics from sample Gene SMCY DBY DFFRY All Seq length 39,931 8,547 15,642 64,120 Sample size 53 70 70 43 No. polym. No. substitutions 47 (41) 14 (12) 17 (15) 65 (56) 528 107 159 794 Mutation rate 1.32x10-9 1.25x10-9 1.02x10-9 1.24x10-9 Source: Table 1 from article (#) in no. polymorphisms after removal of length variants, repeat sequences, indels GENETREE Analysis • Software: www.stats.ox.ac.uk/~stephens/group/software.html • Estimate mean number of mutations: = 2Nem Ne: effective number of Y chromosomes in population m: mutation rate per gene per generation • Also: expected ages of mutation, time since MRCA • Assumptions: coalescent process, infinitely-many-sites mutation (mutation rate low enough -> e/occurs at new site) • Four insertions, three deletions, two repeat mutations (different rates from SN substitutions) • Only one segregating site in SMCY appeared to have mutated >1 -> data fit infinitely-many sites model Recent common ancestry of Y chromosome MRCA distribution under constant population Gene SMCY DBY DFFRY All TMRCA1 95%CI TMRCA2 95%CI 0.56 0.83 0.96 0.55 (0.40, 0.82) (0.60, 1.10) (0.55, 1.21) (0.36, 0.98) 85,000 154,000 120,000 84,000 (61,000, 125,000) (112,000, 206,000) (69,000, 152,000) (55,000, 149,000) MRCA distribution under exponential population growth Gene TMRCA SMCY DBY DFFRY All 0.0731 0.0538 0.0582 0.0853 1Expected 95%CI (0.0618, 0.1030) (0.0382, 0.0975) (0.0440, 0.0720) (0.0580, 0.2070) TMRCA 95%CI 48,000 (41,000, 68,000) 55,000 (39,000, 100,000) 53,000 (40,000, 65,000) 59,000 (40,000, 140,000) age in Ne generations. 2Value in years = Ne*25 GENETREE Analysis 1 2 1 1 2 11 11 1 11 11 11 1 41 11 1 2 11 1 21 21 3 113 Africa Asia Oceania Expected ages of mutations in tree: Mutation 1: 47,000 (35,000; 89,000) – male movement out of Africa Mutation 2: 40,000 (31,000; 79,000) – beginning of global expansion Future Venues • Population genetics models: incorporation of migration, population growth, recombination, natural selection • Longitudinal analysis • Evolutionary analysis of quantitative trait loci (QTL) • Properties of CT: – Accuracy of coalescent approximation under combinations of population size, sample size, mutation rate – Properties of estimators under MCMC References • Handbook of Statistical Genetics, 2nd edition, Vol.2 • Nature 2002; 3:380-390 • Theoretical Population Biology 1999; 56:1-10.