Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Approximate Bayesian Computation Studying demographic parameters Joao Lopes, Mark Beaumont University of Reading [email protected] 1. ABC algorithm: Assumptions: Discordance between gene and species trees is not expected Mutation rate is variable in space, but not in time Features: Based on construction of gene trees using The Coalescent model Easily applied to 4 or 5 populations/species Some tweaks are necessary to use in more populations But most importantly: Handles large datasets (typically hundreds of samples per population/species) Complex population/species models can be used (e.g. presence of gene flow) Assumptions can be greatly relaxed (e.g. variable mutation rate over time) 1. ABC algorithm ABC algorithm: 1. 2. 3. _ 4. _ 5. F = {Ne1, Ne2, NeA, m1, m2, t} Sample from prior(s): Fi ~ p(F) Simulate data, given Fi: Di ~ p(D | Fi) Summarize Di with set of Summary Statistics obtaining Si; go to 1. until N points (S,F) have been created. NeA Popanc Accept the points whose S is within a distance d from s’ the real data summarized by the same set. Correct the values F according to their distance from the real data by performing a local linear regression t m2 Ne1 m1 Pop1 Ne2 Pop2 The population model 2. Simulated data DNA sequence data (1 locus) Pop1: 45 samples Pop2: 55 samples ABC: 200 data sets Comparison with MCMC: 10 data sets Relative Mean Integrated Square Error (relMISE): 2 1 n f i f ' n i 1 f '2 , where n is the number of accepted points, fi is the value of a determined parameter for the ith point and f‘ is the true value of the parameter. Summary Statistics used: 1. mean of pairwise differences a) in each population b) both populations joined together 2. number of segregating sites a) in each population b) both populations joined together 3. number of haplotypes a) in each population b) both populations joined together 2. Simulated data ‘Real’ data and Prior information 10000 0 12500 20000 0 Ne1 5000 40000 Ne2 0 0 10000 NeA 0 0 0.0005 m1 5000 0 0.0005 0 m2 10000 t ABC “real” data MCMC prior distribution 2. Simulated data ABC (500 000 iter, tol=0.02, logit transf, sstats=9 ): Simulation 8: Ne1 Ne2 Mig1 Neanc Mig2 Tev average relMISE: Ne1 (10 data sets) Ne2 NeA m1 m2 t ABC 0.05 0.011 0.22 0.035 0.27 0.034 23.00E-09 3.28E-09 8.74E-09 1.62E-09 0.24 0.020 MCMC 0.04 0.007 0.11 0.029 0.16 0.015 1.28E-09 0.25E-09 0.60E-09 0.18E-09 0.05 0.013 Priors 0.27 - 0.33 - 0.33 - - 0.33 83.33E-09 - 83.33E-09 - 2. Simulated data: optimized ABC method ABC (2500 000 iter, tol=0.004, log transf, sstats=9): Simulation 8: Ne1 Ne2 Mig1 Neanc Mig2 Tev average relMISE: Ne1 (10 data sets) Ne2 NeA m1 m2 t ABC 0.05 0.011 0.22 0.035 0.27 0.034 23.00E-09 3.28E-09 8.74E-09 1.62E-09 0.24 0.020 ABC* 0.06 0.012 0.18 0.033 0.24 0.035 10.10E-09 2.11E-09 3.07E-09 0.92E-09 0.18 0.019 MCMC 0.04 0.007 0.11 0.029 0.16 0.015 1.28E-09 0.25E-09 0.63E-09 0.18E-09 0.05 0.013 Priors 0.27 - 0.33 - 0.33 - - 0.33 83.33E-09 - 83.33E-09 - 2. Simulated data: adding summary stats ABC (2500 000 iter, tol=0.004, log transf, sstats=21) Simulation 8: Ne1 Ne2 Mig1 Neanc Mig2 Tev average relMISE: Ne1 (10 data sets) Ne2 NeA m1 m2 t ABC 0.05 0.011 0.22 0.035 0.27 0.034 23.00E-09 3.28E-09 8.74E-09 1.62E-09 0.24 0.020 ABC* 0.06 0.012 0.18 0.033 0.24 0.035 10.10E-09 2.11E-09 3.07E-09 0.92E-09 0.18 0.019 ABC** 0.05 0.003 0.11 0.005 0.23 0.006 6.21E-09 0.26E-09 1.87E-09 0.08E-09 0.15 0.005 MCMC 0.04 0.007 0.11 0.029 0.16 0.015 1.28E-09 0.25E-09 0.60E-09 0.18E-09 0.05 0.013 Priors 0.27 - 0.33 - 0.33 - 83.3E-09 - - 0.33 83.33E-09 - Model-choice: migration present/absent ABC (1000 000 iter, tol=0.004, log transf, sstats=21): Population model 1 (M = M1) Population model 2 (M = M2) Popanc Popanc or Pop1 pM1 = 2% Pop2 Pop1 x Pop2 pM2 = 98% (10 data sets) 2. Simulated data: using model-choice step ABC (2500 000 iter, tol=0.004, log transf, sstats=21): Simulation 8: Ne1 Ne2 Mig1 Neanc Mig2 Tev average relMISE: Ne1 (10 data sets) Ne2 NeA m1 m2 t ABC 0.05 0.011 0.22 0.035 0.27 0.034 23.00E-09 3.28E-09 8.74E-09 1.62E-09 0.24 0.020 ABC* 0.06 0.012 0.18 0.033 0.24 0.035 10.10E-09 2.11E-09 3.07E-09 0.92E-09 0.18 0.019 ABC** 0.05 0.003 0.11 0.005 0.23 0.006 6.21E-09 0.26E-09 1.87E-09 0.08E-09 0.15 0.005 ABC*** 0.03 0.001 0.12 0.007 0.19 0.005 - - - 0.07 0.007 MCMC 0.04 0.007 0.11 0.029 0.16 0.015 1.28E-09 2.53E-10 6.03E-10 1.84E-10 0.05 0.013 Priors 0.27 - 0.33 - 0.33 - 8.33E-08 - 8.33E-08 - 0.33 - - 2. Simulated data: 10 vs 200 datasets ABC (2500 000 iter, tol=0.004, log transf, sstats=21): Simulation 8: Ne1 Ne2 Mig1 Neanc average relMISE: Ne1 Mig2 Tev (10 data sets) and (200 data sets) Ne2 NeA m1 m2 t ABC*** 0.03 0.001 0.12 0.007 0.19 0.005 - - - - 0.07 0.007 ABC*** 0.04 0.002 0.09 0.005 0.19 0.005 - - - - 0.06 0.003 Priors 0.27 - 0.33 - 0.33 - - 8.33E-08 - 0.33 8.33E-08 - 3. Comparison between ABC and MCMC methods: Conclusions: ABC up to 2 orders of magnitude faster than MCMC method for single locus ABC modes are similar to MCMC (full likelihood method) Can easily incorporate more complex population models with relaxed assumptions Using a model-framework comes just naturally from the ABC approach Easily handles multi-modal Posterior distributions Does not have problems associated with Local Maximums in Likelihood distributions ABC improves with: parameters transformation more iterations more summary statistics model-choice framework Take home message: Phylogenetic methods based on gene trees using The Coalescence are being greatly explored. These methods will be available in a near by future Acknowledgements I would like to acknowledge David Balding for providing frequent meetings on the subject. And also a special thanks to Mark Beaumont for advice and comments on the work. Support for this work was provided by EPSRC. [email protected] http://www.rdg.ac.uk/~sar05sal