* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Learning about modes of speciation by computational approaches
Survey
Document related concepts
Transcript
Learning about modes of speciation by computational approaches: good and bad news Celine Becquet and Molly Przeworski Dept. of Human Genetics, University of Chicago, Chicago IL USA Results Background Enduring debate in evolutionary biology centers around the question of whether the early stages of speciation can occur in the presence of gene flow. Allopatric speciation Early stage of divergence occurs in the absence of gene flow, when an exogenous barrier isolates the populations. Divergence occurs homogeneously across the genome, as the populations accumulate differences through either genetic drift or local adaptation. Reproductive isolation (RI) is a by-product of divergence rather than its cause. Many cases documented. This model is sometimes approximated by the isolation model: Parapatric & sympatric speciation Species start diverging while hybridizing and exchanging migrants (i.e., they occupy contiguous areas, or the same area) Divergence occurs in a number of stages: 1. Natural selection plays a major role • Leads to fixation of alleles that contribute to differential adaptation. • These RI loci reduce the hybrid fitness, preventing gene flow in linked regions; while other unlinked regions are free to introgress. • Hence, these models lead to far greater heterogeneity than expected under allopatriy. 2. Over time, more RI genes accumulate. 3. Speciation is complete when gene flow is prevented throughout the genome. Allopatric divergence from a structured ancestral population Applications of MIMAR and IM to real data often yield an estimate of the ancestral effective population size larger than either of the descendant populations. Could such results reflect geographical structure in the ancestral population? We investigated this possibility by estimating the parameters of the I-M model from data simulated under models of isolation from a structured ancestral population (model c). We find that: IM tends to provide large estimates of NA when there is structure in the ancestral population. Nonetheless, the results of IM would usually be interpreted correctly as allopatric speciation (i.e. the estimates of gene flow are not biased). MIMAR does not seem to provide biased estimates of NA. However, the results would often be interpreted incorrectly as rejecting a model of allopatric speciation (i.e., the estimates of gene flow tend to be biased upwards). MIMAR IM Estimates of NA divided by the true value for data simulated with models a (see Methods). These modes are sometimes approximated by the isolation-migration model (I-M): Isolation without structure Allopatric divergence followed by secondary contact T generations ago, a panmictic ancestral population of constant effective population size NA suddenly split into two panmictic populations of constant effective population sizes N1 and N2, Two populations split T generations ago as for the isolation respectively, which subsequently diverged neutrally in total reproductive isolation (i.e., the number of migrants exchanged model, but then diverged in the presence of constant gene flow M. between the populations each generation, M, is 0). Modeling as well as from empirical studies suggest that the early stages of speciation may often occur in presence of some gene flow. However, the parapatric model of speciation remains controversial, in part because of the difficulty of distinguishing parapatry from allopatry followed by secondary contact. Recently, computational approaches have been developed to test the predictions of simple models of divergence (usually an I-M model). Application of these to a variety of species has yielded extensive evidence for migration, and the results have been interpreted as supporting the widespread occurrence of parapatric speciation. However, the methods rely on numerous simplifying assumptions, which may make them unreliable. We conducted a simulation study to assess the reliability of such inferences using a program that we recently developed (MIMAR) as well as the program IM of Hey and Nielsen (2004) and more generally, investigated how violations of model assumptions affect parameter estimates of the I-M model. Methods Simulated data. We simulated data sets of 20 independently-evolving 1kb nonrecombining loci under I-M models but with assumptions of the model violated. The parameter values were such that T=3.2N1. We generated 10 data sets for each of the combination of parameters: T1=0.25, 0.50 and 0.75xT (where T=2x106 generations) and M=1 or 10 for the following models: a) Model of isolation from a structured ancestral population b) Model of isolation followed by secondary contact Neutral cartoons of allopatric speciation c) Model of isolation with migration at an early stage. Neutral cartoons of parapatric speciation Structured ancestral populations ≠ from the populations in isolation Recent papers have argued that applications of IM lends support to parapatric speciation (Hey, 2006; Niemiller et al., 2008). We investigated the effect of isolation followed by secondary contact on estimates provided by IM and MIMAR (model d). We find that: The estimates are less precise then when the data follow the simple isolation model. Because the methods consider a constant gene flow rate since the split, recent gene flow may be incorrectly interpreted as reflecting parapatry (i.e., gene flow at the early stage). MIMAR IM Isolation without secondary contact Estimates of T divided by the true value for data simulated with models b (see Methods). IM tends to overestimate the time of divergence, while MIMAR tends to underestimate this parameter. Parapatry with gene flow at an early stage We investigated the effect of a change in gene flow rates over time on the parameter estimates (model e). Estimates of NA are qualitatively similar than for model a; thus, large estimates of NA provided by IM on real data may also reflect non constant gene flow since the split. Moreover, IM tends not to detect gene flow, which could be interpreted incorrectly as consistent with allopatric speciation. Both methods tend to underestimate T in this case. Detecting loci linked to a RI factor or a target of local selection To date, only a few genes involved in RI between species have been characterized, through time-consuming molecular approaches. Could computational approaches such as MIMAR be used to help identify candidate loci for further investigation? To assess this, we simulated two series of data sets in which one locus is assumed to be linked to a RI locus, using models i and ii, and analyzed them using MIMAR. MIMAR does not seem greatly affected by the inclusion of a locus with a different history. A simple locus-specific goodness-of-fit tests may help to detect the locus of interest: We obtained the posterior predictive distribution of statistics for each single locus in the data set given the model estimated by MIMAR (see Table below). We ranked the loci by the p-values (in increasing order for all but Sf and FST) and considered how often the locus of interest is an outlier (i.e., has high rank). Data with a modeled RI locus. The best statistics to identify outlier loci modeled with model i or ii are highlighted. We assumed that a sample of 20 independently-evolving loci contains one locus closely linked to a RI factor. Specifically, polymorphism data at the loci were simulated under a model of I-M with one outlier that has experienced either no gene flow since the split (model i), or positive directional selection in population one recently (model ii). Estimation approaches. We analyzed the simulated data with MIMAR and IM. Both programs use a Markov Chain Monte Carlo approach and rely on multi-locus polymorphism data from two species to estimate the parameters of the I-M model. IM relies on the full polymorphism data, but assumes no intra-locus recombination, a limitation that can lead to biased estimates. MIMAR estimates parameters of the I-M model from recombining loci. In contrast to IM, it relies on summaries of the polymorphism data at each of multiple independentlyevolving loci – specifically, four summary statistics known to be sensitive to the parameters of the I-M model: • • • the number of derived polymorphisms unique to the samples from populations 1 and 2 (S1 and S2) the number of shared derived alleles between the two samples (Ss) the number of fixed derived alleles in either sample (Sf ) assuming a known ancestral state. In simulations, MIMAR performs comparably to IM for medium size data sets, providing unbiased estimates of the parameters and reasonable power to detect gene flow (Becquet and Preeworski 2007). Conclusion We find that when one of many assumptions of the I-M model are violated, the methods tend to yield upwardly biased estimates of gene flow, potentially lending spurious support for parapatric speciation. To some extent, this problem can be avoided by carefully testing the fit of estimated models to the data. When a parapatric model is appropriate, we propose a test that can help to pinpoint candidate loci involved in RI. References Becquet and Przeworsli 2007 Genome Res. 17:1505-1519. Hey and Nielsen 2004 Genetics 167:747-760. Hey 2006 Curr. Opin. Genet. Dev. 16:592-596. Niemiller et al. 2008 Molecular Ecology 17:2258-2275.