Download Learning about modes of speciation by computational approaches

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

RNA-Seq wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Human genetic variation wikipedia , lookup

Population genetics wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Designer baby wikipedia , lookup

Helitron (biology) wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Learning about modes of speciation by computational
approaches: good and bad news
Celine Becquet and Molly Przeworski
Dept. of Human Genetics, University of Chicago, Chicago IL USA
Results
Background
Enduring debate in evolutionary biology centers around the question of whether the
early stages of speciation can occur in the presence of gene flow.
Allopatric speciation
 Early stage of divergence occurs in the
absence of gene flow, when an exogenous
barrier isolates the populations.
 Divergence occurs homogeneously across
the genome, as the populations accumulate
differences through either genetic drift or
local adaptation.
 Reproductive isolation (RI) is a by-product
of divergence rather than its cause.
 Many cases documented.
 This model is sometimes approximated by
the isolation model:
Parapatric & sympatric speciation
 Species start diverging while hybridizing
and exchanging migrants (i.e., they occupy
contiguous areas, or the same area)
 Divergence occurs in a number of stages:
1. Natural selection plays a major role
• Leads to fixation of alleles that contribute to
differential adaptation.
• These RI loci reduce the hybrid fitness, preventing
gene flow in linked regions; while other unlinked regions
are free to introgress.
• Hence, these models lead to far greater
heterogeneity than expected under allopatriy.
2. Over time, more RI genes accumulate.
3. Speciation is complete when gene flow is
prevented throughout the genome.
Allopatric divergence from a structured ancestral population
Applications of MIMAR and IM to real data often yield an estimate of the ancestral
effective population size larger than either of the descendant populations. Could such
results reflect geographical structure in the ancestral population? We investigated this
possibility by estimating the parameters of the I-M model from data simulated under
models of isolation from a structured ancestral population (model c). We find that:
 IM tends to provide large estimates of NA when there is structure in the ancestral population.
 Nonetheless, the results of IM would usually be interpreted correctly as allopatric speciation (i.e.
the estimates of gene flow are not biased).
 MIMAR does not seem to provide biased estimates of NA.
 However, the results would often be interpreted incorrectly as rejecting a model of allopatric
speciation (i.e., the estimates of gene flow tend to be biased upwards).
MIMAR
IM
Estimates of NA divided by
the true value for data
simulated with models a (see
Methods).
 These modes are sometimes approximated
by the isolation-migration model (I-M):
Isolation without
structure
Allopatric divergence followed by secondary contact
T generations ago, a panmictic ancestral population of constant
effective population size NA suddenly split into two panmictic
populations of constant effective population sizes N1 and N2,
Two populations split T generations ago as for the isolation
respectively, which subsequently diverged neutrally in total
reproductive isolation (i.e., the number of migrants exchanged model, but then diverged in the presence of constant gene
flow M.
between the populations each generation, M, is 0).
Modeling as well as from empirical studies suggest that the early stages of speciation
may often occur in presence of some gene flow. However, the parapatric model of
speciation remains controversial, in part because of the difficulty of distinguishing
parapatry from allopatry followed by secondary contact.
Recently, computational approaches have been developed to test the predictions of
simple models of divergence (usually an I-M model). Application of these to a variety of
species has yielded extensive evidence for migration, and the results have been
interpreted as supporting the widespread occurrence of parapatric speciation. However,
the methods rely on numerous simplifying assumptions, which may make them unreliable.
We conducted a simulation study to assess the reliability of such inferences using a
program that we recently developed (MIMAR) as well as the program IM of Hey and
Nielsen (2004) and more generally, investigated how violations of model assumptions
affect parameter estimates of the I-M model.
Methods
Simulated data. We simulated data sets of 20 independently-evolving 1kb nonrecombining loci under I-M models but with assumptions of the model violated. The
parameter values were such that T=3.2N1. We generated 10 data sets for each of the
combination of parameters: T1=0.25, 0.50 and 0.75xT (where T=2x106 generations) and
M=1 or 10 for the following models:
a) Model of isolation from a
structured ancestral population
b) Model of isolation followed by
secondary contact
Neutral cartoons of
allopatric speciation
c) Model of isolation with
migration at an early stage.
Neutral cartoons of
parapatric speciation
Structured
ancestral
populations
≠ from the
populations in
isolation
Recent papers have argued that applications of IM lends support to parapatric
speciation (Hey, 2006; Niemiller et al., 2008). We investigated the effect of isolation
followed by secondary contact on estimates provided by IM and MIMAR (model d). We find
that:
 The estimates are less precise then when the data follow the simple isolation model.
 Because the methods consider a constant gene flow rate since the split, recent gene flow may be
incorrectly interpreted as reflecting parapatry (i.e., gene flow at the early stage).
MIMAR
IM
Isolation without
secondary contact
Estimates of T divided by
the true value for data
simulated with models b (see
Methods). IM tends to
overestimate the time of
divergence, while MIMAR
tends to underestimate this
parameter.
Parapatry with gene flow at an early stage
We investigated the effect of a change in gene flow rates over time on the parameter
estimates (model e).
 Estimates of NA are qualitatively similar than for model a; thus, large estimates of NA provided by
IM on real data may also reflect non constant gene flow since the split.
 Moreover, IM tends not to detect gene flow, which could be interpreted incorrectly as consistent
with allopatric speciation.
 Both methods tend to underestimate T in this case.
Detecting loci linked to a RI factor or a target of local selection
To date, only a few genes involved in RI between species have been characterized,
through time-consuming molecular approaches. Could computational approaches such as
MIMAR be used to help identify candidate loci for further investigation? To assess this,
we simulated two series of data sets in which one locus is assumed to be linked to a RI
locus, using models i and ii, and analyzed them using MIMAR.
 MIMAR does not seem greatly affected by the inclusion of a locus with a different history.
A simple locus-specific goodness-of-fit tests may help to detect the locus of interest:
 We obtained the posterior predictive distribution of statistics for each single locus in the data set
given the model estimated by MIMAR (see Table below).
 We ranked the loci by the p-values (in increasing order for all but Sf and FST) and considered how
often the locus of interest is an outlier (i.e., has high rank).
Data with a modeled RI locus.
The best statistics to identify
outlier loci modeled with model
i or ii are highlighted.
We assumed that a sample of 20 independently-evolving loci contains one locus closely
linked to a RI factor. Specifically, polymorphism data at the loci were simulated under a
model of I-M with one outlier that has experienced either no gene flow since the split
(model i), or positive directional selection in population one recently (model ii).
Estimation approaches. We analyzed the simulated data with MIMAR and IM.
 Both programs use a Markov Chain Monte Carlo approach and rely on multi-locus
polymorphism data from two species to estimate the parameters of the I-M model.
 IM relies on the full polymorphism data, but assumes no intra-locus recombination, a
limitation that can lead to biased estimates.
 MIMAR estimates parameters of the I-M model from recombining loci. In contrast to
IM, it relies on summaries of the polymorphism data at each of multiple independentlyevolving loci – specifically, four summary statistics known to be sensitive to the
parameters of the I-M model:
•
•
•
the number of derived polymorphisms unique to the samples from populations 1 and 2 (S1 and S2)
the number of shared derived alleles between the two samples (Ss)
the number of fixed derived alleles in either sample (Sf ) assuming a known ancestral state.
 In simulations, MIMAR performs comparably to IM for medium size data sets,
providing unbiased estimates of the parameters and reasonable power to detect gene
flow (Becquet and Preeworski 2007).
Conclusion
We find that when one of many assumptions of the I-M model are violated, the methods
tend to yield upwardly biased estimates of gene flow, potentially lending spurious support
for parapatric speciation. To some extent, this problem can be avoided by carefully
testing the fit of estimated models to the data. When a parapatric model is appropriate,
we propose a test that can help to pinpoint candidate loci involved in RI.
References
Becquet and Przeworsli 2007 Genome Res. 17:1505-1519.
Hey and Nielsen 2004 Genetics 167:747-760.
Hey 2006 Curr. Opin. Genet. Dev. 16:592-596.
Niemiller et al. 2008 Molecular Ecology 17:2258-2275.