Download Models of Selection, Isolation, and Gene Flow in Speciation

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Viral phylodynamics wikipedia , lookup

Genomics wikipedia , lookup

Metagenomics wikipedia , lookup

Gene therapy wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Copy-number variation wikipedia , lookup

Koinophilia wikipedia , lookup

Genetic drift wikipedia , lookup

Pathogenomics wikipedia , lookup

Public health genomics wikipedia , lookup

Gene desert wikipedia , lookup

Gene nomenclature wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene wikipedia , lookup

History of genetic engineering wikipedia , lookup

Gene expression profiling wikipedia , lookup

Genome (book) wikipedia , lookup

Computational phylogenetics wikipedia , lookup

Genome evolution wikipedia , lookup

Human genetic variation wikipedia , lookup

Helitron (biology) wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

The Selfish Gene wikipedia , lookup

Group selection wikipedia , lookup

Designer baby wikipedia , lookup

Population genetics wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Reference: Biol. Bull. 227: 133–145. (October 2014)
© 2014 Marine Biological Laboratory
Models of Selection, Isolation, and Gene Flow
in Speciation
MICHAEL W. HART*
Department of Biological Sciences, Simon Fraser University, Burnaby, British Columbia,
V5A 1S6, Canada
Abstract. Many marine ecologists aspire to use genetic
data to understand how selection and demographic history
shape the evolution of diverging populations as they become reproductively isolated species. I propose combining
two types of genetic analysis focused on this key early stage
of the speciation process to identify the selective agents
directly responsible for population divergence. Isolationwith-migration (IM) models can be used to characterize
reproductive isolation between populations (low gene flow),
while codon models can be used to characterize selection
for population differences at the molecular level (especially
positive selection for high rates of amino acid substitution).
Accessible transcriptome sequencing methods can generate
the large quantities of data needed for both types of analysis.
I highlight recent examples (including our work on fertilization genes in sea stars) in which this confluence of
interest, models, and data has led to taxonomically broad
advances in understanding marine speciation at the molecular level. I also highlight new models that incorporate both
demography and selection: simulations based on these theoretical advances suggest that polymorphisms shared
among individuals (a key source of information in IM
models) may lead to false-positive evidence of selection (in
codon models), especially during the early stages of population divergence and speciation that are most in need of
study. The false-positive problem may be resolved through
a combination of model improvements plus experiments
that document the phenotypic and fitness effects of specific
polymorphisms for which codon models and IM models
indicate selection and reproductive isolation (such as genes
that mediate sperm-egg compatibility at fertilization).
Introduction
An important problem in biodiversity research is to understand the ecological and evolutionary origins of reproductive isolation and the formation of biological species
(Coyne and Orr, 2004; Hey et al., 2005). Although diversity
evolves at both higher and lower levels in the hierarchy of
life, the adaptive divergence of reproductively isolated populations to form new species is fundamentally important
because this process can set groups of organisms onto
independent evolutionary trajectories from which they can
no longer influence each other directly through mating and
recombination. Key elements of that research program (Butlin et al., 2012) include analyses of (1) the targets of
selection acting on diverging lineages of organisms as they
adapt to different environments or habitats, and (2) the
barriers to gene flow between lineages that would otherwise
oppose their divergence under selection for phenotypic differences. Understanding the targets of selection at the molecular level is critical for linking genotypic divergence to
phenotypic variation (Yang and Bielawski, 2000; Nielsen,
2005); quantifying gene flow between diverging lineages is
critical for documenting the history of reproductive isolation between lineages and their progress on the continuum
from conspecific populations to biological species (Hey and
Nielsen, 2004; Rundle and Nosil, 2005; Seehausen et al.,
2014).
Studies of adaptation and gene flow in the ocean have
been especially useful and interesting because many marine
species have broad geographic ranges (e.g., spanning ocean
basins), and ocean currents seem likely to promote gene
flow among populations (especially for organisms that
spend weeks or months developing as larvae in the marine
Received 29 March 2014; accepted 21 June 2014.
* To whom correspondence should be addressed. E-mail: mwhart@
sfu.ca
Abbreviations: IM, isolation-with-migration; RNA-Seq, RNA sequencing (whole transcriptome shotgun sequencing); SNP, single nucleotide
polymorphism.
133
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
134
M. W. HART
plankton), such that barriers to marine gene flow are often
not readily apparent (e.g., Mayr, 1954; Lessios et al., 2001;
Palumbi and Lessios, 2005). Moreover, the relatively simple mating systems of many marine plants and animals with
broadcast spawning of planktonic gametes include a small
number of ligand and receptor molecules expressed in
sperm and eggs that regulate interaction and binding between gametes and determine the specificity of mating
between males and females (Hirohashi et al., 2008). These
gamete recognition molecules have been a particular focus
of both functional and genetic analyses because they represent a relatively simple basis for the evolution of reproductive isolation and speciation in marine organisms that lack
complex behavioral interactions between male and female
individuals (compared to organisms like terrestrial vertebrates and arthropods), and genetic analyses suggest that
such genes are frequent targets of selection during speciation (Palumbi, 2009; Lessios, 2011). In these ways, marine
speciation has been seen as a challenging problem with
some potentially accessible solutions, one in which both the
demography of speciation and its adaptive basis could be
understood at the molecular level, and evolutionary ecologists have long sought to understand its origins and mechanisms (Evans and Sherman, 2013).
The roles of selection and gene flow during speciation—
and their inference from genetic data— have been recently
and extensively reviewed (Lawton-Rauh, 2008; Nosil et al.,
2009; Bird et al., 2012; Crisci et al., 2012; Feder et al.,
2012; Nosil and Feder, 2012, 2013; Faria et al., 2014). In
this essay I focus on two specific methods, one for modeling
selection at the level of codons in protein-coding sequences,
and one for modeling gene flow within the context of the
demographic history of reproductive isolation at the level of
nucleotide sequences. The two types of analysis share in
common a key modeling feature (the use of insights from
gene tree structure and variation) and two key data requirements (many loci, few samples of individual organisms).
The similarities in their data requirements and the complementary nature of the insights from these two types of
analysis together form an argument for combining these two
methods in analyses of data from next-generation sequencing studies. The essay and its main argument are motivated
by the interests and efforts of organismal biologists and
empiricists (like myself) to use quantitative models and
software from population-genetics theory to study the formation of new species, and by the increasing accessibility of
genome-scale sequence data for identifying the targets of
selection at the molecular level leading to adaptation and
speciation in non-model organisms.
The following section (Background) introduces those
types of analysis for readers who are not already familiar
with population-genetic methods based on large samples of
gene trees. More experienced readers could go directly to
the subsequent section (Recent Progress) that highlights
examples of selection and gene flow analyses in speciation
using transcriptome data from non-model organisms, including a detailed example from our work on gamete recognition in sea stars. The few available examples suggest
that combined analyses of adaptation and demographic history from transcriptome sequencing may be a surprisingly
accessible source of insight into the causes of population
divergence and speciation. However, the empirical examples and new theoretical model developments also point
toward some potentially significant limitations of this combined approach caused by false positives in analyses of
selection on coding sequences. In the closing section (Future Prospects) I review this false-positive problem and
some ways in which it might be resolved in new studies of
selection and gene flow during speciation.
Background: Selection and Gene Flow in Speciation
Coalescent models of genetic diversity and disparity
New sequencing technologies have greatly increased our
ability to characterize genomic variation underlying the
evolution of reproductive isolation among natural populations of virtually any marine organism (Claw and Swanson,
2012; Ellegren, 2014). Elements of that variation include
both diversity of DNA sequences and disparity among
them. The mutational processes that generate new sequence
diversity include point mutation of single nucleotides, duplication, deletion, or gene conversion. These processes
operate on relatively short time scales within lineages of
cells and organisms, and give rise to new allelic variants that
differ from an original or ancestral allele by as little as one
nucleotide change. In contrast, the processes that generate
sequence disparity (greater than that created by single mutations) operate on longer time scales at the level of populations and species, and include genetic drift, selection, and
gene flow. When their effects are integrated over time, these
two sets of processes produce a pattern of ancestor-descendant relationships that can be represented by gene trees or
genealogies, and the information content of such gene trees
has many practical applications (e.g., in molecular taxonomy; Puillandre et al., 2012).
The best methods for inferring the effects of selection and
gene flow on new species formation, and distinguishing
those effects from other processes (such as mutation), are
based on the concept of the coalescent: the pattern of
merging or coalescence of alleles backward in time from the
present to a common ancestral allele at some time in the past
(Kingman, 1982; Hudson, 1990). These methods characterize gene genealogies, fit parameter values (including mutation, population size, gene flow, and selection) for a population model, and use likelihood or Bayesian methods to
identify the best-fit population model that can account for
the underlying gene tree structure (Bahlo and Griffiths,
2000; Pannell, 2003). The concept and models can be used
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
SELECTION AND GENE FLOW FROM GENE TREES
to simulate genealogies under different combinations of
population-model parameters, and to illustrate their individual or combined effects on expected patterns of gene trees
that might be observed in empirical studies (Ewing and
Hermisson, 2010).
Selection and gene flow in genealogies
Among several coalescent population-genetic methods
that have been developed to model the evolution of genetic
diversity and disparity associated with new species formation, two are especially informative and broadly useful.
Both methods use gene trees of diverging lineages to extract
useful insights into speciation processes.
Codon models of selection analyze disparity among protein-coding DNA sequences, and they make inferences
about the sources of disparity on the basis of the relative
rates of nonsynonymous (dN) and synonymous nucleotide
substitutions (dS) that do or do not alter the predicted amino
acid sequence (Yang and Bielawski, 2000; Anisimova,
2012). By mapping substitutions onto a gene tree of relationships among haplotypes, such models can characterize
the relative contributions of purifying selection or positive
Darwinian selection as the ratio of those relative rates ␻ ⫽
dN/dS. If the rate of accumulation of synonymous substitution differences among sequences is mainly attributed to
the accumulation of new mutations over time, and if those
substitutions are largely invisible to selection on protein
function, then a higher (or lower) rate of accumulation of
nonsynonymous differences can be attributed to the effects
of selection favoring (or opposing) new amino acid variants.
In particular, several classes of such codon models (sometimes called branch sites models; Yang and Nielsen, 2002)
can identify the effects of positive selection acting on particular parts of a sequence alignment or on particular lineages within a gene tree. In comparison to other populationgenetic methods for detecting a signal of selection at the
molecular level based on the relative frequencies of intraspecific polymorphisms (Anisimova and Liberles, 2012),
branch sites models have the significant advantage that they
can be used to identify specific molecular traits (codons
with ␻ ⬎ 1 among some lineages) or specific branches in
the gene tree (with ␻ ⬎ 1 at some codons) as the targets of
positive selection for divergence among lineages. This potential to link genotypes (at the nucleotide level) with phenotypes under selection (at the amino acid level) may partly
account for the wide application of codon models to molecular studies of adaptive divergence associated with speciation (Yang and Bielawski, 2000; Nielsen, 2005).
Isolation-with-migration (IM) models analyze disparity
among any type of DNA sequence (including protein-coding genes), and make inferences about the sources of disparity on the basis of estimates of population-model parameters including gene flow, population sizes, and population
135
divergence times (Hey and Nielsen, 2004, 2007; Hey,
2010a). In particular, IM methods use coalescent models of
gene tree evolution to characterize the demographic history
of population samples. By fitting the population model and
its parameters to sequence data for multiple unlinked loci,
IM methods can independently model multiple population
demographic parameters. In comparison to other population-genetic methods for inferring demographic history, IM
models have the significant advantage that they can distinguish among the contributions of low gene flow, reduced
population size, and old divergence time to the evolution of
genetic polymorphisms within populations and differentiation between populations (Marko and Hart, 2011). One of
the most important insights into speciation from such models is the ability to characterize the magnitude of gene flow
between populations independently of other population parameters that affect disparity, and (by comparison among
analyses of different populations and genes) to identify
those populations or genes with evidence of low or zero
gene flow that might reflect divergent selection (Sousa et
al., 2013). Like the particular advantages of codon models
for understanding selection at the molecular level, the potential to model the specific role of gene flow variation in
speciation (separately from the role of other population
demographic parameters) may partly account for the widespread use of IM models in analyses of phylogeographic
structure and population divergence (Nosil, 2008; Pinho and
Hey, 2010).
Both approaches include an explicit model of DNA sequence changes leading to diversity, and each includes
parameters for some but not all of the processes expected to
shape the evolution of sequence disparity (selection in
codon models, genetic drift and gene flow in IM models).
Thus, the two approaches can use the same sequence data to
give complementary insight into the evolutionary causes of
population divergence and speciation. The two approaches
also share similar data requirements: in both cases the
critical model parameters (especially the selection parameter ␻ in codon models, and the gene flow parameter m in IM
models) can be estimated from fitting the model to small
alignments of relatively few samples (individuals or diverging populations or species) for each gene or locus (e.g.,
Carneiro et al., 2012).
The problem: we need many gene trees
Both IM models and codon models depend on information from gene trees to infer population-model parameters.
Among theoreticians it has long been recognized that recombination (e.g., between loci on different chromosomes,
or separated by recombination hotspots) obscures genealogical information about ancestor-descendant relationships
among alleles, and that the coalescent applies only to haplotypes or linked groups of nonrecombining polymorphisms
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
136
M. W. HART
Figure 1. Structure of coalescent gene trees depends on stochastic variation, gene flow, and selection. (A)
Three gene trees (a1–a3) from coalescent simulations using the software program msms ver. 1.3 (Ewing and
Hermisson, 2010) for two structured populations with no selection and high gene flow (Nm ⫽ 10). (B) Gene trees
(b1– b3) from a second simulation with low gene flow (Nm ⫽ 0.5). (C) Gene trees (c1– c3) from a third
simulation with high gene flow as in A but with selection against one of two alleles. In each of the three
simulations only six haplotypes for each population were output in the gene trees for comparison to gene trees
from small population samples of sea star gamete recognition genes (Fig. 2). Other details of the simulation
conditions are given in the main text. Each output tree was drawn in FigTree ver. 1.4 using midpoint rooting and
a rescaled root depth of 1.0 units.
(Donnelly and Tavaré, 1995). As a result, unlinked loci
experience the effects of mutation, selection, and other
processes independently of each other, and the coalescent
pattern of relationships among alleles is expected to differ
among unlinked loci due to stochastic effects of sampling or
genetic drift (Nordborg, 2003).
For empiricists, the most important consequence of recombination is that the demographic history of a population
cannot reliably be characterized from one or a few parts of
the genome because each has evolved independently under
the coalescent. Instead the population history—and individual population-model parameters—must be estimated as a
statistical property of a sample from many parts of the
genome. Like all sampling problems, the quality of such
estimates and the insight they give into the causes of divergence and speciation are now well known to depend critically on the number of loci analyzed and the genomic
breadth of the study (Felsenstein, 2006; Nosil et al., 2009;
Feder et al., 2012).
Theory shows that small samples of haplotypes (on the
order of 10 per population) will typically capture most of
the underlying coalescent population history that generated
the gene tree for a single locus, so that relatively few
individuals can be used effectively (Pluzhnikov and Donnelly, 1996; Felsenstein, 2006). However, gene trees drawn
from the same coalescent history can be highly variable
among replicate samples in nature or among replicate simulations. For example, I used the popular coalescent software program msms ver. 1.3 (Ewing and Hermisson, 2010)
to simulate genealogies drawn from two populations under
a neutral model without selection (using the command line
string ./msms 12 3 -t 1 -T -I 2 6 6 10). Three replicate
genealogies (i.e., three loci with 12 haplotypes sampled for
each gene) from this single simulation are shown in Figure
1A. In this example, gene flow was relatively high (Nm ⫽
10) and the population size parameter ␪ was realistic for
typical mutation rates (-t 1). The extent of population differentiation is shown by the clustering of gene copies from
the two populations (labeled x, o) into clades. However, the
most striking result is the variation in gene tree structure
(the coalescent depth of nodes) among three loci from the
same simulation (e.g., genealogies a1 vs. a3 in Fig. 1).
This dependence on genome sampling is a feature of all
coalescent methods, but its implications for empirical inference are clearest in the case of IM models. In a second
example, I simulated the expected effects of low gene flow
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
SELECTION AND GENE FLOW FROM GENE TREES
(Nm ⫽ 0.5) but otherwise identical conditions (using the
command line string ./msms 12 3 -t 1 -T -I 2 6 6 0.5), and
compared the genealogies (Fig. 1B) to those expected under
a neutral model with high gene flow (Fig. 1A). For some
loci, reproductive isolation and reduced gene flow caused
strong phylogeographic structure due to genetic drift (e.g.,
genealogy b1 in Fig. 1, with reciprocally monophyletic
clades of haplotypes in each population), but at other loci in
the same simulated population history there was still extensive sharing of haplotype lineages between the two populations despite the effects of low gene flow and genetic drift
(e.g., genealogies b2, b3 in Fig. 1).
As a consequence of this variation among loci, estimating
gene flow and other population-model parameters in an
empirical IM analysis for real sequence data may require
sampling dozens of loci across the genome in order to distill
the signal of population history from the noise associated
with stochastic variation (Edwards and Beerli, 2000; Arbogast et al., 2002). Estimating a large number of populationmodel parameters (especially for studies of many populations) may require information from hundreds of loci (e.g.,
Hey, 2010b). In such analyses, gene trees for individual loci
are treated as separate instances of the coalescent history of
the modeled populations, and the estimation of the model
parameters is improved by information from both the individual loci and the variance among loci.
Similarly in the case of codon models, sequence differences between diverging lineages are assumed to represent
fixed substitution differences (rather than polymorphisms
within populations). As a consequence, selection can be
modeled using a single representative haplotype from each
lineage. However, codon models share with IM models the
difficulty of distinguishing the effects of selection from
stochastic variation. In a third example, I simulated the
expected effects of selection against one of two alleles
(using the command line string ./msms 12 3 -t 1 -T -I 2 6 6
10 -SAA 20 -SaA 10 -SF 0.001 -N 10000) but with otherwise identical conditions including high gene flow (Nm ⫽
10), and compared the genealogies (Fig. 1C) to those under
a neutral model without selection (Fig. 1A). For some loci,
directional selection caused striking population differences
(e.g., genealogy c2 in Fig. 1, with a single deep split
between two lineages of haplotypes). However, at other loci
subject to the same pattern of selection in the same simulation, populations shared several deeply divergent lineages
of haplotypes with less striking patterns of population difference (e.g., genealogies c1, c3 in Fig. 1).
This inherent variation in genealogical patterns under the
coalescent has important implications for the interpretation
of codon models of selection leading to speciation. Several
forms of codon models explore and test hypotheses about
variation in ␻ values among codons in the alignment or
among lineages in the gene tree (Anisimova and Liberles,
2012), but demonstrating that particular genes and molecu-
137
lar traits are the particular targets of selection leading to new
species has proven to be difficult: codon models of selection
may be relatively conservative (Anisimova, 2012), and demographic variation can produce patterns of codon evolution similar to the effects of selection (e.g., Stajich and
Hahn, 2005). Analogous difficulties are also evident among
studies that have used indirect methods, for example, based
on genome scans of single nucleotide polymorphisms
(SNPs) and models of population-genetic differentiation to
identify SNPs that are statistical outliers and possibly linked
to genomic variants under selection for frequency differences between diverging populations or species (summarized by Bierne et al., 2013). Selection may be widespread
across the genome (Hahn, 2008), but distinguishing genes
that are the true targets of selection from the statistical noise
associated with coalescent variation depends on the breadth
of comparisons among genes across the genome.
A solution: many gene trees from RNA-Seq
Because both codon models of selection and isolationwith-migration models of gene flow either explicitly require
or inherently benefit from comparison of gene tree structure
among many loci across the genome, the two approaches
can potentially benefit from the expanded capacity of nextgeneration sequencing technologies to characterize genomewide variation. Despite their well-known drawbacks and
limits (DeWoody et al., 2013), transcriptome or RNA-Seq
methods might be the best source of data for such combined
analyses, especially for relatively understudied non-model
organisms for which there is no reference genome (Wheat,
2010; Cahais et al., 2012; Gayral et al., 2013). In RNA-Seq
workflows for non-model organisms (e.g., Feldmeyer et al.,
2011), bulk RNA is extracted from a specific tissue or organ
(or from the whole organism), fragmented into lengths
suitable for sequencing (typically a few hundred nucleotides), and converted to cDNA. Alternative workflows differ
in the order of these steps, first converting mRNA to cDNA
and then fragmenting the cDNA molecules. A key detail is
that transcripts are fragmented (by mechanical or biochemical methods) at random with respect to the nucleotide
sequence, so that different copies of the same gene transcript will be fragmented in different locations. In some
applications, the proportion of transcripts from proteincoding genes can be increased (and the abundance of ribosomal RNA molecules can be decreased) by hybridization
methods that capture messenger RNA molecules or remove
ribosomal RNA molecules.
The resulting cDNA samples are then processed into
libraries for massively parallel sequencing. Because the
resulting libraries consist mainly of processed gene transcripts, other parts of the genome including introns, regulatory regions, unexpressed genes, and noncoding intergenic
regions are not sequenced. A common form of sequence
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
138
M. W. HART
data is paired-end 100- or 150-base sequences from Illumina instruments. Pairs of sequence reads (on the order of
107 to 109 per instrument run) are then computationally
assembled into genes by sequence alignment of reads from
partially overlapping fragments that represent contiguous
parts of the same expressed gene (contigs). Homologous
genes from different assemblies (e.g., assembled from
RNA-Seq data for different individual organisms) can be
identified by sequence similarity comparisons between assemblies (e.g., using Blast), extracted from each assembly,
aligned to each other, and analyzed in codon models or IM
models.
RNA-Seq data have several potential advantages in comparison to traditional PCR-based methods or whole-genome
sequences.
●
●
●
●
Transcriptomes can be analyzed using both codon
models (which require coding sequences) and IM models (which can use coding sequences or any other type
of DNA sequence).
RNA-Seq can target genes expressed in specific tissues
or organs that are associated with specific hypotheses
about ecological adaptation, sexual selection, or other
speciation processes (such as seminal fluid proteins
involved in mate compatibility; Larson et al., 2013).
The relatively high cost of RNA-Seq sample processing and sequencing (DeWoody et al., 2013) may be
partly offset by the relatively low cost per gene, which
makes RNA-Seq (and other next-generation sequencing technologies) a relatively good match to the data
requirements for codon models of selection and IM
models of demographic history (many loci; few samples).
Contigs of expressed genes can often be assembled
from RNA-Seq data using de novo assembly algorithms that compare and align sequence reads to each
other (rather than mapping each read to a reference
genome). In contrast, whole genome assembly methods
typically depend on mapping sequence reads to a reference genome from the same species (or a closely
related species) due to the much larger size and greater
complexity of genomes compared to the transcriptomes
for single tissues or organs.
For studies of molecular adaptation, gene flow, and speciation in non-model organisms without a reference genome, such de novo assembly of transcriptomes may provide relatively easy and inexpensive access to high-quality
genome-scale sequence information (Hornett and Wheat,
2012). Even for organisms with high-quality reference genomes, RNA-Seq data from small samples of individuals
can be an important source of insight into the targets of
selection for adaptive divergence (e.g., Carneiro et al.,
2012).
Recent Progress: Selection and Gene Flow From
Transcriptomes
Because affordable RNA-Seq data and the analytical
methods used to assemble the data are relatively recent
innovations, there are still relatively few applications to
studies of speciation in natural populations (e.g., SoriaCarrasco et al., 2014). However, the available examples
suggest that combining isolation-with-migration model
analyses of gene flow (and reproductive isolation) with
codon model analyses of selection (and population genetic
divergence) can be useful sources of new insight into the
earliest stages of speciation in the sea.
Gamete recognition genes in bat stars
We used this approach to identify possible targets of
selection leading to reproductive isolation and incipient
speciation in diverging populations of a northeastern Pacific
sea star (Patiria miniata (Brandt, 1835)) that have been
isolated with low gene flow for about 3 ⫻ 105 years (Keever
et al., 2009; McGovern et al., 2010; Hart and Marko, 2010).
From small population samples we generated transcriptomes of ovary-expressed genes (Hart and Foster, 2013).
We then used codon models to identify genes under positive
selection for population differences, and IM models of the
same genes to characterize the extent of long-term reproductive isolation between diverging populations. An important feature of this approach is that the same within-population sequence polymorphisms—which are key to
parameter fitting in the IM models—were also used in our
codon model analyses.
We compared two gamete recognition genes (the OBi1
receptor expressed in the egg coat and bindin expressed in
the sperm head) to other genes expressed only in gametes or
to genes expressed in gametes and other cell types. In
branch sites codon models we found a strong signal of
positive selection on a few specific codons in OBi1 (Hart et
al., 2014), similar to the signal of positive selection on a few
codons in bindin (based on data from traditional PCR-based
methods; Sunday and Hart, 2013). In IM models we found
zero gene flow for the parts of OBi1 and bindin that included specific codons under positive selection, and nonzero gene flow between the same diverging populations for
all other genes, including those sequenced from small population samples of transcriptomes (Hart et al., 2014) and
those sequenced from larger population samples of PCR
amplicons (Keever et al., 2009; McGovern et al., 2010; Hart
et al., 2014).
False positives in codon model analyses
Codon model analyses within this context (populations
under selection for divergence, combined with IM models),
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
SELECTION AND GENE FLOW FROM GENE TREES
especially for genes directly related to reproductive compatibility or isolation, are potentially highly informative
because they focus on evolutionary processes during the
earliest stages of speciation (Swanson and Vacquier, 2002;
Palumbi, 2009; Lessios, 2011; Vacquier and Swanson,
2011; Evans and Sherman, 2013). Such analyses are also
potentially controversial: we used branch sites models of
among-codon and among-lineage variation in ␻ (Yang and
Nielsen, 2002; Yang et al., 2005; Pond et al., 2011; Murrell
et al., 2012) that are descended from the original models of
Goldman and Yang (1994) and Muse and Gaut (1994) in
which sequence differences are assumed to be fixed substitutions (for example, between highly divergent species) and
not segregating polymorphisms (within single populations).
An alternative form of this assumption is that mutation is
weak relative to the strength of selection or relative to the
time since divergence between lineages, so that new recurrent mutations within single populations or ancestral polymorphisms shared between populations make a negligible
contribution to the pattern of synonymous and nonsynonymous sequence variation (Muse and Gaut, 1994; Mugal et
al., 2014).
Failing to account for the effects of mutation can have
important undesirable consequences. It is well known
(Kryazhimskiy and Plotkin, 2008; Anisimova and Liberles,
2012; Mugal et al., 2014) that violating this assumption may
have a high associated risk of identifying false positives in
analyses of positive selection (e.g., when alignments of
genes under purifying selection include rare nonsynonymous mutations that have not yet been eliminated from the
population by selection), or of false negatives (e.g., when
stochastic variation in the number of synonymous polymorphisms within a single population masks the underlying
effects of positive selection for high rates of nonsynonymous substitution). This view was confirmed by simulations
showing that, “when applied to a single population, dN/dS is
not particularly sensitive to the strength of selection and it is
not a reliable indicator of the sign of selection” (Kryazhimskiy and Plotkin, 2008). Nevertheless, empiricists studying
adaptive molecular evolution associated with speciation
have often ignored those concerns by focusing on systems
of populations or sister species in which the divergence time
is implicitly (and optimistically) assumed to be sufficiently
long to negate the risks of false positives caused by the
effects of mutation and of segregating polymorphisms (e.g.,
Hart et al., 2012, 2014; Sunday and Hart, 2013; Popovic et
al., 2014).
Are bindin and OBi1 under positive selection leading to
incipient speciation in Patiria miniata populations, or are
these false positives caused by mutation and segregating
polymorphism (and not caused by selection)? Other structural and functional features of those results suggest that
these results are not false positives.
●
●
●
●
139
The OBi1 receptor protein in the egg coat and the
bindin protein in the sperm head are known to interact
in a species-specific fashion at fertilization in sea urchins (Foltz et al., 1993; Vacquier, 2012), and bindin is
the target of positive selection for among-species differences in many sea urchin genera (Palumbi, 2009;
Lessios, 2011).
In P. miniata the key positively selected codons in
those genes are located in the two parts of the gene
structure predicted to be most sensitive to selection on
specificity of sperm-egg binding: in the known substrate-binding site of the OBi1 receptor, and adjacent to
the highly conserved ligand for that binding site in
bindin (Hart et al., 2014).
The key positively selected bindin codon in P. miniata
is in the same part of the gene structure as the wellknown hot spot of bindin positive selection among
closely related sea urchin species (Metz and Palumbi,
1996; Biermann, 1998).
Most important, variation at positively selected codons
in OBi1 and bindin explained a significant proportion
of the variation in laboratory fertilization rates between
P. miniata individuals from diverging populations and
between individual males and females with different
genotypes at one key positively selected codon in each
gene (Hart et al., 2014).
This combination of IM and codon results, especially the
evidence for reproductive isolation between populations
(zero gene flow only in parts of two positively selected
gamete recognition genes) and the evidence for lower gamete compatibility between individuals associated with the
same positively selected codons, thus seems to point directly toward a pair of genes subject to positive selection
and divergence between populations, possibly driven by
local selection on sperm-egg compatibility within populations.
Selection and gene flow in other transcriptome studies
Transcriptome-based analyses of adaptation and ecological speciation, for example in studies of diverging ecotypes
along habitat gradients (Cheviron and Brumfield, 2012), are
increasingly common but have not often used this same
combination of codon and IM models applied to the same
sequence alignments of genes from diverging populations.
One obstacle to such analyses—and a possible reason for
their rarity—is the difficulty of resolving single nucleotide
polymorphisms (SNPs) within a single transcriptome into a
diploid pair of haplotypes (consisting of linked variable
sites) in a de novo transcriptome assembly in order to
generate gene trees based on multiple differences among
individual haplotypes (e.g., Hart, 2013). Some RNA-Seq
studies of diverging populations or ecotypes instead focus
on SNP allele frequency variation without using analyses
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
140
M. W. HART
based on gene trees derived from haplotype sequences (e.g.,
Lemay et al., 2013). De Wit and Palumbi (2013) used
transcriptomes from abalone populations to discover outlier
SNP loci with unusually large differences in allele frequency in genes that are candidates for subsequent analyses
of selection, especially genes involved in energy metabolism and shell mineralization that might be subject to locally
varying effects of selection imposed by climate change and
ocean acidification.
A second obstacle to such analyses is the risk of errors in
estimates of ␻ caused by the inclusion of segregating polymorphisms (and the effects of mutation and ancestral polymorphisms). In one of the earliest transcriptome studies of
ecotype divergence between marine populations, Barreto et
al. (2011) resolved within-population variation into a single
haplotype per gene for two diverging populations of a
tidepool copepod. This approach allowed the authors to
ignore rare segregating polymorphisms and to use codon
models to discover genes that were under positive selection
and likely to underlie the previously well-known genomic
incompatibilities between these same populations. This approach, however, precluded the use of IM models to contrast demographic parameters and the history of gene flow
for sets of genes that were or were not subject to positive
selection. Using a similar compromise, Koester et al. (2013)
analyzed positive selection in codon models (but not gene
flow and reproductive isolation in IM models) for seven
strains of a planktonic diatom species in order to identify
candidate genes under selection for divergence, especially
between ecotypes from different oceans or different planktonic habitats.
Few other studies have attempted to combine both demographic models and codon models applied to the same
transcriptome data. Osborne et al. (2013) analyzed divergence between ragwort ecotypes adapted to different habitats along an altitudinal cline using both codon models of
positive selection and a demographic model that estimated
some parameters (divergence time; population size) but not
all parameters (gene flow) included in IM models. However,
like the studies noted above, that study analyzed a single
haplotype per gene for each ecotype, which precluded analyses using IM models. Other examples of this combination
of analytical approaches have used traditional PCR-based
methods focused on a relatively small number of genes
targeted for their likely role in adaptive evolution (e.g.,
Wlasiuk et al., 2009) or reinforcement in hybrid zones (e.g.,
Maroja et al., 2009).
Future Prospects: Combined Models of Selection and
Gene Flow in Speciation
Mitigating the false-positive problem in codon models
Because segregating polymorphisms are the primary
source of information for isolation-with-migration (IM)
models (Kuhner, 2009) but a known potential source of
error for codon models (Anisimova and Liberles, 2012),
applying both approaches to the same sequence data in the
same study may lead to key insights into speciation but at
the risk of some errors. One alternative approach to mitigating those risks in studies of closely related lineages
(diverging on short evolutionary time scales where mutation
effects might be strong) is to focus on IM methods for
characterizing gene flow and other demographic parameters, while using approximate methods to characterize selection effects such as the HKA test that do not link selection to specific codons or lineages (e.g., Muir et al., 2012).
A significant drawback of this compromise is the limited
ability to characterize the relative strength of response to
selection (␻) in standard codon models, or to identify individual codons or lineages as targets of selection in branch
sites models.
A second alternative for studies of more highly divergent
populations (on longer evolutionary time scales) is to focus
on codon model methods for characterizing the effects of
selection, while using approximations such as fixation indices (FST) as proxies for gene flow (e.g., Fraser et al., 2010).
A drawback of this compromise is that other demographic
processes and other characteristics of the genetic markers,
including old or recent divergence time and small or large
effective population size, can contribute to patterns of differentiation (FST) without variation in gene flow (Marko and
Hart, 2011).
A third approach to this problem is to apply both IM
models and codon models to the same multilocus sequence
alignments from transcriptomes, and then draw post hoc
inferences about the reliability of those results on the basis
of consistency among the codon models (showing selection
acting on specific codons and lineages), IM models (showing restricted gene flow for genes under positive selection),
predictions based on structure-function relationships (showing positive selection on known binding sites and ligands),
and genotype-phenotype correlations in functional studies
(such as reproductive incompatibility correlated with positive selection on gamete recognition genes). We depended
on this mode of inference in our study of the apparent
coevolution of OBi1 and bindin in old diverging population
pairs in Patiria miniata.
The good news: better models may avoid false positives
Better understanding of the dynamics of demography and
selection during speciation will come from better models
that incorporate both IM-based demographic parameters
and codon-based selection parameters into the same framework (Lawton-Rauh, 2008; Crisci et al., 2012). Unfortunately, this hard problem has had few proposed solutions.
One early step toward a synthesis of isolation-with-selection
models included a codon model of positive selection plus a
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
SELECTION AND GENE FLOW FROM GENE TREES
141
Figure 2. Reanalysis of positive selection in small population samples of the egg coat gene OBi1 from
ovary transcriptomes of diverging sea star populations (Patiria miniata) using the software program omegaMap
ver. 0.5 (Wilson and McVean, 2006). The Markov chain Monte Carlo search ran 300,000 steps with model
parameters sampled at intervals of 100 steps (after a burnin of 30,000 steps); I used uniform priors on parameter
values including the mutation rate (muParam ⫽ 0.1, 10); parameter estimates for selection (oBlock ⫽ 4) and
recombination (rBlock ⫽ 5) were fitted using a sliding window method (omega_model ⫽ variable; rho_model ⫽
variable). Black curve shows variation among codons in estimated ␻; red dots show the posterior probability of
positive selection (␻ ⬎ 1). For three codons, including two in the predicted substrate-binding site of the protein
(enclosed within the blue box), ␻ ⬃ 3–5 and P ⬎ 0.90, the same two codons in the substrate-binding site were
positively selected in branch sites models fitted to the same sequence data in previous published analyses (Hart
et al., 2014). The inset gene tree (from Hart et al., 2014) shows reciprocally monophyletic relationships (strong
population divergence) among six OBi1 haplotypes from northern and from southern female sea stars for the
gene partition that includes the substrate-binding site of the receptor.
mutation rate parameter intended to account in part for the
effect of mutations and segregating polymorphisms on the
estimate of ␻ (Wilson and McVean, 2006). This model
includes variation in ␻ among codons but not among lineages or populations, so it is not specifically designed to
detect signals of selection for population divergence, but has
been used effectively to identify likely targets of selection at
late stages in the speciation process (e.g., reinforcement
against hybridization in a cricket hybrid zone; Maroja et al.,
2009).
Does a better model like this indicate false positives from
analyses that used traditional branch sites codon models that
assume fixed substitution differences? I used the method of
Wilson and McVean (called omegaMap) to estimate
among-codon variation in ␻ for OBi1 from P. miniata
transcriptomes (Fig. 2). Two codons in the OBi1 substratebinding site that were positively selected in branch sites
models (Hart et al., 2014) were also identified as positively
selected in the omegaMap analysis, with high posterior
probabilities (P ⬎ 0.9) of positive selection (␻ ⬎ 1), but
with more realistic estimated values for the selection parameter at those codons (␻ ⬃ 3–5) in comparison to the
unrealistic and imprecise estimates from codon models that
assumed fixed substitution differences (␻ ⬃ 200; Hart et al.,
2014).
The concordance between results from traditional branch
sites models (Hart et al., 2014) and from omegaMap analyses (Fig. 2) give some encouragement that the signal of
positive selection on the OBi1 binding site (and its correlation with fertilization rates and reproductive isolation) may
not be a false positive. Similarly, Kryazhimskiy and Plotkin
(2008) simulated within-population polymorphisms for two
isolated populations, and showed that the deviation between
observed and expected ␻ values for this simulated twopopulation system depended on the mutation rate (scaled for
the effective population size as ␪ ⫽ 4Ne␮). They concluded
that “when comparing divergent lineages the magnitude of
dN/dS compared to unity is a faithful indicator of the sign of
selection.” Overall, such results give some reason for optimism that transcriptome studies of incipient species formation using both IM models and codon models may avoid
false positives and may correctly identify genes and codons
under positive selection if applied to systems of relatively
old conspecific population divergence (Hart and Marko,
2010).
The bad news: time dependence of false positives
Unfortunately, the newest model developments suggest
that this optimistic view should be tempered with caution.
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
142
M. W. HART
Mugal et al. (2014) recently developed an IM model that
incorporates most of the desirable features of both IM
models and traditional codon models. It includes a fully
specified codon substitution matrix, a model of divergence
between two populations descended from a single panmictic
ancestral population, and parameters for synonymous and
nonsynonymous mutations that arise in the ancestral population (before the divergence time) or in one of the descendant populations (after the divergence time), and then segregate as polymorphisms or go to fixation under the effects
of drift and selection. The important theoretical improvement of this model is its demonstration of the time dependence of ␻ for diverging populations: new mutations and
segregating polymorphisms will cause a long delay after
population divergence in the approach of ␻ to the expected
equilibium value predicted by the selection coefficient.
For empiricists who desire to identify genes under positive selection for divergence during the critical early stages
of speciation, the more important improvement of this
model is its independent characterization of dN and dS as a
function of the time since population divergence. Because
“the ratio of [these] expected values is in general not equal
to the expected value of a ratio [␻ ⫽ dN/dS]” (Mugal et al.,
2014), the stochastic variation in both synonymous and
nonsynonymous mutations must be accounted for in a characterization of ␻ from real or simulated sequence data.
Mugal et al. (2014) simulated that stochastic variation under
realistic values of the model parameters and data variables
(population size, sequence length, population divergence
time, mutation rate). Their most sobering discovery was that
“during initial divergence extremely high dN/dS values that
would be commonly taken as evidence for positive selection
[simulations with ␻ ⫽ 1.2–1.9] are frequently obtained even
under negative selection pressure [␥ ⫽ ⫺1, or purifying
selection with expected ␻ ⬍ 1]” (Mugal et al., 2014). In
these simulations, the risk of false positives (inferring ␻ ⬎
1 under conditions of purifying selection) depended on both
divergence time and effective population size, and was
relatively high for at least 5–10 Ne generations after population divergence.
The specific magnitude of that risk also depended on
other parameters or variables (sequence length, selection
coefficient, mutation rate). In general, however, for large
populations or recent population divergence, a point estimate of ␻ for a single gene was likely to be inflated by
stochastic variation in the frequency of nonsynonymous
polymorphisms within populations until long after the onset
of population divergence and the action of selection leading
to new species formation. For a system of populations like
Patiria miniata that have diverged for about 6 Ne generations (divergence time of ⬃ 3 ⫻ 105 years, generation time
of ⬃ 5 years, large effective population sizes ⬎ 104; Keever
et al., 2009; McGovern et al., 2010), the risk of false
positives would be high under the conditions simulated by
Mugal et al. (2014; fig. 5). Ironically, their results suggest
that the false-positive problem may be most acute during the
earliest stages of speciation that are of strong interest to
evolutionary ecologists and most in need of careful study
and improved understanding.
Moreover, Mugal et al. (2014) used a population divergence model that includes no gene flow between populations after the divergence time (an IM model). They predict
that incorporating gene flow effects would further delay the
approach of ␻ to the expected equilibrium by adding segregating polymorphisms from one population to the other
via migration. Thus, for more realistic speciation scenarios
that include gene flow between populations as they diverge
under selection (as usually envisioned in ecological speciation models; Nosil, 2008; Feder et al., 2012; Faria et al.,
2014), it may be more difficult than expected to use codon
models of isolation-with-selection to identify the targets of
selection during speciation. Similarly the stochastic variation in apparent response to selection (␻) in these simulations may have interesting implications for efforts to infer
selection effects based on variation among loci in population differentiation (from outlier methods; Narum and Hess,
2011) or in gene flow (Sousa et al., 2013) if analyses of
early stages in speciation often focus on genes and populations that are far from the predicted equilibrium ␻ value for
any underlying direction and strength of selection.
Keep calm and carry on
Although the simulation results of Mugal et al. (2014)
seem to paint a dim picture of prospects for codon model
analyses of the key early stages of speciation, several other
considerations argue for pursuing this combination of IM
models and codon models for characterizing gene flow and
selection in speciation. First, Mugal et al. (2014) used a
relatively simple codon model that characterizes dN and dS
as a uniform feature of all codons in a gene and all lineages
or alleles in an alignment (in contrast to among-codon rate
variation in the sites model of Wilson and McVean, 2006).
Such codon models are widely viewed as relatively insensitive to detecting the effects of selection acting only on
specific functional features of a protein (such as ligand
binding sites; Pond et al. 2011; Anisimova, 2012). Mugal et
al. (2014) noted that branch sites models of among-codon
and among-lineage variation might be useful additions to
their simulation method. It is currently not known whether
estimates of site-specific ␻ from more complex branch sites
models with several selection parameters (e.g., Fig. 2) can
be compared directly to simulations of a single time-dependent selection parameter in a simpler codon model (e.g.,
Mugal et al., 2014, fig. 5). If the time dependence of ␻
estimates for specific codons or lineages under a branch
sites model is much shorter than the typical divergence
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
SELECTION AND GENE FLOW FROM GENE TREES
times of populations undergoing selection for adaptive differences leading to speciation, then branch sites codon models applied to diverging populations may be less susceptible
to false positives than they currently appear to be.
Second, simulation methods like those of Mugal et al.
(2014) could form the basis for quantifying (rather than
merely fearing) the false-positive problem. One simple approach could simulate population divergence under realistic
values of population-model parameters (including population size and divergence time) for different underlying values of the selection coefficient (␥) and ask how weak
purifying selection (␥ ⬍ 0) would need to be in order to
generate a frequency distribution of ␻ values in which the
confidence limits include the ␻ value for the observed data
(and constitute a false inference of positive selection rather
than purifying selection). Alternatively, a more complex
approach could simulate the coalescent under a prior distribution of values for the selection coefficient and other
population-model parameters, using estimates of dN and dS
as summary statistics to characterize the sequence alignment resulting from each simulation. Simulated population
histories and their dN and dS values could then be compared
to observed sequence data from population samples, and
used in an approximate Bayesian computation (ABC) of the
posterior probability distributions for selection, gene flow,
and other parameter values. Analogous ABC methods are in
widespread use in other areas of population genetics, including efforts to characterize gene flow and divergence
times in large, complex population models with many parameters where direct optimization methods have proven
difficult to apply (Beaumont, 2010). Recent steps toward a
more comprehensive ABC-based approach to quantifying
selection and demography seem very promising (e.g., Lopes
et al., 2014).
Third, the identification of specific codons or lineages as
targets of positive selection (such as ligands and their receptor binding sites encoded in gamete recognition genes)
can be treated as a working hypothesis, and can be tested in
field or laboratory experiments of the fitness consequences
of variation at those codons. For example, sperm bindin
divergence predicts reproductive isolation between congeneric sea urchin species (McCartney and Lessios, 2004;
Zigler et al., 2005), and sexual conflict within sea urchin
populations has been proposed as the selective agent responsible for population divergence (reviewed by Palumbi,
2009; Lessios, 2011), but few experimentalists have pursued bindin variation and its consequences for reproductive
isolation within and among conspecific sea urchin populations (Palumbi, 1999; Levitan and Ferrell, 2006). More such
studies are needed (a need emphasized in other reviews as
well; Anisimova and Liberles, 2012), particularly to test the
reliability of inferences about selection acting on early
stages of population divergence leading to new species
formation.
143
Acknowledgments
Thanks to Ken Halanych for the invitation to write this
essay. I am grateful to several reviewers of this and other
manuscripts whose comments helped me to develop a fuller
appreciation of the population genetics of codon evolution
under selection. My recent financial support has come from
the Natural Sciences and Engineering Research Council,
Genome BC, and Simon Fraser University.
Literature Cited
Anisimova, M. 2012. Parametric models of codon evolution. Pp. 12–33
in Codon Evolution, G. M. Cannarozzi and A. Schneider, eds. Oxford
University Press, Oxford.
Anisimova, M., and D. A. Liberles. 2012. Detecting and understanding
natural selection. Pp. 73–96 in Codon Evolution, G. M. Cannarozzi and
A. Schneider, eds. Oxford University Press, Oxford.
Arbogast, B. S., S. V. Edwards, J. Wakeley, P. Beerli, and J. B.
Slowinski. 2002. Estimating divergence times from molecular data
on phylogenetic and population genetic timescales. Annu. Rev. Ecol.
Syst. 33: 707–740.
Bahlo, M., and R. C. Griffiths. 2000. Inference from gene trees in a
subdivided population. Theor. Popul. Biol. 57: 79 –95.
Barreto, F. S., G. W. Moy, and R. S. Burton. 2011. Interpopulation
patterns of divergence and selection across the transcriptome of the
copepod Tigriopus californicus. Mol. Ecol. 20: 560 –572.
Beaumont, M. A. 2010. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41: 379 – 406.
Biermann, C. H. 1998. The molecular evolution of sperm bindin in six
species of sea urchins (Echinoida: Strongylocentrotidae). Mol. Biol.
Evol. 15: 1761–1771.
Bierne, N., D. Roze, and J. J. Welch. 2013. Pervasive selection or is
it. . .? Why are FST outliers sometimes so frequent? Mol. Ecol. 22:
2061–2064.
Bird, C. E., I. Fernandez-Silva, D. J. Skillings, and R. J. Toonen. 2012.
Sympatric speciation in the post “Modern Synthesis” era of evolutionary biology. Evol. Biol. 39: 158 –180.
Butlin, R., A. Debelle, C. Kerth. R. R. Snook, L. W. Beukeboom, R. F.
Castillo Cajas, W. Diao, M. E. Maan, S. Paolucci, F. J. Weissing et
al. 2012. What do we need to know about speciation? Trends Ecol.
Evol. 27: 27–39.
Cahais, V., P. Gayral, G. Tsagkogeorga, J. Melo-Ferreira, M. Ballenghien, L. Weinert, Y. Chiari, K. Belkhir, V. Ranwez, and N. Galtier.
2012. Reference-free transcriptome assembly in non-model animals
from next-generation sequencing data. Mol. Ecol. Res. 12: 834 – 845.
Carneiro, M., F. W. Albert, J. Melo-Ferreira, N. Galtier, P. Gayral,
J. A. Blanco-Aguiar, R. Villafuerte, M. W. Nachman, and N.
Ferrand. 2012.
Evidence for widespread positive and purifying
selection across the European rabbit (Oryctolagus cuniculus) genome.
Mol. Biol. Evol. 29: 1837–1849.
Cheviron, Z. A., and R. T. Brumfield. 2012. Genomic insights into
adaptation to high-altitude environments. Heredity 108: 354 –361.
Claw, K. G., and W. J. Swanson. 2012. Evolution of the egg: new
findings and challenges. Annu. Rev. Genomics Hum. Genet. 13: 109 –
125.
Coyne, J. A., and H. A. Orr. 2004. Speciation. Sinauer Associates,
Sunderland, MA.
Crisci, J. L., Y. P. Poh, A. Bean, A. Simkin, and J. D. Jensen. 2012.
Recent progress in polymorphism-based population genetic inference.
J. Hered. 103: 287–296.
De Wit, P., and S. R. Palumbi. 2013. Transcriptome-wide polymorphisms of red abalone (Haliotis rufescens) reveal patterns of gene flow
and local adaptation. Mol. Ecol. 22: 2884 –2897.
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
144
M. W. HART
DeWoody, J. A., K. C. Abts, A. L. Fahey, Y. Z. Ji, S. J. A. Kimble, N. J.
Marra, B. K. Wijayawardena, and J. R. Willoughby. 2013. Of
contigs and quagmires: next-generation sequencing pitfalls associated
with transcriptomic studies. Mol. Ecol. Res. 13: 551–558.
Donnelly, P., and S. Tavaré. 1995. Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29: 401– 421.
Edwards, S. V., and P. Beerli. 2000. Perspective: gene divergence,
population divergence, and the variance in coalescence time in phylogeographic studies. Evolution 54: 1839 –1854.
Ellegren, H. 2014. Genome sequencing and population genomics in
non-model organisms. Trends Ecol. Evol. 29: 51– 63.
Evans, J. P., and C. D. H. Sherman. 2013. Sexual selection and the
evolution of egg-sperm interactions in broadcast-spawning invertebrates. Biol. Bull. 224: 166 –183.
Ewing G., and J. Hermisson. 2010. MSMS: a coalescent simulation
program including recombination, demographic structure and selection
at a single locus. Bioinformatics 26: 2064 –2065.
Faria, R., S. Renaut, J. Galindo, C. Pinho, J. Melo-Ferreira, M. Melo,
F. Jones, W. Salzburger, D. Schluter, and R. Butlin. 2014. Advances in ecological speciation: an integrative approach. Mol. Ecol. 23:
513–521.
Feder, J. L., S. P. Egan, and P. Nosil. 2012. The genomics of speciation-with-gene-flow. Trends Genet. 28: 342–350.
Feldmeyer, B., C. W. Wheat, N. Krezdom, B. Rotter, and M. Pfenninger. 2011. Short read Illumina data for the de novo assembly of
a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC
Genomics 12: 317.
Felsenstein, J. 2006. Accuracy of coalescent likelihood estimates: Do
we need more sites, more sequences, or more loci? Mol. Biol. Evol. 23:
691–700.
Foltz, K. R., J. A. Partin, and W. J. Lennarz. 1993. Sea urchin egg
receptor for sperm: sequence similarity of binding domain and hsp70.
Science 259: 1421–1425.
Fraser, B. A., I. W. Ramnarine, and B. D. Neff. 2010. Selection at the
MHC class IIB locus across guppy (Poecilia reticulata) populations.
Heredity 104: 155–167.
Gayral, P., J. Melo-Ferreira, S. Glémin, N. Bierne, M. Carneiro, B.
Nabholz, J. M. Lourenco, P. C. Alves, M. Ballenghien, N. Faivre et
al. 2013. Reference-free population genomics from next-generation
transcriptome data and the vertebrate-invertebrate gap. PLoS Genet. 9:
e1003457.
Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide
substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11:
725–736.
Hahn, M. W. 2008. Toward a selection theory of molecular evolution.
Evolution 62: 255–265.
Hart, M. W. 2013. Structure and evolution of the sea star egg receptor
for sperm bindin. Mol. Ecol. 22: 2143–2156.
Hart, M. W., and A. Foster. 2013. Highly expressed genes in gonads
of the bat star Patiria miniata: gene ontology, expression differences,
and gamete recognition loci. Invertebr. Biol. 132: 241–250.
Hart, M. W., and P. B. Marko. 2010. It’s about time: divergence,
demography, and the evolution of developmental modes in marine
invertebrates. Integr. Comp. Biol. 50: 643– 661.
Hart, M. W., I. Popovic, and R. B. Emlet. 2012. Low rates of bindin
codon evolution in lecithotrophic Heliocidaris sea urchins. Evolution
66: 1709 –1721.
Hart, M. W., J. M. Sunday, I. Popovic, K. J. Learning, and C. M.
Konrad. 2014. Incipient speciation of sea star populations by adaptive gamete recognition coevolution. Evolution 68: 1294 –1305.
Hey, J. 2010a.
Isolation with migration models for more than two
populations. Mol. Biol. Evol. 27: 905–920.
Hey, J. 2010b. The divergence of chimpanzee species and subspecies as
revealed in multipopulation isolation-with-migration analyses. Mol.
Biol. Evol. 27: 921–933.
Hey, J., and R. Nielsen. 2004.
Multilocus methods for estimating
population sizes, migration rates and divergence time, with applications
to the divergence of Drosophila pseudoobscura and D. persimilis.
Genetics 167: 747–760.
Hey, J., and R. Nielsen. 2007. Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population
genetics. Proc. Natl. Acad. Sci. USA 104: 2785–2790.
Hey, J., W. M. Fitch, and F. Ayala, eds. 2005. Systematics and the
Origin of Species: On Ernst Mayr’s 100th Anniversary. National Academies Press, Washington, DC.
Hirohashi, N., N. Kamei, H. Kubo, H. Sawada, M. Matsumoto, and M.
Hoshi. 2008. Egg and sperm recognition systems during fertilization. Dev. Growth Differ. 50: S221–S238.
Hornett, E. A., and C. W. Wheat. 2012. Quantitative RNA-Seq analysis in non-model species: assessing transcriptome assemblies as a
scaffold and the utility of evolutionary divergent genomic reference
species. BMC Genomics 13: 361.
Hudson, R. R. 1990. Gene genealogies and the coalescent process. Oxf.
Surv. Evol. Biol. 7: 1– 44.
Keever, C. C., J. Sunday, J. B. Puritz, J. A. Addison, R. J. Toonen,
R. K. Grosberg, and M. W. Hart. 2009. Discordant distribution of
populations and genetic variation in a sea star with high dispersal
potential. Evolution 63: 3214 –3227.
Kingman, J. F. C. 1982. On the genealogy of large populations. J. Appl.
Probab. 19A: 27– 43.
Koester, J. A., W. J. Swanson, and E. V. Armbrust. 2013. Positive
selection within a diatom species acts on putative protein interactions
and transcriptional regulation. Mol. Biol. Evol. 30: 422– 434.
Kryazhimskiy, S., and J. B. Plotkin. 2008. The population genetics of
dN/dS. PLoS Genet. 4: e1000304.
Kuhner, M. K. 2009. Coalescent genealogy samplers: windows into
population history. Trends Ecol. Evol. 24: 86 –93.
Larson, E. L., J. A. Andres, S. M. Bogdanowicz, and R. G. Harrison.
2013. Differential introgression in a mosaic hybrid zone reveals
candidate barrier genes. Evolution 67: 3653–3661.
Lawton-Rauh, A. 2008. Demographic processes shaping genetic variation. Curr. Opin. Plant Biol. 11: 103–109.
Lemay, M. A., D. J. Donnelly, and M. A. Russello. 2013. Transcriptome-wide comparison of sequence variation in divergent ecotypes of
kokanee salmon. BMC Genomics 14: 308.
Lessios, H. A. 2011. Speciation genes in free-spawning marine invertebrates. Integr. Comp. Biol. 51: 456 – 465.
Lessios, H. A., B. D. Kessing, and J. S. Pearse. 2001. Population
structure and speciation in tropical seas: global phylogeography of the
sea urchin Diadema. Evolution 55: 955–975.
Levitan, D. R., and D. L. Ferrell. 2006. Selection on gamete recognition proteins depends on sex, density, and genotype frequency. Science
312: 267–269.
Lopes, J. S., M. Arenas, D. Posada, and M. A. Beaumont. 2014.
Coestimation of recombination, substitution and molecular adaptation
rates by approximate Bayesian computation. Heredity 112: 255–264.
Marko, P. B., and M. W. Hart. 2011. The complex analytical landscape of gene flow inference. Trends Ecol. Evol. 26: 448 – 456.
Maroja, L. S., J. A. Andres, and R. G. Harrison. 2009. Genealogical
discordance and patterns of introgression and selection across a cricket
hybrid zone. Evolution 63: 2999 –3015.
Mayr, E. 1954. Geographic speciation in tropical echinoids. Evolution
8: 1–18.
McCartney, M. A., and H. A. Lessios. 2004. Adaptive evolution of
sperm bindin tracks egg incompatibility in neotropical sea urchins of
the genus Echinometra. Mol. Biol. Evol. 21: 732–745.
McGovern, T. M., C. C. Keever, C. A. Saski, M. W. Hart, and P. B
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).
SELECTION AND GENE FLOW FROM GENE TREES
Marko. 2010. Divergence genetics analysis reveals historical population genetic processes leading to contrasting phylogeographic patterns in co-distributed species. Mol. Ecol. 19: 5043–5060.
Metz, E. C., and S. R. Palumbi. 1996. Positive selection and sequence
rearrangements generate extensive polymorphism in the gamete recognition protein bindin. Mol. Biol Evol. 13: 397– 406.
Mugal, C. F., J. B. W. Wolf, and I. Kaj. 2014. Why time matters:
codon evolution and the temporal dynamics of dN/dS. Mol. Biol. Evol.
31: 212–231.
Muir, G., C. J. Dixon, A. L. Harper, and D. A. Filatov. 2012. Dynamics of drift, gene flow, and selection during speciation in Silene.
Evolution 66: 1447–1458.
Murrell, B., J. O. Wertheim, S. Moola, T. Weighill, I. K. Scheffler, and
S. L. K. Pond. 2012. Detecting individual sites subject to episodic
diversifying selection. PLoS Genet. 8: e1002764.
Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates,
with application to the chloroplast genome. Mol. Biol. Evol. 11: 715–
724.
Narum, S. R., and J. E. Hess. 2011. Comparison of FST outlier tests for
SNP loci under selection. Mol. Ecol. Res. 11: 184 –194.
Nielsen, R. 2005. Molecular signatures of natural selection. Annu. Rev.
Genet. 39: 197–218.
Nordborg, M. 2003. Coalescent theory. Pp. 602– 635 in Handbook of
Statistical Genetics, D. J. Balding, M. Bishop, and C. Canning, eds.
John Wiley, Hoboken, NJ.
Nosil, P. 2008. Speciation with gene flow could be common. Mol. Ecol.
17: 2103–2106.
Nosil, P., and J. L. Feder. 2012. Genomic divergence during speciation:
causes and consequences. Philos. Trans. R. Soc. Lond. B 367: 332–342.
Nosil, P., and J. L. Feder. 2013. Genome evolution and speciation:
toward quantitative descriptions of pattern and process. Evolution 67:
2461–2467.
Nosil, P., D. J. Funk, and D. Ortiz-Barrientos. 2009.
Divergent
selection and heterogeneous genomic divergence. Mol. Ecol. 18: 375–
402.
Osborne, O. G., T. E. Batstone, S. J. Hiscock, and D. A. Filatov. 2013.
Rapid speciation with gene flow following the formation of Mt. Etna.
Genome Biol. Evol. 5: 1704 –1715.
Palumbi, S. R. 1999. All males are not created equal: fertility differences depend on gamete recognition polymorphisms in sea urchins.
Proc. Natl. Acad. Sci. USA 96: 12632–12637.
Palumbi, S. R. 2009. Speciation and the evolution of gamete recognition genes: pattern and process. Heredity 102: 66 –76.
Palumbi, S. R., and H. A. Lessios. 2005. Evolutionary animation: How
do molecular phylogenies compare to Mayr’s reconstruction of speciation in the sea? Proc. Natl. Acad. Sci. USA 102: 6566 – 6572.
Pannell, J. R. 2003. Coalescence in a metapopulation with recurrent
local extinction and recolonization. Evolution 57: 949 –961.
Pinho, C., and J. Hey. 2010. Divergence with gene flow: models and
data. Annu. Rev. Ecol. Evol. Syst. 41: 215–230.
Pluzhnikov, A., and P. Donnelly. 1996. Optimal sequencing strategies
for surveying molecular genetic diversity. Genetics 144: 1247–1262.
Pond, S. L. K., B. Murrell, M. Fourment, S. D. W. Frost, W. Delort,
and K. Scheffler. 2011. A random effects branch-site model for
145
detecting episodic diversifying selection. Mol. Biol. Evol. 28: 3033–
3043.
Popovic, I., P. B. Marko, J. P. Wares, and M. W. Hart. 2014. Selection and demographic history shape the molecular evolution of the
gamete compatibility protein bindin in Pisaster sea stars. Ecol. Evol. 4:
1567–1588.
Puillandre, N., A. Lambert, S. Brouillet, and G. Achaz. 2012. ABGD,
automatic barcode gap discovery for primary species delimitation. Mol.
Ecol. 21: 1864 –1877.
Rundle, H. D., and P. Nosil. 2005. Ecological speciation. Ecol. Lett. 8:
336 –352.
Seehausen, O., R. K. Butlin, I. Keller, C. E. Wagner, J. W. Boughman,
P. A. Hohenlohe, C. L. Peichel, G.-P. Saetre, C. Bank, A. Brännström et al. 2014. Genomics and the origin of species. Nat. Rev.
Genet. 15: 176 –192.
Soria-Carrasco, V., Z. Gompert, A. A. Comeault, T. E. Farkas, T. L.
Parchman, J. S. Johnston, C. A. Buerkle, J. L. Feder, J. Bast, T.
Schwander et al. 2014. Stick insect genomes reveal natural selection’s role in parallel speciation. Science 344: 738 –742.
Sousa, V. C., M. Carneiro, N. Ferrand, and J. Hey. 2013. Identifying
loci under selection against gene flow in isolation-with-migration models. Genetics 194: 211–233.
Stajich, J. E., and M. W. Hahn. 2005. Disentangling the effects of
demography and selection in human history. Mol. Biol. Evol. 22:
63–73.
Swanson, W. J., and V. D. Vacquier. 2002. The rapid evolution of
reproductive proteins. Nat. Rev. Genet. 3: 137–144.
Sunday, J. M., and M. W. Hart. 2013. Sea star populations diverge by
positive selection at a sperm-egg compatibility locus. Ecol. Evol. 3:
640 – 654.
Vacquier, V. D. 2012. The quest for the sea urchin egg receptor for
sperm. Biochem. Biophys. Res. Comm. 425: 583–587.
Vacquier, V. D., and W. J. Swanson. 2011. Selection in the rapid
evolution of gamete recognition proteins in marine invertebrates. Cold
Spring Harb. Perspect. Biol. 2011: a002931.
Wheat, C. W. 2010. Rapidly developing functional genomics in ecological model systems via 454 transcriptome sequencing. Genetica
138: 433– 451.
Wilson, D. J., and G. McVean. 2006. Estimating diversifying selection
and functional constraint in the presence of recombination. Genetics
172: 1411–1425.
Wlasiuk, G., S. Khan, W. M. Switzer, and M. W. Nachman. 2009. A
history of recurrent positive selection at the Toll-like receptor 5 in
primates. Mol. Biol. Evol. 26: 937–949.
Yang, Z., and J. P. Bielawski. 2000. Statistical methods for detecting
molecular adaptation. Trends Ecol. Evol. 15: 496 –503.
Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages.
Mol. Biol. Evol. 19: 908 –917.
Yang, Z., W. S. W. Wong, and R. Nielsen. 2005. Bayes empirical
Bayes inference of amino acid sties under positive selection. Mol. Biol.
Evol. 22: 1107–1118.
Zigler, K. S., M. A. McCartney, D. R. Levitan, and H. A. Lessios. 2005.
Sea urchin bindin divergence predicts gamete compatibility. Evolution
59: 2399 –2404.
This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM
All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).