* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Models of Selection, Isolation, and Gene Flow in Speciation
Viral phylodynamics wikipedia , lookup
Metagenomics wikipedia , lookup
Gene therapy wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Copy-number variation wikipedia , lookup
Koinophilia wikipedia , lookup
Genetic drift wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
History of genetic engineering wikipedia , lookup
Gene expression profiling wikipedia , lookup
Genome (book) wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Genome evolution wikipedia , lookup
Human genetic variation wikipedia , lookup
Helitron (biology) wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Gene expression programming wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
The Selfish Gene wikipedia , lookup
Group selection wikipedia , lookup
Designer baby wikipedia , lookup
Reference: Biol. Bull. 227: 133–145. (October 2014) © 2014 Marine Biological Laboratory Models of Selection, Isolation, and Gene Flow in Speciation MICHAEL W. HART* Department of Biological Sciences, Simon Fraser University, Burnaby, British Columbia, V5A 1S6, Canada Abstract. Many marine ecologists aspire to use genetic data to understand how selection and demographic history shape the evolution of diverging populations as they become reproductively isolated species. I propose combining two types of genetic analysis focused on this key early stage of the speciation process to identify the selective agents directly responsible for population divergence. Isolationwith-migration (IM) models can be used to characterize reproductive isolation between populations (low gene flow), while codon models can be used to characterize selection for population differences at the molecular level (especially positive selection for high rates of amino acid substitution). Accessible transcriptome sequencing methods can generate the large quantities of data needed for both types of analysis. I highlight recent examples (including our work on fertilization genes in sea stars) in which this confluence of interest, models, and data has led to taxonomically broad advances in understanding marine speciation at the molecular level. I also highlight new models that incorporate both demography and selection: simulations based on these theoretical advances suggest that polymorphisms shared among individuals (a key source of information in IM models) may lead to false-positive evidence of selection (in codon models), especially during the early stages of population divergence and speciation that are most in need of study. The false-positive problem may be resolved through a combination of model improvements plus experiments that document the phenotypic and fitness effects of specific polymorphisms for which codon models and IM models indicate selection and reproductive isolation (such as genes that mediate sperm-egg compatibility at fertilization). Introduction An important problem in biodiversity research is to understand the ecological and evolutionary origins of reproductive isolation and the formation of biological species (Coyne and Orr, 2004; Hey et al., 2005). Although diversity evolves at both higher and lower levels in the hierarchy of life, the adaptive divergence of reproductively isolated populations to form new species is fundamentally important because this process can set groups of organisms onto independent evolutionary trajectories from which they can no longer influence each other directly through mating and recombination. Key elements of that research program (Butlin et al., 2012) include analyses of (1) the targets of selection acting on diverging lineages of organisms as they adapt to different environments or habitats, and (2) the barriers to gene flow between lineages that would otherwise oppose their divergence under selection for phenotypic differences. Understanding the targets of selection at the molecular level is critical for linking genotypic divergence to phenotypic variation (Yang and Bielawski, 2000; Nielsen, 2005); quantifying gene flow between diverging lineages is critical for documenting the history of reproductive isolation between lineages and their progress on the continuum from conspecific populations to biological species (Hey and Nielsen, 2004; Rundle and Nosil, 2005; Seehausen et al., 2014). Studies of adaptation and gene flow in the ocean have been especially useful and interesting because many marine species have broad geographic ranges (e.g., spanning ocean basins), and ocean currents seem likely to promote gene flow among populations (especially for organisms that spend weeks or months developing as larvae in the marine Received 29 March 2014; accepted 21 June 2014. * To whom correspondence should be addressed. E-mail: mwhart@ sfu.ca Abbreviations: IM, isolation-with-migration; RNA-Seq, RNA sequencing (whole transcriptome shotgun sequencing); SNP, single nucleotide polymorphism. 133 This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 134 M. W. HART plankton), such that barriers to marine gene flow are often not readily apparent (e.g., Mayr, 1954; Lessios et al., 2001; Palumbi and Lessios, 2005). Moreover, the relatively simple mating systems of many marine plants and animals with broadcast spawning of planktonic gametes include a small number of ligand and receptor molecules expressed in sperm and eggs that regulate interaction and binding between gametes and determine the specificity of mating between males and females (Hirohashi et al., 2008). These gamete recognition molecules have been a particular focus of both functional and genetic analyses because they represent a relatively simple basis for the evolution of reproductive isolation and speciation in marine organisms that lack complex behavioral interactions between male and female individuals (compared to organisms like terrestrial vertebrates and arthropods), and genetic analyses suggest that such genes are frequent targets of selection during speciation (Palumbi, 2009; Lessios, 2011). In these ways, marine speciation has been seen as a challenging problem with some potentially accessible solutions, one in which both the demography of speciation and its adaptive basis could be understood at the molecular level, and evolutionary ecologists have long sought to understand its origins and mechanisms (Evans and Sherman, 2013). The roles of selection and gene flow during speciation— and their inference from genetic data— have been recently and extensively reviewed (Lawton-Rauh, 2008; Nosil et al., 2009; Bird et al., 2012; Crisci et al., 2012; Feder et al., 2012; Nosil and Feder, 2012, 2013; Faria et al., 2014). In this essay I focus on two specific methods, one for modeling selection at the level of codons in protein-coding sequences, and one for modeling gene flow within the context of the demographic history of reproductive isolation at the level of nucleotide sequences. The two types of analysis share in common a key modeling feature (the use of insights from gene tree structure and variation) and two key data requirements (many loci, few samples of individual organisms). The similarities in their data requirements and the complementary nature of the insights from these two types of analysis together form an argument for combining these two methods in analyses of data from next-generation sequencing studies. The essay and its main argument are motivated by the interests and efforts of organismal biologists and empiricists (like myself) to use quantitative models and software from population-genetics theory to study the formation of new species, and by the increasing accessibility of genome-scale sequence data for identifying the targets of selection at the molecular level leading to adaptation and speciation in non-model organisms. The following section (Background) introduces those types of analysis for readers who are not already familiar with population-genetic methods based on large samples of gene trees. More experienced readers could go directly to the subsequent section (Recent Progress) that highlights examples of selection and gene flow analyses in speciation using transcriptome data from non-model organisms, including a detailed example from our work on gamete recognition in sea stars. The few available examples suggest that combined analyses of adaptation and demographic history from transcriptome sequencing may be a surprisingly accessible source of insight into the causes of population divergence and speciation. However, the empirical examples and new theoretical model developments also point toward some potentially significant limitations of this combined approach caused by false positives in analyses of selection on coding sequences. In the closing section (Future Prospects) I review this false-positive problem and some ways in which it might be resolved in new studies of selection and gene flow during speciation. Background: Selection and Gene Flow in Speciation Coalescent models of genetic diversity and disparity New sequencing technologies have greatly increased our ability to characterize genomic variation underlying the evolution of reproductive isolation among natural populations of virtually any marine organism (Claw and Swanson, 2012; Ellegren, 2014). Elements of that variation include both diversity of DNA sequences and disparity among them. The mutational processes that generate new sequence diversity include point mutation of single nucleotides, duplication, deletion, or gene conversion. These processes operate on relatively short time scales within lineages of cells and organisms, and give rise to new allelic variants that differ from an original or ancestral allele by as little as one nucleotide change. In contrast, the processes that generate sequence disparity (greater than that created by single mutations) operate on longer time scales at the level of populations and species, and include genetic drift, selection, and gene flow. When their effects are integrated over time, these two sets of processes produce a pattern of ancestor-descendant relationships that can be represented by gene trees or genealogies, and the information content of such gene trees has many practical applications (e.g., in molecular taxonomy; Puillandre et al., 2012). The best methods for inferring the effects of selection and gene flow on new species formation, and distinguishing those effects from other processes (such as mutation), are based on the concept of the coalescent: the pattern of merging or coalescence of alleles backward in time from the present to a common ancestral allele at some time in the past (Kingman, 1982; Hudson, 1990). These methods characterize gene genealogies, fit parameter values (including mutation, population size, gene flow, and selection) for a population model, and use likelihood or Bayesian methods to identify the best-fit population model that can account for the underlying gene tree structure (Bahlo and Griffiths, 2000; Pannell, 2003). The concept and models can be used This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). SELECTION AND GENE FLOW FROM GENE TREES to simulate genealogies under different combinations of population-model parameters, and to illustrate their individual or combined effects on expected patterns of gene trees that might be observed in empirical studies (Ewing and Hermisson, 2010). Selection and gene flow in genealogies Among several coalescent population-genetic methods that have been developed to model the evolution of genetic diversity and disparity associated with new species formation, two are especially informative and broadly useful. Both methods use gene trees of diverging lineages to extract useful insights into speciation processes. Codon models of selection analyze disparity among protein-coding DNA sequences, and they make inferences about the sources of disparity on the basis of the relative rates of nonsynonymous (dN) and synonymous nucleotide substitutions (dS) that do or do not alter the predicted amino acid sequence (Yang and Bielawski, 2000; Anisimova, 2012). By mapping substitutions onto a gene tree of relationships among haplotypes, such models can characterize the relative contributions of purifying selection or positive Darwinian selection as the ratio of those relative rates ⫽ dN/dS. If the rate of accumulation of synonymous substitution differences among sequences is mainly attributed to the accumulation of new mutations over time, and if those substitutions are largely invisible to selection on protein function, then a higher (or lower) rate of accumulation of nonsynonymous differences can be attributed to the effects of selection favoring (or opposing) new amino acid variants. In particular, several classes of such codon models (sometimes called branch sites models; Yang and Nielsen, 2002) can identify the effects of positive selection acting on particular parts of a sequence alignment or on particular lineages within a gene tree. In comparison to other populationgenetic methods for detecting a signal of selection at the molecular level based on the relative frequencies of intraspecific polymorphisms (Anisimova and Liberles, 2012), branch sites models have the significant advantage that they can be used to identify specific molecular traits (codons with ⬎ 1 among some lineages) or specific branches in the gene tree (with ⬎ 1 at some codons) as the targets of positive selection for divergence among lineages. This potential to link genotypes (at the nucleotide level) with phenotypes under selection (at the amino acid level) may partly account for the wide application of codon models to molecular studies of adaptive divergence associated with speciation (Yang and Bielawski, 2000; Nielsen, 2005). Isolation-with-migration (IM) models analyze disparity among any type of DNA sequence (including protein-coding genes), and make inferences about the sources of disparity on the basis of estimates of population-model parameters including gene flow, population sizes, and population 135 divergence times (Hey and Nielsen, 2004, 2007; Hey, 2010a). In particular, IM methods use coalescent models of gene tree evolution to characterize the demographic history of population samples. By fitting the population model and its parameters to sequence data for multiple unlinked loci, IM methods can independently model multiple population demographic parameters. In comparison to other population-genetic methods for inferring demographic history, IM models have the significant advantage that they can distinguish among the contributions of low gene flow, reduced population size, and old divergence time to the evolution of genetic polymorphisms within populations and differentiation between populations (Marko and Hart, 2011). One of the most important insights into speciation from such models is the ability to characterize the magnitude of gene flow between populations independently of other population parameters that affect disparity, and (by comparison among analyses of different populations and genes) to identify those populations or genes with evidence of low or zero gene flow that might reflect divergent selection (Sousa et al., 2013). Like the particular advantages of codon models for understanding selection at the molecular level, the potential to model the specific role of gene flow variation in speciation (separately from the role of other population demographic parameters) may partly account for the widespread use of IM models in analyses of phylogeographic structure and population divergence (Nosil, 2008; Pinho and Hey, 2010). Both approaches include an explicit model of DNA sequence changes leading to diversity, and each includes parameters for some but not all of the processes expected to shape the evolution of sequence disparity (selection in codon models, genetic drift and gene flow in IM models). Thus, the two approaches can use the same sequence data to give complementary insight into the evolutionary causes of population divergence and speciation. The two approaches also share similar data requirements: in both cases the critical model parameters (especially the selection parameter in codon models, and the gene flow parameter m in IM models) can be estimated from fitting the model to small alignments of relatively few samples (individuals or diverging populations or species) for each gene or locus (e.g., Carneiro et al., 2012). The problem: we need many gene trees Both IM models and codon models depend on information from gene trees to infer population-model parameters. Among theoreticians it has long been recognized that recombination (e.g., between loci on different chromosomes, or separated by recombination hotspots) obscures genealogical information about ancestor-descendant relationships among alleles, and that the coalescent applies only to haplotypes or linked groups of nonrecombining polymorphisms This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 136 M. W. HART Figure 1. Structure of coalescent gene trees depends on stochastic variation, gene flow, and selection. (A) Three gene trees (a1–a3) from coalescent simulations using the software program msms ver. 1.3 (Ewing and Hermisson, 2010) for two structured populations with no selection and high gene flow (Nm ⫽ 10). (B) Gene trees (b1– b3) from a second simulation with low gene flow (Nm ⫽ 0.5). (C) Gene trees (c1– c3) from a third simulation with high gene flow as in A but with selection against one of two alleles. In each of the three simulations only six haplotypes for each population were output in the gene trees for comparison to gene trees from small population samples of sea star gamete recognition genes (Fig. 2). Other details of the simulation conditions are given in the main text. Each output tree was drawn in FigTree ver. 1.4 using midpoint rooting and a rescaled root depth of 1.0 units. (Donnelly and Tavaré, 1995). As a result, unlinked loci experience the effects of mutation, selection, and other processes independently of each other, and the coalescent pattern of relationships among alleles is expected to differ among unlinked loci due to stochastic effects of sampling or genetic drift (Nordborg, 2003). For empiricists, the most important consequence of recombination is that the demographic history of a population cannot reliably be characterized from one or a few parts of the genome because each has evolved independently under the coalescent. Instead the population history—and individual population-model parameters—must be estimated as a statistical property of a sample from many parts of the genome. Like all sampling problems, the quality of such estimates and the insight they give into the causes of divergence and speciation are now well known to depend critically on the number of loci analyzed and the genomic breadth of the study (Felsenstein, 2006; Nosil et al., 2009; Feder et al., 2012). Theory shows that small samples of haplotypes (on the order of 10 per population) will typically capture most of the underlying coalescent population history that generated the gene tree for a single locus, so that relatively few individuals can be used effectively (Pluzhnikov and Donnelly, 1996; Felsenstein, 2006). However, gene trees drawn from the same coalescent history can be highly variable among replicate samples in nature or among replicate simulations. For example, I used the popular coalescent software program msms ver. 1.3 (Ewing and Hermisson, 2010) to simulate genealogies drawn from two populations under a neutral model without selection (using the command line string ./msms 12 3 -t 1 -T -I 2 6 6 10). Three replicate genealogies (i.e., three loci with 12 haplotypes sampled for each gene) from this single simulation are shown in Figure 1A. In this example, gene flow was relatively high (Nm ⫽ 10) and the population size parameter was realistic for typical mutation rates (-t 1). The extent of population differentiation is shown by the clustering of gene copies from the two populations (labeled x, o) into clades. However, the most striking result is the variation in gene tree structure (the coalescent depth of nodes) among three loci from the same simulation (e.g., genealogies a1 vs. a3 in Fig. 1). This dependence on genome sampling is a feature of all coalescent methods, but its implications for empirical inference are clearest in the case of IM models. In a second example, I simulated the expected effects of low gene flow This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). SELECTION AND GENE FLOW FROM GENE TREES (Nm ⫽ 0.5) but otherwise identical conditions (using the command line string ./msms 12 3 -t 1 -T -I 2 6 6 0.5), and compared the genealogies (Fig. 1B) to those expected under a neutral model with high gene flow (Fig. 1A). For some loci, reproductive isolation and reduced gene flow caused strong phylogeographic structure due to genetic drift (e.g., genealogy b1 in Fig. 1, with reciprocally monophyletic clades of haplotypes in each population), but at other loci in the same simulated population history there was still extensive sharing of haplotype lineages between the two populations despite the effects of low gene flow and genetic drift (e.g., genealogies b2, b3 in Fig. 1). As a consequence of this variation among loci, estimating gene flow and other population-model parameters in an empirical IM analysis for real sequence data may require sampling dozens of loci across the genome in order to distill the signal of population history from the noise associated with stochastic variation (Edwards and Beerli, 2000; Arbogast et al., 2002). Estimating a large number of populationmodel parameters (especially for studies of many populations) may require information from hundreds of loci (e.g., Hey, 2010b). In such analyses, gene trees for individual loci are treated as separate instances of the coalescent history of the modeled populations, and the estimation of the model parameters is improved by information from both the individual loci and the variance among loci. Similarly in the case of codon models, sequence differences between diverging lineages are assumed to represent fixed substitution differences (rather than polymorphisms within populations). As a consequence, selection can be modeled using a single representative haplotype from each lineage. However, codon models share with IM models the difficulty of distinguishing the effects of selection from stochastic variation. In a third example, I simulated the expected effects of selection against one of two alleles (using the command line string ./msms 12 3 -t 1 -T -I 2 6 6 10 -SAA 20 -SaA 10 -SF 0.001 -N 10000) but with otherwise identical conditions including high gene flow (Nm ⫽ 10), and compared the genealogies (Fig. 1C) to those under a neutral model without selection (Fig. 1A). For some loci, directional selection caused striking population differences (e.g., genealogy c2 in Fig. 1, with a single deep split between two lineages of haplotypes). However, at other loci subject to the same pattern of selection in the same simulation, populations shared several deeply divergent lineages of haplotypes with less striking patterns of population difference (e.g., genealogies c1, c3 in Fig. 1). This inherent variation in genealogical patterns under the coalescent has important implications for the interpretation of codon models of selection leading to speciation. Several forms of codon models explore and test hypotheses about variation in values among codons in the alignment or among lineages in the gene tree (Anisimova and Liberles, 2012), but demonstrating that particular genes and molecu- 137 lar traits are the particular targets of selection leading to new species has proven to be difficult: codon models of selection may be relatively conservative (Anisimova, 2012), and demographic variation can produce patterns of codon evolution similar to the effects of selection (e.g., Stajich and Hahn, 2005). Analogous difficulties are also evident among studies that have used indirect methods, for example, based on genome scans of single nucleotide polymorphisms (SNPs) and models of population-genetic differentiation to identify SNPs that are statistical outliers and possibly linked to genomic variants under selection for frequency differences between diverging populations or species (summarized by Bierne et al., 2013). Selection may be widespread across the genome (Hahn, 2008), but distinguishing genes that are the true targets of selection from the statistical noise associated with coalescent variation depends on the breadth of comparisons among genes across the genome. A solution: many gene trees from RNA-Seq Because both codon models of selection and isolationwith-migration models of gene flow either explicitly require or inherently benefit from comparison of gene tree structure among many loci across the genome, the two approaches can potentially benefit from the expanded capacity of nextgeneration sequencing technologies to characterize genomewide variation. Despite their well-known drawbacks and limits (DeWoody et al., 2013), transcriptome or RNA-Seq methods might be the best source of data for such combined analyses, especially for relatively understudied non-model organisms for which there is no reference genome (Wheat, 2010; Cahais et al., 2012; Gayral et al., 2013). In RNA-Seq workflows for non-model organisms (e.g., Feldmeyer et al., 2011), bulk RNA is extracted from a specific tissue or organ (or from the whole organism), fragmented into lengths suitable for sequencing (typically a few hundred nucleotides), and converted to cDNA. Alternative workflows differ in the order of these steps, first converting mRNA to cDNA and then fragmenting the cDNA molecules. A key detail is that transcripts are fragmented (by mechanical or biochemical methods) at random with respect to the nucleotide sequence, so that different copies of the same gene transcript will be fragmented in different locations. In some applications, the proportion of transcripts from proteincoding genes can be increased (and the abundance of ribosomal RNA molecules can be decreased) by hybridization methods that capture messenger RNA molecules or remove ribosomal RNA molecules. The resulting cDNA samples are then processed into libraries for massively parallel sequencing. Because the resulting libraries consist mainly of processed gene transcripts, other parts of the genome including introns, regulatory regions, unexpressed genes, and noncoding intergenic regions are not sequenced. A common form of sequence This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 138 M. W. HART data is paired-end 100- or 150-base sequences from Illumina instruments. Pairs of sequence reads (on the order of 107 to 109 per instrument run) are then computationally assembled into genes by sequence alignment of reads from partially overlapping fragments that represent contiguous parts of the same expressed gene (contigs). Homologous genes from different assemblies (e.g., assembled from RNA-Seq data for different individual organisms) can be identified by sequence similarity comparisons between assemblies (e.g., using Blast), extracted from each assembly, aligned to each other, and analyzed in codon models or IM models. RNA-Seq data have several potential advantages in comparison to traditional PCR-based methods or whole-genome sequences. ● ● ● ● Transcriptomes can be analyzed using both codon models (which require coding sequences) and IM models (which can use coding sequences or any other type of DNA sequence). RNA-Seq can target genes expressed in specific tissues or organs that are associated with specific hypotheses about ecological adaptation, sexual selection, or other speciation processes (such as seminal fluid proteins involved in mate compatibility; Larson et al., 2013). The relatively high cost of RNA-Seq sample processing and sequencing (DeWoody et al., 2013) may be partly offset by the relatively low cost per gene, which makes RNA-Seq (and other next-generation sequencing technologies) a relatively good match to the data requirements for codon models of selection and IM models of demographic history (many loci; few samples). Contigs of expressed genes can often be assembled from RNA-Seq data using de novo assembly algorithms that compare and align sequence reads to each other (rather than mapping each read to a reference genome). In contrast, whole genome assembly methods typically depend on mapping sequence reads to a reference genome from the same species (or a closely related species) due to the much larger size and greater complexity of genomes compared to the transcriptomes for single tissues or organs. For studies of molecular adaptation, gene flow, and speciation in non-model organisms without a reference genome, such de novo assembly of transcriptomes may provide relatively easy and inexpensive access to high-quality genome-scale sequence information (Hornett and Wheat, 2012). Even for organisms with high-quality reference genomes, RNA-Seq data from small samples of individuals can be an important source of insight into the targets of selection for adaptive divergence (e.g., Carneiro et al., 2012). Recent Progress: Selection and Gene Flow From Transcriptomes Because affordable RNA-Seq data and the analytical methods used to assemble the data are relatively recent innovations, there are still relatively few applications to studies of speciation in natural populations (e.g., SoriaCarrasco et al., 2014). However, the available examples suggest that combining isolation-with-migration model analyses of gene flow (and reproductive isolation) with codon model analyses of selection (and population genetic divergence) can be useful sources of new insight into the earliest stages of speciation in the sea. Gamete recognition genes in bat stars We used this approach to identify possible targets of selection leading to reproductive isolation and incipient speciation in diverging populations of a northeastern Pacific sea star (Patiria miniata (Brandt, 1835)) that have been isolated with low gene flow for about 3 ⫻ 105 years (Keever et al., 2009; McGovern et al., 2010; Hart and Marko, 2010). From small population samples we generated transcriptomes of ovary-expressed genes (Hart and Foster, 2013). We then used codon models to identify genes under positive selection for population differences, and IM models of the same genes to characterize the extent of long-term reproductive isolation between diverging populations. An important feature of this approach is that the same within-population sequence polymorphisms—which are key to parameter fitting in the IM models—were also used in our codon model analyses. We compared two gamete recognition genes (the OBi1 receptor expressed in the egg coat and bindin expressed in the sperm head) to other genes expressed only in gametes or to genes expressed in gametes and other cell types. In branch sites codon models we found a strong signal of positive selection on a few specific codons in OBi1 (Hart et al., 2014), similar to the signal of positive selection on a few codons in bindin (based on data from traditional PCR-based methods; Sunday and Hart, 2013). In IM models we found zero gene flow for the parts of OBi1 and bindin that included specific codons under positive selection, and nonzero gene flow between the same diverging populations for all other genes, including those sequenced from small population samples of transcriptomes (Hart et al., 2014) and those sequenced from larger population samples of PCR amplicons (Keever et al., 2009; McGovern et al., 2010; Hart et al., 2014). False positives in codon model analyses Codon model analyses within this context (populations under selection for divergence, combined with IM models), This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). SELECTION AND GENE FLOW FROM GENE TREES especially for genes directly related to reproductive compatibility or isolation, are potentially highly informative because they focus on evolutionary processes during the earliest stages of speciation (Swanson and Vacquier, 2002; Palumbi, 2009; Lessios, 2011; Vacquier and Swanson, 2011; Evans and Sherman, 2013). Such analyses are also potentially controversial: we used branch sites models of among-codon and among-lineage variation in (Yang and Nielsen, 2002; Yang et al., 2005; Pond et al., 2011; Murrell et al., 2012) that are descended from the original models of Goldman and Yang (1994) and Muse and Gaut (1994) in which sequence differences are assumed to be fixed substitutions (for example, between highly divergent species) and not segregating polymorphisms (within single populations). An alternative form of this assumption is that mutation is weak relative to the strength of selection or relative to the time since divergence between lineages, so that new recurrent mutations within single populations or ancestral polymorphisms shared between populations make a negligible contribution to the pattern of synonymous and nonsynonymous sequence variation (Muse and Gaut, 1994; Mugal et al., 2014). Failing to account for the effects of mutation can have important undesirable consequences. It is well known (Kryazhimskiy and Plotkin, 2008; Anisimova and Liberles, 2012; Mugal et al., 2014) that violating this assumption may have a high associated risk of identifying false positives in analyses of positive selection (e.g., when alignments of genes under purifying selection include rare nonsynonymous mutations that have not yet been eliminated from the population by selection), or of false negatives (e.g., when stochastic variation in the number of synonymous polymorphisms within a single population masks the underlying effects of positive selection for high rates of nonsynonymous substitution). This view was confirmed by simulations showing that, “when applied to a single population, dN/dS is not particularly sensitive to the strength of selection and it is not a reliable indicator of the sign of selection” (Kryazhimskiy and Plotkin, 2008). Nevertheless, empiricists studying adaptive molecular evolution associated with speciation have often ignored those concerns by focusing on systems of populations or sister species in which the divergence time is implicitly (and optimistically) assumed to be sufficiently long to negate the risks of false positives caused by the effects of mutation and of segregating polymorphisms (e.g., Hart et al., 2012, 2014; Sunday and Hart, 2013; Popovic et al., 2014). Are bindin and OBi1 under positive selection leading to incipient speciation in Patiria miniata populations, or are these false positives caused by mutation and segregating polymorphism (and not caused by selection)? Other structural and functional features of those results suggest that these results are not false positives. ● ● ● ● 139 The OBi1 receptor protein in the egg coat and the bindin protein in the sperm head are known to interact in a species-specific fashion at fertilization in sea urchins (Foltz et al., 1993; Vacquier, 2012), and bindin is the target of positive selection for among-species differences in many sea urchin genera (Palumbi, 2009; Lessios, 2011). In P. miniata the key positively selected codons in those genes are located in the two parts of the gene structure predicted to be most sensitive to selection on specificity of sperm-egg binding: in the known substrate-binding site of the OBi1 receptor, and adjacent to the highly conserved ligand for that binding site in bindin (Hart et al., 2014). The key positively selected bindin codon in P. miniata is in the same part of the gene structure as the wellknown hot spot of bindin positive selection among closely related sea urchin species (Metz and Palumbi, 1996; Biermann, 1998). Most important, variation at positively selected codons in OBi1 and bindin explained a significant proportion of the variation in laboratory fertilization rates between P. miniata individuals from diverging populations and between individual males and females with different genotypes at one key positively selected codon in each gene (Hart et al., 2014). This combination of IM and codon results, especially the evidence for reproductive isolation between populations (zero gene flow only in parts of two positively selected gamete recognition genes) and the evidence for lower gamete compatibility between individuals associated with the same positively selected codons, thus seems to point directly toward a pair of genes subject to positive selection and divergence between populations, possibly driven by local selection on sperm-egg compatibility within populations. Selection and gene flow in other transcriptome studies Transcriptome-based analyses of adaptation and ecological speciation, for example in studies of diverging ecotypes along habitat gradients (Cheviron and Brumfield, 2012), are increasingly common but have not often used this same combination of codon and IM models applied to the same sequence alignments of genes from diverging populations. One obstacle to such analyses—and a possible reason for their rarity—is the difficulty of resolving single nucleotide polymorphisms (SNPs) within a single transcriptome into a diploid pair of haplotypes (consisting of linked variable sites) in a de novo transcriptome assembly in order to generate gene trees based on multiple differences among individual haplotypes (e.g., Hart, 2013). Some RNA-Seq studies of diverging populations or ecotypes instead focus on SNP allele frequency variation without using analyses This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 140 M. W. HART based on gene trees derived from haplotype sequences (e.g., Lemay et al., 2013). De Wit and Palumbi (2013) used transcriptomes from abalone populations to discover outlier SNP loci with unusually large differences in allele frequency in genes that are candidates for subsequent analyses of selection, especially genes involved in energy metabolism and shell mineralization that might be subject to locally varying effects of selection imposed by climate change and ocean acidification. A second obstacle to such analyses is the risk of errors in estimates of caused by the inclusion of segregating polymorphisms (and the effects of mutation and ancestral polymorphisms). In one of the earliest transcriptome studies of ecotype divergence between marine populations, Barreto et al. (2011) resolved within-population variation into a single haplotype per gene for two diverging populations of a tidepool copepod. This approach allowed the authors to ignore rare segregating polymorphisms and to use codon models to discover genes that were under positive selection and likely to underlie the previously well-known genomic incompatibilities between these same populations. This approach, however, precluded the use of IM models to contrast demographic parameters and the history of gene flow for sets of genes that were or were not subject to positive selection. Using a similar compromise, Koester et al. (2013) analyzed positive selection in codon models (but not gene flow and reproductive isolation in IM models) for seven strains of a planktonic diatom species in order to identify candidate genes under selection for divergence, especially between ecotypes from different oceans or different planktonic habitats. Few other studies have attempted to combine both demographic models and codon models applied to the same transcriptome data. Osborne et al. (2013) analyzed divergence between ragwort ecotypes adapted to different habitats along an altitudinal cline using both codon models of positive selection and a demographic model that estimated some parameters (divergence time; population size) but not all parameters (gene flow) included in IM models. However, like the studies noted above, that study analyzed a single haplotype per gene for each ecotype, which precluded analyses using IM models. Other examples of this combination of analytical approaches have used traditional PCR-based methods focused on a relatively small number of genes targeted for their likely role in adaptive evolution (e.g., Wlasiuk et al., 2009) or reinforcement in hybrid zones (e.g., Maroja et al., 2009). Future Prospects: Combined Models of Selection and Gene Flow in Speciation Mitigating the false-positive problem in codon models Because segregating polymorphisms are the primary source of information for isolation-with-migration (IM) models (Kuhner, 2009) but a known potential source of error for codon models (Anisimova and Liberles, 2012), applying both approaches to the same sequence data in the same study may lead to key insights into speciation but at the risk of some errors. One alternative approach to mitigating those risks in studies of closely related lineages (diverging on short evolutionary time scales where mutation effects might be strong) is to focus on IM methods for characterizing gene flow and other demographic parameters, while using approximate methods to characterize selection effects such as the HKA test that do not link selection to specific codons or lineages (e.g., Muir et al., 2012). A significant drawback of this compromise is the limited ability to characterize the relative strength of response to selection () in standard codon models, or to identify individual codons or lineages as targets of selection in branch sites models. A second alternative for studies of more highly divergent populations (on longer evolutionary time scales) is to focus on codon model methods for characterizing the effects of selection, while using approximations such as fixation indices (FST) as proxies for gene flow (e.g., Fraser et al., 2010). A drawback of this compromise is that other demographic processes and other characteristics of the genetic markers, including old or recent divergence time and small or large effective population size, can contribute to patterns of differentiation (FST) without variation in gene flow (Marko and Hart, 2011). A third approach to this problem is to apply both IM models and codon models to the same multilocus sequence alignments from transcriptomes, and then draw post hoc inferences about the reliability of those results on the basis of consistency among the codon models (showing selection acting on specific codons and lineages), IM models (showing restricted gene flow for genes under positive selection), predictions based on structure-function relationships (showing positive selection on known binding sites and ligands), and genotype-phenotype correlations in functional studies (such as reproductive incompatibility correlated with positive selection on gamete recognition genes). We depended on this mode of inference in our study of the apparent coevolution of OBi1 and bindin in old diverging population pairs in Patiria miniata. The good news: better models may avoid false positives Better understanding of the dynamics of demography and selection during speciation will come from better models that incorporate both IM-based demographic parameters and codon-based selection parameters into the same framework (Lawton-Rauh, 2008; Crisci et al., 2012). Unfortunately, this hard problem has had few proposed solutions. One early step toward a synthesis of isolation-with-selection models included a codon model of positive selection plus a This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). SELECTION AND GENE FLOW FROM GENE TREES 141 Figure 2. Reanalysis of positive selection in small population samples of the egg coat gene OBi1 from ovary transcriptomes of diverging sea star populations (Patiria miniata) using the software program omegaMap ver. 0.5 (Wilson and McVean, 2006). The Markov chain Monte Carlo search ran 300,000 steps with model parameters sampled at intervals of 100 steps (after a burnin of 30,000 steps); I used uniform priors on parameter values including the mutation rate (muParam ⫽ 0.1, 10); parameter estimates for selection (oBlock ⫽ 4) and recombination (rBlock ⫽ 5) were fitted using a sliding window method (omega_model ⫽ variable; rho_model ⫽ variable). Black curve shows variation among codons in estimated ; red dots show the posterior probability of positive selection ( ⬎ 1). For three codons, including two in the predicted substrate-binding site of the protein (enclosed within the blue box), ⬃ 3–5 and P ⬎ 0.90, the same two codons in the substrate-binding site were positively selected in branch sites models fitted to the same sequence data in previous published analyses (Hart et al., 2014). The inset gene tree (from Hart et al., 2014) shows reciprocally monophyletic relationships (strong population divergence) among six OBi1 haplotypes from northern and from southern female sea stars for the gene partition that includes the substrate-binding site of the receptor. mutation rate parameter intended to account in part for the effect of mutations and segregating polymorphisms on the estimate of (Wilson and McVean, 2006). This model includes variation in among codons but not among lineages or populations, so it is not specifically designed to detect signals of selection for population divergence, but has been used effectively to identify likely targets of selection at late stages in the speciation process (e.g., reinforcement against hybridization in a cricket hybrid zone; Maroja et al., 2009). Does a better model like this indicate false positives from analyses that used traditional branch sites codon models that assume fixed substitution differences? I used the method of Wilson and McVean (called omegaMap) to estimate among-codon variation in for OBi1 from P. miniata transcriptomes (Fig. 2). Two codons in the OBi1 substratebinding site that were positively selected in branch sites models (Hart et al., 2014) were also identified as positively selected in the omegaMap analysis, with high posterior probabilities (P ⬎ 0.9) of positive selection ( ⬎ 1), but with more realistic estimated values for the selection parameter at those codons ( ⬃ 3–5) in comparison to the unrealistic and imprecise estimates from codon models that assumed fixed substitution differences ( ⬃ 200; Hart et al., 2014). The concordance between results from traditional branch sites models (Hart et al., 2014) and from omegaMap analyses (Fig. 2) give some encouragement that the signal of positive selection on the OBi1 binding site (and its correlation with fertilization rates and reproductive isolation) may not be a false positive. Similarly, Kryazhimskiy and Plotkin (2008) simulated within-population polymorphisms for two isolated populations, and showed that the deviation between observed and expected values for this simulated twopopulation system depended on the mutation rate (scaled for the effective population size as ⫽ 4Ne). They concluded that “when comparing divergent lineages the magnitude of dN/dS compared to unity is a faithful indicator of the sign of selection.” Overall, such results give some reason for optimism that transcriptome studies of incipient species formation using both IM models and codon models may avoid false positives and may correctly identify genes and codons under positive selection if applied to systems of relatively old conspecific population divergence (Hart and Marko, 2010). The bad news: time dependence of false positives Unfortunately, the newest model developments suggest that this optimistic view should be tempered with caution. This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 142 M. W. HART Mugal et al. (2014) recently developed an IM model that incorporates most of the desirable features of both IM models and traditional codon models. It includes a fully specified codon substitution matrix, a model of divergence between two populations descended from a single panmictic ancestral population, and parameters for synonymous and nonsynonymous mutations that arise in the ancestral population (before the divergence time) or in one of the descendant populations (after the divergence time), and then segregate as polymorphisms or go to fixation under the effects of drift and selection. The important theoretical improvement of this model is its demonstration of the time dependence of for diverging populations: new mutations and segregating polymorphisms will cause a long delay after population divergence in the approach of to the expected equilibium value predicted by the selection coefficient. For empiricists who desire to identify genes under positive selection for divergence during the critical early stages of speciation, the more important improvement of this model is its independent characterization of dN and dS as a function of the time since population divergence. Because “the ratio of [these] expected values is in general not equal to the expected value of a ratio [ ⫽ dN/dS]” (Mugal et al., 2014), the stochastic variation in both synonymous and nonsynonymous mutations must be accounted for in a characterization of from real or simulated sequence data. Mugal et al. (2014) simulated that stochastic variation under realistic values of the model parameters and data variables (population size, sequence length, population divergence time, mutation rate). Their most sobering discovery was that “during initial divergence extremely high dN/dS values that would be commonly taken as evidence for positive selection [simulations with ⫽ 1.2–1.9] are frequently obtained even under negative selection pressure [␥ ⫽ ⫺1, or purifying selection with expected ⬍ 1]” (Mugal et al., 2014). In these simulations, the risk of false positives (inferring ⬎ 1 under conditions of purifying selection) depended on both divergence time and effective population size, and was relatively high for at least 5–10 Ne generations after population divergence. The specific magnitude of that risk also depended on other parameters or variables (sequence length, selection coefficient, mutation rate). In general, however, for large populations or recent population divergence, a point estimate of for a single gene was likely to be inflated by stochastic variation in the frequency of nonsynonymous polymorphisms within populations until long after the onset of population divergence and the action of selection leading to new species formation. For a system of populations like Patiria miniata that have diverged for about 6 Ne generations (divergence time of ⬃ 3 ⫻ 105 years, generation time of ⬃ 5 years, large effective population sizes ⬎ 104; Keever et al., 2009; McGovern et al., 2010), the risk of false positives would be high under the conditions simulated by Mugal et al. (2014; fig. 5). Ironically, their results suggest that the false-positive problem may be most acute during the earliest stages of speciation that are of strong interest to evolutionary ecologists and most in need of careful study and improved understanding. Moreover, Mugal et al. (2014) used a population divergence model that includes no gene flow between populations after the divergence time (an IM model). They predict that incorporating gene flow effects would further delay the approach of to the expected equilibrium by adding segregating polymorphisms from one population to the other via migration. Thus, for more realistic speciation scenarios that include gene flow between populations as they diverge under selection (as usually envisioned in ecological speciation models; Nosil, 2008; Feder et al., 2012; Faria et al., 2014), it may be more difficult than expected to use codon models of isolation-with-selection to identify the targets of selection during speciation. Similarly the stochastic variation in apparent response to selection () in these simulations may have interesting implications for efforts to infer selection effects based on variation among loci in population differentiation (from outlier methods; Narum and Hess, 2011) or in gene flow (Sousa et al., 2013) if analyses of early stages in speciation often focus on genes and populations that are far from the predicted equilibrium value for any underlying direction and strength of selection. Keep calm and carry on Although the simulation results of Mugal et al. (2014) seem to paint a dim picture of prospects for codon model analyses of the key early stages of speciation, several other considerations argue for pursuing this combination of IM models and codon models for characterizing gene flow and selection in speciation. First, Mugal et al. (2014) used a relatively simple codon model that characterizes dN and dS as a uniform feature of all codons in a gene and all lineages or alleles in an alignment (in contrast to among-codon rate variation in the sites model of Wilson and McVean, 2006). Such codon models are widely viewed as relatively insensitive to detecting the effects of selection acting only on specific functional features of a protein (such as ligand binding sites; Pond et al. 2011; Anisimova, 2012). Mugal et al. (2014) noted that branch sites models of among-codon and among-lineage variation might be useful additions to their simulation method. It is currently not known whether estimates of site-specific from more complex branch sites models with several selection parameters (e.g., Fig. 2) can be compared directly to simulations of a single time-dependent selection parameter in a simpler codon model (e.g., Mugal et al., 2014, fig. 5). If the time dependence of estimates for specific codons or lineages under a branch sites model is much shorter than the typical divergence This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). SELECTION AND GENE FLOW FROM GENE TREES times of populations undergoing selection for adaptive differences leading to speciation, then branch sites codon models applied to diverging populations may be less susceptible to false positives than they currently appear to be. Second, simulation methods like those of Mugal et al. (2014) could form the basis for quantifying (rather than merely fearing) the false-positive problem. One simple approach could simulate population divergence under realistic values of population-model parameters (including population size and divergence time) for different underlying values of the selection coefficient (␥) and ask how weak purifying selection (␥ ⬍ 0) would need to be in order to generate a frequency distribution of values in which the confidence limits include the value for the observed data (and constitute a false inference of positive selection rather than purifying selection). Alternatively, a more complex approach could simulate the coalescent under a prior distribution of values for the selection coefficient and other population-model parameters, using estimates of dN and dS as summary statistics to characterize the sequence alignment resulting from each simulation. Simulated population histories and their dN and dS values could then be compared to observed sequence data from population samples, and used in an approximate Bayesian computation (ABC) of the posterior probability distributions for selection, gene flow, and other parameter values. Analogous ABC methods are in widespread use in other areas of population genetics, including efforts to characterize gene flow and divergence times in large, complex population models with many parameters where direct optimization methods have proven difficult to apply (Beaumont, 2010). Recent steps toward a more comprehensive ABC-based approach to quantifying selection and demography seem very promising (e.g., Lopes et al., 2014). Third, the identification of specific codons or lineages as targets of positive selection (such as ligands and their receptor binding sites encoded in gamete recognition genes) can be treated as a working hypothesis, and can be tested in field or laboratory experiments of the fitness consequences of variation at those codons. For example, sperm bindin divergence predicts reproductive isolation between congeneric sea urchin species (McCartney and Lessios, 2004; Zigler et al., 2005), and sexual conflict within sea urchin populations has been proposed as the selective agent responsible for population divergence (reviewed by Palumbi, 2009; Lessios, 2011), but few experimentalists have pursued bindin variation and its consequences for reproductive isolation within and among conspecific sea urchin populations (Palumbi, 1999; Levitan and Ferrell, 2006). More such studies are needed (a need emphasized in other reviews as well; Anisimova and Liberles, 2012), particularly to test the reliability of inferences about selection acting on early stages of population divergence leading to new species formation. 143 Acknowledgments Thanks to Ken Halanych for the invitation to write this essay. I am grateful to several reviewers of this and other manuscripts whose comments helped me to develop a fuller appreciation of the population genetics of codon evolution under selection. My recent financial support has come from the Natural Sciences and Engineering Research Council, Genome BC, and Simon Fraser University. Literature Cited Anisimova, M. 2012. Parametric models of codon evolution. Pp. 12–33 in Codon Evolution, G. M. Cannarozzi and A. Schneider, eds. Oxford University Press, Oxford. Anisimova, M., and D. A. Liberles. 2012. Detecting and understanding natural selection. Pp. 73–96 in Codon Evolution, G. M. Cannarozzi and A. Schneider, eds. Oxford University Press, Oxford. Arbogast, B. S., S. V. Edwards, J. Wakeley, P. Beerli, and J. B. Slowinski. 2002. Estimating divergence times from molecular data on phylogenetic and population genetic timescales. Annu. Rev. Ecol. Syst. 33: 707–740. Bahlo, M., and R. C. Griffiths. 2000. Inference from gene trees in a subdivided population. Theor. Popul. Biol. 57: 79 –95. Barreto, F. S., G. W. Moy, and R. S. Burton. 2011. Interpopulation patterns of divergence and selection across the transcriptome of the copepod Tigriopus californicus. Mol. Ecol. 20: 560 –572. Beaumont, M. A. 2010. Approximate Bayesian computation in evolution and ecology. Annu. Rev. Ecol. Evol. Syst. 41: 379 – 406. Biermann, C. H. 1998. The molecular evolution of sperm bindin in six species of sea urchins (Echinoida: Strongylocentrotidae). Mol. Biol. Evol. 15: 1761–1771. Bierne, N., D. Roze, and J. J. Welch. 2013. Pervasive selection or is it. . .? Why are FST outliers sometimes so frequent? Mol. Ecol. 22: 2061–2064. Bird, C. E., I. Fernandez-Silva, D. J. Skillings, and R. J. Toonen. 2012. Sympatric speciation in the post “Modern Synthesis” era of evolutionary biology. Evol. Biol. 39: 158 –180. Butlin, R., A. Debelle, C. Kerth. R. R. Snook, L. W. Beukeboom, R. F. Castillo Cajas, W. Diao, M. E. Maan, S. Paolucci, F. J. Weissing et al. 2012. What do we need to know about speciation? Trends Ecol. Evol. 27: 27–39. Cahais, V., P. Gayral, G. Tsagkogeorga, J. Melo-Ferreira, M. Ballenghien, L. Weinert, Y. Chiari, K. Belkhir, V. Ranwez, and N. Galtier. 2012. Reference-free transcriptome assembly in non-model animals from next-generation sequencing data. Mol. Ecol. Res. 12: 834 – 845. Carneiro, M., F. W. Albert, J. Melo-Ferreira, N. Galtier, P. Gayral, J. A. Blanco-Aguiar, R. Villafuerte, M. W. Nachman, and N. Ferrand. 2012. Evidence for widespread positive and purifying selection across the European rabbit (Oryctolagus cuniculus) genome. Mol. Biol. Evol. 29: 1837–1849. Cheviron, Z. A., and R. T. Brumfield. 2012. Genomic insights into adaptation to high-altitude environments. Heredity 108: 354 –361. Claw, K. G., and W. J. Swanson. 2012. Evolution of the egg: new findings and challenges. Annu. Rev. Genomics Hum. Genet. 13: 109 – 125. Coyne, J. A., and H. A. Orr. 2004. Speciation. Sinauer Associates, Sunderland, MA. Crisci, J. L., Y. P. Poh, A. Bean, A. Simkin, and J. D. Jensen. 2012. Recent progress in polymorphism-based population genetic inference. J. Hered. 103: 287–296. De Wit, P., and S. R. Palumbi. 2013. Transcriptome-wide polymorphisms of red abalone (Haliotis rufescens) reveal patterns of gene flow and local adaptation. Mol. Ecol. 22: 2884 –2897. This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). 144 M. W. HART DeWoody, J. A., K. C. Abts, A. L. Fahey, Y. Z. Ji, S. J. A. Kimble, N. J. Marra, B. K. Wijayawardena, and J. R. Willoughby. 2013. Of contigs and quagmires: next-generation sequencing pitfalls associated with transcriptomic studies. Mol. Ecol. Res. 13: 551–558. Donnelly, P., and S. Tavaré. 1995. Coalescents and genealogical structure under neutrality. Annu. Rev. Genet. 29: 401– 421. Edwards, S. V., and P. Beerli. 2000. Perspective: gene divergence, population divergence, and the variance in coalescence time in phylogeographic studies. Evolution 54: 1839 –1854. Ellegren, H. 2014. Genome sequencing and population genomics in non-model organisms. Trends Ecol. Evol. 29: 51– 63. Evans, J. P., and C. D. H. Sherman. 2013. Sexual selection and the evolution of egg-sperm interactions in broadcast-spawning invertebrates. Biol. Bull. 224: 166 –183. Ewing G., and J. Hermisson. 2010. MSMS: a coalescent simulation program including recombination, demographic structure and selection at a single locus. Bioinformatics 26: 2064 –2065. Faria, R., S. Renaut, J. Galindo, C. Pinho, J. Melo-Ferreira, M. Melo, F. Jones, W. Salzburger, D. Schluter, and R. Butlin. 2014. Advances in ecological speciation: an integrative approach. Mol. Ecol. 23: 513–521. Feder, J. L., S. P. Egan, and P. Nosil. 2012. The genomics of speciation-with-gene-flow. Trends Genet. 28: 342–350. Feldmeyer, B., C. W. Wheat, N. Krezdom, B. Rotter, and M. Pfenninger. 2011. Short read Illumina data for the de novo assembly of a non-model snail species transcriptome (Radix balthica, Basommatophora, Pulmonata), and a comparison of assembler performance. BMC Genomics 12: 317. Felsenstein, J. 2006. Accuracy of coalescent likelihood estimates: Do we need more sites, more sequences, or more loci? Mol. Biol. Evol. 23: 691–700. Foltz, K. R., J. A. Partin, and W. J. Lennarz. 1993. Sea urchin egg receptor for sperm: sequence similarity of binding domain and hsp70. Science 259: 1421–1425. Fraser, B. A., I. W. Ramnarine, and B. D. Neff. 2010. Selection at the MHC class IIB locus across guppy (Poecilia reticulata) populations. Heredity 104: 155–167. Gayral, P., J. Melo-Ferreira, S. Glémin, N. Bierne, M. Carneiro, B. Nabholz, J. M. Lourenco, P. C. Alves, M. Ballenghien, N. Faivre et al. 2013. Reference-free population genomics from next-generation transcriptome data and the vertebrate-invertebrate gap. PLoS Genet. 9: e1003457. Goldman, N., and Z. Yang. 1994. A codon-based model of nucleotide substitution for protein-coding DNA sequences. Mol. Biol. Evol. 11: 725–736. Hahn, M. W. 2008. Toward a selection theory of molecular evolution. Evolution 62: 255–265. Hart, M. W. 2013. Structure and evolution of the sea star egg receptor for sperm bindin. Mol. Ecol. 22: 2143–2156. Hart, M. W., and A. Foster. 2013. Highly expressed genes in gonads of the bat star Patiria miniata: gene ontology, expression differences, and gamete recognition loci. Invertebr. Biol. 132: 241–250. Hart, M. W., and P. B. Marko. 2010. It’s about time: divergence, demography, and the evolution of developmental modes in marine invertebrates. Integr. Comp. Biol. 50: 643– 661. Hart, M. W., I. Popovic, and R. B. Emlet. 2012. Low rates of bindin codon evolution in lecithotrophic Heliocidaris sea urchins. Evolution 66: 1709 –1721. Hart, M. W., J. M. Sunday, I. Popovic, K. J. Learning, and C. M. Konrad. 2014. Incipient speciation of sea star populations by adaptive gamete recognition coevolution. Evolution 68: 1294 –1305. Hey, J. 2010a. Isolation with migration models for more than two populations. Mol. Biol. Evol. 27: 905–920. Hey, J. 2010b. The divergence of chimpanzee species and subspecies as revealed in multipopulation isolation-with-migration analyses. Mol. Biol. Evol. 27: 921–933. Hey, J., and R. Nielsen. 2004. Multilocus methods for estimating population sizes, migration rates and divergence time, with applications to the divergence of Drosophila pseudoobscura and D. persimilis. Genetics 167: 747–760. Hey, J., and R. Nielsen. 2007. Integration within the Felsenstein equation for improved Markov chain Monte Carlo methods in population genetics. Proc. Natl. Acad. Sci. USA 104: 2785–2790. Hey, J., W. M. Fitch, and F. Ayala, eds. 2005. Systematics and the Origin of Species: On Ernst Mayr’s 100th Anniversary. National Academies Press, Washington, DC. Hirohashi, N., N. Kamei, H. Kubo, H. Sawada, M. Matsumoto, and M. Hoshi. 2008. Egg and sperm recognition systems during fertilization. Dev. Growth Differ. 50: S221–S238. Hornett, E. A., and C. W. Wheat. 2012. Quantitative RNA-Seq analysis in non-model species: assessing transcriptome assemblies as a scaffold and the utility of evolutionary divergent genomic reference species. BMC Genomics 13: 361. Hudson, R. R. 1990. Gene genealogies and the coalescent process. Oxf. Surv. Evol. Biol. 7: 1– 44. Keever, C. C., J. Sunday, J. B. Puritz, J. A. Addison, R. J. Toonen, R. K. Grosberg, and M. W. Hart. 2009. Discordant distribution of populations and genetic variation in a sea star with high dispersal potential. Evolution 63: 3214 –3227. Kingman, J. F. C. 1982. On the genealogy of large populations. J. Appl. Probab. 19A: 27– 43. Koester, J. A., W. J. Swanson, and E. V. Armbrust. 2013. Positive selection within a diatom species acts on putative protein interactions and transcriptional regulation. Mol. Biol. Evol. 30: 422– 434. Kryazhimskiy, S., and J. B. Plotkin. 2008. The population genetics of dN/dS. PLoS Genet. 4: e1000304. Kuhner, M. K. 2009. Coalescent genealogy samplers: windows into population history. Trends Ecol. Evol. 24: 86 –93. Larson, E. L., J. A. Andres, S. M. Bogdanowicz, and R. G. Harrison. 2013. Differential introgression in a mosaic hybrid zone reveals candidate barrier genes. Evolution 67: 3653–3661. Lawton-Rauh, A. 2008. Demographic processes shaping genetic variation. Curr. Opin. Plant Biol. 11: 103–109. Lemay, M. A., D. J. Donnelly, and M. A. Russello. 2013. Transcriptome-wide comparison of sequence variation in divergent ecotypes of kokanee salmon. BMC Genomics 14: 308. Lessios, H. A. 2011. Speciation genes in free-spawning marine invertebrates. Integr. Comp. Biol. 51: 456 – 465. Lessios, H. A., B. D. Kessing, and J. S. Pearse. 2001. Population structure and speciation in tropical seas: global phylogeography of the sea urchin Diadema. Evolution 55: 955–975. Levitan, D. R., and D. L. Ferrell. 2006. Selection on gamete recognition proteins depends on sex, density, and genotype frequency. Science 312: 267–269. Lopes, J. S., M. Arenas, D. Posada, and M. A. Beaumont. 2014. Coestimation of recombination, substitution and molecular adaptation rates by approximate Bayesian computation. Heredity 112: 255–264. Marko, P. B., and M. W. Hart. 2011. The complex analytical landscape of gene flow inference. Trends Ecol. Evol. 26: 448 – 456. Maroja, L. S., J. A. Andres, and R. G. Harrison. 2009. Genealogical discordance and patterns of introgression and selection across a cricket hybrid zone. Evolution 63: 2999 –3015. Mayr, E. 1954. Geographic speciation in tropical echinoids. Evolution 8: 1–18. McCartney, M. A., and H. A. Lessios. 2004. Adaptive evolution of sperm bindin tracks egg incompatibility in neotropical sea urchins of the genus Echinometra. Mol. Biol. Evol. 21: 732–745. McGovern, T. M., C. C. Keever, C. A. Saski, M. W. Hart, and P. B This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c). SELECTION AND GENE FLOW FROM GENE TREES Marko. 2010. Divergence genetics analysis reveals historical population genetic processes leading to contrasting phylogeographic patterns in co-distributed species. Mol. Ecol. 19: 5043–5060. Metz, E. C., and S. R. Palumbi. 1996. Positive selection and sequence rearrangements generate extensive polymorphism in the gamete recognition protein bindin. Mol. Biol Evol. 13: 397– 406. Mugal, C. F., J. B. W. Wolf, and I. Kaj. 2014. Why time matters: codon evolution and the temporal dynamics of dN/dS. Mol. Biol. Evol. 31: 212–231. Muir, G., C. J. Dixon, A. L. Harper, and D. A. Filatov. 2012. Dynamics of drift, gene flow, and selection during speciation in Silene. Evolution 66: 1447–1458. Murrell, B., J. O. Wertheim, S. Moola, T. Weighill, I. K. Scheffler, and S. L. K. Pond. 2012. Detecting individual sites subject to episodic diversifying selection. PLoS Genet. 8: e1002764. Muse, S. V., and B. S. Gaut. 1994. A likelihood approach for comparing synonymous and nonsynonymous nucleotide substitution rates, with application to the chloroplast genome. Mol. Biol. Evol. 11: 715– 724. Narum, S. R., and J. E. Hess. 2011. Comparison of FST outlier tests for SNP loci under selection. Mol. Ecol. Res. 11: 184 –194. Nielsen, R. 2005. Molecular signatures of natural selection. Annu. Rev. Genet. 39: 197–218. Nordborg, M. 2003. Coalescent theory. Pp. 602– 635 in Handbook of Statistical Genetics, D. J. Balding, M. Bishop, and C. Canning, eds. John Wiley, Hoboken, NJ. Nosil, P. 2008. Speciation with gene flow could be common. Mol. Ecol. 17: 2103–2106. Nosil, P., and J. L. Feder. 2012. Genomic divergence during speciation: causes and consequences. Philos. Trans. R. Soc. Lond. B 367: 332–342. Nosil, P., and J. L. Feder. 2013. Genome evolution and speciation: toward quantitative descriptions of pattern and process. Evolution 67: 2461–2467. Nosil, P., D. J. Funk, and D. Ortiz-Barrientos. 2009. Divergent selection and heterogeneous genomic divergence. Mol. Ecol. 18: 375– 402. Osborne, O. G., T. E. Batstone, S. J. Hiscock, and D. A. Filatov. 2013. Rapid speciation with gene flow following the formation of Mt. Etna. Genome Biol. Evol. 5: 1704 –1715. Palumbi, S. R. 1999. All males are not created equal: fertility differences depend on gamete recognition polymorphisms in sea urchins. Proc. Natl. Acad. Sci. USA 96: 12632–12637. Palumbi, S. R. 2009. Speciation and the evolution of gamete recognition genes: pattern and process. Heredity 102: 66 –76. Palumbi, S. R., and H. A. Lessios. 2005. Evolutionary animation: How do molecular phylogenies compare to Mayr’s reconstruction of speciation in the sea? Proc. Natl. Acad. Sci. USA 102: 6566 – 6572. Pannell, J. R. 2003. Coalescence in a metapopulation with recurrent local extinction and recolonization. Evolution 57: 949 –961. Pinho, C., and J. Hey. 2010. Divergence with gene flow: models and data. Annu. Rev. Ecol. Evol. Syst. 41: 215–230. Pluzhnikov, A., and P. Donnelly. 1996. Optimal sequencing strategies for surveying molecular genetic diversity. Genetics 144: 1247–1262. Pond, S. L. K., B. Murrell, M. Fourment, S. D. W. Frost, W. Delort, and K. Scheffler. 2011. A random effects branch-site model for 145 detecting episodic diversifying selection. Mol. Biol. Evol. 28: 3033– 3043. Popovic, I., P. B. Marko, J. P. Wares, and M. W. Hart. 2014. Selection and demographic history shape the molecular evolution of the gamete compatibility protein bindin in Pisaster sea stars. Ecol. Evol. 4: 1567–1588. Puillandre, N., A. Lambert, S. Brouillet, and G. Achaz. 2012. ABGD, automatic barcode gap discovery for primary species delimitation. Mol. Ecol. 21: 1864 –1877. Rundle, H. D., and P. Nosil. 2005. Ecological speciation. Ecol. Lett. 8: 336 –352. Seehausen, O., R. K. Butlin, I. Keller, C. E. Wagner, J. W. Boughman, P. A. Hohenlohe, C. L. Peichel, G.-P. Saetre, C. Bank, A. Brännström et al. 2014. Genomics and the origin of species. Nat. Rev. Genet. 15: 176 –192. Soria-Carrasco, V., Z. Gompert, A. A. Comeault, T. E. Farkas, T. L. Parchman, J. S. Johnston, C. A. Buerkle, J. L. Feder, J. Bast, T. Schwander et al. 2014. Stick insect genomes reveal natural selection’s role in parallel speciation. Science 344: 738 –742. Sousa, V. C., M. Carneiro, N. Ferrand, and J. Hey. 2013. Identifying loci under selection against gene flow in isolation-with-migration models. Genetics 194: 211–233. Stajich, J. E., and M. W. Hahn. 2005. Disentangling the effects of demography and selection in human history. Mol. Biol. Evol. 22: 63–73. Swanson, W. J., and V. D. Vacquier. 2002. The rapid evolution of reproductive proteins. Nat. Rev. Genet. 3: 137–144. Sunday, J. M., and M. W. Hart. 2013. Sea star populations diverge by positive selection at a sperm-egg compatibility locus. Ecol. Evol. 3: 640 – 654. Vacquier, V. D. 2012. The quest for the sea urchin egg receptor for sperm. Biochem. Biophys. Res. Comm. 425: 583–587. Vacquier, V. D., and W. J. Swanson. 2011. Selection in the rapid evolution of gamete recognition proteins in marine invertebrates. Cold Spring Harb. Perspect. Biol. 2011: a002931. Wheat, C. W. 2010. Rapidly developing functional genomics in ecological model systems via 454 transcriptome sequencing. Genetica 138: 433– 451. Wilson, D. J., and G. McVean. 2006. Estimating diversifying selection and functional constraint in the presence of recombination. Genetics 172: 1411–1425. Wlasiuk, G., S. Khan, W. M. Switzer, and M. W. Nachman. 2009. A history of recurrent positive selection at the Toll-like receptor 5 in primates. Mol. Biol. Evol. 26: 937–949. Yang, Z., and J. P. Bielawski. 2000. Statistical methods for detecting molecular adaptation. Trends Ecol. Evol. 15: 496 –503. Yang, Z., and R. Nielsen. 2002. Codon-substitution models for detecting molecular adaptation at individual sites along specific lineages. Mol. Biol. Evol. 19: 908 –917. Yang, Z., W. S. W. Wong, and R. Nielsen. 2005. Bayes empirical Bayes inference of amino acid sties under positive selection. Mol. Biol. Evol. 22: 1107–1118. Zigler, K. S., M. A. McCartney, D. R. Levitan, and H. A. Lessios. 2005. Sea urchin bindin divergence predicts gamete compatibility. Evolution 59: 2399 –2404. This content downloaded from 088.099.165.207 on May 05, 2017 11:54:40 AM All use subject to University of Chicago Press Terms and Conditions (http://www.journals.uchicago.edu/t-and-c).