Download Molecular population genetics Magnus Nordborg* and Hideki Innan

69 Molecular population genetics Magnus Nordborg* and Hideki Innan Molecular population genetics is entering a new era dominated by studies of genomic polymorphism. Some of the theory that will be needed to analyze data generated by such studies is already available, but much more work is needed. Furthermore, population genetics is becoming increasingly relevant to other fields of biology, for example to genetic epidemiology, because of disease gene mapping in general populations. Addresses Department of Biological Sciences, University of Southern California, 835 W 37th St, SHS 172, Los Angeles, California 90089-1340, USA *e-mail: [email protected] Current Opinion in Plant Biology 2002, 5:69–73 1369-5266/02/$ — see front matter © 2002 Elsevier Science Ltd. All rights reserved. Abbreviations Ka rate of nucleotide substitution at nonsynonymous sites Ks rate of nucleotide substitution at synonymous sites LD linkage disequilibrium Rpm1 Resistance to Pseudomonas syringae ssp. maculicola1 tb1 teosinte branched1 Introduction We are currently witnessing a technology-driven explosion in the availability of genetic polymorphism data. These developments are revolutionizing evolutionary genetics, which for most of its history has been a field noticeably lacking in data [1]. They are also bringing the theory of population genetics back into the limelight as researchers are faced with analyzing their sequences [2]. Our purpose here is to review these developments, in particular as they apply to plant biology (although we will not hesitate to use examples from other organisms when more appropriate). Because the relevant population genetics theory is not common knowledge, we will explain the basic ideas before proceeding to the data. Modeling polymorphism The ‘Neutral Theory’ of molecular evolution [3] plays a central role in the analysis of population genetic data. The Neutral Theory holds that the majority of the polymorphisms seen within and among species are selectively neutral, or at least nearly so. Neutrality makes mathematical modeling relatively easy; however, much more important is the fact that assuming neutrality gives rise to a natural null model. Selection and other phenomena of interest, such as particular scenarios of population structure (i.e. subdivision) and migration, can then be viewed as perturbations of a standard neutral model. Consider a sample of copies of a particular gene or short chromosomal segment. Focus on a particular site. All of the sampled copies of this site must be related to each other through some kind of genealogical tree. This is true regardless of whether the sites were sampled from the same or different populations, or even from different species (see Figure 1). It is also true in the presence of recombination, although different sites will then typically have different trees. Selectively neutral mutations at a site can be thought of as having occurred according to a random process along the branches of the tree for that site. Precisely because they are neutral, they will not have affected the tree itself. The pattern of polymorphism will thus reflect the tree in a statistical sense. To understand the genomic pattern of polymorphism expected in a population, we need to understand the underlying pattern of trees. It is extremely important to understand the difference between gene trees and species trees. A species tree is an abstract notion that refers to the genealogical relationship among a number of species (i.e. which species begat which). A gene tree, on the other hand, is concrete and refers to the genealogical relationship among a number of homologous copies of a site in the genome. When a single gene copy is sampled from each of a number of species that have been separated (without hybridization) for a sufficiently long time, the gene tree must reflect the species tree very closely. For example, regardless of which part of the genome is studied, a copy of a specific gene from a human and a copy from a chimpanzee will always be more closely related to each other than either is to a copy from a cow. If we consider copies from less well separated species such as human, chimp, and gorilla, however, the picture is less clear: in this case, it is at least possible that for some genes, humans are closer to gorillas than to chimps [4•]. Finally, if we consider copies sampled from members of the same species, it is clear that the picture is completely different: for example, in humans, three copies sampled from a Swede, a Japanese, and an African could be related in any way. Of course, some trees may be more common than others, but the point is that any particular tree must be treated as random. To the extent that genomic patterns exist, they can only be discovered using statistics. The fact that the genealogy of genes sampled from a population is random adds an extra level of randomness to the pattern of variation we expect to see across the genome. The existence of an underlying genealogy means that most classical statistical methods (which rely on independent samples) do not apply to population genetic data [5•,6•]. In order to handle such data, a simple stochastic model known as ‘the coalescent’ has been developed [1,5•,6•,7–10]. The coalescent is a powerful simulation tool that makes it easy to gain insight into the pattern of variation expected under various evolutionary scenarios [5•]. It also forms the basis of the modern, computationally intensive, inference methods that are currently being developed [6•]. 70 Growth and development Figure 1 (a) (b) (d) (c) (e) Examples of gene trees in: (a) a standard neutral model with constant population size; (b) an exponentially growing population; and (c) a structured population with two subpopulations. The open circles represent sampled copies and the closed circles represent their most recent common ancestors (MRCA). Selection can mimic demography, but affects only the selected site (and the region surrounding it). (d) A gene tree after a fixation of an advantageous mutation occurred. The shape of tree is expected to be similar to that in a growing population as seen in (b). (e) A gene tree that occurs when two alleles are maintained for a long time by balancing selection. The shape of tree is similar to that in a structured population (c), with mutation taking the place of migration [5•]. Current Opinion in Plant Biology We now turn to specific questions that have been addressed empirically, while keeping these general remarks in mind. Inferring demography Demographic history affects the shape of gene trees and thus the pattern of polymorphism. Figure 1 illustrates this by comparing a few simple examples to a standard model without structure and with constant population size. In a population whose size has been growing exponentially (Figure 1b), the most recent common ancestor (MRCA) of the sample is relatively young and the external branches of the gene tree are expected to be long in comparison with the internal branches. Such a tree is said to be ‘star-like’. A large proportion of mutations on the tree will appear only once in the sample (singleton polymorphisms). Similar trees are expected in a population that has recovered from a severe reduction of population size (i.e. a ‘bottle-neck event’). In contrast, a completely different shape of tree is expected in a structured population. Consider a population in which there are two subpopulations connected by infrequent migration (Figure 1c). Although genes sampled from the same subpopulation may have a relatively recent common ancestor, the MRCA of the whole sample may be ancient. As a result, most mutations on the tree will result in fixed differences between the subpopulations. As would be expected given the trees in Figure 1, it is possible to infer the demographic history of a population from the pattern of DNA polymorphism. This is much more difficult than one might expect, however, because of the huge stochastic variance of the coalescent process. Even in a constant-size population, we sometimes observe trees that look like they came from a growing or structured population. The only way to reduce this variance is to collect data from a number of independent (unlinked) loci, and to rely on the fact that the demographic history affects the entire genome in the same way. The conclusions that can be drawn from a single locus (or non-recombining genome, such as the mitochondrial DNA [mtDNA] or chloroplast DNA) are in general extremely limited. The best-known example is from humans, in which early studies of mtDNA (which has a star-like tree similar to that illustrated in Figure 1c) suggested a rapid population expansion from a very small population [11], a conclusion that has since been contradicted by studies of nuclear regions [12]. Similarly, in Arabidopsis thaliana, early studies indicated extreme population subdivision, perhaps as a result of ancient admixture [13], but later studies revealed that this pattern is not seen in the entire genome [14]. The situation is further complicated by surveys of genome-wide amplified fragment-length polymorphism (AFLP) markers that suggest weak isolation by distance, and a relatively recent population expansion [15–17]. The picture that emerges is much more complex than the simple models illustrated in Figure 1, and seems to involve both ancient subdivision and recent expansion. This is perhaps not unexpected given the history of glaciation in the areas where A. thaliana is currently found [16]. Large differences in the pattern of polymorphism between genomic regions are also seen in barley [18]. It is clear from these examples that studies designed to investigate demography require a large number of genome-wide markers. Few such studies are yet available, but they will surely become commonplace. In many cases, estimating the population structure will not be a goal in itself but a prerequisite for other analyses. This is Molecular population genetics Nordborg and Innan exemplified by a recent study of maize, in which population structure was inferred using 141 SSR (simple sequence repeat) loci in order to carry out so-called linkage disequilibrium (LD) mapping [19••]. We return to this topic below. Domesticated species such as maize are of particular interest because the process of domestication may well have involved a dramatic bottleneck. As a result, the level of variation might be expected to be very low in domesticated species. However, for grasses, this does not appear to be the case [20]. For example, maize has about 80% of the diversity found in its wild relative [21]. Nevertheless, variability does seem to have been reduced in a short region of the teosinte branched1 (tb1) locus, which might have been subject to strong artificial selection during the domestication of this crop by ancient agriculturists [22••]. As we will see below, such a reduction in variability is precisely what population genetics theory would predict. Detecting selection Selection differs from demography in that it affects specific sites (i.e. those that are functionally important and, indirectly, those that are linked to functionally important sites) rather than the entire genome. Most ‘tests of selection’ take advantage of this by comparing different genomic regions or different kinds of sites [23•]. Directional selection between species Adaptive evolution can be detected by comparing the substitution rate at nonsynonymous sites (Ka) with that at synonymous sites (Ks). Because most amino-acid changes are likely to be deleterious, Ka is usually much smaller than Ks. However, Ka can be increased by strong selection for a novel protein function. Very rapid amino-acid evolution (Ka>Ks for some comparisons) can be seen in chitinases in the genus Arabis [24]. Chitinases are involved in plant defense and so the strong selection for novel chitinase function may be the result of plant–pathogen co-evolution. Another example is provided by the Hawaiian silversword alliance [25]. These species are distributed on six of the eight main islands of Hawaii, and many of them are single-island endemics. There is great variation in habitat, growth form and morphology among the various species, which seem to have diverged from each other quite recently. Intriguingly, the rate of nonsynonymous substitution in certain regulatory genes seems to be faster in the Hawaiian silversword alliance than in their North American relatives. The same increase in nonsynonymous substitution is not seen for structural genes, indicating that the regulatory genes have undergone rapid evolution. Some caution is needed, however, as an increase in Ka can have several causes. When Ka>Ks, there can be little doubt that selection has been involved, but this is an extremely conservative criterion. Directional selection within species (selective sweeps) Whenever an advantageous mutation is driven by selection through a population, it will leave a trace in the surrounding 71 chromosomal region. Intuitively, selection causes a form of ‘bottleneck’ that is limited to the selected site and the surrounding chromosomal region (Figure 1d). The result is a local loss of variation [26]. Such a footprint of selection appears to exist in the vicinity of the maize gene tb1. The amount of variation in the 5′ regulatory region of tb1 is significantly smaller than that in other regions [22••]. Furthermore, the shape of the gene tree for this region seems to be star-like. These observations are consistent with the hypothesis that tb1 played an important role during the domestication of maize. It is well known that tb1 accounts for significant morphological differences in crosses between maize and its progenitor, teosinte. Balancing selection Balancing selection refers to any kind of selection that preserves polymorphism, that is, keeps alleles from drifting to low frequencies and being lost by chance. This means that the selectively different alleles will be older than what is expected for different alleles at loci that are not subject to balancing selection (i.e. most of the genome) (Figure 1e). As the oldest alleles will have had most time to diverge (i.e. to accumulate selectively neutral differences in their flanking regions), a peak of increased polymorphism is formed surrounding the locus or site under selection [27]. The canonical example of this phenomenon is the increased variability associated with the major histocompatibility complex (MHC) in mammals [28,29]. An equally impressive example from plants is provided by self-incompatibility loci, where selection maintains polymorphism because rare alleles always have a selective advantage [30]. A more recently discovered example comes from the disease-resistance locus Rpm1 (Resistance to Pseudomonas syringae ssp. maculicola1) in A. thaliana [31••]. Two major alleles exist at this locus: resistant and non-resistant. The non-resistant allele turns out to be a deletion of the entire Rpm1 gene. Sequence analysis of the flanking region in a number of A. thaliana accessions revealed a gene tree that has a very long branch between the two alleles (Figure 1e), and a pattern of polymorphism that is significantly different from that expected under the standard neutral model. It should be noted that determining whether an observed peak is indeed significant is difficult, in particular because population structure increases the risk of false positives [32]. Genomic patterns of selection About ten years ago, it was discovered that the level of sequence variation in Drosophila melanogaster is not constant across the chromosome, but is positively correlated with the recombination rate [33]. The pattern of variation cannot be explained by a correlation between mutation and recombination, because the rate of divergence between species is not correlated with the recombination rate. Instead, it seems that the pattern is caused by some form of continual selection (e.g. purifying selection against deleterious mutations [34] or recurrent selective sweeps [26]), the rationale being that each selective event would 72 Growth and development have a greater affect on variation in regions of low recombination. Since its initial discovery [33], correlation between level of sequence variation and recombination rate has been found in many other organisms. In plants, this pattern has now been observed in wheat [35], tomato [36,37], sea beet [38], and maize [39], although the correlation is nowhere near as clear in plants as in D. melanogaster, possibly because the pattern is obscured by the effects of population structure. Recombination, linkage disequilibrium, and mapping Although recombination has often been ignored in studies of DNA sequence polymorphism, it is easily incorporated into the standard coalescent model [5•,9,40]. The main effect of recombination is that it allows linked sites to have different trees. These trees will be correlated; the strength of the correlation depends on the genetic distance between the sites. The correlation in the underlying genealogies may result in correlation among alleles in haplotypes. Such non-random association among alleles is known as LD. LD has received much attention recently because it may be used for fine-scale mapping [41] of genes that are responsible for naturally occurring phenotypic variation (e.g. human disease loci). The idea behind LD mapping is simply to look for marker alleles, or multi-locus haplotypes, that are associated with the phenotype in the general population. Neither crosses nor pedigrees are needed. LD mapping depends crucially on the chromosomal extent of LD. If LD decays too slowly with distance, it cannot be used for fine-scale mapping; if it decays too rapidly, an impracticably dense map is needed [42,43]. Genomic data on the extent of LD are only available for a few organisms. Several studies have been made in humans, but the picture is far from clear [44]. Depending on the region and sample, estimates range from ten to several hundred kilobase pairs. In Drosophila, LD typically decays within 1 kb [45]. Genomic surveys of LD are available for two plant species. In maize, LD decays rapidly, on a scale similar to or even faster than that observed in Drosophila [39,46]. In Arabidopsis, LD is much more extensive, decaying within perhaps 250 kb (M Nordborg et al., unpublished data). This is consistent with the difference in breeding system between these two species: whereas maize undergoes outcrossing, Arabidopsis is highly selfing, and selfing is expected to increase LD greatly [47]. The difference in LD between these two organisms has implications for LD mapping. A recent study in maize found association between particular polymorphic sites and phenotypic variation for flowering time [19••]. Given the much more extensive LD, mapping at such a fine scale will almost certainly not be possible in A. thaliana. On the other hand, the extensive LD in this species may make it feasible to carry out genomic screens using markers every 50 kb or so, an approach that is unlikely to work in maize. Conclusions Even though the first study of DNA sequence polymorphism was published almost 20 years ago [48], such studies have started to become common only recently [1]. Ironically, we are just about to witness another technological leap: from studies of single loci to genomic polymorphism studies. It is certain that several completely sequenced genomes of model organisms will soon be available. To make sense of these data, much new theory will be needed. The amount of data will make it possible to take population structure into account [49] and to identify many polymorphisms that are selectively important. Comparative studies between closely related species or variants may reveal much about the molecular basis of adaptive evolution. In plants, the evolution of development is likely to be of particular interest. Because polymorphism data are now important in fields other than evolutionary biology (e.g. in genetic epidemiology), it seems certain that population genetics will receive much more attention in the next 20 years than in the past 20 years. Acknowledgements We would like to thank D. Weigel for comments on the manuscript. References and recommended reading Papers of particular interest, published within the annual period of review, have been highlighted as: • of special interest •• of outstanding interest 1. Felsenstein J: From population genetics to evolutionary genetics. In A View Through the Trees of Evolutionary Genetics: From Molecules to Morphology. Edited by Singh RS, Krimbas CB. New York: Cambridge University Press; 2000:609-627. 2. Chakravarti A: Population genetics — making sense out of sequence. Nat Genet 1999, 21:56-60. 3. Kimura M: The Neutral Theory of Molecular Evolution. Cambridge, UK: Cambridge University Press; 1983. 4. • Chen F-C, Li W-H: Genomic divergences between humans and other hominoids and the effective population size of the common ancestor of humans and chimpanzees. Am J Hum Genet 2001, 58:444-456. This paper investigates the human–chimp–gorilla relationship using many genes: an excellent introduction to the difference between species trees and gene trees. 5. • Nordborg M: Coalescent theory. In Handbook of Statistical Genetics. Edited by Balding DJ, Bishop M, Cannings C. Chichester, UK: John Wiley & Sons Inc; 2001:179-212. See annotation [6•]. 6. • Stephens M: Inference under the coalescent. In Handbook of Statistical Genetics. Edited by Balding DJ, Bishop M, Cannings C. Chichester, UK: John Wiley & Sons Inc; 2001:213-238. These two papers [5•,6•] introduce coalescent theory, and discuss the statistical issues involved in the analysis of polymorphism data. The original description of the coalescent can be found in [7–10]. 7. Kingman JFC: On the genealogy of large populations. J Appl Prob 1982, 19A:27-43. 8. Kingman JFC: The coalescent. Stochastic Proc Appl 1982, 13:235-248. 9. Hudson RR: Properties of a neutral allele model with intragenic recombination. Theor Popul Biol 1983, 23:183-201. 10. Tajima F: Evolutionary relationship of DNA sequences in finite populations. Genetics 1983, 105:437-460. Molecular population genetics Nordborg and Innan 11. Cann RL, Stoneking M, Wilson AC: Mitochondrial DNA and human evolution. Nature 1987, 325:31-36. 12. Przeworski M, Hudson HH, Di Rienzo A: Adjusting the focus on human variation. Trends Genet 2000, 16:296-302. 13. Innan H, Tajima F, Terauchi R, Miyashita NT: Intragenic recombination in the Adh locus of the wild plant Arabidopsis thaliana. Genetics 1996, 143:1761-1770. 14. Aguadé M: Nucleotide sequence variation at two genes of the phenylpropanoid pathway, the FAH1 and F3H genes, in Arabidopsis thaliana. Mol Biol Evol 2001, 18:1-9. 15. Miyashita NT, Kawabe A, Innan H: DNA variation in the wild plant Arabidopsis thaliana revealed by amplified fragment length polymorphism analysis. Genetics 1999, 152:1723-1731. 16. Sharbel TF, Haubold B, Mitchell-Olds T: Genetic isolation by distance in Arabidopsis thaliana: biogeography and postglacial colonization of Europe. Mol Ecol 2000, 9:2109-2118. 17. Innan H, Stephan W: The coalescent in an exponentially growing metapopulation and its application to Arabidopsis thaliana. Genetics 2000, 155:2015-2019. 18. Lin J-Z, Brown AHD, Clegg MT: Heterogeneous geographic patterns of nucleotide sequence diversity between two alcohol dehydrogenase genes in wild barley (Hordeum vulgare subspecies spontaneum). Proc Natl Acad Sci USA 2001, 98:531-536. 19. Thornsberry JM, Goodman MM, Doebley J, Kresovich S, Nielsen D, •• Buckler ES IV: Dwarf8 polymorphisms associate with variation in flowering time. Nat Genet 2001, 28:286-289. The first example of LD mapping in plants, and the first example in any organism of the use of unlinked markers to correct for the effects of population structure. 20. Buckler ES IV, Thornsberry JM, Kresovich S: Molecular diversity, structure and domestication of grasses. Genet Res 2001, 77:213-218. 21. White SE, Doebley JF: The molecular evolution of terminal ear1, a regulatory gene in the genus Zea. Genetics 1999, 153:1455-1462. 22. Wang R-L, Stec A, Hey J, Lukens L, Doebley J: The limits of •• selection during maize domestication. Nature 1999, 398:236-239. This study suggests that the tb1 locus may have been subject to a ‘selective sweep’ during the domestication of maize. Similar studies of other loci, and in other species, are likely to follow. 23. Kreitman M: Methods to detect selection in populations with • applications to the human. Annu Rev Genomics Hum Genet 2000, 1:539-559. A good summary of the many methods for detecting selection that have been developed during the past 20 years. 24. Bishop JG, Dean AM, Mitchell-Olds T: Rapid evolution in plant chitinases: molecular targets of selection in plant–pathogen coevolution. Proc Natl Acad Sci USA 2000, 97:5322-5327. 25. Barrier M, Robichaux RH, Purugganan MD: Accelerated regulatory gene evolution in an adaptive radiation. Proc Natl Acad Sci USA 2001, 98:10208-10213. 26. Kaplan NL, Hudson RR, Langley CH: The ‘hitch-hiking’ effect revisited. Genetics 1989, 123:887-899. 27. Hudson RR, Kaplan NL: The coalescent process in models with selection and recombination. Genetics 1988, 120:831-840. 28. Hughes AL, Nei M: Pattern of nucleotide substitution at major histocompatibility complex loci reveals overdominant selection. Nature 1988, 335:167-170. 73 31. Stahl EA, Dwyer G, Mauricio R, Kreitman M, Bergelson J: •• Dynamics of disease resistance polymorphism at the Rpm1 locus of Arabidopsis. Nature 1999, 400:667-671. The authors describe the striking pattern of DNA polymorphisms in the flanking region of an insertion/deletion polymorphism of the Rpm1 gene, which is presumably maintained by strong balancing selection. 32. Filatov DA, Charlesworth D: DNA polymorphism, haplotype structure and balancing selection in the Leavenworthia PgiC locus. Genetics 1999, 153:1423-1434. 33. Begun DJ, Aquadro CF: Levels of naturally occurring DNA polymorphism correlate with recombination rates in D. melanogaster. Nature 1992, 356:519-520. 34. Charlesworth B, Morgan MT, Charlesworth D: The effect of deleterious mutations on neutral molecular variation. Genetics 1993, 134:1289-1303. 35. Dvorák J, Luo M-C, Yang Z-L: Restriction fragment length polymorphism and divergence in the genomic regions of high and low recombination in self-fertilizing and cross-fertilizing Aegilops species. Genetics 1998, 148:423-434. 36. Stephan W, Langley CH: DNA polymorphism in Lycopersicon and crossing-over per physical length. Genetics 1998, 150:1585-1593. 37. Baudry E, Kerdelhué C, Innan H, Stephan W: Species and recombination effects on DNA variability in the tomato genus. Genetics 2001, 158:1725-1735. 38. Kraft T, Säll T, Magnusson-Rading I, Nilsson N-O, Halldén C: Positive correlation between recombination rates and levels of genetic variation in natural populations of sea beet (Beta vulgaris subsp. maritima). Genetics 1998, 150:1239-1244. 39. Tenaillon MI, Sawkins MC, Long AD, Gaut RL, Doebley JF, Gaut BS: Patterns of DNA sequence polymorphism along chromosome 1 of maize (Zea mays ssp. mays L.). Proc Natl Acad Sci USA 2001, 98:9161-9166. 40. Griffiths RC, Marjoram P: An ancestral recombination graph. In Progress in Population Genetics and Human Evolution. Edited by Donnelly P, Tavaré S. New York: Springer-Verlag; 1997:257-270. 41. Cardon LR, Bell JI: Association study designs for complex diseases. Nature Rev Genet 2001, 2:91-99. 42. Kruglyak L: Prospects for whole-genome linkage disequilibrium mapping of common disease genes. Nat Genet 1999, 22:139-144. 43. Altshuler D, Daly M, Kruglyak L: Guilt by association. Nat Genet 2000, 26:135-137. 44. Pritchard JK, Przeworski M: Linkage disequilibrium in humans: models and data. Am J Hum Genet 2001, 69:1-14. 45. Langley CH, Lazzaro BP, Phillips W, Heikkinen E, Braverman JM: Linkage disequilibria and the site frequency spectra in the su(s) and su(wa) regions of the Drosophila melanogaster X chromosome. Genetics 2000, 156:1837-1852. 46. Remington DL, Thornsberry JM, Matsuoka Y, Wilson LM, Whitt SR, Doebley J, Kresovich S, Goodman MM, Buckler ES IV: Structure of linkage disequilibrium and phenotypic associations in the maize genome. Proc Natl Acad Sci USA 2001, 98:11479-11484. 47. Nordborg M: Linkage disequilibrium, gene trees and selfing: an ancestral recombination graph with partial self-fertilization. Genetics 2000, 154:923-929. 29. Parham P, Ohta T: Population biology of antigen presentation by MHC class I molecules. Science 1996, 272:67-74. 48. Kreitman M: Nucleotide polymorphism at the alcohol dehydrogenase locus of Drosophila melanogaster. Nature 1983, 304:412-417. 30. Charlesworth D, Awadalla P: Flowering plant self-incompatibility: the molecular population genetics of Brassica S-loci. Heredity 1998, 81:1-9. 49. Pritchard J, Stephens M, Donnelly P: Inference of population structure using multilocus genotype data. Genetics 2000, 155:945-959.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Molecular population genetics Magnus Nordborg* and Hideki Innan