* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PopGen1: Introduction to population genetics
Quantitative trait locus wikipedia , lookup
Heritability of IQ wikipedia , lookup
Medical genetics wikipedia , lookup
Human genetic variation wikipedia , lookup
Polymorphism (biology) wikipedia , lookup
Koinophilia wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Genetic drift wikipedia , lookup
Population genetics wikipedia , lookup
PopGen1: Introduction to population genetics Introduction MICROEVOLUTION is the term used to describe the dynamics of evolutionary change in populations and species over time. The discipline devoted to the study of microevolution at the molecular level is called POPULATION GENETICS. Note that the complementary term, Macroevolution, is most often used to describe patterns and processes of diversification above the species level such as rates of speciation, and the origin of evolutionary novelties. Disciplines such as PHYLOGENETICS, COMPARATIVE BIOLOGY, and PALAEONTOLOGY concern themselves with macroevolution. It is important to understand that the process of microevolution can generate (most, if not all) macroevolution, by operation over the relevant time spans. Since we are interested in the outcome of evolution at the molecular level, i.e., the diversity of genes, proteins, chromosomes, genomes, and organisms that we see in the world around us today, we must understand population genetics. The principles of population genetics are used to (i) measure the extent of genetic variation in natural populations, and (ii) explain natural genetic variation in terms of its origin, maintenance, and evolutionary significance. The practice of population genetics is based on using the information contained in a sample of a population to make inferences about the evolutionary processes affecting that population. The discipline often uses explicit models as the basis for testing hypotheses about the evolutionary processes that generated the sample of data, and to estimate parameters; this approach is called PARAMETRIC INFERENCE. Sometimes, no model is assumed, and that approach is called NONPARAMETRIC INFERENCE. The discipline of population genetics has practical value in a wide variety of other disciplines. Medical biology relies heavily of population genetics to detect links between genetic variation and a disease. Agriculture uses the principles in development of breeding programs. Conservation, fisheries, wildlife, and forestry biologists all use population genetics. In fact, the combination of molecular data and population genetic principles has revolutionized the fields of conservation biology and ecology. Anthropologists are very interested in population genetics because it can be used to make inferences about patterns of migration and interbreeding during the history of human evolution. Until this point the term POPULATION has been used rather informally, but now is the time for a formal definition. A population is a subgroup of individuals of the same species living within some set of restrictions, usually a restricted geographic area. The term population is not used to refer to an entire species. Individuals of a species are rarely distributed uniformly, having some form of aggregation according to the most favourable aspects of the environment. For terrestrial vertebrates such aggregations are often obvious: herds, colonies, habitat preferences such as marshes or deserts, etc. Less clear are the boundaries associated with genetic similarity. In population genetics, the population is defined by the restriction of local interbreeding. Sometimes such populations are called LOCAL POPULATIONS or DEMES, but they are very often simply called a POPULATION. The practical definition will vary from species to species, and also from researcher to researcher. Although a more precise definition is difficult, for our purposes we need only assume that we have defined the population appropriately. We want to define the local population correctly because it is the unit within which the evolution of adaptive characteristics occur; i.e., the evolving unit of the species. So, for the purposes of this course we come to a definition of EVOLUTION: the change in the genetic constitution of a population over time, or simply the change in allele frequencies in a population over time. Allele frequencies in populations In population genetics, variation is often quantified using the notion of ALLELE FREQUENCIES. Alleles are the alternative forms of genes, including the so-called “normal” form or “wild-type” form. The allele frequencies are the simple proportions of different alleles within a local population. Note that sometimes the term GENE FREQUENCY will be used synonymously with allele frequency. Long-term changes in allele frequencies, i.e., evolution, occur within the unit of the local population. Before we can measure evolution by the change in allele frequencies in a population, we must be able to measure the frequencies of those alleles. Consider an autosomal locus in a collection of diploid individuals. Let’s take the human M-N blood group as one example. In this blood group there are two alleles, M and N. In this case there would be three possible types of GENOTYPES, MM, MN, or NN. A sample of the population of Icelandic Eskimos was typed according the MN blood group and the data are presented in the table below: count Genotypes MN 385 MM 233 Total (n) NN 129 747 From the data in this table we can compute the allele frequencies of M and N in this sample. By convention we use p and q to represent the frequencies of M and N, with p + q = 1. p= 2(MM ) + 1(MN ) 2n and q = 1− p Using the data in Table 1 p= 2(233) + (385) = 0.57 2 × 747 and q = 1 − 0.57 = 0.43 The exercise in calculating the frequencies of M and N leads us to an important point: the quantities p and q are PARAMETERS of a population with unknown true values. We had to estimate the values of those parameters from a sample of the population. Such estimates will have an error, and we must judge the reliability of our estimate by estimating its VARIANCE. We can estimate the sampling variance of p with: Var ( p ) = p(1 − p ) 2n Assuming that repeated estimates would be normally distributed, we can use the variance to make predictions about how close the true frequency of the allele in the population is to our estimate of the frequency. First we take the square root of the variance; this is known as the STANDARD ERROR of the estimate. Under the normal distribution approximately 68% of the estimates of a parameter would fall within the interval defined by the true value ± 1 standard error. This is called a CONFIDENCE INTERVAL. The 95% confidence interval is defined by ± 2 standard errors, and the 99.7% confidence interval is defined by ± 3 standard errors. Let’s use the variance to evaluate the effect of sample size on the precision of an estimate of allele frequencies. Suppose we obtain the following sample from a population: AA = 8, Aa = 6, and aa = 2. Let’s use the above formulas to estimate p and q and the 95% confidence interval. p = 0.6875, var(p) = 0.0067, 95%CI = (0.524,0.851) Now let’s recalculate the results after multiplying the sample by a factor of 10. p = 0.6875, var(p) = 0.00067, 95%CI = (0.636,0.739) Increasing the sample size decreased our confidence interval, indicating that the precision of our estimate should have increased. Of course had we actually taken another sample 10 times larger, we would not have estimated p exactly as we had in our small sample. The 95%CI is a tool for quantifying this sort of sampling error. Changes in gene frequencies between populations Let’s return to our sample of the frequencies of the MN blood groups from Iceland, and compare it to a sample from Greenland. The data, presented in the table below, clearly indicate a difference in genotype frequencies between these populations. We can make at least one conclusion based on these data: evolution has occurred! Iceland Greenland MM 0.312 0.835 Genotypes MN 0.515 0.156 Total (n) NN 0.173 0.009 747 569 We are now faced with the task of identifying the origin and evolutionary significance of such change. The causes of change are the subject matter for later lectures. For now we will simply list the possible causes of change: 1. Finite population size: The alleles passed from one generation to another are a sample of the genes of the parental generation. Because the proportion of alleles in the sample is subject to sampling variation between generations, allele frequencies will vary from generation to generation (evolution by our definition). The size of the sample, in this case the number of members of the population that are reproducing, influences the strength of this effect. Think back to the difference in the variance of our samples of allele frequencies above. 2. Mutation: Of course if a mutation occurs then a change in allele frequencies has occurred. You have probably already guessed that a single mutational event will have a very small effect in large populations. It is other forces such as sampling variance of selection that determine the evolutionary significance of a new mutant over evolutionary time. 3. Non-random mating: Suppose that potential mates in the parental generation have a preference for a certain trait with a genetic basis. If they preferentially choose mates with such a trait then the frequency of the gene encoding that trait will increase in the next generation. This is just one example of how nonrandom mating influences allele frequencies. 4. Natural selection: Individuals in a population differ in their ability to survive and reproduce (fitness). If an allele confers an increased fitness to those individuals who carry it, then they will make a greater contribution to subsequent generations, thus increasing the frequency of that allele over time. Of course it can work the other way. If a new allele arose by mutation and reduced that individual’s fitness, then that individual might not reproduce and the allele would not appear in the next generation. 5. Migration or Gene Flow: Given two or more populations, immigration of individuals from outside the population will change the frequencies of alleles in that population. In order to identify which of these is relevant, we must first understand the simple case where all these factors are excluded from the system. This very simple model results in a type of equilibrium that will be the focus of the remainder of this topic. Hardy-Weinberg equilibrium “A MODEL is an intentional simplification of a complex situation designed to eliminate extraneous detail in order to focus attention on the essentials of the situation” (Daniel L. Hartl). We need a model of an idealized, unchanging, population as our starting point. Under such a model gene frequencies will not change between generations: i.e., no evolution occurs. Starting with such a model we can then add different causes of change, such as natural selection or non-random mating, and see how such processes affect the relationship between genotype frequencies and allele frequencies across generations. Modern population and quantitative genetics depends on understanding this relationship. The HARDY-WEINBERG (HW) MODEL provides us with the model we need. It represents a population at equilibrium with respect to allele and genotype frequencies. Independently specified in 1908 by G.H. Hardy and W. Weinberg, the model is based on the following assumptions: Assumption of the HW model 1. The organism is diploid 2. Reproduction is sexual 3. Mating is random 4. Generations are discrete 5. Population size is infinite (or very large) 6. No migration 7. No mutation 8. No natural selection Under these assumptions we may specify the probabilities of all possible combinations of gametes in the next generation as: (p + q) x (p + q) Do you remember the PRODUCT RULE and ADDITIVE RULE from your genetics class? Also, the proportions of all possible genotypes must sum to 1: (p + q) x (p +q) = 1 Multiplication gives: p2 + 2pq + q2 = 1 Lets say that p is the frequency of the A allele and q is the frequency of the a allele. Then the genotype frequencies in the next generation are as follows: fAA = p2 fAa = 2pq faa = q2 Female gametes We can see this in the form of a table: Male gametes A(p) a(q) AA Aa A 2 (pq) (p) (p ) Aa aa a 2 (q) (pq) (q ) Keynotes of the HW model: 1. HW model specifies the relationship between allele frequencies and gene frequencies in a population. Such a relationship can be tested with real data. When a population is consistent with HW expectations we say that population is in HARDY-WEINBERG EQUILIBRIUM. 2. The standard mechanisms of Mendelian inheritance ensure that allele frequencies do not change from generation to generation unless some external pressure is acting. 3. No matter what the initial frequencies, just one generation of random mating will result in HW frequencies. These features provide the foundation for many methods of investigation of the genetics of natural populations. In particular it is the departure from HW equilibrium, and the cause of such a departure (i.e., which assumptions of the model have been violated), that interests populations geneticists. Unfortunately, HW is not very sensitive to certain kinds of violations. This means that just because we cannot reject the notion of HW equilibrium for a natural population does not mean that we can conclude all the assumptions are valid. Rare recessive alleles can “hide” in the heterozygotes The relationship between allele and genotype frequency under HW is informative: when a recessive allele is rare, most of the individuals who carry the allele are heterozygous. The ratio of heterozygote: rare homozygote = 2pq : q2 q 0.1 0.01 0.001 2pq : q2 18:1 198:1 1998:1 Let’s look at a real example: Cystic fibrosis is a serious human illness caused by an autosomal recessive gene. An individual homozygous recessive for the CF “gene” will have abnormal glandular secretions resulting in the generation of thickened secretions, which causes the blockage of normal functions of some organs like the lungs and digestive organs causing difficulties in breathing and digestion. The frequency CF is about 1 in 1700 newborn Caucasians. q2 = 1/1700 q = (1/1700)1/2 = 0.024 ASSUMING HW: Frequency of heterozygotes in population = 2pq = 2 x 0.024 x (1-0.024) = 0.047 So, if we allowed a generation of random mating and no selection, 1 in 21 people would be heterozygous “carriers” of the CF allele, even though the frequency of CF in this population is quite rare, at 1 in 1700! Note that in this case we just assumed HW without testing if the assumption was reasonable. CF is clearly under selection, and population sizes are finite. However, heterozygous individuals are not subject to selection, and the example illustrates that deleterious alleles can “lurk” in a large fraction of the population without any overt phenotypic effect. More realistic models of CF require consideration of selection, mutation, and drift; the frequency of the CF allele in human populations reflects the interaction of all these evolutionary forces. We will return to this example later in the course. A proof of the HW principle Let: P1 = p2, P2 = 2pq, P3 = q2 in the parental generation Genotypes Genotype frequencies AA P1 Aa P2 aa P3 Allele frequencies A: p = (P1 × 1) + (P2 × 1/2) a: q = (P3 × 1) + (P2 × 1/2) The questions is what happens to the three genotype frequencies (P1’, P2’, and P3’) in the next generation? Under HW, P1 = P1’ and P2 = P2’ and P3 = P3’. We can test for this outcome by listing all possible matings and the frequencies of their offspring genotypes. Then we simply sum the products of each mating that give rise to each of the three possible genotypes. The results are shown in the table below. Mating Frequency Genotype frequencies of offspring AA Aa aa AA × AA AA × Aa AA × aa Aa × Aa Aa × aa aa × aa P12 2 × P1 × P2 2 × P1 × P3 P22 2 × P2 × P3 P32 1 1/2 0 1/4 0 0 0 1/2 1 1/2 1/2 0 0 0 0 1/4 1/2 1 Total in next generation: P1’ P2’ P3’ 2 P ⎞ ⎛ P1' = P12 + P1P2 + (1 / 4)P22 = ⎜ P1 + 2 ⎟ = p 2 2⎠ ⎝ P ⎞⎛ P ⎞ ⎛ P2' = P1 P2 + 2P1 P3 + (1/2 )P22 + P2 P3 = 2⎜ P1 + 2 ⎟⎜ P3 + 2 ⎟ = 2 pq 2 ⎠⎝ 2⎠ ⎝ 2 P ⎞ ⎛ P = (1 / 2 )P2 + P2 P3 + P = ⎜ P3 + 2 ⎟ = q 2 2⎠ ⎝ ' 3 2 3 So, we have our proof; after all this we see that allele frequencies do NOT change from generation to generation under HW! Testing for HW proportions We test a population for HW frequencies in three steps: (i) compute observed genotype frequencies; (ii) compute expected genotype frequencies under HW; and (iii) conduct a goodness of fit test. We will use the following data as an example: MN blood types in Pueblo Indians Genotypes MM MN Observed counts 83 46 Total = n = 140 NN 11 PART 1: Observed Genotype frequencies: MM = 83/140 = 0.59 MN = 46/140 = 0.33 NN = 11/140 = 0.08 Observed allele frequencies: M = p = 0.59 + 0.33(1/2) = 0.755 N = q = (1-0.755) = 0.245 Do NOT compute the allele frequencies at this step by assuming HW (i.e., p2 ≠ 0.59). Anyone who does this will automatically get an F in the class! PART 2: Expected genotype frequencies: Expected genotype counts: p2 = (0.755)2 = 0.57 2pq = 2 x 0.755 x 0.245 = 0.37 q2 = (0.245)2 = 0.06 p2 x n = 0.57 x 140 = 79.8 2pq x n = 0.37 x 140 = 51.8 q2 x n = 0.07 x 140 = 8.4 Compare these counts to the observed counts in the table above. PART 3: χ2 = ∑ (observed - expected )2 = (83 − 79.8)2 + (46 − 51.8)2 + (11 − 8.4)2 expected 79.8 51.8 8.4 χ 2 = 0.128 + 0.649 + 0.805 χ 2 = 1.58 d.f. = [(number of tests) – (non-independent categories) – (calculate p from data)] = [3 – 1 – 1] = 1 P = 0.20; i.e., there is a 20% chance that we would have observed a test statistic this large under HW. Testing for HW is not possible under dominance DOMINANCE occurs when the phenotypic effect of one allele conceals the effect of an alternate allele. An example is the D allele of the human Rh+ blood group: Genotype DD Dd dd Phenotype Rh+ Rh+ Rh- In North America the frequency of the Rh+ phenotype (DD or Dd) is 0.858 and the Rh- phenotype (dd) is 0.142. If are willing to assume HW then we can calculate the expected frequencies of D and d: q = (0.142)1/2 = 0.3768 p = (1 – 0.3768) = 0.6232 Why not use these to test HW? These are EXPECTED frequencies under HW! The calculated frequencies fit the expected frequencies exactly. d.f. = 2 (for two classes of data) – 1 – 1 (for estimating q) = 0 HW equilibrium with three alleles There is no reason why the allelic variation at a locus must be limited to two variants. Micro-satellite loci commonly have very large numbers of alleles per locus. Let’s look at HW for the case of three autosomal alleles: Alleles: A1 A2 and A3 Frequencies: p1 p2 and p3 p 1 + p2 + p3 = 1 We can do a cross multiplication square that shows the HW frequencies at three alleles at a single autosomal locus. Remember we are still making all the HW assumptions listed above. Female gametes A1 (p1) A1A1 p12 A1A2 p1 p2 A1A3 p1 p3 A1 (p1) A2 (p2) A3 (p3) Male gametes A2 (p2) A3 (p3) A1A2 A1A3 p1 p2 p1 p3 A2A2 A2A3 p22 p2 p3 A2A3 A3A3 p2 p3 p32 Frequencies in the next generation: A1A1: A1A2: A1A3: A2A2: A2A3: A3A3: p12 2p1 p2 2p1 p3 p22 2p2 p3 p32 Hardy-Weinberg as a null model We have seen that under the assumptions of the Hardy-Weinberg model, i.e., a large random mating population with no selection, that the allele frequencies are stable over time. In other words our model is the equivalent of “no evolution”. Under this model nothing interesting ever happens: what good is such a model? It might not be interesting, but it is valuable as our NULL MODEL. By systematically changing the assumptions of HW, through models that explicitly relax the involved assumption, we can investigate its effect on evolution. This approach is the basis of the next five topics in population genetics. The table below summarized some of the involved assumptions and their effects. Agency Linkage Change in frequencies Genotype Allele no no Notes Creates disequilibrium among loci Inbreeding yes no Acts on all loci in genome; results in loss of heterozygosity Assortative Mating yes no Only acts on the locus subject to assortment, and those loci linked to it Migration a yes yes Depends of migration rate and frequency differences between populations Mutation yes yes Very very very slow Natural Selection yes yes Acts on the locus subject to selection, and those loci linked to it Genetic Drift yes yes Acts on all loci in the genome; results in loss of heterozygosity and loss of alleles a we use the term migration in population genetics to mean movement of genes or individuals among populations in terms of “gene flow”. Migration has a different meaning in the discipline of ecology. We do not examine the impact of migration on evolution in this course; a full course in population genetics will cover migration in detail. Don’t worry if the table does not make much sense the first time you look at it. At some point you will look back on it and it will.