Download Practical Guide to Population Genetics

Practical Guide to Population Genetics André Drenth The University of Queensland 4072 Australia Version 1.0 A. Drenth Practical Guide to Population Genetics 2 1 General Introduction Population genetics is by no means a new scientific discipline. Most of the important theorems were worked out in the first part of the 20th century. For a long time there has been a gap between theoretical advance and experimental research. With the development of neutral markers such as isozymes in the 1960’s and molecular markers in the 1980’s the experimental research caught up with the theoretical advance. However, due to the abstract nature of population genetics and over use of mathematical language by population geneticists, the discipline has suffered and in generally is not taken up by many students in biology who in general tend to shun mathematics and statistics. The challenge to students in population genetics is to bring the biology of the organism and the mathematics together in an effort to address important biological questions. In Mycology and Plant Pathology the population biology of the organisms under investigation has often been ignored. There are numerous reasons for this. The first being the lack of numerous phenotypic characters showing variation in the population. Second, the lack of neutral genetic markers. Third, the lack of insight how useful population genetics can be if one considers that diseases are caused by populations of pathogens and not by individuals. Plant pathologists have long been aware of the variation in phenotypic characters such as virulence in fungal populations. However, no systematic attempts were made to study this genetic diversity in detail and unfortunately natural populations of fungi are seldom studied at all. With the advent of molecular markers in the 1980's and the realisation that fungal pathogen populations are more variable than was initially thought, a significant increase in the number of research papers in this area has been published. However, since mycologists, plant pathologists, and molecular biologists are typically not well trained in genetics and population genetics, the advances in this field have been somewhat disappointing due to a lack of understanding of the underlying principles. The science of population genetics is ignored and a race for the latest molecular marker systems has erupted giving rise to method oriented instead of problem oriented publications. Experiments are conducted without clearly defining the biological questions and use of experimental designs and sampling strategies allowing statistical testing of hypotheses. Hence, the need for this practical guide to outline some of the underlying genetic issues which are particularly relevant to studying the population genetics of fungi. I have opted for a simple and straightforward style and give ample numerical examples the student may use to master the computations. Theoretical background is provided where needed to provide the students with reasons for why to use a particular test or diversity measure. Armed with this practical guide it is my hope that the student is on the way to rigorously testing important hypotheses concerning population biology of fungal plant pathogens. André Drenth, Brisbane, January 1998 © Copyright: No part of this publication may be reproduced, stored in a retrieval system, transmitted, in any form or any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the author. A. Drenth Practical Guide to Population Genetics Contents 1 2 3 4 5 6 7 Introduction to the workshop DNA and genetic variation The structure of DNA Basis of genetic variation Measuring genetic variation Molecular markers in Plant Pathology Population genetic research questions in Plant Pathology Population genetic tools How to get started ? Population genetic theory 6.1 Individuals, and Populations 6.2 Forces on populations 6.3 Alleles versus genotypes 6.4 Calculating allele frequencies 6.5 Hardy Weinberg equilibrium 6.6 Genetic diversity and evolution 6.7 Measuring and quantifying genetic diversity 1 Polymorphic loci 2 Heterozygosity 3 Gene diversity 4 Genotypic diversity 6.8 Linkage disequilibrium 6.9 Population differentiation 6.10 Partitioning of genetic diversity 6.11 Fixation index 6.12 Genetic distance 6.13 Similarity and dissimilarity indices 6.14 Suitability of markers for population genetics 1 Isozymes 2 RAPD 3 How to obtain a set of neutral RFLP markers Literature cited 3 A. Drenth Practical Guide to Population Genetics 2 4 DNA and Genetic Variation THE STRUCTURE OF DNA We are all familiar with heritable attributes, at least in a general sense. We speak of a child looking "just like its father", we talk of brown eyes and characteristic features "running in the family". These heritable attributes are a part of our genetic make up; a blueprint that provides the plan for our development. Most eucaryotic organisms are composed of millions of different cells. Regardless of its size and function each cell contains a defined structure called a nucleus. Within the nucleus is an identical copy of the individual's genetic material. This genetic material has a complete set of instructions that programs the life processes of that cell. The genetic material inside each nucleus is organised into chromosomes. Chromosomes are not easily distinguishable in the nucleus of a normal, active cell. At the time of cell division however, the chromosomes condense and can be seen using a light microscope or an electron microscope. Diploid organisms contain their chromosomes in pairs of homologous chromosomes. As organisms grow and cells divide, the chromosomes are duplicated (mitosis) and transferred to new cells. Chromosomes are transmitted between different generations, through sexual reproduction. For this purpose, special cells, called germ cells undergo reduction division (meiosis) leading to haploid cells which contain only one chromosome of each homologous pair. The core of each chromosome, the material of heredity itself, is DNA (deoxyribonucleic acid). The physical structure of DNA is simple, yet effective. A single strand of DNA is comprised of four nucleotides. Each nucleotide is made up of three parts: a phosphate group, a sugar known as deoxyribose and one of four nitrogen containing bases. The four bases are adenine (A), cytosine (C), guanine (G), and thymine (T). DNA consists of two single strands of nucleotides bound together in a double helix to form double stranded DNA. The two strands run in opposite directions and are anti parallel so that a "T" in one strand is always paired with a "A" in the other strand and similarly a "C" is always paired with a "G". Hence, the two strands of DNA complement each other. This complementary base pairing makes the mechanisms possible by which DNA self replicates each time a cell divides. This complementary nature of DNA is also fundamental to its role as the genetic material which stores information and can replicate it. When a cell divides its double stranded DNA is unwound so that each strand serves as a template for synthesis of a second complementary strand by the enzyme DNA polymerase. Each chromosome contains a continuous strand of double stranded DNA that is packaged tightly, coiled and supercoiled with other components of the cell (proteins as histones and ribonucleic acid (RNA)) allowing its enormous length to be compressed into the nucleus. The DNA in the 46 chromosomes of each human cell would total about two metres if fully extended, and the entire amount of DNA in an adult human body when fully extended would reach from the earth to the sun and back 25 times. A. Drenth Practical Guide to Population Genetics 5 Each strand of DNA in every chromosome consists of a linear sequence of the four bases, the genetic code. Since this sequence of nucleotides is the sole distinguishing factor of the genetic code, the essential information of any segment can simply be represented by writing its sequence of bases (e.g. CAGGTTCGTAATGC). This linear sequence of base pairs we usually refer to as DNA sequence. Although the DNA sequence is continuous, the information it encodes is not. It is organised in discrete locations that we refer to as genes. A gene is a particular sequence of nucleotides that is transcribed into RNA which in turn is translated into amino acids which form the basis of proteins and enzymes. The genetic code is the relationship between the sequence of bases in DNA and the sequence of amino acids in proteins. A group of three bases, called a codon, codes for one amino acid. Since there are 64 possible base triplets and only twenty amino acids the genetic code is degenerate, for most amino acids there is more than one codon. Hence, changes in DNA sequence do not always affect the amino acid sequence they encode. Chromosomes contain many genes which are the discrete units of inheritance; the heredity particles that Mendel first perceived in 1866. Each chromosome of a homologous pair contains the same gene in the same location. The location on the chromosome where any particular gene is found is called a locus. Thus a locus is a defined region of DNA base sequence. Homologous chromosomes have the same genes in the same order but differences between the genes exist. Most genes exist in a number of different forms that we refer to as alleles (Fig. 5). We refer to each variant form of a genetic locus as an allele, in which case different alleles may give rise to variants of that trait. Logically, an allele must be due to the presence of a different nucleotide sequence at the locus. Most alleles, however, are minor variants that have little or no effect on the normal function of a gene. In the next section we will discuss the processes involved in generating and maintaining genetic variation. 6 A. Drenth Practical Guide to Population Genetics Homologous chromosome pair Allele 1 Figure 1. Locus Allele 2 Homologous chromosomes with loci in the same location but different alleles at each locus BASIS OF GENETIC VARIATION Population genetic studies on many organisms have revealed that natural populations of sexual species are genetically variable. Now that we are familiar with the structure of DNA we should look at the processes involved in generating and maintaining genetic variation in this DNA sequence. The processes involved are: (i) mutation, (ii) mating system, (iii) migration and gene flow, (iv) genetic drift, and (v) selection. (i) Mutation Genetic variation is created by changes in the genetic material and mutation forms the basis of all genetic variation. A mutation can be defined as any change in the base sequence of the DNA in the genome. Mutations are typically lethal if essential genes are affected. Different forms of mutation may occur, including. (a) Base substitution; the replacement of one nucleotide by another. If a base substitution occurs in a codon within a protein coding region, an amino acid change in the primary structure of the protein may result. Sometimes a basepair change does not affect the codon and a change in the protein structure does not occur. If a nucleotide substitution caused no change in the protein product of the gene such a mutation is known as a silent mutation. (b) Insertion or deletion of a nucleotide. Such mutations involve a frame shift in the process of translation and usually result in non-functional gene products. (c) Inversion of a section of DNA. Even major rearrangements of this type may be harmless as long as no genetic material is lost and no important genes are disrupted at the breakpoints of the inversion. A. Drenth Practical Guide to Population Genetics 7 (d) Duplication or deletion of a section of the DNA. (e) Translocation. Rearrangements of genetic material resulting from an exchange of material between non-homologous chromosomes. Non-homologous chromosomes do not normally pair with each other. (f) Gene conversion. Mutations due to gene conversion stem from misalignment of DNA. This is especially important in the evolution of tandemly repeated clusters of related genes or multigene families. Gene conversion is often associated with meiotic recombination in which the mismatch repair system of the cell converts one allele to the other. These kinds of mutation occur at different rates and are differently affected by mutagenic agents. We have to realise that there is no constraint at the molecular level of DNA on what mutations can occur. Constraints on genetic variation arise from physiology and development of an individual and not from the mutational process itself. Mutations occur at random and can either increase or decrease the fitness of an individual. Fitness can be defined as greater ability to survive and reproduce in a particular environment. Many mutations take place in parts of the genome which do not encode genes. These mutations are neutral. The ultimate source of genetic variation is gene mutation and it takes place continuously in a population. However, mutation is such a rare event (10-6 per gene per generation) that it would change the genetic constitution of a population so slowly it would be almost negligible. However, the following mechanisms enhance and amplify the effects of mutation. (ii) Mating system In nature, gene mutations provide different forms of a gene, and these are spread throughout the population by sexual reproduction, which entails independent assortment and recombination through crossing over. This makes possible different combinations of newly arisen alleles with each other and with those already established in the gene pool; as a result, the effect of gene mutation is amplified. Different forms of mating exist in nature. Micro-organisms can either outbreed, inbreed, or reproduce asexually. Typically, fungal populations have mixed mating systems in which they can reproduce asexually within the season and sexually between seasons. In a random mating outbreeding population, the loci are randomly assorted in each generation. This leads to many combinations of alleles in the progeny. Hence, each individual will have a unique genetic make up. Mating does not create alleles but it combines already existing alleles into new combinations leading to higher levels of genetic variation. In contrast, individuals in asexually reproducing populations have an identical genetic make up. (iii) Migration and gene flow Populations of most species exhibit at least some degree of genetic differentiation between geographic locations. Migration of individuals from one population to another will lead to a reduction of differences between these populations. It is easy to see that emigration only has a minor effect whereas immigration can have large effects by introducing new alleles into the population. Thus the genetic structure of populations can change as a result of immigration or gene flow. A. Drenth Practical Guide to Population Genetics 8 (iv) Genetic drift In small populations, allele frequencies can change each generation and particular alleles may be lost. This will lead to changes in the population genetic structure over time and occur independently at different locations. As genetic drift is random, changes will occur in small populations which are isolated from each other. This will typically happen in pathogen populations which have extremely low effective population sizes in the absence of their host plant. (v) Selection Natural selection changes the gene pool by giving a reproductive advantage to those individuals with favoured combinations of alleles which, in certain environments, lead to a greater fitness. Because of natural selection, i.e. the process by which genotypes with greater fitness will leave, on average, more offspring than less fit genotypes, favourable alleles promoting higher fitness will be over-represented in succeeding generations. As a result, the types and frequencies of alleles in the population gradually change so as to promote greater adaptation to the environment. MEASURING GENETIC VARIATION Two different types of genetic markers are used in population genetics; genotypic and phenotypic markers. Genotypic markers such as isozymes and Restriction Fragment Length Polymorphism (RFLPs) identify a number of alleles at a designated locus. This allows the analysis of populations using only a relatively low number of individuals because the allele frequency forms the basis of analysis. Allele frequencies can be used to test for random mating or analyse gene diversity, gene flow, population substructuring etc. Isozymes are quick and cheap to detect but their number is limited. RFLPs are more numerous but are time consuming to perform. Phenotypic markers involve morphological characters and molecular markers such as Random Amplified Polymorphic DNA (RAPD) and DNA Amplification Fingerprinting (DAF). RAPD and DAF technology is not as powerful as RFLPs in resolving population genetic structures but are typically used to estimate the fraction of clonal individuals in a population and measure the number and frequency of different phenotypes present to enable measurement of phenotypic diversity. In addition, the spread and occurrence of particular phenotypes can be followed over time. The disadvantage of these types of markers is their low power to infer population genetic structure. However, their big advantage is their speed and simplicity with a potential to sample large numbers of individuals. DNA fingerprinting and DNA profiling techniques are used in many disciplines of science. In the next section the genetic basis underlying RFLPs and RAPDs and DAFs will be discussed. RFLP The basis of RFLP lies within the fact that restriction enzymes can cut the DNA duplex at specific base sequences. Restriction enzymes find a particular sequence of six bases (e.g. EcoRI recognises GAATTC) in duplex DNA, and the enzymes cut the DNA only at this sequence. A particular sequence of six base pairs occurs on average once every 4,000 bases. In examining a locus in a number of individuals we might find that in some individuals the DNA duplex surrounding the locus is not cut at the normal sites by EcoRI. In these individuals the normal A. Drenth Practical Guide to Population Genetics 9 recognition sequence is no longer recognised by EcoRI, therefore one of the six bases comprising the recognition site of EcoRI must be different. Hence, a different allele, which is not always apparent by obvious criteria since the base change has not affected gene function, can nevertheless be identified by a restriction enzyme. To determine whether a restriction enzyme cuts at a particular site we measure the length of the DNA fragments generated by the restriction enzyme. Because restriction enzymes generate thousands of restriction fragments, one of these DNA fragments is used as a probe which, due to the specific double stranded nature of DNA, hybridises to its complement on a DNA binding membrane (i.e. Southern Blot) to specifically recognise its complementary DNA sequence among many. The allele is identifiable as a, Restriction Fragment Length Polymorphism (RFLP). RFLPs are codominant markers (all alleles at a locus can be identified in a diploid organisms) and the ability to screen RFLPs has added extraordinary power to our analyses of gene structure and function. RAPD and DAF RAPD and DAF technologies are both based on the Polymerase Chain Reaction (PCR). The PCR technique involves three steps, 1) denaturation of the double stranded DNA by heating, 2) annealing of primers to sites flanking the region to be amplified and 3) primer extension, in which strands complementary to the region between the flanking primers are synthesised using a thermostable DNA polymerase (e.g. Taq polymerase). The double stranded products are cycled repeatedly through steps 1-2-3. In each round of denaturation-annealing-extension, the target sequence is roughly doubled in the reaction mixture. After more than 20 cycles a target sequence can be amplified more than a million fold. The PCR technique is extremely powerful in that only small amounts of starting material are necessary for an assay. The primers used to initiate the PCR process are short nucleotides (typically 20-30 bp in length) that specifically amplify the DNA sequence from a particular locus. Instead of using specific sequences to target certain loci, arbitrary primers can be used to amplify at random a number of anonymous genomic sequences which can then be size fractionated on a gel to provide an individual specific DNA fingerprint which forms the basis of the RAPD and DAF techniques. Because only the presence or absence of specific amplified fragments can be identified, no individual alleles can be distinguished which makes these markers dominant. Hence, allele frequencies cannot be directly calculated which is a major disadvantage of these techniques compared to RFLP analysis. Besides the fact that only minute amounts of tissue are needed for DNA fingerprinting using arbitrary primer techniques, they are also quicker compared to RFLPs. However, for many questions they are an extremely rapid alternative to RFLP markers. MOLECULAR MARKERS IN PLANT PATHOLOGY The life span of an individual runs but a short course, during which time the genotype remains constant. In contrast, the population persists over generations and has a genetic constitution that continues to vary. The population, not the individual, is the main unit of evolution. Evolution can be defined as change in the diversity and adaptations of populations of organisms. Populations can be defined as individuals who share a common gene pool (mate with each other) in a defined location. Therefore population members share more alleles with individuals belonging to the group than with those in related populations. Most species are composed of more than one population, and the number of individuals varies from one population to another. A. Drenth Practical Guide to Population Genetics 10 Natural populations almost always display differences in allele and genotype frequencies from one geographic region to another. In order to study genetic variation in pathogens in plant pathology we have to look at the population level. Because many fungal pathogens have no morphological characters which allow us to identify individuals in a population, many research questions remain unanswered. Molecular marker technology in general, and DNA fingerprinting especially, enable the rapid identification of each individual in a population. This enables the researcher to investigate the mode of reproduction, mode and extent of spread, mode of survival, origin and evolutionary relationships among closely related pathogens, to be investigated in greater detail. With the advent of an almost unlimited number of molecular markers available at relatively low cost it is now possible to deduce the genetic structure of pathogen populations. Questions in plant pathology which can be addressed using molecular markers include: (i) (ii) (iii) (iv) (v) (vi) (vii) Where does the pathogen come from ? Where do new races come from ? What is the level of genetic variation in the pathogen population ? How far does the pathogen spread ? How does the pathogen survive between seasons ? How important is the sexual cycle of the pathogen ? Is the pathogen population confined to one plant, one field, one region, one continent ? By using a population genetic approach we can deduce the mode of reproduction, levels of inbreeding, outbreeding or asexual reproduction. It is easy to understand that asexual reproduction will give rise to clones which cannot be distinguished from the parental types. Continuous selfing will led to the same situation. However, outbreeding will recombine the genetic information of both parents and give rise to specific new individuals which can be easily distinguished from the parents using DNA fingerprinting. Population genetic information concerning the mode of reproduction provides insights into the ability of pathogens to form new pathogenic races and which new races are combinations of already existing ones. Sexual reproduction in fungi often involves the formation of specific resting spores which increase the ability of the pathogen to survive between different cropping seasons. We will first look in more detail at questions in plant pathology related to population genetics, before discussing the relevant population genetic theory and background in more detail to tackle these questions. A few case studies will provide some insights into the way this technology can be used to answer biological questions of importance in plant pathology. Molecular approaches carry immense popularity at the present time, but they nonetheless provide only one of many avenues towards the goal of understanding the biology of organisms. A. Drenth Practical Guide to Population Genetics 3 11 Population genetic research questions in Plant Pathology Before starting on the theory of population genetics it is a good idea to look at the biological questions of relevance to plant pathology which we are interested in and we want to address. I have listed a small collection in a few broad categories. Of course, there are many more questions but these will provide a starting point. Population structure • What is a population - Geographic boundaries • How much genetic diversity exist in a population • How is genetic diversity distributed within a population • How is genetic diversity distributed between populations • One large panmictic population or many small subpopulations (island model) • Continuous population (incomplete isolation by distance) Population boundaries • Is pathogen population on host plant A hybridising with population on host plant B Geographic differentiation • Is the pathogen population in field A the same as in field B • Relationships between populations from different areas. Host specialisation • Is the pathogen population on host A the same as on host B Migration • Does migration and gene flow occur between different populations • How do migration and genetic drift affect population structure • Is a new race or genotype introduced to field A or did it evolve locally. If it was introduced where did it come from. Life-cycle biology (mating system - occurrence and maintenance of genetic diversity) • How do sexual and asexual reproduction affect population genetic structure • Inbreeding • Outbreeding • Asexual reproduction Selection • How does selection affect population structure • Influence of the host plant on the pathogen population • Deployment of resistance in the host • Application of fungicides Disease control strategies • How do different control strategies affect genetic structure A. Drenth Practical Guide to Population Genetics Phylogenetic relationships (systematics) • What is the evolutionary potential. • What are the evolutionary relationships between closely related pathogen species A and B 12 A. Drenth Practical Guide to Population Genetics 4 13 Population genetic tools After we have defined our questions we need to find out how we can apply population genetics tools to address these questions. Three criteria or tools of population genetics need to be taken into account in order to address questions. These three tools are • • • Genetic markers Sampling strategies Data analysis Each particular research question asks for a particular choice of markers, sampling strategy combined with a particular way of analysing the data. After the biological question has been defined it is important to chose the right tools. The aim of this practical guide is to help you chose these tools. 5 How to get started Population genetic studies require careful planning because they are relatively expensive and run over an extended period of time. Project planning is vital in order to obtain maximum information from your samples. Always start with very clearly defining the biological questions being asked. I cannot stress this enough. The questions need to be written down as specific and detailed as possible. Based on the questions, hypotheses need to be constructed and particular attention should be paid to constructing testable hypotheses. After this most important step you need to work out how to test these hypotheses which will involve the following steps. 1. 2. 3. 4. Sampling strategy Sample collection Sample analysis using genetic markers Data analysis One of the aims of the sampling strategy should be to minimise both the number of specimens and their handling and analysis in order to allow statistical testing of your hypotheses which answers your biological question. The sampling strategy largely depends on the biological questions being asked and the level of error one agrees to accept. Population genetic data such as allele frequencies often follow a binominal distribution which allows us to estimate the variance (s2) from the mean according to s2p = p(1-p)/n where p is the allele frequency and n is the sample size. This approach is appropriate for loci in diploid populations and in case the locus is in Hardy Weinberg equilibrium. For most populations of fungal pathogens we do not know the population structure so this would be a good place to start any population genetic study. The three most fundamental populations structures one is most likely to encounter include: • One single random mating population 14 A series of small isolated subpopulations (island or stepping stone model) A continuous population where individuals exchange genes with geographical proximate individuals (isolation by distance model) A. Drenth Practical Guide to Population Genetics • • See the section on biological questions for more details on other relevant questions. After you have defined your question it is time for a pilot experiment which should have the following three major aims. • Choice of genetic markers • Determine if the markers are suitable in a practical sense • Feasibility of large scale sampling program Samples need to be obtained from a variety of populations to start with. It is best to obtain as variable as possible material first to select your markers. Because a large number of samples need to be analysed in your main study markers should be easy to score and inexpensive. Moreover, it is vital that heterozygotes can be easily distinguished from homozygotes and all alleles can be easily identified from each other. Also optimisation of sample handling, storage, DNA isolation and manipulation, data scoring and the logistics of the project need to be worked out. Sample size and strategy After the biological questions have been worked out in great detail, your hypotheses are clearly defined, and you have an appropriate marker it is time to design a proper sampling strategy. The design of a sampling strategy largely depends on: • The question being asked • Biology of the organisms, spread, mode of reproduction ploidy etc • What levels of error one agrees to accept • Frequency of polymorphic loci in the population Sampling strategies and the statistics behind it will be discussed in a separate chapter. A. Drenth Practical Guide to Population Genetics 6 15 Population genetic theory 6.1 Individuals and Populations The most obvious unit of living matter is the individual organism. In unicellular organisms, each cell is an individual; multicellular organisms consist of many interdependent cells, many of which die and are replaced by other cells throughout the life of an individual. In evolution, the relevant unit is not the individual but a population. A population is a community of individuals linked by bonds of mating and parenthood. In other words, a population is a community of individuals of the same species. A Mendelian population is a community of interbreeding, sexually reproducing individuals. The individuals of a species are not usually homogeneously distributed in space, rather they exist in more or less well defined clusters, or local populations. The concept of local populations may seem clear but its application in practice entails difficulties because the boundaries between local populations are not well defined and often unknown. In addition most organisms are not homogeneously distributed and migration occurs. 6.2 Forces on populations Populations are not static over time but fluctuate in size and genetic make-up. There are a number of forces upon populations due to the fact that food supplies are always limited and predation, migration and selection occurs. The largest changes in populations typically occur when selection forces lead to greater fitness in some individuals in the population compared to others. In order for this to happen there need to be genetic differences between individuals for these forces to act upon. The most common forces on populations are mentioned below. Mutation • Source of all genetic diversity • Spontaneous mutations are occurring continuously without regard to their immediate need or usefulness. (mutation rate 1 per 106 per generation) • Selective forces act to increase its frequency in the population at the expense of its less favoured allele. Selection • Some individuals have more offspring than others based on differences in fitness • Natural selection - not defined (fitness, vitality, fertility) • Artificial selection - human involvement (breeding resistant plants) Migration (gene flow) • Emigration - negative selection (limited influence on population) • Immigration - influence on population (allele frequency changes) Drift • Small populations inbreeding - fluctuations of allele frequencies • Reduction of heterozygotes - loss of genetic variability • Random nature - different strains become homozygous for different allelic combinations so isolated subpopulations become genetically distinct from each other 16 A. Drenth Practical Guide to Population Genetics Mating system • Influences genotype frequency but not allele frequency • Random mating • Assortive mating - breeding to phenotypic similarity (period of flowering) • Inbreeding - selfing Fungi • Asexual - vcg (parasexual recombination) • Homothallic • Heterothallic (mating types) 6.3 Alleles versus genotypes In population genetics the frequency of the allele rather than the frequency of the genotypes is the basis used to answer most of the biological questions. The reason for this is that frequencies of alleles are much higher than the frequency of genotypes since there are usually fewer alleles than genotypes. With two alleles the number of possible genotypes in a diploid organisms is three, with 3 alleles 6, and with 4 alleles it is 10. In general if the number of different alleles is k, the number of different possible genotypes is k(k+1)/2. Table 6.1 illustrates this point for haploid and diploid organisms. Table 6.1 Number of possible genotypes in haploid and diploid organisms. Ploidy Loci Alleles Possible Genotypes Genotypes 1n 1 5 AL 5 2n 1 5 ⎛ a⎞ A+ ⎜ ⎟ ⎝ 2⎠ 15 1n 4 5 AL 625 2n 4 5 ⎛ a⎞ (A + ⎜ ⎟ )4 ⎝ 2⎠ 50625 From table 6.1 it becomes immediately clear that in order to detect the frequency of genotypes enormous sample sizes are required and therefore we can conclude that in sexual systems frequencies of genotypes is extremely difficult (if not impossible) to determine. The frequency of alleles is higher than the frequency of genotypes. This is illustrated in table 6.2. It is immediately clear that the estimate of allele frequencies is much more precise and requires a smaller sample size. Moreover, frequencies of genotypes can be estimated indirectly from the allele frequencies through involvement of Hardy Weinberg Equilibrium to be discussed in the next section. 17 A. Drenth Practical Guide to Population Genetics Table 6.2 Numerical example based on the Lap-5 gene (Leucine aminopeptidase) of Drosophila willistoni based on isozyme analysis on a population of 500 individuals. Genotype Number Genotype frequency 98/98 2 0.004 100/100 172 0.344 103/103 54 0.108 98/100 38 0.076 98/103 20 0.04 100/103 214 0.428 Total 500 1 The allele frequencies for the above example are: Allele Frequency 98 0.062 100 0.596 103 0.342 Total 1.000 6.4 Calculating allele frequencies How to calculate allele frequencies The frequency of an allele is the frequency of individuals homozygous for that allele plus half the frequency of heterozygotes for that allele. f(A) = p f(a) = q p+q=1 A(p) A(p) a(q) AA(p2) Aa(pq) 18 A. Drenth Practical Guide to Population Genetics a(q) Aa(pq) 2 aa(q ) Genotypic frequency (p + q)2 = p2 + 2pq + q2 = 1 AA Aa aa 0.3 0.6 0.1 AA Aa aa 0.36 0.48 0.16 p = 0.3 + (0.5 x 0.6) = 0.6 q = 0.1 + (0.5 x 0.6) = 0.4 p+q=1 Random mating p2 + 2pq + q2 = 1 Co-dominant genes A population of a total of 200 individuals of a diploid micro-organism has 2 red individuals, 36 orange and 162 white ones. What are the frequencies of the red and white alleles. Red R = (2p + pq)/2N = (2x2 + 36)/400 = 0.1 White W = (2q + pq)/2N = (2x162 + 36/400 = 0.9 Since p + q = 1 q = 1 - p Note that if we can identify the heterozygotes in a population we can calculate the allele frequencies. Dominant genes A population of 200 individuals of a diploid organism has 182 red and 18 white individuals. What are the allele frequencies for the red and white allele? This cannot be directly calculated but we can estimate this if we make a number of assumptions. If we assume that the population is in equilibrium we can take the square root of the frequency of the population which is of the recessive phenotype as our estimator of the recessive allele. q 2 = 0.09 = 0.3 White q = Red p = 1 - q = 0.7 In cases where dominance is involved the heterozygous class is indistinguishable phenotypically from the homozygous dominant class. Hence, there is no way of checking the Hardy Weinberg expectations against observed sample data unless the dominant phenotypes have been genetically 19 analysed by observation of their progeny from test crosses. Only when co-dominant alleles are involved can we easily check our observations against the expected equilibrium values through the chi-square test. Note that this is especially a problem in diploid organisms when using dominant markers. What is the difference and the problem when working with haploid fungi? Some dominant markers only recognise one allele while the alternative allele is absence of a fragment. What is the problem with these markers? A. Drenth Practical Guide to Population Genetics In mammalian systems there is an extra complication namely the occurrence of sex chromosomes. For example humans males have XY while females have XX. The expression of dominance and recessive relationships is markedly changed when this happens. In sex influenced traits the heterozygous genotype usually will produce different phenotypes in the two sexes, making the dominance and recessive relationships of the alleles appear to reverse themselves. Since in fungi the sex of an organism is often under the control of a single gene and no sex chromosomes are known in fungi we will not discuss this matter here any further. Loci with multiple alleles. Consider three alleles A a' and a with the dominance hierarchy A > a' > a occurring in the gene pool of a diploid organism with frequencies p, q, and r. In this case random mating will generate zygotes with the following frequencies (p+q+r)2 = P2 + 2pg + 2pr + q2 + 2qr + r2 = 1 Genotypes AA Aa' Aa a'a' a'a aa Phenotypes A a' a Precision of allele frequency estimates The effect of sample size on an estimate’s precision is expressed as the sample variance (s2), where (pxq) s2 = 2xN where p and q are the allele frequencies in the sample of two alternating allele’s and N is the number of individuals in the sample. For diploid organism we use 2xN as this is the number of occurrences of a specific locus in the sample, for haploid use N. When multiallelic series are involved take p as the frequency of one allele and q as the combined frequency of all other alleles. Because N is the denominator of this equation, it is clear that sample variance is inversely proportional to sample size, i.e. the variance is smaller for a larger sample size. It is also apparent from the equation that sample variance is dependent on allele frequency. This can be illustrated by a numerical example. Take a sample of 50 individuals and an allele frequency of 0.5. The sample variance is 0.5x0.5/(2x50) = 0.0025. The sample variance for an allele with a frequency of 0.05 in the sample is 0.05x0.95/(2x50) = 0.000475, less than one-fifth of that of the first allele. For a given sample size, the frequencies of very common and very rare alleles can be estimated with less precision than those of alleles with intermediate frequencies. 20 The square root of the sample variance provides an estimate of the sample standard deviation, and this in turn can be used to obtain confidence limits of the estimate of allele frequencies. Confidence limits are values either side of the estimate that delimit the confidence interval, a range of values within which we can be confident, to a given degree, the true population frequencies lies. In biological sciences we often use the 95% confidence limits, which delineate a range of values that we can be 95% confident contains the true frequency, are positioned 1.96 standard deviations either side of the frequency estimate. For the example of the allele above having an estimated frequency of 0.05, the sample standard deviation is 0.00475 = 0.022. The 95% confidence limits are thus 1.96 x 0.022 = 0.04 either side of 0.05. This means that we can be 95% confident that the true population frequency of the allele lies between 0.01 and 0.09. The only way to obtain the true allele frequency is to analyse the entire population. Since this is rather impractical a decision needs to be made as to what level of confidence is acceptable. A. Drenth Practical Guide to Population Genetics 6.5 Hardy Weinberg equilibrium The Hardy Weinberg law was formulated independently in 1908 by the British mathematician G.H. Hardy and the German doctor Wilhelm Weinberg. It states that the process of heredity by itself does not alter the frequency of either allele’s or genotypes in a population in which mating occurs at random. Furthermore, after a single generation of random mating genotype frequencies reach equilibrium if the allele frequencies are the same in both males and females; thus equilibrium state is predictable from a knowledge of allelic frequencies. A population in which the genotype frequencies are as predicted by the Hardy-Weinberg law is often referred to as in Hardy -Weinberg equilibrium. A very important characteristic of HardyWeinberg equilibrium is that is achieved after only a single generation of random mating. Regardless of what might happen to disturb the state of equilibrium in one generation, it will be restored in the following generation. Assume a locus with two alleles, A and a, and that their frequencies are p for A and q for a. If mating occurs at random then the frequency of a given genotype will simply be the product of the frequencies of the two corresponding allele’s. The probability that an individual of a diploid species will have the AA genotype is the probability (p) of receiving the A allele from one parent multiplied by the probability (p) of receiving the A allele from the other parent, or p x p = p². Similarly, the probability that an individual will have the aa genotype is q². The genotype Aa can arise in two ways: A from the first parent and a from the second, which will occur with a frequency of pq, a from the first parent and A from the second, occurring at the same frequency pq, therefore the total frequency of Aa is 2pq. Three general statements concerning the HWE can be made as well as a number of assumptions 1 2 3 Process of heredity does not alter the frequencies of alleles or genotypes in a population in which mating occurs at random HWE will always be restored in one generation Equilibrium state can be calculated from the allele frequencies 21 A. Drenth Practical Guide to Population Genetics Assumptions • No selection • No mutation • No gene flow • Large population size • Random mating One application of the Hardy-Weinberg law is that it permits the computation of gene and genotypic frequencies in cases where not all genotypes can be distinguished, because of dominance (see example in 6.4). The other is to test if the population is actually random mating which is exemplified below. Testing a locus for equilibrium Are the genotypes in the following population confirming to the frequencies expected for a hardy Weinberg population within statistically acceptable limits? 100 individuals with the following genotypes AA 10 Aa 35 aa 55 Calculate allele frequencies p = (2p + pq)/2N = 0.275 q = 1-p = 0.725 Calculate genotypic frequencies according to Hardy Weinberg equation based on the allelic frequencies Genotypes HW Equilibrium Expected genotypic frequency Absolute frequency in population AA p2 0.076 7.6 Aa 2pq 0.399 39.9 aa q2 0.526 52.6 Chi square test Genotype Observed Expected (o-e)2/e AA 10 7.6 0.79 Aa 35 39.9 0.60 aa 55 52.6 0.11 Total 100 100 1.50 Degrees of freedom df = k phenotypes - r alleles 3-2 = 1 A. Drenth Practical Guide to Population Genetics 22 X2 is 1.5 which gives a probability (P) of 0.2-0.3. Hence this population does not significantly deviate from the Hardy Weinberg Equilibrium and is random mating. Degrees of freedom The number of variables in a chi-square tests of Hardy Weinberg equilibrium is not simply the number of phenotypes minus 1 (as in chi-square tests of classical Mendelian ratios). The number of variables in equilibrium tests is further restricted by testing their conformity to an expected Hardy Weinberg frequency ratio generated by a number of additional variables. Hence, the combined number of degrees of freedom is the number of phenotypes minus one (k-1) minus the number of alleles minus one (r-1) which is similar to the number of phenotypes minus the number of alleles (k-r). 6.6 Genetic diversity and evolution The existence of genetic diversity is a necessary condition for evolution. In case all individuals are homozygous at a certain locus for the same allele evolution cannot take place at that locus because the allele frequencies cannot change from generation to generation. The occurrence of diversity in natural population was the starting point for Darwin's argument for evolution by a process of natural selection. Individuals having advantageous variations are more likely to be successful than others in passing on their genes to their offspring. As a consequence useful variation will become more prevalent through the generations, while harmful or less useful ones will be eliminated. There is a direct correlation between the amount of genetic diversity in a population and the rate of evolutionary change by natural selection with respect to fitness. This was mathematically demonstrated by Fisher in his Fundamentals Theorem of Natural Selection (1930). In Agriculture we continually try to improve the genetic make-up of our crop species by selecting for favourable characteristics such as yield, resistance to pests and diseases. However, this resistance is often not durable and can be overcome by changes in the pathogen population. Since ability of a pathogen to adapt to new environments is to a certain degree dependent on the level of genetic diversity present in the pathogen population it is it is of practical importance to have methods to quantify the genetic diversity present in pathogen populations. 6.7 Measuring and quantifying genetic diversity It is now known that natural populations of many organisms possess a great deal of genetic diversity and that genetic diversity is a common phenomenon in nature. Through experiments on all kinds of organisms involving inbreeding, geneticists have discovered that much more genetic diversity exists than is apparent when organisms living in nature are observed. Inbreeding experiments give rise to homozygosity and the expression of recessive genes which otherwise go undetected in the population. Another source of evidence which indicated that genetic diversity is present in populations came from artificial selection experiments. There are many examples of 23 this in Agriculture such as the spectacular yield increases in cereals, milk production in cows, egg production in chickens etc. which have take place over most of this century. A. Drenth Practical Guide to Population Genetics If we want to quantify how much genetic diversity there actually is we run into a classical problem. The traditional methods of classical genetic analysis possess a severe handicap. How do we measure what proportion of genes are polymorphic in a population. Since we cannot study every gene in the population we need to look at only a sample of gene loci. Ideally we need a random sample, truly representative of the whole population, from which values can be extrapolated to the whole population. Traditional classical genetic analysis becomes extremely cumbersome here, because to find out if differences between phenotypes are based on different alleles in the isolates we need to conduct testcrosses between all the different phenotypic classes to find out if one or more genes are involved. The dominance relationships need to be sorted out using labor and time consuming testcrosses. The other problem of the classical genetic approach is that only genes known to exist are those that are variable. Invariant genes cannot be included in the sample and, hence, it is impossible to obtain an unbiased sample of the genome to accurately assess genetic diversity. Discoveries in molecular genetics provided a way out of this dilemma. It was established that the genetic information encoded in the nucleotide sequences at the DNA of a structural gene is translated into a sequence of amino acids making up a polypeptide. This allows us to select a series of proteins without previously knowing whether or not they are variable in a population. Hence, this allows us to obtain an unbiased sample of all the structural genes in the organism. With the introduction of gel electrophoresis it has become possible to study protein variation quickly of large numbers of individuals with only a moderate investment of time and money. Since the late 1960's estimates of genetic diversity have been obtained for many different natural populations of all kind of organisms. Electrophoretic techniques show what the genotypes of the individuals in the sample are: • how many are homozygous • how many are heterozygous • and for what alleles In order to obtain a reasonable estimate of the amount of genetic diversity in a population between 15-20 or more loci need to be studied. After the laboratory work it is desirable to summarise the information obtained for all the loci in a simple way that would express the genetic diversity of the population and that would permit comparing one population to another. In addition, it is vital to use measures of genetic diversity which allow statistical testing of hypotheses concerning population structure and comparing different populations. In the next sections we will deal with ways to measure genetic diversity and ways to statistically test hypothesis concerning population structure. Measuring genetic diversity In most population genetic analyses, allele frequencies form the basis to measure genetic diversity. Allele frequencies are preferred over genotypic frequencies because allele frequencies remain relatively stable over time and are independent of the mating system in contrast to genotypic frequencies which are randomised at each generation of mating. See the section on alleles versus genotypes for more detail on this. 24 At this point fungal population genetics starts to deviate significantly from the established theory. This is due to the fact that fungi are different in a number of ways to strictly outbreeding diploid organisms with clearly identifiable individuals for which most of the population genetics theory was developed. Special characteristics of fungi relevant to population genetics are. • Difficulties in identifying the individual • Some fungi are haploid • Overlapping generations • Outbreeding, inbreeding and strictly asexual reproduction can all occur at varying degrees at the same time • At different geographic locations and on different host plants different modes of reproduction can occur • The occurrence of more than one mating type • The occurrence of vegetative incompatibility • Strong host specialisation of some fungal species which can have a large influence on population structure A. Drenth Practical Guide to Population Genetics The above demonstrates that population genetics of fungi need to be approached differently and more cautiously than if we were dealing with plants or mammals. Genetic diversity can be measured in a number of ways: 1. Proportion of polymorphic loci 2. Heterozygosity 3. Gene diversity 4. Genotypic diversity Shannon Index Clonal Fraction Genotypic diversity (Nei) Stoddart and Taylor 1 Polymorphic loci One measure of genetic diversity is the proportion of polymorphic loci, or simply the polymorphism (P) in a population. If we use a co-dominant marker and we examine 20 loci of a fungal species and find that 16 loci show no polymorphism but some polymorphisms are present at the other 4 loci we can say that 4/20 = 0.2 of the loci are polymorphic in that population. Hence, the degree of polymorphism in the population is 0.2. Polymorphism can be a useful measure of genetic diversity but it suffers from two important problems arbitrariness and imprecision. The number of variable loci observed depends on how many individuals are examined. If we examine more individuals we might identify more polymorphisms and the measure tends to increase. To counter effect this a criterion of polymorphism is often used to the effect that a locus is only considered polymorphic when the most common allele has a frequency no greater than 0.95. As additional variants are occasionally identified the average proportion of polymorphic loci will not change. However, the criterion is a rather arbitrary decision. More importantly the degree of polymorphisms in a population is imprecise because a slightly polymorphic locus counts as much as a very polymorphic locus containing many different alleles at a locus. A. Drenth Practical Guide to Population Genetics 25 2 Heterozygosity A better measure of genetic diversity which is not arbitrary and much more precise is the heterozygosity (H) of the population. Heterozygosity (Hobs) is defined as the average frequency of heterozygous individuals per locus. Heterozygosity is calculated by first obtaining the frequency of heterozygous individuals of each locus and then averaging these frequencies over all loci. Example Heterozygosity Locus Heterozygotes in sample Total population Heterozygosity (Hobs) 1 40 100 0.4 2 20 100 0.2 3 35 100 0.35 0.32 In order for an estimate of heterozygosity to be valid it must be based on 15-20 or more loci. Heterozygosity is an estimate of the average number of loci within an individual that are in the heterozygous state. The variance associated with this estimate can be reduced both by increasing the number of loci examined and by increasing the number of individuals sampled from the population. Observed heterozygosity (Hobs) is a good measure of genetic diversity because it estimates the probability that two alleles taken at random from the population are different. However, the observed heterozygosity does not reflect well the amount of genetic diversity in populations of organisms that reproduce by self fertilization (homothallic fungi) or organisms in which mating between relatives are common. In self fertilising populations most individuals will be homozygous even though the different individuals may carry different alleles if the locus is a variable in the population. Mating between close relatives has the same effect. 3 Gene diversity In order to overcome the problems with the observed heterozygosity measure we can calculate the expected heterozygosity (Hexp) of a population. Nei (1973) introduced the concept of gene diversity to describe genetic variation that is applicable to both sexual and asexual populations. Gene diversity (Hexp) is defined as the probability of obtaining two different alleles at a locus when two haploid individuals are sampled from a population. Nei’s formula for gene diversity: H = 1 − ∑k xk 2 where H is gene diversity for a non-random mating population, and xk is the frequency of the kth allele (Nei, 1973). A gene diversity of 1 means that the diversity is so high that any two alleles at a locus sampled from a population are different. At the other extreme, a genetically uniform population (with no 26 allelic variation at the loci sampled) will have a diversity of 0 since any two individuals sampled will be identical. In a diploid mating population gene diversity is equivalent to the proportion of heterozygosity at the locus expected under random mating, so called expected heterozygosity (Hexp). A. Drenth Practical Guide to Population Genetics Calculation of gene diversity (Hexp) Gene diversity can be calculated from the allele frequencies under the assumption that the individuals in the population are mating with each other at random. Applying Nei’s (1973) formula on the following problem we will get. One locus 4 alleles Locus A Locus B Allele Frequency Allele Frequency A1 0.5 B1 0.2 A2 0.3 B2 0.3 A3 0.1 B3 0.4 A4 0.1 B4 0.1 1.0 1.0 Hexp = 1 - freq A12 + freq A22) Hexp = 1 - (0.52 + 0.32 + 0.12 + 0.12) = 0.64 Hexp = 1 - (0.22 + 0.32 + 0.42 + 0.12) = 0.70 Hexp 0.64 + 0.70 / 2 = 0.67 Note that the expected level of heterozygosity is the same as Nei's gene diversity measure (Nei, 1973; PNAS 70: 3321-3323) Differences between expected and observed levels of heterozygosity may be due to the occurrence of a certain amount of self-fertilization. This difference can be quantified into a socalled Fixation index discussed in section 6.11. 4 Genotypic diversity One of the more challenging aspects of population genetics of fungal pathogens is the variability in mode of reproduction. Fungi can either reproduce sexually, outbreeding or inbreeding, or reproduce asexually. These reproductive modes are not isolated either and may occur at varying degrees at the same time. This may lead to population structures varying from the predominance of one or a few clones, to strictly sexual reproduction with many combinations in between. This poses some problems to population genetic approaches and requires for well thought out experimental designs to answer biological questions. In case high levels of outbreeding occur, the analysis should involve gene diversity analysis as described before. In case we are dealing 27 with mixed mating populations causing large numbers of clones we have to consider using gene diversity after correcting the population for clones and/or using a measure of genotypic diversity. A. Drenth Practical Guide to Population Genetics When dominant markers are used for population studies we cannot accurately estimate the allele frequencies. Often when we do not know if we are dealing with sexual or asexual fungal populations we combine the phenotypic traits (e.g. RAPD fragments) into a multilocus phenotype, sometimes called haplotype. Often markers are used which give us a so-called DNA fingerprint or single copy RFLP or isozyme marker data from several loci can combined to give a so-called multilocus haplotype or multilocus phenotype depending on the ploidy level of the fungus under investigation. Various measures of genotypic diversity exist: • Shannon Index • Clonal Fraction • Genotypic diversity • Stoddart and Taylor A Shannon Index A diversity measure commonly used for phenotypic analysis of pathogen populations is the Shannon index (SI) (Bowman et al., 1971; Groth and Roelfs, 1987). Genotypic diversity for a population can be calculated according to: k SI = - ∑ pi ln pi i =1 where pi is the frequency of isolates with the ith phenotype in the population and k is the number of phenotypes in the population and lnpi is the natural log of pi. The Shannon index takes into account the frequency and evenness of distribution of a particular phenotype. When sample sizes of different populations vary the Shannon index can be converted into a normalised Shannon diversity index: HS = H/HMAX, in which H is the usual Shannon diversity index over genotypes, and HMAX is ln(N), the maximum diversity for a sample of size N. This statistic is relatively stable when sample sizes vary (Sheldon, 1969). The Shannon index has a useful property in that it is linearly related to addition of phenotypic characters that are independent of those already included (Groth and Roelfs, 1989). Because of this simple additive effect, a linear model of the contribution of characters such as virulence, RFLPs etc is possible. An inherent problem in using phenotypic and haplotypic diversity measures as a genetic diversity measure is that the actual genetic differences between the two unique multilocus genotypes are not compared. Multilocus phenotypes different from each other by one or many characters are weighted equally. A. Drenth Practical Guide to Population Genetics 28 B Clonal Fraction Clonal fraction is a simple statistic which indicates the fraction of clones in a population. This statistic is calculated as (N − C) Cf = N where N is sample size and C is the number of distinct genotypes or clones. In case we have a sample size of 10, 5 distinct isolates and 1 genotype which occurs 5 times then the clonal fraction is (10-6)/10 = 0.4. A larger population with the same genetic make-up of for example N = 100, 50 different genotypes and 1 genotype occurring 50 times will have a clonal fraction of (10051)/100 0.49. This immediately drives home the message that clonal fraction is dependent upon sample size. Hence, clonal fractions should not be compared between populations of different sizes. The use of this statistic is further restricted by the lack of statistics. C Genotypic diversity NEI In case different phenotypes are identified using DNA fingerprinting or combining single loci to construct a multilocus genotype, often called haplotype, genotypic diversity can be calculated using Nei’s formula by substituting allele frequencies with the genotype frequencies. H = 1 − ∑k xk 2 where H is genotypic diversity for a non-random mating population, and xk is the frequency of the kth genotype. D Stoddart and Taylor When using a measure of genotypic diversity, or any genetic diversity measure for that matter it is important that this measure has its statistics worked out so we can use it to test hypotheses. Stoddart and Taylor (1988) have just done that for genotypic diversity. Genotypic diversity can be calculated using the formula: 1 Gˆ = N ∑ x=0 ⎡ ⎛ x ⎞2 ⎤ ⎟ ⎥ ⎢ fx⎜ ⎣⎢ ⎝ N ⎠ ⎥⎦ where N is the sample size, and fx is the number of genotypes observed x times in the sample (Stoddart and Taylor, 1988). The maximum possible value for Ĝ is the number of individuals in the population, which occurs when each individual in the sample has a different genotype. To compare Ĝ between collections of different sample sizes, Ĝ can be divided by N to calculate the percentage of maximum diversity obtained ( Ĝ /N) (McDonald et al., 1994). The significance of differences between the percentages of maximum diversity ( Ĝ /N) for each sub-population can be calculated using a t-test (Chen et al., 1994), given by the formula: 29 A. Drenth Practical Guide to Population Genetics t= Gˆ 1 Gˆ 2 − N1 N 2 ( ) ( ) Var Gˆ 1 Var Gˆ 2 + N12 N 22 () K 4 ⎡ ⎤ 3 where Var Gˆ = G 2 ⎢G 2 ∑ ( pi ) − 1⎥ N ⎣ i =0 ⎦ G is the population genotypic diversity, K is the number of genotypes in the sample, and pi is the frequency of the ith genotype in the sample. Ĝ is the maximum likelihood estimator for G in this formula (Stoddart and Taylor, 1988). The t-test to be calculated at a significance level of P ≥ 0.05. The number of degrees of freedom is N1 + N2 - 2. This formula can be used to reflect to what degree asexual reproduction contributes to population genetic structure in fungal species with both an asexual and sexual reproductive stages in their life cycle. Example of calculating genotypic diversity The next example is obtained from Chen et al (1994) on a population genetic study of the haploid fungus Septoria tritici. Two populations are collected one early in the season and the other one late. The data identified using low copy RFLP markers are summarised in the table below. Early season Late season Total sample 129 277 Nr. genotypes 120 251 Genotype 112 7 1 Frequency 1 2 3 Genotypic diversity in the early population Ĝ = 1/{112 x (1/129)2 + 7 x (2/129) 2 + 1 x (3/129) 2 } = 112 Genotype 231 15 4 1 Frequency 1 2 3 4 Ĝ /N% = 112/129 = 86.8% Genotypic diversity in the late population Ĝ = 1/{231 x (1/277) 2 + 15 x (2/277) 2 + 4 x (3/277) 2 + 1 x (4/277) 2 } = 224 Ĝ /N% = 224/277 = 80.8% From this data it is clear that genotypic diversity has decreased from early to late in the season during the year. Of course we want to know whether or not this is a significant difference. In order to do that we first need to calculate the variance A. Drenth Practical Guide to Population Genetics 30 Variance of Ĝ in the early population Var Ĝ = 4/129 x 1122 [ 1122 ( 112(1/120)3 + 7 x (2/120) 3 + 1 x (3/120) 3 ) - 1] = 161.6 Since the standard deviation is the root of the variance Ĝ = 112 ±12.5 Note that this is the variance of Ĝ and not the Ĝ corrected for sample size Variance of Ĝ in the late population Var Ĝ = 4/277 x 2242 [ 2242 (231x1/251)3 + 15(2/151)3 + 4(3/251)3 + 1 (4/251)3 ) -1] = 447.8 Ĝ = 224 ±21.8 In order to test if the early and late populations are different we perform the t test t = | 112/129 - 224/277 | /√(161.6/1292 + 447.8/2772 ) = 0.47 Degrees of freedom is 129 + 277 - 2 = 404 In the t-table there is not a significant difference at P> 0.05. Hence, the null hypothesis that the early and late populations do not show any significant differences in genotypic diversity cannot be rejected. 6.8 Linkage disequilibrium Population structure is largely affected by the mode of reproduction. Sexual reproduction affects the association of genes in an individual. The genotypic proportions for a particular locus in a random mating population are the product of the allelic frequencies of the two uniting gametes, a condition called the Hardy Weinberg equilibrium. Hardy Weinberg equilibrium at a single locus is achieved in one generation of random mating; it therefore only tells us about the pattern of mating of the generation directly before the observation is made (Crow, 1986). To know the history and pattern of reproduction and gene association, a measure of linkage disequilibrium (alternatively called gametic disequilibrium) is used to describe the deviation of observed genotypic frequencies expected from random association of alleles at different loci. Linkage disequilibrium (D) can be measured for two loci A and B according to Dij = Pij - pipj 31 where Pij is the observed frequency of AiBj genotypes in the population and pipj are the frequencies of alleles Ai and Bj at loci A and B respectively in the population (Hartl and Clark, 1989). A. Drenth Practical Guide to Population Genetics A pair of loci is said to be in linkage equilibrium when the observed genotypic frequencies do not deviate significantly from the product of the individual allelic frequencies at the two loci examined. Physical linkage between loci contributes to disequilibrium because genes on the same chromosomes, unless separated by recombination, will be associated more often than expected at random. Besides physical linkage association of alleles at different loci can be caused by selection and genetic drift. Linkage disequilibrium is an important concept in the analysis of fungal populations because selection for virulence alleles by resistant host genotypes coupled with a high asexual reproductive capacity can result in strong linkage disequilibrium. Genetic exchange and recombination, either sexual or parasexual, dissipate disequilibrium. Hence, the level of linkage disequilibrium can therefore be used as an indirect measure of the significance of genetic exchange and recombination in a presumable asexual populations. When alleles at different loci are not associated at random, they are said to be in linkage disequilibrium. When alleles at different loci are associated at random (i.e. in proportion to their frequencies, the loci are in linkage equilibrium and D approaches 0. The maximum absolute value D can have is 0.25, namely when linkage disequilibrium is complete and the allelic frequencies are 0.5 at both loci. In case the allelic frequencies at both loci are different complete linkage is not possible. The Hardy Weinberg law dictates that equilibrium among alleles at a single locus are reached in one single generation of random mating. In contrast, linkage disequilibrium decreases with every generation of random mating in the absence of selection. Permanent linkage disequilibrium, may result from selection if some gametic combinations result in higher fitness than other combinations. Assume two loci, A and B and at each locus there are two alleles, (A and a) and (B and b). Assume further that alleles A and B interact well with each other producing a well adapted phenotype and that the same is true for alleles a and b. However, combination Ab and aB yield poorly adapted phenotypes. The population as a whole would benefit if in most cases the alleles were transmitted in the combinations AB and ab and rarely in Ab and aB. When alleles at different loci are not associated at random they are said to be in linkage disequilibrium. When alleles at different loci are associated at random (i.e. in proportion to their frequencies) the loci are in linkage equilibrium. If the allele frequencies of the above two loci are: Locus A B Allele A a B b Frequency p q r s 32 and the alleles at the two loci are associated at random then four possible gametic classes are expected to have the frequencies that are the product of the frequencies of the alleles involved, that is. A. Drenth Practical Guide to Population Genetics Gametes AB ab Ab aB Frequency pr qs ps qr If the alleles are associated at random, the product of the frequencies of the two coupling gametes (pr x qs = ps x qr) is the same as the frequencies of the two repulsion gametes (ps x qr = pqrs). Where the alleles are not randomly associated, the two products will be different and this difference is called linkage disequilibrium and is a measure of the difference between the two products D = prqs - psqr Where there is linkage equilibrium D = 0. Linkage disequilibrium is complete when only two gametic combinations exists, either only the two coupling gametes or only the two repulsion gametes. The maximum absolute value that D can have is 0.25, namely when linkage disequilibrium is complete and the allele frequencies are 0.5 at both loci. Example of linkage disequilibrium If we put some numbers into the above example Gametes AB ab Ab aB Frequency pr qs ps qr 0.453 0.019 0.076 0.452 D = prqs - psqr D = 0.453 x 0.019 - 0.076 x 0.452 = - 0.026 A test for the significance of the disequilibrium coefficient between each pair of alleles at two loci can be formulated with the following chi-square statistic: X2AB = nDab Pa (1 − Pa ) Pb (1 − Pb ) A. Drenth Practical Guide to Population Genetics 33 where n is the number of individuals in the sample and Dab the maximum likelihood estimator of disequilibrium between alleles A and B. The observed alleles frequencies for the loci are Pa and Pb, respectively. This chi-square statistic has one degree of freedom (Weir, 1990). It is often common to exclude alleles with a frequency lower than 5% from the analysis as extremely high sample sizes are needed to obtain any meaningful disequilibrium coefficient. Also common is the combining of rare alleles into a single class which is larger than 5%. 6.9 Population differentiation In plant pathology we are often interested in population differentiation. Typical questions include: are pathogen populations in different fields, different regions, obtained from different crops, etc the same or are they significantly different. There are a number of tests we can perform to find out whether or not populations are the same. At the genotypic level we can use Stoddart and Taylors measure of genotypic diversity for which there is a t-test. This test can be used in asexual reproducing populations to see if there is significant differences in mode of reproduction between different fields. The problem with this approach is that this measure does not take the level of differences between the clones into account. For numerical example of testing difference in genotypic diversity see section 6.7d. In highly sexually reproducing populations comparing levels of genotypic diversity is not very useful as both populations will show extremely high levels. Different levels in genotypic diversity are more important when the biological question relates to the importance of various modes of reproduction over the cause of an epidemic or a season in a temporal sense. Hence, when populations have been sampled early and late in the season, and the following year and we want to test if the genotypic diversity has significantly changed over the cause of the epidemic or the life-cycle of the pathogen. When co-dominant markers have been used in such a population study we can use a Chi-square test to test if allele frequencies are significantly different between the populations. One of the most straightforward ways of testing for genetic differentiation is to use contingency table chisquare tests. With v alleles at a locus, the genotype counts in each of r samples are arranged in a [v(v+1)]/2r contingency table and a chi square statistic with {[v(v + 1)]/2-1)} x (r-1) degrees of freedom. A problem with this method arises when low allele frequencies are encountered which give rise to rather large test statistics. To avoid this it may be necessary to combine rare alleles together. This problem increases when the number of alleles increases. As a rule of thumb, goodness of fit Chi-square tests should not be performed with expected classes less than 5. 34 A. Drenth Practical Guide to Population Genetics Contingency Chi-square analysis Differences in allele frequencies between populations can be tested using a X2 test for heterogeneity (Workman and Niswander, 1970). For a diploid fungus the Chi-square values for each RFLP locus can be tested in the following manner. Consider two populations of 50 individuals each with the following allele frequencies Allele A a Population 1 0.8 0.2 Population 2 0.6 0.4 There is not any one hypothesis to tell us what to frequency to expect in each class, but we can test whether the two populations are independent by means of a 2x2 contingency table. First set up a table with the observed results: Allele A a Totals Population 1 40 10 50 Population 2 30 20 50 Totals 70 30 100 We can now calculate the expected results for each of the classes under the null hypothesis that there is no genetic differentiation for each of the classes by multiplying the corresponding subtotals and dividing by the grand total. For example the expected frequency of allele A in population 1 is (70x50)/100 = 35. The 2x2 contingency table of expected results is: Allele A a Totals Population 1 35 15 50 Population 2 35 15 50 Totals 70 30 100 This gives rise to the following X2 value: X2 = (35-40)2 / 35 + (35-30)2 /35 + (10-15)2 /15 + (20-15)2 /15 = 4.76 Although there are four classes, the number of degrees of freedom in this case is 1, not 3. because only one of the four values in the 2x2 contingency table need to be known in order to calculate the other ones by subtracting them from the subtotals. In general the number of degrees of freedom is (r-1)(c-1) where r is the number of rows and c the number of columns (do not include the subtotals). The Chi-square value of 4.76 is larger than the Chi-square for one degree of freedom at the 5% level of significance. Hence, we can conclude that there is a significant difference in the allele frequencies between these populations. Note that Chi-square statistic need to be corrected for haploids by dividing the value by 2. 35 A. Drenth Practical Guide to Population Genetics F statistics The definition of a population is a difficult one involving time, geographic distance, and biology of the organism. A population of a pathogen in one continent can range from one large population, to overlapping subpopulations to numerous small distinct subpopulations. In order to determine what level and scale a population of a pathogen operates we can subdivide or partition variation in the population. Sewal Wright (1951) was the first to develop methods to partition variation. His measurements, often called F-statistics, are based on the idea of inbreeding in a diploid mating population. If a population is subdivided into several genetically related subpopulations, then two randomly uniting gametes chosen within a subpopulation are more likely to be related by descent than two gametes from different subpopulations. Wright's fixation index (Fst) is a measure of the genetic differentiation of a subpopulation relative to the total population due to non-random mating. In case of two single loci comprising two alleles Fst is calculated as Fst = q q (1 − q ) Where (q ) is the variance in the frequency of allele A2 and q = Σwiqi is the weighted frequency of A2 in the total population. When subpopulations do not differ in sample size significantly equal weight is given to each subpopulation. A Chi-square test is commonly used where X2 = 2NFst which only may be applied if sample sizes are identical for all populations and only two alleles occur at the locus compared. When there are more then two alleles per polymorphic locus then a powerful test for significant deviation of Fst from zero is the log-likelihood Chi-squared test (G-test) of homogeneity of the allele distributions themselves. The null hypothesis is that there is no substructuring of the population (i.e. F-St = 0) can be examined by testing for heterogeneity of allele frequencies between subpopulations using the Gtest (with M-1 degrees of freedom, where M = the number of populations). A common problem with this approach is the presence of alleles at very low frequencies. These may be combined into contingency tables until all expected cell frequencies exceed 1.0. However, it is no longer possible to test for significant heterogeneity between subpopulations because of the combining of different alleles. However, by combining cells containing alleles at low frequencies, the G-test may be used to determine the maximum number of cells that can be combined before heterogeneity reaches statistical significance. Using this approach the mean of Gst over all variable loci may be compared to zero by the t-test. Wright’s F-statistics have the advantage of allowing a simultaneous comparison of allele frequencies for a number of populations among many loci. However, they are based upon loci that are effectively neutral. Fst is the most commonly used of the F statistics coefficient and gives a measure of the extent to which a species is organised into subpopulations with restricted gene flow. It represents the correlation between alleles of gametes sampled at random from two subdivisions of a population, with the distribution of alleles within the entire population sampled. Fst reflects the extent of local differentiation into subpopulations. Fst is always positive (Wright, 1965) and its calculation requires genotypic information for single loci. 36 Nei (1973) generalised Wright's population subdivision concept to haploids and asexual populations. Nei's approach to measuring genetic differentiation is to partition the total gene diversity into component diversities according to subgroups, such as those from geographic locations or pathotypic groups as in the case of pathogen populations. A genetic differentiation coefficient (Gst) is defined as A. Drenth Practical Guide to Population Genetics ∧ ∧ ( H t − H s) Gst = Ht ∧ Ht = Total gene diversity over all groups, which is identical to Nei's gene diversity (1973) which is identical to the expected level of heterozygosity in the total population. ∧ Hs = Average gene diversity over all subgroups. Individual Hs estimates are the Nei (1973) gene diversity measure based on the allele frequencies found within a particular subpopulation. Gst, like Fst, describes the average amount of genetic diversity attributed to a particular subdividing factor relative to the total level of genetic diversity. Gst is a rather useful measure as it is not contingent upon any assumption about the mode of reproduction of the population. A low Gst value indicates that most of the gene diversity is found within the subpopulation and there is not much gene diversity between the populations thus indicating low levels of population differentiation. Nei’s Gst, provides a good measure of the degree of similarity between taxa, however it lacks the usual test statistics with their associated levels of confidence. When a Gst value is high the gene diversity within subpopulations is small compared to the gene diversity within the total population indicating substantial genetic differentiation between the subpopulations. When Gst approaches 1, each subpopulation becomes homogeneous and most of the variation exists between the subpopulations. Note that Nei’s gene diversity measure can also be used for genotypic diversity. This will allow partitioning of genotypic diversity within and between subpopulations. Subdivision itself can also affect genotype frequencies. If a species is divided into subpopulations where there is random mating, and the allele frequencies differ between subpopulations then for the species as a whole, homozygote genotypes will increase at the expense of heterozygote genotypes. This is known as the Wahlund effect, and has the same effect on overall heterozygosity as inbreeding. Of course the question is what value of Gst is indicative of significant differentiation between subpopulations. In other words which observed distribution of gene diversity supports the null hypothesis of no differentiation among subpopulations. Population differentiation based on Gst is fairly straightforward but Nei did not provide any statistics with this measure. A simple nonparametric methods to test for population differentiation is described in Hudson et al., (1992). This procedure involves tabulating the observed gene diversity (or genotypic frequencies) of each subpopulation and subjecting the data to a Chi-square homogeneity test. Significant deviation of observed from expected frequencies leads to rejection of the null hypothesis of no differentiation between the subpopulations. 37 Slatkin (1993) has used Gst to devise a statistic which allows testing of isolation by distance among different subpopulations. This is based on the reasoning that if there is gene flow between two populations which are geographically close and the population is in equilibrium, the logarithm of the average number of migrants per generation between each pair of subpopulations A. Drenth Practical Guide to Population Genetics ∧ M , is expected to be negatively correlated with the logarithm of the geographic distance between subpopulations. ∧ M can be estimated as: ∧ M = (1/ Gst -1)/4 for diploids ∧ M = (1/ Gst -1)/2 for haploids ∧ After M is calculated the log M per population can be plotted to the log of the geographic distance for each pairwise comparison and tested as to whether there is significant negative correlation. See Milgroom and Lipari (1995) for an example of this. 6.10 Partitioning of genetic diversity Partitioning of genotypic diversity coupled with hierarchical sampling schemes are often used in plant pathology. Consider a two-level hierarchical sampling scheme (fields within regions) used to collect isolates, the total genotypic diversity can be partitioned into components based on the amount of diversity within and among subpopulations. The relative magnitude of each component can be assessed following methods developed by Lewontin (1972), Zhang et al. (1987), and Goodwin et al. (1992a). Partitioning works rather similar for all indices and the following approach is generally applicable. For each region, hfield can calculated as the mean of ho for all fields in the region, and hregion can be calculated as the mean frequency of all genotypes within the region. The total diversity, htotal, is then determined from the mean frequencies of all genotypes in the entire sample. The mean within-field and among-field within-region diversity values, hfield and hregion, are the average hfield and hregion values, weighted by the number of fields in each region. The total diversity was allocated to hierarchical components as follows: hfield /htotal is the proportion of total diversity that is due to differences within fields; (hregion — hfield)/ (htotal) is the proportion of total diversity due to differences among-fields within-regions; and (htotal — hregion)/ htotal is the proportion of total diversity due to differences among-regions. 6.11 Fixation index Differences between expected and observed levels of heterozygosity may be due to the occurrence of a certain amount of self-fertilization. This difference can be exploited in a Fixation index (F) which tells us something about the kind of reproduction in the population. Higher levels of inbreeding result in lower levels of heterozygosity and lower overall levels of genetic diversity and this can be shown at any hierarchical level. Thus the mating system can be 38 analysed by comparing the observed proportion of heterozygotes in a population to that expected assuming random mating. Wright’s fixation index (F), can be calculated as : A. Drenth Practical Guide to Population Genetics F=1-(Hobs/Hexp) in which Hobs is the observed mean heterozygosity per locus, and Hexp is expected mean heterozygosity (Brown, 1979). Hexp is the same as Nei’s gene diversity (Nei, 1973). Random mating populations should have F-values close to 0, as Hobs approaches Hexp. Under complete selfing, F-values close to 1 are expected as Hobs approaches 0. Values between 0 and 1 would indicate various levels of inbreeding. A value of F less than 0 would indicate an excess of heterozygotes (Goodwin, 1997), either through disassortive mating or asexual reproduction. Asexual reproduction could skew F-values in any direction, depending on: the level of heterozygosity present in the most prevalent clones (Goodwin, 1997) and the state of most of the polymorphisms, whether present as heterozygotes or homozygotes. Needless to say that the fixation index only applies to diploid organisms. 6.12 Genetic distance Heterozygosity gives one measure of genetic variation but there are other methods which have the advantage that values may be directly compared between populations and between different sized groupings (i.e. between versus within species) and thus used to measure genetic differentiation during the speciation process and possible phylogenetic relationships. Nei’s genetic identity (I) estimates the normalised probability that two alleles, one taken from each population, are identical. Essentially, it provides a measure of the similarity in frequency of each allele, summed over all alleles. It is given by: I= Ixy IxIy where Ixy, Ix, and Iy are the averages over all loci (including monomorphic ones) of ∑xiyi, ∑x2i and ∑y2i, respectively, where xi and yi are the frequencies of allele I for the two populations X and Y. Genetic identity may vary from zero (no alleles shared between the two populations) to one (where both populations have identical allele frequencies). For traits with two or more alleles the probabilities must be calculated for each allele separately and summed. The genetic distance between the two populations is then calculated by: D = -lnI. 39 Genetic distance varies from 0 for populations with identical allele frequencies to infinity for populations that do not share any alleles. (See Nei 1978 for modifications concerning small sample sizes). A. Drenth Practical Guide to Population Genetics Calculation of genetic distance Consider two populations 1 and 2 with 3 loci and the following allele frequencies Locus A Allele A a B b C c B C Population 1 0.1 0.9 0.4 0.6 1 0 Population 2 0.2 0.8 0.3 0.7 0 1 In order to calculate genetic identity it is easiest to set up a table Locus A B C Total Average Ixy (0.1x0.2)+(0.9x0.8)= 0.74 (0.4x0.3)+(0.6x0.7)= 0.54 (1x0)+(0x1)= 0 1.28 0.427 Genetic identity is I = Ixy = IxIy Ix 0.1 + 0.9 = 0.82 0.42 + 0.62 = 0.52 12 + 02 = 1 2.34 0.78 2 2 Iy 0.2 +0.8 = 0.68 0.32 +0.72 = 0.58 02 + 12 = 1 2.26 0.7563 2 2 0.427 = 0.557 0.78x0.753 Genetic distance D = -lnI. D = - ln 0.557 = 0.58 That is, it is estimated that 0.58 allelic substitutions per locus (or 58.8 allelic substitutions per 100 loci) have occurred in the separate evolution of the two populations. Note that more than three loci need to be studied in order to obtain a reliable estimate of genetic distance or genetic differentiation between any two populations. In the following table (Ayala,1975) gives an idea of what distances to expect at various levels. See also a review by Avise and Aguado (1982) Level of comparison Local populations Subspecies Incipient species Sibling species Morphological different species Genetic identity I 0.970 ± 0.006 0.795 ± 0.013 0.798 ± 0.026 0.563 ± 0.023 0.352 ± 0.023 Genetic distance D 0.031 ± 0.007 0.230 ± 0.016 0.226 ± 0.033 0.581 ± 0.039 1.056 ± 0.068 Genetic distance and genetic identity are often used as the bases for analysis to reconstruct the evolutionary relationships. Nei’s (1972,1978) Identity (I) has a clear biological meaning; the 40 probability that alleles drawn at random from two populations or taxa are identical. Genetic distance is therefore an estimate of the number of nucleotides base-pair substitutions per gene locus that have accumulated during evolutionary time since the divergence of taxa from their common ancestor. This is under the assumption that evolutionary rates are identical along lineages and among loci. If we compare two taxa, a D value of 0.01 would indicate that one allelic substitution had occurred per 100 loci since the divergence of these taxa from their common ancestor. A. Drenth Practical Guide to Population Genetics 6.13 Similarity and dissimilarity indices There are other measures of diversity which are often used in combination with qualitative data (presence or absence of characters such as e.g. RAPD fragments) There are a large number of these so called similarity indices and have outlined a few using a small data set. For qualitative characters (presence or absence of characters) Variables for two individuals (1 and 2) can be classed as follows a = both individuals have fragment (++) b = individual 1 has fragment, 2 has no fragment (+-) c = individual 1 has no fragment 2 has one (-+) d = individual 1 and 2 both have no fragment (--) p=a+b+c+d + - + a b a+b - c d c+d a+c b+d p 41 A. Drenth Practical Guide to Population Genetics Sample data set Variable 1 2 3 4 5 6 7 8 9 10 Strain 1 1 0 0 0 1 1 0 0 1 0 Strain 2 0 0 0 0 1 0 0 1 1 0 1. Simple matching coefficient SI = (a+d)/p 2. Jaccard's coefficient SI = a/(a+b+c) 3. Czekanowski's coefficient SI = 2a/(2a+b+c) 4. Sokal & Sneath coefficient SI = 2(a+d)/(2(a+d)+b+c) 5. SI = a/(a+2(b+c)) 6. Russel & Rao's coefficient SI = a/p 1 2 3 4 5 6 0.70 0.40 0.57 0.82 0.25 0.20 According to SI 4 the strains are similar but according to SI 6 they are dissimilar Which similarity index are you going to use??? 42 A. Drenth Practical Guide to Population Genetics 6.14 Suitability of markers for population genetics Each particular biological problem can be attacked using one or a series of different genetic markers. Each marker system has its own advantages and disadvantages. The following table gives a rough indication of which markers are most suited to which problem. Question Population Dynamics Geographic Distribution Gene Flow Drift Recombination Clonality Random mating Parasexuality Gene diversity Genotypic diversity Selection4 Key + ++ +++ 2 Biological Markers Cytoplasmic Markers Neutral Molecular Markers Phenotypic Genotypic + + + ++ +++ +++ +++ + +2 +1 +++ +++ +5 ++ ++ ++ +++ + ++3 +++ +++ +++ +++ +++ +++ +++ +++ +++ +++ Unsuitable Limited information can be deduced Estimations can be made with these tools Most informative techniques available Mating type as a marker is a good indicator of the potential for sexual reproduction in heterothallic species, however other biological markers such as virulence are typically infrequent and under selection which make them unsuitable for assessment of random mating. 3 Genotypic diversity can be estimated with a phenotypic marker such as RAPDs. However, results are not always reproducible, and codominance is not distinguished. Hence, a genotypic marker technique such as RFLPs, which may initially appear more time consuming, may actually save on time and costs in the long term by reducing the repetitions required. 4 Selection is usually assessed by a combination of neutral and selectable markers. 5 Useful if the selected trait is located on extrachromosomal elements such as mitochondrial DNA. A. Drenth Practical Guide to Population Genetics 43 Isozymes The amount of genetic diversity detected by using isozymes is subject to strong bias. A major source of bias is that only approximately one-third of all amino acid substitutions can be detected by using electrophoresis (Lewontin, 1974). The other substitutions do not change the charge of the protein and will thus not result in separation of the isozymes in electrical fields. In addition, small differences in rate of migration are not always detectable, so some amino acid substitutions that do influence net charge are also “silent”. Be aware of the strong bias in favour of enzyme systems that show polymorphism’s when screening enzymes systems for use in population studies. Problems with RAPD markers RAPD markers are dominant and only allow one fragment per locus to be identified in the form of presence or absence. After reading all the previous sections is must have become clear that the use of dominant markers is rather limited in population genetics. However, they do allow estimates of phenotypic diversity and are good for identification of clonal phenotypes. I have listed a few of the problems and possibilities with these markers without going in great detail. (Lynch + Milligan, 1994. Molecular Ecology 3: 91-99) Practical problems 1. How to assign bands to loci (need to do pedigree analysis) 2. Products of different loci will have similar molecular weight 3. Dominance of RAPD marker Lynch + Milligan give estimates off • Allele frequency • Genotype frequency • Gene diversity • Population subdivision • Genetic distance • Relatedness (very limited) Conclusions: • For RAPD 2 -10 times more individuals need to be sampled per RAPD locus then per RFLP or isozyme loci • Loci with high frequency cannot be used • Many more loci need to be scored Possibilities with RAPD markers (see Peever and Milgroom, 1994. Can. J. Bot. 72: 915-923) A. Drenth Practical Guide to Population Genetics 44 HOW TO OBTAIN A SET OF NEUTRAL RFLP MARKERS After you have clearly defined your biological question and have come up with an experimental design which allow you to rigorously test your hypothesis using the appropriate statistics, it is time to start thinking about some aspects of the markers you are going to use. I have outlined some details how one can obtain useful RFLP markers but the principles are the same for virtually any marker. Issues to be worked out involve: • How to grow large numbers of your fungus in an efficient way • Try some DNA isolation procedures and chose one which gives reliable and good quality DNA • Make sure the DNA cuts well with a number of different restriction enzymes • You need 2-5 ug DNA per lane and need about enough DNA for 4-5 gels • Store your DNA at -20 °C and never at 4 °C (don’t worry about the shearing as that is a myth). At 4 °C your DNA will slowly degrade. You will need probes from a library, either an existing cDNA or genetic library or you need to make one. In case you make a library screen it against high copy probes by labelling part of the entire library and probe it back to a plated out copy of the library. Also select against ribosomal repeats. Obtain about 50-100 clones from this library which then need to be screened whether or not they are from single loci. This can be done by screening the clones on about four genetically different isolates cut with the same enzyme as the library is constructed with. Make a number of duplicate blots with 4 different enzymes on them so you can rapidly screen 4-8 probes at a time depending on the capacity of your hybridisation oven. Screen the 50-100 clones in 5-10 hybridisation rounds. Since you have used 4 different enzymes you have screened a minimum of 200 probe enzyme combinations. This should yield 20-25 single copy probes. Ideally your probe selection should be different from the population you are going to analyse in an effort to avoid bias towards selecting monomorphic or polymorphic loci. If you work with a sexual reproducing fungus and you do have progeny available you can even do better and demonstrate that you have truly single copy markers and proof that all the markers segregate independently from each other. In order to avoid the occurrence of complex patterns you need to cut your clones with the three other enzymes as well. If you still find more than two fragments for a diploid organism than it is likely two or more loci are involved or your enzyme cuts within the locus. By using only a fragment for which none of the restriction enzymes cuts within that section you increase the change of getting easy to interpret patterns. You will need at least 10 different clones and you may use two to three different enzymes with the same clone. This should yield between 20 and 30 useful probe-enzyme combinations, enough for most population genetic studies. Thus instead of using a single blot 20 times (which is beyond most blots) it is easier to do 2-3 blots with different enzymes with the same clone. This 45 significantly reduces the number of hybridisations which need to be preformed and speeds up your research. A. Drenth Practical Guide to Population Genetics Also consider the logistics of your approach. If you have gels with 20 well combs and you have two controls on each gel it is handy to choose 36 as your sampling unit in the experimental design. It is worthwhile spending a little time to come up with an efficient approach as you will be rewarded later for this with an efficient sample throughput. A. Drenth Practical Guide to Population Genetics 46 LITERATURE CITED Avise, J.C. and Aguado, C.F. 1982. A comparative summary of genetic distances in vertebrates. Evolutionary Biology 15:151-185 Ayala, F.J. 1975. Genetic differentiation during the speciation process. Evolutionary Biology 8: 1-78. Bowman, K.O., Hutcheson, K., Odum, E.P. & Shenton, L.R. 1971. Comments on the distribution of indices of diversity. p 315-359 in: Patil, G.P., Pielou, E.C. & Waters W.E. (eds.). Statistical Ecology Volume 3. Many species populations, ecosystems, and system analysis. The Pennsylvanian State University Press, University Park and London. pp 462. Brown, A.H.D. 1979. Enzyme polymorphism in plant populations. Theoretical Population Biology. 15: 1-42. Chen, R.S., Boeger, J.M., and McDonald, B.A. 1994. Genetic stability in a population of a plant pathogenic fungus over time. Molecular Ecology. 3: 209-218. Cheung, W.Y., Hubert, N., and Landry, B.S. 1993. A simple and rapid DNA microextraction method for plant, animal, and insect suitable for RAPD and other PCR analyses. PCR Methods and Application. 3: 69-70. Goodwin, S.B., Allard R.W., Hardy, S.A. & Webster, R.K. 1992a. Hierarchical structure of pathogenic variation among Rhynchosporium secalis populations in Idaho and Oregon. Canadian Journal of Botany 70: 810-817. Fry, W.E., Goodwin, S.B., Dyer, A.T., Matuszak, J.M., Drenth, A., Tooley, P.W., Sujkowski, L.S., Koh, Y.J., Cohen, B.A., Spielman, L.J., Deahl, K.L., Inglis, D.A., and Sandlan, K.P. 1993. Historical and recent migrations of Phytophthora infestans: chronology, pathways, and implications. Plant Disease. 77: 653-661. Goodwin, S. B. 1997. The population genetics of Phytophthora. Phytopathology. 87: 463473. Groth, J.V. and Roelfs, A.P. 1986. The analysis of genetic variation in populations of rust fungi. In: Leonard, K.J. and Fry, W.E. Plant disease epidemiology. Volume 2. Genetics, Resistance and Management. Hardy, G.H. , 1908. Mendelian proportions in a mixed population. Science 28: 49-50. Hartl, D.L. and Clark A.G. 1989. Principles of population genetics. Sinauer Associates Inc. Sunderland. Mass. 47 Hudson, R.R., Boos, D.D. & Kaplan, N.L. 1992. A statistical test for detecting geographic subdivision. Molecular Biology and Evolution 9: 138-151. A. Drenth Practical Guide to Population Genetics Lewontin, R.C. 1972. The apportionment of human diversity. Evolution Biology 6: 381-398. McDonald, B.A. 1997. The population genetics of fungi: tools and techniques. Phytopathology. 87: 448-453. McDonald, B.A., Miles, J., Nelson, L.R., and Pettway, R.E. 1994. Genetic variability in nuclear DNA in field populations of Stagonospora nodorum. Phytopathology. 84: 250-255. Milgroom, M.G. and Lipari, S.E. 1995. Population differentiation in the chestnut blight fungus, Cryphonectria parasitica, in eastern North America. Phytopathology 85:155-160. Nei, M. 1973. Analysis of gene diversity in subdivided populations. Proceedings of the National Academy of Sciences USA. 70: 3321-3323. Nei, M. 1978. Molecular evolutionary genetics. Columbia Press. USA. Nei, M., and Li, W. K. 1979. Mathematical model for studying genetic variation in terms of restriction endonucleases. Proc. Natl. Acad. Sci. USA 76:5269-5273. Peever, T.L., and Milgroom, M.G. 1993. Genetic structure of Pyrenophora teres populations determined with Random Amplified Polymorphic DNA markers. Canadian Journal of Botany. 72:915-923. Sheldon, A.L. 1969. Equitability indices: dependence on the species count. Ecology 50: 466-467. Slatkin, M. 1993. Isolation by distance in equilibrium and non-equilibrium populations. Evolution 47:264-279. Stoddart, J.A., and Taylor, J.F. 1988. Genotypic diversity: estimation and prediction in samples. Genetics. 118: 705-711. Workman, P.L. and Niswander, J.D. 1970. Population studies on southwestern Indian tribes II. Local genetic differentiation in the Papago. American Journal of Human Genetics 22:24-29. Zhang, Q., Webster, R.K. & Allard, R.W. 1987. Geographical distribution and associations between resistance to four races of Rhynchosporium secalis. Phytopathology 77: 352357.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Practical Guide to Population Genetics