Download Practical Guide to Population Genetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Gene wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Genetic studies on Bulgarians wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Genetic testing wikipedia , lookup

Genetics and archaeogenetics of South Asia wikipedia , lookup

Public health genomics wikipedia , lookup

Designer baby wikipedia , lookup

Behavioural genetics wikipedia , lookup

Heritability of IQ wikipedia , lookup

Inbreeding wikipedia , lookup

Genome (book) wikipedia , lookup

Genetic engineering wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Koinophilia wikipedia , lookup

History of genetic engineering wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Medical genetics wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Human genetic variation wikipedia , lookup

Genetic drift wikipedia , lookup

Population genetics wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Practical Guide to Population Genetics
André Drenth
The University of Queensland 4072
Australia
Version 1.0
A. Drenth Practical Guide to Population Genetics
2
1 General Introduction
Population genetics is by no means a new scientific discipline. Most of the important theorems
were worked out in the first part of the 20th century. For a long time there has been a gap between
theoretical advance and experimental research. With the development of neutral markers such as
isozymes in the 1960’s and molecular markers in the 1980’s the experimental research caught up
with the theoretical advance. However, due to the abstract nature of population genetics and over
use of mathematical language by population geneticists, the discipline has suffered and in
generally is not taken up by many students in biology who in general tend to shun mathematics
and statistics. The challenge to students in population genetics is to bring the biology of the
organism and the mathematics together in an effort to address important biological questions.
In Mycology and Plant Pathology the population biology of the organisms under investigation
has often been ignored. There are numerous reasons for this. The first being the lack of numerous
phenotypic characters showing variation in the population. Second, the lack of neutral genetic
markers. Third, the lack of insight how useful population genetics can be if one considers that
diseases are caused by populations of pathogens and not by individuals. Plant pathologists have
long been aware of the variation in phenotypic characters such as virulence in fungal
populations. However, no systematic attempts were made to study this genetic diversity in detail
and unfortunately natural populations of fungi are seldom studied at all.
With the advent of molecular markers in the 1980's and the realisation that fungal pathogen
populations are more variable than was initially thought, a significant increase in the number of
research papers in this area has been published. However, since mycologists, plant pathologists,
and molecular biologists are typically not well trained in genetics and population genetics, the
advances in this field have been somewhat disappointing due to a lack of understanding of the
underlying principles. The science of population genetics is ignored and a race for the latest
molecular marker systems has erupted giving rise to method oriented instead of problem oriented
publications. Experiments are conducted without clearly defining the biological questions and
use of experimental designs and sampling strategies allowing statistical testing of hypotheses.
Hence, the need for this practical guide to outline some of the underlying genetic issues which
are particularly relevant to studying the population genetics of fungi. I have opted for a simple
and straightforward style and give ample numerical examples the student may use to master the
computations. Theoretical background is provided where needed to provide the students with
reasons for why to use a particular test or diversity measure. Armed with this practical guide it is
my hope that the student is on the way to rigorously testing important hypotheses concerning
population biology of fungal plant pathogens.
André Drenth,
Brisbane, January 1998
© Copyright: No part of this publication may be reproduced, stored in a retrieval system, transmitted, in any form or
any means, electronic, mechanical, photocopying, recording, or otherwise, without the prior written permission of the
author.
A. Drenth Practical Guide to Population Genetics
Contents
1
2
3
4
5
6
7
Introduction to the workshop
DNA and genetic variation
The structure of DNA
Basis of genetic variation
Measuring genetic variation
Molecular markers in Plant Pathology
Population genetic research questions in Plant Pathology
Population genetic tools
How to get started ?
Population genetic theory
6.1 Individuals, and Populations
6.2 Forces on populations
6.3 Alleles versus genotypes
6.4 Calculating allele frequencies
6.5 Hardy Weinberg equilibrium
6.6 Genetic diversity and evolution
6.7 Measuring and quantifying genetic diversity
1 Polymorphic loci
2 Heterozygosity
3 Gene diversity
4 Genotypic diversity
6.8 Linkage disequilibrium
6.9 Population differentiation
6.10 Partitioning of genetic diversity
6.11 Fixation index
6.12 Genetic distance
6.13 Similarity and dissimilarity indices
6.14 Suitability of markers for population genetics
1 Isozymes
2 RAPD
3 How to obtain a set of neutral RFLP markers
Literature cited
3
A. Drenth Practical Guide to Population Genetics
2
4
DNA and Genetic Variation
THE STRUCTURE OF DNA
We are all familiar with heritable attributes, at least in a general sense. We speak of a child
looking "just like its father", we talk of brown eyes and characteristic features "running in the
family". These heritable attributes are a part of our genetic make up; a blueprint that provides the
plan for our development.
Most eucaryotic organisms are composed of millions of different cells. Regardless of its size and
function each cell contains a defined structure called a nucleus. Within the nucleus is an identical
copy of the individual's genetic material. This genetic material has a complete set of instructions
that programs the life processes of that cell.
The genetic material inside each nucleus is organised into chromosomes. Chromosomes are not
easily distinguishable in the nucleus of a normal, active cell. At the time of cell division however,
the chromosomes condense and can be seen using a light microscope or an electron microscope.
Diploid organisms contain their chromosomes in pairs of homologous chromosomes. As
organisms grow and cells divide, the chromosomes are duplicated (mitosis) and transferred to
new cells. Chromosomes are transmitted between different generations, through sexual
reproduction. For this purpose, special cells, called germ cells undergo reduction division
(meiosis) leading to haploid cells which contain only one chromosome of each homologous pair.
The core of each chromosome, the material of heredity itself, is DNA (deoxyribonucleic acid).
The physical structure of DNA is simple, yet effective. A single strand of DNA is comprised of
four nucleotides. Each nucleotide is made up of three parts: a phosphate group, a sugar known
as deoxyribose and one of four nitrogen containing bases. The four bases are adenine (A),
cytosine (C), guanine (G), and thymine (T).
DNA consists of two single strands of nucleotides bound together in a double helix to form
double stranded DNA. The two strands run in opposite directions and are anti parallel so that a
"T" in one strand is always paired with a "A" in the other strand and similarly a "C" is always
paired with a "G". Hence, the two strands of DNA complement each other. This complementary
base pairing makes the mechanisms possible by which DNA self replicates each time a cell
divides. This complementary nature of DNA is also fundamental to its role as the genetic
material which stores information and can replicate it. When a cell divides its double stranded
DNA is unwound so that each strand serves as a template for synthesis of a second
complementary strand by the enzyme DNA polymerase.
Each chromosome contains a continuous strand of double stranded DNA that is packaged
tightly, coiled and supercoiled with other components of the cell (proteins as histones and
ribonucleic acid (RNA)) allowing its enormous length to be compressed into the nucleus. The
DNA in the 46 chromosomes of each human cell would total about two metres if fully extended,
and the entire amount of DNA in an adult human body when fully extended would reach from
the earth to the sun and back 25 times.
A. Drenth Practical Guide to Population Genetics
5
Each strand of DNA in every chromosome consists of a linear sequence of the four bases, the
genetic code. Since this sequence of nucleotides is the sole distinguishing factor of the genetic
code, the essential information of any segment can simply be represented by writing its sequence
of bases (e.g. CAGGTTCGTAATGC). This linear sequence of base pairs we usually refer to as
DNA sequence.
Although the DNA sequence is continuous, the information it encodes is not. It is organised in
discrete locations that we refer to as genes. A gene is a particular sequence of nucleotides that is
transcribed into RNA which in turn is translated into amino acids which form the basis of
proteins and enzymes. The genetic code is the relationship between the sequence of bases in
DNA and the sequence of amino acids in proteins. A group of three bases, called a codon, codes
for one amino acid. Since there are 64 possible base triplets and only twenty amino acids the
genetic code is degenerate, for most amino acids there is more than one codon. Hence, changes
in DNA sequence do not always affect the amino acid sequence they encode.
Chromosomes contain many genes which are the discrete units of inheritance; the heredity
particles that Mendel first perceived in 1866. Each chromosome of a homologous pair contains
the same gene in the same location. The location on the chromosome where any particular gene
is found is called a locus. Thus a locus is a defined region of DNA base sequence. Homologous
chromosomes have the same genes in the same order but differences between the genes exist.
Most genes exist in a number of different forms that we refer to as alleles (Fig. 5). We refer to
each variant form of a genetic locus as an allele, in which case different alleles may give rise to
variants of that trait. Logically, an allele must be due to the presence of a different nucleotide
sequence at the locus. Most alleles, however, are minor variants that have little or no effect on
the normal function of a gene. In the next section we will discuss the processes involved in
generating and maintaining genetic variation.
6
A. Drenth Practical Guide to Population Genetics
Homologous chromosome pair
Allele 1
Figure 1.
Locus
Allele 2
Homologous chromosomes with loci in the same location but different alleles at
each locus
BASIS OF GENETIC VARIATION
Population genetic studies on many organisms have revealed that natural populations of sexual
species are genetically variable. Now that we are familiar with the structure of DNA we should
look at the processes involved in generating and maintaining genetic variation in this DNA
sequence. The processes involved are: (i) mutation, (ii) mating system, (iii) migration and gene
flow, (iv) genetic drift, and (v) selection.
(i)
Mutation
Genetic variation is created by changes in the genetic material and mutation forms the basis of all
genetic variation. A mutation can be defined as any change in the base sequence of the DNA in
the genome. Mutations are typically lethal if essential genes are affected. Different forms of
mutation may occur, including.
(a)
Base substitution; the replacement of one nucleotide by another. If a base substitution
occurs in a codon within a protein coding region, an amino acid change in the primary
structure of the protein may result. Sometimes a basepair change does not affect the
codon and a change in the protein structure does not occur. If a nucleotide substitution
caused no change in the protein product of the gene such a mutation is known as a silent
mutation.
(b)
Insertion or deletion of a nucleotide. Such mutations involve a frame shift in the
process of translation and usually result in non-functional gene products.
(c)
Inversion of a section of DNA. Even major rearrangements of this type may be harmless
as long as no genetic material is lost and no important genes are disrupted at the
breakpoints of the inversion.
A. Drenth Practical Guide to Population Genetics
7
(d)
Duplication or deletion of a section of the DNA.
(e)
Translocation. Rearrangements of genetic material resulting from an exchange of
material between non-homologous chromosomes. Non-homologous chromosomes do
not normally pair with each other.
(f)
Gene conversion. Mutations due to gene conversion stem from misalignment of DNA.
This is especially important in the evolution of tandemly repeated clusters of related
genes or multigene families. Gene conversion is often associated with meiotic
recombination in which the mismatch repair system of the cell converts one allele to the
other.
These kinds of mutation occur at different rates and are differently affected by mutagenic agents.
We have to realise that there is no constraint at the molecular level of DNA on what mutations
can occur. Constraints on genetic variation arise from physiology and development of an
individual and not from the mutational process itself. Mutations occur at random and can
either increase or decrease the fitness of an individual. Fitness can be defined as greater ability to
survive and reproduce in a particular environment. Many mutations take place in parts of the
genome which do not encode genes. These mutations are neutral. The ultimate source of genetic
variation is gene mutation and it takes place continuously in a population. However, mutation is
such a rare event (10-6 per gene per generation) that it would change the genetic constitution of a
population so slowly it would be almost negligible. However, the following mechanisms
enhance and amplify the effects of mutation.
(ii) Mating system
In nature, gene mutations provide different forms of a gene, and these are spread throughout the
population by sexual reproduction, which entails independent assortment and recombination
through crossing over. This makes possible different combinations of newly arisen alleles with
each other and with those already established in the gene pool; as a result, the effect of gene
mutation is amplified.
Different forms of mating exist in nature. Micro-organisms can either outbreed, inbreed, or
reproduce asexually. Typically, fungal populations have mixed mating systems in which they can
reproduce asexually within the season and sexually between seasons. In a random mating
outbreeding population, the loci are randomly assorted in each generation. This leads to many
combinations of alleles in the progeny. Hence, each individual will have a unique genetic make
up. Mating does not create alleles but it combines already existing alleles into new combinations
leading to higher levels of genetic variation. In contrast, individuals in asexually reproducing
populations have an identical genetic make up.
(iii) Migration and gene flow
Populations of most species exhibit at least some degree of genetic differentiation between
geographic locations. Migration of individuals from one population to another will lead to a
reduction of differences between these populations. It is easy to see that emigration only has a
minor effect whereas immigration can have large effects by introducing new alleles into the
population. Thus the genetic structure of populations can change as a result of immigration or
gene flow.
A. Drenth Practical Guide to Population Genetics
8
(iv) Genetic drift
In small populations, allele frequencies can change each generation and particular alleles may be
lost. This will lead to changes in the population genetic structure over time and occur
independently at different locations. As genetic drift is random, changes will occur in small
populations which are isolated from each other. This will typically happen in pathogen
populations which have extremely low effective population sizes in the absence of their host
plant.
(v) Selection
Natural selection changes the gene pool by giving a reproductive advantage to those individuals
with favoured combinations of alleles which, in certain environments, lead to a greater fitness.
Because of natural selection, i.e. the process by which genotypes with greater fitness will leave,
on average, more offspring than less fit genotypes, favourable alleles promoting higher fitness
will be over-represented in succeeding generations. As a result, the types and frequencies of
alleles in the population gradually change so as to promote greater adaptation to the environment.
MEASURING GENETIC VARIATION
Two different types of genetic markers are used in population genetics; genotypic and
phenotypic markers. Genotypic markers such as isozymes and Restriction Fragment Length
Polymorphism (RFLPs) identify a number of alleles at a designated locus. This allows the
analysis of populations using only a relatively low number of individuals because the allele
frequency forms the basis of analysis. Allele frequencies can be used to test for random mating
or analyse gene diversity, gene flow, population substructuring etc. Isozymes are quick and
cheap to detect but their number is limited. RFLPs are more numerous but are time consuming to
perform.
Phenotypic markers involve morphological characters and molecular markers such as Random
Amplified Polymorphic DNA (RAPD) and DNA Amplification Fingerprinting (DAF). RAPD
and DAF technology is not as powerful as RFLPs in resolving population genetic structures but
are typically used to estimate the fraction of clonal individuals in a population and measure the
number and frequency of different phenotypes present to enable measurement of phenotypic
diversity. In addition, the spread and occurrence of particular phenotypes can be followed over
time. The disadvantage of these types of markers is their low power to infer population genetic
structure. However, their big advantage is their speed and simplicity with a potential to sample
large numbers of individuals. DNA fingerprinting and DNA profiling techniques are used in
many disciplines of science. In the next section the genetic basis underlying RFLPs and RAPDs
and DAFs will be discussed.
RFLP
The basis of RFLP lies within the fact that restriction enzymes can cut the DNA duplex at
specific base sequences. Restriction enzymes find a particular sequence of six bases (e.g. EcoRI
recognises GAATTC) in duplex DNA, and the enzymes cut the DNA only at this sequence. A
particular sequence of six base pairs occurs on average once every 4,000 bases. In examining a
locus in a number of individuals we might find that in some individuals the DNA duplex
surrounding the locus is not cut at the normal sites by EcoRI. In these individuals the normal
A. Drenth Practical Guide to Population Genetics
9
recognition sequence is no longer recognised by EcoRI, therefore one of the six bases
comprising the recognition site of EcoRI must be different. Hence, a different allele, which is not
always apparent by obvious criteria since the base change has not affected gene function, can
nevertheless be identified by a restriction enzyme. To determine whether a restriction enzyme
cuts at a particular site we measure the length of the DNA fragments generated by the restriction
enzyme. Because restriction enzymes generate thousands of restriction fragments, one of these
DNA fragments is used as a probe which, due to the specific double stranded nature of DNA,
hybridises to its complement on a DNA binding membrane (i.e. Southern Blot) to specifically
recognise its complementary DNA sequence among many. The allele is identifiable as a,
Restriction Fragment Length Polymorphism (RFLP). RFLPs are codominant markers (all
alleles at a locus can be identified in a diploid organisms) and the ability to screen RFLPs has
added extraordinary power to our analyses of gene structure and function.
RAPD and DAF
RAPD and DAF technologies are both based on the Polymerase Chain Reaction (PCR). The
PCR technique involves three steps, 1) denaturation of the double stranded DNA by heating, 2)
annealing of primers to sites flanking the region to be amplified and 3) primer extension, in
which strands complementary to the region between the flanking primers are synthesised using a
thermostable DNA polymerase (e.g. Taq polymerase). The double stranded products are cycled
repeatedly through steps 1-2-3. In each round of denaturation-annealing-extension, the target
sequence is roughly doubled in the reaction mixture. After more than 20 cycles a target sequence
can be amplified more than a million fold. The PCR technique is extremely powerful in that only
small amounts of starting material are necessary for an assay. The primers used to initiate the
PCR process are short nucleotides (typically 20-30 bp in length) that specifically amplify the
DNA sequence from a particular locus. Instead of using specific sequences to target certain loci,
arbitrary primers can be used to amplify at random a number of anonymous genomic sequences
which can then be size fractionated on a gel to provide an individual specific DNA fingerprint
which forms the basis of the RAPD and DAF techniques. Because only the presence or absence
of specific amplified fragments can be identified, no individual alleles can be distinguished
which makes these markers dominant. Hence, allele frequencies cannot be directly calculated
which is a major disadvantage of these techniques compared to RFLP analysis. Besides the fact
that only minute amounts of tissue are needed for DNA fingerprinting using arbitrary primer
techniques, they are also quicker compared to RFLPs. However, for many questions they are an
extremely rapid alternative to RFLP markers.
MOLECULAR MARKERS IN PLANT PATHOLOGY
The life span of an individual runs but a short course, during which time the genotype remains
constant. In contrast, the population persists over generations and has a genetic constitution that
continues to vary. The population, not the individual, is the main unit of evolution. Evolution
can be defined as change in the diversity and adaptations of populations of organisms.
Populations can be defined as individuals who share a common gene pool (mate with each other)
in a defined location. Therefore population members share more alleles with individuals
belonging to the group than with those in related populations. Most species are composed of
more than one population, and the number of individuals varies from one population to another.
A. Drenth Practical Guide to Population Genetics
10
Natural populations almost always display differences in allele and genotype frequencies
from one geographic region to another. In order to study genetic variation in pathogens in plant
pathology we have to look at the population level.
Because many fungal pathogens have no morphological characters which allow us to identify
individuals in a population, many research questions remain unanswered. Molecular marker
technology in general, and DNA fingerprinting especially, enable the rapid identification of each
individual in a population. This enables the researcher to investigate the mode of reproduction,
mode and extent of spread, mode of survival, origin and evolutionary relationships among
closely related pathogens, to be investigated in greater detail.
With the advent of an almost unlimited number of molecular markers available at relatively low
cost it is now possible to deduce the genetic structure of pathogen populations. Questions in plant
pathology which can be addressed using molecular markers include:
(i)
(ii)
(iii)
(iv)
(v)
(vi)
(vii)
Where does the pathogen come from ?
Where do new races come from ?
What is the level of genetic variation in the pathogen population ?
How far does the pathogen spread ?
How does the pathogen survive between seasons ?
How important is the sexual cycle of the pathogen ?
Is the pathogen population confined to one plant, one field, one region, one continent ?
By using a population genetic approach we can deduce the mode of reproduction, levels of
inbreeding, outbreeding or asexual reproduction. It is easy to understand that asexual
reproduction will give rise to clones which cannot be distinguished from the parental types.
Continuous selfing will led to the same situation. However, outbreeding will recombine the
genetic information of both parents and give rise to specific new individuals which can be easily
distinguished from the parents using DNA fingerprinting.
Population genetic information concerning the mode of reproduction provides insights into the
ability of pathogens to form new pathogenic races and which new races are combinations of
already existing ones. Sexual reproduction in fungi often involves the formation of specific
resting spores which increase the ability of the pathogen to survive between different cropping
seasons.
We will first look in more detail at questions in plant pathology related to population genetics,
before discussing the relevant population genetic theory and background in more detail to tackle
these questions. A few case studies will provide some insights into the way this technology can
be used to answer biological questions of importance in plant pathology. Molecular approaches
carry immense popularity at the present time, but they nonetheless provide only one of many
avenues towards the goal of understanding the biology of organisms.
A. Drenth Practical Guide to Population Genetics
3
11
Population genetic research questions in Plant Pathology
Before starting on the theory of population genetics it is a good idea to look at the biological
questions of relevance to plant pathology which we are interested in and we want to address. I have
listed a small collection in a few broad categories. Of course, there are many more questions but
these will provide a starting point.
Population structure
•
What is a population - Geographic boundaries
•
How much genetic diversity exist in a population
•
How is genetic diversity distributed within a population
•
How is genetic diversity distributed between populations
•
One large panmictic population or many small subpopulations (island model)
•
Continuous population (incomplete isolation by distance)
Population boundaries
•
Is pathogen population on host plant A hybridising with population on host plant B
Geographic differentiation
•
Is the pathogen population in field A the same as in field B
•
Relationships between populations from different areas.
Host specialisation
•
Is the pathogen population on host A the same as on host B
Migration
•
Does migration and gene flow occur between different populations
•
How do migration and genetic drift affect population structure
•
Is a new race or genotype introduced to field A or did it evolve locally. If it was
introduced where did it come from.
Life-cycle biology (mating system - occurrence and maintenance of genetic diversity)
•
How do sexual and asexual reproduction affect population genetic structure
•
Inbreeding
•
Outbreeding
•
Asexual reproduction
Selection
•
How does selection affect population structure
•
Influence of the host plant on the pathogen population
•
Deployment of resistance in the host
•
Application of fungicides
Disease control strategies
•
How do different control strategies affect genetic structure
A. Drenth Practical Guide to Population Genetics
Phylogenetic relationships (systematics)
• What is the evolutionary potential.
• What are the evolutionary relationships between closely related pathogen species A and B
12
A. Drenth Practical Guide to Population Genetics
4
13
Population genetic tools
After we have defined our questions we need to find out how we can apply population genetics
tools to address these questions. Three criteria or tools of population genetics need to be taken
into account in order to address questions. These three tools are
•
•
•
Genetic markers
Sampling strategies
Data analysis
Each particular research question asks for a particular choice of markers, sampling strategy
combined with a particular way of analysing the data. After the biological question has been
defined it is important to chose the right tools. The aim of this practical guide is to help you
chose these tools.
5
How to get started
Population genetic studies require careful planning because they are relatively expensive and run
over an extended period of time. Project planning is vital in order to obtain maximum
information from your samples. Always start with very clearly defining the biological questions
being asked. I cannot stress this enough. The questions need to be written down as specific and
detailed as possible. Based on the questions, hypotheses need to be constructed and particular
attention should be paid to constructing testable hypotheses. After this most important step you
need to work out how to test these hypotheses which will involve the following steps.
1.
2.
3.
4.
Sampling strategy
Sample collection
Sample analysis using genetic markers
Data analysis
One of the aims of the sampling strategy should be to minimise both the number of specimens
and their handling and analysis in order to allow statistical testing of your hypotheses which
answers your biological question.
The sampling strategy largely depends on the biological questions being asked and the level of
error one agrees to accept. Population genetic data such as allele frequencies often follow a
binominal distribution which allows us to estimate the variance (s2) from the mean according to
s2p = p(1-p)/n
where p is the allele frequency and n is the sample size. This approach is appropriate for loci in
diploid populations and in case the locus is in Hardy Weinberg equilibrium.
For most populations of fungal pathogens we do not know the population structure so this would
be a good place to start any population genetic study. The three most fundamental populations
structures one is most likely to encounter include:
•
One single random mating population
14
A series of small isolated subpopulations (island or stepping stone model)
A continuous population where individuals exchange genes with geographical proximate
individuals (isolation by distance model)
A. Drenth Practical Guide to Population Genetics
•
•
See the section on biological questions for more details on other relevant questions. After you
have defined your question it is time for a pilot experiment which should have the following
three major aims.
• Choice of genetic markers
• Determine if the markers are suitable in a practical sense
• Feasibility of large scale sampling program
Samples need to be obtained from a variety of populations to start with. It is best to obtain as
variable as possible material first to select your markers. Because a large number of samples
need to be analysed in your main study markers should be easy to score and inexpensive.
Moreover, it is vital that heterozygotes can be easily distinguished from homozygotes and all
alleles can be easily identified from each other. Also optimisation of sample handling, storage,
DNA isolation and manipulation, data scoring and the logistics of the project need to be worked
out.
Sample size and strategy
After the biological questions have been worked out in great detail, your hypotheses are clearly
defined, and you have an appropriate marker it is time to design a proper sampling strategy. The
design of a sampling strategy largely depends on:
• The question being asked
• Biology of the organisms, spread, mode of reproduction ploidy etc
• What levels of error one agrees to accept
• Frequency of polymorphic loci in the population
Sampling strategies and the statistics behind it will be discussed in a separate chapter.
A. Drenth Practical Guide to Population Genetics
6
15
Population genetic theory
6.1 Individuals and Populations
The most obvious unit of living matter is the individual organism. In unicellular organisms, each
cell is an individual; multicellular organisms consist of many interdependent cells, many of
which die and are replaced by other cells throughout the life of an individual. In evolution, the
relevant unit is not the individual but a population. A population is a community of individuals
linked by bonds of mating and parenthood. In other words, a population is a community of
individuals of the same species. A Mendelian population is a community of interbreeding,
sexually reproducing individuals.
The individuals of a species are not usually homogeneously distributed in space, rather they exist
in more or less well defined clusters, or local populations. The concept of local populations may
seem clear but its application in practice entails difficulties because the boundaries between local
populations are not well defined and often unknown. In addition most organisms are not
homogeneously distributed and migration occurs.
6.2 Forces on populations
Populations are not static over time but fluctuate in size and genetic make-up. There are a number of
forces upon populations due to the fact that food supplies are always limited and predation, migration
and selection occurs. The largest changes in populations typically occur when selection forces lead to
greater fitness in some individuals in the population compared to others. In order for this to happen
there need to be genetic differences between individuals for these forces to act upon. The most
common forces on populations are mentioned below.
Mutation
•
Source of all genetic diversity
•
Spontaneous mutations are occurring continuously without regard to their immediate need
or usefulness. (mutation rate 1 per 106 per generation)
•
Selective forces act to increase its frequency in the population at the expense of its less
favoured allele.
Selection
•
Some individuals have more offspring than others based on differences in fitness
•
Natural selection - not defined (fitness, vitality, fertility)
•
Artificial selection - human involvement (breeding resistant plants)
Migration (gene flow)
•
Emigration - negative selection (limited influence on population)
•
Immigration - influence on population (allele frequency changes)
Drift
•
Small populations inbreeding - fluctuations of allele frequencies
•
Reduction of heterozygotes - loss of genetic variability
•
Random nature - different strains become homozygous for different allelic combinations
so isolated subpopulations become genetically distinct from each other
16
A. Drenth Practical Guide to Population Genetics
Mating system
•
Influences genotype frequency but not allele frequency
•
Random mating
•
Assortive mating - breeding to phenotypic similarity (period of flowering)
•
Inbreeding - selfing
Fungi
•
Asexual - vcg (parasexual recombination)
•
Homothallic
•
Heterothallic (mating types)
6.3 Alleles versus genotypes
In population genetics the frequency of the allele rather than the frequency of the genotypes is
the basis used to answer most of the biological questions. The reason for this is that frequencies
of alleles are much higher than the frequency of genotypes since there are usually fewer alleles
than genotypes. With two alleles the number of possible genotypes in a diploid organisms is
three, with 3 alleles 6, and with 4 alleles it is 10. In general if the number of different alleles is k,
the number of different possible genotypes is k(k+1)/2. Table 6.1 illustrates this point for haploid
and diploid organisms.
Table 6.1 Number of possible genotypes in haploid and diploid organisms.
Ploidy
Loci
Alleles
Possible Genotypes
Genotypes
1n
1
5
AL
5
2n
1
5
⎛ a⎞
A+ ⎜ ⎟
⎝ 2⎠
15
1n
4
5
AL
625
2n
4
5
⎛ a⎞
(A + ⎜ ⎟ )4
⎝ 2⎠
50625
From table 6.1 it becomes immediately clear that in order to detect the frequency of genotypes
enormous sample sizes are required and therefore we can conclude that in sexual systems
frequencies of genotypes is extremely difficult (if not impossible) to determine. The frequency of
alleles is higher than the frequency of genotypes. This is illustrated in table 6.2. It is immediately
clear that the estimate of allele frequencies is much more precise and requires a smaller sample
size. Moreover, frequencies of genotypes can be estimated indirectly from the allele frequencies
through involvement of Hardy Weinberg Equilibrium to be discussed in the next section.
17
A. Drenth Practical Guide to Population Genetics
Table 6.2 Numerical example based on the Lap-5 gene (Leucine aminopeptidase) of
Drosophila willistoni based on isozyme analysis on a population of 500 individuals.
Genotype
Number
Genotype
frequency
98/98
2
0.004
100/100
172
0.344
103/103
54
0.108
98/100
38
0.076
98/103
20
0.04
100/103
214
0.428
Total
500
1
The allele frequencies for the above example are:
Allele
Frequency
98
0.062
100
0.596
103
0.342
Total
1.000
6.4 Calculating allele frequencies
How to calculate allele frequencies
The frequency of an allele is the frequency of individuals homozygous for that allele plus half the
frequency of heterozygotes for that allele.
f(A) = p
f(a) = q
p+q=1
A(p)
A(p)
a(q)
AA(p2)
Aa(pq)
18
A. Drenth Practical Guide to Population Genetics
a(q)
Aa(pq)
2
aa(q )
Genotypic frequency (p + q)2 = p2 + 2pq + q2 = 1
AA
Aa
aa
0.3
0.6
0.1
AA
Aa
aa
0.36
0.48
0.16
p = 0.3 + (0.5 x 0.6) = 0.6
q = 0.1 + (0.5 x 0.6) = 0.4
p+q=1
Random mating p2 + 2pq + q2 = 1
Co-dominant genes
A population of a total of 200 individuals of a diploid micro-organism has 2 red individuals, 36
orange and 162 white ones. What are the frequencies of the red and white alleles.
Red R = (2p + pq)/2N = (2x2 + 36)/400 = 0.1
White W = (2q + pq)/2N = (2x162 + 36/400 = 0.9
Since p + q = 1 q = 1 - p
Note that if we can identify the heterozygotes in a population we can calculate the allele
frequencies.
Dominant genes
A population of 200 individuals of a diploid organism has 182 red and 18 white individuals.
What are the allele frequencies for the red and white allele?
This cannot be directly calculated but we can estimate this if we make a number of assumptions.
If we assume that the population is in equilibrium we can take the square root of the frequency of
the population which is of the recessive phenotype as our estimator of the recessive allele.
q 2 = 0.09 = 0.3
White q =
Red p = 1 - q = 0.7
In cases where dominance is involved the heterozygous class is indistinguishable phenotypically
from the homozygous dominant class. Hence, there is no way of checking the Hardy Weinberg
expectations against observed sample data unless the dominant phenotypes have been genetically
19
analysed by observation of their progeny from test crosses. Only when co-dominant alleles
are involved can we easily check our observations against the expected equilibrium values
through the chi-square test. Note that this is especially a problem in diploid organisms when
using dominant markers. What is the difference and the problem when working with haploid
fungi? Some dominant markers only recognise one allele while the alternative allele is absence
of a fragment. What is the problem with these markers?
A. Drenth Practical Guide to Population Genetics
In mammalian systems there is an extra complication namely the occurrence of sex
chromosomes. For example humans males have XY while females have XX. The expression of
dominance and recessive relationships is markedly changed when this happens. In sex influenced
traits the heterozygous genotype usually will produce different phenotypes in the two sexes,
making the dominance and recessive relationships of the alleles appear to reverse themselves.
Since in fungi the sex of an organism is often under the control of a single gene and no sex
chromosomes are known in fungi we will not discuss this matter here any further.
Loci with multiple alleles.
Consider three alleles A a' and a with the dominance hierarchy A > a' > a occurring in the gene
pool of a diploid organism with frequencies p, q, and r. In this case random mating will generate
zygotes with the following frequencies
(p+q+r)2 = P2 + 2pg + 2pr + q2 + 2qr + r2 = 1
Genotypes AA Aa' Aa a'a' a'a aa
Phenotypes
A
a'
a
Precision of allele frequency estimates
The effect of sample size on an estimate’s precision is expressed as the sample variance (s2),
where
(pxq)
s2 =
2xN
where p and q are the allele frequencies in the sample of two alternating allele’s and N is the
number of individuals in the sample. For diploid organism we use 2xN as this is the number of
occurrences of a specific locus in the sample, for haploid use N. When multiallelic series are
involved take p as the frequency of one allele and q as the combined frequency of all other
alleles.
Because N is the denominator of this equation, it is clear that sample variance is inversely
proportional to sample size, i.e. the variance is smaller for a larger sample size. It is also apparent
from the equation that sample variance is dependent on allele frequency. This can be illustrated
by a numerical example. Take a sample of 50 individuals and an allele frequency of 0.5. The
sample variance is 0.5x0.5/(2x50) = 0.0025. The sample variance for an allele with a frequency
of 0.05 in the sample is 0.05x0.95/(2x50) = 0.000475, less than one-fifth of that of the first allele.
For a given sample size, the frequencies of very common and very rare alleles can be estimated
with less precision than those of alleles with intermediate frequencies.
20
The square root of the sample variance provides an estimate of the sample standard deviation,
and this in turn can be used to obtain confidence limits of the estimate of allele frequencies.
Confidence limits are values either side of the estimate that delimit the confidence interval, a
range of values within which we can be confident, to a given degree, the true population
frequencies lies. In biological sciences we often use the 95% confidence limits, which delineate a
range of values that we can be 95% confident contains the true frequency, are positioned 1.96
standard deviations either side of the frequency estimate. For the example of the allele above
having an estimated frequency of 0.05, the sample standard deviation is 0.00475 = 0.022. The
95% confidence limits are thus 1.96 x 0.022 = 0.04 either side of 0.05. This means that we can be
95% confident that the true population frequency of the allele lies between 0.01 and 0.09. The
only way to obtain the true allele frequency is to analyse the entire population. Since this is rather
impractical a decision needs to be made as to what level of confidence is acceptable.
A. Drenth Practical Guide to Population Genetics
6.5 Hardy Weinberg equilibrium
The Hardy Weinberg law was formulated independently in 1908 by the British mathematician
G.H. Hardy and the German doctor Wilhelm Weinberg. It states that the process of heredity by
itself does not alter the frequency of either allele’s or genotypes in a population in which mating
occurs at random. Furthermore, after a single generation of random mating genotype frequencies
reach equilibrium if the allele frequencies are the same in both males and females; thus
equilibrium state is predictable from a knowledge of allelic frequencies.
A population in which the genotype frequencies are as predicted by the Hardy-Weinberg law is
often referred to as in Hardy -Weinberg equilibrium. A very important characteristic of HardyWeinberg equilibrium is that is achieved after only a single generation of random mating.
Regardless of what might happen to disturb the state of equilibrium in one generation, it will be
restored in the following generation.
Assume a locus with two alleles, A and a, and that their frequencies are p for A and q for a. If
mating occurs at random then the frequency of a given genotype will simply be the product of the
frequencies of the two corresponding allele’s. The probability that an individual of a diploid
species will have the AA genotype is the probability (p) of receiving the A allele from one parent
multiplied by the probability (p) of receiving the A allele from the other parent, or p x p = p².
Similarly, the probability that an individual will have the aa genotype is q². The genotype Aa can
arise in two ways: A from the first parent and a from the second, which will occur with a
frequency of pq, a from the first parent and A from the second, occurring at the same frequency
pq, therefore the total frequency of Aa is 2pq.
Three general statements concerning the HWE can be made as well as a number of assumptions
1
2
3
Process of heredity does not alter the frequencies of alleles or genotypes in a population in
which mating occurs at random
HWE will always be restored in one generation
Equilibrium state can be calculated from the allele frequencies
21
A. Drenth Practical Guide to Population Genetics
Assumptions
•
No selection
•
No mutation
•
No gene flow
•
Large population size
•
Random mating
One application of the Hardy-Weinberg law is that it permits the computation of gene and
genotypic frequencies in cases where not all genotypes can be distinguished, because of
dominance (see example in 6.4). The other is to test if the population is actually random mating
which is exemplified below.
Testing a locus for equilibrium
Are the genotypes in the following population confirming to the frequencies expected for a hardy
Weinberg population within statistically acceptable limits?
100 individuals with the following genotypes
AA
10
Aa
35
aa
55
Calculate allele frequencies
p = (2p + pq)/2N = 0.275
q = 1-p = 0.725
Calculate genotypic frequencies according to Hardy Weinberg equation based on the allelic
frequencies
Genotypes
HW
Equilibrium
Expected genotypic
frequency
Absolute frequency in
population
AA
p2
0.076
7.6
Aa
2pq
0.399
39.9
aa
q2
0.526
52.6
Chi square test
Genotype
Observed
Expected
(o-e)2/e
AA
10
7.6
0.79
Aa
35
39.9
0.60
aa
55
52.6
0.11
Total
100
100
1.50
Degrees of freedom df = k phenotypes - r alleles 3-2 = 1
A. Drenth Practical Guide to Population Genetics
22
X2 is 1.5 which gives a probability (P) of 0.2-0.3. Hence this population does not significantly
deviate from the Hardy Weinberg Equilibrium and is random mating.
Degrees of freedom
The number of variables in a chi-square tests of Hardy Weinberg equilibrium is not simply the
number of phenotypes minus 1 (as in chi-square tests of classical Mendelian ratios). The number
of variables in equilibrium tests is further restricted by testing their conformity to an expected
Hardy Weinberg frequency ratio generated by a number of additional variables. Hence, the
combined number of degrees of freedom is the number of phenotypes minus one (k-1) minus the
number of alleles minus one (r-1) which is similar to the number of phenotypes minus the
number of alleles (k-r).
6.6 Genetic diversity and evolution
The existence of genetic diversity is a necessary condition for evolution. In case all individuals
are homozygous at a certain locus for the same allele evolution cannot take place at that locus
because the allele frequencies cannot change from generation to generation. The occurrence of
diversity in natural population was the starting point for Darwin's argument for evolution by a
process of natural selection. Individuals having advantageous variations are more likely to be
successful than others in passing on their genes to their offspring. As a consequence useful
variation will become more prevalent through the generations, while harmful or less useful ones
will be eliminated.
There is a direct correlation between the amount of genetic diversity in a population and the rate
of evolutionary change by natural selection with respect to fitness. This was mathematically
demonstrated by Fisher in his Fundamentals Theorem of Natural Selection (1930). In Agriculture
we continually try to improve the genetic make-up of our crop species by selecting for
favourable characteristics such as yield, resistance to pests and diseases. However, this resistance
is often not durable and can be overcome by changes in the pathogen population. Since ability of
a pathogen to adapt to new environments is to a certain degree dependent on the level of genetic
diversity present in the pathogen population it is it is of practical importance to have methods to
quantify the genetic diversity present in pathogen populations.
6.7 Measuring and quantifying genetic diversity
It is now known that natural populations of many organisms possess a great deal of genetic
diversity and that genetic diversity is a common phenomenon in nature. Through experiments on
all kinds of organisms involving inbreeding, geneticists have discovered that much more genetic
diversity exists than is apparent when organisms living in nature are observed. Inbreeding
experiments give rise to homozygosity and the expression of recessive genes which otherwise go
undetected in the population. Another source of evidence which indicated that genetic diversity
is present in populations came from artificial selection experiments. There are many examples of
23
this in Agriculture such as the spectacular yield increases in cereals, milk production in cows,
egg production in chickens etc. which have take place over most of this century.
A. Drenth Practical Guide to Population Genetics
If we want to quantify how much genetic diversity there actually is we run into a classical
problem. The traditional methods of classical genetic analysis possess a severe handicap. How
do we measure what proportion of genes are polymorphic in a population. Since we cannot study
every gene in the population we need to look at only a sample of gene loci. Ideally we need a
random sample, truly representative of the whole population, from which values can be
extrapolated to the whole population. Traditional classical genetic analysis becomes extremely
cumbersome here, because to find out if differences between phenotypes are based on different
alleles in the isolates we need to conduct testcrosses between all the different phenotypic classes
to find out if one or more genes are involved. The dominance relationships need to be sorted out
using labor and time consuming testcrosses. The other problem of the classical genetic approach
is that only genes known to exist are those that are variable. Invariant genes cannot be included
in the sample and, hence, it is impossible to obtain an unbiased sample of the genome to
accurately assess genetic diversity.
Discoveries in molecular genetics provided a way out of this dilemma. It was established that the
genetic information encoded in the nucleotide sequences at the DNA of a structural gene is
translated into a sequence of amino acids making up a polypeptide. This allows us to select a
series of proteins without previously knowing whether or not they are variable in a population.
Hence, this allows us to obtain an unbiased sample of all the structural genes in the organism.
With the introduction of gel electrophoresis it has become possible to study protein variation
quickly of large numbers of individuals with only a moderate investment of time and money.
Since the late 1960's estimates of genetic diversity have been obtained for many different natural
populations of all kind of organisms.
Electrophoretic techniques show what the genotypes of the individuals in the sample are:
• how many are homozygous
• how many are heterozygous
• and for what alleles
In order to obtain a reasonable estimate of the amount of genetic diversity in a population
between 15-20 or more loci need to be studied. After the laboratory work it is desirable to
summarise the information obtained for all the loci in a simple way that would express the
genetic diversity of the population and that would permit comparing one population to another.
In addition, it is vital to use measures of genetic diversity which allow statistical testing of
hypotheses concerning population structure and comparing different populations. In the next
sections we will deal with ways to measure genetic diversity and ways to statistically test
hypothesis concerning population structure.
Measuring genetic diversity
In most population genetic analyses, allele frequencies form the basis to measure genetic
diversity. Allele frequencies are preferred over genotypic frequencies because allele frequencies
remain relatively stable over time and are independent of the mating system in contrast to
genotypic frequencies which are randomised at each generation of mating. See the section on
alleles versus genotypes for more detail on this.
24
At this point fungal population genetics starts to deviate significantly from the established
theory. This is due to the fact that fungi are different in a number of ways to strictly outbreeding
diploid organisms with clearly identifiable individuals for which most of the population genetics
theory was developed. Special characteristics of fungi relevant to population genetics are.
• Difficulties in identifying the individual
• Some fungi are haploid
• Overlapping generations
• Outbreeding, inbreeding and strictly asexual reproduction can all occur at varying degrees at
the same time
• At different geographic locations and on different host plants different modes of reproduction
can occur
• The occurrence of more than one mating type
• The occurrence of vegetative incompatibility
• Strong host specialisation of some fungal species which can have a large influence on
population structure
A. Drenth Practical Guide to Population Genetics
The above demonstrates that population genetics of fungi need to be approached differently and
more cautiously than if we were dealing with plants or mammals.
Genetic diversity can be measured in a number of ways:
1. Proportion of polymorphic loci
2. Heterozygosity
3. Gene diversity
4. Genotypic diversity
Shannon Index
Clonal Fraction
Genotypic diversity (Nei)
Stoddart and Taylor
1
Polymorphic loci
One measure of genetic diversity is the proportion of polymorphic loci, or simply the
polymorphism (P) in a population. If we use a co-dominant marker and we examine 20 loci of a
fungal species and find that 16 loci show no polymorphism but some polymorphisms are present
at the other 4 loci we can say that 4/20 = 0.2 of the loci are polymorphic in that population.
Hence, the degree of polymorphism in the population is 0.2. Polymorphism can be a useful
measure of genetic diversity but it suffers from two important problems arbitrariness and
imprecision. The number of variable loci observed depends on how many individuals are
examined. If we examine more individuals we might identify more polymorphisms and the
measure tends to increase. To counter effect this a criterion of polymorphism is often used to the
effect that a locus is only considered polymorphic when the most common allele has a frequency
no greater than 0.95. As additional variants are occasionally identified the average proportion of
polymorphic loci will not change. However, the criterion is a rather arbitrary decision. More
importantly the degree of polymorphisms in a population is imprecise because a slightly
polymorphic locus counts as much as a very polymorphic locus containing many different alleles
at a locus.
A. Drenth Practical Guide to Population Genetics
25
2
Heterozygosity
A better measure of genetic diversity which is not arbitrary and much more precise is the
heterozygosity (H) of the population. Heterozygosity (Hobs) is defined as the average frequency
of heterozygous individuals per locus. Heterozygosity is calculated by first obtaining the
frequency of heterozygous individuals of each locus and then averaging these frequencies over
all loci.
Example Heterozygosity
Locus
Heterozygotes in sample
Total population
Heterozygosity (Hobs)
1
40
100
0.4
2
20
100
0.2
3
35
100
0.35
0.32
In order for an estimate of heterozygosity to be valid it must be based on 15-20 or more loci.
Heterozygosity is an estimate of the average number of loci within an individual that are in the
heterozygous state. The variance associated with this estimate can be reduced both by increasing
the number of loci examined and by increasing the number of individuals sampled from the
population. Observed heterozygosity (Hobs) is a good measure of genetic diversity because it
estimates the probability that two alleles taken at random from the population are different.
However, the observed heterozygosity does not reflect well the amount of genetic diversity in
populations of organisms that reproduce by self fertilization (homothallic fungi) or organisms in
which mating between relatives are common. In self fertilising populations most individuals will
be homozygous even though the different individuals may carry different alleles if the locus is a
variable in the population. Mating between close relatives has the same effect.
3
Gene diversity
In order to overcome the problems with the observed heterozygosity measure we can calculate
the expected heterozygosity (Hexp) of a population. Nei (1973) introduced the concept of gene
diversity to describe genetic variation that is applicable to both sexual and asexual populations.
Gene diversity (Hexp) is defined as the probability of obtaining two different alleles at a locus
when two haploid individuals are sampled from a population.
Nei’s formula for gene diversity:
H = 1 − ∑k xk 2
where H is gene diversity for a non-random mating population, and xk is the frequency of the
kth allele (Nei, 1973).
A gene diversity of 1 means that the diversity is so high that any two alleles at a locus sampled
from a population are different. At the other extreme, a genetically uniform population (with no
26
allelic variation at the loci sampled) will have a diversity of 0 since any two individuals
sampled will be identical. In a diploid mating population gene diversity is equivalent to the
proportion of heterozygosity at the locus expected under random mating, so called expected
heterozygosity (Hexp).
A. Drenth Practical Guide to Population Genetics
Calculation of gene diversity (Hexp)
Gene diversity can be calculated from the allele frequencies under the assumption that the
individuals in the population are mating with each other at random. Applying Nei’s (1973)
formula on the following problem we will get.
One locus 4 alleles
Locus A
Locus B
Allele
Frequency
Allele
Frequency
A1
0.5
B1
0.2
A2
0.3
B2
0.3
A3
0.1
B3
0.4
A4
0.1
B4
0.1
1.0
1.0
Hexp = 1 - freq A12 + freq A22)
Hexp = 1 - (0.52 + 0.32 + 0.12 + 0.12) = 0.64
Hexp = 1 - (0.22 + 0.32 + 0.42 + 0.12) = 0.70
Hexp 0.64 + 0.70 / 2 = 0.67
Note that the expected level of heterozygosity is the same as Nei's gene diversity measure (Nei,
1973; PNAS 70: 3321-3323)
Differences between expected and observed levels of heterozygosity may be due to the
occurrence of a certain amount of self-fertilization. This difference can be quantified into a socalled Fixation index discussed in section 6.11.
4
Genotypic diversity
One of the more challenging aspects of population genetics of fungal pathogens is the variability
in mode of reproduction. Fungi can either reproduce sexually, outbreeding or inbreeding, or
reproduce asexually. These reproductive modes are not isolated either and may occur at varying
degrees at the same time. This may lead to population structures varying from the predominance
of one or a few clones, to strictly sexual reproduction with many combinations in between. This
poses some problems to population genetic approaches and requires for well thought out
experimental designs to answer biological questions. In case high levels of outbreeding occur,
the analysis should involve gene diversity analysis as described before. In case we are dealing
27
with mixed mating populations causing large numbers of clones we have to consider using
gene diversity after correcting the population for clones and/or using a measure of genotypic
diversity.
A. Drenth Practical Guide to Population Genetics
When dominant markers are used for population studies we cannot accurately estimate the allele
frequencies. Often when we do not know if we are dealing with sexual or asexual fungal
populations we combine the phenotypic traits (e.g. RAPD fragments) into a multilocus
phenotype, sometimes called haplotype. Often markers are used which give us a so-called DNA
fingerprint or single copy RFLP or isozyme marker data from several loci can combined to give
a so-called multilocus haplotype or multilocus phenotype depending on the ploidy level of the
fungus under investigation.
Various measures of genotypic diversity exist:
• Shannon Index
• Clonal Fraction
• Genotypic diversity
• Stoddart and Taylor
A
Shannon Index
A diversity measure commonly used for phenotypic analysis of pathogen populations is the
Shannon index (SI) (Bowman et al., 1971; Groth and Roelfs, 1987). Genotypic diversity for a
population can be calculated according to:
k
SI = - ∑ pi ln pi
i =1
where pi is the frequency of isolates with the ith phenotype in the population and k is the number
of phenotypes in the population and lnpi is the natural log of pi. The Shannon index takes into
account the frequency and evenness of distribution of a particular phenotype. When sample sizes
of different populations vary the Shannon index can be converted into a normalised Shannon
diversity index: HS = H/HMAX, in which H is the usual Shannon diversity index over genotypes,
and HMAX is ln(N), the maximum diversity for a sample of size N. This statistic is relatively
stable when sample sizes vary (Sheldon, 1969). The Shannon index has a useful property in that
it is linearly related to addition of phenotypic characters that are independent of those already
included (Groth and Roelfs, 1989). Because of this simple additive effect, a linear model of the
contribution of characters such as virulence, RFLPs etc is possible.
An inherent problem in using phenotypic and haplotypic diversity measures as a genetic
diversity measure is that the actual genetic differences between the two unique multilocus
genotypes are not compared. Multilocus phenotypes different from each other by one or many
characters are weighted equally.
A. Drenth Practical Guide to Population Genetics
28
B
Clonal Fraction
Clonal fraction is a simple statistic which indicates the fraction of clones in a population. This
statistic is calculated as
(N − C)
Cf =
N
where N is sample size and C is the number of distinct genotypes or clones. In case we have a
sample size of 10, 5 distinct isolates and 1 genotype which occurs 5 times then the clonal fraction
is (10-6)/10 = 0.4. A larger population with the same genetic make-up of for example N = 100,
50 different genotypes and 1 genotype occurring 50 times will have a clonal fraction of (10051)/100 0.49. This immediately drives home the message that clonal fraction is dependent upon
sample size. Hence, clonal fractions should not be compared between populations of different
sizes. The use of this statistic is further restricted by the lack of statistics.
C
Genotypic diversity NEI
In case different phenotypes are identified using DNA fingerprinting or combining single loci to
construct a multilocus genotype, often called haplotype, genotypic diversity can be calculated
using Nei’s formula by substituting allele frequencies with the genotype frequencies.
H = 1 − ∑k xk 2
where H is genotypic diversity for a non-random mating population, and xk is the frequency
of the kth genotype.
D
Stoddart and Taylor
When using a measure of genotypic diversity, or any genetic diversity measure for that matter it
is important that this measure has its statistics worked out so we can use it to test hypotheses.
Stoddart and Taylor (1988) have just done that for genotypic diversity.
Genotypic diversity can be calculated using the formula:
1
Gˆ =
N
∑
x=0
⎡ ⎛ x ⎞2 ⎤
⎟ ⎥
⎢ fx⎜
⎣⎢ ⎝ N ⎠ ⎥⎦
where N is the sample size, and fx is the number of genotypes observed x times in the sample
(Stoddart and Taylor, 1988). The maximum possible value for Ĝ is the number of
individuals in the population, which occurs when each individual in the sample has a
different genotype. To compare Ĝ between collections of different sample sizes, Ĝ can be
divided by N to calculate the percentage of maximum diversity obtained ( Ĝ /N) (McDonald
et al., 1994). The significance of differences between the percentages of maximum diversity
( Ĝ /N) for each sub-population can be calculated using a t-test (Chen et al., 1994), given by
the formula:
29
A. Drenth Practical Guide to Population Genetics
t=
Gˆ 1 Gˆ 2
−
N1 N 2
( )
( )
Var Gˆ 1 Var Gˆ 2
+
N12
N 22
()
K
4
⎡
⎤
3
where Var Gˆ = G 2 ⎢G 2 ∑ ( pi ) − 1⎥
N
⎣ i =0
⎦
G is the population genotypic diversity, K is the number of genotypes in the sample, and pi is
the frequency of the ith genotype in the sample. Ĝ is the maximum likelihood estimator for G
in this formula (Stoddart and Taylor, 1988). The t-test to be calculated at a significance level
of P ≥ 0.05. The number of degrees of freedom is N1 + N2 - 2.
This formula can be used to reflect to what degree asexual reproduction contributes to population
genetic structure in fungal species with both an asexual and sexual reproductive stages in their
life cycle.
Example of calculating genotypic diversity
The next example is obtained from Chen et al (1994) on a population genetic study of the
haploid fungus Septoria tritici. Two populations are collected one early in the season and the
other one late. The data identified using low copy RFLP markers are summarised in the table
below.
Early season
Late season
Total sample
129
277
Nr. genotypes
120
251
Genotype
112
7
1
Frequency
1
2
3
Genotypic diversity in the early population
Ĝ = 1/{112 x (1/129)2 + 7 x (2/129) 2 + 1 x (3/129) 2 } = 112
Genotype
231
15
4
1
Frequency
1
2
3
4
Ĝ /N% = 112/129 = 86.8%
Genotypic diversity in the late population
Ĝ = 1/{231 x (1/277) 2 + 15 x (2/277) 2 + 4 x (3/277) 2 + 1 x (4/277) 2 } = 224 Ĝ /N% = 224/277 =
80.8%
From this data it is clear that genotypic diversity has decreased from early to late in the season
during the year. Of course we want to know whether or not this is a significant difference. In
order to do that we first need to calculate the variance
A. Drenth Practical Guide to Population Genetics
30
Variance of Ĝ in the early population
Var Ĝ = 4/129 x 1122 [ 1122 ( 112(1/120)3 + 7 x (2/120) 3 + 1 x (3/120) 3 ) - 1] = 161.6
Since the standard deviation is the root of the variance Ĝ = 112 ±12.5
Note that this is the variance of Ĝ and not the Ĝ corrected for sample size
Variance of Ĝ in the late population
Var Ĝ = 4/277 x 2242 [ 2242 (231x1/251)3 + 15(2/151)3 + 4(3/251)3 + 1 (4/251)3 ) -1] =
447.8
Ĝ = 224 ±21.8
In order to test if the early and late populations are different we perform the t test
t = | 112/129 - 224/277 | /√(161.6/1292 + 447.8/2772 ) = 0.47
Degrees of freedom is 129 + 277 - 2 = 404
In the t-table there is not a significant difference at P> 0.05. Hence, the null hypothesis that the
early and late populations do not show any significant differences in genotypic diversity cannot
be rejected.
6.8 Linkage disequilibrium
Population structure is largely affected by the mode of reproduction. Sexual reproduction affects
the association of genes in an individual. The genotypic proportions for a particular locus in a
random mating population are the product of the allelic frequencies of the two uniting gametes, a
condition called the Hardy Weinberg equilibrium. Hardy Weinberg equilibrium at a single locus
is achieved in one generation of random mating; it therefore only tells us about the pattern of
mating of the generation directly before the observation is made (Crow, 1986).
To know the history and pattern of reproduction and gene association, a measure of linkage
disequilibrium (alternatively called gametic disequilibrium) is used to describe the deviation of
observed genotypic frequencies expected from random association of alleles at different loci.
Linkage disequilibrium (D) can be measured for two loci A and B according to
Dij = Pij - pipj
31
where Pij is the observed frequency of AiBj genotypes in the population and pipj are the
frequencies of alleles Ai and Bj at loci A and B respectively in the population (Hartl and Clark,
1989).
A. Drenth Practical Guide to Population Genetics
A pair of loci is said to be in linkage equilibrium when the observed genotypic frequencies do
not deviate significantly from the product of the individual allelic frequencies at the two loci
examined. Physical linkage between loci contributes to disequilibrium because genes on the
same chromosomes, unless separated by recombination, will be associated more often than
expected at random. Besides physical linkage association of alleles at different loci can be caused
by selection and genetic drift.
Linkage disequilibrium is an important concept in the analysis of fungal populations because
selection for virulence alleles by resistant host genotypes coupled with a high asexual
reproductive capacity can result in strong linkage disequilibrium. Genetic exchange and
recombination, either sexual or parasexual, dissipate disequilibrium. Hence, the level of linkage
disequilibrium can therefore be used as an indirect measure of the significance of genetic
exchange and recombination in a presumable asexual populations.
When alleles at different loci are not associated at random, they are said to be in linkage
disequilibrium. When alleles at different loci are associated at random (i.e. in proportion to their
frequencies, the loci are in linkage equilibrium and D approaches 0. The maximum absolute
value D can have is 0.25, namely when linkage disequilibrium is complete and the allelic
frequencies are 0.5 at both loci. In case the allelic frequencies at both loci are different complete
linkage is not possible.
The Hardy Weinberg law dictates that equilibrium among alleles at a single locus are reached in
one single generation of random mating. In contrast, linkage disequilibrium decreases with every
generation of random mating in the absence of selection. Permanent linkage disequilibrium, may
result from selection if some gametic combinations result in higher fitness than other
combinations.
Assume two loci, A and B and at each locus there are two alleles, (A and a) and (B and b).
Assume further that alleles A and B interact well with each other producing a well adapted
phenotype and that the same is true for alleles a and b. However, combination Ab and aB yield
poorly adapted phenotypes. The population as a whole would benefit if in most cases the alleles
were transmitted in the combinations AB and ab and rarely in Ab and aB. When alleles at
different loci are not associated at random they are said to be in linkage disequilibrium. When
alleles at different loci are associated at random (i.e. in proportion to their frequencies) the loci
are in linkage equilibrium. If the allele frequencies of the above two loci are:
Locus
A
B
Allele
A
a
B
b
Frequency
p
q
r
s
32
and the alleles at the two loci are associated at random then four possible gametic classes are
expected to have the frequencies that are the product of the frequencies of the alleles involved,
that is.
A. Drenth Practical Guide to Population Genetics
Gametes
AB
ab
Ab
aB
Frequency
pr
qs
ps
qr
If the alleles are associated at random, the product of the frequencies of the two coupling
gametes (pr x qs = ps x qr) is the same as the frequencies of the two repulsion gametes (ps x qr =
pqrs). Where the alleles are not randomly associated, the two products will be different and this
difference is called linkage disequilibrium and is a measure of the difference between the two
products
D = prqs - psqr
Where there is linkage equilibrium D = 0. Linkage disequilibrium is complete when only two
gametic combinations exists, either only the two coupling gametes or only the two repulsion
gametes. The maximum absolute value that D can have is 0.25, namely when linkage
disequilibrium is complete and the allele frequencies are 0.5 at both loci.
Example of linkage disequilibrium
If we put some numbers into the above example
Gametes
AB
ab
Ab
aB
Frequency
pr
qs
ps
qr
0.453
0.019
0.076
0.452
D = prqs - psqr D = 0.453 x 0.019 - 0.076 x 0.452 = - 0.026
A test for the significance of the disequilibrium coefficient between each pair of alleles at two
loci can be formulated with the following chi-square statistic:
X2AB =
nDab
Pa (1 − Pa ) Pb (1 − Pb )
A. Drenth Practical Guide to Population Genetics
33
where n is the number of individuals in the sample and Dab the maximum likelihood estimator
of disequilibrium between alleles A and B. The observed alleles frequencies for the loci are Pa
and Pb, respectively. This chi-square statistic has one degree of freedom (Weir, 1990). It is often
common to exclude alleles with a frequency lower than 5% from the analysis as extremely high
sample sizes are needed to obtain any meaningful disequilibrium coefficient. Also common is the
combining of rare alleles into a single class which is larger than 5%.
6.9 Population differentiation
In plant pathology we are often interested in population differentiation. Typical questions
include: are pathogen populations in different fields, different regions, obtained from different
crops, etc the same or are they significantly different. There are a number of tests we can perform
to find out whether or not populations are the same.
At the genotypic level we can use Stoddart and Taylors measure of genotypic diversity for which
there is a t-test. This test can be used in asexual reproducing populations to see if there is
significant differences in mode of reproduction between different fields. The problem with this
approach is that this measure does not take the level of differences between the clones into
account.
For numerical example of testing difference in genotypic diversity see section 6.7d.
In highly sexually reproducing populations comparing levels of genotypic diversity is not very
useful as both populations will show extremely high levels. Different levels in genotypic
diversity are more important when the biological question relates to the importance of various
modes of reproduction over the cause of an epidemic or a season in a temporal sense. Hence,
when populations have been sampled early and late in the season, and the following year and we
want to test if the genotypic diversity has significantly changed over the cause of the epidemic or
the life-cycle of the pathogen.
When co-dominant markers have been used in such a population study we can use a Chi-square
test to test if allele frequencies are significantly different between the populations. One of the
most straightforward ways of testing for genetic differentiation is to use contingency table chisquare tests. With v alleles at a locus, the genotype counts in each of r samples are arranged in a
[v(v+1)]/2r contingency table and a chi square statistic with {[v(v + 1)]/2-1)} x (r-1) degrees of
freedom. A problem with this method arises when low allele frequencies are encountered which
give rise to rather large test statistics. To avoid this it may be necessary to combine rare alleles
together. This problem increases when the number of alleles increases. As a rule of thumb,
goodness of fit Chi-square tests should not be performed with expected classes less than 5.
34
A. Drenth Practical Guide to Population Genetics
Contingency Chi-square analysis
Differences in allele frequencies between populations can be tested using a X2 test for
heterogeneity (Workman and Niswander, 1970). For a diploid fungus the Chi-square values for
each RFLP locus can be tested in the following manner. Consider two populations of 50
individuals each with the following allele frequencies
Allele
A
a
Population 1
0.8
0.2
Population 2
0.6
0.4
There is not any one hypothesis to tell us what to frequency to expect in each class, but we can
test whether the two populations are independent by means of a 2x2 contingency table. First set
up a table with the observed results:
Allele
A
a
Totals
Population 1
40
10
50
Population 2
30
20
50
Totals
70
30
100
We can now calculate the expected results for each of the classes under the null hypothesis that
there is no genetic differentiation for each of the classes by multiplying the corresponding
subtotals and dividing by the grand total. For example the expected frequency of allele A in
population 1 is (70x50)/100 = 35. The 2x2 contingency table of expected results is:
Allele
A
a
Totals
Population 1
35
15
50
Population 2
35
15
50
Totals
70
30
100
This gives rise to the following X2 value:
X2 = (35-40)2 / 35 + (35-30)2 /35 + (10-15)2 /15 + (20-15)2 /15 = 4.76
Although there are four classes, the number of degrees of freedom in this case is 1, not 3. because
only one of the four values in the 2x2 contingency table need to be known in order to calculate
the other ones by subtracting them from the subtotals. In general the number of degrees of
freedom is (r-1)(c-1) where r is the number of rows and c the number of columns (do not include
the subtotals). The Chi-square value of 4.76 is larger than the Chi-square for one degree of
freedom at the 5% level of significance. Hence, we can conclude that there is a significant
difference in the allele frequencies between these populations. Note that Chi-square statistic need
to be corrected for haploids by dividing the value by 2.
35
A. Drenth Practical Guide to Population Genetics
F statistics
The definition of a population is a difficult one involving time, geographic distance, and biology
of the organism. A population of a pathogen in one continent can range from one large
population, to overlapping subpopulations to numerous small distinct subpopulations. In order to
determine what level and scale a population of a pathogen operates we can subdivide or partition
variation in the population.
Sewal Wright (1951) was the first to develop methods to partition variation. His measurements,
often called F-statistics, are based on the idea of inbreeding in a diploid mating population. If a
population is subdivided into several genetically related subpopulations, then two randomly
uniting gametes chosen within a subpopulation are more likely to be related by descent than two
gametes from different subpopulations. Wright's fixation index (Fst) is a measure of the genetic
differentiation of a subpopulation relative to the total population due to non-random mating.
In case of two single loci comprising two alleles Fst is calculated as
Fst =
q
q (1 − q )
Where (q ) is the variance in the frequency of allele A2 and q = Σwiqi is the weighted frequency
of A2 in the total population. When subpopulations do not differ in sample size significantly
equal weight is given to each subpopulation. A Chi-square test is commonly used where X2 =
2NFst which only may be applied if sample sizes are identical for all populations and only two
alleles occur at the locus compared. When there are more then two alleles per polymorphic locus
then a powerful test for significant deviation of Fst from zero is the log-likelihood Chi-squared
test (G-test) of homogeneity of the allele distributions themselves.
The null hypothesis is that there is no substructuring of the population (i.e. F-St = 0) can be
examined by testing for heterogeneity of allele frequencies between subpopulations using the Gtest (with M-1 degrees of freedom, where M = the number of populations). A common problem
with this approach is the presence of alleles at very low frequencies. These may be combined
into contingency tables until all expected cell frequencies exceed 1.0. However, it is no longer
possible to test for significant heterogeneity between subpopulations because of the combining of
different alleles. However, by combining cells containing alleles at low frequencies, the G-test
may be used to determine the maximum number of cells that can be combined before
heterogeneity reaches statistical significance. Using this approach the mean of Gst over all
variable loci may be compared to zero by the t-test.
Wright’s F-statistics have the advantage of allowing a simultaneous comparison of allele
frequencies for a number of populations among many loci. However, they are based upon loci
that are effectively neutral. Fst is the most commonly used of the F statistics coefficient and gives
a measure of the extent to which a species is organised into subpopulations with restricted gene
flow. It represents the correlation between alleles of gametes sampled at random from two
subdivisions of a population, with the distribution of alleles within the entire population sampled.
Fst reflects the extent of local differentiation into subpopulations. Fst is always positive (Wright,
1965) and its calculation requires genotypic information for single loci.
36
Nei (1973) generalised Wright's population subdivision concept to haploids and asexual
populations. Nei's approach to measuring genetic differentiation is to partition the total gene
diversity into component diversities according to subgroups, such as those from geographic
locations or pathotypic groups as in the case of pathogen populations. A genetic differentiation
coefficient (Gst) is defined as
A. Drenth Practical Guide to Population Genetics
∧
∧
( H t − H s)
Gst =
Ht
∧
Ht =
Total gene diversity over all groups, which is identical to Nei's gene diversity (1973)
which is identical to the expected level of heterozygosity in the total population.
∧
Hs =
Average gene diversity over all subgroups. Individual Hs estimates are the Nei (1973)
gene diversity measure based on the allele frequencies found within a particular
subpopulation.
Gst, like Fst, describes the average amount of genetic diversity attributed to a particular
subdividing factor relative to the total level of genetic diversity. Gst is a rather useful measure as
it is not contingent upon any assumption about the mode of reproduction of the population. A
low Gst value indicates that most of the gene diversity is found within the subpopulation and
there is not much gene diversity between the populations thus indicating low levels of population
differentiation. Nei’s Gst, provides a good measure of the degree of similarity between taxa,
however it lacks the usual test statistics with their associated levels of confidence.
When a Gst value is high the gene diversity within subpopulations is small compared to the gene
diversity within the total population indicating substantial genetic differentiation between the
subpopulations. When Gst approaches 1, each subpopulation becomes homogeneous and most of
the variation exists between the subpopulations. Note that Nei’s gene diversity measure can also
be used for genotypic diversity. This will allow partitioning of genotypic diversity within and
between subpopulations.
Subdivision itself can also affect genotype frequencies. If a species is divided into
subpopulations where there is random mating, and the allele frequencies differ between
subpopulations then for the species as a whole, homozygote genotypes will increase at the
expense of heterozygote genotypes. This is known as the Wahlund effect, and has the same
effect on overall heterozygosity as inbreeding.
Of course the question is what value of Gst is indicative of significant differentiation between
subpopulations. In other words which observed distribution of gene diversity supports the null
hypothesis of no differentiation among subpopulations. Population differentiation based on Gst is
fairly straightforward but Nei did not provide any statistics with this measure. A simple nonparametric methods to test for population differentiation is described in Hudson et al., (1992).
This procedure involves tabulating the observed gene diversity (or genotypic frequencies) of
each subpopulation and subjecting the data to a Chi-square homogeneity test. Significant
deviation of observed from expected frequencies leads to rejection of the null hypothesis of no
differentiation between the subpopulations.
37
Slatkin (1993) has used Gst to devise a statistic which allows testing of isolation by distance
among different subpopulations. This is based on the reasoning that if there is gene flow between
two populations which are geographically close and the population is in equilibrium, the
logarithm of the average number of migrants per generation between each pair of subpopulations
A. Drenth Practical Guide to Population Genetics
∧
M , is expected to be negatively correlated with the logarithm of the geographic distance
between subpopulations.
∧
M can be estimated as:
∧
M = (1/ Gst -1)/4 for diploids
∧
M = (1/ Gst -1)/2
for haploids
∧
After M is calculated the log M per population can be plotted to the log of the geographic
distance for each pairwise comparison and tested as to whether there is significant negative
correlation. See Milgroom and Lipari (1995) for an example of this.
6.10 Partitioning of genetic diversity
Partitioning of genotypic diversity coupled with hierarchical sampling schemes are often used in
plant pathology. Consider a two-level hierarchical sampling scheme (fields within regions) used
to collect isolates, the total genotypic diversity can be partitioned into components based on the
amount of diversity within and among subpopulations. The relative magnitude of each
component can be assessed following methods developed by Lewontin (1972), Zhang et al.
(1987), and Goodwin et al. (1992a). Partitioning works rather similar for all indices and the
following approach is generally applicable. For each region, hfield can calculated as the mean of
ho for all fields in the region, and hregion can be calculated as the mean frequency of all genotypes
within the region. The total diversity, htotal, is then determined from the mean frequencies of all
genotypes in the entire sample. The mean within-field and among-field within-region diversity
values, hfield and hregion, are the average hfield and hregion values, weighted by the number of fields
in each region. The total diversity was allocated to hierarchical components as follows: hfield /htotal
is the proportion of total diversity that is due to differences within fields; (hregion — hfield)/ (htotal)
is the proportion of total diversity due to differences among-fields within-regions; and (htotal —
hregion)/ htotal is the proportion of total diversity due to differences among-regions.
6.11 Fixation index
Differences between expected and observed levels of heterozygosity may be due to the
occurrence of a certain amount of self-fertilization. This difference can be exploited in a Fixation
index (F) which tells us something about the kind of reproduction in the population. Higher
levels of inbreeding result in lower levels of heterozygosity and lower overall levels of genetic
diversity and this can be shown at any hierarchical level. Thus the mating system can be
38
analysed by comparing the observed proportion of heterozygotes in a population to that
expected assuming random mating. Wright’s fixation index (F), can be calculated as :
A. Drenth Practical Guide to Population Genetics
F=1-(Hobs/Hexp)
in which Hobs is the observed mean heterozygosity per locus, and Hexp is expected mean
heterozygosity (Brown, 1979). Hexp is the same as Nei’s gene diversity (Nei, 1973). Random
mating populations should have F-values close to 0, as Hobs approaches Hexp. Under complete
selfing, F-values close to 1 are expected as Hobs approaches 0. Values between 0 and 1 would
indicate various levels of inbreeding. A value of F less than 0 would indicate an excess of
heterozygotes (Goodwin, 1997), either through disassortive mating or asexual reproduction.
Asexual reproduction could skew F-values in any direction, depending on: the level of
heterozygosity present in the most prevalent clones (Goodwin, 1997) and the state of most of
the polymorphisms, whether present as heterozygotes or homozygotes. Needless to say that
the fixation index only applies to diploid organisms.
6.12 Genetic distance
Heterozygosity gives one measure of genetic variation but there are other methods which have
the advantage that values may be directly compared between populations and between different
sized groupings (i.e. between versus within species) and thus used to measure genetic
differentiation during the speciation process and possible phylogenetic relationships. Nei’s
genetic identity (I) estimates the normalised probability that two alleles, one taken from each
population, are identical. Essentially, it provides a measure of the similarity in frequency of each
allele, summed over all alleles. It is given by:
I=
Ixy
IxIy
where Ixy, Ix, and Iy are the averages over all loci (including monomorphic ones) of ∑xiyi, ∑x2i
and ∑y2i, respectively, where xi and yi are the frequencies of allele I for the two populations X
and Y. Genetic identity may vary from zero (no alleles shared between the two populations) to
one (where both populations have identical allele frequencies). For traits with two or more alleles
the probabilities must be calculated for each allele separately and summed.
The genetic distance between the two populations is then calculated by:
D = -lnI.
39
Genetic distance varies from 0 for populations with identical allele frequencies to infinity for
populations that do not share any alleles. (See Nei 1978 for modifications concerning small
sample sizes).
A. Drenth Practical Guide to Population Genetics
Calculation of genetic distance
Consider two populations 1 and 2 with 3 loci and the following allele frequencies
Locus
A
Allele
A
a
B
b
C
c
B
C
Population 1
0.1
0.9
0.4
0.6
1
0
Population 2
0.2
0.8
0.3
0.7
0
1
In order to calculate genetic identity it is easiest to set up a table
Locus
A
B
C
Total
Average
Ixy
(0.1x0.2)+(0.9x0.8)= 0.74
(0.4x0.3)+(0.6x0.7)= 0.54
(1x0)+(0x1)= 0
1.28
0.427
Genetic identity is I =
Ixy
=
IxIy
Ix
0.1 + 0.9 = 0.82
0.42 + 0.62 = 0.52
12 + 02 = 1
2.34
0.78
2
2
Iy
0.2 +0.8 = 0.68
0.32 +0.72 = 0.58
02 + 12 = 1
2.26
0.7563
2
2
0.427
= 0.557
0.78x0.753
Genetic distance D = -lnI. D = - ln 0.557 = 0.58
That is, it is estimated that 0.58 allelic substitutions per locus (or 58.8 allelic substitutions per
100 loci) have occurred in the separate evolution of the two populations. Note that more than
three loci need to be studied in order to obtain a reliable estimate of genetic distance or genetic
differentiation between any two populations. In the following table (Ayala,1975) gives an idea of
what distances to expect at various levels. See also a review by Avise and Aguado (1982)
Level of comparison
Local populations
Subspecies
Incipient species
Sibling species
Morphological different species
Genetic identity I
0.970 ± 0.006
0.795 ± 0.013
0.798 ± 0.026
0.563 ± 0.023
0.352 ± 0.023
Genetic distance D
0.031 ± 0.007
0.230 ± 0.016
0.226 ± 0.033
0.581 ± 0.039
1.056 ± 0.068
Genetic distance and genetic identity are often used as the bases for analysis to reconstruct the
evolutionary relationships. Nei’s (1972,1978) Identity (I) has a clear biological meaning; the
40
probability that alleles drawn at random from two populations or taxa are identical. Genetic
distance is therefore an estimate of the number of nucleotides base-pair substitutions per gene
locus that have accumulated during evolutionary time since the divergence of taxa from their
common ancestor. This is under the assumption that evolutionary rates are identical along
lineages and among loci. If we compare two taxa, a D value of 0.01 would indicate that one
allelic substitution had occurred per 100 loci since the divergence of these taxa from their
common ancestor.
A. Drenth Practical Guide to Population Genetics
6.13 Similarity and dissimilarity indices
There are other measures of diversity which are often used in combination with qualitative data
(presence or absence of characters such as e.g. RAPD fragments) There are a large number of
these so called similarity indices and have outlined a few using a small data set.
For qualitative characters (presence or absence of characters)
Variables for two individuals (1 and 2) can be classed as follows
a = both individuals have fragment (++)
b = individual 1 has fragment, 2 has no fragment (+-)
c = individual 1 has no fragment 2 has one (-+)
d = individual 1 and 2 both have no fragment (--)
p=a+b+c+d
+
-
+
a
b
a+b
-
c
d
c+d
a+c
b+d
p
41
A. Drenth Practical Guide to Population Genetics
Sample data set
Variable
1
2
3
4
5
6
7
8
9
10
Strain 1
1
0
0
0
1
1
0
0
1
0
Strain 2
0
0
0
0
1
0
0
1
1
0
1.
Simple matching coefficient SI = (a+d)/p
2.
Jaccard's coefficient SI = a/(a+b+c)
3.
Czekanowski's coefficient SI = 2a/(2a+b+c)
4.
Sokal & Sneath coefficient SI = 2(a+d)/(2(a+d)+b+c)
5.
SI = a/(a+2(b+c))
6.
Russel & Rao's coefficient SI = a/p
1
2
3
4
5
6
0.70
0.40
0.57
0.82
0.25
0.20
According to SI 4 the strains are similar but according to SI 6 they are dissimilar
Which similarity index are you going to use???
42
A. Drenth Practical Guide to Population Genetics
6.14 Suitability of markers for population genetics
Each particular biological problem can be attacked using one or a series of different genetic
markers. Each marker system has its own advantages and disadvantages. The following table
gives a rough indication of which markers are most suited to which problem.
Question
Population Dynamics
Geographic
Distribution
Gene Flow
Drift
Recombination
Clonality
Random mating
Parasexuality
Gene diversity
Genotypic diversity
Selection4
Key
+
++
+++
2
Biological
Markers
Cytoplasmic
Markers
Neutral Molecular Markers
Phenotypic
Genotypic
+
+
+
++
+++
+++
+++
+
+2
+1
+++
+++
+5
++
++
++
+++
+
++3
+++
+++
+++
+++
+++
+++
+++
+++
+++
+++
Unsuitable
Limited information can be deduced
Estimations can be made with these tools
Most informative techniques available
Mating type as a marker is a good indicator of the potential for sexual reproduction in heterothallic
species, however other biological markers such as virulence are typically infrequent and under
selection which make them unsuitable for assessment of random mating.
3
Genotypic diversity can be estimated with a phenotypic marker such as RAPDs. However, results are
not always reproducible, and codominance is not distinguished. Hence, a genotypic marker technique
such as RFLPs, which may initially appear more time consuming, may actually save on time and costs
in the long term by reducing the repetitions required.
4
Selection is usually assessed by a combination of neutral and selectable markers.
5
Useful if the selected trait is located on extrachromosomal elements such as mitochondrial DNA.
A. Drenth Practical Guide to Population Genetics
43
Isozymes
The amount of genetic diversity detected by using isozymes is subject to strong bias. A major
source of bias is that only approximately one-third of all amino acid substitutions can be detected
by using electrophoresis (Lewontin, 1974). The other substitutions do not change the charge of the
protein and will thus not result in separation of the isozymes in electrical fields. In addition, small
differences in rate of migration are not always detectable, so some amino acid substitutions that do
influence net charge are also “silent”. Be aware of the strong bias in favour of enzyme systems
that show polymorphism’s when screening enzymes systems for use in population studies.
Problems with RAPD markers
RAPD markers are dominant and only allow one fragment per locus to be identified in the form of
presence or absence. After reading all the previous sections is must have become clear that the use of
dominant markers is rather limited in population genetics. However, they do allow estimates of
phenotypic diversity and are good for identification of clonal phenotypes. I have listed a few of the
problems and possibilities with these markers without going in great detail.
(Lynch + Milligan, 1994. Molecular Ecology 3: 91-99)
Practical problems
1. How to assign bands to loci (need to do pedigree analysis)
2. Products of different loci will have similar molecular weight
3. Dominance of RAPD marker
Lynch + Milligan give estimates off
•
Allele frequency
•
Genotype frequency
•
Gene diversity
•
Population subdivision
•
Genetic distance
•
Relatedness (very limited)
Conclusions:
•
For RAPD 2 -10 times more individuals need to be sampled per RAPD locus then per
RFLP or isozyme loci
•
Loci with high frequency cannot be used
•
Many more loci need to be scored
Possibilities with RAPD markers
(see Peever and Milgroom, 1994. Can. J. Bot. 72: 915-923)
A. Drenth Practical Guide to Population Genetics
44
HOW TO OBTAIN A SET OF NEUTRAL RFLP MARKERS
After you have clearly defined your biological question and have come up with an experimental
design which allow you to rigorously test your hypothesis using the appropriate statistics, it is
time to start thinking about some aspects of the markers you are going to use. I have outlined
some details how one can obtain useful RFLP markers but the principles are the same for
virtually any marker.
Issues to be worked out involve:
• How to grow large numbers of your fungus in an efficient way
• Try some DNA isolation procedures and chose one which gives reliable and good quality
DNA
• Make sure the DNA cuts well with a number of different restriction enzymes
• You need 2-5 ug DNA per lane and need about enough DNA for 4-5 gels
• Store your DNA at -20 °C and never at 4 °C (don’t worry about the shearing as that is a
myth). At 4 °C your DNA will slowly degrade.
You will need probes from a library, either an existing cDNA or genetic library or you need to
make one. In case you make a library screen it against high copy probes by labelling part of the
entire library and probe it back to a plated out copy of the library. Also select against ribosomal
repeats. Obtain about 50-100 clones from this library which then need to be screened whether or
not they are from single loci.
This can be done by screening the clones on about four genetically different isolates cut with the
same enzyme as the library is constructed with. Make a number of duplicate blots with 4
different enzymes on them so you can rapidly screen 4-8 probes at a time depending on the
capacity of your hybridisation oven.
Screen the 50-100 clones in 5-10 hybridisation rounds. Since you have used 4 different enzymes
you have screened a minimum of 200 probe enzyme combinations. This should yield 20-25
single copy probes. Ideally your probe selection should be different from the population you are
going to analyse in an effort to avoid bias towards selecting monomorphic or polymorphic loci.
If you work with a sexual reproducing fungus and you do have progeny available you can even
do better and demonstrate that you have truly single copy markers and proof that all the markers
segregate independently from each other.
In order to avoid the occurrence of complex patterns you need to cut your clones with the three
other enzymes as well. If you still find more than two fragments for a diploid organism than it is
likely two or more loci are involved or your enzyme cuts within the locus. By using only a
fragment for which none of the restriction enzymes cuts within that section you increase the
change of getting easy to interpret patterns.
You will need at least 10 different clones and you may use two to three different enzymes with
the same clone. This should yield between 20 and 30 useful probe-enzyme combinations,
enough for most population genetic studies. Thus instead of using a single blot 20 times (which
is beyond most blots) it is easier to do 2-3 blots with different enzymes with the same clone. This
45
significantly reduces the number of hybridisations which need to be preformed and speeds up
your research.
A. Drenth Practical Guide to Population Genetics
Also consider the logistics of your approach. If you have gels with 20 well combs and you have
two controls on each gel it is handy to choose 36 as your sampling unit in the experimental
design. It is worthwhile spending a little time to come up with an efficient approach as you will
be rewarded later for this with an efficient sample throughput.
A. Drenth Practical Guide to Population Genetics
46
LITERATURE CITED
Avise, J.C. and Aguado, C.F. 1982. A comparative summary of genetic distances in vertebrates.
Evolutionary Biology 15:151-185
Ayala, F.J. 1975. Genetic differentiation during the speciation process. Evolutionary Biology
8: 1-78.
Bowman, K.O., Hutcheson, K., Odum, E.P. & Shenton, L.R. 1971. Comments on the distribution
of indices of diversity. p 315-359 in: Patil, G.P., Pielou, E.C. & Waters W.E. (eds.).
Statistical Ecology Volume 3. Many species populations, ecosystems, and system
analysis. The Pennsylvanian State University Press, University Park and London. pp
462.
Brown, A.H.D. 1979. Enzyme polymorphism in plant populations. Theoretical Population
Biology. 15: 1-42.
Chen, R.S., Boeger, J.M., and McDonald, B.A. 1994. Genetic stability in a population of a
plant pathogenic fungus over time. Molecular Ecology. 3: 209-218.
Cheung, W.Y., Hubert, N., and Landry, B.S. 1993. A simple and rapid DNA microextraction
method for plant, animal, and insect suitable for RAPD and other PCR analyses. PCR
Methods and Application. 3: 69-70.
Goodwin, S.B., Allard R.W., Hardy, S.A. & Webster, R.K. 1992a. Hierarchical structure of
pathogenic variation among Rhynchosporium secalis populations in Idaho and Oregon.
Canadian Journal of Botany 70: 810-817.
Fry, W.E., Goodwin, S.B., Dyer, A.T., Matuszak, J.M., Drenth, A., Tooley, P.W., Sujkowski,
L.S., Koh, Y.J., Cohen, B.A., Spielman, L.J., Deahl, K.L., Inglis, D.A., and Sandlan,
K.P. 1993. Historical and recent migrations of Phytophthora infestans: chronology,
pathways, and implications. Plant Disease. 77: 653-661.
Goodwin, S. B. 1997. The population genetics of Phytophthora. Phytopathology. 87: 463473.
Groth, J.V. and Roelfs, A.P. 1986. The analysis of genetic variation in populations of rust
fungi. In: Leonard, K.J. and Fry, W.E. Plant disease epidemiology. Volume 2.
Genetics, Resistance and Management.
Hardy, G.H. , 1908. Mendelian proportions in a mixed population. Science 28: 49-50.
Hartl, D.L. and Clark A.G. 1989. Principles of population genetics. Sinauer Associates Inc.
Sunderland. Mass.
47
Hudson, R.R., Boos, D.D. & Kaplan, N.L. 1992. A statistical test for detecting geographic
subdivision. Molecular Biology and Evolution 9: 138-151.
A. Drenth Practical Guide to Population Genetics
Lewontin, R.C. 1972. The apportionment of human diversity. Evolution Biology 6: 381-398.
McDonald, B.A. 1997. The population genetics of fungi: tools and techniques.
Phytopathology. 87: 448-453.
McDonald, B.A., Miles, J., Nelson, L.R., and Pettway, R.E. 1994. Genetic variability in
nuclear DNA in field populations of Stagonospora nodorum. Phytopathology. 84:
250-255.
Milgroom, M.G. and Lipari, S.E. 1995. Population differentiation in the chestnut blight fungus,
Cryphonectria parasitica, in eastern North America. Phytopathology 85:155-160.
Nei, M. 1973. Analysis of gene diversity in subdivided populations. Proceedings of the
National Academy of Sciences USA. 70: 3321-3323.
Nei, M. 1978. Molecular evolutionary genetics. Columbia Press. USA.
Nei, M., and Li, W. K. 1979. Mathematical model for studying genetic variation in terms of
restriction endonucleases. Proc. Natl. Acad. Sci. USA 76:5269-5273.
Peever, T.L., and Milgroom, M.G. 1993. Genetic structure of Pyrenophora teres populations
determined with Random Amplified Polymorphic DNA markers. Canadian Journal
of Botany. 72:915-923.
Sheldon, A.L. 1969. Equitability indices: dependence on the species count. Ecology 50: 466-467.
Slatkin, M. 1993. Isolation by distance in equilibrium and non-equilibrium populations.
Evolution 47:264-279.
Stoddart, J.A., and Taylor, J.F. 1988. Genotypic diversity: estimation and prediction in
samples. Genetics. 118: 705-711.
Workman, P.L. and Niswander, J.D. 1970. Population studies on southwestern Indian tribes
II. Local genetic differentiation in the Papago. American Journal of Human Genetics
22:24-29.
Zhang, Q., Webster, R.K. & Allard, R.W. 1987. Geographical distribution and associations
between resistance to four races of Rhynchosporium secalis. Phytopathology 77: 352357.