Download Notes

Document related concepts

Dual inheritance theory wikipedia , lookup

No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Pathogenomics wikipedia , lookup

RNA-Seq wikipedia , lookup

Adaptive evolution in the human genome wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

Human genome wikipedia , lookup

DNA barcoding wikipedia , lookup

Genomics wikipedia , lookup

Gene wikipedia , lookup

Epistasis wikipedia , lookup

Deoxyribozyme wikipedia , lookup

History of genetic engineering wikipedia , lookup

Metagenomics wikipedia , lookup

Non-coding DNA wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Helitron (biology) wikipedia , lookup

Frameshift mutation wikipedia , lookup

Genome evolution wikipedia , lookup

Genetic drift wikipedia , lookup

Group selection wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Human genetic variation wikipedia , lookup

Mutation wikipedia , lookup

Genetic code wikipedia , lookup

Point mutation wikipedia , lookup

Polymorphism (biology) wikipedia , lookup

Koinophilia wikipedia , lookup

Population genetics wikipedia , lookup

Microevolution wikipedia , lookup

Transcript
Drosophila Population Genetics
Brian Charlesworth
Institute of Evolutionary Biology
School of Biological Sciences
University of Edinburgh
Why is intra-specific variability
interesting?
A high degree of variability is obviously favourable, as
freely giving the materials for selection to work on…
Charles Darwin, The Origin of Species, Chap. 1.
Darwin was the first person to recognize clearly that
evolutionary change over time is the result of processes
acting on genetically controlled variability among
individuals within a population, which eventually cause
differences between ancestral and descendant populations.
Knowledge of the nature and causes of this variability is
crucial for an understanding of the mechanisms of
evolution, animal and plant breeding, and human genetic
diseases.
Classical and quantitative genetic studies
of variation
Classical genetics reveals the existence of discrete
polymorphisms in natural populations, but is necessarily
limited either to chromosomal rearrangements such as
inversions that can be detected cytologically, or to
conspicuous phenotypes such as eye colour or body colour
(flies carrying certain eye-colour mutations such as
cardinal can be found in natural populations).
Within a given species, only a handful of such
polymorphisms can easily be detected. Relatively few
cases of discrete polymorphisms affecting morphological
traits are known.
A human inversion
polymorphism
The classic polymorphism of
Drosophila pseudoobscura
Quantitative genetics reveals the existence of ubiquitous
genetic variation in metrical and meristic traits.
Most metric traits have a coefficient of variation (the ratio
of the standard deviation to the mean) of 5-10%.
Measurements of the resemblances between relatives show
that 20%-80% of the variance in such traits is typically due
to genetic factors.
This type of variation is of great evolutionary, medical and
economic significance, but measuring it does not tell us
anything about the details of its genetic control (numbers
of loci involved, frequencies of variant alleles, etc.).
Studies of concealed variability (revealed by inbreeding)
indicates the existence of low frequency recessive alleles
usually with deleterious effects, that are not normally
detectable in a large random-mating population.
The results of close inbreeding (e.g. by brother-sister
matings) are:
1. Reduced mean performance of a set of inbred lines,
with respect to traits like survival, fertility and growth rate.
2. Increased variability among lines, sometimes involving
abnormalities caused by single gene mutations.
While amply validating Darwin’s view that there is
plenty of variation available for evolution to utilize, this
evidence leaves two important questions unanswered:
(a) How much variation within a natural population is there
at an average locus? Classical genetics provides no
means of sampling loci at random from the genome,
without respect to their functional importance or level of
natural variability.
(b) To what extent does natural selection as opposed to
mutation and/or genetic drift control the frequencies of
allelic variants within populations? The classical
genetics bias towards genes with conspicuous phenotypic
effects means that strong selective forces are likely to be
operating. Such genes might well be unrepresentative of
the global picture.
Molecular genetics to the rescue
The solution to question (a) is to use the fact that genes
correspond to stretches of DNA that code for proteins.
If either the protein sequence corresponding to a gene, or
its DNA sequence, can be studied directly, then we can
look at variation within the population without having to
follow visible mutations, i.e. there is no need for prior
knowledge of the existence of variation.
We can also look at variation in non-coding sequences.
Electrophoretic variation
The first steps were taken in the mid-1960s by Lewontin
and Hubby, working in Chicago on the fruitfly Drosophila
pseudoobscura, and by Harris in London, working on
humans.
They used the technique of gel electrophoresis of proteins
to screen populations for variants in a large number of
soluble proteins controlled by independent loci, mostly
enzymes with well-established metabolic roles. The
proteins were chosen purely because they could be studied
easily.
The results of the early electrophoretic surveys were
startling: a large fraction (as high as 40%) of loci were
found to be polymorphic (i.e. they exhibited one or more
minority alleles with frequencies greater than 1%).
An average D. pseudoobscura individual was estimated to
be heterozygous at 13% of the 24 protein loci that had been
studied by 1974 i.e. a random individual sampled from the
population would be expected to have distinct maternal
and paternal alleles at 13% of its protein-coding loci.
Much lower levels of heterozygosity (or gene diversity: the
chance that two randomly chosen copies of a gene are
different) were found in mammals, and much higher levels
in bacteria.
This work conclusively refuted the view that loci are only
rarely polymorphic.
However, it raised more questions than it answered. In
particular, there were several biases in the data. Only
soluble proteins could easily be studied, and amino-acid
changes that do not affect the mobility of proteins on gels
are not detected by electrophoresis.
Similarly, any changes in the DNA that do not affect the
protein sequence go undetected.
DNA sequence variation
The advent in the late 1970s of methods for cloning and
sequencing of DNA meant that studies of natural variation
could be carried out at the DNA level. This eliminates
virtually all the possible biases in quantifying variability.
With the advent of PCR amplification for isolated specific
regions of the DNA, and with relatively cheap automated
sequencing, this is now the method most commonly used
in surveys of variation.
Efforts are currently under way in D. melanogaster to scale
this “resequencing” up to the whole genome level.
The pioneering work on directly comparing homologous
DNA sequences sampled within a species was carried out
by Martin Kreitman in Lewontin’s lab at Harvard in the
early 1980s.
Kreitman sequenced 11 independent copies (alleles) of the
Adh (alcohol dehydrogenase) gene of D. melanogaster,
isolated from collections made around the world. He
sequenced 2379 bases from each of these alleles, an heroic
effort in those days.
His work succeeded in:
• Demonstrating a high level of variability at the level of
individual nucleotide sites, a factor of ten or so higher than
would have been expected from the typical level of
heterozygosity for protein polymorphisms
• Showing that nearly all of this variability involved silent
changes that did not affect protein sequences, i.e. the
changes were either in regions that did not code for aminoacids or involved synonymous changes in codons.
•
The only amino-acid polymorphism detected was that
already known to cause the difference between the fast (F)
and slow (S) electrophoretic alleles of Adh.
Kreitman’s Adh Results
Intron 1
Coding Region
3' Non-Transcr.
% Silent Sites
Segregating
1.7
6.7
0.6
No. Sites
654
765
767
No non-silent substitutions found (other than F/S): 39 are
expected if variability were same as for silent sites.
These results demonstrate that the protein sequence is
highly constrained by selection, i.e. most mutations
affecting the amino-acid sequence of a protein cause
selectively disadvantageous changes to its functioning, and
are eliminated rapidly from the population.
Most variation that is detected in coding sequences
(typically over 85% in Drosophila) thus involves
synonymous variants. Non-coding region variation shows
a similar level to synonymous variation.
These results suggest that most variation and evolution at
the DNA level may be due to neutral or nearly neutral
mutations, whose fate is controlled by genetic drift rather
than selection, especially as much of the genome is noncoding, even in Drosophila.
How to measure DNA sequence variation
Allele 1
ATGCTTAGCGTTGGCATCCTAGCGATCGAG
Allele 2
ATGCTTGGCGTTGGCATCCTAGCGATCGG
Allele 3
ATACTTAGCGTTGGCATCCTCGCGATTGAG
The nucleotide site diversity () for a given set of alleles
sampled from a population is the frequency with which a
randomly chosen pair of alleles differ at a given site.
It can be calculated from data on a sample of homologous
DNA sequences, by determining the sum of the numbers of
differences between all possible pairs of sequences.
The result is divided by the product of the number of
sequences that were compared (this equals n(n-1)/2, if
there are n independent alleles), and the number of bases
studied.
In the example, n= 3, so n(n-1)/2 = 3.
The total number of pairwise differences between all 3
combinations of sequences is 1 + 3 + 4 = 8.
To get the pairwise diversity per site, we divide this by 3
times the number of sites, so that
  = 8/(3 x 30) = 0.089
An alternative method of measuring variation is simply by
counting the number of sites that are segregating in the
sample, S.
By dividing S by the product of the number of bases in the
sequence and the sum
a = 1 + 1/2 + 1/3 + ... + 1/(n -1)
we obtain a statistic called Watterson’s w.
If the population is at equilibrium and there is no selection,
 w is expected to be similar in value to .
In the example, we have S = 4, and a ...
Hence:
 w = 4/(30 x 1.5) = 0.089
Under the neutral theory of evolution, variability
in DNA sequences reflects the balance between
the input of new variants by mutation and their
loss by random fluctuations in frequencies caused
by finite population size (genetic drift).
Under this model, variant frequencies at a locus are always
shifting around, but a statistical equilibrium will eventually
be reached if population size stays constant.
The expected value of the pairwise diversity in the
population is then given by:
 = 4Nem
where mis the neutral mutation rate per site, and Ne is the
effective population size, which controls the rate of genetic
drift.
The expected values of both  and  w are equal to .
Estimates of  have now been obtained from many
different kinds of organisms, by sampling sets of
homologous genes from natural populations and
sequencing them.
Rough average values over many genes for silent
nucleotide are as follows:
• Escherichia coli (bacterium):
0.05
• Drosophila melanogaster
(African)
0.02
• Homo sapiens
0.001
Knowledge of m enables us to estimate Ne from
.
For example, with m = 4 x 10-9, and  = 0.02, we
obtain Ne = 1.25 x 106.
Drosophila effective population sizes are therefore
very large.
Detecting Selection
One of the major goals of evolutionary genetics is to
understand to what extent selection, as opposed to neutral
forces of mutation and genetic drift, controls variation and
evolution in DNA and protein sequences.
The methods for doing this often involves combining data
on sequence divergence between species with data on
polymorphism within species.
Forms of selection
• Purifying selection, which acts to prevent the spread of
deleterious mutations, e.g. those affecting the amino-acid
sequences of proteins.
• Positive directional selection, which causes an adaptive
mutation to spread through a species
• Balancing selection, which maintains alternative variants in
the population
Directional and balancing selection are often collectively
referred to as positive selection.
Use of sequence divergence data
The simplest situation is when we have two homologous
(aligned) DNA sequences from a pair of related species.
For the purpose of discussion, assume that all evolutionary
change occurs by nucleotide substitutions, i.e. the sequence
differences are caused entirely by one nucleotide base
changing into another by mutation.
This is usually the case for coding sequences, since
insertions or deletions cause disruption of functionality.
Species 1
T
Species 2
T
The total time separating a pair of sequences from the two
species is 2T
Neutral sequence evolution
Under neutral evolution, K is expected to be equal to the
mutation rate (m) times the divergence time between the
two species, i.e.
K=2mT
The simplest way to understand this is to note that, under
neutral evolution, the expected number of mutations that
distinguish a pair of sequences is equal to the time
separating them (2T) times the rate of mutation per unit
time (m).
We compare K values for nucleotide sites where mutations
can reasonably be assumed to be neutral or nearly neutral
with K for sites where we wish to test for selection; larger
than neutral K values indicate directional selection, and
smaller than neutral K values indicate purifying selection.
Nonsynonymous sites are usually used as the candidates
for selection, but there is increasing use of defined types of
non-coding sequences.
Evidence for pervasive purifying selection
This comes from the fact that both K and  for
nonsynonymous variants are nearly always much
smaller than for synonymous and noncoding
sites.
Statistics on diversity and divergence in D. miranda
(species 1: 18 loci) and D. pseudoobscura (species 2: 14 loci)
Means
A1
S1
A2
S2
KA
KS
0.088
0.478
0.206
2.73
2.48
22.2
(0.044 /
(0.342 /
(0.124 /
(2.31 /
(1.30 /
(19.9 /
0.141)
0.626)
0.300)
3.14)
3.76)
24.8)
All values are percentages
Divergence (K) is measured between
D. miranda and D. affinis. (KS between mir pseudo is 3.5%)
L. Loewe et al. 2006 Genetics 172: 1079-1092.
Divergence of mel-sim introns
0.35
First Introns
Non-first Introns
0.3
Divergence
0.25
0.2
0.15
0.1
0.05
0
10
100
1000
10000
100000
Length of intron (base pairs)
P. Haddrill et al. 2005 Genome Biol. 6: R67. 1-8.
Effects of deleterious mutations on fitness
• There are clearly a lot of deleterious mutations
entering the population each generation, most of
which will eventually be eliminated by selection
• While the mean level of variability is much lower
for nonsynonymous than synonymous mutations,
this could simply mean that all the deleterious
ones are rapidly removed by selection, so that the
amino-acid variants that we see segregating are in
fact selectively neutral.
• It is a topic of current research to try and estimate the
distribution of selection coefficients on deleterious aminoacid and silent variants in natural populations
• Estimate for amino-acid variants indicate a wide
distribution, such that the mean selection coefficient
against a heterozygous non-synonymous variant is of the
order of 10-5
• Values for synonymous or silent variants are much smaller,
of the order of 10-6.
Positive directional selection
Faster divergence in coding than non-coding sequences suggests
positive selection
Homeodomain
Non-homeodomain
Intron
0.2
Distantly related species
• In the OdsH gene of three
Drosophila species,
divergence in the
homeodomain is highly
significantly accelerated
• This directly suggests
selection
0.15
Closely related species
0.1
0.05
C. Ting et al. 1998 Science 282:1501-1504
0
mel-sim
mel-mau
Species compared
sim-mau
The McDonald-Kreitman test
• Compares non-synonymous and synonymous site
divergence between species, and non-synonymous and
synonymous site diversity within species, in the same gene
• If variants at both kinds of sites were neutral, the numbers
of substitutions at the two kinds of sites between two
species should be in the same ratio as the polymorphism
within either species, assuming equilibrium between drift
and mutation:
Neutral divergence = 2Tm
Neutral diversity = 4Nem
• If the ratio of non-synonymous variants to synonymous
variants for differences between species is greater than the
ratio for within-species variation, this suggests positive
directional selection
• If the opposite is the case, either purifying selection or
balancing selection is acting
Centromeric
histone
protein
evolution
• Alignment of the Cid proteins of five melanogaster subgroup
species with histone H3 proteins from D. melanogaster (2.3
million years divergence )with E. histolytica (> 1 billion years
divergence)
– The most divergent histone H3 sequences have >75% identity to each
other, whereas centromeric H3-like proteins are much more diverged
(35–50% identical to histone H3).
Sliding window analysis of Cid
50-nucleotide (nt) window, in steps of 10 nt, using all sites
N-terminal tail
region (mostly nonsynonymous)
 or K
C-terminal
core (mostly
synonymous
substitutions)
intraspecific polymorphism within D. simulans ()
interspecific divergence (K)
Evidence for adaptive evolution in D.
melanogaster & simulans Cid
• Polymorphism was studied in D. melanogaster (15 strains) and D. simulans
(8 strains), and divergence between them
• Non-synonymous: synonymous (N:S) ratios differ significantly (P < 0.0025)
– For divergence between the species = 18:10
– For pooled polymorphic sites within the two species = 9:28
• McDonald-Kreitman test for the D. melanogaster lineage (box):
P < 0.006
H. Malik & S. Henikoff
2001 Genetics
157: 1293-1298
Fixed diffs Polymorphic sites
Non-syn
Synonymous
8
4
0
9
Using data on many different genes, methods have
been developed to use the McDonald-Kreitman
approach to estimate what fraction of amino-acid
differences between D. melanogaster and D.
simulans are caused by directional selection.
This fraction is of the order of 25%, a surprisingly
high value.
N. Bierne & A. Eyre-Walker 2004 Mol. Biol. Evol. 21: 1350-1360.
Indirect evidence for selection: selective sweeps
• After an advantageous mutation has spread through a
population, the level of polymorphism will be reduced across
the region (i.e. at closely linked neutral sites)
• This is because a unique selectively favourable mutation may
arise at a site in a DNA sequence that is completely linked to
a polymorphic variant segregating in a population
J. Maynard Smith & J. Haigh 1974 Genet. Res. 12: 12-35.
A selective sweep fixes variants linked to the selected site
It is a form of hitch-hiking:
• as the black (advantageous) variant increases in frequency
in a population, it causes low diversity at closely linked
sites in a sequence (white circles)
A recent selective sweep is detectable if the time since
selective substitution is sufficiently small (around 0.25Ne
generations), but there is a lot of noise
Indirect evidence for selection: statistics of
variant frequency distributions
•
It is also possible to work out the frequencies at which
variants are expected to be found in equilibrium populations,
under both neutrality and selection
– Under neutrality, most variants are expected to be quite rare
• If selection is operating on the sequence, it will affect the
frequencies of variants in the sample
– This forms the basis for some tests for selection, and methods for
estimating the intensity of selection.
• Assuming neutrality and equilibrium, the expected value of
both  and w = 4Nem
• If  ≠  w, it suggests the possibility of selection
– If there are excess rare variants, compared with what is
expected under neutrality, this suggests purifying selection
– Excess high frequency variants might suggest balancing
selection or the presence of advantageous mutations
spreading in the population
• BUT there are two problems
– We have to test whether the difference could be produced
by chance
– The population may not have been constant in size, as
assumed in the model, and so its demographic history may
cause  ≠  w
Statistical tests must be used!
• Things we estimate from a sample may look very different
from the average that is expected
• Statistical tests are necessary to decide whether a sample
could not have arisen by a process of neutral mutation and
drift. Only if we can say this, can we conclude that
something such as selection has affected the sequences.
• Neutrality is used as a null hypothesis
The spread of an advantageous mutation
affects diversity very much like a bottleneck,
but only on the region around the gene
Extreme bottleneck
One haplotype present, then new
neutral variants occur
< w , negative Tajima’s D
Fixed advantageous mutation
One haplotype selected, then new
neutral variants occur
<  w , Tajima’s D < 0
Evidence for a selective sweep
on the neo-X chromosome
of D. miranda
D. Bachtrog 2003 Nat. Genet. 34:
215-219.
Genome scans for selective sweeps
There is currently a lot of interest in using scans of
variability across the genome, to look for patterns that
suggest a recent selective sweep.
The hope is that this will lead to identification of the
mutations that have been favoured by selection.
One subject of study is non-African populations of D.
melanogaster and D. simulans, which are believed to have
originated relatively recently (10,000 years ago??) from
ancestral African populations.
They must have adapted to their new environments. It
should be possible to see which regions of the genome
show evidence of selective sweeps.
The problem is that they have also gone through
bottlenecks of small population size, which has similar
effects to sweeps, but are distributed over the whole
genome.
Relative values of microsatellite (A) and sequence diversity (B) in
non-African and African populations of D. melanogaster
B. Harr et al. (2002) Proc. Natl. Acad. Sci. USA 99, 12949-12954
Scan of 250 approximately 500 bp non-coding sequences
across the X chromosome of mel
(L. Ometto et al. 2005 M.B.E. 22: 2119-2130)
Q is the probability of getting as many as the observed number of
polymorphisms in the European sample on a bottleneck model
Empty and filled circles indicate sig. negative or positive Tajima’s D.
Some recent research problems in my lab
• What is the typical magnitude of selection on
mutations that alter codon usage?
• Are non-coding sequences evolving neutrally?
• The genetic code is degenerate; there are at least
two codons for each amino-acid except
methionine and tryptophan
• The 3rd coding position is often redundant, so that
at least some changes in it frequently result in no
change in the protein sequence
• The genetic code is degenerate; there are at least
two codons for each amino-acid except
methionine and tryptophan
• The 3rd coding position is often redundant, so that
at least some changes in it frequently result in no
change in the protein sequence
• It might be thought that synonymous changes would have
no effect on fitness, so that such changes could be treated
as selectively neutral
• If this is so, the frequency with which codons
corresponding to a particular amino-acid are used should
correspond to the frequencies with which they would be
expected to be produced by randomly combining their
constituent nucleotides
• It quickly became apparent in the early days of DNA
sequencing that this was not the case, and that there is
considerable codon usage bias in many species
• The proportion of codons in a gene that are preferred
(major codons) provides an index of overall codon bias
(major codon usage or MCU)
• A variant of this method has become popular with the
advent of databases of levels of gene expression to identify
codons that are more frequently used in genes with high
levels of expression
• These are often called optimal codons, and the frequency
of optimal codons in a gene is known as Fop. This term is
now often used for MCU
• An important observation is that there is a general
tendency for patterns of codon usage to be fairly consistent
across different genes in the genome i.e. the same codons
are preferred in different genes, although the level of bias
varies considerably, and there are differences between
species in the nature of the preferred codons
• General levels of codon usage are well-conserved
evolutionarily
Fop ( D. pseudoobscura)
Fop ( D. melanogaster)
bcd
Bruce
ftz
Gld
hb
hyd
nop56
rh1
rp49
sry-alpha
T1
Xdh
ade3
Adh
Adh-dup
amd
Ddc
dpp
Eno
Gpdh
Lam
smo
Uro
cyp1
Annx
Est-5B
Gapdh2
Hsp82
scute
sesB
sisA
Sod
swallow
0.50
0.55
0.60
0.60
0.55
0.21
0.59
0.57
0.67
0.42
0.66
0.64
0.48
0.66
0.56
0.53
0.60
0.44
0.76
0.48
0.64
0.50
0.64
0.66
0.69
0.42
0.33
0.67
0.63
0.66
0.63
0.69
0.58
0.52
0.56
0.56
0.57
0.50
0.25
0.64
0.62
0.75
0.57
0.63
0.61
0.48
0.69
0.57
0.55
0.62
0.41
0.77
0.46
0.64
0.52
0.64
0.67
0.65
0.40
0.40
0.70
0.61
0.72
0.56
0.73
0.55
Average
0.57
0.58
Gene
These facts suggest that the forces affecting the use of
preferred codons mainly operate across the whole
genome, rather than being specific for individual genes,
although the magnitude of these forces varies
considerably.
The evolution of codon usage bias
• In most species there is substantial variation at
synonymous nucleotide sites, even in genes with high
levels of codon usage bias (of the order of 1-2% per
cent diversity per site in many Drosophila species)
• This means that any selection on codon usage must be
weak in relation to other evolutionary factors, such as
genetic drift and mutation.
• In order to understand codon bias, we need population
genetic models that take all three factors into account
Modelling codon usage evolution
(the Li-Bulmer model)
The simplest model that can be made is for a randommating population with a large number of independently
evolving sites
Each site has two alternatives: preferred and unpreferred
codons (A versus a)
Evolutionary forces
• Selection for preferred over unpreferred codons
• Mutation in either direction (preferred to
unpreferred, and vice-versa).
• Genetic drift (random sampling of allele
frequencies). Its effectiveness is inversely related
to the effective population size (Ne )
• Selection is less effective at preventing deleterious
mutations becoming polymorphic than spreading
to fixation.
• It was suggested in 1995 by Hiroshi Akashi that
this result could be used to test for present-day
selection on codon usage
• This requires a species in which synonymous
single nucleotide polymorphisms at numerous
codons exist, and in which the ancestral state of
each SNP can be inferred
• Polymorphic mutations can then be classified as
preferred (P) to unpreferred (U)
• In addition, we need to identify fixed differences
from a related species as P U or U  P, to
check whether codon bias is in evolutionary
equilibrium.
These differences are assumed to have
accumulated in the two focal species since the
split between them
• If codon usage is in equilibrium, the numbers of
fixations in the two directions must be equal
• Since selection has less of an effect on
polymorphic mutations than fixations, we thus
expect a deficiency of U  P polymorphisms, and
an excess of P U polymorphisms
• Mutational bias and mutation rates do not affect
these statistics, if codon usage is in equilibrium
The species of choice
We have been using three Drosophila species for this
purpose:
D. miranda is used for the polymorphism study
D. pseudoobscura is a very close relative (less than 4%
silent site divergence from miranda)
D. affinis is a more distant outgroup species (about 23%
silent site divergence from the other two)
Codons were classified as preferred (P) versus unpreferred
(U), using Akashi’s codon usage table for D.
pseudoobscura.
Polymorphism/divergence for codon usage
changes for 18 X and autosomal genes
P U
U P
Fixed
19
12
Polymorphic
37
6
rpd
1.95
0.50
Ratio of rpd values = 3.9
C. Bartolomé et al. 2005 Genetics 169: 1495-1507
For a sample of n homologous sequences from the
population, the expected fraction of P  U
polymorphisms among both P  U and U P
polymorphisms is:
 = upI0/(up I0 + v[1-p] I1)
where:
I0 is the probability that a P  U
polymorphism is
detected in the sample;
I1 is the probability of detecting a U  P polymorphism;
p is the proportion of P codons in the sequence;
u and v are the mutation rates for P  U and U  P
changes
1
Ii =

{1 - x - (1-x) }  i (x) dx
n
n
1/(2N)

0(x)  x (1- x) (1 - exp {1 - exp  (1-x)}
-1
-1
If the Li-Bulmer formula for equilibrium p is substituted into
this equation, we get the simple relation:
 = I0 /(I0 + I1e -  )
i.e. the proportion of P U polymorphisms depends only on  =
4Net.
This allows us to use maximum likelihood to estimate the
value of  and its approximate 95% confidence limits.
• For all 18 genes together, the maximum likelihood of 
was 2.5 (2-unit support limits 1.5 - 3.8.
• This value is not significantly different from those
obtained after dividing the dataset into two groups of
genes with low bias (Fop < 0.60, = 2.6) and high bias
(Fop > 0.63, = 2.2).
• This lack of an apparent difference may reflect the
limited range of Fop values; the average Fopvalues for
the low and high bias groups were 0.50 ± 0.024 and
0.66 ± 0.009, respectively.
• These results suggest that Net for mutations changing
codon usage in D. miranda is between 0.38 to 0.96, with
an ML value of 0.62
• Silent polymorphism data suggest an Ne of about 800,000
for miranda. The selection coefficient s is thus about 8 x
10-7
• This is much lower than previous estimates of Net by
Akashi and coworkers for simulans and pseudoobscura
(around 1 or more)
• It agrees well with an estimate using the same approach for
americana
GC to AT changes
Coding
Fixed
Polymorphic
rpd
Non-coding
Fixed
Polymorphic
rpd
GC->AT
AT->GC
30
48
12
4
1.60
0.33
16
13
22
9
0.81
0.41
rc= 4.80
rnc=1.99
• Similar methods to those applied to P and U codons can be
applied to GC content at 3rd coding positions (GC3); to
explain the observed mean value of 69% with the
estimated level of selection requires a mutational bias of
over 3-fold in favour of GC to AT mutations
• This predicts a GC content of 23% for non-coding
sequences, if these are evolving neutrally, as opposed to an
observed value of around 36%
• The implication is that non-coding sequences are subject to
non-neutral evolution, despite our failure to detect it.
Formation of a neo-Y chromosome
• The two autosomal copies in males segregate with
the sex chromosomes in the first division of
meiosis, in such a way that one always
accompanies the X into a sperm, and the other
accompanies the Y.
• The lack of crossing over in male Drosophila
means that the neo-Y chromosome is immediately
placed in a genetic environment that is identical to
that of the true Y chromosome.
From: Bachtrog & Charlesworth (2002) Nature 416: 323-326.
Relaxed selection on codon usage
Fixations were assigned to the neo-X and neo-Y branches,
subsequent to the neo-X/neo-Y split
Neo-X Neo-Y
PU
15
47
UP
7
4
p = 0.014
Bartolomé and Charlesworth 2006 Genetics 174:2033-2044
Polymorphisms on the neo-X
versus the neo-Y
On a Mantel-Haenszel test, there is a
significant excess (p <0.001) of nonsynonymous relative to silent
polymorphisms on the neo-Y compared
with the neo-X, indicating a relaxation of
purifying selection on the neo-Y.
ACKNOWLEDGEMENTS
• THE HARD EXPERIMENTAL WORK: Doris Bachtrog, Carolina
Bartolomé, and Soojin Yi
• HELP WITH FLY-COLLECTING: Deborah Charlesworth
• PROVISION OF LAB FACILITIES ON COLLECTING TRIP:
Dan Barbash, Chuck Langley
• IDENTIFICATION OF MIRANDA STRAINS: Doris Bachtrog
• TECHNICAL ASSISTANCE: Helen Borthwick and Helen Cowan
• MONEY: BBSRC, Royal Society
• THEODOSIUS DOBZHANSKY: for discovering D. miranda 71
years ago, and for the posthumous loan of his field microscope