Download 9 Neutrality Tests - Section of population genetics

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
Transcript
122
9
9 NEUTRALITY TESTS
Neutrality Tests
Up to now, we calculated different things from various models and compared our findings
with data. But to be able to state, with some quantifiable certainty, that our data do not
fit with the predictions, we need statistics. That means that it is not enough to describe
phenomena, we need to quantify them and assign probabilities to observed data to see
how probable they are under certain scenarios, such as neutrality or population expansion.
Next we will deal with neutrality tests, which measure if data deviate from the expectations
under a neutral model with constant population size. The first three tests we introduce,
which all are called by the name of their test statistic, are Tajima’s D, Fu and Li’s D
and Fay and Wu’s H. These are based on data from a single population (plus one line of
an outgroup to see which states are ancestral and which ones are derived). Then we are
dealing with the HKA test and the McDonald-Kreitman test that both use data from two
or more species or populations.
The statistical procedure is very general, only the used methods, i.e. what to compute
from data, is unique to population genetics. Therefore we start with some general statistics.
9.1
Statistical inference
Let us start with a brief summary about statistical testing: Statistical testing starts by
stating a null hypothesis, H0 . A test statistic, T , is chosen. If the value of T , calculated
from the data, falls within a certain range of values called the critical region, R, then the
null hypothesis H0 is rejected. The size of the test is α = P[T ∈ R | H0 ]. If the test statistic
is scalar - (this is most often the case, and in this case you can say whether the statistic is
smaller or bigger than a certain value) - then we can calculate a p-value, p = P[T ≥ t | H0 ]
where the observed value of the test statistic is t. What did all of this mean? And more
importantly, why is this the way we do statistical testing?
Firstly, the null hypothesis H0 . This has to be a mathematically explicit hypothesis.
“There has been an effect of natural selection on my data” is not mathematically explicit,
but unfortunately “There has not been an effect of natural selection on my data” is not
sufficiently explicit either. Mathematical models are required for statistical testing. An
example for a sufficiently explicit hypothesis would be:
The population is in equilibrium, behaves like a Wright-Fisher model with
constant population size 2Ne . All mutations are neutral. The mutation rate is
µ. There is no recombination.
With an H0 like this many relevant parameters can be calculated and they can be compared
with the observed data.
The test statistic T is a function of the data. It usually summarizes or condenses the
data. There is a range of possible statistics that could be chosen. The aim is to choose
one that contains the information we want, and ignores the information that we believe
is irrelevant, as far as is possible. For the test to work at all, it is essential to know the
distribution of T when H0 is true, written P[T | H0 ]. Sometimes this distribution can be
9.1 Statistical inference
123
calculated analytically but for more complex tests it may have to be estimated by computer
simulation.
The size of the test, α = P[T ∈ R | H0 ], is the probability that the null hypothesis will
be rejected when it is in fact true. This is a false positive error, a.k.a. a type I error. We
can control the chance of such an error occurring, by our choice of either T or R. Note that
some supposedly authoritative sources say that the p-value is the probability of making a
type I error. This is not true! Only the size of the test, if determined before the data are
inspected, has this property.
An interpretation of a p-value is the probability of observing data like the data that was
observed or more extreme, given the null hypothesis. These p-values can only be calculated
for scalar test statistics. This is because we need to define an order, so that we can say
which data are more extreme than others.
The other type of error is a false negative, a.k.a. a type II error, which is a failure to
reject the null hypothesis when it is in fact wrong. We cannot control the chance of these
errors occurring; they depend on what alternate hypothesis is true instead of H0 . If an
alternative hypothesis H1 is in fact true, then the power of the test for H1 is P[T ∈ R | H1 ],
which is determined by the choice of T and the critical region R. High power is desirable.
Therefore, in order to design a good test it is important to have a good idea about which
alternative hypotheses could be true. For genetic data there are usually an infinity of
possible choices for T and R, so some kind of biological insight is important.
Exercise 9.1. Assume you are given two datasets, data1 and data2. You perform a test
of neutrality on both of them. For data1 you get a significant result (with α = 5%) and
for data2 a non-significant one. Which of the following conclusions can you draw?
• The dataset data1 does not stem from a neutral model of constant size.
• The dataset data2 stems from a neutral model of constant size.
Which error is involved in these two conclusions? Which error is controlled by the size of
the test?
2
Example: Fisher’s exact test
Let us look at an example, taken from Sokal and Rohlf (1994). We will be dealing
with Fisher’s exact test which will be of relevance in this section. This test uses as data a
2 × 2-contingency table. This is a table of the form given in Figure 9.1. Here acacia trees
were studied and whether or not they are invaded by ant colonies. The aim of the study
was to find out whether species A is more often invaded by ant colonies than species B.
From species A, 13 out of 15 trees were invaded, but only 3 out of 13 from species B.
So it certainly looks as if A is more often invaded. But we would like to know whether
this is statistically significant. Now, you know that in the study there is a total of 15
trees from species A and 13 from species B. and you know that 16 trees are invaded by
ants and 12 are not. Using only this data, and if you would assume that both species are
124
9 NEUTRALITY TESTS
Acacia
species
Not invaded
Invaded Total
A
2
13
15
B
10
3
13
Total
12
16
28
Figure 9.1: Example of data to use Fisher’s exact test.
equally likely to be invaded you would expect that 16 · 15
≈ 8.57 trees of species A would
28
be invaded. This value is the expected value under the assumption that the species of a
tree and whether or not it is invaded are two independent things.
You already know of the χ2 -test which can also be used in this case. To make the
2
χ -test all you need is your data in the contingency table and the expectations like the one
we just computed. Then, you calculate, as on page 58,
χ2 =
=
X (Observed − Expected)2
Expected
(2
− 15
12)2
28
15
12
28
+
(13 −
15
16)2
28
15
16
28
+
(10 −
13
12)2
28
13
12
28
+
13
16)2
28
13
16
28
(3 −
≈ 11.4991
Usually, you now say that the statistic you just computed is χ2 -distributed with (rows−
1)(lines − 1) = 1 degree of freedom. You can then look up in a table of the χ2 -distribution
with 1 degree of freedom that a value of 11.4991 or larger only appears with probability
p = 0.0006963 which is then also the p-value. However, all of this relies on the χ2 distribution of the statistic we computed. And in this case the statistic is not exactly
χ2 -distributed, because our data are discrete and not continuous, as they would have to
be in order to be χ2 -distributed. There are corrections for this, but here we will use a
different method: Fisher’s exact test.
The test is called exact because the distribution of the test statistic is exact and not
only approximate as it would be for the χ2 -test. Fisher’s exact test relies on computing the
probability of obtaining the observed data given the marginals of the contingency table.
To compute these the number of ways to put 28 balls into four urns such that all marginals
are correct is
28 28
.
15 16
To calculate the number of ways to obtain not only the marginals but the numbers in the
four cells assume you must lay 28 balls in a row, where 2 have color a, 13 have color
b, 10
28
have color c and 3 have color d. The color a balls can be put on the 28 sites in 2 ways.
There are 26 positions remaining. Next choose 13 positions for the balls of color b, which
9.1 Statistical inference
125
13
give 26
possibilities.
Afterwards,
for
color
c
you
have
possibilities. The balls of color
13
10
d must then be put in the remaining positions. This totally gives
28 26 13
28!
=
.
2
13 10
2!13!10!3!
Let us assume we have data a, b, c, d instead of the given numbers. Then, as the
probability of the observed data must be the number of ways of obtaining it divided by
the number of all possibilities, we have
P[(a, b, c, d)] =
n!
a!b!c!d!
n
n
a+b a+c
=
(a + b)!(a + c)!(b + c)!(b + d)!
.
a!b!c!d!n!
(9.1)
So in our case we have
P[(2, 13, 10, 3)] =
15!13!12!16!
= 0.00098712.
28!2!13!10!3!
which is the probability of finding the contingency table that we had. However, the pvalue was defined as the probability that, given the null-hypothesis, the data are at least
as extreme as the observed data. The data would be more extreme if the data would like
one of the tables given in Figure 9.2. Note however, that the marginals of the contingency
table are fixed here. We only check independence of invasion and species given these
marginals.
Using these more extreme cases we can calculate the p-value, by adding up the probabilities for all the more extreme cases. It turns out to be 0.00162. This means that the
test is highly significant and the hypothesis, that the invasion of ants is independent of the
species can be rejected. All about Fisher’s exact test is summarized in Figure 9.3.
The easiest way to perform tests on contigency tables is by using a web-based calculator. You can find a χ2 -calculator e.g. at http://schnoodles.com/cgi-bin/web chi.cgi,
one for Fisher’s exact test is found at http://www.matforsk.no/ola/fisher.htm. Also
programs like DNASP (use Tools->Tests of Independence: 2x2 table) and of course
any statistical package like R can do such tests.
Exercise 9.2. A plant ecologist samples 100 trees of a rare species from a 400-squarekilometer area. His records for each tree whether or not it is rooted in serpentine soils and
whether its leaves are pubescent or smooth. The data he collected in Figure 9.4.
Use
1. a χ2 -test
2. Fisher’s exact test
to assess whether the kind of soil and the kind of leaves are independent. Compute the
p-values in each case? Interpret your results.
2
126
9 NEUTRALITY TESTS
Acacia
species
Not invaded
Invaded Total
Acacia
species
Not invaded
Invaded Total
A
1
14
15
A
0
15
15
B
11
2
13
B
12
1
13
Total
12
16
28
Total
12
16
28
Acacia
species
Not invaded
Invaded Total
Acacia
species
Not invaded
Invaded Total
A
11
4
15
A
12
3
15
B
1
12
13
B
0
13
13
Total
12
16
28
Total
12
16
28
Figure 9.2: More extreme cases for Fisher’s exact test
9.2
Tajima’s D
Recall that, if the neutral theory and the infinite sites model hold, there are a number
of different unbiased estimators of θ = 4Ne µ. These include the estimator θbS (see (2.7))
where S is the number of segregating sites in a sample of n sequences. A second was given
in (2.6) to be θbπ which is the mean pairwise difference between the sequences in the sample.
Both these estimators are unbiased but they use different information in the sample. This
motivated Tajima (1989) to propose his d statistic, which is defined as
d := θbπ − θbS .
(9.2)
Since both estimators are unbiased, for neutral/infinite sites model E[d] = 0. However,
because the two statistics have different sensitivities to deviations from the neutral model,
d contains information.
Exercise 9.3. Take the data from Exercise 5.1. Compute d in this case either by hand or
using R.
2
We know that the expectation of d is 0, but in order to use it as a test statistic we need
to know also the variance. As Tajima (1989) showed, the variance can be estimated by
(we don’t give the derivation because it is too long)
d θbπ − θbS ] = c1 S + c2 S(S − 1)
Var[
a1
a21 + a2
9.2 Tajima’s D
127
Fisher’s exact test
checks for independence in a 2 × 2 contingency table.
Data
a, b, c, d as given below; their distribution is any distribution
for which a + b, a + c, b + c and b + d are fixed
Case 1 Case 2 Total
Case A
a
b
a+b
Case B
c
d
c+d
a+c
b+d
n
Total
Null-hypothesis
a, b, c, d are distributed independently on the table, leaving
a + b, a + c, b + c, b + d fixed,
P[(a, b, c, d)|a+b, a+c, b+c, b+d, n] =
p-value
X
p=
(a + b)!(c + d)!(a + c)!(b + d)!
.
a!b!c!d!n!
P[(a, b, c, d)|a + b, a + c, b + c, b + d, n].
a,b,c,d at least
as extreme as data
Figure 9.3: Schematic use of Fisher’s exact test
Serpentine
soil
No serpentine
soil
Total
Pubescent
19
18
37
Smooth
17
46
63
Total
36
64
100
Leaf
form
Figure 9.4: Data for Exercise 9.2
128
9 NEUTRALITY TESTS
Neutral
D=0
Balancing selection
D>0
Recent sweep
D<0
Figure 9.5: Genealogies under different forms of selection.
with
1
,
a1
n+1
b1 =
,
3(n − 1)
n−1
X
1
,
a1 =
i
i=1
n + 2 a2
+ 2,
a1 n
a1
2
2(n + n + 3)
b2 =
,
9n(n − 1)
n−1
X
1
.
a2 =
i2
i=1
c 1 = b1 −
c 2 = b2 −
(9.3)
(9.4)
(9.5)
Using this he defined the test statistic
θbπ − θbS
D=q
.
(9.6)
d θbπ − θbS ]
Var[
Tajima’s D statistic is probably the most widely used test of neutrality. It has the
appealing property that its mean is approximately 0 and its variance approximately 1.
Also, for large sample sizes it is approximately normally distributed under neutrality, and
more generally it is approximately β-distributed. However, it is still preferable not to use
the approximate distribution, but to get the real one by doing computer simulations.
What values of Tajima’s D do we expect when there is a deviation from the neutral
model? The statistic reflects the shape of the genealogy. Consider the genealogies of equal
total length shown in Figure 9.5.
Keeping the total length constant means that E[θbS ] is constant, because a mutation
anywhere on the tree causes a segregating site. We can transform between genealogies
by moving one node up and another node down, without changing the total length. For
example, moving the root down one unit increases the pairwise distance by two units for
9 · 3 = 27 pairwise comparisons, but moving any point up by two units increases the
pairwise distance by four units for a smaller number of pairwise comparisons. The net
effect is to change the expected pairwise difference is thus positive. This illustrates why
genealogies which have a deep bifurcation tend to have positive values of Tajima’s D. For
9.2 Tajima’s D
129
the same reasons, genealogies that are approximately star-like tend to have negative values
of Tajima’s D.
We expect a selective sweep to cause a star shaped genealogy and balancing selection
to cause a deeply bifurcated genealogy. Purifying selection against recurrent deleterious
mutations causes a small reduction in D because deleterious alleles are at low frequency
on short branches.
Exercise 9.4. The last paragraph gave a heuristic view about genealogical trees under
different selective forces. Explain in your own words why
• selective sweeps cause star like genealogies,
• balancing selection leads to deeply bifurcated trees.
Which kind of tree topologies from Figure 9.5 would you expect in a substructured population?
2
Conditional Testing of Tajima’s D
Recall that for the denominator of D the variance of d was estimated using an estimator
b So when we calculate a p-value for Tajima’s D we use exactly this estimator.
of θ, θ.
Implicitly we assume here that the estimator gives us the correct value. So, when we want
to find out the distribution of Tajima’s D and p-values, that follow from the distribution,
we want to find
b neutrality].
p = P[D < d|θ,
(9.7)
With modern computers and fast coalescent simulations, there is really no reason to approximate the sampling distribution of D with a normal or beta distribution. Instead, the
exact sampling distribution should be estimated by simulation.
Exercise 9.5. DNASP can do these coalescent simulations for you. Open the file hla-b.nex.
Here you see 50 lines from the human hla region. This locus is supposed to be under
balancing selection, so Tajima’s D should be positive. Always use the complete dataset in
this exercise.
1. Does Tajima’s D-test give a significant result for the region as a whole?
2. Do a sliding window analysis to see how Tajima’s D behaves in the region of consideration. Here DNASP uses a predefined distribution of Tajima’s D, probably a normal
distribution.
3. You will see that there is a region where Tajima’s D tends to be positive but not significantly so (i.e. with a p-value above 5%). Use Tools->Coalescent Simulations
to compute the critical region of Tajima’s D test, given the overall level of polymorphism, e.g. given by θbS or θbπ . Do your simulation results support the hypothesis of
balancing selection better than your results in 1?
130
9 NEUTRALITY TESTS
2
Some more recent papers have instead calculated p-values using the conditional sampling distribution of D, obtained by conditioning on the observed number of segregating
sites, i.e. S = s. The reason for doing this is that no assumption has to be made that
an estimated θ-value is correct. The number of segregating sites is just as it is. So these
papers compute
p = P[D < d|S = s, neutrality].
(9.8)
These two ways of tests using Tajima’s D produce different p-values and different critical
regions. For example, suppose we know θ = 1 or we estimated θb = 1 and assume this is
correct. Then by generating 105 random samples from the distribution of D, for n = 30,
we find that the 0.025 and 0.975 quantiles −1.585 and 1.969 respectively.
Thus R = {D ≤ −1.585 or D ≥ 1.969} is a rejection region for a size 0.05 test, i.e. a
test that guarantees to make false rejection errors with a frequency at worst 5% in a long
series of trials. However, we conditioned here on having found the correct θ. When we
condition on the number of segregating sites s, the frequency of making a type I error by
rejecting neutrality using rejection region R is not 5%, as shown in Figure 9.6.
Exercise 9.6. In Tools->Coalescent Simulations DNASP offers a coalescent simulation
interface. Let us use this to see the differences between conditioning on θb = θ and on
S = s.
1. Make 10000 coalescent simulations to check whether you also obtain the critical region
R. However it is possible that the values DNASP gives you a deviate from the above
ones though . How do you interpret this?
2. DNASP offers you to simulate Given Theta and Given Segregating Sites which is
exactly the distinction we also draw here. Using the coalescent interface can you also
obtain the values given in Figure 9.6?
2
9.3
Fu and Li’s D
Another test that is directly based on the coalescent was proposed by Fu and Li (1993).
Their test statistic is based on the fact that singletons, i.e. polymorphisms that only affect
one individual in a sample, play a special role for different population histories. From the
frequency spectrum (see Section 5) we know that
θ
E[Si ] = ,
i
9.3 Fu and Li’s D
131
s
P[D ∈ R]
bias
1
0%
less type I
4
6.22%
more type I
7
6.93%
more type I
Figure 9.6: Assume a critical region R = {D ≤ −1.585 or D ≥ 1.969}. This should fit to
a test of site 5% where n = 30 and θ = 1. Conditioning on the number of segregating sites
the frequencies of type I errors change. A total of 105 unconditional trials were simulated.
where Si is the number of segregating sites that affect i individuals in the sample. Using
this, we can e.g. conclude that
E[S] =
n−1
X
E[Si ] = θ
i=1
n−1
X
1
i=1
i
,
which gives a better understanding of the estimator θbS . We can also use it to predict the
number of singletons: E[S1 ] = θ, this gives us a new unbiased estimator of θ:
θbS1 = S1
With their test statistic, Fu and Li (1993) compared the level of polymorphism on external
branches, which are the singleton mutations, with the level of polymorphism on internal
branches, i.e. all other mutations. As
n−1
X
i=2
E[Si ] = θ
n−1
X
1
i=2
i
,
another unbiased estimator of θ is
S>1
θbS>1 = Pn−1
1
i=2 i
where S>1 are all segregating sites that affect more than one individual. In the same spirit
as Tajima (1989), Fu and Li (1993) proposed the statistic
d = θbS>1 − θbS1 .
Exercise 9.7. Again use data from Exercise 5.1 (plus the outgroup of Exercise 5.3).
Compute d in this case either by hand or using R.
2
132
9 NEUTRALITY TESTS
Again - in order to make it a better test statistic - they calculated the variance of d
and found out that a good estimator of this is
d θbS>1 − θbS1 ] = c1 S + c2 S 2
Var[
with
a21 n + 1
c1 = 1 +
b−
,
a2 + a21
n−1
c2 = an − 1 − c1 ,
2nan − 4(n − 1)
b=
,
(n − 1)(n − 2)
and a1 , a2 from (9.3).
Analogously to Tajima’s D statistic Fu and Li (1993) proposed the test statistic
θ̂S − θ̂S1
D = q >1
.
d
Var[θ̂i − θ̂e ]
The interpretation for selective models can again be read from Figure 9.5. When the tree
looks like the one in the middle θi will be small and D in this case negative. When the
genealogical tree looks like the right side then θ̂S>1 will be far too small and so D will
positive.
Exercise 9.8. Use the same dataset as in Exercise 9.5. Does Fu and Li’s D suggest that
balancing selection is going on at the human HLA locus? Again answer this question using
the dataset as a whole, for a sliding window analysis and using coalescent simulations. 2
Exercise 9.9. Obviously Fu and Li’s D and Tajima’s D use different information of the
data. Can you draw a genealogical tree (with mutations on the tree) for the case that
• Tajima’s D is negative and Fu and Li’s D is approximately 0?
• Fu and Li’s D is positive and Tajima’s D is approximately 0?
2
9.4
Fay and Wu’s H
Tests based on polymorphism data only, are easily confounded by demography and by
selection at linked sites. These problems can be addressed to some extent by combining
divergence and polymorphism data, and by combining data from several loci at once.
As we know from Section 8, the genealogy at loci closely but not completely linked to
a site at which a selective sweep has occurred, there is a tendency for the genealogies to
resemble the one in Figure 8.3.
9.5 The HKA Test
133
Figure 9.7: Sliding window analysis of the accessory gland protein gene by Fay and Wu
(2000).
Especially, we saw in Exercise Ex:hitchSFS that hitchhiking give rise to high frequency
derived alleles. The statistic
n−1
X
2Si i2
b
(9.9)
θH =
n(n − 1)
i=1
is particularly sensitive to the presence of such variants, and therefore suggests the test
statistic
H = θbT − θbH
(9.10)
Fay and Wu did not provide an analytical calculation of the variance for this statistic and
therefore p-values have to be estimated by simulation. In both simulated and real data for
which divergence data have indicated positive selection, Fay and Wu (2000) found smaller
p-values for H-tests than for Tajima’s D-tests.
Particularly spectacular are the large peaks in θbH for the Drosophila accessory gland
protein gene (Figure 9.7); perhaps two regions where there could have been incomplete
hitchhiking, one on either side of a ∼ 350bp region where almost all variability has been
eliminated.
9.5
The HKA Test
The HKA test that was introduced by Hudson et al. (1987), uses polymorphism and
divergence data from two or more loci, and tests for loci that have an unusual pattern of
molecular variation. Here we will consider the case of L loci and polymorphism data from
two species A and B.
134
9 NEUTRALITY TESTS
The idea and the model
The main idea of the HKA test is that under neutrality you would expect the same ratio
between polymorphism within and divergence between species in the L loci under observation.
The test assumes that the polymorphism data is taken from two species with a fixed
effective population size of 2Ne and 2Ne f haploids. These species are assumed to have
diverged from each other some time T ago, where T is measured in units of 1/2Ne generations. The ancestral species is assumed to have had a fixed effective population size of
2Ne 1+f
diploids. Further assumptions are:
2
1. at each locus an infinite sites model with selectively neutral mutations holds
2. the mutation rate at locus i is µi (i = 1, . . . , L)
3. there is no recombination within the loci
4. there is free recombination between the loci, i.e. loci are unlinked
Parameters and data
On the data side we have
S1A , . . . , SLA
number of segregating sites in species A at loci 1, . . . , L
S1B , . . . , SLB
number of segregating sites in species B at loci 1, . . . , L
D1 , . . . , DL
divergence between two randomly picked lines from species A
and B at loci 1, . . . , L.
So there are 3L numbers that we observe. On the parameter side we have:
T
time since the split of the two species
f
ratio of the two population sizes of species B and A
θ1 , . . . , θ L
scaled mutation rate at locis 1, . . . , L
So there are L + 2 model parameters. As long as L ≥ 2, i.e. when data of two or more loci
is available there are more observations than parameters. This means that we can test the
model.
Exercise 9.10. Why does it only make sense to test the model if the data consists of more
numbers than the number of model parameters?
2
9.5 The HKA Test
135
Estimation of model parameters
First of all we have to estimate the model parameters. This can be done by different means.
As the estimator θbS is unbiased for one locus, we can calculate that
L
X
SiA
= anA
L
X
i=1
L
X
i=1
L
X
i=1
i=1
SiB = anB f
where
an :=
n−1
X
1
i=1
i
θbi ,
(9.11)
θbi ,
(9.12)
,
and nA is the number of sequences we have for species A. Divergence of two randomly
picked lines from the two populations is on average 2Ne (T + 1+f
) generations. So, we can
2
estimate
L
X
Di = Tb +
1+fb
2
θbi .
(9.13)
i=1
For each single locus i = 1, . . . , L we have polymorphism in A, polymorphism in B and
divergence which adds up to
b
SiA + SiB + Di = θbi Tb + 1+2 f + ana + anB .
(9.14)
Using these equations we would have L + 3 equations for L + 2 unknowns, which are the
model parameters. So it is not guaranteed that we find a solution. But we can try to find
them by a least squares approach. Assume you take some combination of b
t, fb, θb1 , . . . , θbL .
When you plug them into the right sides of (9.11)-(9.14) these numbers produce some
numbers for the left sides for these equations. They are least squares estimators if and
only if the sum of squares of differences of those produced left sides to the real ones, given
by the left sides of (9.11)-(9.14), is minimal.
The test statistic
Ultimately we want to test if our model fits the observations. So far we have estimated
the model parameters. The HKA test now makes the assumption that
• the random numbers S1A , . . . , SLA , S1B , . . . , SLB , D1 , . . . , DL are independent and normally distributed.
While for large data sets it is possible that approximately the normal distribution holds
true. The independence is certainly false, e.g. because the divergence uses the same
segregating sites as the polymorphism. The HKA test has been criticized for this. Let us
see what happens under this assumption.
Next we need two properties of probability distributions.
136
9 NEUTRALITY TESTS
Maths 9.1. When X is normally distributed, then
Y :=
X − E[X]
Var[X]
is also normally distributed with E[Y ] = 0, Var[Y ] = 1.
Maths 9.2. Let X1 , . . . , Xn be independent, normally distributed random variables with
E[Xi ] = 0, Var[Xi ] = 1 (i = 1, . . . , n). Then the distribution of
Z = X12 + . . . + Xn2
(9.15)
is χ2 (n) distributed. Here n denotes the number of degrees of freedom.
When the above assumption holds at least approximately we now see that
SiA − E[SiA ]
p
Var[SiA ]
is approximately normally distributed and
L
X
S A − E[S A ]
i
i
Var[SiA ]
i=1
+
L
X
S B − E[S B ]
i
i
Var[SiB ]
i=1
+
L
X
Di − E[Di ]
i=1
Var[Di ]
is approximately χ2 distributed. But as we do not know E[SiA ] and Var[SiA ] we have to
estimate them before we can use this. This is easy for the expectation, but for the variance
we have to compute something.
Assume a coalescent tree for n individuals with mutation rate µ. Let L be the length
of the whole tree and Ti be the time the tree spends with i lines. As usual S is the number
of segregating sites. Then
Z ∞
E[S(S − 1)] =
E[S(S − 1)|L = `]P[L ∈ d`]
0
Z
=
∞
∞X
0
Z
s=0
∞
(`µ)2
=
0
= µ2
s(s − 1)e−`µ
∞
X
e−`µ
s=2
(`µ)s
P[L ∈ d`]
s!
(`µ)s
P[L ∈ d`]
s!
∞
Z
`2 P[L ∈ d`] = µ2 E[L2 ] = µ2 Var[L] + (E[L])2 .
0
So we have to compute the variance of L. As we calculated in Maths 1.3 the variance of
an exponentially distributed variable we can continue
Var[L] = Var
n
hX
i=2
i
iTi =
n
X
i=2
n
n−1
2
X
X
1
2 2N
2
= (2N )
i Var[Ti ] =
i
i
i2
2
i=2
i=1
2
9.5 The HKA Test
137
i locus
si di
1 Adh
20 16
2 5’ flanking region 30 78
3 Adh-dup
13 50
Figure 9.8: Data from 11 D. melanogaster and 1 D. simulans sequences
and so
Var[S] = E[S] + E[S(S − 1)] − (E[S])2 = θa1 + θ2 (a2 + a21 − a21 ) = θa1 + θ2 a2 .
(9.16)
Using this we can estimate, for the ith locus in species A
d A ] = θbi a1 + θb2 a2
Var[S
i
i
and in species B
d B ] = θbi fba1 + (θbi fb)2 a2 .
Var[S
i
The same calculation also works for divergence and in this case
b
b
d i ] = θbi Tb + 1 + f + θbi Tb + 1 + f 2 .
Var[D
2
2
Now we can use the test statistic
2
χ =
L
X
b A]
S A − E[S
i
i=1
i
d A]
Var[S
i
+
L
X
b B]
S B − E[S
i
i=1
i
d B]
Var[S
i
+
L
X
b i]
Di − E[D
i=1
.
(9.17)
d i]
Var[D
Each estimated parameter reduces the degree of freedom of this test statistic by 1 and so
χ2 has a χ2 -distribution with 3L − (L + 2) = 2L − 2 degrees of freedom. Now we have
a test statistic and also know approximately its distribution, so we can calculate critical
regions and p-values.
Example
We can apply the HKA test to data for the Adh gene, its 5’ flanking region and an ancient
duplicate gene Adh-dup, shown in Figure 9.8. The data are from the paper in which the
HKA test was introduced Hudson et al. (1987).
Here we have 11 lines from D. melanogaster but only one from D. simulans. The
consequence is the for simulans we do not have polymorphism data. As our data is less
we need also less model parameters. We do this by assuming that f = 0. This means that
we assume that D. simulans and D. Melanogaster have emerged from one species that had
the size of D. melanogaster today. The D. simulans has split as a small fraction of this
ancestral species. This is illustrated in Figure 9.9.
138
9 NEUTRALITY TESTS
2Ne
2Ne
t
Locus 1, µ1
Locus 2, µ2
Figure 9.9: Model assumed by the HKA test in the special case f = 0 and polymorphism
data only from one species.
The only significant pairwise comparison is between Adh and Adh-dup, which has χ2 =
4.574 and p = 0.0325 assuming the χ2 approximation. For this test the expected quantities
b 1 ] = 12.0, E[S
b 3 ] = 21.0, E[D
b 1 ] = 24.0, E[D
b 3 ] = 42.0). Comparing these with
were (E[S
the observed values suggests that the Adh gene has unusually high level of polymorphism
or an unusually low divergence, or vice versa for Adh-dup. This has been interpreted as
evidence of balancing selection acting on the Adh fast—slow polymorphism.
Exercise 9.11. Two versions of the HKA test are implemented in DNASP. However it
can only deal with two loci and always sets f = 0 in the above analysis. For the first
(Analysis->HKA, Hudson, Kreitman, Aguade Test) you have to define the two loci in
your data set. The second version (Tools->HKA Test (Direct Mode)) only needs numbers of segregating sites and divergence between two species as input. Here, it is assumed
that only one line from species B is available, so we find ourselves in the regime of the
above example.
1. Can you reproduce the numbers of the above example, e.g. χ2 = 4.574?
2. Is the comparison Adh and Adh-dup really the only one that gives a significant result?
2
Note that the parameter estimates obtained using equations (9.11)-(9.14) are not the
values that minimize the test statistic. The least squares procedure minimizes the sum
9.6 The McDonald–Kreitman Test
139
100
80
60
40
20
0
s1
s2
d1
d2
Figure 9.10: Data (×) for HKA test for Adh locus versus its 5’ flanking region. Expectations
± 1 s.d. are shown for each quantity, calculated using either the parameter estimates
obtained by equations (9.11)-(9.14) (closed symbols) or the estimates that minimize the
test statistic of equation (9.17) (open symbols).
of the numerators in equations (9.11)-(9.14) without regard to differences in the expected
variance of each quantity. This is illustrated in Figure 9.10 for the comparison between
Adh and its 5’ flanking region (table 9.8). Parameter estimates obtained by equations
(9.11)(9.14) are t = 4.5, θ1 = 4.3, θ2 = 12.8 and give χ2 = 3.2. However, the minimum
χ2Min = 1.79 is obtained when t = 2.4, θ1 = 5.6, θ2 = 20.3.
Estimating parameters by minimizing χ2 produces smaller values of the test statistic X 2 ,
and hence larger P -values if it is assumed that the null distribution of χ2 is χ2 distributed
with 1 degree of freedom. The HKA test is quite a robust test. Many deviations from the
model, for example linkage between the loci or a population bottleneck in the past, generate
correlations between the genealogies at the two loci and therefore reduce the variance of
the test statistic, making the test conservative.
Exercise 9.12. One property of an estimator is consistency. That means that it gets
better when more data is available. Assume an estimator b
•(n) of • is based on n data.
Consistency means that
n→∞
Var[ b
•(n) ] −−−→ 0.
(n)
You know the estimator θbS which is only based on S. We could also call it θbS when the
sequences of n individuals are available. Is this estimator consistent?
Hint: Use (9.16)
2
9.6
The McDonald–Kreitman Test
The McDonald and Kreitman (1991) test is similar to the HKA test in that it compares
the levels of polymorphism and divergence at two sets of sites. Whereas for the HKA test
the two sets of sites are two different loci, the McDonald-Kreitman test examines sites that
140
9 NEUTRALITY TESTS
are interspersed: synonymous and nonsynonymous sites in the same locus. Because the
sites are interspersed, it is safe to assume that the genealogies for the two are the same.
The model therefore has four parameters; the synonymous mutation rate µs and nonsynonymous mutation rate µa , the total lengths of divergence and within species branches in
the genealogy, td and tw (see Figure 9.11).
Σ=tw
td
Figure 9.11: Model assumed for the McDonald-Kreitman test.
Assuming the infinite sites model, the numbers of segregating sites of each of the four
possible types (synonymous or nonsynonymous, and diverged or polymorphic within a
species) are independently Poisson distributed with means given in Figure 9.12.
Here µ = µa + µs is the total mutation rate and t = td + tw is the total length of the
tree. The observations can therefore be arranged in the corresponding 2 × 2 contingency
table and tested for goodness of fit using a χ2 test, Fisher’s exact test or others (e.g. a
G-test).
The McDonald-Kreitman test is a very robust test, because no assumption about the
diverged
polymorphic
Total
synonymous
µs td
µs tw
µs t
non-synonymous
µa td
µa tw
µa t
Total
µtd
µtw
µt
Figure 9.12: Expected means for the McDonald-Kreitman test under neutrality.
9.6 The McDonald–Kreitman Test
141
diverged
polymorphic
Total
synonymous
17
42
59
non-synonymous
7
2
9
Total
24
44
68
Figure 9.13: Data from McDonald and Kreitman (1991) given in a 2 × 2 contingency
table.
shape of the genealogy is made. It is therefore insensitive to the demographic histories,
geographic structuring and non-equilibrium statuses of the populations sampled.
If synonymous mutations are considered neutral on an a priori basis, then a significant
departure from independence in the test is an indicator of selection at nonsynonymous
sites. An excess of substitutions can be interpreted as evidence of adaptive evolution,
and an excess of polymorphism can be interpreted as evidence of purifying selection since
deleterious alleles contribute to polymorphism but rarely contribute to divergence.
McDonald and Kreitman (1991) studied the Adh gene using 12 sequences from D.
melanogaster, 6 from D. simulans and 12 from D. yakuba and obtained the data as given
in Figure 9.13. These show a significant departure from independence with p < 0.01. The
departure is in the direction of excess non-synonymous divergence, providing evidence of
adaptively driven substitutions.