Download View - OhioLINK ETD

Document related concepts

Quantitative trait locus wikipedia , lookup

Genealogical DNA test wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Population genetics wikipedia , lookup

Genetic drift wikipedia , lookup

Microevolution wikipedia , lookup

Species distribution wikipedia , lookup

Inbreeding avoidance wikipedia , lookup

Dominance (genetics) wikipedia , lookup

Inbreeding wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Transcript
UNIVERSITY OF CINCINNATI
Date:
I,
August 6th, 2007
Ran He
,
hereby submit this work as part of the requirements for the degree of:
DOCTOR OF PHILOSOPHY
in:
ENVIRONMENTAL HEALTH
It is entitled:
Some Statistical Aspects of Association Studies in Genetics
and Tests of the Hardy-Weinberg Equilibrium
This work and its defense approved by:
Chair: Dr. Marepalli Rao
Dr. Ranajit Chakraborty
Dr. Ranjan Deka
Dr. Ning Wang
Some Statistical Aspects of Association Studies in Genetics and
Tests of the Hardy-Weinberg Equilibrium
A dissertation submitted to the
Division of Research and Advanced Studies
of the University of Cincinnati
In partial fulfillment of the
requirements for the degree of
DOCTOR OF PHILOSOPHY (Ph.D.)
in the Division of Epidemiology and Biostatistics
of Department of Environmental Health
of the College of Medicine
2007
By
Ran He
B. S., Sichuan University, China, 2001
Committee Chair: Dr. Marepalli Rao
Abstract
The applicability of a statistical method hinges how far the assumptions are met
for its validity. Some statistical tests are robust when assumptions are relaxed, while
others are not.
In first part of the dissertation, we focus on exploring assumption
violations in some statistical methods for genetic association studies and use simulations
to test the robustness of these methods. In genetic studies, one of the major objectives is
to apply statistical models to identify genes contributing to variations in specific
quantitative traits. In order to correlate such quantitative phenotypes with underlying
genotypes, the method of analysis of variance (ANOVA) is most commonly used. If the
null hypothesis of equality of means is rejected, it implies that the investigating gene is
associated with the phenotype. However, we show that this method raises a paradox by
violating the assumptions of its validity. An alternative method, namely Bartlett’s test, is
available to overcome the paradox. We compare the performances of the ANOVA test
and Bartlett’s test to answer the underlying question. Our study indicates that the
ANOVA test works despite the failure of the validity of its assumption.
In the second part of the dissertation, we focus on tests of the Hardy-Weinberg
Equilibrium (HWE). In population genetics, HWE states that, under certain conditions,
after one generation of random mating genotype frequencies at a single gene locus will
attain a particular set of equilibrium values. The most commonly used method for testing
HWE is the goodness-of-fit Chi-squared statistic, which does not discriminate
homozygote excess from heterozygote deficiency because of its two-sided nature. We
ii
propose alternative methods and use simulations to assess their power. The proposed
methods are amenable to sample size calculations. We compare our sample size
calculations with those available in the literature and find that ours are smaller. For more
than two alleles, testing the HWE is computationally complex. We propose a new method
of testing the HWE for multi-allelic cases by reducing the dimensionality of the problem.
Mathematical, statistical, and computational aspects of the new method are set out in
detail.
iii
iv
Acknowledgments
I would like to express my sincere gratitude to Dr. Marepalli Rao, who is my
advisor, and Dr. Ranajit Chakraborty, who is the director of Center for Genome
Information, for their inspiration, professional guidance and support for my dissertation. I
have greatly profited from their solid knowledge and great personalities. I am indebted to
their constant encouragement and mentoring throughout my graduate studies.
I would also like to thank Dr. Ranjan Deka and Dr. Ning Wang for serving on my
committee and providing many insightful suggestions and comments and discussing with
me some of the difficult points in the dissertation. I would also like to express my
appreciations to faculty and staff in the Department of Environmental Health with whom
I have a good fortune to interact. Finally, I want to thank my family who always
encouraged me to succeed in achieving high goals.
v
Table of contents
1
2
3
4
Introduction..........................................................................................................................1
Purpose, Hypotheses, Specific Aims, and Significance ......................................................3
2.1
Purpose...................................................................................................................3
2.2
Research Hypotheses .............................................................................................4
2.3
Specific Aims.........................................................................................................5
2.4
Significance............................................................................................................6
On testing that genotypes at a marker locus is associated with a given phenotype.............8
3.1
Background: Traditional Approach .......................................................................8
3.2
Statistical Methods...............................................................................................10
3.2.1 Analysis of Variance (ANOVA)..............................................................10
3.2.2 Bartlett’s Test...........................................................................................11
3.2.3 Linkage Disequilibrium Coefficient and Joint distribution .....................13
3.2.4 Joint distribution of the Phenotype and Genotypes of G and G′..............14
3.2.5 Conditional Expectations and Variances .................................................15
3.3
Some Facts and Paradox ......................................................................................16
3.3.1 Some Facts ...............................................................................................16
3.3.2 The Paradox .............................................................................................16
3.4
Efficacy of ANOVA ............................................................................................16
3.5
Power Comparison of ANOVA and Bartlett’s test..............................................26
3.5.1 Different choices of mean (λ) ..................................................................27
3.5.2 Conclusion ...............................................................................................28
Hardy-Weinberg Equilibrium in the case of two alleles....................................................31
4.1
Introduction..........................................................................................................31
4.1.1 What is Hardy-Weinberg Equilibrium? ...................................................31
4.1.2 Assumption of HWE................................................................................33
4.1.3 Departures from the Equilibrium .............................................................34
4.1.4 Inbreeding Coefficient θ ..........................................................................35
4.2
Properties of Inbreeding coefficient θ..................................................................38
4.2.1 Formulation of the problem .....................................................................38
4.2.2 Bounds on θ .............................................................................................40
4.2.3 Homozygote excesses and Heterozygote deficiencies.............................40
vi
4.3
4.4
5
Maximum Likelihood estimates ..........................................................................41
Testing the validity of HWE ................................................................................42
4.4.1 Hypothesis Testing on θ...........................................................................42
4.4.2 A likelihood test of the null hypothesis ...................................................43
4.4.3 Siegmund’s T-Test...................................................................................45
4.4.4 χ2 -Test......................................................................................................48
4.4.5 Relationship between θˆ , Wald’s Z-test, Siegmund’s T-test and
χ2 -test.......................................................................................................50
4.5
Advantages of the Wald’s Z-test or Siegmund’s T-Test .....................................51
4.6
Sample size calculation........................................................................................51
4.6.1 Sample size calculation based on Z-test or Siegmund’s T-test................51
4.6.2 Sample size calculation based on Ward and Sing’s χ2-test......................54
4.6.3 Power comparison between T and χ2 tests via simulations......................55
4.7
Conclusion. ..........................................................................................................59
Hardy-Weinberg Equilibrium in the case of three alleles..................................................60
5.1
Introduction..........................................................................................................60
5.2
Joint distribution of genotypes.............................................................................60
5.2.1 Parameter spaces......................................................................................61
5.2.2 Bounds on θ .............................................................................................65
5.2.3 Biological scenario...................................................................................65
5.3
Structure of the case of 3 alleles: data and Likelihood ........................................67
5.3.1 Structure of the case of 3 alleles: data .....................................................67
5.3.2 Maximum Likelihood estimators.............................................................67
5.4
5.5
6
Joint distribution of the type Ωθ and Connection to lower dimensional
joint distributions .................................................................................................74
5.4.1 The case of A1 vs. (not A1) ......................................................................75
5.4.2 The case of A2 vs. (not A2) ......................................................................76
5.4.3 The case of A3 vs. (not A3) ......................................................................77
Estimation of inbreeding coefficient and hypotheses testing ..............................77
5.5.1
Estimation of inbreeding coefficient in a model of the type Ωθ .............77
5.5.2
Testing that the joint distribution of the alleles is of the type Ωθ ...........80
5.6
Conclusions..........................................................................................................92
Generalization to multiple alleles ......................................................................................93
6.1
Formulation of the problem .................................................................................93
vii
6.2
7
Data and Likelihood.............................................................................................96
6.2.1 Data Structure ..........................................................................................96
6.2.2 Maximum Likelihood estimators.............................................................96
6.3
Lower dimensional joint distributions .................................................................97
Conclusions and Future Research....................................................................................101
Bibliographic references ................................................................................................103
viii
List of Appendix
Appendix 1:
Appendix 2:
Appendix 3:
Appendix 4:
Appendix 5:
Appendix 6:
Appendix 7:
Appendix 8:
Appendix 9:
Derivation of Conditional Expectations and Variances........................106
SAS code of different scenarios of ANOVA test .................................112
SAS code of Power comparison of ANOVA and Bartlett test .............118
Derivation of Expectation and Variance of Siegmund’s T-test ............122
SAS code of sample size calculation of Wald’s Z test .........................124
SAS code of power comparison of Ward and Sing’s χ2 Test
and Wald’s Z test ..................................................................................125
SAS code of Rao’s Homogeneity Test .................................................129
Derivatives of θˆ s with respective to frequencies ................................133
Mathematica code for power and size computations of the Qtest .........................................................................................................141
ix
List of tables
Table 3-1
Joint distribution of the genotypes of G and G′ .......................................13
Table
Table
Table
Table
Table
Table
Table
Table 5-1
Conditional distributions under Scenario 1.................................................17
Conditional distributions under Scenario 2.................................................21
Summarized results of simulations .............................................................25
Punnet square for Hardy-Weinberg Equilibrium........................................38
Genotype frequencies in the population .....................................................39
Joint distribution with inbreeding coefficient θ ..........................................40
Sample size n to achieve a specified power, 1−β, using Wald’s Ztest for various values of allele frequencies q , and true inbreeding
coefficient θ, and level α.............................................................................52
Joint distribution of genotypes....................................................................61
Table 5-2
Joint distribution is of type Ωθ ...................................................................63
Table 5-3
Joint distribution for Genotypes under Equilibrium ( Ω 0 )..........................64
Table
Table
Table
Table
Table
Table
Table
Table
Table
Table
Example: A distribution in Ω but not in Ω* ..............................................64
Population subdivision with respect to tri-alleles .......................................65
Data on Genotypes......................................................................................67
Joint distribution: A1 vs. (not A1) ...............................................................75
Joint distribution: A1 vs. (not A1) with inbreeding coefficient θ1 ...............76
Joint distribution: A2 vs. (not A2) ...............................................................76
Joint distribution: A2 vs. (not A2) with inbreeding coefficient θ2 ...............76
Joint distribution: A3 vs. (not A3) ...............................................................77
Joint distribution: A3 vs. (not A3) with inbreeding coefficient θ3 ...............77
General Joint distribution of genotypes ......................................................93
3-2
3-3
3-4
4-1
4-2
4-3
4-4
5-4
5-5
5-6
5-7
5-8
5-9
5-10
5-11
5-12
6-1
Table 6-2
A joint distribution of type Ω* ...................................................................95
Table 6-3
Joint distribution of genotypes under Equilibrium ( Ω 0 ) ............................95
Table 6-4
Table 6-5
Data collected for any multiple alleles........................................................96
2x2 distribution required for a test about inbreeding coefficient..............100
x
List of figures
Figure 3-1
Common conditional pdf of P | G′under Scenario 1 ................................17
Figure 3-2
Common conditional pdf of P | G under Scenario 1 ..................................18
Figure 3-3
Conditional pdf of P | G′under Scenario 2 ...............................................22
Figure 3-4
Conditional pdf of P | G′under Scenario 3 ...............................................23
Figure 3-5
Conditional pdf of P | G′under Scenario 4 ...............................................24
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Figure
Power comparison of ANOVA and Bartlett’s test, when λ = 1.................27
Power comparison of ANOVA and Bartlett’s test, when λ = 50...............28
Power comparison of Z and χ2, when p = 0.5 ............................................56
Power comparison of Z and χ2, when p = 0.2 ............................................56
Power comparison of Z and χ2, when p = 0.05 ..........................................57
Histogram and Normal Q-Q Plot for Z’s, when p = 0.5 .............................58
Histogram and Normal Q-Q Plot for Z’s, when p = 0.2 .............................58
Histogram and Normal Q-Q Plot for Z’s, when p = 0.05 ...........................59
Empirical power of χ2 Test Q for testing H0: θ1 = θ2 = θ3 = θ = 0...........84
Empirical size of χ2 Test Q for testing H0: θ1 = θ2 = θ3 = θ
(θ unknown) ................................................................................................87
3-6
3-7
4-1
4-2
4-3
4-4
4-5
4-6
5-1
5-2
xi
1
Introduction
One of the important problems in genetic studies is to explore association between a gene
and a quantitative phenotype, such as blood pressure, body mass index (BMI), and lipid levels in
blood. A standard additive model is generally postulated exemplifying the connection between the
genotypes and phenotype. The model assumes a normal distribution for the phenotype under each
genotype with additive effects. The relevant question is whether a gene of interest is associated
with the phenotype. Data collected on the phenotype of a random sample of subjects are classified
according to the genotypes and the ANOVA method is most commonly used. By comparing the
mean phenotypic values across the different genotypic groups, the proportion of variance
explained by the marker loci are examined. The rejection of the null hypothesis would lead us to
believe that there is a statistically significant difference among these groups, which implies the
conclusion that the genotypes are correlated with the given quantitative phenotype.
This test is reasonable and the method is easy to use. However, we note that the method of
analysis of variance raises a paradox by violating the assumptions of its validity. For the validity
of the ANOVA method, homogeneity of variances of the genotype populations is needed. We
observe that homogeneity holds if and only if the population means are equal, which the ANOVA
method is purporting to test. This is a paradoxical situation. Thus motivation for this part of our
work stems from the feeling that if the test assumptions are violated, the test results may not be
valid.
The Bartlett’s test can be used to test homogeneity of variances. In Chapter 3, we compare
the performance of the ANOVA method and Bartlett’s test via simulations. Our conclusion is that
the ANOVA method works despite violation of the assumptions of its validity.
1
In Chapter 4 through 6, we focus on the assumption of the Hardy-Weinberg Equilibrium
(HWE) to describe genotype frequencies at autosomal codominant loci. In population genetics, the
HWE or Hardy–Weinberg law, named after G. H. Hardy and W. Weinberg, states that, under
certain conditions, after one generation of random mating, the genotype frequencies at a single
gene locus will attain a particular set of equilibrium values. It also specifies that those equilibrium
frequencies can be represented as a simple function of allele frequencies at that locus.
In Chapter 4, we focus on testing the HWE in the bi-allelic case. The most commonly used
entity for testing is the goodness-of-test Chi-squared statistic, which does not
discriminate
homozygote excess from heterozygote deficiency because of its two-sided nature. We propose
alternative methods for testing the HWE against an one-sided alternative. We use simulations to
compare the power of the new test and goodness-of-test Chi-squared test. We also compare sample
sizes required to achieve a given power. Sample sizes calculated based on the new method are
lower than what are available in the literature.
For more than two alleles, testing the HWE raises severe computational problems. In
Chapter 5, we propose a new method of testing the HWE following the dimensionality reduction
principle. We set out in great detail execution of the new method and computational routines, and
demonstrate its feasibility and effectiveness by simulations.
In Chapter 6, results and methods are extended to multi-allelic cases. As the number of
alleles increases, computational complexity arises. The computational power is adequate to cover
cases of reasonable number of alleles.
In Chapter 7, we draw conclusions from the work presented. We will also outline research
problems which we wish to pursue in future.
2
2
Purpose, Hypotheses, Specific Aims, and Significance
2.1
Purpose
There are two main goals pursued in this dissertation:
1. First we establish that the assumptions needed for the commonly used ANOVA method
for testing that if a gene or some genes is associated with a given phenotype are not met,
giving rise to a paradox. This motivated us to an exploration of one of the goals of this
dissertation to scrutinize the violations and check the appropriateness of using ANOVA.
Bartlett’s test seems to be more appropriate to use in this context, even though the (marker)
genotypic group populations are not normal. In fact, the populations are mixed normal. We
compare the performances of Bartlett’s test and ANOVA procedure, using simulations
under different choices of parameters (i.e., allele frequencies, allelic effect, linkage
disequilibrium, type I error) to examine whether the ANOVA procedure still works, even
when the assumptions are violated, and how robust it is compared to the Bartlett’s test.
2. The second aim is to propose a new method to test Hardy-Weinberg Equilibrium since the
commonly used Chi-squared statistic can not discriminate homozygote excess from
heterozygote deficiency. The new method is simple to use in the bi-allelic case. For more
than two alleles, testing the HWE raises severe computational problems. We detail a new
technique reducing the multi-allelic problem to several bi-allelic problems. The
mathematical and computational details have been set out.
3
2.2
Research Hypotheses
Research hypotheses to be examined in this dissertation are the following:
1. The goal is to test the hypothesis that genotypes is associated with a given quantitative
phenotype. The method commonly used is ANOVA. We show that the assumptions for the
validity of ANOVA are violated, giving rise to a paradox. The Bartlett’s test seems more
appropriate. We hypothesize that the ANOVA procedure still works despite the violations.
2. On testing the Hardy-Weinberg Equilibrium, in the case of bi-allelic genes, the chi-squared
test cannot be used for one-sided alternatives. We propose a new method of testing which
can accommodate one-sided alternatives. We hypothesize that the new method provides
lower sample sizes than the chi-squared method.
3. Computational problems are insurmountable when testing the HWE in cases with more
than two alleles. In the tri-allelic case, we propose a new method of testing the HWE. The
new method reduces the tri-allelic problem to several bi-allelic problems. The research
hypothesis is that it will work. We use simulations to examine the power of the new
method.
4. The new proposed method can be generalized to any case of multiple alleles.
4
2.3
Specific Aims
The specific aims of this dissertation are outlined below:
1. Under the additive model of allelic effects on a quantitative phenotype, the ANOVA
method is commonly used to check the influence of the underlying gene on the phenotype.
We want to examine whether the assumptions for the validity of the ANOVA method are
met. If not, propose an alternative method to answer the question of interest and compare
its performance with that of the ANOVA method. This is pursued in Chapter 3.
2. For testing the Hardy-Weinberg equilibrium in the two-allelic case, goodness-of-fit Chisquared statistic is commonly used. The Chi-squared statistic can not be used to test onesided alternatives. We want to explore whether a new test can be developed to achieve the
objective. We propose such a new test. We use the new test for power and sample size
calculations. This is pursued in Chapter 4.
3. Test for HWE in the tri-allelic case is computationally intractable. We want to explore
ways of overcoming computational complexities. We propose a new method for testing the
HWE in this case. We want to detail the new procedure for practical implementation. This
is pursued in Chapter 5.
4. Generalize the testing procedure developed to any multi-allelic case.
Chapter 6.
5
This is pursued in
2.4
Significance
In genetic studies, one of the major objectives is to apply statistical models to assist in
identification of genes contributing to specific quantitative traits. The appropriateness of these
methods depend on the validity of the assumptions needed to carry out these methods. The
difficult thing is that many times the violation is not readily apparent (i.e., it is deeply buried and
not detectable without extensive algebraic computations). Unavoidably, geneticists sometimes
pick the most commonly used statistical method without checking the validity of these methods
which may bias or even jeopardize the integrity of the research. Some methods are robust when
the assumptions are violated, while others are not. Therefore, checking the validity of the
assumptions becomes very critical for assuring the validity of the method used.
In order to correlate a given quantitative phenotype with a gene, the method of analysis of
variance (ANOVA) is most commonly used. However, we have observed a paradox when
applying the ANOVA method to test the null hypotheses H0: the genotypes of a gene G′ do not
discriminate the phenotype P, which is equivalent to H0: Δ = 0, where Δ is the linkage
disequilibrium between G′ and the causative gene locus (G) of the phenotype P.ٛ For the
applicability of ANOVA for testing H0: Δ = 0, we need to assume homogeneity of variances of the
phenotype P across all genotypes of G′ (i.e., the groups formed by the genotypes of G′ have to
have the same variance). However, ANOVA tests equality of means, and the assumption of
homogeneity of variances holds only if Δ = 0, under which the means are also all equal, which is
what we are testing. This is a paradoxical situation. Using simulation to compare its power with an
alternative method “Bartlett’s test” would give scientists a better idea about the performance of
ANOVA. Research carried out on this problem would help quantitative geneticists to provide a
good understanding of the ANOVA method vis-à-vis Bartlett’s test in this context.
6
The second half of the dissertation focuses on Tests of the Hardy-Weinberg Equilibrium
(HWE). HWE is one of the most important assumptions to be checked in genetic analysis. In
population genetics, the HWE states that, under certain conditions, after one generation of random
mating, the genotype frequencies at a single gene locus will attain at a set of specific equilibrium
values. The most commonly used method for testing HWE is the Chi-squared statistic, which does
not discriminate homozygote excess from heterozygote deficiency because of its two-sided nature.
Based on this deficiency, we propose an alternative method and use simulations to assess its
power. For more than two alleles, testing the HWE is computationally complex. We propose a
new method of testing the HWE for multi-allelic cases by reducing it to several bi-allelic cases.
The quantitative geneticists are expected to use our routines when examining issues surrounding
the HWE.
7
3
On testing that genotypes at a marker locus is associated with
a given phenotype
3.1
Background: Traditional Approach
Suppose that we are investigating association between a particular quantitative phenotype
and a gene. The focus of genotype-phenotype mapping is to identify if the candidate gene or genes
have some bearing on this given phenotype. Let P denote the given phenotype, which will be
measured for each participant. In addition, blood sample will be collected for determining the
genotype of each participant in a randomly selected sample. It is believed that there is a gene G,
bi-allelic with alleles M and m, which impacts the phenotype. The genotypes MM, Mm, and mm
of the gene G influence the phenotype in the following sense:
P | G = MM ~ N(λ, σ2);
P | G = Mm ~ N(0, σ2);
P | G = mm ~ N(−λ, σ2).
for some λ ≠ 0, where G stands for ‘Genotype.’ In the subpopulation of those with genotype MM
the phenotype P is normally distributed with mean λ and variance σ2, in the subpopulation of those
with genotype Mm the phenotype P is normally distributed with mean 0 and variance σ2, and in
the subpopulation of those with genotype mm the phenotype is normally distributed with – λ and
variance σ2. This is essentially an additive model of allelic effects of the causative gene (G) on the
phenotype P.
Let PM2 , 2 PM Pm , and Pm2 be the relative frequencies of the genotypes MM, Mm, and mm,
respectively, in the population, where the probabilities (allele frequencies) PM and Pm satisfy the
condition PM + Pm = 1 . We are assuming the gene G is in Hardy–Weinberg equilibrium with PM
8
and Pm representing the frequencies of alleles M and m at the causative locus G in the entire
population ( Li, 1976).
Unconditionally, P has a distribution which is a mixture of normal distributions. More
precisely,
P ~ PM2 N (λ , σ 2 ) + 2 PM Pm N (0, σ 2 ) + Pm2 N (−λ , σ 2 ) .
The joint distribution of P and G is:
f(P, MM) = PM2 N (λ , σ 2 )
f(P, Mm) = 2 PM Pm N (0, σ 2 )
f(P, mm) = Pm2 N (−λ , σ 2 )
Suppose G′ is another gene at a chosen site of the genome. Suppose G′ is also a bi-allelic
gene with alleles A and a. Since we do not know where G is truly located yet, we choose some
gene that is probably located close to the true gene; we call this gene the “Marker.” The question
of interest is whether or not the genotypes AA, Aa, and aa of the Marker locus discriminate the
phenotype. If we have success to find some association between the marker and the phenotype, we
can do more analysis to locate the true gene closer. For this purpose, a sample of n individuals is
selected, their phenotypes measured, and genotypes determined. The phenotype data are classified
according to genotypes. In the literature (Chakraborty,1986), the ANOVA method on the one-way
classified data is used to answer the question raised above. Accepting the null hypothesis of equal
means is tantamount to declaring that G′ is not the gene that discriminates the phenotype. We
notice that the assumptions needed for the validity of ANOVA are not met giving rise to a paradox.
One of the goals of this dissertation is that if traditional ANOVA is used to answer the question,
check its power via simulations. We also compare the ANOVA method with the Bartlett’s test by
9
checking their powers via simulations. A broad conclusion of the investigation is that the ANOVA
method still works.
3.2
Statistical Methods
3.2.1
Analysis of Variance (ANOVA)
In general, experimenters assume that markers segregate randomly. Once the data are
collected on each individual, statistical associations between the markers and quantitative traits are
established through statistical approaches that range from simple techniques, such as analysis of
variance (ANOVA), to models that include multiple markers and interactions. The simpler
statistical approaches tend to be methods of Quantitative Trait Locus (QTL) detection that assess
differences in the phenotypic means for single-marker genotypic classes. The actual location of
QTL involves an estimated genetic map with known distances between markers, and on evaluation
of a likelihood function that is maximized over the established parameter space.
Typically, the null hypothesis tested is that the mean of the trait value is independent of the
genotype at a particular marker. The null hypothesis is rejected when the test statistic is larger than
a critical value, and the implication is that the QTL is linked to the marker under investigation.
Single-marker analyses investigate individual markers independently without reference to their
position or order.
The data classified according to the genotypes of G ′ have the following structure (Pi’s
stands for phenotypic values):
10
Phenotypic Values
Genotypes
AA
Aa
aa
P11
P21
P31
P12
P22
P32
.
.
.
.
.
.
.
.
.
P1, n1
P2, n2
P3, n3
n = total sample size = n1+n2+n3
The ANOVA technique can be used to test the equality of the group means corresponding to the
genotypes.
3.2.2
Bartlett’s Test
As we shall see later, Bartlett’s test can also be used to test homogeneity of means for the
data set-up of Section 3.2.1. In this section, we provide a brief introduction to the Bartlett’s test.
3.2.2.1
Introduction
Bartlett's test (Snedecor and Cochran, 1983) is used to test if k normal populations have
equal variances. Equal variances across samples is called homoscedasticity or homogeneity of
variances. Some statistical tests, for example the analysis of variance (ANOVA), assume that
variances are equal across groups or samples. The Bartlett’s test can be used to verify that
assumption.
11
Bartlett's test is sensitive to departures from normality. That is, if the samples come from
non-normal distributions, then Bartlett's test may simply be testing for non-normality. The Levene
test (Milliken and Johnson, 1989) is an alternative to the Bartlett’s test that is less sensitive to
departures from normality. Here, we are investigating mixed normally distributed data and
focusing only on the Bartlett’s test.
3.2.2.2
Definition
The Bartlett’s test statistic is designed to test for equality of variances across groups
against the alternative that variances are unequal for at least two groups.
The hypotheses are:
H0: σ 1 = σ 2 = .... = σ k
Ha: σ i ≠ σ j for at least one pair (i,j)
The test statistic is given by:
( N − k ) ln s 2p − ∑i=1 ( N i − 1) ln si2
k
T=
⎛ 1 ⎞⎛ k 1
1 ⎞
⎟⎟
⎟⎟⎜⎜ (∑
1 + ⎜⎜
)−
⎝ 3(k − 1) ⎠⎝ i=1 N i − 1 N − k ⎠
In the above, si2 is the sample variance of the ith group (i =1,2,…,k), N is the total sample
size, Ni is the sample size of the ith group, k is the number of groups, and s 2p is the pooled
variance. The pooled variance is a weighted average of the group variances and is defined as:
k
s 2p = ∑ ( N i − 1) si2 /( N − k )
i =1
12
The variances are judged to be unequal if T >
2
χ (2α ,k −1) , where χ (α ,k −1) is the 100* α upper
percentile critical value of the chi-squared distribution with k −1 degrees of freedom at the
significance level of α .
The above formula for the critical region follows the convention that χα2 is the upper
critical value from the chi-squared distribution and χ12−α is the lower critical value from the chisquared distribution.
3.2.3
Linkage Disequilibrium Coefficient and Joint distribution
Let Δ be the linkage disequilibrium coefficient between the genes G and G′. Since G is
unknown, Δ is unknown. The joint distribution of the genotypes of G and G′ is given in the
following table:
Table 3-1
Joint distribution of the genotypes of G and G′
G′
Marginal
AA
Aa
aa
MM
2
PMA
2PMA PMa
2
PMa
PM2
Mm
2PMA PmA
2PMA Pma + 2PMa PmA
2PMa Pma
2PM Pm
mm
2
PmA
2PmA Pma
2
Pma
Pm2
PA2
2PA Pa
Pa2
1
G
Marginal
frequencies
frequencies
Using the concept of haplotype frequencies, the entities in the joint distribution are defined by:
PMA = PMPA + Δ;
PMa = PMPa − Δ;
PmA = PmPA − Δ;
Pma = PmPa + Δ.
13
where PA and Pa are the frequencies of alleles A and a of the marker locus (G′) in the entire
population. The conditional distribution of P given genotypes of G and G′ depend only on the
genotype of the true gene G. More precisely,
P | G = MM, G′ = AA ~ N(λ, σ2);
P | G = MM, G′ = Aa ~ N(λ, σ2);
P | G = MM, G′ = aa ~ N(λ, σ2);
P | G = Mm, G′ = AA ~ N(0, σ2);
P | G = Mm, G′ = Aa ~ N(0, σ2);
P | G = Mm, G′ = aa ~ N(0, σ2);
P | G = mm, G′ = AA ~ N(-λ, σ2);
P | G = mm, G′ = Aa ~ N(-λ, σ2);
P | G = mm, G′ = aa ~ N(-λ, σ2).
3.2.4
Joint distribution of the Phenotype and Genotypes of G and G′ٛ
The joint distribution of the Phenotype, genotypes of G and G′ is given as follows:
2
f(P, MM, AA)= PMA
N (λ , σ 2 )
f(P, MM, Aa)= 2 PMA PMa N (λ , σ 2 )
2
N (λ , σ 2 )
f(P, MM, aa)= PMa
f(P, Mm, AA)= 2 PMA PmA N (0, σ 2 )
f(P, Mm, Aa)= (2 PMA Pma + 2 PMa PmA ) N (0, σ 2 )
f(P, Mm, aa)= 2 PMa Pma N (0, σ 2 )
2
f(P, mm, AA)= PmA
N ( −λ , σ 2 )
f(P, mm, Aa)= 2 PmA Pma N (−λ , σ 2 )
2
N ( −λ , σ 2 )
f(P, mm, aa)= Pma
14
The joint pdf of phenotype and marker G′ is given by:
2
2
N (λ , σ 2 ) + 2 PMA PmA N (0, σ 2 ) + PmA
f( P, AA) = PMA
N ( −λ , σ 2 )
f( P, Aa) = 2 PMA PMa N (λ , σ 2 ) + (2 PMA Pma + 2 PMa PmA ) N (0, σ 2 ) + 2 PmA Pma N (−λ , σ 2 )
2
2
N (λ , σ 2 ) + 2 PMa Pma N (0, σ 2 ) + Pma
N ( −λ , σ 2 )
f( P, aa) = PMa
3.2.5
Conditional Expectations and Variances
From these joint distributions, the following conditional distributions are derived:
(P | G′ ~ AA) =
(P | G′ ~ Aa) =
(P | G′ ~ aa) =
2
2
PMA
N (λ , σ 2 ) + 2 PMA PmA N (0, σ 2 ) + PmA
N ( −λ , σ 2 )
PA2
PMA PMa N (λ , σ 2 ) + ( PMA Pma + PMa PmA ) N (0, σ 2 ) + PmA Pma N (−λ , σ 2 )
PA Pa
2
PMa
N (λ , σ 2 ) + 2 PMa Pma N (0, σ 2 ) + Pma2 N (−λ , σ 2 )
Pa2
The conditional expectations and variances, are summarized below (derivations of which are given
in Appendix 1)
E(P | G′ = AA) = λ ( PM − Pm ) +
2Δλ
PA
Var(P | G′ = AA) = σ 2 + 2λ2 PM Pm +
E(P | G′ = Aa) = λ ( PM − Pm ) −
2Δλ2 ( Pm − PM ) 2Δ2 λ2
−
PA
PA2
Δλ ( PA − Pa )
PA Pa
Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA2 + Pa2 )
−
Var(P | G′ = Aa) = σ + 2λ PM Pm +
PA Pa
PA2 Pa2
2
2
15
E(P | G′ = aa) = λ ( PM − Pm ) −
2Δλ
Pa
Var(P | G′ = aa) = σ 2 + 2λ2 PM Pm +
3.3
2Δλ2 ( PM − Pm ) 2Δ2λ2
−
Pa
Pa2
Some Facts and Paradox
3.3.1
Some Facts
1. If Δ ≠ 0, the conditional means as well as the conditional variances are different. The
genotypes of G′ do discriminate the phenotype P.
2. If Δ = 0, all the conditional expectations and all the conditional variances are equal. The
genotypes of G′ do not discriminate the phenotype P.
3. If PA = 0.5 and PM = 0.5, all the conditional variances are equal.
4. Each conditional distribution of P | Genotype of G′ is a mixed normal.
3.3.2
The Paradox
We will formulate the null hypothesis H0: the genotypes of G′ do not discriminate the
phenotype P, which is equivalent to H0: Δ = 0. For applicability of ANOVA for testing H0: Δ = 0,
we need to assume homogeneity of variances (i.e., the groups formed by the genotypes of G′ have
to have the same variance). The assumption of homogeneity of variances holds if Δ = 0, which
implies that the means are also all equal, which is what we are testing using the ANOVA method.
This is a paradoxical situation.
3.4
Efficacy of ANOVA
Despite the paradoxical situation outlined above, in this section, we examine how much an
application of ANOVA to the one-way classified data achieves the objectives. This study will be
conducted by resorting to simulations.
16
The SAS code of simulations is presented in Appendix 2.
Scenario 1. The first situation we considered has the following specifications for the parameters
involved:
%sim (Δ=0, λ=1,
σ2=1,Pm=0.5,Pa=0.5,sample=200,alphalevel=0.05);
Under this scenario, the conditional distributions are tabulated below:
Table 3-2
Conditional distributions under Scenario 1
Conditional distribution
mean
variance
1. P | G′ = AA
¼ N(1,1) + ½ N(0,1) + ¼ N(−1,1)
0
1.5
2. P | G′ = Aa
¼ N(1,1) + ½ N(0,1) + ¼ N(−1,1)
0
1.5
3. P | G′ = aa
¼ N(1,1) + ½ N(0,1) + ¼ N(−1,1)
0
1.5
4. P | G = MM
N(1,1)
0
1
5. P | G = Mm
N(0,1)
0
1
6. P | G = mm
N(−1,1)
0
1
The common conditional probability density function under distributions 1, 2, and 3 above is
plotted in Figure 3-1:
Figure 3-1
Common conditional pdf of P | G′ under Scenario 1
0.3
0.25
0.2
0.15
0.1
0.05
-4
-2
2
4
Distribution: ¼ N(−1,1) + ½ N(0,1) + ¼ N(1,1)
This plot does not show trimodality as expected because the assumed variance is as large as the
allelic effect (i.e., σ2 = 1 and λ = 1). In consequence, the three modes are smoothed out. If we
assume a smaller variance, for example, like 0.4 and keep λ the same (λ = 1), the conditional
distribution with three modes is shown as follows:
17
0.5
0.4
0.3
0.2
0.1
-2
-1
1
2
Distribution: ¼ N(−1,0.4) + ½ N(0,0.4) + ¼ N(1,0.4)
For comparison, when we choose σ2 = 0.5, the modes are smoothed out as follows:
0.4
0.3
0.2
0.1
-2
-1
1
2
Distribution: ¼ N(−1,0.5) + ½ N(0,0.5) + ¼ N(1,0.5)
The conditional probability density function under Distributions 4, 5, and 6 of Table 3-2 are
plotted in Figure3-2:
Figure 3-2
Common conditional pdf of P | G under Scenario 1
0.4
0.3
0.2
0.1
-4
-2
2
4
Distributions: N(−1,1); N(0,1) ; N(1,1)
We now focus on testing the hypothesis H0: Δ = 0, which is true under Scenario 1. We generate a
random sample of size n = 200 from the following joint distribution of P and G′.
18
f (P, AA) = 1/16 N(1,1) + 1/8 N(0,1) + 1/16 N(−1,1)
f (P, Aa) = 1/8 N(1,1) + 1/4 N(0,1) + 1/8 N(−1,1)
f (P, aa) = 1/16 N(1,1) + 1/8 N(0,1) + 1/16 N(−1,1)
−∞ < P < ∞
Simulations are conducted according to the following steps.
Step 1. Draw a random sample of size 200 from the following trinomial distribution:
G′
AA
Aa
Aa
Pr:
1/4
1/2
1/4
Let n1 , n2 , n3 be the observed frequencies of the genotypes. Obviously, n1 + n2 + n3 = 200 .
Step 2. If G′ = AA, Aa or aa, simulate the mixed normal distribution
¼ N(1,1) + ½ N(0,1) + ¼ N(−1,1).
Step 3. Arrange the Genotype-Phenotype data generated in Steps1 and 2 in the following way.
Phenotypic Values
Genotypes
AA
Aa
aa
P11
P12
P21
P22
P31
P32
.
.
.
.
.
.
.
P1, n1
.
P2, n2
.
P3, n3
n = total sample size = n1+n2+n3
Step 4. Carry out Analysis of Variance and Bartlett’s test on the phenotype data of Step 3. Use
level α = 0.05 . Note down whether or not H0 is rejected under each test.
Step 5. Repeat steps 1, 2, 3, and 4 10,000 times.
19
Summery statistics of simulations:
1. Empirical size under the ANOVA test =
No. of times H 0 is rejected
= 0.052
10,000
2. Empirical size under the Bartlett’s test =
No. of times H 0 is rejected
=0.059
10,000
The observed size under the Bartlett’s test is slightly larger than that under the ANOVA method.
However, the observed powers are not significantly different from the nominal power 0.05.
Scenario 2. We now look at scenarios when the null hypothesis H0: Δ = 0 is not true.
We have the following specifications for the parameters involved:
%sim (Δ=0.1875, λ=1,
Note that
σ2=1,Pm=0.5,Pa=0.5,n=200,alphalevel=0.05);
PMA = PMPA + Δ = 0.4375
PMa = PMPa − Δ = 0.0625
PmA = PmPA − Δ = 0.0625
Pma = PmPa + Δ = 0.4375
The conditional distributions of P | G′ are given by
(P | G′ ~ AA) =
(P | G′ ~ Aa) =
(P | G′ ~ aa) =
2
2
PMA
N (1,1) + 2 PMA PmA N (0,1) + PmA
N (−1,1)
2
PA
PMA PMa N (1,1) + ( PMA Pma + PMa PmA ) N (0,1) + PmA Pma N (−1,1)
PA Pa
2
2
PMa
N (1,1) + 2 PMa Pma N (0,1) + Pma
N (−1,1)
2
Pa
The conditional means and variances are given by:
E(P | G′ = AA) = λ ( PM − Pm ) +
2Δλ
= 0.75
PA
Var(P | G′ = AA) = σ 2 + 2λ2 PM Pm +
2Δλ2 ( Pm − PM ) 2Δ2 λ2
= 1.21875
−
PA
PA2
20
E(P | G′ = Aa) = λ ( PM − Pm ) −
Δλ ( PA − Pa )
=0
PA Pa
Var(P | G′ = Aa) = σ 2 + 2λ2 PM Pm +
E(P | G′ = aa) = λ ( PM − Pm ) −
Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA2 + Pa2 )
−
= 1.21875
PA Pa
PA2 Pa2
2Δλ
= −0.75
Pa
Var(P | G′ = aa) = σ 2 + 2λ2 PM Pm +
2Δλ2 ( PM − Pm ) 2Δ2λ2
−
= 1.21875
Pa
Pa2
Under this scenario, the conditional distributions are tabulated below:
Table 3-3
Conditional distributions under Scenario 2
Conditional distribution
1. P | G′ = AA
2. P | G′ = Aa
3. P | G′ = aa
0.765625N(1,1) + 0.21875N(0,1) +
0.015625N(−1,1)
0.109375N(1,1) + 0.78125N(0,1) +
0.109375N(−1,1)
0.01562N(1,1) + 0.21875 N(0,1) +
0.765625N(−1,1)
mean
variance
0.75
1.21875
0
1.21875
−0.75
1.21875
4. P | G = MM
N(1,1)
0
1
5. P | G = Mm
N(0,1)
0
1
6. P | G = mm
N(−1,1)
0
1
These three conditional densities 1, 2, and 3 of Table 3-3 are plotted in Figure 3-3.
21
Figure 3-3
Conditional pdf of P | G′ under Scenario 2
0.35
0.3
0.25
0.2
0.15
0.1
0.05
-4
-2
2
4
Distribution: 0.015625N(1,1) + 0.21875N(0,1) + 0.765625N(−1,1)
0.109375N(1,1) + 0.78125N(0,1) + 0.109375N(−1,1)
0.765625N(1,1) + 0.21875N(0,1) + 0.015625N(−1,1)
As a contrast, the conditional densities 4, 5, and 6 of P|G were plotted in Figure 3-2.
In this scenario, the null hypothesis H0: Δ = 0 is not true. A random sample of size n = 200 is
generated from the joint distribution of P and given by
2
2
N (1,1) + 2 PMA PmA N (0,1) + PmA
N (−1,1)
f(P, AA) = PMA
f(P, Aa) = 2 PMA PMa N (1,1) + (2 PMA Pma + 2 PMa PmA ) N (0,1) + 2 PmA Pma N (−1,1)
2
N (1,1) + 2 PMa Pma N (0,1) + Pma2 N (−1,1)
f(P, aa) = PMa
With the protocol outlined under Scenario 1, we calculate the following entities:
1. Empirical power under the ANOVA test =
No. of times H 0 is rejected
= 0.27
10,000
2. Empirical power under the Bartlett’s test =
No. of times H 0 is rejected
= 0.14
10,000
22
Scenario 3.
We consider the following specifications for the parameters involved:
%sim (Δ=0.125, λ=1,
σ2=1,Pm=0.5,Pa=0.5,n=200,alphalevel=0.05);
Under this scenario, the conditional distributions are tabulated below:
Conditional distribution
mean
variance
1. P | G′ = AA
0.5625N(1,1) + 0.375N(0,1) + 0.0625N(−1,1)
0.5
1.375
2. P | G′ = Aa
0.1875N(1,1) + 0.625N(0,1) + 0.1875N(−1,1)
0
1.375
3. P | G′ = aa
0.0625N(1,1) + 0.375 N(0,1) +0.5625N(−1,1)
−0.5
1.375
The graphs of the conditional distributions of P | G′ are given in Figure 3-4.
Figure 3-4
Conditional pdf of P | G′ under Scenario 3
0.3
0.25
0.2
0.15
0.1
0.05
-4
-2
2
4
Distributions: 0.0625N(1,1) + 0.375N(0,1) +0.5625N(−1,1)
0.1875N(1,1) + 0.625N(0,1) + 0.1875N(−1,1)
0.5625N(1,1) + 0.375N(0,1) + 0.0625N(−1,1)
Simulations are conducted to calculate empirical powers under ANOVA and Bartlett’s test.
No. of times H 0 is rejected
= 0.225
10,000
No. of times H 0 is rejected
= 0.10
2. Empirical size under the Bartlett’s test =
10,000
1. Empirical size under the ANOVA test =
23
Scenario 4.
We consider the following specifications for the parameters involved:
%sim (Δ=0.0625, λ=1,
σ2=1,Pm=0.5,Pa=0.5,n=200,alphalevel=0.05);
Under this scenario, the conditional distributions are tabulated below:
Conditional distribution
0.390625N(1,1) + 0.46875N(0,1) +
1. P | G′ = AA
0.140625N(−1,1)
0.234375N(1,1) + 0.53125N(0,1) +
2. P | G′ = Aa
0.234375N(−1,1)
0.140625N(1,1) + 0.46875 N(0,1) +
3. P | G′ = aa
0.390625N(−1,1)
mean
variance
0.25
1.46875
0
1.46875
−0.25
1.46875
The graphs of the conditional distributions of P | G′ are given in Figure 3-5.
Figure 3-5
Conditional pdf of P | G′ under Scenario 4
0.3
0.25
0.2
0.15
0.1
0.05
-4
-2
2
4
Distributions: 0.140625N(1,1) + 0.46875 N(0,1) + 0.390625N(−1,1)
0.234375N(1,1) + 0.53125N(0,1) + 0.234375N(−1,1)
0.390625N(1,1) + 0.46875N(0,1) + 0.140625N(−1,1)
24
Simulations are conducted to calculate empirical powers under ANOVA and Bartlett’s test.
No. of times H 0 is rejected
= 0.194
10,000
No. of times H 0 is rejected
= 0.086
2. Empirical power under the Bartlett’s test =
10,000
1. Empirical power under the ANOVA test =
The results of the simulations under each of Scenarios 1, 2, 3, and 4 are summarized below
(10,000 simulations).
Specifications (common parameters):
λ=1, σ2=1,Pm=0.5,Pa=0.5,n=200,alphalevel=0.05
Table 3-4
Summarized results of simulations
Δ
Empirical Power
ANOVA
Bartlett
0
0.059
0.052
0.0625
0.194
0.086
0.125
0.225
0.101
0.1875
0.273
0.142
Conclusions: We have considered four sets of parameter values involved in the simulations. In
one set of simulations, the observed sizes are not significantly different from the nominal size of
0.05. In all other cases, the observed powers are significantly different and the ANOVA test is
superior to the Bartlett’s test. In all the scenarios, homogeneity of variances holds. The genotypic
distributions are not normal but mixed normal. The ANOVA procedure is robust to violation of
the normality assumption but the Bartlett’s test is not.
25
3.5
Power Comparison of ANOVA and Bartlett’s test
Checking the conditional expectations and variances, we know that the assumption of
homogeneity of variances to carry ANOVA is violated. Normality assumption is violated for the
applicability of both the ANOVA and Bartlett tests. We used simulations (10,000 times) to
compare the power of these two tests in general. Please refer to Appendix 3 for SAS code.
Selection of Parameter values for the simulations:
•
Allele frequencies of PM and PA are randomly selected: PM ∈ [0, 1], PA ∈ [0, 1]
•
Disequilibrium values (Δ ) are randomly selected: Δ ∈ [0, 0.2]
•
Significance level = 5%
In Section 3.4, PM and PA are fixed at 0.5. Here we randomly select their frequencies from
0 to 1. We know that The genotypes MM, Mm, and mm of the gene G discriminate the phenotype
in the following sense:
P | G = MM ~ N(λ, σ2);
P | G = Mm ~ N(0, σ2);
P | G = mm ~ N(-λ, σ2).
In addition, we need to choose λ, σ2 , and n in the simulations.
We simulated many scenarios under different choices of parameter values for a
comparison of powers. More specifically, we look at:
26
•
Same mean (λ), different variance (σ2), same sample size (n);
•
Different mean (λ), same variance (σ2), same sample size (n);
•
Same mean (λ), same variance (σ2), but different sample size (n).
The simulations showed that power of both tests increase when sample size (n) increases,
especially when Δ is low (Δ ∈ [0, 0.2]) and the power of both tests does not depend on variance
(σ2 ∈ [1, 50]) too much, the thing that matters most is the mean (λ). The performances of these
two tests are different under different choices of mean (λ). We report two special cases.
3.5.1
Different choices of mean (λ)
1. Let λ = 1, n = 200 and σ 2 = 1.
The power of both tests are low ( <30%), but the ANOVA method has better power than
the Bartlett’s test. See Figure 3-6.
Figure 3-6
Power comparison of ANOVA and Bartlett’s test, when λ = 1
27
2. Let λ = 50, n = 200 and σ 2 = 1.ٛ
When the mean (λ) gets larger, both tests have power between 0.6 and 0.75 for Δ ≥ 0.06. The
ANOVA method is still better than the Bartlett’s test, but the difference is not that wide. See
Figure 3-7.
Figure 3-7
3.5.2
Power comparison of ANOVA and Bartlett’s test, when λ = 50
Conclusion
In testing that genotypes at a marker locus are ascociated with a certain quantitative
phenotype, the ANOVA method is widely used. However, we showed that, the assumptions
required for the validity of the ANOVA method are violated. First, the conditional distribution is a
mixed normal distribution instead of normal, and the homogeneity of variances is violated.
Homogeneity of variances holds if the disequilibrium (Δ) is zero or both alleles at the marker locus
are equi-probable. If Δ = 0, all conditional means are equal, which is the null hypothesis we are
testing.
28
Such a paradox triggered our interest to question the appropriateness of using ANOVA.
An alternative procedure for testing Δ = 0 is the Bartlett’s test. The assumptions needed for the
validity of the Bartlett’s test are not valid. The underlying genotypic populations are not normal
but mixed normal. We want to examine appropriateness of using the Bartlett’s test.
With 10,000 replications of simulations, we compared the power of both test under a
variety of specifications of parameter values. We found that, generally speaking, ANOVA
performed better than the Bartlett’s test. With the choice of higher values of allelic effect (i.e., λ,
represents the mean of the underlying normal distribution), the power of the Bartlett’s test
approached that of the power of ANOVA’s gradually. However, the observed size of the Bartlett’s
test is higher than the nominal level and this bias increases as λ increases. Even though the
homogeneity of variances is violated, the normality assumption does not hold, and the application
of ANOVA is not completely correct, from a statistical point of view, the ANOVA method is still
robust, and the power of ANOVA is better than the Bartlett’s test.
Besides, Bartlett’s test is sensitive to departures from normality, and each conditional
distribution here is mixed normal instead of normal distribution. If Δ is small, we need a very large
sample for the ANOVA test to have decent power.
Is intrinsically ANOVA more powerful than the Bartlett’s test? The answer is yes, because
ANOVA is a test for centrality based on the first moment and Bartlett’s test is a test for equality of
variances, which is about second moment. The sampling variance of the first moment is smaller
than the sampling variances of the second moment. However, as our simulations show, for smaller
Δ, Bartlett’s test appears to have a higher power than ANOVA. This is so because with smaller Δ,
the distributions remain as mixed normals, but the mean of the component distributions are more
equal, making their differences less detectable by ANOVA. Nevertheless, for such small Δ, the
absolute power of either of these two procedures is small.
29
From Figure 3-6 and Figure 3-7, it appears that the Bartlett test is biased, i.e., it fails to
achieve the nominal size. When simulations are conducted with PM ∈ [0, 1], PA ∈ [0, 1], and Δ
∈ [0, 0.2] randomly generated, certain configurations of PMA, PMa, PmA, and Pma could lead to
negative constituents. In such situations, data can not be generated. In the program, whenever such
a configuration arises, it is ignored. This may have some bearing on the bias of the Bartlett’s test.
ٛ
30
4
Hardy-Weinberg Equilibrium in the case of two alleles
4.1
Introduction
4.1.1
What is Hardy-Weinberg Equilibrium?
Hardy-Weinberg Equilibrium is one of the most important assumptions to be checked in
genetic analysis. In population genetics, the Hardy–Weinberg equilibrium (HWE), or Hardy–
Weinberg law, named after G. H. Hardy and W. Weinberg, states that, under certain conditions,
after one generation of random mating, the genotype frequencies at a single gene locus will attain
at a set of particular equilibrium values. It also specifies that those equilibrium frequencies can be
represented as simple functions of the allele frequencies at that locus.
We focus on a diploid organism and a specific gene locus with alleles A and a. The entity
A is the dominant allele and a is the recessive allele for a certain trait. An intuitive question would
be “Will the dominant character eventually dominate the whole population?” Hardy and Weinberg
discovered independently that dominance will not happen from random mating.
Let the population of organisms be infinite or very large. Let the frequencies (first
generation) of the genotypes be given by
AA
Aa
aa
r
2s
t
with r, s, t ≥ 0 and r + 2s + t =1. Suppose the population is under in random mating. The
frequencies of the genotype of the offspring population (second generation) is given by
AA
Aa
31
aa
(r + s)2
2(r + s)(s + t)
(s + t)2
An important question when the genotype frequencies of the two generations are the same, i.e.,
r = (r + s)2
s = 2(r + s)(s + t)
and
t = (s + t)2.
Note that r = (r + s)2 ⇔ r = r2 + 2rs + s2
⇔ r = r (r + 2s) + s2
⇔ r = r (1−t) + s2
⇔ s2 = rt
Thus the two sets of genotypes frequencies are equal if and only if s2 = rt.
Interestingly, if we denote the genotype frequencies of the second generation by:
r1 = (r + s)2,
2 s1 = 2 (r + s) (r + s)
and t1 = (r + s) 2
it follows that
2
s1 = r1 t1 . If random mating occurs in the second generation, the genotype
frequencies in the third generation are identical to those in the second generation. Under random
mating, subsequent generations will have genotype frequencies identical to those of the second
generation. This is essentially the gist of the Hardy-Weinberg Law.
It should be noted for the second generation, or for that matter, in any subsequent
generation, the allele frequencies are given by:
A
a
r+s
s+r
32
In view of this observation, one can now say that the population is in Hardy-Weinberg
equilibrium (HWE). More informally, if the allele frequencies are p and q, then the genotype
frequencies are p2, 2pq, and r2, provided the population is in HWE.
4.1.2
Assumption of HWE
The assumptions governing the Hardy–Weinberg equilibrium (HWE) are that the organism
under consideration:
•
is diploid and the trait under consideration is not on a chromosome that has different copy
numbers for different sexes, such as the X chromosome in humans (i.e., the trait is
autosomal);
•
is sexually reproducing, either monoecious or dioecious;
•
has discrete generations;
•
in addition, the population under consideration is idealised, that is:
•
random mating within a single population
•
infinite population size (or sufficiently large so as to minimize the effect of genetic
drift)
and experiences:
•
no selection;
•
no mutation;
•
no migration (gene flow).
The first group of assumptions are required for the mathematics involved. It is relatively
easy to expand the definition of HWE to include modifications of these, such as for sex-linked
traits. The other assumptions are inherent in the Hardy-Weinberg principle. A Hardy-Weinberg
population is used as a reference population when discussing various genetic issues.
33
4.1.3
Departures from the Equilibrium
To analyze departures, we can look at populations to see if they conform to these
numerical patterns. If they differ, we seek the reasons for the difference in some violation of the
Hardy-Weinberg assumptions. Two processes, natural selection and genetic drift, are the most
common and important factors at work in most populations that are not at equilibrium. Inbreeding
and nonrandom mating are also forces of departure.
For example, suppose we find a population in which the recessive allele frequency is
declining over time. We might then investigate whether homozygous recessives are dying earlier.
(Many genetic diseases, such as cystic fibrosis, are due to recessive alleles.) This could be due to
natural selection, in which those that are better adapted to the environment survive longer and
reproduce more frequently.
Suppose we find a population in which there is a smaller-than-expected number of
homozygotes of both types, and a larger number of heterozygotes. This could be due to
heterozygote superiority—where the heterozygote is more fit than either homozygotes. In humans,
this is the case for the allele causing sickle cell disease, a type of hemoglobinopathy.
Nonrandom mating is another potential source of departure from the Hardy-Weinberg
equilibrium. Imagine that two alleles give rise to two very different appearances. Individuals may
choose to mate with those whose appearance is closest to theirs. This may lead to divergence of
the two groups over time into separate populations and perhaps ultimately separate into two
distinct subpopulations.
In very small populations, allele frequencies may change dramatically from one generation
to the next, due to the vagaries of mate choice or other random events. For instance, half a dozen
34
individuals with the dominant allele may, by chance, have fewer offspring than half a dozen with
the recessive allele. This would have comparatively little effect in a population of one thousand,
but it could have a dramatic effect in a population of twenty. Such changes are known as genetic
drifts.
In this dissertation, we assume inbreeding is the only force for a departure from the HardyWeinberg Equilibrium.
4.1.4
Inbreeding Coefficient θ
Inbreeding depression was recognized early by plant and animal breeders (Wright 1977)
and in zoo polulations (Ralls, Brugger and Ballow 1979; Senner 1980) and in the management and
restocking of endangered populations in the wild.
Inbreeding
Inbreeding is defined as mating between related individuals. It is also called consanguinity,
meaning "mixing of the blood." Although some plants successfully self-fertilize (the most extreme
case of inbreeding), biological mechanisms are in place in many organisms, from fungi to humans,
to encourage cross-fertilization. In human populations, customs and laws in many countries have
been developed to prevent marriages between closely related individuals (e.g., siblings and first
cousins). Despite these proscriptions, genetic counselors are frequently presented with the question
"If I marry my cousin, what are the chances that we will have a baby who has a disease?" The
answer is that when two partners are related their chance to have a baby with a disease or birth
defect is higher than the background risk in the general population.
Increased Disease Risk
Many genetic diseases are recessive, meaning only people who inherit two disease alleles
develop the disease. Many of us carry several single alleles for genetic diseases. Since close
35
relatives have more genes in common than unrelated individuals, there is an increased chance that
parents who are closely related will have the same disease alleles and thus have a child who is
homozygous for a recessive disease.
For instance, cousins share approximately one-eighth or 12.5 percent of their alleles. So, at
any locus the chance that cousins share an allele inherited from a common parent is one-eighth.
The chance that their offspring will inherit this allele from both parents, if each parent has one
copy of the allele, is one-fourth. Thus, the risk the offspring will inherit two copies of the same
allele is 1/8 × 1/4, or 1/32, about 3 percent. If this allele is deleterious, then the homozygous child
will be affected by the disease. Overall, the risk associated with having a child affected with a
recessive disease as a result of a first cousin mating is approximately 3 percent, in addition to the
background risk of 3 to 4 percent that all couples face.
Detect Inbreeding
Unfortunately, inbreeding cannot be detected by pedigrees directly, because pedigrees are
not usually available for individuals with these populations. However, inbreeding coefficient (θ)
can be measured indirectly form genotypic data.
This is the probability that two genes at any locus in one individual are identical by
descent (have been inherited from a common ancestor). The more closely related the parents are,
the larger the value of θ is. For example, the coefficient of inbreeding for an offspring of two
siblings is one-fourth (0.25), for an offspring of two half-siblings it is one-eighth (0.125), and for
an offspring of two first cousins it is one-sixteenth (0.0625). (This is a different calculation than
the calculation of shared alleles between cousins, above.)
In general, inbreeding in human populations is rare. The average inbreeding coefficient is
0.03 for the Dunker population in Pennsylvania and 0.04 for islanders on Tristan da Cunha.
Inbreeding occurs in both those populations. Some isolated populations actively avoid inbreeding
36
and have maintained low average inbreeding coefficients even though they are small. For example,
polar Eskimos have an average inbreeding coefficient that is less than 0.003.
Beneficial changes can also come from inbreeding, and inbreeding is practiced routinely in
animal breeding to enhance specific characteristics, such as milk production or low fat-to-muscle
ratios in cows. However, there can often be deleterious effects of such selective breeding when
genes controlling unselected traits are influenced too. Generations of inbreeding decrease genetic
diversity, and this can be problematic for a species. Some endangered species, which have had
their mating groups reduced to very small numbers, are losing important diversity as a result of
inbreeding.
37
4.2
Properties of Inbreeding coefficient θ
4.2.1
Formulation of the problem
We first consider the case of a single locus with two alleles and then consider a single
locus with three alleles. We cannot examine the entire population to check the equilibrium law.
We take a random sample of n individuals from the population. At a single autosomal locus with
two alleles, a diploid can be one of three genotypes: AA, Aa and aa. Let n1, n2, and n3 be the
frequencies of the genotypes of AA, Aa and aa in the sample. We need to formulate the
equilibrium law as a hypothesis to be tested using the data collected.
Consider two alleles, A and a, and let p and q, respectively, be the allele frequencies,
which are unknown in the population. A punnet square (Table 4-1) can be used to formulate the
problem, where the fraction in each cell is equal to the product of the row and column
probabilities.
Table 4-1
Punnet square for Hardy-Weinberg Equilibrium
Alleles
Alleles
A
a
Marginal
frequencies
2
pq
p
a
pq
2
q
q
Marginal
frequencies
p
q
1
A
p
In general, let p1, 2q1, and r1 be the unknown genotype frequencies in the population. The
genotype frequencies can be written in the form of a bivariate frequency Table 4-2 as follows:
38
Table 4-2
Genotype frequencies in the population
Alleles
Marginal
A
a
A
p1
q1
p
a
q1
r1
q
Marginal
frequencies
p
q
1
Alleles
frequencies
Thus the joint distribution of the alleles is symmetric with identical marginal frequencies.
Therorem: Given frequencies p1, 2q1, and r1 , there exists a number θ such that
p1 = p 2 + θ pq
q1 = pq − θ pq
and
r1 = q 2 + θ pq
Proof
p1 − p 2
Let θ =
. This readily implies that p1= p2 +θpq. Likewise,
pq
q1 = p − p1 = p − ( p 2 + θpq ) = p (1 − p ) − θpq = pq − θpq .
In a similar vein, one can show that r1 = q 2 + θpq . The number θ is called inbreeding coefficient.
If θ = 0, the joint distribution matches with the one spelled under the Hardy-Weinberg
Equilibrium. The inbreeding coefficient θ measures the extent of departure from the HardyWeinberg Equilibrium. Consequently, the joint distribution can be put in the following Table 4-3:
39
Table 4-3
Joint distribution with inbreeding coefficient θ
Alleles
a
A
p 2 + θpq
pq − θpq
p
a
pq − θpq
q 2 + θpq
q
p
q
1
Alleles
Marginal
frequencies
4.2.2
Marginal
A
frequencies
Bounds on θ
Since each entry in the above table is a frequency, each entry has to be larger than or equal to zero.
p
⎧ 2
⎪ p + θpq ≥ 0 ⇒ θ ≥ − q
⎪
q
⎪
It’s easy to find the bound of θ by ⎨q 2 + θpq ≥ 0 ⇒ θ ≥ −
p
⎪
⎪ pq − θpq ≥ 0 ⇒ θ ≤ 1
⎪
⎩
Therefore, θ has to satisfy :
⎧p q⎫
− min ⎨ , ⎬ < θ < 1
⎩q p⎭
4.2.3
Homozygote excesses and heterozygote deficiencies
Genotype can be distinguished as homozygous to heterozygous. Phenotypic traits are
determined from the genotypes and therefore, it is important to know on a specific locus whether
we have a homozygote or heterozygote excess. Homozygote genotypes are represented by, say,
40
AA and aa, heterozygote genotype is represented by Aa. The question is when homozygotes outnumber heterozygotes?
•
Homozygote proportion = p 2 + q 2 + 2 pqθ
= ( p + q ) 2 − 2 pq − 2 pqθ
= 1 − 2 pq(1 − θ )
•
Heterozygote propotion: 2 pq (1 − θ )
1) Thus homozygotes out-number heterozygotes if
1
1 − 2 pq (1 − θ ) < 2 pq (1 − θ ) ⇒ θ > 1 −
4 pq
1
424
3
≤0
2) Heterozygotes out-number homozygotes
1
1 − 2 pq(1 − θ ) > 2 pq (1 − θ ) ⇒ θ < 1 −
4 pq
1
424
3
≤0
Homozygote excesses and heterozygote deficiencies have important genetic meanings but
the currently widely used tests, like Chi-squared test and exact test (Guo and Thompson 1992) for
small sample size and large number of alleles, of the Hardy-Weinberg proportions do not take into
account the homozygote excesses and heterozygote deficiencies because of their two-sided
hypothesis testing nature.
4.3
Maximum Likelihood estimates
Theoretically, the frequencies (n1, n2, n3 ) under genotypes AA, Aa, aa have a multinomial
distribution: Multinomial (n, p 2 + θpq, 2 pq(1 − θ ), q 2 + θpq ) . The parameters of the model are p
and θ, which are unknown. We employ the Maximum Likelihood method to get the estimates:
41
The likelihood can be written as: L = ( p 2 + θpq ) n1 [2 pq (1 − θ )]n2 (q 2 + θpq ) n3
ln L( p,θ ) = n1 ln( p 2 + θpq) + n2 ln(2 pq(1 − θ )) + n3 ln(q 2 + θpq )
∂ ln L( p,θ )
n
n2
= 2 1
(2 p + θ (1 − 2 p)) +
(2(1 − θ )(1 − 2 p)) +
∂p
p + θpq
2 pq(1 − 2θ )
n3
(−2q + θ (1 − 2 p ))
q + θpq
2
∂ ln L( p,θ )
n
n2
n
= 2 1
( pq) +
(−2 pq) + 2 3 ( pq)
∂θ
p + θpq
2 pq (1 − θ )
q + θpq
Solve the following equations for p and θ:
⎧ ∂ ln L( p,θ )
=0
⎪⎪
∂p
⎨
⎪ ∂ ln L( p,θ ) = 0
∂θ
⎩⎪
4.4
4.4.1
2n + n
⎧
⇒ ⎪ pˆ = 1 2
2n
⎪
⎨
4n1n3 − n22
⎪θˆ =
⎪⎩
(2n1 + n2 )(n2 + 2n3 )
Testing the validity of HWE
Hypothesis Testing on θ
In reality, the departure from the Hardy-Weinberg law is affected not only by
consanguinity (inbreeding) but also by selection, genetic drift, assortative mating, and other
evolutionary forces (Cockerham 1973) which are beyond the scope of this dissertation. Here we
assume that the departure is affected just by inbreeding so that we can emphasize on studying and
interpreting inbreeding only. As a first step toward an accurate and efficient measure of inbreeding
in a small population, it is important to initially resolve the single locus measure of inbreeding,
determine its sampling variance, and also build hypotheses and test their validity. In this
dissertation, we will explore different tests of inbreeding coefficient calculated from the genotypic
42
data at a single locus by the MLE, Siegmund’s T test (Shweta Choudhry 2006), Wald’s test, or
Chi-squared test and use computer simulations to examine and compare their power.
Assume that a random sample is drawn from the population. We may wish to compare this
sample with what would be expected from an “idealized inbred population’. An idealized inbred
population is defined as an infinitely large population with no mutation, migration, or selection (so
that gene frequencies remain constant) and with random mating except for a fixed amount of
inbred matings resulting in an average inbreeding coefficient of θ. The sampling of a whole
population may be considered as a random sample from an infinitely large pool of zygotes. In an
idealized inbred population, only the fixed amount of inbreeding (consanguinity) will affect the
proportion of heterozygotes. For an autosomal codominant locus having two alleles A and a with
frequency p and q (p+q=1), the proportions of homozygotes are p 2 + θpq and q 2 + θpq for AA
and aa, respectively, and the proportion of heterozygotes Aa is 2 pq(1 − θ ) (Crow and Kimura
1970). In view of the formulation of the joint distribution and the assumption the inbreeding is the
only force for departure of the Hardy-Weinberg equilibrium, it now follows that the population is
in HWE if and only if the “inbreeding coefficient” θ = 0.
The null hypothesis and the alternative now are:
H0 : θ = 0
H1: θ≠0
The null hypothesis is equivalent to the hypothesis of the Hardy–Weinberg equilibrium if we
assume inbreeding is the only force for departure of HWE.
4.4.2
A likelihood test of the null hypothesis
Using the likelihood estimator of the coefficient, a Wald’s test statistic can be built as:
43
Z=
θˆ − 0
Var (θˆ) H 0
Using the asymptotic theroy of likelihood estimators, it follows that Z has a standard normal
distribution under the null hypothesis. The alternative hypothesis can be easily used to
descriminate homozygote excess and heterozygote deficiency.
(H0: θ > 0 or H1: θ < 0 ).
We will determine the large sample properties of the estimator θˆ . We can use the delta-method
(Rao 1983, page 388). The Maximum Likelihood Estimate of θ is:
θˆ =
)
After rewiring, θ = 1 − 2 n
4 n1 n 3 − n 22
( 2 n1 + n 2 )( n 2 + 2 n 3 ) .
n2
.
( 2 n1 + n 2 )( n 2 + 2 n3 )
The derivatives are:
)
∂θ
n2
= 4n
∂n1
(2n1 + n2 ) 2 (n2 + 2n3 ) ,
)
2
(n2 − 4nn1n3 )
∂θ
,
= 2n
(2n1 + n2 ) 2 (n2 + 2n3 ) 2
∂n2
)
n2
∂θ
, and
= 4n
(2n1 + n2 ) (n2 + 2n3 ) 2
∂n3
)
∂θ
2n2
=−
.
∂n
(2n1 + n2 ) (n2 + 2n3 )
Evaluating these at the expected values for the three genotypic counts and simplifying the variance
expression:
44
)
)
)
)
)
1
∂θ 2
∂θ 2
∂θ 2
∂θ 2
Var (θ ) = ( ) PAA + (
) PAa + (
) Paa − (
)
n
∂n1
∂n2
∂n3
∂n
See (Weir 1996, page 65). It gives the approcximate variance of
Var (θˆ) =
θ (1 − θ )(2 − θ )
1
(1 − θ ) 2 (1 − 2θ ) +
n
2np (1 − p )
Therefore,
1) under the null hypothesis H0 : Z | H0 ~ N(0,1)
θ (1 − θ )(2 − θ )
1
)
2) under the alternative hypothesis H1: θˆ | H1 ~ N( θ , (1 − θ ) 2 (1 − 2θ ) +
n
2np (1 − p )
4.4.3
Siegmund’s T-Test
A study on the genetics of Asthma in Latino Americans (GALA) (Choudhry 2006)
conducted a test for deviations from the Hardy-Weinberg equilibrium (HWE). The test can also
distinguish between homozygote excess and heterozygote deficiency. The test statistic was
originally proposed by David Siegmund. The properties and asymptotic distributions of the test
statistic had not been discussed in that paper. The test statistics takes the form:
n1 n3
+ − n)
p q
T=
n
,
(
where n1 and n3 denote the homozygote genotypic counts, p and q denote the estimated allele
frequencies, and n is the total number of individuals.
As far as we are aware, there is no published paper available on properties of the Tstatistic. In this section, we focus on the T-statistic and establish some important properties.
Under the HWE, T has an approximately standard normal distribution. An excess of
homozygous individuals will lead to a positive value of T while an excess of heterozygotes gives a
negative value of T.
45
4.4.3.1
Expectation and Variance of T
After rewriting, T takes the following form:
T=(
2nn3
2nn1
+
− n) / n .
n + n1 − n3 n + n3 − n1
The details are given in Appendix 4. The expection and variance of T are determined using the
delta-method.ٛ
1. Expectation of T
E(T) =
nθ
2. Variance of T
V(T) =
4.4.3.2
(1 − θ )[θ (6 p 2 − 6 p + 2) + 2 p (1 − p ) − θ 2 (1 − 2 p ) 2 ]
2 p (1 − p )
Distribution of T
1. Under the Null hypothesis ( when inbreeding coefficient is equal to 0):
•
Expectation of T under H0
E (T ) ≈ (
•
np 2 nq 2
+
− n) / n = (np + nq − n) / n = 0, and
p
q
Substituting θ = 0, in the above expression of Variance of T, it can be shown:
(1 − θ )[θ (6 p 2 − 6 p + 2) + 2 p (1 − p ) − θ 2 (1 − 2 p ) 2 ]
V (T ) =
=1
2 p (1 − p )
Therefore, under the null hypothesis, T has asymptotically standard normal distribution, i.e., T |
H0 ~ N(0, 1).
46
2. Under the alternative hypothesis (when inbreeding coefficient is not equal to 0):
T | H1 ~ N( nθ ,
(1 − θ )[θ (6 p 2 − 6 p + 2) + 2 p (1 − p ) − θ 2 (1 − 2 p ) 2 ]
).
2 p (1 − p )
From Section 2.4.2, we recall that:
1) under the null hypothesis H0: Z | H0 ~ N(0,1)
2) under the alternative hypothesis H1:
θˆ | H1 ~ N( θ ,
θ (1 − θ )(2 − θ )
1
(1 − θ ) 2 (1 − 2θ ) +
)
n
2npq
One can show that:
(1 − θ )(θ 2 (1 − 2 p ) 2 + 2( p − 1) p − θ (6 p 2 − 6 p + 2)) 1
θ (1 − θ )(2 − θ )
= (1 − θ ) 2 (1 − 2θ ) +
.
2( p − 1) p
n
2npq
The derivations of Expectation and Variance of T are reported in Appendix 4.
T
.
Further, it can be shown that θˆ =
n
47
χ2 -Test
4.4.4
Testing deviation from the HWE is generally performed using Pearson’s chi-squared test,
using the observed genotype frequencies obtained from the data and the expected genotype
frequencies obtained using the HWE. The null hypothesis is that the population is in Hardy–
Weinberg proportions, and the alternative hypothesis is that the population is not in Hardy–
Weinberg proportions.
χ2 =
(n1 − n * pˆ 2 ) 2 (n1 − 2npˆ qˆ ) 2 (n1 − n * qˆ 2 ) 2
+
+
,
n * pˆ 2
2npˆ qˆ
n * qˆ 2
where pˆ =
2n1 + n2
n + 2n3
and qˆ = 2
.
2n
2n
Substituting p̂ and q̂ into the chi-squared formula and the chi-squared test statistic can be
simplified as:
(2n1 + n2 ) 2 2
(2n1 + n2 )(n2 + 2n3 ) 2
(n2 + 2n3 ) 2 2
4n(n1 −
) 2n(n2 −
)
4n(n3 −
)
4n
2n
4n
+
+
(2n1 + n2 ) 2
(2n1 + n2 )(n2 + 2n3 )
(n2 + 2n3 ) 2
=
n(n22 − 4n1n3 ) 2
(2n1 + n2 ) 2 (n2 + 2n3 ) 2
= nθˆ 2
The classical chi-squared goodness-of-fit test generally performs well, but it has sometims
been pointed out that the chi-squared statistics test is not appropriate when the alternative
hypothesis of the test is heterozygote deficiency (Pamilo and Varvio-Aho 1984; Lessios 1992),
and the generally used exact tests may also have this problem. In (Ward and Sing 1970), θ is
48
estimated from the chi-squared of test statistic value. The derivation runs as follows: Assume that
the sample is large. Let p1 , p2 ,..., pk be the allele frequencies of a k-allele gene. Let nij be the
observed frequency of the genotype AiAj, i ≤ j. Since the sample is infinitely large,
nii ≅ n[ pi2 + pi (1 − pi )θ ] and nij ≅ n[(1 − θ ) pi p j ] .
Note that,
k
χ2 = ∑
i =1
(2nij − 2npi p j ) 2
(nii − npi2 ) 2
+
∑
npi2
2npi p j
i< j
(2nθpi p j ) 2
[npi (1 − pi )θ ]2
≅∑
+∑
npi2
2npi p j
i =1
i< j
k
⎡k
⎤
= nθ 2 ⎢∑ (1 − pi ) 2 + 2∑ pi p j ⎥
i< j
⎣ i=1
⎦
k
⎡
⎤
2
= nθ 2 ⎢k − 2 + ∑ pi + 2∑ pi p j ⎥
i =1
i< j
⎣
⎦
= nθ 2 [k − 2 + ( p1 + p1 + ... + pk ) 2 ]
= n(k − 1)θ 2
They showed that for samples of size commonly collected from natural populations a
significant χ 2 is obtained only for large values of θ. In other words, under the null hypothesis (θ =
0) it takes a very large sample to detect levels of inbreeding characteristics of natural populations
(θ ≤ 0.10). (This means that, for sample sizes usually collected, the estimate of θ can take on
crucial biological values which are not statistically significant).
Otherwise stated, the type II error, failure to reject θ = 0 when θ ≠ 0, is an overwhelming
reality in most studies.
49
Solving the above formula for θ , a third estimate of average inbreeding is taken as the
positive root: θˆ =
x2
n(k − 1) .
This estimate is very appealing since the chi-squared value can simultaneously provide an
estimate of θˆ and a significance test of the hypothesis θ = 0.ٛ However, θˆ is the positive root of
a quadratic equation, its sampling properties are very difficult to determine. However, by using the
likelihood’s Z-test or Siegmund’s T-test can be easily employed to determine the sample size
required for a given size, power, and atternative value of θ.
Under the alternative hypothesis ( θ ≠ 0), the χ 2 test approximately has a non-central chisquared distribution with non-centrality parameter λ = nθ 2 (k − 1) (Ward and Sing 1970; Harber
1980), so that
k (k − 1) + 4nθ 2 (k − 1)
V (θˆ) =
n 2 (k − 1) 2
=
4θ 2
k
+ 2
.
n(k − 1) n (k − 1)
When only a single locus is considered, the sample size needed to detect small but
significant deviations is unrealistically large (Ward and Sing 1970). At a locus with two alleles,
the χ 2 test can detect an inbreeding coefficient of θ = 0.0001 (a realistic value for human
populations) at the 5% significance level only 50% of the time in a sample as large as 4x108;
almost twice the population of the U.S. (Curie-Cohen 1981). ٛ
4.4.5
Relationship between θˆ , Wald’s Z-test, Siegmund’s T-test and χ2 -test
In summary, the above tests are all connected:
50
•
θˆ and χ 2 : In section 2.4.4, we have shown χ 2 = nθˆ 2
•
θˆ and Siegmund’s T: In section 2.4.3.1, we have shown E(T) =
•
Siegmund’s T and χ 2 :
T= (
2nn1
2nn3
+
− n) / n
2n1 + n2 n2 + 2n3
Æ T2 =
nθ
n(n22 − 4n1n3 ) 2
= χ2
(2n1 + n2 ) 2 (n2 + 2n3 ) 2
Therefore, T2 = χ 2 = nθˆ 2 , and T = nθˆ .
4.5
Advantages of the Wald’s Z-test or Siegmund’s T-Test
Using the Wald’s Z-Test as the test statistic instead of χ 2 to test the Hardy-Weinberg
Equilibrium has more desirable properties:
1) Z-test can be used to test homozygote excess and heterozygote deficiency because it can be
built as one-sided, while χ 2 - test cannot, because χ 2 is always positive, and can only be used
for two-sided alternatives.
2) Sample size calculation based on Z-test is much easier than based on χ 2 -test.
4.6
4.6.1
Sample size calculation
Sample size calculation based on Z-test or Siegmund’s T-test
Let σ be the standard deviation of T under H1: θ = θ1 > 0 and note that,
σ 2 = (1 − θ1 ) 2 (1 − 2θ1 ) +
θ1 (1 − θ1 )(2 − θ1 )
2 pq
Set 1 − β = Pr(Rej H 0 | θ = θ1 ) = Pr( T > Zα θ = θ1 ) , where Zα is the upper 100 × α percentile
of the standard normal distribution.
51
= Pr(
T − nθ1
σ
>
Zα − nθ1
σ
= Pr( Z >
Zα − nθ1
Set
Zα − nθ1
σ
σ
θ = θ1 )
) , where Z
N(0,1)
= −Z β
Zα − nθ = −σZ β
⇒n=
( Zα + σZ β ) 2
θ12
The Sample size formula requires the value of PA . The Sample size n needed to obtain a
specified power 1− β, using the Wald’s Z-test for various values of allele frequencies q , level α
and true inbreeding coefficient θ in the bi-allelic case (k=2) is presented in Table 4-4. The SAS
code is presented in Appendix 5.
Table 4-4
q=0.1
Sample size n to achieve a specified power, 1−β, using Wald’s Z-test for
various values of allele frequencies q , and true inbreeding coefficient θ,
and level α
0.2
k=2; α=0.05:
θ:
0.0001
64,564,100
0.0005
2,589,898
0.001
649,763
0.002
163,582
0.005
26,717
0.01
6,903
0.02
1,835
0.05
342
0.1
103
0.25
22
0.5
5
1
0
power
0.5
0.9
0.95
270,746,708
10,860,621
2,724,751
685,974
112,038
28,948
7,694
1,436
432
91
23
0
856,993,620
34,377,086
8,624,646
2,171,311
354,634
91,629
24,355
4,545
1,369
288
71
0
1,082,986,832
43,442,484
10,899,005
2,743,896
448,152
115,792
30,778
5,744
1,729
364
90
0
52
q=0.1
q=0.5
0.2
k=2; α=0.01:
θ:
0.0001
220,598,052
0.0005
8,848,979
0.001
2,220,063
0.002
558,916
0.005
91,286
0.01
23,586
0.02
6,269
0.05
1,170
0.1
352
0.25
74
0.5
18
1
0
k=2; α=0.05:
θ:
0.0001
64,518,227
0.0005
2,580,728
0.001
645,182
0.002
161,295
0.005
25,807
0.01
6,451
0.02
1,612
0.05
257
0.1
64
0.25
10
0.5
2
1
0
k=2; α=0.01:
θ:
0.0001
220,441,317
0.0005
8,817,651
0.001
2,204,411
0.002
551,101
0.005
88,174
0.01
22,042
0.02
5,509
0.05
880
0.1
218
0.25
33
0.5
7
1
0
power
0.5
0.9
0.95
541,574,226
21,724,484
5,450,316
1,372,153
224,110
57,904
15,391
2,872
865
182
45
0
1,302,619,334
52,252,731
13,109,351
3,300,365
539,039
139,274
37,020
6,909
2,080
438
108
0
1,578,165,406
63,305,872
15,882,403
3,998,499
653,063
168,736
44,851
8,370
2,520
531
131
0
270,554,343
10,822,171
2,705,541
676,383
108,219
27,053
6,761
1,080
268
41
8
0
856,384,727
34,255,381
8,563,839
2,140,953
342,545
85,630
21,401
3,417
848
128
26
0
1,082,217,371
43,288,684
10,822,163
2,705,533
432,876
108,211
27,045
4,318
1,071
162
32
0
541,189,438
21,647,572
5,411,889
1,352,968
216,470
54,114
13,524
2,159
536
81
16
0
1,301,693,824
52,067,740
13,016,925
3,254,222
520,665
130,156
32,529
5,194
1,289
195
39
0
1,577,044,120
63,081,750
15,770,426
3,942,595
630,802
157,689
39,410
6,292
1,561
237
47
0
53
4.6.2
Sample size calculation based on Ward and Sing’s χ2-test
The Sample size required under different choices of parameters based on Ward and Sing’s
χ2 -test (Ward and Sing, 1970) is reproduced below. It turns out that the sample sizes calculated
based on the Z-test are lower than those based on Ward and Sing’s χ2 -test.
54
4.6.3
Power comparison between T and χ2 tests via simulations
We have discussed different properties of Wald’s Z-Test Statistic and Ward and Sing’s χ 2
test. Each has its own advantages, and Wald’s Z-Test Statistic is capable of building one-sided
test, which can accommodate discriminating Homozygote excesses and Heterozygote deficiencies.
On the other hand, the χ 2 test has it’s own advantage, namely, this test Statistic can be
generalized to any number of alleles. We invoked SAS to compare their power, the code for which
is presented in Appendix 6.
The Null hypothesis and Alternative hypotheses are:
H0 : θ = 0
H1: θ > 0 (Z-test)
H1: θ ≠ 0 ( χ 2 -test)
Simulation proceduces:
Randomly select 100 θ ' s from (0, 0.2). For each θ , simulate the Multinomial distribution
(n = 100, a certain allele frequency P(A) = p), employ both Wald’s Z-Test and χ 2 test, α-level
(significance level) = 5%. Repeat this process 10,000 times. The empirical power is the proportion
of times H0 is rejected.
1. Power comparison of Z-test and χ2-test
In Figures 4-1, 4-2, and 4-3, empirical powers under both the tests are graphed.
55
Figure 4-1
Power comparison of Z and χ2, when p = 0.5
Figure 4-2
Power comparison of Z and χ2, when p = 0.2
56
Figure 4-3
Power comparison of Z and χ2, when p = 0.05
2. Normality Check of Z
For a very low value of p (p=0.05), the chi-squared test has a better power than the Z-test.
For other values of p, the Z-test is better. Apparently, different performances of Z-test and χ 2 -test
depend on different choices of allele frequencies. When p = 0.5, Z-test has higher power than χ2test, however, when p gets extreme (p = 0.05), Z-has lower power than χ 2 -test. The normality of
Z, taken for granted, may be the issue. Therefore, we check the normality of Z. Histogram plots
and Q-Q plots of Z are used for this purpose. When a linear point pattern exist in the Q-Q plot, it
indicates that the specified family reasonably describes the data distribution, and the location and
scale parameters can be estimated visually as the intercept and slope of the linear pattern. For
different choices of allele frequency p, the degrees of normality of Z are different: when p = 0.5 or
0.2, the histograms visually appear to be normal, where as in the case of p = 0.05, the histogram is
bi-modal.
57
Therefore, we can conclude that, the different power performances of Z and χ 2 tests
depend on the degree of their normality which is impacted by the different choices of allele
frequencies p.
1) p = 0.5
Figure 4-4
Histogram and Normal Q-Q Plot for Z’s, when p = 0.5
12
10
8
P
e
r
c
6
e
n
t
4
2
0
- 1. 63
- 0. 63
0. 38
1. 38
2. 38
3. 38
4. 38
5. 38
6. 38
z
Cur ve:
Nor mal ( Mu=1. 9973 Si gma=1. 0532)
2) p = 0.2
Figure 4-5
Histogram and Normal Q-Q Plot for Z’s, when p = 0.2
17. 5
15. 0
12. 5
P
e 10. 0
r
c
e
n
t 7. 5
5. 0
2. 5
0. 0
- 5. 8
- 4. 6
- 3. 4
- 2. 2
- 1. 0
0. 2
1. 4
2. 6
3. 8
5. 0
6. 2
z
Cur ve:
Nor mal ( Mu=1. 6738 Si gma=1. 0413)
58
3) p = 0.05
Figure 4-6
Histogram and Normal Q-Q Plot for Z’s, when p = 0.05
25
20
P 15
e
r
c
e
n
t
10
5
0
- 4. 4
- 3. 2
- 2. 0
- 0. 8
0. 4
1. 6
2. 8
4. 0
5. 2
6. 4
7. 6
z
Cur ve:
Nor mal ( Mu=0. 1395 Si gma=2. 0432)
For small allele frequency p, there is no homozygote for the real allele in the sample.
Consequently, the sampling distribution of Z is based on heterozygotes and common
homozygotes, which as showed in the histogram, indicates that the distribution is bimodal.
4.7
Conclusion.
The traditional χ 2 test for testing the Hardy-Weinberg equilibrium is not applicable to
entertain one-sided alternatives. Wald’s Z test and Siegmund’s T-test are entertained for one-sided
alternatives. These two tests are shown to be essentially the same. Using the Z-test, sample size
formula for achieving a specified power has been developed. The sample sizes are lower than
those obtained under the χ 2 test. However, the Z test is biased (i.e., nomial level for significance
is not achieved under H0 for shcrewed allele frequency). See Fig 4-3. The bias may be due to nonnormality of Z-statistic. For moderate values of allele frequency p, the Z-test has higher power
than the χ 2 test. For very low values of p, the χ 2 test is superior.
59
5
Hardy-Weinberg Equilibrium in the case of three alleles
5.1
Introduction
When a population has HW proportions, the disequilibrium coefficient is zero, suggesting
that a test of the hypothesis that disequilibrium is zero, is equivalent to testing for HWE. The word
“Equilibrium” here, strictly speaking, is an equilibrium state in which properties of the population
are not changing over successive generations. In the HWE case, this implies that continued
absence of disturbing forces such as selection, migration, and mutation as well as the continuation
of nonrandom mating (inbreeding).
In the study, testing if θ equal to zero is equivalent to testing the HWE is under two
biological scenarios:
1) inbreeding is the only reason affect the “Equilibrium state”;
2) Population is substructured.
The argument given by Hardy for the stability of allele frequencies under random mating
extends to the case of any number of alleles. In this chapter, we will consider the problem of
testing the HWE in the tri-allelic case. The bi-allelic and tri-allelic cases are different with respect
to mathematical, statistical, and computational issues. We will present a new method of handling
tri-allelic problem by reducing it to several bi-allelic problems.
5.2
Joint distribution of genotypes
In this section, we will focus on the case of tri-allelic genes and the attendant Hardy-
Weinberg equilibrium issues. Consider a tri-allelic gene with alleles A1, A2 and A3. The customary
notation for the joint distribution of the genotypes A1A1, A1A2, A1A3, A2A2, A2A3, A3A3 is: p11,
60
2p12,
2p13,
p22,
2 p23,
p33, respectively. The reason for this special notation is that any
heterozygous genotype AiAj ( i ≠ j ) is indistinguishable from AjAi. This particular notation
facilitates us to write the joint distribution of the genotypes in the following form (Table 5-1),
where pij = pji for all i and j.
Table 5-1
Joint distribution of genotypes
Alleles
Marginal
A1
A2
A3
A1
p11
p12
p13
p1
A2
p21
p22
p23
p2
A3
p31
p32
p33
p3
p1
p2
p3
1
Alleles
Marginal
frequencies
Let us use the generic symbol
frequencies
A for the joint distribution. The marginal probabilities p1,
3X3
p2 and p3 are called allelic frequencies. The following are the properties of the joint distribution
A:
3X3
1) A is symmetric: p12 = p21 , p13 = p31 and p23 = p32.
2) The marginal frequencies are the same.
5.2.1
Parameter spaces
(1) Parameter space Ωٛ
The purpose in this chapter is to make inferences on the unknown
A of a population of
3X3
interest based on a random sample of individuals drawn from the population and their
determined genotypes. The underlying parameter space is defined by:
61
Ω = {A: A is a joint distribution of the type in Table 5-1}.
Because of the special structure of A, it is enough to choose a set of five entries in Table 51 for specification so that the rest of the entries in the table can be determined. For example, one
can spell out p11, p12, p13, p22 and p23 all non-negative with sum ≤ 1. The rest of the entries are
automatically determined. Equivalently, one can spell out p1, p2, p11, p12 and p22 with the requisite
natural constraints. The rest of the entries in the table can be determined uniquely. What this
means is that the parameter space Ω has five parameters and therefore can be deemed as 5dimensioanl.
(2) Parameter space Ω p1, p2, p3 ٛ
We now consider a specified subset of Ω . Let p1, p2 and p3 be such that each
pi ≥ 0 and p1 + p2 + p3 = 1 . Suppose p1, p2 and p3 are given. Let
Ω p1, p2, p3 = {A: A is a joint distribution of the type in Ω , p1, p2 and p3 known }.
The set Ω p1, p2, p3 has three free parameters. For example, one can spell out p11, p12 and p22
freely subject to the relevant natural constraints. The rest of the entries in A can be determined
uniquely. As a consequence, Ω p1, p2, p3 is deemed to be 3-dimensional.
(3) Parameter space Ω* ٛ
We now introduce a definition.
Definition: A joint distribution A of the type in Table 5-1 is of Ωθ if it is of the following form
(Table 5-2) for some θ, p1, p2 and p3, marginals p1, p2 and p3 unknown:
62
Table 5-2
Joint distribution is of type Ωθ
Alleles
Alleles
A1
A2
A3
Marginal
frequencies
A1
(1 − θ ) p12
+ θp1
(1 − θ ) p1 p2
Marginal
A2
A3
(1 − θ ) p1 p2
(1 − θ ) p1 p3
p1
(1 − θ ) p2 p3
p2
(1 − θ ) p22
+ θp2
(1 − θ ) p1 p3
(1 − θ ) p2 p3
p1
p2
frequencies
(1 − θ ) p32
+ θp3
p3
p3
1
The entity θ is defined to be the inbreeding coefficient.
Let
Ω* = {A: The joint distribution A is of the type Ωθ for some θ ≠ 0 }
It is clear that the parameter space Ω* is 3-dimensional. One can take θ , p1 and p2
as free parameters.
(4) Parameter space Ω* p1, p2, p3 with given p1, p2 and p3ٛ
A special subset of Ω* is of interest. Let p1, p2 and p3 be given marginal frequencies. Let
Ω* p1, p2, p3 = {A ∈ Ω* : A has marginal frequencies p1, p2 and p3}
It is clear that Ω* p1, p2, p3 is only 1-dimensional and the free parameter is θ .
Let A ∈ Ω . The case θ = 0 leads to the following joint distribution (Table 5-3). In this case, we can
say that the population has achieved the Hardy-Weinberg equilibrium ( Ω 0 ).
63
Table 5-3
Joint distribution of genotypes under Equilibrium ( Ω 0 )
Alleles
Marginal
A1
A2
A3
A1
p12
p1 p2
p1 p3
p1
A2
p1 p2
p22
p2 p3
p2
A3
p1 p3
p2 p3
p32
p3
p1
p2
p3
1
Alleles
Marginal
frequencies
frequencies
The inclusion Ω* ⊂ Ω is strict. This follows from dimension ( Ω* ) < dimension ( Ω ). The
following is a specific example of a joint distribution A in Ω but not in Ω* (Table 5-4):
Table 5-4
Example: A distribution in Ω but not in Ω
Alleles
*
Marginal
A1
A2
A3
A1
1/9
1/9
1/9
1/3
A2
1/9
2/27
4/27
1/3
A3
1/9
4/27
2/27
1/3
1/3
1/3
1/3
1
Alleles
Marginal
frequencies
frequencies
We know that the distribution in Table 5-4 is not in Ω* . Suppose it is. Let θ be the
corresponding inbreeding coefficient. Then (1 − θ ) p1 p2 = 1/9 = (1 − θ ) 1/9, which implies that θ =
0. On the other hand, (1 − θ ) p2 p3 = 4/27 = (1 − θ ) 1/9, which implies that θ = −1/3. This is a
contradiction.
64
5.2.2
Bounds on θ
Inbreeding coefficient (θ) can be measured indirectly from genetic data, the higher the
value, the more closely related the parents are. Suppose the joint distribution is of the type Ωθ for
some θ. We can find bounds for θ from Table 5-2.
Since each entry in the above table is a frequency, each entry has to be larger than or equal
to zero. It’s easy to find bounds of θ by setting each entry ≥ 0.
Therefore, θ has to satisfy :
5.2.3
⎧ p
p
p ⎫
− min ⎨ 1 , 2 , 3 ⎬ < θ < 1
⎩1 − p1 1 − p2 1 − p3 ⎭
Biological scenario
For the case of two alleles, every distribution in Ω can be put as one of the type Ωθ .
However, when there are three or more alleles, not every distribution in Ω can be put as one of
the type Ωθ (Example Table 5-4). A question naturally arises is under what conditions, any
genotype probability distribution with fixed marginals can be put in the form of Table 5-2.
Population sub structure
Li, C.C. (1969) provided an explanation. The departure from the Hardy-Weinberg
Equilibrium can be explained by the following table:
Table 5-5
Population subdivision with respect to tri-alleles
Alleles
Marginal
A1
A2
A3
A1
p12 + σ 12
2 p1 p2 + 2σ 12
2 p1 p3 + 2σ 13
p1
A2
.
p22 + σ 22
2 p2 p3 + 2σ 23
p2
A3
.
.
p32 + σ 32
p3
p1
p2
p3
1
Alleles
Marginal
frequencies
65
frequencies
This joint distribution has 9 parameters ( σ 1 , σ 2 , σ 3 , σ 12 , σ 13 , σ 23 and three marginal frequencies:
p1 , p2 , p3 ) subject to 4 constraints:
⎧ p1 + p2 + p3 = 1
⎪ 2
⎪σ 1 + σ 12 + σ 13 = 0
⎨
2
⎪σ 12 + σ 2 + σ 23 = 0
⎪σ + σ + σ 2 = 0
23
3
⎩ 13
Consequently, the parameter space associated with all possible distributions of type Table
5-5 is 9 – 4 = 5-dimensional. Therefore, we will have many θ i ' s , which means not every
distribution in Ω can be put as one of the type Ωθ . However, the 5 parameters can be reduced
even more from a biological standpoint:
Under population subdivision, evolutionary expectations of variance and covariance can be
represented by a single parameter θ , leading to probability of AiAi = (1 − θ ) pi2 + θpi and
probability of AiAj = 2 pi p j (1 − θ ) , which is explained solely by an inbreeding co-ancestry
coefficient, θ .
If the departure from the Hardy-Weinberg Equilibrium is caused solely by inbreeding, like
alleles combine with like alleles. Consequently, derivatives of genotype frequencies from the
HWE can be explained by just one inbreeding coefficient.
For illustration, let’s take a look at the probability of genotype AiAi, Pr(AiAi)
= (1 − θ ) pi2 + θpi . This can be seen as adding two situations: complete inbreeding (pi) multiplied by
its inbreeding degree ‘ θ ’ plus complete independence ( pi2 ) multiplied by its inbreeding degree
‘(1− θ )’.
66
5.3
Structure of the case of 3 alleles: data and Likelihood
5.3.1
Structure of the case of 3 alleles: data
In order to make inference on the unknown joint distribution A ∈ Ω , we select a random
sample of n subjects from the population of interest and their genotypes determined. Let nij =
number of subjects in the sample with genotypic AiAj, 1 ≤ i ≤ j ≤ 3. The data collected can be
arranged in the following table (Table 5-6):
Table 5-6
Data on Genotypes
Alleles
Alleles
A1
A2
A3
A1
n11
n12
n13
A2
.
n22
n23
A3
.
.
n33
Note that n11 + n12+ n13 + n22 + n23 + n33 = n. The notation for the frequencies nij’s is
markedly different from the genotype probabilities pij’s. The genotype probability of AiAj ( i ≠ j )
is written as 2 pij, whereas the genotype frequency of AiAj is nij.
5.3.2
Maximum Likelihood estimators
Theoretically, the frequencies (n11, n12, n13, n22, n23, n33 ) of the genotypes A1A1, A1A2,
A1A3 , A2A2 , A2A3 , A3A3 have a multinomial distribution (n, p11, 2p12, 2p13, p22, 2p23, p33).
1. Maximum Likelihood Estimation over Ω
Let A ∈ Ω . The likelihood of the data at A is given by
L(A) = ( p11 ) n11 * ( p22 ) n22 * ( p33 ) n33 * (2 p12 ) n12 * (2 p13 ) n13 * (2 p23 ) n23
67
Maximizing the Likelihood over all A ∈ Ω yields the following Maximum Likelihood Estimates:
pˆ 11 = n11 / n
pˆ 22 = n22 / n
pˆ 33 = n33 / n
2 pˆ 23 = n23 / n
2 pˆ 12 = n12 / n
2 pˆ 13 = n13 / n
2. Maximum Likelihood Estimation over Ω* .
We know that Ω* is a 3-dimensional parameter space with free parameters: θ , p1, and p2
( p3 = 1 − p1 − p2 ). Let A ∈ Ω* . The Likelihood of the data is given by:
L(A)= [(1 − θ ) p12 + θp1 ]n11 * [2(1 − θ ) p1 p2 ]n12 * [2(1 − θ ) p1 p3 ]n13 * [2(1 − θ ) p2 p3 ]n23
*[(1 − θ ) p22 + θp2 ]n22 * [(1 − θ ) p32 + θp3 ]n33
The log likelihood is given by:
LnL( θ , p1 , p2) = constant +
n11 ln[(1 − θ ) p12 + θp1 ] + n22 ln[(1 − θ ) p22 + θp2 ] + n33 ln[(1 − θ )(1 − p1 − p2 ) 2 + θ (1 − p1 − p2 )]
+ (n12 + n13 + n23 ) ln(1 − θ ) + (n12 + n13 ) ln p1 + (n12 + n23 ) ln p2 + (n13 + n23 ) ln(1 − p1 − p2 )
∂ ln L( p1 , p2 ,θ )
=
∂p1
n11 (2(1 − θ ) p1 + θ )
− n33 (θ + 2(1 − p1 − p2 )
+
2
(1 − θ ) p1 + θp1
θ (1 − p1 − p2 ) + (1 − θ )(1 − p1 − p2 ) 2
+
n12 + n13
n +n
− 13 23
p1
1 − p1 − p2
68
∂ ln L( p1 , p2 ,θ )
=
∂p2
∂ ln L( p1 , p2 ,θ )
=
∂θ
n22 (2(1 − θ ) p2 + θ )
− n33 (θ + 2(1 − p1 − p2 )
+
2
(1 − θ ) p2 + θp2
θ (1 − p1 − p2 ) + (1 − θ )(1 − p1 − p2 ) 2
+
n12 + n23
n +n
− 13 23
p2
1 − p1 − p2
n11 p1 (1 − p1 )
n p (1 − p )
+ 22 2 2 2
2
(1 − θ ) p1 + θp1 (1 − θ ) p2 + θp2
+
n33 (1 − p1 − p2 )( p1 + p2 )
n +n +n
− 12 13 23
2
1−θ
θ (1 − p1 − p2 ) + (1 − θ )(1 − p1 − p2 )
⎤
⎡
1 ⎢ n11 (1 − p1 ) n22 (1 − p2 ) n33 (1 − p3 ) ⎥ n12 + n13 + n23
=
+
+
⎥−
⎢
θ
θ
1−θ ⎢ θ + p
1−θ
+ p2
+ p3 ⎥
1
1−θ
1−θ
⎦
⎣ 1−θ
⎧ ∂ ln L( p1 , p2 ,θ )
=0
⎪
∂p1
⎪
⎪ ∂ ln L( p1 , p2 ,θ )
=0
Set ⎨
∂p2
⎪
⎪ ∂ ln L( p1 , p2 ,θ )
=0
⎪
∂θ
⎩
and solve for p1 , p2 , and θ .
Technically, these equations should yield maximum likelihood estimates of p1 , p2 , and θ .
However, the amount of computation required is enormous. We invoked the software
Mathematica (10.1) to obtain a solution to these equations symbolically. The software ran over a
long time (over 48 hours) and there was no solution. Suppose we plug in the data values of nij’s
into the equations spelled above. Is there a solution?
More specifically, suppose n11=n12=n13=n22=n23=n33=10. The log likelihood is given by
ln L( θ , p1 , p2) =
10ln[(1 − θ ) p12 + θp1 ] + 10ln[(1 − θ ) p22 + θp2 ] + 10ln[(1 − θ )(1 − p1 − p2 ) 2 + θ (1 − p1 − p2 )]
+ 30ln(1 − θ ) + 20lnp1 + 20lnp2 + 20ln(1 − p1 − p2 )
69
Maximize ln L with respect to p1 , p2 and θ . The solution is:
θˆ = 1 / 4, pˆ1 = 1 / 3, pˆ 2 = 1 / 3, pˆ 3 = 1 / 3
2n11 + n12 + n13
⎧ ˆˆ
= 1/ 3
⎪ p1 =
2n
⎪
2n22 + n12 + n23
⎪ˆ
= 1 / 3 of the allele frequencies coincide, in this
The gene count estimates ⎨ pˆ 2 =
2n
⎪
2n33 + n13 + n23
⎪ˆ
= 1/ 3
⎪ pˆ 3 =
2n
⎩
example, with the maximum likelihood estimates. We have tried another example.
We took n11 = 10, n12 = 10, n13 = 10, n22 = 10, n23 = 20, n33 = 20 . Mathematica has been
implemented. No explicit numerical solution to the likelihood equation has been provided.
There are examples of data in which the likelihood estimates of p1 , p2 , p3 and natural estimates of
p1 , p2 , p3 do not match.
Maximum Likelihood estimates are often preferred because they are sufficient statistics
and will attain the minimum variance as the sample size gets infinitely large. Unfortunately, the
maximum likelihood estimate of the estimate of the inbreeding coefficient can not be explicitly
written, but must be solved numerically by iteration. If the likelihood of the observed numbers of
each genotype is maximized simultaneously for the gene frequencies pi (i=1,2,3) and θ, the gene
frequency estimates are not generally equal to the natural unbiased estimators: pˆ i =
1,2,3
We provide two different approaches to resolve the difficulties we are facing.
70
1 3
∑ ni⋅ . i =
n i=1
Option (1): Use gene count estimates of p1, p2, and p3 and obtain the maximum likelihood
estimate of θ . As we demonstrate below, the likelihood equation is a cubic polynomial in θ . The
asymptotic theory is extendable to this case.
Option (2): Reduce the 3x3 distribution to three 2x2 distributions.
We now elaborate on Option (1) in the following section.
3. Maximum Likelihood Estimation over Ωθ , p1, p2, p3.
Suppose we assume the margianl frequencies p1, p2, and p3 are fixed and known. The only
parameter unknown is θ . The model is one-dimensional. The Likelihood of the data is given by:
L( θ )= [(1 − θ ) p12 + θp1 ]n11 *[2(1 − θ ) p1 p2 ]n12 * [2(1 − θ ) p1 p3 ]n13 *[2(1 − θ ) p2 p3 ]n23
*[(1 − θ ) p22 + θp2 ]n22 *[(1 − θ ) p32 + θp3 ]n33
∂LnL
n11 p1 (1 − p1 )
n p (1 − p )
n p (1 − p3 )
n +n +n
=
+ 22 2 2 2 + 33 3
− 12 13 23
2
2
(1 − θ ) p1 + θp1 (1 − θ ) p2 + θp2 θp3 + (1 − θ ) p3
1−θ
∂θ
We set
∂LnL
= 0,
∂θ
The likelihood equation can be simplified as follows:
⎤
⎡
∂LnL
1 ⎢ n11 (1 − p1 ) n22 (1 − p2 ) n33 (1 − p3 ) ⎥ n12 + n13 + n23
=
+
+
=0
⎥−
⎢
θ
θ
1−θ ⎢ θ + p
1−θ
∂θ
⎥
+
p
p
+
1
2
3
1−θ
1−θ
⎦
⎣ 1−θ
71
This gives
n11 (1 − p1 )
θ
1−θ
Let η =
θ
1−θ
+ p1
+
n22 (1 − p2 )
θ
1−θ
+ p2
+
n33 (1 − p3 )
θ
1−θ
+ p3
= n12 + n13 + n23
, the likelihood equation is a third degree polynomial in η , given by:
n11 (1 − p1 ) n22 (1 − p2 ) n33 (1 − p3 )
+
+
= n12 + n13 + n23
η + p1
η + p2
η + p3
n11 (1 − p1 )(η + p2 )(η + p3 ) + n22 (1 − p2 )(η + p1 )(η + p3 ) + n33 (1 − p3 )(η + p1 )(η + p2 )
= (n13 + n13 + n23 )(η + p1 )(η + p2 )(η + p3 )
Alternatively,
(n12 + n13 + n23 )η 3 − η 2 [ n11 (1 − p1 ) + n22 (1 − p2 ) + n33 (1 − p3 ) − (n12 + n13 + n23 )]
− η[n11 (1 − p1 )( p2 + p3 ) + n22 (1 − p2 )( p1 + p3 ) + n33 (1 − p3 )( p1 + p2 )
− (n12 + n13 + n23 )( p1 p2 + p1 p3 + p2 p3 )] − n11 (1 − p1 ) p2 p3 − n22 (1 − p2 ) p1 p3 −
n33 (1 − p3 ) p1 p2 + (n12 + n13 + n23 ) p1 p2 p3 = 0
As an example, suppose p1 = p2 = p3 = 1 / 3 . Then the polynomial is:
(n12 + n13 + n23 )η 3 − η 2 [(2 / 3)(n11 + n22 + n33 ) − (n12 + n13 + n23 )] − η[(4 / 9)(n11 + n22 + n33 )
− (1 / 3)(n12 + n13 + n23 )] − (2 / 27)(n11 + n22 + n33 ) + (1 / 27)(n12 + n13 + n23 ) = 0
Let α i be the coefficient for ith power of η , that is, for the above equation:
⎧(n12 + n13 + n23 ) = α 3
⎪− [(2 / 3)(n + n + n ) − (n + n + n )] = α
⎪
11
22
33
12
13
23
2
⎨
⎪− [(4 / 9)(n11 + n22 + n33 ) − (1 / 3)(n12 + n13 + n23 )] = α1
⎪⎩− [(2 / 27)(n11 + n22 + n33 ) + (1 / 27)(n12 + n13 + n23 )] = α
0
Æ α 3η 3 + α 2η 2 + α1η + α 0 = 0
72
Solving this third degree polynomial in η , we have:
η=
θ
1−θ
= − 1/3ٛ Æ θˆ = − 1/2
∂ 2 LnL
For the asymptotic distribution of the likelihood estimator θˆ , we need E
∂θ 2 .
After simplification,
∂LnL
n11 (1 − p1 )
n (1 − p2 )
n (1 − p3 ) n12 + n13 + n23
=
+ 22
+ 33
−
(1 − θ ) p1 + θ (1 − θ ) p2 + θ (1 − θ ) p3 + θ
1−θ
∂θ
2
∂ 2 LnL
n11 (1 − p1 ) 2
n22 (1 − p2 )
n33 (1 − p3 ) 2
n +n +n
−
−
−
− 12 13 2 23
=
2
2
2
2
[(1 − θ ) p1 + θ ] [(1 − θ ) p2 + θ ] [(1 − θ ) p3 + θ ]
(1 − θ )
∂θ
Asymptotic variance of θˆ , V (θˆ) =
E (−
∂ 2 LnL
)=
∂θ 2
1
∂ 2 LnL
E (−
)
∂θ 2
n(θp1 + (1 − θp12 )(1 − p1 ) 2 n(θp2 + (1 − θp22 )(1 − p2 )
+
[(1 − θ ) p1 + θ ]2
[(1 − θ ) p2 + θ ]2
2
n(θp3 + (1 − θp32 )(1 − p3 ) 2 2n(1 − θ )( p1 p2 + p1 p3 + p2 p3 )
+
+
[(1 − θ ) p3 + θ ]2
(1 − θ ) 2
2
np1 (1 − p1 ) 2 np2 (1 − p2 )
np3 (1 − p3 ) 2 2n( p1 p2 + p1 p3 + p2 p3 )
=
+
+
+
(1 − θ ) p1 + θ (1 − θ ) p2 + θ (1 − θ ) p3 + θ
(1 − θ )
4
2
n
n
∂ LnL
3 + 3
In this special case, p1 = p2 = p3 = 1 / 3 , E (−
)
=
∂θ 2
1 + 2θ 1 − θ
2
Therefore, if the gene frequencies are known, the maximum likelihood estimator θˆ of θ
has asymptotically a normal distribution with the mean θ and variance:
73
V (θˆ) =
1
1
1
= [
]
2
2
2
∂ LnL n p1 (1 − p1 ) 2
p
−
p
p
−
p
p
p
+
p
p
+
p
p
(
1
)
(
1
)
2
(
)
2
3
1 2
1 3
2 3
E (−
)
+ 2
+ 3
+
∂θ 2
(1 − θ ) p1 + θ (1 − θ ) p2 + θ (1 − θ ) p3 + θ
(1 − θ )
A test of the null hypothesis H0: θ = θ 0 can be built based on the asymptotic theory of the
likelihood estimator.
Test statistic = Z =
θˆ − θˆ0
| H0
Var (θˆ)
If p1 , p2 , p3 are not known, one could use the following consistent gene count estimates:
2n11 + n12 + n13
⎧ˆ
ˆ
=
p
1
⎪
2n
⎪
2n22 + n12 + n23
⎪ˆ
⎨ pˆ 2 =
2n
⎪
2n33 + n13 + n23
⎪ˆ
⎪ pˆ 3 =
2n
⎩
of p1, p2, and p3 respectively, in the asymptotic variance formula of θˆ . The asymptotic normal
distribution is still valid in view of Slutsky’s Theorem (Cramer, 1946).
5.4
Joint distribution of the type Ωθ and Connection to lower
dimensional joint distributions
In this section, we pursue Option (2). Maximum Likelihood estimation and testing of
inbreeding coefficient in the case of bi-allelic genes are simple to execute. We will reduce triallelic genes case to three bi-allelic gene cases and explore the connection between them. The
connection discovered helps us to tackle the tri-allelic case.
74
Reduction to 2x2 joint distributionsٛ
We reduce a given genotype distribution in the tri-allelic genes case to three bi-allelic gene
cases: A1 vs. (not A1); A2 vs. (not A2); A3 vs. (not A3) and their corresponding inbreeding
coefficient are : θ1 , θ 2 , and θ 3 .
The case of A1 vs. (not A1)
5.4.1
From Table 5-1, by combining alleles A2 and A3, we have the following reduced joint
distribution derived.
Table 5-7
Joint distribution: A1 vs. (not A1)
Alleles
A1
not A1
A1
p11
p12+ p13
not A1
p12+ p13
Alleles
Marginal
frequencies
p22+p23+
p23+p33
p1
p2+p3
Marginal
frequencies
p1
p2+p3
1
As pointed out in Chapter 4, we can always write the above joint distribution in the following form
for some inbreeding coefficient θ1 :
75
Table 5-8
Joint distribution: A1 vs. (not A1) with inbreeding coefficient θ1
Alleles
Alleles
A1
not A1
A1
not A1
p12+
p1 (p2+ p3)−
θ1 p1( p2+ p3)
θ1 p1( p2+ p3)
p1 (p2+ p3)−
( p2+ p3)2+
θ1 p1( p2+ p3)
θ1 p1( p2+ p3)
p1
p2+p3
Marginal
frequencies
5.4.2
Marginal
frequencies
p1
p2+p3
1
The case of A2 vs. (not A2)
By combining alleles A1 and A3, we have the following reduced joint distribution:
Table 5-9
Joint distribution: A2 vs. (not A2)
Alleles
Alleles
not A2
not A2
p11+ p13+ p13+
p33
A2
Marginal
frequencies
p12+ p23
p1+ p3
A2
p12+ p23
p22
p2
Marginal
frequencies
p1+ p3
p2
1
The above joint distribution can be rewritten in the following form for some inbreeding coefficient
θ2 :
Table 5-10
Joint distribution: A2 vs. (not A2) with inbreeding coefficient θ2
Alleles
Alleles
not A2
A2
2
not A2
A2
Marginal
frequencies
(p1+ p3) +
(p1+ p3) p2−
θ2 (p1+ p3) p2
θ2 (p1+ p3) p2
(p1+ p3) p2−
p22+
θ2 (p1+ p3) p2
θ2 (p1+ p3) p2
p1+ p3
p2
76
Marginal
frequencies
p1+ p3
p2
1
5.4.3
The case of A3 vs. (not A3)
By combining alleles A1 and A2, we have the following reduced dimension joint distribution:
Table 5-11
Joint distribution: A3 vs. (not A3)
Alleles
Alleles
not A3
A3
Marginal
frequencies
not A3
A3
p11+ p12+ p12+
Marginal
frequencies
p13+ p23
p1+ p2
p13+ p23
p33
p3
p1+ p2
p3
1
p22
The above joint distribution can be rewritten in the following form for some inbreeding
coefficient θ3:
Table 5-12
Joint distribution: A3 vs. (not A3) with inbreeding coefficient θ3
Alleles
Alleles
not A3
A3
2
not A3
A3
Marginal
frequencies
5.5
5.5.1
(p1+ p2) +
(p1+ p2) p3−
θ3 (p1+ p2) p3
θ3 (p1+ p2) p3
(p1+ p2) p3−
p32+
θ3 (p1+ p2) p3
θ3 (p1+ p2) p3
p1+ p2
p3
Marginal
frequencies
p1+ p2
p3
1
Estimation of inbreeding coefficient and hypotheses testing
Estimation of inbreeding coefficient in a model of the type Ωθ
Suppose the joint distribution in a tri-allelic case is of Ωθ .
77
Let θ be the inbreeding coefficient. Let θ1, θ2 and θ3 be the inbreeding coefficients of the
joint distributions of A1 vs. (not A1); A2 vs. (not A2); A3 vs. (not A3), respectively. The following
is the fundamental result of this section.
Theorem
The joint distribution is one-dimensional (A is of the type Ωθ ) if and only if θ1 = θ2 = θ3.
The result has two parts:
1. If the joint distribution A of the alleles is of the type Ωθ , then θ1 = θ2 = θ3 = θ.
2. The converse is true. If θ1 = θ2 = θ3 = θ, then the joint distribution A of the alleles is of
the type Ωθ .
Proof 1. Let A be given by
Alleles
Alleles
A1
A2
A3
Marginal
frequencies
(1 − θ ) p1 p2
(1 − θ ) p1 p3
p1
(1 − θ ) p2 p3
p2
A1
(1 − θ ) p12
+ θp1
(1 − θ ) p22
A2
(1 − θ ) p1 p2
A3
(1 − θ ) p1 p3
(1 − θ ) p2 p3
Marginal
frequencies
P1
p2
+ θp2
(1 − θ ) p32
p3
+ θp3
1
p3
The joint distribution of A1 vs. (not A1) becomes
Alleles
Alleles
A1
Not A1
Marginal
frequencies
A1
Not A1
(1 − θ ) p
2
1
(1 − θ ) p1 ( p2 + p3 )
+ θp1
(1 − θ ) p1 ( p2 + p3 )
(1 − θ )( p2 + p3 ) 2
p1
p2+p3
+ θ ( p2 + p3 )
78
Marginal
frequencies
p1
p2+p3
1
Note that (1 − θ ) p12 + θ p1 = p12 + θ p1 (1 − p1 ) and (1 − θ ) p1 ( p2 + p3 ) = (1 − θ ) p1 (1 − p1 )
Consequently, the inbreeding coefficient θ1 stemming from the above 2x2 joint distribution is
indeed equal to θ. Thus θ =θ1. In a similar vein, one can show that θ =θ2 and θ =θ3.
Proof 2.
Let A be any arbitrary joint distribution of the alleles given by
Alleles
Marginal
A1
A2
A3
A1
p11
p12
p13
p1
A2
p21
p22
p23
p2
A3
p31
p32
p33
p3
p1
p2
p3
1
Alleles
Marginal
frequencies
frequencies
Suppose θ1 = θ2, look at the joint distribution in the case A1 vs. (not A1):
Alleles
A1
not A1
A1
p11
p12+ p13
Not A1
p12+ p13
Alleles
Marginal
frequencies
p22+p23+
p23+p33
p1
P2+p3
We have the following identities:
p11 = p12 + θp1 ( p2 + p3 )
p12+ p13 = (1 − θ ) p1 ( p2 + p3 )
p22+ 2p23+ p33 = ( p2 + p3 ) 2 + θp1 ( p2 + p3 )
79
Marginal
frequencies
p1
p2+p3
1
Similarly, by focusing on the case of A2 vs. (not A2), we have the following identities:
p22 = p22 + θp2 ( p1 + p3 )
p12+ p23 = (1 − θ ) p2 ( p1 + p3 )
p11 + 2p13 + p33 = ( p1 + p3 ) 2 + θp2 ( p1 + p3 )
Focusing on the case of A3 vs. (not A3), we have the following identities:
p33 = p32 + θp3 ( p1 + p2 )
p13+ p23 = (1 − θ ) p3 ( p1 + p2 )
p11 + 2p12 + p22 = ( p1 + p2 ) 2 + θp3 ( p1 + p2 )
In all these nine identities, p11 , p22 , p33 are uniquely determined directly. From the first set
of identities:
2p23 = ( p2 + p3 ) 2 + θp1 ( p2 + p3 ) − p22 − p33
= ( p2 + p3 ) 2 + θp1 ( p2 + p3 ) − p22 − θp2 ( p1 + p3 ) − p32 − θp3 ( p1 + p2 )
= 2 p2 p3 + θ ( p1 p2 + p1 p3 − p1 p2 − p2 p3 − p1 p3 − p2 p3 )
= 2(1 − θ ) p2 p3
Consequently, p23 = (1 − θ ) p2 p3 . In a similar vein, it follows that
p12 = (1 − θ ) p1 p2
and
p13 = (1 − θ ) p1 p3 .
This completes the proof.
5.5.2
Testing that the joint distribution of the alleles is of the type Ωθ
The hypothesis that the joint distribution of the alleles is of the type Ωθ for some θ is
equivalent to the hypothesis H0: θ1 = θ2 = θ3 = θ.
80
We now develop a strategy for testing H0. Let θˆ1 , θˆ2 and θˆ3 be the maximum likelihood
estimator of θ1, θ2 and θ3, respectively. Under H0, we will now have three unbiased estimators of θ.
We will combine these three estimators linearly. Let
θˆ = l 1θˆ1 + l 2θˆ2 + l 3θˆ3
for some scalars l 1 , l 2 , l 3 . The estimator θˆ is unbiased for θ if l 1 + l 2 + l 3 = 1. We want to
choose l 1 , l 2 , l 3 so that the Var( θˆ ) is minimum. Let
θˆ2 and θˆ3 under H0. Then Var (θˆ) = lT
We minimize lT
∑
0
∑
0
∑
0
be the variance-covariance matrix of θˆ1 ,
⎛ l1 ⎞
⎜ ⎟
l , where l = ⎜ l 2 ⎟ and lT is the transpose of l .
⎜l ⎟
⎝ 3⎠
l subject to the constraint l 1 + l 2 + l 3 = 1. Using Lagrange
⎛1⎞
⎛ 1 ⎞ −1
⎜ ⎟
⎜
⎟
multipliers, the solution turns out to be l = T −1 ∑0 1 , where 1= ⎜1⎟ .
⎜1
⎟
⎜1⎟
⎝ ∑0 1 ⎠
⎝ ⎠
This dispersion matrix depends on the common inbreeding coefficient θ and allele
frequencies p1 , p2 , and p3 .
Case 1. θ is known.
Suppose p1 , p2 , and p3 are known. The test statistic we propose is
−1
Q = (θˆ − θ ⋅1)T ∑0 (θˆ − θ ⋅1)
~
~
81
⎛ θˆ1 ⎞
⎛1⎞
⎜ ⎟
⎜ ⎟
where θˆ = ⎜θˆ2 ⎟ and 1= ⎜1⎟ .
~
⎜⎜ ˆ ⎟⎟
⎜1⎟
⎝ ⎠
⎝θ3 ⎠
Under H0, Q
χ 32 , asymptotically.
Test: Reject H0 if Q > χ 32,α , where χ 32,α is the upper 100 xα percentile of a χ 2 -distribution with 3
degrees of freedom and α is the prescribed level of significance.
If p1 , p2 , and p3 are not known, they can be estimated from the data. Using the abridged
data abridged to the format of A1 vs. (not A1), p1 can be estimated. More specifically, let the
abridged data be
Alleles
A1
not A1
A1
n11
n12+ n13
not A1
.
Alleles
n22+ n23+
n33
2n + n + n
From Chapter 5, the gene count estimator of p1 is given by pˆˆ 1 = 11 12 13 . This is an
2n
unbiased consistent estimator of p1. In a similar vein, estimators of p2 and p3 are given,
2n + n + n23
2n + n + n23
and pˆˆ 3 = 33 13
.
respectively, by pˆˆ 2 = 22 12
2n
2n
These estimators can be substituted into the statistic Q. The asymptotic null distribution of
2
Q still remains χ 3 .
For an illustration, suppose θ = 0. The null hypothesis is H0: θ1 = θ2 = θ3 = θ = 0.
We want to examine the power of the χ 2 -test based on Q. Simulations are carried out following
the script outlined below.
82
Choice of the parameters and steps of the Simulations.
Step 1. choose θ ∈ {0, 0.025, 0.05, 0.075, 0.1, 0.125, 0.15, 0.175, 0.2};
Step 2. choose p1 , p2 , and p3 randomly from [0, 1] subject to p1 + p2 + p3 =1;
Step 3. choose total sample size n = 1000.
Step 4. Simulate the multinomial distribution
(1000, (1 − θ ) p12 + θp1 , (1 − θ ) p22 + θp2 , (1 − θ ) p32 + θp3 , (1 − θ ) p1 p2 , (1 − θ ) p1 p3 , (1 − θ ) p2 p3 ).
Let data be n11, n12, n13, n22, n23, and n33.
Step 5. Obtain the estimates of p1 , p2 , and p3 as outlined above.
Step 6. Calculate
∑
0
under the scenario θ = 0 and p̂ˆ 2 , p̂ˆ 2 , and p̂ˆ 3 .ٛ
⎛ θˆ1 ⎞
⎜ ⎟
Step 7. Obtain the estimates θˆ = ⎜θˆ2 ⎟ .
~
⎜⎜ ˆ ⎟⎟
⎝θ3 ⎠
⎛1⎞
⎛ 1 ⎞ −1
⎜ ⎟
⎟
Step 8. Calculate l = ⎜
1
,
where
1=
⎜1⎟ .
∑
⎜ 1T −11 ⎟ 0
⎜1⎟
∑
0 ⎠
⎝
⎝ ⎠
Step 9. Calculate θˆ = l 1θˆ1 + l 2θˆ2 + l 3θˆ3 .
−1
Step 10. Calculate the test statistic Q = (θˆ⋅1)T ∑0 (θˆ⋅1) .
~
~
Step 11. Choose level α = 0.05.
Step 12. Check whether or not H0 is rejected, i.e., Q > χ 32,α .
For each fixed θ , repeat Steps 2 to 12 10,000 times. Calculate
83
Empirical size =
No. of times H 0 is rejected
, if θ = 0, and
10,000
Empirical power =
No. of times H 0 is rejected
, if θ > 0
10,000
The results are given in Figure 5-1. Power is plotted against θ .
Figure 5-1
Empirical power of Q-Test for testing H0: θ1 = θ2 = θ3 = θ = 0
Comments: As θ rises from 0 to 0.2, power rises very sharply. At θ = 0, we have valid
nominal size (approximately 5%); at θ = 0.2, the power is approximately 95%. The relevant
Mathematica Code is provided in Appendix 9.
84
Case 2. θ is unknown
This is more complicated. We need to estimate the common inbreeding coefficient of the
null hypothesis before we can use the test statistic Q. Note that
∑
0
is a function of θ and allele
frequencies. Let θˆ be the best linear unbiased estimator θ given by
θˆ = l 1θˆ1 + l 2θˆ2 + l 3θˆ3 ,
⎛ l1 ⎞ ⎛
⎜ ⎟
1 ⎞⎟ −1
where l = ⎜ l 2 ⎟ = ⎜
1.
⎜ T −1 ⎟∑0
⎜ l ⎟ ⎝ 1 ∑0 1 ⎠
⎝ 3⎠
The vector l is not computable, as
∑
0
involves θ and allele frequencies p1 , p2 , and p3 . In the
place of p1 , p2 , and p3 , use their gene count estimators.
2n11 + n12 + n13
⎧ ˆˆ
⎪ p1 =
2n
⎪
2n22 + n12 + n23
⎪ˆ
⎨ pˆ 2 =
2n
⎪
2n33 + n13 + n23
⎪ˆ
⎪ pˆ 3 =
2n
⎩
We obtain θˆ iteratively,
0
0
0
Step 1: Let l 1 = l 2 = l 3 =
1
0
0
0
, calculate θˆ( 0 ) = l 1 θˆ1 + l 2 θˆ2 + l 3 θˆ3
3
⎛ l 11 ⎞
⎜ 1 ⎟ ⎛ 1 ⎞ −1
⎟
ˆ of Step 1 in
⎜l2 ⎟ =⎜
θ
and
calculate
1 and
∑0
Step 2: Use ( 0 )
−1 ⎟∑0
T
⎜
⎜⎜ 1 ⎟⎟ ⎝ 1 ∑0 1 ⎠
⎝l3 ⎠
θˆ(1) = l 11θˆ1 + l 21θˆ2 + l 31θˆ3 .
85
Repeat Steps 1 and 2 until convergence takes place. The test statistic is
⎛ θˆ1 − θˆ ⎞
⎜
⎟
Q = θˆ1 − θˆ θˆ2 − θˆ θˆ3 − θˆ ∑ 0−1 ⎜θˆ2 − θˆ ⎟
⎜⎜ ˆ ˆ ⎟⎟
⎝θ3 − θ ⎠
(
Under Ho, Q
)
χ 22 , asymptotically.
From the perspective of size, we want to examine how well this test works. We carried out
simulations following the script below.
Choice of the parameters and steps of the Simulations.
Step 1. choose θ from{
k * 0.2
, k = 0, 1, 2, …, 12};
12
Step 2. choose p1 , p2 , and p3 randomly from [0, 1] subject to p1 + p2 + p3 =1;
Step 3. choose total sample size n = 1000.
Step 4. For the chosen θ in Step 1, simulate the multinomial distribution
(1000, (1 − θ ) p12 + θp1 , (1 − θ ) p22 + θp2 , (1 − θ ) p32 + θp3 , (1 − θ ) p1 p2 , (1 − θ ) p1 p3 , (1 − θ ) p2 p3 ).
Obtain the data n11, n12, n13, n22, n23, and n33.
Step 5. Obtain the estimates of p1 , p2 , and p3 as outlined above.
⎛ θˆ1 ⎞
⎜ ⎟
Step 6. Obtain the estimates θˆ = ⎜θˆ2 ⎟ .
~
⎜⎜ ˆ ⎟⎟
⎝θ3 ⎠
Step 7. Initiate the iterative procedure outlined above for estimating θ .
Step 8. Calculate
∑
0
using the estimates of θ , p1 , p2 , and p3 .
86
⎛ θˆ1 − θˆ ⎞
⎜
⎟
Step 9. Calculate the test statistic Q = θˆ1 − θˆ θˆ2 − θˆ θˆ3 − θˆ ∑ 0−1 ⎜θˆ2 − θˆ ⎟
⎜⎜ ˆ ˆ ⎟⎟
⎝θ3 − θ ⎠
(
)
Step 10. Choose level α = 0.05.
Step 11. Check whether or not H0 is rejected, i.e., Q > χ 22,α .
For each fixed θ , repeat Steps 2 to 11 10,000 times. Calculate
Empirical size =
No. of times H 0 is rejected
10,000
The results are given in Figure 5-2. Size is plotted versus θ .
Comments: The empirical size is holding true to the nominal size 0.05.
The relevant Mathematica Code is provided in Appendix 9.
Figure 5-2
Empirical size of Q-Test for testing H0: θ1 = θ2 = θ3 = θ (unknown)
87
In the following, we provide some technical details involved in the computation of the
⎛ θˆ1 ⎞
⎜ ⎟
variance-covariance matrix ∑ 0 under H0 of the estimators θˆ = ⎜θˆ2 ⎟ .
~
⎜⎜ ˆ ⎟⎟
⎝θ3 ⎠
The underlying technique used is the delta-method.
The computations are symbolically carried out. Mathematica is used for the symbolic
computations. All the relevant calculations are incorporated in Appendix 8 and 9.ٛ
Technical details:
θˆ1 = f (n11 , n12 , n13 , n22 , n23 , n33 ) = 1 −
2n(n12 + n13 )
(2n11 + n12 + n13 )(n12 + n13 + 2n22 + 2n23 + 2n33 )
θˆ2 = g (n11 , n12 , n13 , n22 , n23 , n33 ) = 1 −
2n(n12 + n23 )
(2n22 + n12 + n23 )(n12 + n23 + 2n11 + 2n13 + 2n33 )
θˆ3 = h(n11 , n12 , n13 , n22 , n23 , n33 ) = 1 −
2n(n13 + n23 )
(2n33 + n13 + n23 )(n13 + n23 + 2n11 + 2n12 + 2n22 )
Under H0,
1
θ (1 − θ )(2 − θ )
Var (θˆ1 ) = (1 − θ ) 2 (1 − 2θ ) +
n
2np1 (1 − p1 )
θ (1 − θ )(2 − θ )
1
Var (θˆ2 ) = (1 − θ ) 2 (1 − 2θ ) +
n
2np2 (1 − p2 )
θ (1 − θ )(2 − θ )
1
Var (θˆ3 ) = (1 − θ ) 2 (1 − 2θ ) +
n
2np3 (1 − p3 )
88
Let
∑
be the Variance-Covariance Matrix of θˆ1 , θˆ2 , θˆ3
⎛ Var (θˆ1 )
Cov(θˆ1 ,θˆ2 ) Cov(θˆ1 ,θˆ3 ) ⎞
⎜
⎟
∑ = ⎜⎜ Cov(θˆ1 ,θˆ2 ) Var (θˆ2 ) Cov(θˆ2 ,θˆ3 ) ⎟⎟
⎜ Cov(θˆ1 ,θˆ3 ) Cov(θˆ2 ,θˆ3 )
Var (θˆ3 ) ⎟⎠
⎝
Derive the covariance terms:
θˆ1 = f (n11 , n12 , n13 , n22 , n23 , n33 )
θˆ2 = g (n11 , n12 , n13 , n22 , n23 , n33 )
θˆ3 = h(n11 , n12 , n13 , n22 , n23 , n33 )
(n11 , n12 , n13 , n22 , n23 , n33 )
Multinomial ( n, (1 − θ ) p12 + θp1 , (1 − θ ) p1 p2 , (1 − θ ) p1 p3 ,
(1 − θ ) p22 + θp2 , (1 − θ ) p2 p3 , (1 − θ ) p32 + θp3 )
We will now calculate Cov (θˆ1 ,θˆ2 ) , Cov (θˆ1 ,θˆ3 ) and Cov(θˆ2 ,θˆ3 ) under H0.
Let A denote the statements:
n11 = E n11 = n[( p12 + θp1 ( p2 + p3 )]
n22 = E n22 = n[( p22 + θp2 ( p1 + p3 )]
n33 = E n33 = n[( p32 + θp3 ( p1 + p2 )]
n12 = E n12 = 2n(1 − θ ) p1 p2
n13 = E n13 = 2n(1 − θ ) p1 p3
n23 = E n23 = 2n(1 − θ ) p2 p3
Calculate:
4n(n12 + n13 )
∂θˆ1
=
2
∂n11 (2n11 + n12 + n13 ) (n12 + n13 + 2n22 + 2n23 + 2n33 )
89
⎛ ∂θˆ1 ⎞
4n(2n(1 − θ ) p1 p2 + 2n(1 − θ ) p1 p3 )
⎜
⎟ =
⎜ ∂n ⎟ ⎛ 2n(1 − θ ) p p + 2n(1 − θ ) p p + 2n( p 2 + θp ( p + p ))) 2 ⎞
1 2
1 3
1
1
2
3
⎝ 11 ⎠ A ⎜
⎟
⎜ (2n(1 − θ ) p1 p2 + 2n(1 − θ ) p1 p3 + 2(2n(1 − θ ) p 2 p3 +
⎟
⎜
⎟
2
2
⎝ n(θ ( p1 + p 2 ) p3 + p3 ) + (n( p2 + θp 2 ( p1 + p3 ))))
⎠
2n(n12 + n13 )
+
(2n11 + n12 + n13 )(n12 + n13 + 2n22 + 2n23 + 2n33 ) 2
2n(n12 + n13 )
∂θˆ1
−
=
2
∂n12 (2n11 + n12 + n13 ) (n12 + n13 + 2n22 + 2n23 + 2n33 )
2n
(2n11 + n12 + n13 )(n12 + n13 + 2n22 + 2n23 + 2n33 )
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟ can be derived by plugging in the expected terms of nij from statement A.
⎝ 12 ⎠ A
In a similar vein, we get the following derivative terms for θˆ2 :
⎛ ∂θˆ1 ⎞
⎛ ˆ ⎞
⎜
⎟ , ⎜ ∂θ1 ⎟ ,
⎜
⎟
⎜ ∂n ⎟
⎝ 11 ⎠ A ⎝ ∂n12 ⎠ A
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟ ,
⎝ 13 ⎠ A
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟ ,
⎝ 22 ⎠ A
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟ ,
⎝ 23 ⎠ A
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟
⎝ 33 ⎠ A
In a similar vein, we get the following derivative terms for θˆ2 :
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟
11
⎝
⎠A ,
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟
12
⎝
⎠A ,
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟
13
⎝
⎠A ,
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟
22
⎝
⎠A ,
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟
23
⎝
⎠A ,
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟
33
⎝
⎠A
⎛ ∂θˆ3 ⎞
⎟
⎜
⎜ ∂n ⎟ ,
⎝ 13 ⎠ A
⎛ ∂θˆ3 ⎞
⎟
⎜
⎜ ∂n ⎟ ,
⎝ 22 ⎠ A
⎛ ∂θˆ3 ⎞
⎟
⎜
⎜ ∂n ⎟ ,
⎝ 23 ⎠ A
⎛ ∂θˆ3 ⎞
⎟
⎜
⎜ ∂n ⎟
⎝ 33 ⎠ A
Also, find the derivative terms for θˆ3
⎛ ∂θˆ3 ⎞
⎛ ˆ ⎞
⎟ , ⎜ ∂θ 3 ⎟ ,
⎜
⎟
⎜
⎜ ∂n ⎟
⎝ 11 ⎠ A ⎝ ∂n12 ⎠ A
The explicit terms are included in Appendix 8.
⎛ ∂θˆ ⎞
Cov(θˆ1 ,θˆ2 ) = ⎜⎜ 1 ⎟⎟
⎝ ∂n11 ⎠ A
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟ np11 (1 − p11 )
⎝ 11 ⎠ A
90
⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞
− ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p12
⎝ ∂n11 ⎠ A ⎝ ∂n12 ⎠ A
⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞
− ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p13
⎝ ∂n11 ⎠ A ⎝ ∂n13 ⎠ A
⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞
− ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p22
⎝ ∂n11 ⎠ A ⎝ ∂n22 ⎠ A
⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞
− ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p23
⎝ ∂n11 ⎠ A ⎝ ∂n23 ⎠ A
⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞
− ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p33
⎝ ∂n11 ⎠ A ⎝ ∂n33 ⎠ A
⎡ 3 3
⎛ ∂θˆ1 ⎞ ⎛ ∂θˆ2 ⎞
⎟ p − n ⎢∑∑
⎟ ⎜
⎜
⎢ i =1 j =1
⎜ ∂n ⎟ ⎜ ∂n ⎟ ij
⎝ ij ⎠ A ⎝ ij ⎠ A
⎢⎣ i≤ j
⎤ ⎡ 3 3
⎛ ∂θˆ1 ⎞
⎜
⎟ p ⎥⋅⎢
⎜ ∂n ⎟ ij ⎥ ⎢∑∑
⎝ ij ⎠ A ⎥ ⎢ i =1 j =1
⎦ ⎣ i≤ j
⎤
⎛ ∂θˆ2 ⎞
⎜
⎟ p ⎥
⎜ ∂n ⎟ ij ⎥
⎝ ij ⎠ A ⎥
⎦
⎡ 3 3
3
3
⎛ ∂θˆ1 ⎞ ⎛ ∂θˆ3 ⎞
ˆ
ˆ
⎟ ⎜
⎟ pij − n ⎢∑∑
Cov (θ1 ,θ 3 ) = n ∑∑ ⎜
⎢ i =1 j =1
⎜
⎜
⎟
⎟
i =1 j =1
⎝ ∂nij ⎠ A ⎝ ∂nij ⎠ A
⎢⎣ i≤ j
i≤ j
⎤ ⎡ 3 3
⎛ ∂θˆ1 ⎞
⎜
⎟ p ⎥⋅⎢
⎜ ∂n ⎟ ij ⎥ ⎢∑∑
⎝ ij ⎠ A ⎥ ⎢ i =1 j =1
⎦ ⎣ i≤ j
⎤
⎛ ∂θˆ3 ⎞
⎜
⎟ p ⎥
⎜ ∂n ⎟ ij ⎥
⎝ ij ⎠ A ⎥
⎦
⎡ 3 3
3
3
⎛ ∂θˆ2 ⎞ ⎛ ∂θˆ3 ⎞
ˆ
ˆ
⎟
⎟
⎜
⎜
Cov(θ 2 , θ 3 ) = n ∑∑
pij − n ⎢⎢∑∑
⎜ ∂n ⎟ ⎜ ∂n ⎟
i =1 j =1
⎝ ij ⎠ A ⎝ ij ⎠ A
⎢⎣ i =1 i≤j =j 1
i≤ j
⎤ ⎡ 3 3
⎛ ∂θˆ2 ⎞
⎜
⎟ pij ⎥ ⋅ ⎢∑∑
⎥ ⎢
⎜ ∂n ⎟
⎝ ij ⎠ A ⎥ ⎢ i =1 j =1
⎦ ⎣ i≤ j
⎤
⎛ ∂θˆ3 ⎞
⎜
⎟ pij ⎥
⎥
⎜ ∂n ⎟
⎝ ij ⎠ A ⎥
⎦
3
3
= n ∑∑
i =1 j =1
i≤ j
In a similar vein,
Again, Mathematica code and the individual terms can be found in Appendix 8 and 9.
91
5.6
Conclusions
Every 2x2 joint distribution of alleles in the bi-allelic case can be reformatted to bring into
focus the inbreeding coefficient θ . This is not possible in the tri-allelic case (A1, A2, and A3).
Consequently, formulating the HWE in this case is fraught with difficulties. Going straight after
the inbreeding coefficient once the data are given is creating computational complexities of a very
high order. In this chapter, we made a systematic exploration of issues involved. We have looked
at three 2x2 joint distribution: A1 vs. (not A1), A2 vs. (not A2), A3 vs. (not A3) stemming from the
3x3 joint distribution of the alleles, and their corresponding inbreeding coefficient θ1, θ2, and θ3,
respectively. We have shown that the 3x3 joint distribution of the alleles admits a single
inbreeding coefficient θ if and only if θ1 = θ2 = θ3 = θ. This basic result paved way to tackle the
HWE problem. We proposed a Chi-squared test Q to test hypotheses about θ. This test is shown to
have good power.
This is a caveat in the way we conducted simulations. The allele frequencies p1 and p2 are
randomly generated from [0,1] subject to p1+p2 ≤1. The inbreeding coefficient θ has natural
bounds set by the allele frequencies. In simulations, whenever the natural bounds are violated, data
are not generated. The simulations are counted only for those cases in which data are generated.
This does not seem to impact the observed size. See Figure 5-2. We believe that this has no impact
on power either. See Figure 5-1.
92
6
Generalization to multiple alleles
In Chapter 5, a comprehensive discussion has been carried out about the HWE in the
context of 3 alleles. A crucial result in reducing the 3x3 joint distribution to several 2x2 joint
distributions helped the testing problem. In this Chapter, we will generalize the key result to any
number of multi-allelic case.
When a specific locus has three or more alleles (k alleles), apparently, by maximizing the
likelihood over the parameter space Ω* , with respect to all marginals and inbreeding coefficient θ
is not possible to find an explicit solution to the maximum likelihood equations. As in Chapter 5,
we need to find a way to reduce the dimensionality of the problem. This is what we plan to do in
this section.
6.1
Formulation of the problem
For three or more alleles (k alleles), the joint distribution of the alleles A1, A2, ..., Ak can be
represented in the following contingency table (Table 6-1).
Table 6-1
General Joint distribution of genotypes of multiple alleles
Alleles
Marginal
A1
A2
…
Ak-1
Ak
A1
p11
p12
…
p1,k-1
p1,k
p1
A2
p21
p22
…
p2,k-1
p2,k
p2
…
…
…
…
…
…
…
Ak-1
pk-1,1
pk-1,2
…
pk-1,k-1
pk-1,k
pk-1
Ak
pk1
pk2
…
pk,k-1
pk,k
pk
p1
p2
…
pk-1
pk
1
Alleles
Marginal
frequencies
93
frequencies
Let us use the generic symbol Α for the joint distribution. The marginal probabilities p1,
kxk
p2 , p3, … and pk are usually called allelic frequencies. The genotypes A1A2 and A2A1, for example,
are not distinguishable. The matrix Α has the following properties:
kxk
1.
Α is symmetric;
kxk
2. The marginal distributions are the same.
The purpose in this chapter is to make inferences on the unknown Α drawn from the
kxk
population of interest based on a random sample of individuals and their genotypes. The
underlying parameter spaces are defined by:
1) Ω = { Α : A is all possible joint distributions, symmetric and identical marginals}.
kxk
The dimension of Ω is given by:
k (k + 1)
−1 .
2
2) Ω P = {All possible joint distributions with fixed marginals} , p1, p2 , p3, …… and pk are fixed
and known. Ω P is a special subset of Ω of interest.
The dimension of Ω P is given by:
k (k − 1)
.
2
3) Ω* = {A: The joint distribution A is the Ω for some θ ≠ 0 }
Definition: A joint distribution is of type Ω* if it is of the form (Table 6-2) for some θ, p1, p2 , p3,
… and pk .
94
A joint distribution of type Ω
Table 6-2
Alleles
Alleles
A2
…
Ak-1
Ak
Marginal
frequencies
(1 − θ ) p1 p2
…
(1 − θ ) p1 pk −1
(1 − θ ) p1 pk
p1
…
(1 − θ ) p2 pk −1
(1 − θ ) p2 pk
p2
…
…
…
…
(1 − θ ) pk2−1
(1 − θ ) pk −1 pk
A1
(1 − θ ) p12
A1
+ θp1
(1 − θ ) p22
A2
*
+ θp2
…
Ak-1
+ θpk −1
pk-1
(1 − θ ) pk2
Ak
pk
+ θpk
Marginal
frequencies
p1
p2
…
pk-1
1
pk
The dimension of Ω* is clearly k. The entity θ can be labeled as the inbreeding coefficient.
When θ = 0, the equilibrium is achieved. The equilibrium distribution is given in Table 6-3. It is
clear that Ω* is a subset of Ω .
4) Parameter space: Ω* p1, p2, …, pk = {A ∈ Ω* : A has marginal frequencies p1, p2 ,…, pk}
If marginals p1, p2 , p3, …and pk-1 are all known, only one parameter θ is left unknown. The
dimension of Ω* p1, p2, …, pk is one.
Table 6-3
Joint distribution of genotypes under Equilibrium ( Ω 0 )
Alleles
Alleles
A1
A1
p1
2
A2
…
Ak-1
Ak
p1 p 2
…
p1 p k-1
p1 p k
p1
p22
…
p2 p k-1
p2 p k
p2
…
…
…
…
pk-12
pk-1 p k
pk-1
pk2
pk
pk
1
…
Ak-1
Ak
Marginal
frequencies
p1
Marginal
A2
p2
…
95
pk-1
frequencies
6.2
Data and Likelihood
6.2.1
Data Structure
A random sample of n individuals is selected at random from the population of interest.
Let nij (i < j) be the number of individuals in the sample with genotype AiAj. The data are arranged
in Table 6-4.
Table 6-4
Data collected for multiple alleles
Alleles
Alleles
A1
A2
…
Ak-1
Ak
n11
n12
…
n1,k-1
n1,k
n1
n22
…
n2,k-1
n2,k
n2
…
…
…
…
nk-1,k-1
nk-1,k
nk-1
nk,k
nk
nk
n
A2
…
Ak-1
Ak
Marginal
frequencies
6.2.2
Marginal
A1
n1
n2
…
nk-1
frequencies
Maximum Likelihood estimators
k
n
The Likelihood of the data is given by L(A) = ∏( pii ) nii ∏(2 pij ) ij .
i =1
i< j
1. Maximize L over all A ∈ Ω . The maximum likelihood estimators are given by
pˆ ii =
2 pˆ ij =
nii
,
n
nij
n
i = 1, 2, … , k
, i<j.
2. Maximize L over A ∈ Ω* .
This is an intractable optimization problem.
96
3. Maximize L over Ω* p1, p2, …, pk.
Suppose the margianl frequencies p1, p2 , p3, …… and pk are known. The only unknown
parameter is θ. The likelihood is given by:
k
n
L(A) = ∏[(1 − θ ) pi2 + θpi ]nii ∏((1 − θ ) pi p j ) ij .
i =1
LnL(θ ) =
k
∑ n ln[(1 − θ ) p
i =1
2
i
ii
∂LnL
=
∂θ
set
i< j
+ θpi ] + ∑ nij ln(1 − θ ) + constant
i< j
n
nii ( pi − pi2 )
− ∑ ij
∑
2
i =1 (1 − θ ) pi + θpi
i < j (1 − θ )
k
∂LnL
= 0 to solve θˆ .
∂θ
θ
nii ( pi − pi2 )
Let η =
. The likelihood equation is given by ∑
= ∑ nij .
pi2 + ηpi
1−θ
i =1
i< j
k
This is a polynomial in η of degree k. The solution is tractable. Once the data and
marginal frequencies are given, a good software should be able to provide all the roots of the
polynomial. From the set of roots, one should be able to pick up the root which maximizes the
likelihood.
6.3
Lower dimensional joint distributions
As in Chapter 5, we will identify 2x2 joint distributions with their corresponding
inbreeding coefficients which will determine uniquely the inbreeding coefficient of the kxk joint
distribution.
Let θi be the inbreeding coefficient stemming from the 2x2 joint distribution associated
with the case Ai vs. (not Ai), i = 1, 2, … , k.
97
For i < j, ∈ { i = 1, 2, … , k}, let θij be the inbreeding coefficient stemming from the 2x2
joint distribution associated with the case (Ai or Aj) vs. (not Ai and not Aj). The following is the
main result.
Therom. A joint distribution A of the genotypes is of the type Ωθ for some θ if and only if
θi = θ for every i,
and θij = θ for every i < j.
Proof. It is easy to show that if A is of type Ωθ , then θi = θ for every i. for example, look at the
2x2 joint distribution associated with A1 vs. (not A1).
Alleles
A1
Alleles
A1
Marginal
frequencies
not A1
(1 − θ ) p12 + θp1
k
∑ (1 − θ ) p p
1
i =2
k
not A1
∑ (1 − θ ) p p
1
i =2
Marginal
frequencies
i
p1
p1
i
*
p1 + ... p k
p1 + ... p k
1
* is obtained by subtraction.
From the table, it follows that θ1 = θ. In a similar fashion, one can show that θ1 = θ2 = … = θk = θ.
Look at the 2x2 joint distribution associated with (A1 or A2) vs. (not A1 and not A2).
Alleles
Alleles
A1 or A2
A1 or A2
(1 − θ )( p12 + p12 ) +
θ ( p1 + p2 ) + 2θp1 p2
not A1 and not
A2
Marginal
frequencies
a
p1 + p2
not A1 and
not A2
b
c
p3 + ... pk
Marginal
frequencies
p1 + p2
p3 + ... pk
1
a, b, and c are obtained by subtraction.
98
Note that (1 − θ )( p12 + p12 ) + θ ( p1 + p 2 ) + 2θp1 p 2 = (1 − θ )( p1 + p2 ) 2 + θ ( p1 + p2 ) ,
which implies that θ12 = θ . In a similar fashion, one can shown that θij = θ for every i < j.
Let us prove the converse. Suppose θi = θ for every i and θij = θ for every i < j. We want to show
that in A= (pij) ,
pi = (1 − θ ) pi2 + θpi , i = 1, 2, … , k.
and pij = (1 − θ ) pi p j , for every i < j.
by the hypothesis.
Looking at the case (Ai or Aj) vs. (not Ai and not Aj) (i < j),
we have pii + 2pij + pjj
= (1 − θ ij )( pi + p j ) 2 + θ ij ( pi + p j )
= (1 − θ )( pi + p j ) 2 + θ ( pi + p j )
Since pii = (1 − θ ) pi + θpi and pjj = (1 − θ ) p 2j + θp j ,
2
we have 2pij = (1 − θ )( pi + p j ) 2 + θ ( pi + p j ) − pii − pjj = (1 − θ )2 pi p j ,
which implies pij = (1 − θ ) pi p j .
This completes the proof.
Remarks: The number of 2x2 joint distributions that are required for a unique determination of
the inbreeding coefficient of the kxk joint distribution is k +
k (k − 1) k (k + 1)
.
=
2
2
This is the upper bound. For lower values of k, there will be some duplication of θi s with θij s. We
give a list for k ∈ {3, 4, 5} here in Table 6-5 below.
99
Table 6-5
2x2 distribution required for a test about inbreeding coefficient
k
2x2 joint distributions needed for a unique
determination of the inbreeding coefficient
of the kxk joint distribution
Total number
A1 vs. (not A1)
3
3
A2 vs. (not A2)
A3 vs. (not A3)
A1 vs. (not A1)
A2 vs. (not A2)
A3 vs. (not A3)
4
A4 vs. (not A4)
7
(A1 or A2) vs. (A3 or A4)
(A1 or A3) vs. (A2 or A4)
(A1 or A4) vs. (A2 or A3)
A1 vs. (not A1)
A2 vs. (not A2)
A3 vs. (not A3)
A4 vs. (not A4)
A5 vs. (not A5)
(A1 or A2) vs. (A3, A4 or A5)
(A1 or A3) vs. (A2, A4 or A5)
5
(A1 or A4) vs. (A2, A3 or A5)
15
(A1 or A5) vs. (A2, A3 or A4)
(A2 or A3) vs. (A1, A4 or A5)
(A2 or A4) vs. (A1, A3 or A5)
(A2 or A5) vs. (A1, A3 or A4)
(A3 or A4) vs. (A1, A2 or A5)
(A3 or A5) vs. (A1, A2 or A4)
(A4 or A5) vs. (A1, A2 or A3)
6
unlisted
21
And so on.
We have not exploited this result to build tests on the inbreeding coefficient.
100
7
Conclusions and Future Research
The broad theme of research carried out in this dissertation is on association studies in
genetics. The work done can be classified into three segments.
In the first segment (Chapter 3), association between a bi-allelic gene and a quantitative
phenotype is the main focus. An additive model is assumed exemplifying the connection between
the genotypes of the (true) gene and phenotype of interest. The true gene is unknown. Data are
collected on the phenotype of subjects classified according to the genotypes for a gene under
investigation. The ANOVA method is the standard recipe to examine whether the investigative
gene is associated with the phenotype. We have discovered that the assumptions needed for the
validity of the ANOVA method are not met. Normality failed; homogeneity of variances does not
hold. Bartlett’s test is a viable option. One needs normality for the validity of the test, which tests
homogeneity of variances.
We made a comparison of the performances of both procedures in terms of power. We
have shown that the ANOVA procedure is superior to the Bartlett’s test. Violation of the normality
condition is the main reason, we believe, for the poor performance of the Bartlett’s test. The
ANOVA procedure seems more robust despite the violation of its assumptions. There are a
number of tests of homogeneity of variances (eg. Levene test) available in the literature which are
more robust than the Bartlett’s test. Future work would consist of examining details of these tests
and make a comprehensive comparison of powers.
In the second segment (Chapter 4), the focus is on testing hypotheses about the HardyWeinberg equilibrium. The standard method used is the χ 2 -test. The χ 2 -test is not usable when
one wants to entertain one-sided alternatives. Two alternative tests (Z-test and Siegmund’s test)
are considered to fill out the lacunae. We have shown that these two tests are essentially the same.
101
Sample size formula has been established using the Z-test. A comparison of sample sizes obtained
using the Z-test and Ward and Sing’s χ 2 -test is made. The sample sizes under the Z-test are
smaller than those given by the χ 2 -test.
In future, we want to develop an R-code for sample size calculations. After it is developed,
we want to make it available among R-users and seek their input.
Examining the Hardy-Weinberg equilibrium issues in a case of tri-allelic gene is fraught
with mathematical, statistical, and computational difficulties. This problem is taken up in Segment
3 (Chapter 5). Two solutions have been proposed. One was to find the maximum likelihood
estimate of the inbreeding coefficient after the standard estimates of the allele frequencies have
been plugged into the equation. The resulting equation is a third degree polynomial in the
inbreeding coefficient.
The other solution is to reduce the problem to several 2x2 joint distributions. Testing about
the inbreeding coefficient is equivalent to testing the equality of several inbreeding coefficients
stemming from the 2x2 distributions. A test statistic, which is a quadratic form, has been proposed
to carry out the testing problem. This test is shown to have good power.
We want to compare the performance of the two procedures we have proposed in this connection
in terms of power in a future endeavor.
In Chapter 6, we have broadened the Hardy-Weinberg equilibrium problem to multi-allelic
cases. The details have not been completely worked out. We want to pursue the extension more
extensively. As the number of alleles increases, computational complexity increases. We want to
develop a R-code to ease the heavy burden of computations.
102
Bibliographic references
Barr, J. (1991) Liver slices in dynamic organ culture. II. An in vitro cellular technique for the
study of integrated drug metabolism using human tissue. Xenobiotica; 21 (3): 341-350.
Chakraborty, R, Hanis, CL, and Boerwinkle, E (1986) Effect of a marker locus on the quantitative
variability of a risk factor to chronic diseases. American Journal of Human Genetics; 39 : A231.
Charkraborty, R, and Zhong, Y. (1994) Statistical power of an exact test of Hardy-Weinberg
proportions of genotypic data at a multiallelic locus. Hum. Hered. 44: 1-9.
Choudhry, S and Coyle NE (2006) Population stratification confounds genetic association studies.
Hum Genet;118: 652–664.
Cockerham, CC (1973) Analyses of gene frequencies. Genetics; 74: 679-700.
Cramer, H. (1961). Mathematical Methods of Statistics. Priceton University Press, USA.
Crow, JF and Kimura, M (1970) An introduciton to population genetics theory. Harper and Row,
New York.
Curie-Cohen, M. (1982) Estimates of inbreeding in a natureal population: A comparison of
sampling properties. Genetics; 100: 339-358.
Guo, SW and Thompson, EA (1992) Performing the exact test of Hardy-Weinberg proportion for
multiple alleles. Biometrics; 48: 361-372.
Harber, M. (1980) Detection of inbreeding effects by the χ2 test on genotypic and phenotypic
frequencies. Am. J. Hum. Genet. 32: 754-760.
Li, CC (1969); Population subdivision with respect to multiple alleles. Ann. Hum., Lond: 33: 2329.
Li, CC and Horvitz, DG (1953) Some methods of estimating Inbreeding Coefficient. Ameri.
Journal of Human Genetics. Vol. 5 No.2: 107-117.
103
Lessios, H (1992) Testing electrophoretic data for agreement with Hardy-Weinberg expectations.
Mar. Biol. 112: 517-523.
Milliken, GA and Johnson, DE (1989). Analysis of Messy data. Van Nostrand Reinhold, New
York.
Pamilo, P and Varvio-Aho, S (1984) Testing genotype frequencies and heterozygosities. Mar.
Biol; 79: 99-100.
Ralls, K, Brugger K and Ballow J (1979) Inbreeding and juvenile mortality in small populations of
ungulates. Science; 206: 1101-1103.
Rao, CR (1983) Linear Statistical Inference and Its Applications, Wiley Publications, NY.
Seener, JW (1980) Inbreeding and the survival of zoo populations. In: Consevation Biology; 209229. Edited by M. Soule and B. Wilcox. Sinauer Assoc., Sunderland, Massachusetts.
Ward, R and Sing, C (1970) A consideration of the power of the χ2 test to detect inbreeding effects
un natural populations. American Naturalist; V.104, No. 938: 355-365.
Wright, S (1977) Evolution and the Genetics of Populations; V. 3. University of Chicago Press,
Chicago.
Weir, B(1996) Genetic Data Analysis II. Sinauer Associates, Inc., Sunderland, Massachusetts.
104
Appendices
Mathematical derivations,
SAS Programs and Mathematica Codes
105
Appendix 1: Derivation of Conditional Expectations and Variances
In the additive model presented in Chapter 3, conditional means and variances are needed to check
on the assumptions of ANOVA. In the Appendix here, the calculations are set out.
For genotype AA of the marker gene G′:
2
2
λPMA
− λPmA
E(P | G′ = AA) =
=
2
A
P
=
=
=
λ ( PMA + PmA )( PMA − PmA )
PA2
λ ( PM PA + Δ + Pm PA − Δ )( PM PA + Δ − Pm PA + Δ )
PA2
λ ( PA )( PA ( PM − Pm ) + 2Δ )
PA2
λ (PM − Pm ) +
2 Δλ
PA
2
2
(λ2 + σ 2 ) + 2 PMA PmA (σ 2 ) + PmA
(λ 2 + σ 2 )
PMA
E(P | G′ = AA) =
PA2
2
2
2
2
2
λ2 + PMA
σ 2 + 2 PMA PmA (σ 2 ) + PmA
λ2 + PmA
σ2
PMA
=
PA2
=
=
=
2
2
2
2
+ PmA
+ 2 PMA PmA + PmA
λ2 ( PMA
) + σ 2 ( PMA
)
PA2
2
2
+ PmA
λ2 ( PMA
) + σ 2 ( PMA + PmA ) 2
PA2
2
2
+ PmA
λ2 ( PMA
) + σ 2 ( PM PA + Δ + Pm PA − Δ) 2
PA2
=σ +
2
Var(P | G′ = AA) = σ +
2
2
2
+ PmA
λ2 ( PMA
)
PA2
2
2
λ2 ( PMA
)
+ PmA
PA2
−
λ2
PA4
2
2 2
( PMA
− PmA
)
106
=σ +
2
2
=σ +
2
=σ +
= σ2 +
= σ2 +
= σ2 +
2
2
+ PmA
λ2 ( PMA
)
PA2
2
2
λ2 ( PMA
)
+ PmA
2
A
P
2
2
λ2 ( PMA
)
+ PmA
2
A
P
λ2
2
A
P
λ2
PA2
−
−
−
λ2
4
A
P
λ2
4
A
P
λ2
2
A
P
[( PMA − PmA )( PMA + PmA )]2
[( PMA − PmA )( PA )]2
( PMA − PmA ) 2
2
2
[( PMA
+ PmA
) − ( PMA − PmA ) 2 ]
(2 PMA PmA )
2λ2
( PM PA + Δ)( Pm PA − Δ)
PA2
2λ2
2
2
= σ + 2 ( PM PA Pm + ΔPm PA − ΔPM PA − Δ )
PA
2
= σ 2 + 2λ2 PM Pm +
2Δλ2 ( Pm − PM ) 2Δ2 λ2
−
PA
PA2
Also, for genotype Aa of marker G′:
2PMA PMa N( λ , σ 2 ) + (2PMA Pma + 2PMa PmA )N(0, σ 2 ) + 2PmA Pma N( − λ , σ 2 )
P | G′ = Aa =
2PA Pa
E(P | G′ = Aa) =
=
2 PMA PMa λ − 2 PmA Pma λ
2 PA Pa
2λ( PMA PMa − PmA Pma)
2 PA Pa
=
=
λ[( PM PA + Δ)( PM Pa − Δ) − ( Pm PA − Δ)( Pm Pa + Δ)]
PA Pa
λ[( PM2 PA Pa + ΔPM Pa − ΔPM PA − Δ2 ) − ( Pm2 PA Pa − ΔPm Pa + ΔPm PA − Δ2 )]
PA Pa
107
=
=
=
λ ( PM2 PA Pa + ΔPM Pa − ΔPM PA − Δ2 − Pm2 PA Pa + ΔPm Pa − ΔPm PA + Δ2 )
PA Pa
λ[( PM2 − Pm2 ) PA Pa + ΔPa ( PM + Pm ) − ΔPA ( PM + Pm )]
PA Pa
λ[( PM2 − Pm2 ) PA Pa + ΔPa − ΔPA ]
PA Pa
= λ ( PM − Pm ) −
E(P2 | G′ = Aa) =
Δλ ( PA − Pa )
PA Pa
2 PMA PMa (λ2 + σ 2 ) + (2 PMA Pma + 2 PMa PmA )(σ 2 ) + 2 PmA Pma (λ2 + σ 2 )
2 PA Pa
( PMA PMa + PmA Pma )λ2 + ( PMA PMa + PMA Pma + PMa PmA + PmA Pma )(σ 2 )
=
PA Pa
=
( PMA PMa + PmA Pma )λ2 + ( PMA + PmA )( PMa + Pma )(σ 2 )
PA Pa
=
( PMA PMa + PmA Pma )λ2 + PA Pa (σ 2 )
PA Pa
( PMA PMa + PmA Pma )λ2
=σ +
PA Pa
2
= σ2 +
= σ2 +
[( PM PA + Δ)( PM Pa − Δ ) + ( Pm PA − Δ)( Pm Pa + Δ)]λ2
PA Pa
( PM2 PA Pa + ΔPM Pa − ΔPM PA − Δ2 + Pm2 PA Pa − ΔPm Pa + ΔPm PA − Δ2 )λ2
PA Pa
[( PM2 + Pm2 ) PA Pa + ΔPM ( Pa − PA ) − ΔPm ( Pa − PA ) − 2Δ2 )]λ2
=σ +
PA Pa
2
= σ 2 + ( PM2 + Pm2 )λ2 +
Var(P | G′ = Aa) = σ 2 + ( PM2 + Pm2 )λ2 +
[Δ( PM − Pm )( Pa − PA ) − 2Δ2 )]λ2
PA Pa
[Δ( PM − Pm )( Pa − PA ) − 2Δ2 )]λ2
PA Pa
− [λ ( PM − Pm ) −
Δλ ( PA − Pa ) 2
]
PA Pa
108
Δλ2 ( PM − Pm )( PA − Pa ) − 2Δ2 λ2
= σ + ( P + P )λ −
PA Pa
2
2
M
2
m
2
− λ2 ( PM − Pm ) 2 +
2Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA − Pa ) 2
−
PA Pa
PA2 Pa2
= σ 2 + 2λ2 PM Pm +
Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2
( P − Pa ) 2
−
[2 + A
]
PA Pa
PA Pa
PA Pa
= σ 2 + 2λ2 PM Pm +
Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 PA2 + Pa2
(
)
−
PA Pa
PA Pa PA Pa
Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA2 + Pa2 )
−
= σ + 2λ PM Pm +
PA Pa
PA2 Pa2
2
2
Again, for genotype aa of marker G′:
2
2
PMa
N( λ , σ 2 ) + 2PMa Pma N(0, σ 2 ) + Pma
N( − λ , σ 2 )
P | G′ = aa =
Pa2
2
PMa
λ − Pma2 λ
E(P | G′ = aa) =
Pa2
=
=
=
λ ( PMa + Pma )( PMa − Pma )
Pa2
λ ( Pa )( PM Pa − Δ − Pm Pa − Δ)
Pa2
λ ( Pa )( Pa ( PM − Pm ) − 2Δ )
Pa2
= λ ( PM − Pm ) −
2Δλ
Pa
2
2
(λ2 + σ 2 ) + 2 PMa Pma (σ 2 ) + Pma
(λ 2 + σ 2 )
PMa
E(P | G′ = aa) =
Pa2
2
=
2
2
( PMa + Pma ) 2 σ 2 λ2 ( PMa
)
+ Pma
+
2
2
Pa
Pa
109
=σ +
2
Var(P | G′ = aa) = σ 2 +
2
2
+ Pma
λ2 ( PMa
)
Pa2
2
2
+ Pma
λ2 ( PMa
)
2
a
P
= σ2 +
− [λ ( PM − Pm ) −
2Δλ 2
]
Pa
2
2
λ2 ( PMa
)
+ Pma
Pa2
4Δλ 2 (PM − Pm ) 4Δ 2λ 2
−[λ (PM − Pm ) −
+
]
Pa
Pa2
2
2
λ 2 [(PM Pa − Δ ) 2 + (Pm Pa + Δ ) 2 ]
= σ2 +
Pa2
−[λ 2 (PM − Pm ) 2 −
=σ +
2
λ2
2
a
P
[(PM2 + Pm2 )Pa2 − 2ΔPa (PM − Pm ) + 2Δ 2 ]
−[λ 2 (PM − Pm ) 2 −
=
4Δλ 2 (PM − Pm ) 4Δ 2λ 2
+
]
Pa
Pa2
4Δλ 2 (PM − Pm ) 4Δ 2λ 2
+
]
Pa
Pa2
σ 2 + λ 2 (PM2 + Pm2 ) −
2Δλ 2 (PM − Pm ) 2Δ 2λ 2
+
Pa
Pa2
4Δλ 2 (PM − Pm ) 4Δ 2λ 2
−[λ (PM − Pm ) −
+
]
Pa
Pa2
2
=
2
σ 2 + 2λ 2 PM Pm +
2Δλ 2 (PM − Pm ) 2Δ 2λ 2
−
Pa
Pa2
Therefore, the derived conditional expectations and variations are summarized as follows:
E(P | G′ = AA) =
λ (PM − Pm ) +
2 Δλ
PA
Var(P | G′ = AA) = σ 2 + 2λ2 PM Pm +
2Δλ2 ( Pm − PM ) 2Δ2 λ2
−
PA
PA2
110
E(P | G′ = Aa) = λ ( PM − Pm ) −
Δλ ( PA − Pa )
PA Pa
Var(P | G′ = Aa) = σ 2 + 2λ2 PM Pm +
E(P | G′ = aa) = λ (p M − p m ) −
Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA2 + Pa2 )
−
PA Pa
PA2 Pa2
2Δλ
pa
Var(P | G′ = aa) = σ 2 + 2λ 2 PM Pm +
2Δλ 2 (PM − Pm ) 2Δ 2λ 2
−
Pa
Pa2
111
Appendix 2: SAS code of different scenarios of ANOVA test
Power calculations under the ANOVA test are set out in this appendix in the form of SAS macros.
(Chapter 3)
options nosource nodate;
ods trace off;
%macro
sim(delta=0,lamda=1,v=1,Pm=0.5,Pa=0.5,sample=200,alphalevel=0.05,seed_0=0,seed_1
=0,simu=10);
data one;
retain sd0-sd1 (&seed_0 &seed_1);
BmBm=(1-&pm)**2;BmLm=2*(1-&pm)*&pm;LmLm=(&pm)**2;
BaBa=(1-&pa)**2;BaLa=2*(1-&pa)*&pm;LaLa=(&pa)**2;
BmBa=(1-&pm)*(1-&pa)+&delta;
BmLa=(1-&pm)*&pa-&delta;
LmBa=&pm*(1-&pa)-&delta;
LmLa=&pm*&pa+&delta;
p1=BmBa**2;
p2=2*BmBa*BmLa;
p3=BmLa**2;
p4=2*BmBa*LmBa;
p5=2*BmBa*LmLa+2*BmLa*LmBa;
p6=2*BmLa*LmLa;
p7=LmBa**2;
p8=2*LmBa*LmLa;
p9=LmLa**2;
g1=BmBa**2/0.25;
g4=2*BmBa*LmBa/0.25;
g7=LmBa**2/0.25;
g2=BmBa*BmLa/0.25;
g5=(BmBa*LmLa+BmLa*LmBa)/0.25;
g8=LmBa*LmLa/0.25;
g3=BmLa**2/0.25;
g6=2*BmLa*LmLa/0.25;
g9=LmLa**2/0.25;
/*
p1=f(
p2=f(
p3=f(
p4=f(
p5=f(
p6=f(
p7=f(
p8=f(
p9=f(
*/;
MM,
MM,
MM,
Mm,
Mm,
Mm,
mm,
mm,
mm,
AA)
Aa)
aa)
AA)
Aa)
aa)
AA)
Aa)
aa)
112
mean=&lamda*(1-2*&Pm);
VBaBa=&v+2*&lamda**2*(1-&Pm)*&Pm+2*&delta*&lamda**2*(2*&Pm-1)/(1-&pa)2*&delta**2*&lamda**2/(1-&pa)**2;
VBaLa=&v+2*&lamda**2*(1-&Pm)*&Pm+&delta*&lamda**2*(1-2*&pa)*(1-2*&pm)/(1&pa)/&pa-&delta**2*&lamda**2*((1-&pa)**2+&pa**2)/(1-&pa)**2/&pa**2;
VLaLa=&v+2*&lamda**2*(1-&Pm)*&Pm+2*&delta*&lamda**2*(1-2*&Pm)/&pa2*&delta**2*&lamda**2/&pa**2;
mBaBa=mean+2*&delta*&lamda/(1-&pa);
mBaLa=mean-&delta*&lamda*(1-2*&pa)/(1-&pa)/&pa;
mLaLa=mean-2*&delta*&lamda/&pa;
do i=1 to &simu;
do j=1 to &sample;
call rantbl (sd0,p1,p2,p3,p4,p5,p6,p7,p8,p9,g);
call rannor (sd1,x);
if g=1 then do;
genotype='MM AA';genotype_M='MM';genotype_A='AA';
phenotype=&lamda+sqrt(&V)*x;
end;
else if g=2 then do;
genotype='MM Aa';genotype_M='MM';genotype_A='Aa';
phenotype=&lamda+sqrt(&V)*x;
end;
else if g=3 then do;
genotype='MM aa';genotype_M='MM';genotype_A='aa';
phenotype=&lamda+sqrt(&V)*x;
end;
else if g=4 then do;
genotype='Mm AA';genotype_M='Mm';genotype_A='AA';
phenotype=0+sqrt(&V)*x;
end;
else if g=5 then do;
genotype='Mm Aa';genotype_M='Mm';genotype_A='Aa';
phenotype=0+sqrt(&V)*x;
end;
else if g=6 then do;
genotype='Mm aa';genotype_M='Mm';genotype_A='aa';
phenotype=0+sqrt(&V)*x;
end;
else if g=7 then do;
genotype='mm AA';genotype_M='mm';genotype_A='AA';
phenotype=(-&lamda)+sqrt(&V)*x;
end;
else if g=8 then do;
genotype='mm Aa';genotype_M='mm';genotype_A='Aa';
phenotype=(-&lamda)+sqrt(&V)*x;
end;
else if g=9 then do;
genotype='mm aa';genotype_M='mm';genotype_A='aa';
phenotype=(-&lamda)+sqrt(&V)*x;
end;
113
sim=i;
output;
end;
end;
run;
ods output bartlett=bart;
ods listing close;
/*ods output modelanova=anova_raw bartlett=bartlett_raw welch=welch_raw;*/
proc anova data=one outstat=anova_raw;
class genotype_A;
by sim;
model phenotype=genotype_A ;
means genotype_A /HOVTEST=BARTLETT ;
run;
ods listing;
/*ods output close;*/
data anova_raw;
set anova_raw;
where _type_^='ERROR';
if prob<=&alphalevel then reject=1;else reject=0;
run;
data bartlett_raw;
set bart;
if probchisq<=&alphalevel then reject=1;else reject=0;
run;
proc freq noprint data=anova_raw;
tables reject /out=anova;
run;
proc freq noprint data=bartlett_raw;
tables reject /out=bartlett;
run;
data anova;
set anova;
prob_anova=percent/100;if reject^=1 then delete;
keep prob_anova;
run;
data bartlett;
set bartlett;
prob_bartlett=percent/100;if reject^=1 then delete;
keep prob_bartlett;
run;
data result;
merge anova bartlett;
delta=&delta;lamda=&lamda;v=&v;Pm=&pm;Pa=&pa;Sample=&sample;Sim=&simu;
keep delta lamda v pm pa sample prob_anova prob_bartlett sim;
label prob_anova='Power of Anova'
prob_bartlett='Power of Bartlett’s test'
Sim='Simulation Times';
114
file 'C:\Personal Folder\power.txt' mod;
put delta lamda v pm pa sample prob_anova prob_bartlett sim;
run;
proc print data=result;
var delta lamda v pm pa sample prob_anova prob_bartlett sim;
run;
quit;
%mend sim;
%sim (delta=0.0625,lamda=1,v=1,Pm=0.5,Pa=0.5,sample=200,alphalevel=0.05);
axis1 offset=(11,11);
symbol1 color=red
interpol=none
value=dot
height=0.5;
proc gplot data=one; plot phenotype*genotype/
haxis=axis1 hminor=2
vaxis=axis2 vminor=1;
run;
proc boxplot data=one;
plot phenotype*genotype/
haxis=axis1 hminor=2
vaxis=axis2 vminor=1;
run;
proc capability data=one noprint;
spec lsl=6.8 llsl=2 clsl=black;
cdf phenotype / cframe = ligr
legend = legend2;
run;
**** Draw plot by AA Aa aa****;
proc sort data=one;
by genotype_A;
run;
proc univariate data=one;
by genotype_A;
var phenotype;
probplot phenotype;
histogram phenotype /normal(noprint)
outhistogram=raw_graph
midpoints=-10 to 10 by 0.2;
run;
/*proc print data=raw_graph; run;
data graph;
set raw_graph;
115
if genotype='Aa' then _obspct_=_obspct_+50;
else if genotype='AA' then _obspct_=_obspct_+100;
run;
*/
proc sort data=raw_graph;
by genotype_A desending _midpt_;
run;
symbol1 color=red
interpol=join
value=dot
height=0.5;
symbol2 color=blue
interpol=join
value=star
height=0.5;
symbol3 color=yellow
interpol=join
value=diamond
height=0.5;
proc gplot data=raw_graph;
plot _obspct_*_midpt_=genotype_A;
run;
**** Draw plot by MM Mm mm****;
proc sort data=one;
by genotype_M;
run;
proc univariate data=one;
by genotype_M;
var phenotype;
probplot phenotype;
histogram phenotype /normal(noprint)
outhistogram=raw_graph
midpoints=-10 to 10 by 0.2;
run;
/*proc print data=raw_graph; run;
data graph;
set raw_graph;
if genotype='Aa' then _obspct_=_obspct_+50;
else if genotype='AA' then _obspct_=_obspct_+100;
run;
*/
proc sort data=raw_graph;
by genotype_M desending _midpt_;
run;
symbol1 color=red
interpol=join
value=dot
116
height=0.5;
symbol2 color=blue
interpol=join
value=star
height=0.5;
symbol3 color=yellow
interpol=join
value=diamond
height=0.5;
proc gplot data=raw_graph;
plot _obspct_*_midpt_=genotype_M;
run;
quit;
117
Appendix 3: SAS code of Power comparison of ANOVA and Bartlett test
SAS macros are presented for a Power comparison. (Chapter 3)
options nosource nodate;
ods trace off;
%macro sim(delta=0.1,v=1,sample=200,alphalevel=0.05,seed_0=0,seed_1=0,simu=1000);
data one;
retain sd0-sd1 (&seed_0 &seed_1);
do i=1 to &simu;
lamda=-1+2*ranuni(0);
pm=ranuni(0);
pa=ranuni(0);
%let pm=pm;
%let pa=pa;
%let lamda=lamda;
BmBm=(1-pm)**2;BmLm=2*(1-pm)*pm;LmLm=(pm)**2;
BaBa=(1-pa)**2;BaLa=2*(1-pa)*pm;LaLa=(pa)**2;
BmBa=(1-pm)*(1-pa)+&delta;
BmLa=(1-pm)*pa-&delta;
LmBa=pm*(1-pa)-&delta;
LmLa=pm*pa+&delta;
p1=BmBa**2;
p2=2*BmBa*BmLa;
p3=BmLa**2;
p4=2*BmBa*LmBa;
p5=2*BmBa*LmLa+2*BmLa*LmBa;
p6=2*BmLa*LmLa;
p7=LmBa**2;
p8=2*LmBa*LmLa;
p9=LmLa**2;
p=p1+p2+p3+p4+p5+p6+p7+p8+p9;
/*
p1=f(
p2=f(
p3=f(
p4=f(
p5=f(
p6=f(
p7=f(
p8=f(
p9=f(
*/
MM,
MM,
MM,
Mm,
Mm,
Mm,
mm,
mm,
mm,
AA)
Aa)
aa)
AA)
Aa)
aa)
AA)
Aa)
aa)
mean=&lamda*(1-2*Pm);
118
VBaBa=&v+2*&lamda**2*(1-&Pm)*Pm+2*&delta*&lamda**2*(2*Pm-1)/(1-pa)2*&delta**2*&lamda**2/(1-pa)**2;
VBaLa=&v+2*&lamda**2*(1-&Pm)*&Pm+&delta*&lamda**2*(1-2*&pa)*(1-2*&pm)/(1&pa)/&pa-&delta**2*&lamda**2*((1-&pa)**2+&pa**2)/(1-&pa)**2/&pa**2;
VLaLa=&v+2*&lamda**2*(1-&Pm)*&Pm+2*&delta*&lamda**2*(1-2*&Pm)/&pa2*&delta**2*&lamda**2/&pa**2;
mBaBa=mean+2*&delta*&lamda/(1-&pa);
mBaLa=mean-&delta*&lamda*(1-2*&pa)/(1-&pa)/&pa;
mLaLa=mean-2*&delta*&lamda/&pa;
do j=1 to &sample;
call rantbl (sd0,p1,p2,p3,p4,p5,p6,p7,p8,p9,g);
call rannor (sd1,x);
if g=1 then do;
genotype='MM AA';genotype_M='MM';genotype_A='AA';
phenotype=&lamda+sqrt(&V)*x;
end;
else if g=2 then do;
genotype='MM Aa';genotype_M='MM';genotype_A='Aa';
phenotype=&lamda+sqrt(&V)*x;
end;
else if g=3 then do;
genotype='MM aa';genotype_M='MM';genotype_A='aa';
phenotype=&lamda+sqrt(&V)*x;
end;
else if g=4 then do;
genotype='Mm AA';genotype_M='Mm';genotype_A='AA';
phenotype=0+sqrt(&V)*x;
end;
else if g=5 then do;
genotype='Mm Aa';genotype_M='Mm';genotype_A='Aa';
phenotype=0+sqrt(&V)*x;
end;
else if g=6 then do;
genotype='Mm aa';genotype_M='Mm';genotype_A='aa';
phenotype=0+sqrt(&V)*x;
end;
else if g=7 then do;
genotype='mm AA';genotype_M='mm';genotype_A='AA';
phenotype=(-&lamda)+sqrt(&V)*x;
end;
else if g=8 then do;
genotype='mm Aa';genotype_M='mm';genotype_A='Aa';
phenotype=(-&lamda)+sqrt(&V)*x;
end;
else if g=9 then do;
genotype='mm aa';genotype_M='mm';genotype_A='aa';
phenotype=(-&lamda)+sqrt(&V)*x;
end;
sim=i;
output;
end;
end;
run;
119
ODS SELECT NONE;
ods output bartlett=bart;
proc anova data=one outstat=anova_raw;
class genotype_A;
by sim;
model phenotype=genotype_A ;
means genotype_A /HOVTEST=BARTLETT ;
run;
/*ods output close;*/
data anova_raw;
set anova_raw;
where _type_^='ERROR';
if prob<=&alphalevel then reject=1;else reject=0;
run;
data bartlett_raw;
set bart;
if probchisq<=&alphalevel then reject=1;else reject=0;
run;
proc freq noprint data=anova_raw;
tables reject /out=anova;
run;
proc freq noprint data=bartlett_raw;
tables reject /out=bartlett;
run;
data anova;
set anova;
prob_anova=percent/100;if reject^=1 then delete;
keep prob_anova;
run;
data bartlett;
set bartlett;
prob_bartlett=percent/100;if reject^=1 then delete;
keep prob_bartlett;
run;
data result;
merge anova bartlett;
delta=&delta;theata=&delta;v=&v;Sample=&sample;Sim=&simu;
keep delta theata v sample prob_anova prob_bartlett sim;
label prob_anova='Power of Anova'
prob_bartlett='Power of Bartlett’s test'
Sim='Simulation Times';run;
quit;
%mend sim;
data power;
run;
%macro power(theata_start=0, theata_end=0.2,
sample=100,alphalevel=0.05,seed_0=0,simu=1000,devide=10);
120
%do i=1 %to &devide;
%let theata=&theata_start + (&i-1)*(&theata_end-&theata_start)/&devide;
%sim (delta=&theata,v=1,sample=&sample,alphalevel=&alphalevel,simu=&simu);
data power;
set power result;
if theata=. then delete;
run;
%end;
%mend power;
%power(theata_start=0, theata_end=0.2,
sample=200,alphalevel=0.05,seed_0=0,simu=400,devide=20);
data power;
set power;
label prob_anova='Rejecting delta=0 using ANOVA
prob_bartlett='Rejecting delta=0 using Bartlett’s test
run;
symbol1 interpol=join
value=dot
height=1
width=2
cv=red
CI=red;
symbol2 interpol=join
value=dot
height=1
width=2
cv=blue
CI=blue;
legend1 label=none
shape=symbol(5,1)
position=(top center inside)
mode=share;
axis1 lable=('Power');
proc gplot data=power;
plot prob_anova*delta
prob_bartlett*delta /overlay legend=legend1 vaxis=axis1;
run;
121
'
';
Appendix 4: Derivation of Expectation and Variance of Siegmund’s T-test
The derivations of expectation and variance of Siegmund’s T-test is given. The technique used is
the Δ-method. (Chapter 4)
•
ٛ Expectation of T
T=(
2nn3
2nn1
+
− n) / n
n + n1 − n3 n + n3 − n1
n3
n1
+
− n] / n }
(2n1 + n 2 ) / 2n (2n3 + n2 ) / 2n
2nE (n1 )
2nE (n3 )
=(
+
− n) / n
n + E (n1 ) − E (n3 ) n + E (n3 ) − E (n1 )
E (T) = E{[
=(
2n * n * ( p 2 + θpq )
2n * n * (q 2 + θpq )
+
− n) / n
n + n * ( p 2 + θpq ) − n * (q 2 + θpq ) n + n * (q 2 + θpq ) − n * ( p 2 + θpq )
E (n1 ) = n(θ (1 − p ) p + p 2 )
E (n2 ) = 2 p (1 − p )(1 − θ )
E (n3 ) = n((1 − p ) 2 + θ (1 − p ) p )
E (T) =
=
n(θ (1 − p ) p + p 2 )
{
+
[2(n(θ (1 − p ) p + p 2 )) + 2 p(1 − p )(1 − θ )] / 2n
n((1 − p) 2 + θ (1 − p ) p)
− n} / n
[2(n((1 − p) 2 + θ (1 − p) p)) + 2 p (1 − p)(1 − θ )] / 2n
1
2n 2 ((1 − p ) 2 + θ (1 − p ) p )
(
+
2
2
n n + n((1 − p ) + θ (1 − p ) p ) − n(θ (1 − p ) p + p )
2n 2 (θ (1 − p ) p + p 2 )
− n)
n − n((1 − p ) 2 + θ (1 − p) p) + n(θ (1 − p ) p + p 2 )
After simplification,Î E(T) =
n *θ
• Variance of T
By employing Delta-Method, we could find the variance of T
122
E (n1 ) = n * ( p 2 + θpq )
E (n3 ) = n * (q 2 + θpq)
V (n1 ) = n * ( p 2 + θpq)(1 − p 2 − θpq)
V (n3 ) = n * (q 2 + θpq)(1 − q 2 − θpq)
Cov(n1 , n3 ) = - n * ( p 2 + θpq)(q 2 + θpq )
2n(n − E (n3 ))
2n * E (n3 )
+
2
(n + E (n1 ) − E (n3 )) (n + E (n3 ) − E (n1 )) 2
2n(n − E (n1 ))
2n * E (n1 )
b=
+
2
(n + E (n3 ) − E (n1 )) (n + E (n1 ) − E (n3 )) 2
a=
V (T ) = (V (n1 ) * a 2 + V (n3 ) * b 2 + 2Cov(n1 , n3 ) * a * b) / n
=
(1 − θ )(θ 2 (1 − 2 p ) 2 + 2( p − 1) p − θ (6 p 2 − 6 p + 2))
2( p − 1) p
Ef (n1 , n3 ) ≅ 0
∂f 2 n = En
∂f 2 n = En
) n13 = En13 + Var(n 3 )(
) 1 1 +
∂n1
∂n3 n3 = En3
∂f ∂f n = En
2Cov(n 1 , n 3 )(
)(
) 1 1)
∂n1 ∂n3 n3 = En3
f (n1 , n3 ) ~ N (0, Var(n1 )(
⎧V (n1 ) = n( p 2 + fpq)(1 − P 2 − fpq )
⎪
2
2
⎪V (n3 ) = n(q + fpq )(1 − P − fpq)
⎪
2
2
⎨Cov(n1 , n3 ) = − n( p + fpq )(q + fpq)
⎪
2
⎪ E (n1 ) = n( f (1 − p) p + p )
⎪ E (n2 ) = 2 p(1 − p)(1 − f )
⎩
2n ( n − n 3 )
2nn3
∂f
=
+
2
n1 (n + n1 − n3 )
(n + n1 − n3 ) 2
2n ( n − n3 )
2nn1
∂f
=
+
2
n3 (n + n3 − n1 )
(n + n1 − n3 ) 2
Î V(T) =
(1 − θ )(θ 2 (1 − 2 p ) 2 + 2( p − 1) p − θ (6 p 2 − 6 p + 2))
2( p − 1) p
123
Appendix 5: SAS code of sample size calculation of Wald’s Z test
SAS code for calculating sample size under the Wald’s Z test is given. (Chapter 4)
data sample_size;
do i=1 to 2;
select (i);
when (1) alpha=0.05;
when (2) alpha=0.01;
end;
do j=1 to 4;
select (j);
when (1) belta=.2;
when (2) belta=.5;
when (3) belta=.9;
when (4) belta=.95;
end;
do k=1 to 5;
Pa=k*0.1;
do l=1 to 12;
select (l);
when (1) f=.0001;
when (2) f=.0005;
when (3) f=.001;
when (4) f=.002;
when (5) f=.005;
when (6) f=.01;
when (7) f=.02;
when (8) f=.05;
when (9) f=.1;
when (10) f=.25;
when (11) f=.5;
when (12) f=1;
end;
n=(probit(1-alpha)+probit(belta))**2*((1-f)**2*(1-2*f)+f*(1-f)*(2-f)/(2*Pa*(1Pa)))/(f**2);
output;
end;
end;
end;
end;
run;
124
Appendix 6: SAS code of power comparison of Ward and Sing’s χ2 Test and
Wald’s Z test
SAS macros are developed. (Chapter 4)
options nosource nodate;
ods trace off;
proc printto;
%macro sim(f=0.2, sample=100,alphalevel=0.05,seed_0=0,simu=1000);
data temp_1;
do i=1 to &simu;
do j=1 to 3;
if j=1 then genotype='AA';
if j=2 then genotype='Aa';
if j=3 then genotype='aa';
sim=i;
output;
end;
end;
run;
data one;
retain sd0 (&seed_0);
do i=1 to &simu;
p=ranuni(0);
q=1-p;
f=&f;
p1=p**2+f*p*q;
p2=2*p*q*(1-f);
p3=q**2+f*p*q;
do j=1 to &sample;
g=rantbl(0,p1,p2,p3);
if g=1 then do;
genotype='AA';
end;
else if g=2 then do;
genotype='Aa';
end;
else if g=3 then do;
genotype='aa';
end;
sim=i;
output;
end;
end;
125
run;
proc freq data=one noprint;
table genotype / out=FreqCnt ;
by sim;
run;
data freqcnt(drop=i j);
merge temp_1 freqcnt;
by sim genotype;
if count=. then count=0;
run;
proc transpose data=freqcnt out=T prefix=n;
by sim;
var count;
run;
data T;
set T;
n=n1+n2+n3;
f_e=1-2*n*n2/(2*n1+n2)/(2*n3+n2);
p_e=(2*n1+n2)/2/n;
V_f_e=1/n*(1-f_e)**2*(1-2*f_e) +f_e*(1-f_e)*(2-f_e)/(2*n*p_e*(1-p_e));
z=f_e/(V_f_e)**0.5;
if abs(z)>-probit(&alphalevel/2) then reject_z='Yes' ;
else if z=. then reject_z='N/A';
else reject_z='No ';
chisq=(n1-n*p_e**2)**2/(n*p_e**2)+(n2-n*2*p_e*(1-p_e))**2/(n*2*p_e*(1-p_e))+(n3n*(1-p_e)**2)**2/(n*(1-p_e)**2);
if chisq>cinv(1-&alphalevel,1) then reject_chi='Yes';
else if chisq=. then reject_chi='N/A';
else reject_chi='No ';
f=&f; sample=&sample;
run;
proc freq data=t noprint;
table reject_z /out=freq_z;
table reject_chi / out=freq_chi;
run;
data freq_z1;
set freq_z;
rename percent=power_z;
theata=&f;
if reject_z^='Yes' then delete;
run;
data freq_chi1;
set freq_chi;
rename percent=power_chi;
theata=&f;
if reject_chi^='Yes' then delete;
run;
126
data freq;
merge freq_chi1 freq_z1;
keep power_chi Power_z theata;
run;
%mend sim;
%macro power(theata_start=0, theata_end=0.2,
sample=1000,alphalevel=0.05,seed_0=0,simu=100,devide=2);
%do i=1 %to &devide;
%let theata=&theata_start + (&i-1)*(&theata_end-&theata_start)/&devide;
%sim( f=&theata,sample=&sample,alphalevel=&alphalevel, simu=&simu);
data power;
set power freq;
if theata=. then delete;
run;
%end;
%mend power;
data power;
run;
%power(theata_start=0, theata_end=0.2,
sample=1000,alphalevel=0.05,simu=10000,devide=10);
data power;
set power;
label power_z='Rejecting theata=0 using Z-test
power_chi='Rejecting theata=0 using Chi-square
run;
symbol1 interpol=join
value=dot
height=1
width=2
cv=red
CI=red;
symbol2 interpol=join
value=circle
height=1
width=2
cv=blue
CI=blue;
legend1 label=none
shape=symbol(5,1)
position=(top center inside)
mode=share;
axis1 label=('Power');
127
'
';
proc gplot data=power;
plot power_z*theata power_chi*theata /overlay legend=legend1 vaxis=axis1;
run;
quit;
title1 "Normal Q-Q Plot for Z's";
axis2 label=('Z');
proc capability data=T noprint;
qqplot Z / normal(mu=est sigma=est color=orange l=2 w=7)
square
vaxis=axis2;
histogram Z/ normal;
run;
128
Appendix 7: SAS code of Rao’s Homogeneity Test
The Q test for testing homogeneity of several inbreeding coefficients from 2x2 joint distribution is
dealt. (Chapter 5)
options nosource nodate;
ods trace off;
%macro data(theta=,sample_z=,simu_z=);
data temp_1;
do i=1 to &simu_z;
do g=1 to 6;
select (g);
when (1) genotype='A1A1';
when (2) genotype='A1A2';
when (3) genotype='A1A3';
when (4) genotype='A2A2';
when (5) genotype='A2A3';
when (6) genotype='A3A3';
otherwise;
end;
simu_z=i;
output;
end;
end;
run;
data z_1;
do i=1 to &simu_z;
P1=ranuni(0);
P2=(1-P1)*ranuni(0);
P3=1-P1-P2;
t=&theta;
P11=(1-t)*(P1**2)+t*P1;
P12=(1-t)*p1*p2;
P13=(1-t)*p1*p3;
P22=(1-t)*(P2**2)+t*P2;
P23=(1-t)*p2*p3;
P33=(1-t)*(P3**2)+t*P3;
simu_z=i;
do j=1 to &sample_z;
g=rantbl(0,p11,2*p12,2*p13,p22,2*p23,p33);
select (g);
when (1) genotype='A1A1';
when (2) genotype='A1A2';
when (3) genotype='A1A3';
when (4) genotype='A2A2';
when (5) genotype='A2A3';
when (6) genotype='A3A3';
otherwise;
129
end;
output;
end;
end;
run;
proc freq data=z_1 noprint;
table genotype / out=FreqCnt ;
by simu_z;
run;
data freqcnt(drop=i g);
merge temp_1 freqcnt;
by simu_z genotype;
if count=. then count=0;
run;
proc transpose data=freqcnt out=T prefix=n;
by simu_z;
var count;
run;
data t;
set t;
rename n1=n11 n2=n12 n3=n13 n4=n22 n5=n23 n6=n33;n=&sample_z;
run;
%mend data;
%macro Z(data=,theta=,alphalevel=);
data z1;
set &data;
n1=n11;n2=n12+n13;n3=n22+n33+n23;
f_e=1-2*n*n2/(2*n1+n2)/(2*n3+n2);
p_e=(2*n1+n2)/2/n;
V_e=1/n*(1-f_e)**2*(1-2*f_e) +f_e*(1-f_e)*(2-1*f_e)/(2*n*p_e*(1-p_e));
rename f_e=f1_e V_e=V1_e;
run;
data z2;
set &data;
n1=n22;n2=n12+n23;n3=n11+n33+n13;
f_e=1-2*n*n2/(2*n1+n2)/(2*n3+n2);
p_e=(2*n1+n2)/2/n;
V_e=1/n*(1-f_e)**2*(1-2*f_e) +f_e*(1-f_e)*(2-1*f_e)/(2*n*p_e*(1-p_e));
rename f_e=f2_e V_e=V2_e;
run;
data z3;
set &data;
n1=n33;n2=n13+n23;n3=n11+n22+n23;
f_e=1-2*n*n2/(2*n1+n2)/(2*n3+n2);
p_e=(2*n1+n2)/2/n;
V_e=1/n*(1-f_e)**2*(1-2*f_e) +f_e*(1-f_e)*(2-1*f_e)/(2*n*p_e*(1-p_e));
rename f_e=f3_e V_e=V3_e;
run;
proc sql;
130
create table z as
select a.*, b.f2_e, b.v2_e,c.f3_e,c.v3_e
from z1 as a,z2 as b, z3 as c
where a.simu_z=b.simu_z=c.simu_z;
quit;
data Z;
set z;
v_e=1/(1/v1_e+1/v2_e+1/v3_e);f_e=(f1_e/v1_e+f2_e/v2_e+f3_e/v3_e)*V_e;
Z_test=f_e/(v_e)**0.5;
if abs(z_test)>-probit(&alphalevel/2) then reject_z='Yes' ;
else if z_test=. then reject_z='N/A';
else reject_z='No ';
H_test=(f1_e-f_e)**2/v1_e+(f2_e-f_e)**2/v2_e+(f3_e-f_e)**2/v3_e;
if H_test>cinv(1-&alphalevel,2) then reject_H='Yes';
else if H_test=. then reject_H='N/A';
else reject_H='No ';
run;
proc freq data=z noprint;
table reject_z /out=freq_z;
table reject_h / out=freq_h;
run;
data freq_z1;
set freq_z;
rename percent=power_z;
theta=&theta;
if reject_z^='Yes' then delete;
run;
data freq_h1;
set freq_h;
rename percent=power_h;
if reject_h^='No' then delete;
run;
data freq;
merge freq_h1 freq_z1;
keep power_h Power_z theta;
run;
%mend Z;
data power;
run;
%macro power(theta_start=0, theta_end=0.2,
sample=100,alphalevel=0.05,seed_0=0,simu=10000,
devide=100);
%do i=1 %to &devide;
%let theta=&theta_start + &i*(&theta_end-&theta_start)/&devide;
%data( theta=&theta,sample_z=&sample,simu_z=&simu);
131
%z(data=t,theta=&theta,alphalevel=&alphalevel);
data power;
set power freq;
if theta=. then delete;
run;
%end;
%mend power;
%power(theta_start=0, theta_end=0.2,
sample=100,alphalevel=0.05,simu=1000,devide=10);
data power;
set power;
label power_z='Rejecting theta=0 using Z-test
power_h='Accepting all thetas are equal using Chi-square';
run;
symbol1 interpol=join
value=dot
height=1
width=2
cv=red
CI=red;
symbol2 interpol=join
value=dot
height=1
width=2
cv=blue
CI=blue;
legend1 label=none
shape=symbol(5,1)
position=(top center inside)
mode=share;
axis1 label=('Power');
proc gplot data=power;
plot power_z*theta power_h*theta /overlay legend=legend1 vaxis=axis1;
run;
quit;
132
'
Appendix 8: Derivatives of θˆ s with respective to frequencies
The following expressions are useful in calculating Cov (θˆ1 ,θˆ2 ) , Cov(θˆ1 ,θˆ3 ) and Cov(θˆ2 ,θˆ3 ) .
These expressions are used in Appendix 9. (Chapter 5)
∂θˆ1
=
∂n11
∂θˆ1
=
∂n12
∂θˆ1
=
∂n13
∂θˆ1
=
∂n22
∂θˆ1
=
∂n23
133
∂θˆ1
=
∂n33
∂θˆ2
=
∂n11
∂θˆ2
=
∂n12
∂θˆ2
=
∂n13
∂θˆ2
=
∂n22
∂θˆ2
=
∂n23
134
∂θˆ2
=
∂n33
∂θˆ3
=
∂n11
∂θˆ3
=
∂n12
∂θˆ3
=
∂n13
∂θˆ3
=
∂n22
∂θˆ3
=
∂n23
135
∂θˆ3
=
∂n33
find the derivative terms for θˆ1 under A:
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟ =
11
⎝
⎠A
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 12 ⎠ A
136
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟ =
13
⎝
⎠A
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 22 ⎠ A
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 23 ⎠ A
⎛ ∂θˆ1 ⎞
⎜
⎟
⎜ ∂n ⎟
⎝ 33 ⎠ A
Also, find the derivative terms for θˆ2 under A:
137
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟ =
11
⎝
⎠A
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 12 ⎠ A
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 13 ⎠ A
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 22 ⎠ A
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 23 ⎠ A
138
⎛ ∂θˆ2 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 33 ⎠ A
Also, find the derivative terms for θˆ3 under A:
⎛ ∂θˆ3 ⎞
⎟
⎜
⎜ ∂n ⎟ =
⎝ 11 ⎠ A
⎛ ∂θˆ3 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 12 ⎠ A
⎛ ∂θˆ3 ⎞
⎜
⎟
⎜ ∂n ⎟ =
⎝ 13 ⎠ A
139
⎛ ∂θˆ3 ⎞
⎟
⎜
⎜ ∂n ⎟ =
22
⎠A
⎝
⎛ ∂θˆ3 ⎞
⎜
⎟
⎜ ∂n ⎟ =
23
⎝
⎠A
⎛ ∂θˆ3 ⎞
⎜
⎟
⎜ ∂n ⎟ =
33
⎝
⎠A
140
Appendix 9: Mathematica code for power and size computations of the Qtest
The iterative computations for finding the optimal estimate of inbreeding coefficient, the
derivatives of variance-covariance terms, and the different choices of parameters needed for
simulation are developed for the power and size computations of Q-test. (Chapter 5)
141
142
143