Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Quantitative trait locus wikipedia , lookup
Genealogical DNA test wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Population genetics wikipedia , lookup
Genetic drift wikipedia , lookup
Microevolution wikipedia , lookup
Species distribution wikipedia , lookup
Inbreeding avoidance wikipedia , lookup
UNIVERSITY OF CINCINNATI Date: I, August 6th, 2007 Ran He , hereby submit this work as part of the requirements for the degree of: DOCTOR OF PHILOSOPHY in: ENVIRONMENTAL HEALTH It is entitled: Some Statistical Aspects of Association Studies in Genetics and Tests of the Hardy-Weinberg Equilibrium This work and its defense approved by: Chair: Dr. Marepalli Rao Dr. Ranajit Chakraborty Dr. Ranjan Deka Dr. Ning Wang Some Statistical Aspects of Association Studies in Genetics and Tests of the Hardy-Weinberg Equilibrium A dissertation submitted to the Division of Research and Advanced Studies of the University of Cincinnati In partial fulfillment of the requirements for the degree of DOCTOR OF PHILOSOPHY (Ph.D.) in the Division of Epidemiology and Biostatistics of Department of Environmental Health of the College of Medicine 2007 By Ran He B. S., Sichuan University, China, 2001 Committee Chair: Dr. Marepalli Rao Abstract The applicability of a statistical method hinges how far the assumptions are met for its validity. Some statistical tests are robust when assumptions are relaxed, while others are not. In first part of the dissertation, we focus on exploring assumption violations in some statistical methods for genetic association studies and use simulations to test the robustness of these methods. In genetic studies, one of the major objectives is to apply statistical models to identify genes contributing to variations in specific quantitative traits. In order to correlate such quantitative phenotypes with underlying genotypes, the method of analysis of variance (ANOVA) is most commonly used. If the null hypothesis of equality of means is rejected, it implies that the investigating gene is associated with the phenotype. However, we show that this method raises a paradox by violating the assumptions of its validity. An alternative method, namely Bartlett’s test, is available to overcome the paradox. We compare the performances of the ANOVA test and Bartlett’s test to answer the underlying question. Our study indicates that the ANOVA test works despite the failure of the validity of its assumption. In the second part of the dissertation, we focus on tests of the Hardy-Weinberg Equilibrium (HWE). In population genetics, HWE states that, under certain conditions, after one generation of random mating genotype frequencies at a single gene locus will attain a particular set of equilibrium values. The most commonly used method for testing HWE is the goodness-of-fit Chi-squared statistic, which does not discriminate homozygote excess from heterozygote deficiency because of its two-sided nature. We ii propose alternative methods and use simulations to assess their power. The proposed methods are amenable to sample size calculations. We compare our sample size calculations with those available in the literature and find that ours are smaller. For more than two alleles, testing the HWE is computationally complex. We propose a new method of testing the HWE for multi-allelic cases by reducing the dimensionality of the problem. Mathematical, statistical, and computational aspects of the new method are set out in detail. iii iv Acknowledgments I would like to express my sincere gratitude to Dr. Marepalli Rao, who is my advisor, and Dr. Ranajit Chakraborty, who is the director of Center for Genome Information, for their inspiration, professional guidance and support for my dissertation. I have greatly profited from their solid knowledge and great personalities. I am indebted to their constant encouragement and mentoring throughout my graduate studies. I would also like to thank Dr. Ranjan Deka and Dr. Ning Wang for serving on my committee and providing many insightful suggestions and comments and discussing with me some of the difficult points in the dissertation. I would also like to express my appreciations to faculty and staff in the Department of Environmental Health with whom I have a good fortune to interact. Finally, I want to thank my family who always encouraged me to succeed in achieving high goals. v Table of contents 1 2 3 4 Introduction..........................................................................................................................1 Purpose, Hypotheses, Specific Aims, and Significance ......................................................3 2.1 Purpose...................................................................................................................3 2.2 Research Hypotheses .............................................................................................4 2.3 Specific Aims.........................................................................................................5 2.4 Significance............................................................................................................6 On testing that genotypes at a marker locus is associated with a given phenotype.............8 3.1 Background: Traditional Approach .......................................................................8 3.2 Statistical Methods...............................................................................................10 3.2.1 Analysis of Variance (ANOVA)..............................................................10 3.2.2 Bartlett’s Test...........................................................................................11 3.2.3 Linkage Disequilibrium Coefficient and Joint distribution .....................13 3.2.4 Joint distribution of the Phenotype and Genotypes of G and G′..............14 3.2.5 Conditional Expectations and Variances .................................................15 3.3 Some Facts and Paradox ......................................................................................16 3.3.1 Some Facts ...............................................................................................16 3.3.2 The Paradox .............................................................................................16 3.4 Efficacy of ANOVA ............................................................................................16 3.5 Power Comparison of ANOVA and Bartlett’s test..............................................26 3.5.1 Different choices of mean (λ) ..................................................................27 3.5.2 Conclusion ...............................................................................................28 Hardy-Weinberg Equilibrium in the case of two alleles....................................................31 4.1 Introduction..........................................................................................................31 4.1.1 What is Hardy-Weinberg Equilibrium? ...................................................31 4.1.2 Assumption of HWE................................................................................33 4.1.3 Departures from the Equilibrium .............................................................34 4.1.4 Inbreeding Coefficient θ ..........................................................................35 4.2 Properties of Inbreeding coefficient θ..................................................................38 4.2.1 Formulation of the problem .....................................................................38 4.2.2 Bounds on θ .............................................................................................40 4.2.3 Homozygote excesses and Heterozygote deficiencies.............................40 vi 4.3 4.4 5 Maximum Likelihood estimates ..........................................................................41 Testing the validity of HWE ................................................................................42 4.4.1 Hypothesis Testing on θ...........................................................................42 4.4.2 A likelihood test of the null hypothesis ...................................................43 4.4.3 Siegmund’s T-Test...................................................................................45 4.4.4 χ2 -Test......................................................................................................48 4.4.5 Relationship between θˆ , Wald’s Z-test, Siegmund’s T-test and χ2 -test.......................................................................................................50 4.5 Advantages of the Wald’s Z-test or Siegmund’s T-Test .....................................51 4.6 Sample size calculation........................................................................................51 4.6.1 Sample size calculation based on Z-test or Siegmund’s T-test................51 4.6.2 Sample size calculation based on Ward and Sing’s χ2-test......................54 4.6.3 Power comparison between T and χ2 tests via simulations......................55 4.7 Conclusion. ..........................................................................................................59 Hardy-Weinberg Equilibrium in the case of three alleles..................................................60 5.1 Introduction..........................................................................................................60 5.2 Joint distribution of genotypes.............................................................................60 5.2.1 Parameter spaces......................................................................................61 5.2.2 Bounds on θ .............................................................................................65 5.2.3 Biological scenario...................................................................................65 5.3 Structure of the case of 3 alleles: data and Likelihood ........................................67 5.3.1 Structure of the case of 3 alleles: data .....................................................67 5.3.2 Maximum Likelihood estimators.............................................................67 5.4 5.5 6 Joint distribution of the type Ωθ and Connection to lower dimensional joint distributions .................................................................................................74 5.4.1 The case of A1 vs. (not A1) ......................................................................75 5.4.2 The case of A2 vs. (not A2) ......................................................................76 5.4.3 The case of A3 vs. (not A3) ......................................................................77 Estimation of inbreeding coefficient and hypotheses testing ..............................77 5.5.1 Estimation of inbreeding coefficient in a model of the type Ωθ .............77 5.5.2 Testing that the joint distribution of the alleles is of the type Ωθ ...........80 5.6 Conclusions..........................................................................................................92 Generalization to multiple alleles ......................................................................................93 6.1 Formulation of the problem .................................................................................93 vii 6.2 7 Data and Likelihood.............................................................................................96 6.2.1 Data Structure ..........................................................................................96 6.2.2 Maximum Likelihood estimators.............................................................96 6.3 Lower dimensional joint distributions .................................................................97 Conclusions and Future Research....................................................................................101 Bibliographic references ................................................................................................103 viii List of Appendix Appendix 1: Appendix 2: Appendix 3: Appendix 4: Appendix 5: Appendix 6: Appendix 7: Appendix 8: Appendix 9: Derivation of Conditional Expectations and Variances........................106 SAS code of different scenarios of ANOVA test .................................112 SAS code of Power comparison of ANOVA and Bartlett test .............118 Derivation of Expectation and Variance of Siegmund’s T-test ............122 SAS code of sample size calculation of Wald’s Z test .........................124 SAS code of power comparison of Ward and Sing’s χ2 Test and Wald’s Z test ..................................................................................125 SAS code of Rao’s Homogeneity Test .................................................129 Derivatives of θˆ s with respective to frequencies ................................133 Mathematica code for power and size computations of the Qtest .........................................................................................................141 ix List of tables Table 3-1 Joint distribution of the genotypes of G and G′ .......................................13 Table Table Table Table Table Table Table Table 5-1 Conditional distributions under Scenario 1.................................................17 Conditional distributions under Scenario 2.................................................21 Summarized results of simulations .............................................................25 Punnet square for Hardy-Weinberg Equilibrium........................................38 Genotype frequencies in the population .....................................................39 Joint distribution with inbreeding coefficient θ ..........................................40 Sample size n to achieve a specified power, 1−β, using Wald’s Ztest for various values of allele frequencies q , and true inbreeding coefficient θ, and level α.............................................................................52 Joint distribution of genotypes....................................................................61 Table 5-2 Joint distribution is of type Ωθ ...................................................................63 Table 5-3 Joint distribution for Genotypes under Equilibrium ( Ω 0 )..........................64 Table Table Table Table Table Table Table Table Table Table Example: A distribution in Ω but not in Ω* ..............................................64 Population subdivision with respect to tri-alleles .......................................65 Data on Genotypes......................................................................................67 Joint distribution: A1 vs. (not A1) ...............................................................75 Joint distribution: A1 vs. (not A1) with inbreeding coefficient θ1 ...............76 Joint distribution: A2 vs. (not A2) ...............................................................76 Joint distribution: A2 vs. (not A2) with inbreeding coefficient θ2 ...............76 Joint distribution: A3 vs. (not A3) ...............................................................77 Joint distribution: A3 vs. (not A3) with inbreeding coefficient θ3 ...............77 General Joint distribution of genotypes ......................................................93 3-2 3-3 3-4 4-1 4-2 4-3 4-4 5-4 5-5 5-6 5-7 5-8 5-9 5-10 5-11 5-12 6-1 Table 6-2 A joint distribution of type Ω* ...................................................................95 Table 6-3 Joint distribution of genotypes under Equilibrium ( Ω 0 ) ............................95 Table 6-4 Table 6-5 Data collected for any multiple alleles........................................................96 2x2 distribution required for a test about inbreeding coefficient..............100 x List of figures Figure 3-1 Common conditional pdf of P | G′under Scenario 1 ................................17 Figure 3-2 Common conditional pdf of P | G under Scenario 1 ..................................18 Figure 3-3 Conditional pdf of P | G′under Scenario 2 ...............................................22 Figure 3-4 Conditional pdf of P | G′under Scenario 3 ...............................................23 Figure 3-5 Conditional pdf of P | G′under Scenario 4 ...............................................24 Figure Figure Figure Figure Figure Figure Figure Figure Figure Figure Power comparison of ANOVA and Bartlett’s test, when λ = 1.................27 Power comparison of ANOVA and Bartlett’s test, when λ = 50...............28 Power comparison of Z and χ2, when p = 0.5 ............................................56 Power comparison of Z and χ2, when p = 0.2 ............................................56 Power comparison of Z and χ2, when p = 0.05 ..........................................57 Histogram and Normal Q-Q Plot for Z’s, when p = 0.5 .............................58 Histogram and Normal Q-Q Plot for Z’s, when p = 0.2 .............................58 Histogram and Normal Q-Q Plot for Z’s, when p = 0.05 ...........................59 Empirical power of χ2 Test Q for testing H0: θ1 = θ2 = θ3 = θ = 0...........84 Empirical size of χ2 Test Q for testing H0: θ1 = θ2 = θ3 = θ (θ unknown) ................................................................................................87 3-6 3-7 4-1 4-2 4-3 4-4 4-5 4-6 5-1 5-2 xi 1 Introduction One of the important problems in genetic studies is to explore association between a gene and a quantitative phenotype, such as blood pressure, body mass index (BMI), and lipid levels in blood. A standard additive model is generally postulated exemplifying the connection between the genotypes and phenotype. The model assumes a normal distribution for the phenotype under each genotype with additive effects. The relevant question is whether a gene of interest is associated with the phenotype. Data collected on the phenotype of a random sample of subjects are classified according to the genotypes and the ANOVA method is most commonly used. By comparing the mean phenotypic values across the different genotypic groups, the proportion of variance explained by the marker loci are examined. The rejection of the null hypothesis would lead us to believe that there is a statistically significant difference among these groups, which implies the conclusion that the genotypes are correlated with the given quantitative phenotype. This test is reasonable and the method is easy to use. However, we note that the method of analysis of variance raises a paradox by violating the assumptions of its validity. For the validity of the ANOVA method, homogeneity of variances of the genotype populations is needed. We observe that homogeneity holds if and only if the population means are equal, which the ANOVA method is purporting to test. This is a paradoxical situation. Thus motivation for this part of our work stems from the feeling that if the test assumptions are violated, the test results may not be valid. The Bartlett’s test can be used to test homogeneity of variances. In Chapter 3, we compare the performance of the ANOVA method and Bartlett’s test via simulations. Our conclusion is that the ANOVA method works despite violation of the assumptions of its validity. 1 In Chapter 4 through 6, we focus on the assumption of the Hardy-Weinberg Equilibrium (HWE) to describe genotype frequencies at autosomal codominant loci. In population genetics, the HWE or Hardy–Weinberg law, named after G. H. Hardy and W. Weinberg, states that, under certain conditions, after one generation of random mating, the genotype frequencies at a single gene locus will attain a particular set of equilibrium values. It also specifies that those equilibrium frequencies can be represented as a simple function of allele frequencies at that locus. In Chapter 4, we focus on testing the HWE in the bi-allelic case. The most commonly used entity for testing is the goodness-of-test Chi-squared statistic, which does not discriminate homozygote excess from heterozygote deficiency because of its two-sided nature. We propose alternative methods for testing the HWE against an one-sided alternative. We use simulations to compare the power of the new test and goodness-of-test Chi-squared test. We also compare sample sizes required to achieve a given power. Sample sizes calculated based on the new method are lower than what are available in the literature. For more than two alleles, testing the HWE raises severe computational problems. In Chapter 5, we propose a new method of testing the HWE following the dimensionality reduction principle. We set out in great detail execution of the new method and computational routines, and demonstrate its feasibility and effectiveness by simulations. In Chapter 6, results and methods are extended to multi-allelic cases. As the number of alleles increases, computational complexity arises. The computational power is adequate to cover cases of reasonable number of alleles. In Chapter 7, we draw conclusions from the work presented. We will also outline research problems which we wish to pursue in future. 2 2 Purpose, Hypotheses, Specific Aims, and Significance 2.1 Purpose There are two main goals pursued in this dissertation: 1. First we establish that the assumptions needed for the commonly used ANOVA method for testing that if a gene or some genes is associated with a given phenotype are not met, giving rise to a paradox. This motivated us to an exploration of one of the goals of this dissertation to scrutinize the violations and check the appropriateness of using ANOVA. Bartlett’s test seems to be more appropriate to use in this context, even though the (marker) genotypic group populations are not normal. In fact, the populations are mixed normal. We compare the performances of Bartlett’s test and ANOVA procedure, using simulations under different choices of parameters (i.e., allele frequencies, allelic effect, linkage disequilibrium, type I error) to examine whether the ANOVA procedure still works, even when the assumptions are violated, and how robust it is compared to the Bartlett’s test. 2. The second aim is to propose a new method to test Hardy-Weinberg Equilibrium since the commonly used Chi-squared statistic can not discriminate homozygote excess from heterozygote deficiency. The new method is simple to use in the bi-allelic case. For more than two alleles, testing the HWE raises severe computational problems. We detail a new technique reducing the multi-allelic problem to several bi-allelic problems. The mathematical and computational details have been set out. 3 2.2 Research Hypotheses Research hypotheses to be examined in this dissertation are the following: 1. The goal is to test the hypothesis that genotypes is associated with a given quantitative phenotype. The method commonly used is ANOVA. We show that the assumptions for the validity of ANOVA are violated, giving rise to a paradox. The Bartlett’s test seems more appropriate. We hypothesize that the ANOVA procedure still works despite the violations. 2. On testing the Hardy-Weinberg Equilibrium, in the case of bi-allelic genes, the chi-squared test cannot be used for one-sided alternatives. We propose a new method of testing which can accommodate one-sided alternatives. We hypothesize that the new method provides lower sample sizes than the chi-squared method. 3. Computational problems are insurmountable when testing the HWE in cases with more than two alleles. In the tri-allelic case, we propose a new method of testing the HWE. The new method reduces the tri-allelic problem to several bi-allelic problems. The research hypothesis is that it will work. We use simulations to examine the power of the new method. 4. The new proposed method can be generalized to any case of multiple alleles. 4 2.3 Specific Aims The specific aims of this dissertation are outlined below: 1. Under the additive model of allelic effects on a quantitative phenotype, the ANOVA method is commonly used to check the influence of the underlying gene on the phenotype. We want to examine whether the assumptions for the validity of the ANOVA method are met. If not, propose an alternative method to answer the question of interest and compare its performance with that of the ANOVA method. This is pursued in Chapter 3. 2. For testing the Hardy-Weinberg equilibrium in the two-allelic case, goodness-of-fit Chisquared statistic is commonly used. The Chi-squared statistic can not be used to test onesided alternatives. We want to explore whether a new test can be developed to achieve the objective. We propose such a new test. We use the new test for power and sample size calculations. This is pursued in Chapter 4. 3. Test for HWE in the tri-allelic case is computationally intractable. We want to explore ways of overcoming computational complexities. We propose a new method for testing the HWE in this case. We want to detail the new procedure for practical implementation. This is pursued in Chapter 5. 4. Generalize the testing procedure developed to any multi-allelic case. Chapter 6. 5 This is pursued in 2.4 Significance In genetic studies, one of the major objectives is to apply statistical models to assist in identification of genes contributing to specific quantitative traits. The appropriateness of these methods depend on the validity of the assumptions needed to carry out these methods. The difficult thing is that many times the violation is not readily apparent (i.e., it is deeply buried and not detectable without extensive algebraic computations). Unavoidably, geneticists sometimes pick the most commonly used statistical method without checking the validity of these methods which may bias or even jeopardize the integrity of the research. Some methods are robust when the assumptions are violated, while others are not. Therefore, checking the validity of the assumptions becomes very critical for assuring the validity of the method used. In order to correlate a given quantitative phenotype with a gene, the method of analysis of variance (ANOVA) is most commonly used. However, we have observed a paradox when applying the ANOVA method to test the null hypotheses H0: the genotypes of a gene G′ do not discriminate the phenotype P, which is equivalent to H0: Δ = 0, where Δ is the linkage disequilibrium between G′ and the causative gene locus (G) of the phenotype P.ٛ For the applicability of ANOVA for testing H0: Δ = 0, we need to assume homogeneity of variances of the phenotype P across all genotypes of G′ (i.e., the groups formed by the genotypes of G′ have to have the same variance). However, ANOVA tests equality of means, and the assumption of homogeneity of variances holds only if Δ = 0, under which the means are also all equal, which is what we are testing. This is a paradoxical situation. Using simulation to compare its power with an alternative method “Bartlett’s test” would give scientists a better idea about the performance of ANOVA. Research carried out on this problem would help quantitative geneticists to provide a good understanding of the ANOVA method vis-à-vis Bartlett’s test in this context. 6 The second half of the dissertation focuses on Tests of the Hardy-Weinberg Equilibrium (HWE). HWE is one of the most important assumptions to be checked in genetic analysis. In population genetics, the HWE states that, under certain conditions, after one generation of random mating, the genotype frequencies at a single gene locus will attain at a set of specific equilibrium values. The most commonly used method for testing HWE is the Chi-squared statistic, which does not discriminate homozygote excess from heterozygote deficiency because of its two-sided nature. Based on this deficiency, we propose an alternative method and use simulations to assess its power. For more than two alleles, testing the HWE is computationally complex. We propose a new method of testing the HWE for multi-allelic cases by reducing it to several bi-allelic cases. The quantitative geneticists are expected to use our routines when examining issues surrounding the HWE. 7 3 On testing that genotypes at a marker locus is associated with a given phenotype 3.1 Background: Traditional Approach Suppose that we are investigating association between a particular quantitative phenotype and a gene. The focus of genotype-phenotype mapping is to identify if the candidate gene or genes have some bearing on this given phenotype. Let P denote the given phenotype, which will be measured for each participant. In addition, blood sample will be collected for determining the genotype of each participant in a randomly selected sample. It is believed that there is a gene G, bi-allelic with alleles M and m, which impacts the phenotype. The genotypes MM, Mm, and mm of the gene G influence the phenotype in the following sense: P | G = MM ~ N(λ, σ2); P | G = Mm ~ N(0, σ2); P | G = mm ~ N(−λ, σ2). for some λ ≠ 0, where G stands for ‘Genotype.’ In the subpopulation of those with genotype MM the phenotype P is normally distributed with mean λ and variance σ2, in the subpopulation of those with genotype Mm the phenotype P is normally distributed with mean 0 and variance σ2, and in the subpopulation of those with genotype mm the phenotype is normally distributed with – λ and variance σ2. This is essentially an additive model of allelic effects of the causative gene (G) on the phenotype P. Let PM2 , 2 PM Pm , and Pm2 be the relative frequencies of the genotypes MM, Mm, and mm, respectively, in the population, where the probabilities (allele frequencies) PM and Pm satisfy the condition PM + Pm = 1 . We are assuming the gene G is in Hardy–Weinberg equilibrium with PM 8 and Pm representing the frequencies of alleles M and m at the causative locus G in the entire population ( Li, 1976). Unconditionally, P has a distribution which is a mixture of normal distributions. More precisely, P ~ PM2 N (λ , σ 2 ) + 2 PM Pm N (0, σ 2 ) + Pm2 N (−λ , σ 2 ) . The joint distribution of P and G is: f(P, MM) = PM2 N (λ , σ 2 ) f(P, Mm) = 2 PM Pm N (0, σ 2 ) f(P, mm) = Pm2 N (−λ , σ 2 ) Suppose G′ is another gene at a chosen site of the genome. Suppose G′ is also a bi-allelic gene with alleles A and a. Since we do not know where G is truly located yet, we choose some gene that is probably located close to the true gene; we call this gene the “Marker.” The question of interest is whether or not the genotypes AA, Aa, and aa of the Marker locus discriminate the phenotype. If we have success to find some association between the marker and the phenotype, we can do more analysis to locate the true gene closer. For this purpose, a sample of n individuals is selected, their phenotypes measured, and genotypes determined. The phenotype data are classified according to genotypes. In the literature (Chakraborty,1986), the ANOVA method on the one-way classified data is used to answer the question raised above. Accepting the null hypothesis of equal means is tantamount to declaring that G′ is not the gene that discriminates the phenotype. We notice that the assumptions needed for the validity of ANOVA are not met giving rise to a paradox. One of the goals of this dissertation is that if traditional ANOVA is used to answer the question, check its power via simulations. We also compare the ANOVA method with the Bartlett’s test by 9 checking their powers via simulations. A broad conclusion of the investigation is that the ANOVA method still works. 3.2 Statistical Methods 3.2.1 Analysis of Variance (ANOVA) In general, experimenters assume that markers segregate randomly. Once the data are collected on each individual, statistical associations between the markers and quantitative traits are established through statistical approaches that range from simple techniques, such as analysis of variance (ANOVA), to models that include multiple markers and interactions. The simpler statistical approaches tend to be methods of Quantitative Trait Locus (QTL) detection that assess differences in the phenotypic means for single-marker genotypic classes. The actual location of QTL involves an estimated genetic map with known distances between markers, and on evaluation of a likelihood function that is maximized over the established parameter space. Typically, the null hypothesis tested is that the mean of the trait value is independent of the genotype at a particular marker. The null hypothesis is rejected when the test statistic is larger than a critical value, and the implication is that the QTL is linked to the marker under investigation. Single-marker analyses investigate individual markers independently without reference to their position or order. The data classified according to the genotypes of G ′ have the following structure (Pi’s stands for phenotypic values): 10 Phenotypic Values Genotypes AA Aa aa P11 P21 P31 P12 P22 P32 . . . . . . . . . P1, n1 P2, n2 P3, n3 n = total sample size = n1+n2+n3 The ANOVA technique can be used to test the equality of the group means corresponding to the genotypes. 3.2.2 Bartlett’s Test As we shall see later, Bartlett’s test can also be used to test homogeneity of means for the data set-up of Section 3.2.1. In this section, we provide a brief introduction to the Bartlett’s test. 3.2.2.1 Introduction Bartlett's test (Snedecor and Cochran, 1983) is used to test if k normal populations have equal variances. Equal variances across samples is called homoscedasticity or homogeneity of variances. Some statistical tests, for example the analysis of variance (ANOVA), assume that variances are equal across groups or samples. The Bartlett’s test can be used to verify that assumption. 11 Bartlett's test is sensitive to departures from normality. That is, if the samples come from non-normal distributions, then Bartlett's test may simply be testing for non-normality. The Levene test (Milliken and Johnson, 1989) is an alternative to the Bartlett’s test that is less sensitive to departures from normality. Here, we are investigating mixed normally distributed data and focusing only on the Bartlett’s test. 3.2.2.2 Definition The Bartlett’s test statistic is designed to test for equality of variances across groups against the alternative that variances are unequal for at least two groups. The hypotheses are: H0: σ 1 = σ 2 = .... = σ k Ha: σ i ≠ σ j for at least one pair (i,j) The test statistic is given by: ( N − k ) ln s 2p − ∑i=1 ( N i − 1) ln si2 k T= ⎛ 1 ⎞⎛ k 1 1 ⎞ ⎟⎟ ⎟⎟⎜⎜ (∑ 1 + ⎜⎜ )− ⎝ 3(k − 1) ⎠⎝ i=1 N i − 1 N − k ⎠ In the above, si2 is the sample variance of the ith group (i =1,2,…,k), N is the total sample size, Ni is the sample size of the ith group, k is the number of groups, and s 2p is the pooled variance. The pooled variance is a weighted average of the group variances and is defined as: k s 2p = ∑ ( N i − 1) si2 /( N − k ) i =1 12 The variances are judged to be unequal if T > 2 χ (2α ,k −1) , where χ (α ,k −1) is the 100* α upper percentile critical value of the chi-squared distribution with k −1 degrees of freedom at the significance level of α . The above formula for the critical region follows the convention that χα2 is the upper critical value from the chi-squared distribution and χ12−α is the lower critical value from the chisquared distribution. 3.2.3 Linkage Disequilibrium Coefficient and Joint distribution Let Δ be the linkage disequilibrium coefficient between the genes G and G′. Since G is unknown, Δ is unknown. The joint distribution of the genotypes of G and G′ is given in the following table: Table 3-1 Joint distribution of the genotypes of G and G′ G′ Marginal AA Aa aa MM 2 PMA 2PMA PMa 2 PMa PM2 Mm 2PMA PmA 2PMA Pma + 2PMa PmA 2PMa Pma 2PM Pm mm 2 PmA 2PmA Pma 2 Pma Pm2 PA2 2PA Pa Pa2 1 G Marginal frequencies frequencies Using the concept of haplotype frequencies, the entities in the joint distribution are defined by: PMA = PMPA + Δ; PMa = PMPa − Δ; PmA = PmPA − Δ; Pma = PmPa + Δ. 13 where PA and Pa are the frequencies of alleles A and a of the marker locus (G′) in the entire population. The conditional distribution of P given genotypes of G and G′ depend only on the genotype of the true gene G. More precisely, P | G = MM, G′ = AA ~ N(λ, σ2); P | G = MM, G′ = Aa ~ N(λ, σ2); P | G = MM, G′ = aa ~ N(λ, σ2); P | G = Mm, G′ = AA ~ N(0, σ2); P | G = Mm, G′ = Aa ~ N(0, σ2); P | G = Mm, G′ = aa ~ N(0, σ2); P | G = mm, G′ = AA ~ N(-λ, σ2); P | G = mm, G′ = Aa ~ N(-λ, σ2); P | G = mm, G′ = aa ~ N(-λ, σ2). 3.2.4 Joint distribution of the Phenotype and Genotypes of G and G′ٛ The joint distribution of the Phenotype, genotypes of G and G′ is given as follows: 2 f(P, MM, AA)= PMA N (λ , σ 2 ) f(P, MM, Aa)= 2 PMA PMa N (λ , σ 2 ) 2 N (λ , σ 2 ) f(P, MM, aa)= PMa f(P, Mm, AA)= 2 PMA PmA N (0, σ 2 ) f(P, Mm, Aa)= (2 PMA Pma + 2 PMa PmA ) N (0, σ 2 ) f(P, Mm, aa)= 2 PMa Pma N (0, σ 2 ) 2 f(P, mm, AA)= PmA N ( −λ , σ 2 ) f(P, mm, Aa)= 2 PmA Pma N (−λ , σ 2 ) 2 N ( −λ , σ 2 ) f(P, mm, aa)= Pma 14 The joint pdf of phenotype and marker G′ is given by: 2 2 N (λ , σ 2 ) + 2 PMA PmA N (0, σ 2 ) + PmA f( P, AA) = PMA N ( −λ , σ 2 ) f( P, Aa) = 2 PMA PMa N (λ , σ 2 ) + (2 PMA Pma + 2 PMa PmA ) N (0, σ 2 ) + 2 PmA Pma N (−λ , σ 2 ) 2 2 N (λ , σ 2 ) + 2 PMa Pma N (0, σ 2 ) + Pma N ( −λ , σ 2 ) f( P, aa) = PMa 3.2.5 Conditional Expectations and Variances From these joint distributions, the following conditional distributions are derived: (P | G′ ~ AA) = (P | G′ ~ Aa) = (P | G′ ~ aa) = 2 2 PMA N (λ , σ 2 ) + 2 PMA PmA N (0, σ 2 ) + PmA N ( −λ , σ 2 ) PA2 PMA PMa N (λ , σ 2 ) + ( PMA Pma + PMa PmA ) N (0, σ 2 ) + PmA Pma N (−λ , σ 2 ) PA Pa 2 PMa N (λ , σ 2 ) + 2 PMa Pma N (0, σ 2 ) + Pma2 N (−λ , σ 2 ) Pa2 The conditional expectations and variances, are summarized below (derivations of which are given in Appendix 1) E(P | G′ = AA) = λ ( PM − Pm ) + 2Δλ PA Var(P | G′ = AA) = σ 2 + 2λ2 PM Pm + E(P | G′ = Aa) = λ ( PM − Pm ) − 2Δλ2 ( Pm − PM ) 2Δ2 λ2 − PA PA2 Δλ ( PA − Pa ) PA Pa Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA2 + Pa2 ) − Var(P | G′ = Aa) = σ + 2λ PM Pm + PA Pa PA2 Pa2 2 2 15 E(P | G′ = aa) = λ ( PM − Pm ) − 2Δλ Pa Var(P | G′ = aa) = σ 2 + 2λ2 PM Pm + 3.3 2Δλ2 ( PM − Pm ) 2Δ2λ2 − Pa Pa2 Some Facts and Paradox 3.3.1 Some Facts 1. If Δ ≠ 0, the conditional means as well as the conditional variances are different. The genotypes of G′ do discriminate the phenotype P. 2. If Δ = 0, all the conditional expectations and all the conditional variances are equal. The genotypes of G′ do not discriminate the phenotype P. 3. If PA = 0.5 and PM = 0.5, all the conditional variances are equal. 4. Each conditional distribution of P | Genotype of G′ is a mixed normal. 3.3.2 The Paradox We will formulate the null hypothesis H0: the genotypes of G′ do not discriminate the phenotype P, which is equivalent to H0: Δ = 0. For applicability of ANOVA for testing H0: Δ = 0, we need to assume homogeneity of variances (i.e., the groups formed by the genotypes of G′ have to have the same variance). The assumption of homogeneity of variances holds if Δ = 0, which implies that the means are also all equal, which is what we are testing using the ANOVA method. This is a paradoxical situation. 3.4 Efficacy of ANOVA Despite the paradoxical situation outlined above, in this section, we examine how much an application of ANOVA to the one-way classified data achieves the objectives. This study will be conducted by resorting to simulations. 16 The SAS code of simulations is presented in Appendix 2. Scenario 1. The first situation we considered has the following specifications for the parameters involved: %sim (Δ=0, λ=1, σ2=1,Pm=0.5,Pa=0.5,sample=200,alphalevel=0.05); Under this scenario, the conditional distributions are tabulated below: Table 3-2 Conditional distributions under Scenario 1 Conditional distribution mean variance 1. P | G′ = AA ¼ N(1,1) + ½ N(0,1) + ¼ N(−1,1) 0 1.5 2. P | G′ = Aa ¼ N(1,1) + ½ N(0,1) + ¼ N(−1,1) 0 1.5 3. P | G′ = aa ¼ N(1,1) + ½ N(0,1) + ¼ N(−1,1) 0 1.5 4. P | G = MM N(1,1) 0 1 5. P | G = Mm N(0,1) 0 1 6. P | G = mm N(−1,1) 0 1 The common conditional probability density function under distributions 1, 2, and 3 above is plotted in Figure 3-1: Figure 3-1 Common conditional pdf of P | G′ under Scenario 1 0.3 0.25 0.2 0.15 0.1 0.05 -4 -2 2 4 Distribution: ¼ N(−1,1) + ½ N(0,1) + ¼ N(1,1) This plot does not show trimodality as expected because the assumed variance is as large as the allelic effect (i.e., σ2 = 1 and λ = 1). In consequence, the three modes are smoothed out. If we assume a smaller variance, for example, like 0.4 and keep λ the same (λ = 1), the conditional distribution with three modes is shown as follows: 17 0.5 0.4 0.3 0.2 0.1 -2 -1 1 2 Distribution: ¼ N(−1,0.4) + ½ N(0,0.4) + ¼ N(1,0.4) For comparison, when we choose σ2 = 0.5, the modes are smoothed out as follows: 0.4 0.3 0.2 0.1 -2 -1 1 2 Distribution: ¼ N(−1,0.5) + ½ N(0,0.5) + ¼ N(1,0.5) The conditional probability density function under Distributions 4, 5, and 6 of Table 3-2 are plotted in Figure3-2: Figure 3-2 Common conditional pdf of P | G under Scenario 1 0.4 0.3 0.2 0.1 -4 -2 2 4 Distributions: N(−1,1); N(0,1) ; N(1,1) We now focus on testing the hypothesis H0: Δ = 0, which is true under Scenario 1. We generate a random sample of size n = 200 from the following joint distribution of P and G′. 18 f (P, AA) = 1/16 N(1,1) + 1/8 N(0,1) + 1/16 N(−1,1) f (P, Aa) = 1/8 N(1,1) + 1/4 N(0,1) + 1/8 N(−1,1) f (P, aa) = 1/16 N(1,1) + 1/8 N(0,1) + 1/16 N(−1,1) −∞ < P < ∞ Simulations are conducted according to the following steps. Step 1. Draw a random sample of size 200 from the following trinomial distribution: G′ AA Aa Aa Pr: 1/4 1/2 1/4 Let n1 , n2 , n3 be the observed frequencies of the genotypes. Obviously, n1 + n2 + n3 = 200 . Step 2. If G′ = AA, Aa or aa, simulate the mixed normal distribution ¼ N(1,1) + ½ N(0,1) + ¼ N(−1,1). Step 3. Arrange the Genotype-Phenotype data generated in Steps1 and 2 in the following way. Phenotypic Values Genotypes AA Aa aa P11 P12 P21 P22 P31 P32 . . . . . . . P1, n1 . P2, n2 . P3, n3 n = total sample size = n1+n2+n3 Step 4. Carry out Analysis of Variance and Bartlett’s test on the phenotype data of Step 3. Use level α = 0.05 . Note down whether or not H0 is rejected under each test. Step 5. Repeat steps 1, 2, 3, and 4 10,000 times. 19 Summery statistics of simulations: 1. Empirical size under the ANOVA test = No. of times H 0 is rejected = 0.052 10,000 2. Empirical size under the Bartlett’s test = No. of times H 0 is rejected =0.059 10,000 The observed size under the Bartlett’s test is slightly larger than that under the ANOVA method. However, the observed powers are not significantly different from the nominal power 0.05. Scenario 2. We now look at scenarios when the null hypothesis H0: Δ = 0 is not true. We have the following specifications for the parameters involved: %sim (Δ=0.1875, λ=1, Note that σ2=1,Pm=0.5,Pa=0.5,n=200,alphalevel=0.05); PMA = PMPA + Δ = 0.4375 PMa = PMPa − Δ = 0.0625 PmA = PmPA − Δ = 0.0625 Pma = PmPa + Δ = 0.4375 The conditional distributions of P | G′ are given by (P | G′ ~ AA) = (P | G′ ~ Aa) = (P | G′ ~ aa) = 2 2 PMA N (1,1) + 2 PMA PmA N (0,1) + PmA N (−1,1) 2 PA PMA PMa N (1,1) + ( PMA Pma + PMa PmA ) N (0,1) + PmA Pma N (−1,1) PA Pa 2 2 PMa N (1,1) + 2 PMa Pma N (0,1) + Pma N (−1,1) 2 Pa The conditional means and variances are given by: E(P | G′ = AA) = λ ( PM − Pm ) + 2Δλ = 0.75 PA Var(P | G′ = AA) = σ 2 + 2λ2 PM Pm + 2Δλ2 ( Pm − PM ) 2Δ2 λ2 = 1.21875 − PA PA2 20 E(P | G′ = Aa) = λ ( PM − Pm ) − Δλ ( PA − Pa ) =0 PA Pa Var(P | G′ = Aa) = σ 2 + 2λ2 PM Pm + E(P | G′ = aa) = λ ( PM − Pm ) − Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA2 + Pa2 ) − = 1.21875 PA Pa PA2 Pa2 2Δλ = −0.75 Pa Var(P | G′ = aa) = σ 2 + 2λ2 PM Pm + 2Δλ2 ( PM − Pm ) 2Δ2λ2 − = 1.21875 Pa Pa2 Under this scenario, the conditional distributions are tabulated below: Table 3-3 Conditional distributions under Scenario 2 Conditional distribution 1. P | G′ = AA 2. P | G′ = Aa 3. P | G′ = aa 0.765625N(1,1) + 0.21875N(0,1) + 0.015625N(−1,1) 0.109375N(1,1) + 0.78125N(0,1) + 0.109375N(−1,1) 0.01562N(1,1) + 0.21875 N(0,1) + 0.765625N(−1,1) mean variance 0.75 1.21875 0 1.21875 −0.75 1.21875 4. P | G = MM N(1,1) 0 1 5. P | G = Mm N(0,1) 0 1 6. P | G = mm N(−1,1) 0 1 These three conditional densities 1, 2, and 3 of Table 3-3 are plotted in Figure 3-3. 21 Figure 3-3 Conditional pdf of P | G′ under Scenario 2 0.35 0.3 0.25 0.2 0.15 0.1 0.05 -4 -2 2 4 Distribution: 0.015625N(1,1) + 0.21875N(0,1) + 0.765625N(−1,1) 0.109375N(1,1) + 0.78125N(0,1) + 0.109375N(−1,1) 0.765625N(1,1) + 0.21875N(0,1) + 0.015625N(−1,1) As a contrast, the conditional densities 4, 5, and 6 of P|G were plotted in Figure 3-2. In this scenario, the null hypothesis H0: Δ = 0 is not true. A random sample of size n = 200 is generated from the joint distribution of P and given by 2 2 N (1,1) + 2 PMA PmA N (0,1) + PmA N (−1,1) f(P, AA) = PMA f(P, Aa) = 2 PMA PMa N (1,1) + (2 PMA Pma + 2 PMa PmA ) N (0,1) + 2 PmA Pma N (−1,1) 2 N (1,1) + 2 PMa Pma N (0,1) + Pma2 N (−1,1) f(P, aa) = PMa With the protocol outlined under Scenario 1, we calculate the following entities: 1. Empirical power under the ANOVA test = No. of times H 0 is rejected = 0.27 10,000 2. Empirical power under the Bartlett’s test = No. of times H 0 is rejected = 0.14 10,000 22 Scenario 3. We consider the following specifications for the parameters involved: %sim (Δ=0.125, λ=1, σ2=1,Pm=0.5,Pa=0.5,n=200,alphalevel=0.05); Under this scenario, the conditional distributions are tabulated below: Conditional distribution mean variance 1. P | G′ = AA 0.5625N(1,1) + 0.375N(0,1) + 0.0625N(−1,1) 0.5 1.375 2. P | G′ = Aa 0.1875N(1,1) + 0.625N(0,1) + 0.1875N(−1,1) 0 1.375 3. P | G′ = aa 0.0625N(1,1) + 0.375 N(0,1) +0.5625N(−1,1) −0.5 1.375 The graphs of the conditional distributions of P | G′ are given in Figure 3-4. Figure 3-4 Conditional pdf of P | G′ under Scenario 3 0.3 0.25 0.2 0.15 0.1 0.05 -4 -2 2 4 Distributions: 0.0625N(1,1) + 0.375N(0,1) +0.5625N(−1,1) 0.1875N(1,1) + 0.625N(0,1) + 0.1875N(−1,1) 0.5625N(1,1) + 0.375N(0,1) + 0.0625N(−1,1) Simulations are conducted to calculate empirical powers under ANOVA and Bartlett’s test. No. of times H 0 is rejected = 0.225 10,000 No. of times H 0 is rejected = 0.10 2. Empirical size under the Bartlett’s test = 10,000 1. Empirical size under the ANOVA test = 23 Scenario 4. We consider the following specifications for the parameters involved: %sim (Δ=0.0625, λ=1, σ2=1,Pm=0.5,Pa=0.5,n=200,alphalevel=0.05); Under this scenario, the conditional distributions are tabulated below: Conditional distribution 0.390625N(1,1) + 0.46875N(0,1) + 1. P | G′ = AA 0.140625N(−1,1) 0.234375N(1,1) + 0.53125N(0,1) + 2. P | G′ = Aa 0.234375N(−1,1) 0.140625N(1,1) + 0.46875 N(0,1) + 3. P | G′ = aa 0.390625N(−1,1) mean variance 0.25 1.46875 0 1.46875 −0.25 1.46875 The graphs of the conditional distributions of P | G′ are given in Figure 3-5. Figure 3-5 Conditional pdf of P | G′ under Scenario 4 0.3 0.25 0.2 0.15 0.1 0.05 -4 -2 2 4 Distributions: 0.140625N(1,1) + 0.46875 N(0,1) + 0.390625N(−1,1) 0.234375N(1,1) + 0.53125N(0,1) + 0.234375N(−1,1) 0.390625N(1,1) + 0.46875N(0,1) + 0.140625N(−1,1) 24 Simulations are conducted to calculate empirical powers under ANOVA and Bartlett’s test. No. of times H 0 is rejected = 0.194 10,000 No. of times H 0 is rejected = 0.086 2. Empirical power under the Bartlett’s test = 10,000 1. Empirical power under the ANOVA test = The results of the simulations under each of Scenarios 1, 2, 3, and 4 are summarized below (10,000 simulations). Specifications (common parameters): λ=1, σ2=1,Pm=0.5,Pa=0.5,n=200,alphalevel=0.05 Table 3-4 Summarized results of simulations Δ Empirical Power ANOVA Bartlett 0 0.059 0.052 0.0625 0.194 0.086 0.125 0.225 0.101 0.1875 0.273 0.142 Conclusions: We have considered four sets of parameter values involved in the simulations. In one set of simulations, the observed sizes are not significantly different from the nominal size of 0.05. In all other cases, the observed powers are significantly different and the ANOVA test is superior to the Bartlett’s test. In all the scenarios, homogeneity of variances holds. The genotypic distributions are not normal but mixed normal. The ANOVA procedure is robust to violation of the normality assumption but the Bartlett’s test is not. 25 3.5 Power Comparison of ANOVA and Bartlett’s test Checking the conditional expectations and variances, we know that the assumption of homogeneity of variances to carry ANOVA is violated. Normality assumption is violated for the applicability of both the ANOVA and Bartlett tests. We used simulations (10,000 times) to compare the power of these two tests in general. Please refer to Appendix 3 for SAS code. Selection of Parameter values for the simulations: • Allele frequencies of PM and PA are randomly selected: PM ∈ [0, 1], PA ∈ [0, 1] • Disequilibrium values (Δ ) are randomly selected: Δ ∈ [0, 0.2] • Significance level = 5% In Section 3.4, PM and PA are fixed at 0.5. Here we randomly select their frequencies from 0 to 1. We know that The genotypes MM, Mm, and mm of the gene G discriminate the phenotype in the following sense: P | G = MM ~ N(λ, σ2); P | G = Mm ~ N(0, σ2); P | G = mm ~ N(-λ, σ2). In addition, we need to choose λ, σ2 , and n in the simulations. We simulated many scenarios under different choices of parameter values for a comparison of powers. More specifically, we look at: 26 • Same mean (λ), different variance (σ2), same sample size (n); • Different mean (λ), same variance (σ2), same sample size (n); • Same mean (λ), same variance (σ2), but different sample size (n). The simulations showed that power of both tests increase when sample size (n) increases, especially when Δ is low (Δ ∈ [0, 0.2]) and the power of both tests does not depend on variance (σ2 ∈ [1, 50]) too much, the thing that matters most is the mean (λ). The performances of these two tests are different under different choices of mean (λ). We report two special cases. 3.5.1 Different choices of mean (λ) 1. Let λ = 1, n = 200 and σ 2 = 1. The power of both tests are low ( <30%), but the ANOVA method has better power than the Bartlett’s test. See Figure 3-6. Figure 3-6 Power comparison of ANOVA and Bartlett’s test, when λ = 1 27 2. Let λ = 50, n = 200 and σ 2 = 1.ٛ When the mean (λ) gets larger, both tests have power between 0.6 and 0.75 for Δ ≥ 0.06. The ANOVA method is still better than the Bartlett’s test, but the difference is not that wide. See Figure 3-7. Figure 3-7 3.5.2 Power comparison of ANOVA and Bartlett’s test, when λ = 50 Conclusion In testing that genotypes at a marker locus are ascociated with a certain quantitative phenotype, the ANOVA method is widely used. However, we showed that, the assumptions required for the validity of the ANOVA method are violated. First, the conditional distribution is a mixed normal distribution instead of normal, and the homogeneity of variances is violated. Homogeneity of variances holds if the disequilibrium (Δ) is zero or both alleles at the marker locus are equi-probable. If Δ = 0, all conditional means are equal, which is the null hypothesis we are testing. 28 Such a paradox triggered our interest to question the appropriateness of using ANOVA. An alternative procedure for testing Δ = 0 is the Bartlett’s test. The assumptions needed for the validity of the Bartlett’s test are not valid. The underlying genotypic populations are not normal but mixed normal. We want to examine appropriateness of using the Bartlett’s test. With 10,000 replications of simulations, we compared the power of both test under a variety of specifications of parameter values. We found that, generally speaking, ANOVA performed better than the Bartlett’s test. With the choice of higher values of allelic effect (i.e., λ, represents the mean of the underlying normal distribution), the power of the Bartlett’s test approached that of the power of ANOVA’s gradually. However, the observed size of the Bartlett’s test is higher than the nominal level and this bias increases as λ increases. Even though the homogeneity of variances is violated, the normality assumption does not hold, and the application of ANOVA is not completely correct, from a statistical point of view, the ANOVA method is still robust, and the power of ANOVA is better than the Bartlett’s test. Besides, Bartlett’s test is sensitive to departures from normality, and each conditional distribution here is mixed normal instead of normal distribution. If Δ is small, we need a very large sample for the ANOVA test to have decent power. Is intrinsically ANOVA more powerful than the Bartlett’s test? The answer is yes, because ANOVA is a test for centrality based on the first moment and Bartlett’s test is a test for equality of variances, which is about second moment. The sampling variance of the first moment is smaller than the sampling variances of the second moment. However, as our simulations show, for smaller Δ, Bartlett’s test appears to have a higher power than ANOVA. This is so because with smaller Δ, the distributions remain as mixed normals, but the mean of the component distributions are more equal, making their differences less detectable by ANOVA. Nevertheless, for such small Δ, the absolute power of either of these two procedures is small. 29 From Figure 3-6 and Figure 3-7, it appears that the Bartlett test is biased, i.e., it fails to achieve the nominal size. When simulations are conducted with PM ∈ [0, 1], PA ∈ [0, 1], and Δ ∈ [0, 0.2] randomly generated, certain configurations of PMA, PMa, PmA, and Pma could lead to negative constituents. In such situations, data can not be generated. In the program, whenever such a configuration arises, it is ignored. This may have some bearing on the bias of the Bartlett’s test. ٛ 30 4 Hardy-Weinberg Equilibrium in the case of two alleles 4.1 Introduction 4.1.1 What is Hardy-Weinberg Equilibrium? Hardy-Weinberg Equilibrium is one of the most important assumptions to be checked in genetic analysis. In population genetics, the Hardy–Weinberg equilibrium (HWE), or Hardy– Weinberg law, named after G. H. Hardy and W. Weinberg, states that, under certain conditions, after one generation of random mating, the genotype frequencies at a single gene locus will attain at a set of particular equilibrium values. It also specifies that those equilibrium frequencies can be represented as simple functions of the allele frequencies at that locus. We focus on a diploid organism and a specific gene locus with alleles A and a. The entity A is the dominant allele and a is the recessive allele for a certain trait. An intuitive question would be “Will the dominant character eventually dominate the whole population?” Hardy and Weinberg discovered independently that dominance will not happen from random mating. Let the population of organisms be infinite or very large. Let the frequencies (first generation) of the genotypes be given by AA Aa aa r 2s t with r, s, t ≥ 0 and r + 2s + t =1. Suppose the population is under in random mating. The frequencies of the genotype of the offspring population (second generation) is given by AA Aa 31 aa (r + s)2 2(r + s)(s + t) (s + t)2 An important question when the genotype frequencies of the two generations are the same, i.e., r = (r + s)2 s = 2(r + s)(s + t) and t = (s + t)2. Note that r = (r + s)2 ⇔ r = r2 + 2rs + s2 ⇔ r = r (r + 2s) + s2 ⇔ r = r (1−t) + s2 ⇔ s2 = rt Thus the two sets of genotypes frequencies are equal if and only if s2 = rt. Interestingly, if we denote the genotype frequencies of the second generation by: r1 = (r + s)2, 2 s1 = 2 (r + s) (r + s) and t1 = (r + s) 2 it follows that 2 s1 = r1 t1 . If random mating occurs in the second generation, the genotype frequencies in the third generation are identical to those in the second generation. Under random mating, subsequent generations will have genotype frequencies identical to those of the second generation. This is essentially the gist of the Hardy-Weinberg Law. It should be noted for the second generation, or for that matter, in any subsequent generation, the allele frequencies are given by: A a r+s s+r 32 In view of this observation, one can now say that the population is in Hardy-Weinberg equilibrium (HWE). More informally, if the allele frequencies are p and q, then the genotype frequencies are p2, 2pq, and r2, provided the population is in HWE. 4.1.2 Assumption of HWE The assumptions governing the Hardy–Weinberg equilibrium (HWE) are that the organism under consideration: • is diploid and the trait under consideration is not on a chromosome that has different copy numbers for different sexes, such as the X chromosome in humans (i.e., the trait is autosomal); • is sexually reproducing, either monoecious or dioecious; • has discrete generations; • in addition, the population under consideration is idealised, that is: • random mating within a single population • infinite population size (or sufficiently large so as to minimize the effect of genetic drift) and experiences: • no selection; • no mutation; • no migration (gene flow). The first group of assumptions are required for the mathematics involved. It is relatively easy to expand the definition of HWE to include modifications of these, such as for sex-linked traits. The other assumptions are inherent in the Hardy-Weinberg principle. A Hardy-Weinberg population is used as a reference population when discussing various genetic issues. 33 4.1.3 Departures from the Equilibrium To analyze departures, we can look at populations to see if they conform to these numerical patterns. If they differ, we seek the reasons for the difference in some violation of the Hardy-Weinberg assumptions. Two processes, natural selection and genetic drift, are the most common and important factors at work in most populations that are not at equilibrium. Inbreeding and nonrandom mating are also forces of departure. For example, suppose we find a population in which the recessive allele frequency is declining over time. We might then investigate whether homozygous recessives are dying earlier. (Many genetic diseases, such as cystic fibrosis, are due to recessive alleles.) This could be due to natural selection, in which those that are better adapted to the environment survive longer and reproduce more frequently. Suppose we find a population in which there is a smaller-than-expected number of homozygotes of both types, and a larger number of heterozygotes. This could be due to heterozygote superiority—where the heterozygote is more fit than either homozygotes. In humans, this is the case for the allele causing sickle cell disease, a type of hemoglobinopathy. Nonrandom mating is another potential source of departure from the Hardy-Weinberg equilibrium. Imagine that two alleles give rise to two very different appearances. Individuals may choose to mate with those whose appearance is closest to theirs. This may lead to divergence of the two groups over time into separate populations and perhaps ultimately separate into two distinct subpopulations. In very small populations, allele frequencies may change dramatically from one generation to the next, due to the vagaries of mate choice or other random events. For instance, half a dozen 34 individuals with the dominant allele may, by chance, have fewer offspring than half a dozen with the recessive allele. This would have comparatively little effect in a population of one thousand, but it could have a dramatic effect in a population of twenty. Such changes are known as genetic drifts. In this dissertation, we assume inbreeding is the only force for a departure from the HardyWeinberg Equilibrium. 4.1.4 Inbreeding Coefficient θ Inbreeding depression was recognized early by plant and animal breeders (Wright 1977) and in zoo polulations (Ralls, Brugger and Ballow 1979; Senner 1980) and in the management and restocking of endangered populations in the wild. Inbreeding Inbreeding is defined as mating between related individuals. It is also called consanguinity, meaning "mixing of the blood." Although some plants successfully self-fertilize (the most extreme case of inbreeding), biological mechanisms are in place in many organisms, from fungi to humans, to encourage cross-fertilization. In human populations, customs and laws in many countries have been developed to prevent marriages between closely related individuals (e.g., siblings and first cousins). Despite these proscriptions, genetic counselors are frequently presented with the question "If I marry my cousin, what are the chances that we will have a baby who has a disease?" The answer is that when two partners are related their chance to have a baby with a disease or birth defect is higher than the background risk in the general population. Increased Disease Risk Many genetic diseases are recessive, meaning only people who inherit two disease alleles develop the disease. Many of us carry several single alleles for genetic diseases. Since close 35 relatives have more genes in common than unrelated individuals, there is an increased chance that parents who are closely related will have the same disease alleles and thus have a child who is homozygous for a recessive disease. For instance, cousins share approximately one-eighth or 12.5 percent of their alleles. So, at any locus the chance that cousins share an allele inherited from a common parent is one-eighth. The chance that their offspring will inherit this allele from both parents, if each parent has one copy of the allele, is one-fourth. Thus, the risk the offspring will inherit two copies of the same allele is 1/8 × 1/4, or 1/32, about 3 percent. If this allele is deleterious, then the homozygous child will be affected by the disease. Overall, the risk associated with having a child affected with a recessive disease as a result of a first cousin mating is approximately 3 percent, in addition to the background risk of 3 to 4 percent that all couples face. Detect Inbreeding Unfortunately, inbreeding cannot be detected by pedigrees directly, because pedigrees are not usually available for individuals with these populations. However, inbreeding coefficient (θ) can be measured indirectly form genotypic data. This is the probability that two genes at any locus in one individual are identical by descent (have been inherited from a common ancestor). The more closely related the parents are, the larger the value of θ is. For example, the coefficient of inbreeding for an offspring of two siblings is one-fourth (0.25), for an offspring of two half-siblings it is one-eighth (0.125), and for an offspring of two first cousins it is one-sixteenth (0.0625). (This is a different calculation than the calculation of shared alleles between cousins, above.) In general, inbreeding in human populations is rare. The average inbreeding coefficient is 0.03 for the Dunker population in Pennsylvania and 0.04 for islanders on Tristan da Cunha. Inbreeding occurs in both those populations. Some isolated populations actively avoid inbreeding 36 and have maintained low average inbreeding coefficients even though they are small. For example, polar Eskimos have an average inbreeding coefficient that is less than 0.003. Beneficial changes can also come from inbreeding, and inbreeding is practiced routinely in animal breeding to enhance specific characteristics, such as milk production or low fat-to-muscle ratios in cows. However, there can often be deleterious effects of such selective breeding when genes controlling unselected traits are influenced too. Generations of inbreeding decrease genetic diversity, and this can be problematic for a species. Some endangered species, which have had their mating groups reduced to very small numbers, are losing important diversity as a result of inbreeding. 37 4.2 Properties of Inbreeding coefficient θ 4.2.1 Formulation of the problem We first consider the case of a single locus with two alleles and then consider a single locus with three alleles. We cannot examine the entire population to check the equilibrium law. We take a random sample of n individuals from the population. At a single autosomal locus with two alleles, a diploid can be one of three genotypes: AA, Aa and aa. Let n1, n2, and n3 be the frequencies of the genotypes of AA, Aa and aa in the sample. We need to formulate the equilibrium law as a hypothesis to be tested using the data collected. Consider two alleles, A and a, and let p and q, respectively, be the allele frequencies, which are unknown in the population. A punnet square (Table 4-1) can be used to formulate the problem, where the fraction in each cell is equal to the product of the row and column probabilities. Table 4-1 Punnet square for Hardy-Weinberg Equilibrium Alleles Alleles A a Marginal frequencies 2 pq p a pq 2 q q Marginal frequencies p q 1 A p In general, let p1, 2q1, and r1 be the unknown genotype frequencies in the population. The genotype frequencies can be written in the form of a bivariate frequency Table 4-2 as follows: 38 Table 4-2 Genotype frequencies in the population Alleles Marginal A a A p1 q1 p a q1 r1 q Marginal frequencies p q 1 Alleles frequencies Thus the joint distribution of the alleles is symmetric with identical marginal frequencies. Therorem: Given frequencies p1, 2q1, and r1 , there exists a number θ such that p1 = p 2 + θ pq q1 = pq − θ pq and r1 = q 2 + θ pq Proof p1 − p 2 Let θ = . This readily implies that p1= p2 +θpq. Likewise, pq q1 = p − p1 = p − ( p 2 + θpq ) = p (1 − p ) − θpq = pq − θpq . In a similar vein, one can show that r1 = q 2 + θpq . The number θ is called inbreeding coefficient. If θ = 0, the joint distribution matches with the one spelled under the Hardy-Weinberg Equilibrium. The inbreeding coefficient θ measures the extent of departure from the HardyWeinberg Equilibrium. Consequently, the joint distribution can be put in the following Table 4-3: 39 Table 4-3 Joint distribution with inbreeding coefficient θ Alleles a A p 2 + θpq pq − θpq p a pq − θpq q 2 + θpq q p q 1 Alleles Marginal frequencies 4.2.2 Marginal A frequencies Bounds on θ Since each entry in the above table is a frequency, each entry has to be larger than or equal to zero. p ⎧ 2 ⎪ p + θpq ≥ 0 ⇒ θ ≥ − q ⎪ q ⎪ It’s easy to find the bound of θ by ⎨q 2 + θpq ≥ 0 ⇒ θ ≥ − p ⎪ ⎪ pq − θpq ≥ 0 ⇒ θ ≤ 1 ⎪ ⎩ Therefore, θ has to satisfy : ⎧p q⎫ − min ⎨ , ⎬ < θ < 1 ⎩q p⎭ 4.2.3 Homozygote excesses and heterozygote deficiencies Genotype can be distinguished as homozygous to heterozygous. Phenotypic traits are determined from the genotypes and therefore, it is important to know on a specific locus whether we have a homozygote or heterozygote excess. Homozygote genotypes are represented by, say, 40 AA and aa, heterozygote genotype is represented by Aa. The question is when homozygotes outnumber heterozygotes? • Homozygote proportion = p 2 + q 2 + 2 pqθ = ( p + q ) 2 − 2 pq − 2 pqθ = 1 − 2 pq(1 − θ ) • Heterozygote propotion: 2 pq (1 − θ ) 1) Thus homozygotes out-number heterozygotes if 1 1 − 2 pq (1 − θ ) < 2 pq (1 − θ ) ⇒ θ > 1 − 4 pq 1 424 3 ≤0 2) Heterozygotes out-number homozygotes 1 1 − 2 pq(1 − θ ) > 2 pq (1 − θ ) ⇒ θ < 1 − 4 pq 1 424 3 ≤0 Homozygote excesses and heterozygote deficiencies have important genetic meanings but the currently widely used tests, like Chi-squared test and exact test (Guo and Thompson 1992) for small sample size and large number of alleles, of the Hardy-Weinberg proportions do not take into account the homozygote excesses and heterozygote deficiencies because of their two-sided hypothesis testing nature. 4.3 Maximum Likelihood estimates Theoretically, the frequencies (n1, n2, n3 ) under genotypes AA, Aa, aa have a multinomial distribution: Multinomial (n, p 2 + θpq, 2 pq(1 − θ ), q 2 + θpq ) . The parameters of the model are p and θ, which are unknown. We employ the Maximum Likelihood method to get the estimates: 41 The likelihood can be written as: L = ( p 2 + θpq ) n1 [2 pq (1 − θ )]n2 (q 2 + θpq ) n3 ln L( p,θ ) = n1 ln( p 2 + θpq) + n2 ln(2 pq(1 − θ )) + n3 ln(q 2 + θpq ) ∂ ln L( p,θ ) n n2 = 2 1 (2 p + θ (1 − 2 p)) + (2(1 − θ )(1 − 2 p)) + ∂p p + θpq 2 pq(1 − 2θ ) n3 (−2q + θ (1 − 2 p )) q + θpq 2 ∂ ln L( p,θ ) n n2 n = 2 1 ( pq) + (−2 pq) + 2 3 ( pq) ∂θ p + θpq 2 pq (1 − θ ) q + θpq Solve the following equations for p and θ: ⎧ ∂ ln L( p,θ ) =0 ⎪⎪ ∂p ⎨ ⎪ ∂ ln L( p,θ ) = 0 ∂θ ⎩⎪ 4.4 4.4.1 2n + n ⎧ ⇒ ⎪ pˆ = 1 2 2n ⎪ ⎨ 4n1n3 − n22 ⎪θˆ = ⎪⎩ (2n1 + n2 )(n2 + 2n3 ) Testing the validity of HWE Hypothesis Testing on θ In reality, the departure from the Hardy-Weinberg law is affected not only by consanguinity (inbreeding) but also by selection, genetic drift, assortative mating, and other evolutionary forces (Cockerham 1973) which are beyond the scope of this dissertation. Here we assume that the departure is affected just by inbreeding so that we can emphasize on studying and interpreting inbreeding only. As a first step toward an accurate and efficient measure of inbreeding in a small population, it is important to initially resolve the single locus measure of inbreeding, determine its sampling variance, and also build hypotheses and test their validity. In this dissertation, we will explore different tests of inbreeding coefficient calculated from the genotypic 42 data at a single locus by the MLE, Siegmund’s T test (Shweta Choudhry 2006), Wald’s test, or Chi-squared test and use computer simulations to examine and compare their power. Assume that a random sample is drawn from the population. We may wish to compare this sample with what would be expected from an “idealized inbred population’. An idealized inbred population is defined as an infinitely large population with no mutation, migration, or selection (so that gene frequencies remain constant) and with random mating except for a fixed amount of inbred matings resulting in an average inbreeding coefficient of θ. The sampling of a whole population may be considered as a random sample from an infinitely large pool of zygotes. In an idealized inbred population, only the fixed amount of inbreeding (consanguinity) will affect the proportion of heterozygotes. For an autosomal codominant locus having two alleles A and a with frequency p and q (p+q=1), the proportions of homozygotes are p 2 + θpq and q 2 + θpq for AA and aa, respectively, and the proportion of heterozygotes Aa is 2 pq(1 − θ ) (Crow and Kimura 1970). In view of the formulation of the joint distribution and the assumption the inbreeding is the only force for departure of the Hardy-Weinberg equilibrium, it now follows that the population is in HWE if and only if the “inbreeding coefficient” θ = 0. The null hypothesis and the alternative now are: H0 : θ = 0 H1: θ≠0 The null hypothesis is equivalent to the hypothesis of the Hardy–Weinberg equilibrium if we assume inbreeding is the only force for departure of HWE. 4.4.2 A likelihood test of the null hypothesis Using the likelihood estimator of the coefficient, a Wald’s test statistic can be built as: 43 Z= θˆ − 0 Var (θˆ) H 0 Using the asymptotic theroy of likelihood estimators, it follows that Z has a standard normal distribution under the null hypothesis. The alternative hypothesis can be easily used to descriminate homozygote excess and heterozygote deficiency. (H0: θ > 0 or H1: θ < 0 ). We will determine the large sample properties of the estimator θˆ . We can use the delta-method (Rao 1983, page 388). The Maximum Likelihood Estimate of θ is: θˆ = ) After rewiring, θ = 1 − 2 n 4 n1 n 3 − n 22 ( 2 n1 + n 2 )( n 2 + 2 n 3 ) . n2 . ( 2 n1 + n 2 )( n 2 + 2 n3 ) The derivatives are: ) ∂θ n2 = 4n ∂n1 (2n1 + n2 ) 2 (n2 + 2n3 ) , ) 2 (n2 − 4nn1n3 ) ∂θ , = 2n (2n1 + n2 ) 2 (n2 + 2n3 ) 2 ∂n2 ) n2 ∂θ , and = 4n (2n1 + n2 ) (n2 + 2n3 ) 2 ∂n3 ) ∂θ 2n2 =− . ∂n (2n1 + n2 ) (n2 + 2n3 ) Evaluating these at the expected values for the three genotypic counts and simplifying the variance expression: 44 ) ) ) ) ) 1 ∂θ 2 ∂θ 2 ∂θ 2 ∂θ 2 Var (θ ) = ( ) PAA + ( ) PAa + ( ) Paa − ( ) n ∂n1 ∂n2 ∂n3 ∂n See (Weir 1996, page 65). It gives the approcximate variance of Var (θˆ) = θ (1 − θ )(2 − θ ) 1 (1 − θ ) 2 (1 − 2θ ) + n 2np (1 − p ) Therefore, 1) under the null hypothesis H0 : Z | H0 ~ N(0,1) θ (1 − θ )(2 − θ ) 1 ) 2) under the alternative hypothesis H1: θˆ | H1 ~ N( θ , (1 − θ ) 2 (1 − 2θ ) + n 2np (1 − p ) 4.4.3 Siegmund’s T-Test A study on the genetics of Asthma in Latino Americans (GALA) (Choudhry 2006) conducted a test for deviations from the Hardy-Weinberg equilibrium (HWE). The test can also distinguish between homozygote excess and heterozygote deficiency. The test statistic was originally proposed by David Siegmund. The properties and asymptotic distributions of the test statistic had not been discussed in that paper. The test statistics takes the form: n1 n3 + − n) p q T= n , ( where n1 and n3 denote the homozygote genotypic counts, p and q denote the estimated allele frequencies, and n is the total number of individuals. As far as we are aware, there is no published paper available on properties of the Tstatistic. In this section, we focus on the T-statistic and establish some important properties. Under the HWE, T has an approximately standard normal distribution. An excess of homozygous individuals will lead to a positive value of T while an excess of heterozygotes gives a negative value of T. 45 4.4.3.1 Expectation and Variance of T After rewriting, T takes the following form: T=( 2nn3 2nn1 + − n) / n . n + n1 − n3 n + n3 − n1 The details are given in Appendix 4. The expection and variance of T are determined using the delta-method.ٛ 1. Expectation of T E(T) = nθ 2. Variance of T V(T) = 4.4.3.2 (1 − θ )[θ (6 p 2 − 6 p + 2) + 2 p (1 − p ) − θ 2 (1 − 2 p ) 2 ] 2 p (1 − p ) Distribution of T 1. Under the Null hypothesis ( when inbreeding coefficient is equal to 0): • Expectation of T under H0 E (T ) ≈ ( • np 2 nq 2 + − n) / n = (np + nq − n) / n = 0, and p q Substituting θ = 0, in the above expression of Variance of T, it can be shown: (1 − θ )[θ (6 p 2 − 6 p + 2) + 2 p (1 − p ) − θ 2 (1 − 2 p ) 2 ] V (T ) = =1 2 p (1 − p ) Therefore, under the null hypothesis, T has asymptotically standard normal distribution, i.e., T | H0 ~ N(0, 1). 46 2. Under the alternative hypothesis (when inbreeding coefficient is not equal to 0): T | H1 ~ N( nθ , (1 − θ )[θ (6 p 2 − 6 p + 2) + 2 p (1 − p ) − θ 2 (1 − 2 p ) 2 ] ). 2 p (1 − p ) From Section 2.4.2, we recall that: 1) under the null hypothesis H0: Z | H0 ~ N(0,1) 2) under the alternative hypothesis H1: θˆ | H1 ~ N( θ , θ (1 − θ )(2 − θ ) 1 (1 − θ ) 2 (1 − 2θ ) + ) n 2npq One can show that: (1 − θ )(θ 2 (1 − 2 p ) 2 + 2( p − 1) p − θ (6 p 2 − 6 p + 2)) 1 θ (1 − θ )(2 − θ ) = (1 − θ ) 2 (1 − 2θ ) + . 2( p − 1) p n 2npq The derivations of Expectation and Variance of T are reported in Appendix 4. T . Further, it can be shown that θˆ = n 47 χ2 -Test 4.4.4 Testing deviation from the HWE is generally performed using Pearson’s chi-squared test, using the observed genotype frequencies obtained from the data and the expected genotype frequencies obtained using the HWE. The null hypothesis is that the population is in Hardy– Weinberg proportions, and the alternative hypothesis is that the population is not in Hardy– Weinberg proportions. χ2 = (n1 − n * pˆ 2 ) 2 (n1 − 2npˆ qˆ ) 2 (n1 − n * qˆ 2 ) 2 + + , n * pˆ 2 2npˆ qˆ n * qˆ 2 where pˆ = 2n1 + n2 n + 2n3 and qˆ = 2 . 2n 2n Substituting p̂ and q̂ into the chi-squared formula and the chi-squared test statistic can be simplified as: (2n1 + n2 ) 2 2 (2n1 + n2 )(n2 + 2n3 ) 2 (n2 + 2n3 ) 2 2 4n(n1 − ) 2n(n2 − ) 4n(n3 − ) 4n 2n 4n + + (2n1 + n2 ) 2 (2n1 + n2 )(n2 + 2n3 ) (n2 + 2n3 ) 2 = n(n22 − 4n1n3 ) 2 (2n1 + n2 ) 2 (n2 + 2n3 ) 2 = nθˆ 2 The classical chi-squared goodness-of-fit test generally performs well, but it has sometims been pointed out that the chi-squared statistics test is not appropriate when the alternative hypothesis of the test is heterozygote deficiency (Pamilo and Varvio-Aho 1984; Lessios 1992), and the generally used exact tests may also have this problem. In (Ward and Sing 1970), θ is 48 estimated from the chi-squared of test statistic value. The derivation runs as follows: Assume that the sample is large. Let p1 , p2 ,..., pk be the allele frequencies of a k-allele gene. Let nij be the observed frequency of the genotype AiAj, i ≤ j. Since the sample is infinitely large, nii ≅ n[ pi2 + pi (1 − pi )θ ] and nij ≅ n[(1 − θ ) pi p j ] . Note that, k χ2 = ∑ i =1 (2nij − 2npi p j ) 2 (nii − npi2 ) 2 + ∑ npi2 2npi p j i< j (2nθpi p j ) 2 [npi (1 − pi )θ ]2 ≅∑ +∑ npi2 2npi p j i =1 i< j k ⎡k ⎤ = nθ 2 ⎢∑ (1 − pi ) 2 + 2∑ pi p j ⎥ i< j ⎣ i=1 ⎦ k ⎡ ⎤ 2 = nθ 2 ⎢k − 2 + ∑ pi + 2∑ pi p j ⎥ i =1 i< j ⎣ ⎦ = nθ 2 [k − 2 + ( p1 + p1 + ... + pk ) 2 ] = n(k − 1)θ 2 They showed that for samples of size commonly collected from natural populations a significant χ 2 is obtained only for large values of θ. In other words, under the null hypothesis (θ = 0) it takes a very large sample to detect levels of inbreeding characteristics of natural populations (θ ≤ 0.10). (This means that, for sample sizes usually collected, the estimate of θ can take on crucial biological values which are not statistically significant). Otherwise stated, the type II error, failure to reject θ = 0 when θ ≠ 0, is an overwhelming reality in most studies. 49 Solving the above formula for θ , a third estimate of average inbreeding is taken as the positive root: θˆ = x2 n(k − 1) . This estimate is very appealing since the chi-squared value can simultaneously provide an estimate of θˆ and a significance test of the hypothesis θ = 0.ٛ However, θˆ is the positive root of a quadratic equation, its sampling properties are very difficult to determine. However, by using the likelihood’s Z-test or Siegmund’s T-test can be easily employed to determine the sample size required for a given size, power, and atternative value of θ. Under the alternative hypothesis ( θ ≠ 0), the χ 2 test approximately has a non-central chisquared distribution with non-centrality parameter λ = nθ 2 (k − 1) (Ward and Sing 1970; Harber 1980), so that k (k − 1) + 4nθ 2 (k − 1) V (θˆ) = n 2 (k − 1) 2 = 4θ 2 k + 2 . n(k − 1) n (k − 1) When only a single locus is considered, the sample size needed to detect small but significant deviations is unrealistically large (Ward and Sing 1970). At a locus with two alleles, the χ 2 test can detect an inbreeding coefficient of θ = 0.0001 (a realistic value for human populations) at the 5% significance level only 50% of the time in a sample as large as 4x108; almost twice the population of the U.S. (Curie-Cohen 1981). ٛ 4.4.5 Relationship between θˆ , Wald’s Z-test, Siegmund’s T-test and χ2 -test In summary, the above tests are all connected: 50 • θˆ and χ 2 : In section 2.4.4, we have shown χ 2 = nθˆ 2 • θˆ and Siegmund’s T: In section 2.4.3.1, we have shown E(T) = • Siegmund’s T and χ 2 : T= ( 2nn1 2nn3 + − n) / n 2n1 + n2 n2 + 2n3 Æ T2 = nθ n(n22 − 4n1n3 ) 2 = χ2 (2n1 + n2 ) 2 (n2 + 2n3 ) 2 Therefore, T2 = χ 2 = nθˆ 2 , and T = nθˆ . 4.5 Advantages of the Wald’s Z-test or Siegmund’s T-Test Using the Wald’s Z-Test as the test statistic instead of χ 2 to test the Hardy-Weinberg Equilibrium has more desirable properties: 1) Z-test can be used to test homozygote excess and heterozygote deficiency because it can be built as one-sided, while χ 2 - test cannot, because χ 2 is always positive, and can only be used for two-sided alternatives. 2) Sample size calculation based on Z-test is much easier than based on χ 2 -test. 4.6 4.6.1 Sample size calculation Sample size calculation based on Z-test or Siegmund’s T-test Let σ be the standard deviation of T under H1: θ = θ1 > 0 and note that, σ 2 = (1 − θ1 ) 2 (1 − 2θ1 ) + θ1 (1 − θ1 )(2 − θ1 ) 2 pq Set 1 − β = Pr(Rej H 0 | θ = θ1 ) = Pr( T > Zα θ = θ1 ) , where Zα is the upper 100 × α percentile of the standard normal distribution. 51 = Pr( T − nθ1 σ > Zα − nθ1 σ = Pr( Z > Zα − nθ1 Set Zα − nθ1 σ σ θ = θ1 ) ) , where Z N(0,1) = −Z β Zα − nθ = −σZ β ⇒n= ( Zα + σZ β ) 2 θ12 The Sample size formula requires the value of PA . The Sample size n needed to obtain a specified power 1− β, using the Wald’s Z-test for various values of allele frequencies q , level α and true inbreeding coefficient θ in the bi-allelic case (k=2) is presented in Table 4-4. The SAS code is presented in Appendix 5. Table 4-4 q=0.1 Sample size n to achieve a specified power, 1−β, using Wald’s Z-test for various values of allele frequencies q , and true inbreeding coefficient θ, and level α 0.2 k=2; α=0.05: θ: 0.0001 64,564,100 0.0005 2,589,898 0.001 649,763 0.002 163,582 0.005 26,717 0.01 6,903 0.02 1,835 0.05 342 0.1 103 0.25 22 0.5 5 1 0 power 0.5 0.9 0.95 270,746,708 10,860,621 2,724,751 685,974 112,038 28,948 7,694 1,436 432 91 23 0 856,993,620 34,377,086 8,624,646 2,171,311 354,634 91,629 24,355 4,545 1,369 288 71 0 1,082,986,832 43,442,484 10,899,005 2,743,896 448,152 115,792 30,778 5,744 1,729 364 90 0 52 q=0.1 q=0.5 0.2 k=2; α=0.01: θ: 0.0001 220,598,052 0.0005 8,848,979 0.001 2,220,063 0.002 558,916 0.005 91,286 0.01 23,586 0.02 6,269 0.05 1,170 0.1 352 0.25 74 0.5 18 1 0 k=2; α=0.05: θ: 0.0001 64,518,227 0.0005 2,580,728 0.001 645,182 0.002 161,295 0.005 25,807 0.01 6,451 0.02 1,612 0.05 257 0.1 64 0.25 10 0.5 2 1 0 k=2; α=0.01: θ: 0.0001 220,441,317 0.0005 8,817,651 0.001 2,204,411 0.002 551,101 0.005 88,174 0.01 22,042 0.02 5,509 0.05 880 0.1 218 0.25 33 0.5 7 1 0 power 0.5 0.9 0.95 541,574,226 21,724,484 5,450,316 1,372,153 224,110 57,904 15,391 2,872 865 182 45 0 1,302,619,334 52,252,731 13,109,351 3,300,365 539,039 139,274 37,020 6,909 2,080 438 108 0 1,578,165,406 63,305,872 15,882,403 3,998,499 653,063 168,736 44,851 8,370 2,520 531 131 0 270,554,343 10,822,171 2,705,541 676,383 108,219 27,053 6,761 1,080 268 41 8 0 856,384,727 34,255,381 8,563,839 2,140,953 342,545 85,630 21,401 3,417 848 128 26 0 1,082,217,371 43,288,684 10,822,163 2,705,533 432,876 108,211 27,045 4,318 1,071 162 32 0 541,189,438 21,647,572 5,411,889 1,352,968 216,470 54,114 13,524 2,159 536 81 16 0 1,301,693,824 52,067,740 13,016,925 3,254,222 520,665 130,156 32,529 5,194 1,289 195 39 0 1,577,044,120 63,081,750 15,770,426 3,942,595 630,802 157,689 39,410 6,292 1,561 237 47 0 53 4.6.2 Sample size calculation based on Ward and Sing’s χ2-test The Sample size required under different choices of parameters based on Ward and Sing’s χ2 -test (Ward and Sing, 1970) is reproduced below. It turns out that the sample sizes calculated based on the Z-test are lower than those based on Ward and Sing’s χ2 -test. 54 4.6.3 Power comparison between T and χ2 tests via simulations We have discussed different properties of Wald’s Z-Test Statistic and Ward and Sing’s χ 2 test. Each has its own advantages, and Wald’s Z-Test Statistic is capable of building one-sided test, which can accommodate discriminating Homozygote excesses and Heterozygote deficiencies. On the other hand, the χ 2 test has it’s own advantage, namely, this test Statistic can be generalized to any number of alleles. We invoked SAS to compare their power, the code for which is presented in Appendix 6. The Null hypothesis and Alternative hypotheses are: H0 : θ = 0 H1: θ > 0 (Z-test) H1: θ ≠ 0 ( χ 2 -test) Simulation proceduces: Randomly select 100 θ ' s from (0, 0.2). For each θ , simulate the Multinomial distribution (n = 100, a certain allele frequency P(A) = p), employ both Wald’s Z-Test and χ 2 test, α-level (significance level) = 5%. Repeat this process 10,000 times. The empirical power is the proportion of times H0 is rejected. 1. Power comparison of Z-test and χ2-test In Figures 4-1, 4-2, and 4-3, empirical powers under both the tests are graphed. 55 Figure 4-1 Power comparison of Z and χ2, when p = 0.5 Figure 4-2 Power comparison of Z and χ2, when p = 0.2 56 Figure 4-3 Power comparison of Z and χ2, when p = 0.05 2. Normality Check of Z For a very low value of p (p=0.05), the chi-squared test has a better power than the Z-test. For other values of p, the Z-test is better. Apparently, different performances of Z-test and χ 2 -test depend on different choices of allele frequencies. When p = 0.5, Z-test has higher power than χ2test, however, when p gets extreme (p = 0.05), Z-has lower power than χ 2 -test. The normality of Z, taken for granted, may be the issue. Therefore, we check the normality of Z. Histogram plots and Q-Q plots of Z are used for this purpose. When a linear point pattern exist in the Q-Q plot, it indicates that the specified family reasonably describes the data distribution, and the location and scale parameters can be estimated visually as the intercept and slope of the linear pattern. For different choices of allele frequency p, the degrees of normality of Z are different: when p = 0.5 or 0.2, the histograms visually appear to be normal, where as in the case of p = 0.05, the histogram is bi-modal. 57 Therefore, we can conclude that, the different power performances of Z and χ 2 tests depend on the degree of their normality which is impacted by the different choices of allele frequencies p. 1) p = 0.5 Figure 4-4 Histogram and Normal Q-Q Plot for Z’s, when p = 0.5 12 10 8 P e r c 6 e n t 4 2 0 - 1. 63 - 0. 63 0. 38 1. 38 2. 38 3. 38 4. 38 5. 38 6. 38 z Cur ve: Nor mal ( Mu=1. 9973 Si gma=1. 0532) 2) p = 0.2 Figure 4-5 Histogram and Normal Q-Q Plot for Z’s, when p = 0.2 17. 5 15. 0 12. 5 P e 10. 0 r c e n t 7. 5 5. 0 2. 5 0. 0 - 5. 8 - 4. 6 - 3. 4 - 2. 2 - 1. 0 0. 2 1. 4 2. 6 3. 8 5. 0 6. 2 z Cur ve: Nor mal ( Mu=1. 6738 Si gma=1. 0413) 58 3) p = 0.05 Figure 4-6 Histogram and Normal Q-Q Plot for Z’s, when p = 0.05 25 20 P 15 e r c e n t 10 5 0 - 4. 4 - 3. 2 - 2. 0 - 0. 8 0. 4 1. 6 2. 8 4. 0 5. 2 6. 4 7. 6 z Cur ve: Nor mal ( Mu=0. 1395 Si gma=2. 0432) For small allele frequency p, there is no homozygote for the real allele in the sample. Consequently, the sampling distribution of Z is based on heterozygotes and common homozygotes, which as showed in the histogram, indicates that the distribution is bimodal. 4.7 Conclusion. The traditional χ 2 test for testing the Hardy-Weinberg equilibrium is not applicable to entertain one-sided alternatives. Wald’s Z test and Siegmund’s T-test are entertained for one-sided alternatives. These two tests are shown to be essentially the same. Using the Z-test, sample size formula for achieving a specified power has been developed. The sample sizes are lower than those obtained under the χ 2 test. However, the Z test is biased (i.e., nomial level for significance is not achieved under H0 for shcrewed allele frequency). See Fig 4-3. The bias may be due to nonnormality of Z-statistic. For moderate values of allele frequency p, the Z-test has higher power than the χ 2 test. For very low values of p, the χ 2 test is superior. 59 5 Hardy-Weinberg Equilibrium in the case of three alleles 5.1 Introduction When a population has HW proportions, the disequilibrium coefficient is zero, suggesting that a test of the hypothesis that disequilibrium is zero, is equivalent to testing for HWE. The word “Equilibrium” here, strictly speaking, is an equilibrium state in which properties of the population are not changing over successive generations. In the HWE case, this implies that continued absence of disturbing forces such as selection, migration, and mutation as well as the continuation of nonrandom mating (inbreeding). In the study, testing if θ equal to zero is equivalent to testing the HWE is under two biological scenarios: 1) inbreeding is the only reason affect the “Equilibrium state”; 2) Population is substructured. The argument given by Hardy for the stability of allele frequencies under random mating extends to the case of any number of alleles. In this chapter, we will consider the problem of testing the HWE in the tri-allelic case. The bi-allelic and tri-allelic cases are different with respect to mathematical, statistical, and computational issues. We will present a new method of handling tri-allelic problem by reducing it to several bi-allelic problems. 5.2 Joint distribution of genotypes In this section, we will focus on the case of tri-allelic genes and the attendant Hardy- Weinberg equilibrium issues. Consider a tri-allelic gene with alleles A1, A2 and A3. The customary notation for the joint distribution of the genotypes A1A1, A1A2, A1A3, A2A2, A2A3, A3A3 is: p11, 60 2p12, 2p13, p22, 2 p23, p33, respectively. The reason for this special notation is that any heterozygous genotype AiAj ( i ≠ j ) is indistinguishable from AjAi. This particular notation facilitates us to write the joint distribution of the genotypes in the following form (Table 5-1), where pij = pji for all i and j. Table 5-1 Joint distribution of genotypes Alleles Marginal A1 A2 A3 A1 p11 p12 p13 p1 A2 p21 p22 p23 p2 A3 p31 p32 p33 p3 p1 p2 p3 1 Alleles Marginal frequencies Let us use the generic symbol frequencies A for the joint distribution. The marginal probabilities p1, 3X3 p2 and p3 are called allelic frequencies. The following are the properties of the joint distribution A: 3X3 1) A is symmetric: p12 = p21 , p13 = p31 and p23 = p32. 2) The marginal frequencies are the same. 5.2.1 Parameter spaces (1) Parameter space Ωٛ The purpose in this chapter is to make inferences on the unknown A of a population of 3X3 interest based on a random sample of individuals drawn from the population and their determined genotypes. The underlying parameter space is defined by: 61 Ω = {A: A is a joint distribution of the type in Table 5-1}. Because of the special structure of A, it is enough to choose a set of five entries in Table 51 for specification so that the rest of the entries in the table can be determined. For example, one can spell out p11, p12, p13, p22 and p23 all non-negative with sum ≤ 1. The rest of the entries are automatically determined. Equivalently, one can spell out p1, p2, p11, p12 and p22 with the requisite natural constraints. The rest of the entries in the table can be determined uniquely. What this means is that the parameter space Ω has five parameters and therefore can be deemed as 5dimensioanl. (2) Parameter space Ω p1, p2, p3 ٛ We now consider a specified subset of Ω . Let p1, p2 and p3 be such that each pi ≥ 0 and p1 + p2 + p3 = 1 . Suppose p1, p2 and p3 are given. Let Ω p1, p2, p3 = {A: A is a joint distribution of the type in Ω , p1, p2 and p3 known }. The set Ω p1, p2, p3 has three free parameters. For example, one can spell out p11, p12 and p22 freely subject to the relevant natural constraints. The rest of the entries in A can be determined uniquely. As a consequence, Ω p1, p2, p3 is deemed to be 3-dimensional. (3) Parameter space Ω* ٛ We now introduce a definition. Definition: A joint distribution A of the type in Table 5-1 is of Ωθ if it is of the following form (Table 5-2) for some θ, p1, p2 and p3, marginals p1, p2 and p3 unknown: 62 Table 5-2 Joint distribution is of type Ωθ Alleles Alleles A1 A2 A3 Marginal frequencies A1 (1 − θ ) p12 + θp1 (1 − θ ) p1 p2 Marginal A2 A3 (1 − θ ) p1 p2 (1 − θ ) p1 p3 p1 (1 − θ ) p2 p3 p2 (1 − θ ) p22 + θp2 (1 − θ ) p1 p3 (1 − θ ) p2 p3 p1 p2 frequencies (1 − θ ) p32 + θp3 p3 p3 1 The entity θ is defined to be the inbreeding coefficient. Let Ω* = {A: The joint distribution A is of the type Ωθ for some θ ≠ 0 } It is clear that the parameter space Ω* is 3-dimensional. One can take θ , p1 and p2 as free parameters. (4) Parameter space Ω* p1, p2, p3 with given p1, p2 and p3ٛ A special subset of Ω* is of interest. Let p1, p2 and p3 be given marginal frequencies. Let Ω* p1, p2, p3 = {A ∈ Ω* : A has marginal frequencies p1, p2 and p3} It is clear that Ω* p1, p2, p3 is only 1-dimensional and the free parameter is θ . Let A ∈ Ω . The case θ = 0 leads to the following joint distribution (Table 5-3). In this case, we can say that the population has achieved the Hardy-Weinberg equilibrium ( Ω 0 ). 63 Table 5-3 Joint distribution of genotypes under Equilibrium ( Ω 0 ) Alleles Marginal A1 A2 A3 A1 p12 p1 p2 p1 p3 p1 A2 p1 p2 p22 p2 p3 p2 A3 p1 p3 p2 p3 p32 p3 p1 p2 p3 1 Alleles Marginal frequencies frequencies The inclusion Ω* ⊂ Ω is strict. This follows from dimension ( Ω* ) < dimension ( Ω ). The following is a specific example of a joint distribution A in Ω but not in Ω* (Table 5-4): Table 5-4 Example: A distribution in Ω but not in Ω Alleles * Marginal A1 A2 A3 A1 1/9 1/9 1/9 1/3 A2 1/9 2/27 4/27 1/3 A3 1/9 4/27 2/27 1/3 1/3 1/3 1/3 1 Alleles Marginal frequencies frequencies We know that the distribution in Table 5-4 is not in Ω* . Suppose it is. Let θ be the corresponding inbreeding coefficient. Then (1 − θ ) p1 p2 = 1/9 = (1 − θ ) 1/9, which implies that θ = 0. On the other hand, (1 − θ ) p2 p3 = 4/27 = (1 − θ ) 1/9, which implies that θ = −1/3. This is a contradiction. 64 5.2.2 Bounds on θ Inbreeding coefficient (θ) can be measured indirectly from genetic data, the higher the value, the more closely related the parents are. Suppose the joint distribution is of the type Ωθ for some θ. We can find bounds for θ from Table 5-2. Since each entry in the above table is a frequency, each entry has to be larger than or equal to zero. It’s easy to find bounds of θ by setting each entry ≥ 0. Therefore, θ has to satisfy : 5.2.3 ⎧ p p p ⎫ − min ⎨ 1 , 2 , 3 ⎬ < θ < 1 ⎩1 − p1 1 − p2 1 − p3 ⎭ Biological scenario For the case of two alleles, every distribution in Ω can be put as one of the type Ωθ . However, when there are three or more alleles, not every distribution in Ω can be put as one of the type Ωθ (Example Table 5-4). A question naturally arises is under what conditions, any genotype probability distribution with fixed marginals can be put in the form of Table 5-2. Population sub structure Li, C.C. (1969) provided an explanation. The departure from the Hardy-Weinberg Equilibrium can be explained by the following table: Table 5-5 Population subdivision with respect to tri-alleles Alleles Marginal A1 A2 A3 A1 p12 + σ 12 2 p1 p2 + 2σ 12 2 p1 p3 + 2σ 13 p1 A2 . p22 + σ 22 2 p2 p3 + 2σ 23 p2 A3 . . p32 + σ 32 p3 p1 p2 p3 1 Alleles Marginal frequencies 65 frequencies This joint distribution has 9 parameters ( σ 1 , σ 2 , σ 3 , σ 12 , σ 13 , σ 23 and three marginal frequencies: p1 , p2 , p3 ) subject to 4 constraints: ⎧ p1 + p2 + p3 = 1 ⎪ 2 ⎪σ 1 + σ 12 + σ 13 = 0 ⎨ 2 ⎪σ 12 + σ 2 + σ 23 = 0 ⎪σ + σ + σ 2 = 0 23 3 ⎩ 13 Consequently, the parameter space associated with all possible distributions of type Table 5-5 is 9 – 4 = 5-dimensional. Therefore, we will have many θ i ' s , which means not every distribution in Ω can be put as one of the type Ωθ . However, the 5 parameters can be reduced even more from a biological standpoint: Under population subdivision, evolutionary expectations of variance and covariance can be represented by a single parameter θ , leading to probability of AiAi = (1 − θ ) pi2 + θpi and probability of AiAj = 2 pi p j (1 − θ ) , which is explained solely by an inbreeding co-ancestry coefficient, θ . If the departure from the Hardy-Weinberg Equilibrium is caused solely by inbreeding, like alleles combine with like alleles. Consequently, derivatives of genotype frequencies from the HWE can be explained by just one inbreeding coefficient. For illustration, let’s take a look at the probability of genotype AiAi, Pr(AiAi) = (1 − θ ) pi2 + θpi . This can be seen as adding two situations: complete inbreeding (pi) multiplied by its inbreeding degree ‘ θ ’ plus complete independence ( pi2 ) multiplied by its inbreeding degree ‘(1− θ )’. 66 5.3 Structure of the case of 3 alleles: data and Likelihood 5.3.1 Structure of the case of 3 alleles: data In order to make inference on the unknown joint distribution A ∈ Ω , we select a random sample of n subjects from the population of interest and their genotypes determined. Let nij = number of subjects in the sample with genotypic AiAj, 1 ≤ i ≤ j ≤ 3. The data collected can be arranged in the following table (Table 5-6): Table 5-6 Data on Genotypes Alleles Alleles A1 A2 A3 A1 n11 n12 n13 A2 . n22 n23 A3 . . n33 Note that n11 + n12+ n13 + n22 + n23 + n33 = n. The notation for the frequencies nij’s is markedly different from the genotype probabilities pij’s. The genotype probability of AiAj ( i ≠ j ) is written as 2 pij, whereas the genotype frequency of AiAj is nij. 5.3.2 Maximum Likelihood estimators Theoretically, the frequencies (n11, n12, n13, n22, n23, n33 ) of the genotypes A1A1, A1A2, A1A3 , A2A2 , A2A3 , A3A3 have a multinomial distribution (n, p11, 2p12, 2p13, p22, 2p23, p33). 1. Maximum Likelihood Estimation over Ω Let A ∈ Ω . The likelihood of the data at A is given by L(A) = ( p11 ) n11 * ( p22 ) n22 * ( p33 ) n33 * (2 p12 ) n12 * (2 p13 ) n13 * (2 p23 ) n23 67 Maximizing the Likelihood over all A ∈ Ω yields the following Maximum Likelihood Estimates: pˆ 11 = n11 / n pˆ 22 = n22 / n pˆ 33 = n33 / n 2 pˆ 23 = n23 / n 2 pˆ 12 = n12 / n 2 pˆ 13 = n13 / n 2. Maximum Likelihood Estimation over Ω* . We know that Ω* is a 3-dimensional parameter space with free parameters: θ , p1, and p2 ( p3 = 1 − p1 − p2 ). Let A ∈ Ω* . The Likelihood of the data is given by: L(A)= [(1 − θ ) p12 + θp1 ]n11 * [2(1 − θ ) p1 p2 ]n12 * [2(1 − θ ) p1 p3 ]n13 * [2(1 − θ ) p2 p3 ]n23 *[(1 − θ ) p22 + θp2 ]n22 * [(1 − θ ) p32 + θp3 ]n33 The log likelihood is given by: LnL( θ , p1 , p2) = constant + n11 ln[(1 − θ ) p12 + θp1 ] + n22 ln[(1 − θ ) p22 + θp2 ] + n33 ln[(1 − θ )(1 − p1 − p2 ) 2 + θ (1 − p1 − p2 )] + (n12 + n13 + n23 ) ln(1 − θ ) + (n12 + n13 ) ln p1 + (n12 + n23 ) ln p2 + (n13 + n23 ) ln(1 − p1 − p2 ) ∂ ln L( p1 , p2 ,θ ) = ∂p1 n11 (2(1 − θ ) p1 + θ ) − n33 (θ + 2(1 − p1 − p2 ) + 2 (1 − θ ) p1 + θp1 θ (1 − p1 − p2 ) + (1 − θ )(1 − p1 − p2 ) 2 + n12 + n13 n +n − 13 23 p1 1 − p1 − p2 68 ∂ ln L( p1 , p2 ,θ ) = ∂p2 ∂ ln L( p1 , p2 ,θ ) = ∂θ n22 (2(1 − θ ) p2 + θ ) − n33 (θ + 2(1 − p1 − p2 ) + 2 (1 − θ ) p2 + θp2 θ (1 − p1 − p2 ) + (1 − θ )(1 − p1 − p2 ) 2 + n12 + n23 n +n − 13 23 p2 1 − p1 − p2 n11 p1 (1 − p1 ) n p (1 − p ) + 22 2 2 2 2 (1 − θ ) p1 + θp1 (1 − θ ) p2 + θp2 + n33 (1 − p1 − p2 )( p1 + p2 ) n +n +n − 12 13 23 2 1−θ θ (1 − p1 − p2 ) + (1 − θ )(1 − p1 − p2 ) ⎤ ⎡ 1 ⎢ n11 (1 − p1 ) n22 (1 − p2 ) n33 (1 − p3 ) ⎥ n12 + n13 + n23 = + + ⎥− ⎢ θ θ 1−θ ⎢ θ + p 1−θ + p2 + p3 ⎥ 1 1−θ 1−θ ⎦ ⎣ 1−θ ⎧ ∂ ln L( p1 , p2 ,θ ) =0 ⎪ ∂p1 ⎪ ⎪ ∂ ln L( p1 , p2 ,θ ) =0 Set ⎨ ∂p2 ⎪ ⎪ ∂ ln L( p1 , p2 ,θ ) =0 ⎪ ∂θ ⎩ and solve for p1 , p2 , and θ . Technically, these equations should yield maximum likelihood estimates of p1 , p2 , and θ . However, the amount of computation required is enormous. We invoked the software Mathematica (10.1) to obtain a solution to these equations symbolically. The software ran over a long time (over 48 hours) and there was no solution. Suppose we plug in the data values of nij’s into the equations spelled above. Is there a solution? More specifically, suppose n11=n12=n13=n22=n23=n33=10. The log likelihood is given by ln L( θ , p1 , p2) = 10ln[(1 − θ ) p12 + θp1 ] + 10ln[(1 − θ ) p22 + θp2 ] + 10ln[(1 − θ )(1 − p1 − p2 ) 2 + θ (1 − p1 − p2 )] + 30ln(1 − θ ) + 20lnp1 + 20lnp2 + 20ln(1 − p1 − p2 ) 69 Maximize ln L with respect to p1 , p2 and θ . The solution is: θˆ = 1 / 4, pˆ1 = 1 / 3, pˆ 2 = 1 / 3, pˆ 3 = 1 / 3 2n11 + n12 + n13 ⎧ ˆˆ = 1/ 3 ⎪ p1 = 2n ⎪ 2n22 + n12 + n23 ⎪ˆ = 1 / 3 of the allele frequencies coincide, in this The gene count estimates ⎨ pˆ 2 = 2n ⎪ 2n33 + n13 + n23 ⎪ˆ = 1/ 3 ⎪ pˆ 3 = 2n ⎩ example, with the maximum likelihood estimates. We have tried another example. We took n11 = 10, n12 = 10, n13 = 10, n22 = 10, n23 = 20, n33 = 20 . Mathematica has been implemented. No explicit numerical solution to the likelihood equation has been provided. There are examples of data in which the likelihood estimates of p1 , p2 , p3 and natural estimates of p1 , p2 , p3 do not match. Maximum Likelihood estimates are often preferred because they are sufficient statistics and will attain the minimum variance as the sample size gets infinitely large. Unfortunately, the maximum likelihood estimate of the estimate of the inbreeding coefficient can not be explicitly written, but must be solved numerically by iteration. If the likelihood of the observed numbers of each genotype is maximized simultaneously for the gene frequencies pi (i=1,2,3) and θ, the gene frequency estimates are not generally equal to the natural unbiased estimators: pˆ i = 1,2,3 We provide two different approaches to resolve the difficulties we are facing. 70 1 3 ∑ ni⋅ . i = n i=1 Option (1): Use gene count estimates of p1, p2, and p3 and obtain the maximum likelihood estimate of θ . As we demonstrate below, the likelihood equation is a cubic polynomial in θ . The asymptotic theory is extendable to this case. Option (2): Reduce the 3x3 distribution to three 2x2 distributions. We now elaborate on Option (1) in the following section. 3. Maximum Likelihood Estimation over Ωθ , p1, p2, p3. Suppose we assume the margianl frequencies p1, p2, and p3 are fixed and known. The only parameter unknown is θ . The model is one-dimensional. The Likelihood of the data is given by: L( θ )= [(1 − θ ) p12 + θp1 ]n11 *[2(1 − θ ) p1 p2 ]n12 * [2(1 − θ ) p1 p3 ]n13 *[2(1 − θ ) p2 p3 ]n23 *[(1 − θ ) p22 + θp2 ]n22 *[(1 − θ ) p32 + θp3 ]n33 ∂LnL n11 p1 (1 − p1 ) n p (1 − p ) n p (1 − p3 ) n +n +n = + 22 2 2 2 + 33 3 − 12 13 23 2 2 (1 − θ ) p1 + θp1 (1 − θ ) p2 + θp2 θp3 + (1 − θ ) p3 1−θ ∂θ We set ∂LnL = 0, ∂θ The likelihood equation can be simplified as follows: ⎤ ⎡ ∂LnL 1 ⎢ n11 (1 − p1 ) n22 (1 − p2 ) n33 (1 − p3 ) ⎥ n12 + n13 + n23 = + + =0 ⎥− ⎢ θ θ 1−θ ⎢ θ + p 1−θ ∂θ ⎥ + p p + 1 2 3 1−θ 1−θ ⎦ ⎣ 1−θ 71 This gives n11 (1 − p1 ) θ 1−θ Let η = θ 1−θ + p1 + n22 (1 − p2 ) θ 1−θ + p2 + n33 (1 − p3 ) θ 1−θ + p3 = n12 + n13 + n23 , the likelihood equation is a third degree polynomial in η , given by: n11 (1 − p1 ) n22 (1 − p2 ) n33 (1 − p3 ) + + = n12 + n13 + n23 η + p1 η + p2 η + p3 n11 (1 − p1 )(η + p2 )(η + p3 ) + n22 (1 − p2 )(η + p1 )(η + p3 ) + n33 (1 − p3 )(η + p1 )(η + p2 ) = (n13 + n13 + n23 )(η + p1 )(η + p2 )(η + p3 ) Alternatively, (n12 + n13 + n23 )η 3 − η 2 [ n11 (1 − p1 ) + n22 (1 − p2 ) + n33 (1 − p3 ) − (n12 + n13 + n23 )] − η[n11 (1 − p1 )( p2 + p3 ) + n22 (1 − p2 )( p1 + p3 ) + n33 (1 − p3 )( p1 + p2 ) − (n12 + n13 + n23 )( p1 p2 + p1 p3 + p2 p3 )] − n11 (1 − p1 ) p2 p3 − n22 (1 − p2 ) p1 p3 − n33 (1 − p3 ) p1 p2 + (n12 + n13 + n23 ) p1 p2 p3 = 0 As an example, suppose p1 = p2 = p3 = 1 / 3 . Then the polynomial is: (n12 + n13 + n23 )η 3 − η 2 [(2 / 3)(n11 + n22 + n33 ) − (n12 + n13 + n23 )] − η[(4 / 9)(n11 + n22 + n33 ) − (1 / 3)(n12 + n13 + n23 )] − (2 / 27)(n11 + n22 + n33 ) + (1 / 27)(n12 + n13 + n23 ) = 0 Let α i be the coefficient for ith power of η , that is, for the above equation: ⎧(n12 + n13 + n23 ) = α 3 ⎪− [(2 / 3)(n + n + n ) − (n + n + n )] = α ⎪ 11 22 33 12 13 23 2 ⎨ ⎪− [(4 / 9)(n11 + n22 + n33 ) − (1 / 3)(n12 + n13 + n23 )] = α1 ⎪⎩− [(2 / 27)(n11 + n22 + n33 ) + (1 / 27)(n12 + n13 + n23 )] = α 0 Æ α 3η 3 + α 2η 2 + α1η + α 0 = 0 72 Solving this third degree polynomial in η , we have: η= θ 1−θ = − 1/3ٛ Æ θˆ = − 1/2 ∂ 2 LnL For the asymptotic distribution of the likelihood estimator θˆ , we need E ∂θ 2 . After simplification, ∂LnL n11 (1 − p1 ) n (1 − p2 ) n (1 − p3 ) n12 + n13 + n23 = + 22 + 33 − (1 − θ ) p1 + θ (1 − θ ) p2 + θ (1 − θ ) p3 + θ 1−θ ∂θ 2 ∂ 2 LnL n11 (1 − p1 ) 2 n22 (1 − p2 ) n33 (1 − p3 ) 2 n +n +n − − − − 12 13 2 23 = 2 2 2 2 [(1 − θ ) p1 + θ ] [(1 − θ ) p2 + θ ] [(1 − θ ) p3 + θ ] (1 − θ ) ∂θ Asymptotic variance of θˆ , V (θˆ) = E (− ∂ 2 LnL )= ∂θ 2 1 ∂ 2 LnL E (− ) ∂θ 2 n(θp1 + (1 − θp12 )(1 − p1 ) 2 n(θp2 + (1 − θp22 )(1 − p2 ) + [(1 − θ ) p1 + θ ]2 [(1 − θ ) p2 + θ ]2 2 n(θp3 + (1 − θp32 )(1 − p3 ) 2 2n(1 − θ )( p1 p2 + p1 p3 + p2 p3 ) + + [(1 − θ ) p3 + θ ]2 (1 − θ ) 2 2 np1 (1 − p1 ) 2 np2 (1 − p2 ) np3 (1 − p3 ) 2 2n( p1 p2 + p1 p3 + p2 p3 ) = + + + (1 − θ ) p1 + θ (1 − θ ) p2 + θ (1 − θ ) p3 + θ (1 − θ ) 4 2 n n ∂ LnL 3 + 3 In this special case, p1 = p2 = p3 = 1 / 3 , E (− ) = ∂θ 2 1 + 2θ 1 − θ 2 Therefore, if the gene frequencies are known, the maximum likelihood estimator θˆ of θ has asymptotically a normal distribution with the mean θ and variance: 73 V (θˆ) = 1 1 1 = [ ] 2 2 2 ∂ LnL n p1 (1 − p1 ) 2 p − p p − p p p + p p + p p ( 1 ) ( 1 ) 2 ( ) 2 3 1 2 1 3 2 3 E (− ) + 2 + 3 + ∂θ 2 (1 − θ ) p1 + θ (1 − θ ) p2 + θ (1 − θ ) p3 + θ (1 − θ ) A test of the null hypothesis H0: θ = θ 0 can be built based on the asymptotic theory of the likelihood estimator. Test statistic = Z = θˆ − θˆ0 | H0 Var (θˆ) If p1 , p2 , p3 are not known, one could use the following consistent gene count estimates: 2n11 + n12 + n13 ⎧ˆ ˆ = p 1 ⎪ 2n ⎪ 2n22 + n12 + n23 ⎪ˆ ⎨ pˆ 2 = 2n ⎪ 2n33 + n13 + n23 ⎪ˆ ⎪ pˆ 3 = 2n ⎩ of p1, p2, and p3 respectively, in the asymptotic variance formula of θˆ . The asymptotic normal distribution is still valid in view of Slutsky’s Theorem (Cramer, 1946). 5.4 Joint distribution of the type Ωθ and Connection to lower dimensional joint distributions In this section, we pursue Option (2). Maximum Likelihood estimation and testing of inbreeding coefficient in the case of bi-allelic genes are simple to execute. We will reduce triallelic genes case to three bi-allelic gene cases and explore the connection between them. The connection discovered helps us to tackle the tri-allelic case. 74 Reduction to 2x2 joint distributionsٛ We reduce a given genotype distribution in the tri-allelic genes case to three bi-allelic gene cases: A1 vs. (not A1); A2 vs. (not A2); A3 vs. (not A3) and their corresponding inbreeding coefficient are : θ1 , θ 2 , and θ 3 . The case of A1 vs. (not A1) 5.4.1 From Table 5-1, by combining alleles A2 and A3, we have the following reduced joint distribution derived. Table 5-7 Joint distribution: A1 vs. (not A1) Alleles A1 not A1 A1 p11 p12+ p13 not A1 p12+ p13 Alleles Marginal frequencies p22+p23+ p23+p33 p1 p2+p3 Marginal frequencies p1 p2+p3 1 As pointed out in Chapter 4, we can always write the above joint distribution in the following form for some inbreeding coefficient θ1 : 75 Table 5-8 Joint distribution: A1 vs. (not A1) with inbreeding coefficient θ1 Alleles Alleles A1 not A1 A1 not A1 p12+ p1 (p2+ p3)− θ1 p1( p2+ p3) θ1 p1( p2+ p3) p1 (p2+ p3)− ( p2+ p3)2+ θ1 p1( p2+ p3) θ1 p1( p2+ p3) p1 p2+p3 Marginal frequencies 5.4.2 Marginal frequencies p1 p2+p3 1 The case of A2 vs. (not A2) By combining alleles A1 and A3, we have the following reduced joint distribution: Table 5-9 Joint distribution: A2 vs. (not A2) Alleles Alleles not A2 not A2 p11+ p13+ p13+ p33 A2 Marginal frequencies p12+ p23 p1+ p3 A2 p12+ p23 p22 p2 Marginal frequencies p1+ p3 p2 1 The above joint distribution can be rewritten in the following form for some inbreeding coefficient θ2 : Table 5-10 Joint distribution: A2 vs. (not A2) with inbreeding coefficient θ2 Alleles Alleles not A2 A2 2 not A2 A2 Marginal frequencies (p1+ p3) + (p1+ p3) p2− θ2 (p1+ p3) p2 θ2 (p1+ p3) p2 (p1+ p3) p2− p22+ θ2 (p1+ p3) p2 θ2 (p1+ p3) p2 p1+ p3 p2 76 Marginal frequencies p1+ p3 p2 1 5.4.3 The case of A3 vs. (not A3) By combining alleles A1 and A2, we have the following reduced dimension joint distribution: Table 5-11 Joint distribution: A3 vs. (not A3) Alleles Alleles not A3 A3 Marginal frequencies not A3 A3 p11+ p12+ p12+ Marginal frequencies p13+ p23 p1+ p2 p13+ p23 p33 p3 p1+ p2 p3 1 p22 The above joint distribution can be rewritten in the following form for some inbreeding coefficient θ3: Table 5-12 Joint distribution: A3 vs. (not A3) with inbreeding coefficient θ3 Alleles Alleles not A3 A3 2 not A3 A3 Marginal frequencies 5.5 5.5.1 (p1+ p2) + (p1+ p2) p3− θ3 (p1+ p2) p3 θ3 (p1+ p2) p3 (p1+ p2) p3− p32+ θ3 (p1+ p2) p3 θ3 (p1+ p2) p3 p1+ p2 p3 Marginal frequencies p1+ p2 p3 1 Estimation of inbreeding coefficient and hypotheses testing Estimation of inbreeding coefficient in a model of the type Ωθ Suppose the joint distribution in a tri-allelic case is of Ωθ . 77 Let θ be the inbreeding coefficient. Let θ1, θ2 and θ3 be the inbreeding coefficients of the joint distributions of A1 vs. (not A1); A2 vs. (not A2); A3 vs. (not A3), respectively. The following is the fundamental result of this section. Theorem The joint distribution is one-dimensional (A is of the type Ωθ ) if and only if θ1 = θ2 = θ3. The result has two parts: 1. If the joint distribution A of the alleles is of the type Ωθ , then θ1 = θ2 = θ3 = θ. 2. The converse is true. If θ1 = θ2 = θ3 = θ, then the joint distribution A of the alleles is of the type Ωθ . Proof 1. Let A be given by Alleles Alleles A1 A2 A3 Marginal frequencies (1 − θ ) p1 p2 (1 − θ ) p1 p3 p1 (1 − θ ) p2 p3 p2 A1 (1 − θ ) p12 + θp1 (1 − θ ) p22 A2 (1 − θ ) p1 p2 A3 (1 − θ ) p1 p3 (1 − θ ) p2 p3 Marginal frequencies P1 p2 + θp2 (1 − θ ) p32 p3 + θp3 1 p3 The joint distribution of A1 vs. (not A1) becomes Alleles Alleles A1 Not A1 Marginal frequencies A1 Not A1 (1 − θ ) p 2 1 (1 − θ ) p1 ( p2 + p3 ) + θp1 (1 − θ ) p1 ( p2 + p3 ) (1 − θ )( p2 + p3 ) 2 p1 p2+p3 + θ ( p2 + p3 ) 78 Marginal frequencies p1 p2+p3 1 Note that (1 − θ ) p12 + θ p1 = p12 + θ p1 (1 − p1 ) and (1 − θ ) p1 ( p2 + p3 ) = (1 − θ ) p1 (1 − p1 ) Consequently, the inbreeding coefficient θ1 stemming from the above 2x2 joint distribution is indeed equal to θ. Thus θ =θ1. In a similar vein, one can show that θ =θ2 and θ =θ3. Proof 2. Let A be any arbitrary joint distribution of the alleles given by Alleles Marginal A1 A2 A3 A1 p11 p12 p13 p1 A2 p21 p22 p23 p2 A3 p31 p32 p33 p3 p1 p2 p3 1 Alleles Marginal frequencies frequencies Suppose θ1 = θ2, look at the joint distribution in the case A1 vs. (not A1): Alleles A1 not A1 A1 p11 p12+ p13 Not A1 p12+ p13 Alleles Marginal frequencies p22+p23+ p23+p33 p1 P2+p3 We have the following identities: p11 = p12 + θp1 ( p2 + p3 ) p12+ p13 = (1 − θ ) p1 ( p2 + p3 ) p22+ 2p23+ p33 = ( p2 + p3 ) 2 + θp1 ( p2 + p3 ) 79 Marginal frequencies p1 p2+p3 1 Similarly, by focusing on the case of A2 vs. (not A2), we have the following identities: p22 = p22 + θp2 ( p1 + p3 ) p12+ p23 = (1 − θ ) p2 ( p1 + p3 ) p11 + 2p13 + p33 = ( p1 + p3 ) 2 + θp2 ( p1 + p3 ) Focusing on the case of A3 vs. (not A3), we have the following identities: p33 = p32 + θp3 ( p1 + p2 ) p13+ p23 = (1 − θ ) p3 ( p1 + p2 ) p11 + 2p12 + p22 = ( p1 + p2 ) 2 + θp3 ( p1 + p2 ) In all these nine identities, p11 , p22 , p33 are uniquely determined directly. From the first set of identities: 2p23 = ( p2 + p3 ) 2 + θp1 ( p2 + p3 ) − p22 − p33 = ( p2 + p3 ) 2 + θp1 ( p2 + p3 ) − p22 − θp2 ( p1 + p3 ) − p32 − θp3 ( p1 + p2 ) = 2 p2 p3 + θ ( p1 p2 + p1 p3 − p1 p2 − p2 p3 − p1 p3 − p2 p3 ) = 2(1 − θ ) p2 p3 Consequently, p23 = (1 − θ ) p2 p3 . In a similar vein, it follows that p12 = (1 − θ ) p1 p2 and p13 = (1 − θ ) p1 p3 . This completes the proof. 5.5.2 Testing that the joint distribution of the alleles is of the type Ωθ The hypothesis that the joint distribution of the alleles is of the type Ωθ for some θ is equivalent to the hypothesis H0: θ1 = θ2 = θ3 = θ. 80 We now develop a strategy for testing H0. Let θˆ1 , θˆ2 and θˆ3 be the maximum likelihood estimator of θ1, θ2 and θ3, respectively. Under H0, we will now have three unbiased estimators of θ. We will combine these three estimators linearly. Let θˆ = l 1θˆ1 + l 2θˆ2 + l 3θˆ3 for some scalars l 1 , l 2 , l 3 . The estimator θˆ is unbiased for θ if l 1 + l 2 + l 3 = 1. We want to choose l 1 , l 2 , l 3 so that the Var( θˆ ) is minimum. Let θˆ2 and θˆ3 under H0. Then Var (θˆ) = lT We minimize lT ∑ 0 ∑ 0 ∑ 0 be the variance-covariance matrix of θˆ1 , ⎛ l1 ⎞ ⎜ ⎟ l , where l = ⎜ l 2 ⎟ and lT is the transpose of l . ⎜l ⎟ ⎝ 3⎠ l subject to the constraint l 1 + l 2 + l 3 = 1. Using Lagrange ⎛1⎞ ⎛ 1 ⎞ −1 ⎜ ⎟ ⎜ ⎟ multipliers, the solution turns out to be l = T −1 ∑0 1 , where 1= ⎜1⎟ . ⎜1 ⎟ ⎜1⎟ ⎝ ∑0 1 ⎠ ⎝ ⎠ This dispersion matrix depends on the common inbreeding coefficient θ and allele frequencies p1 , p2 , and p3 . Case 1. θ is known. Suppose p1 , p2 , and p3 are known. The test statistic we propose is −1 Q = (θˆ − θ ⋅1)T ∑0 (θˆ − θ ⋅1) ~ ~ 81 ⎛ θˆ1 ⎞ ⎛1⎞ ⎜ ⎟ ⎜ ⎟ where θˆ = ⎜θˆ2 ⎟ and 1= ⎜1⎟ . ~ ⎜⎜ ˆ ⎟⎟ ⎜1⎟ ⎝ ⎠ ⎝θ3 ⎠ Under H0, Q χ 32 , asymptotically. Test: Reject H0 if Q > χ 32,α , where χ 32,α is the upper 100 xα percentile of a χ 2 -distribution with 3 degrees of freedom and α is the prescribed level of significance. If p1 , p2 , and p3 are not known, they can be estimated from the data. Using the abridged data abridged to the format of A1 vs. (not A1), p1 can be estimated. More specifically, let the abridged data be Alleles A1 not A1 A1 n11 n12+ n13 not A1 . Alleles n22+ n23+ n33 2n + n + n From Chapter 5, the gene count estimator of p1 is given by pˆˆ 1 = 11 12 13 . This is an 2n unbiased consistent estimator of p1. In a similar vein, estimators of p2 and p3 are given, 2n + n + n23 2n + n + n23 and pˆˆ 3 = 33 13 . respectively, by pˆˆ 2 = 22 12 2n 2n These estimators can be substituted into the statistic Q. The asymptotic null distribution of 2 Q still remains χ 3 . For an illustration, suppose θ = 0. The null hypothesis is H0: θ1 = θ2 = θ3 = θ = 0. We want to examine the power of the χ 2 -test based on Q. Simulations are carried out following the script outlined below. 82 Choice of the parameters and steps of the Simulations. Step 1. choose θ ∈ {0, 0.025, 0.05, 0.075, 0.1, 0.125, 0.15, 0.175, 0.2}; Step 2. choose p1 , p2 , and p3 randomly from [0, 1] subject to p1 + p2 + p3 =1; Step 3. choose total sample size n = 1000. Step 4. Simulate the multinomial distribution (1000, (1 − θ ) p12 + θp1 , (1 − θ ) p22 + θp2 , (1 − θ ) p32 + θp3 , (1 − θ ) p1 p2 , (1 − θ ) p1 p3 , (1 − θ ) p2 p3 ). Let data be n11, n12, n13, n22, n23, and n33. Step 5. Obtain the estimates of p1 , p2 , and p3 as outlined above. Step 6. Calculate ∑ 0 under the scenario θ = 0 and p̂ˆ 2 , p̂ˆ 2 , and p̂ˆ 3 .ٛ ⎛ θˆ1 ⎞ ⎜ ⎟ Step 7. Obtain the estimates θˆ = ⎜θˆ2 ⎟ . ~ ⎜⎜ ˆ ⎟⎟ ⎝θ3 ⎠ ⎛1⎞ ⎛ 1 ⎞ −1 ⎜ ⎟ ⎟ Step 8. Calculate l = ⎜ 1 , where 1= ⎜1⎟ . ∑ ⎜ 1T −11 ⎟ 0 ⎜1⎟ ∑ 0 ⎠ ⎝ ⎝ ⎠ Step 9. Calculate θˆ = l 1θˆ1 + l 2θˆ2 + l 3θˆ3 . −1 Step 10. Calculate the test statistic Q = (θˆ⋅1)T ∑0 (θˆ⋅1) . ~ ~ Step 11. Choose level α = 0.05. Step 12. Check whether or not H0 is rejected, i.e., Q > χ 32,α . For each fixed θ , repeat Steps 2 to 12 10,000 times. Calculate 83 Empirical size = No. of times H 0 is rejected , if θ = 0, and 10,000 Empirical power = No. of times H 0 is rejected , if θ > 0 10,000 The results are given in Figure 5-1. Power is plotted against θ . Figure 5-1 Empirical power of Q-Test for testing H0: θ1 = θ2 = θ3 = θ = 0 Comments: As θ rises from 0 to 0.2, power rises very sharply. At θ = 0, we have valid nominal size (approximately 5%); at θ = 0.2, the power is approximately 95%. The relevant Mathematica Code is provided in Appendix 9. 84 Case 2. θ is unknown This is more complicated. We need to estimate the common inbreeding coefficient of the null hypothesis before we can use the test statistic Q. Note that ∑ 0 is a function of θ and allele frequencies. Let θˆ be the best linear unbiased estimator θ given by θˆ = l 1θˆ1 + l 2θˆ2 + l 3θˆ3 , ⎛ l1 ⎞ ⎛ ⎜ ⎟ 1 ⎞⎟ −1 where l = ⎜ l 2 ⎟ = ⎜ 1. ⎜ T −1 ⎟∑0 ⎜ l ⎟ ⎝ 1 ∑0 1 ⎠ ⎝ 3⎠ The vector l is not computable, as ∑ 0 involves θ and allele frequencies p1 , p2 , and p3 . In the place of p1 , p2 , and p3 , use their gene count estimators. 2n11 + n12 + n13 ⎧ ˆˆ ⎪ p1 = 2n ⎪ 2n22 + n12 + n23 ⎪ˆ ⎨ pˆ 2 = 2n ⎪ 2n33 + n13 + n23 ⎪ˆ ⎪ pˆ 3 = 2n ⎩ We obtain θˆ iteratively, 0 0 0 Step 1: Let l 1 = l 2 = l 3 = 1 0 0 0 , calculate θˆ( 0 ) = l 1 θˆ1 + l 2 θˆ2 + l 3 θˆ3 3 ⎛ l 11 ⎞ ⎜ 1 ⎟ ⎛ 1 ⎞ −1 ⎟ ˆ of Step 1 in ⎜l2 ⎟ =⎜ θ and calculate 1 and ∑0 Step 2: Use ( 0 ) −1 ⎟∑0 T ⎜ ⎜⎜ 1 ⎟⎟ ⎝ 1 ∑0 1 ⎠ ⎝l3 ⎠ θˆ(1) = l 11θˆ1 + l 21θˆ2 + l 31θˆ3 . 85 Repeat Steps 1 and 2 until convergence takes place. The test statistic is ⎛ θˆ1 − θˆ ⎞ ⎜ ⎟ Q = θˆ1 − θˆ θˆ2 − θˆ θˆ3 − θˆ ∑ 0−1 ⎜θˆ2 − θˆ ⎟ ⎜⎜ ˆ ˆ ⎟⎟ ⎝θ3 − θ ⎠ ( Under Ho, Q ) χ 22 , asymptotically. From the perspective of size, we want to examine how well this test works. We carried out simulations following the script below. Choice of the parameters and steps of the Simulations. Step 1. choose θ from{ k * 0.2 , k = 0, 1, 2, …, 12}; 12 Step 2. choose p1 , p2 , and p3 randomly from [0, 1] subject to p1 + p2 + p3 =1; Step 3. choose total sample size n = 1000. Step 4. For the chosen θ in Step 1, simulate the multinomial distribution (1000, (1 − θ ) p12 + θp1 , (1 − θ ) p22 + θp2 , (1 − θ ) p32 + θp3 , (1 − θ ) p1 p2 , (1 − θ ) p1 p3 , (1 − θ ) p2 p3 ). Obtain the data n11, n12, n13, n22, n23, and n33. Step 5. Obtain the estimates of p1 , p2 , and p3 as outlined above. ⎛ θˆ1 ⎞ ⎜ ⎟ Step 6. Obtain the estimates θˆ = ⎜θˆ2 ⎟ . ~ ⎜⎜ ˆ ⎟⎟ ⎝θ3 ⎠ Step 7. Initiate the iterative procedure outlined above for estimating θ . Step 8. Calculate ∑ 0 using the estimates of θ , p1 , p2 , and p3 . 86 ⎛ θˆ1 − θˆ ⎞ ⎜ ⎟ Step 9. Calculate the test statistic Q = θˆ1 − θˆ θˆ2 − θˆ θˆ3 − θˆ ∑ 0−1 ⎜θˆ2 − θˆ ⎟ ⎜⎜ ˆ ˆ ⎟⎟ ⎝θ3 − θ ⎠ ( ) Step 10. Choose level α = 0.05. Step 11. Check whether or not H0 is rejected, i.e., Q > χ 22,α . For each fixed θ , repeat Steps 2 to 11 10,000 times. Calculate Empirical size = No. of times H 0 is rejected 10,000 The results are given in Figure 5-2. Size is plotted versus θ . Comments: The empirical size is holding true to the nominal size 0.05. The relevant Mathematica Code is provided in Appendix 9. Figure 5-2 Empirical size of Q-Test for testing H0: θ1 = θ2 = θ3 = θ (unknown) 87 In the following, we provide some technical details involved in the computation of the ⎛ θˆ1 ⎞ ⎜ ⎟ variance-covariance matrix ∑ 0 under H0 of the estimators θˆ = ⎜θˆ2 ⎟ . ~ ⎜⎜ ˆ ⎟⎟ ⎝θ3 ⎠ The underlying technique used is the delta-method. The computations are symbolically carried out. Mathematica is used for the symbolic computations. All the relevant calculations are incorporated in Appendix 8 and 9.ٛ Technical details: θˆ1 = f (n11 , n12 , n13 , n22 , n23 , n33 ) = 1 − 2n(n12 + n13 ) (2n11 + n12 + n13 )(n12 + n13 + 2n22 + 2n23 + 2n33 ) θˆ2 = g (n11 , n12 , n13 , n22 , n23 , n33 ) = 1 − 2n(n12 + n23 ) (2n22 + n12 + n23 )(n12 + n23 + 2n11 + 2n13 + 2n33 ) θˆ3 = h(n11 , n12 , n13 , n22 , n23 , n33 ) = 1 − 2n(n13 + n23 ) (2n33 + n13 + n23 )(n13 + n23 + 2n11 + 2n12 + 2n22 ) Under H0, 1 θ (1 − θ )(2 − θ ) Var (θˆ1 ) = (1 − θ ) 2 (1 − 2θ ) + n 2np1 (1 − p1 ) θ (1 − θ )(2 − θ ) 1 Var (θˆ2 ) = (1 − θ ) 2 (1 − 2θ ) + n 2np2 (1 − p2 ) θ (1 − θ )(2 − θ ) 1 Var (θˆ3 ) = (1 − θ ) 2 (1 − 2θ ) + n 2np3 (1 − p3 ) 88 Let ∑ be the Variance-Covariance Matrix of θˆ1 , θˆ2 , θˆ3 ⎛ Var (θˆ1 ) Cov(θˆ1 ,θˆ2 ) Cov(θˆ1 ,θˆ3 ) ⎞ ⎜ ⎟ ∑ = ⎜⎜ Cov(θˆ1 ,θˆ2 ) Var (θˆ2 ) Cov(θˆ2 ,θˆ3 ) ⎟⎟ ⎜ Cov(θˆ1 ,θˆ3 ) Cov(θˆ2 ,θˆ3 ) Var (θˆ3 ) ⎟⎠ ⎝ Derive the covariance terms: θˆ1 = f (n11 , n12 , n13 , n22 , n23 , n33 ) θˆ2 = g (n11 , n12 , n13 , n22 , n23 , n33 ) θˆ3 = h(n11 , n12 , n13 , n22 , n23 , n33 ) (n11 , n12 , n13 , n22 , n23 , n33 ) Multinomial ( n, (1 − θ ) p12 + θp1 , (1 − θ ) p1 p2 , (1 − θ ) p1 p3 , (1 − θ ) p22 + θp2 , (1 − θ ) p2 p3 , (1 − θ ) p32 + θp3 ) We will now calculate Cov (θˆ1 ,θˆ2 ) , Cov (θˆ1 ,θˆ3 ) and Cov(θˆ2 ,θˆ3 ) under H0. Let A denote the statements: n11 = E n11 = n[( p12 + θp1 ( p2 + p3 )] n22 = E n22 = n[( p22 + θp2 ( p1 + p3 )] n33 = E n33 = n[( p32 + θp3 ( p1 + p2 )] n12 = E n12 = 2n(1 − θ ) p1 p2 n13 = E n13 = 2n(1 − θ ) p1 p3 n23 = E n23 = 2n(1 − θ ) p2 p3 Calculate: 4n(n12 + n13 ) ∂θˆ1 = 2 ∂n11 (2n11 + n12 + n13 ) (n12 + n13 + 2n22 + 2n23 + 2n33 ) 89 ⎛ ∂θˆ1 ⎞ 4n(2n(1 − θ ) p1 p2 + 2n(1 − θ ) p1 p3 ) ⎜ ⎟ = ⎜ ∂n ⎟ ⎛ 2n(1 − θ ) p p + 2n(1 − θ ) p p + 2n( p 2 + θp ( p + p ))) 2 ⎞ 1 2 1 3 1 1 2 3 ⎝ 11 ⎠ A ⎜ ⎟ ⎜ (2n(1 − θ ) p1 p2 + 2n(1 − θ ) p1 p3 + 2(2n(1 − θ ) p 2 p3 + ⎟ ⎜ ⎟ 2 2 ⎝ n(θ ( p1 + p 2 ) p3 + p3 ) + (n( p2 + θp 2 ( p1 + p3 )))) ⎠ 2n(n12 + n13 ) + (2n11 + n12 + n13 )(n12 + n13 + 2n22 + 2n23 + 2n33 ) 2 2n(n12 + n13 ) ∂θˆ1 − = 2 ∂n12 (2n11 + n12 + n13 ) (n12 + n13 + 2n22 + 2n23 + 2n33 ) 2n (2n11 + n12 + n13 )(n12 + n13 + 2n22 + 2n23 + 2n33 ) ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ can be derived by plugging in the expected terms of nij from statement A. ⎝ 12 ⎠ A In a similar vein, we get the following derivative terms for θˆ2 : ⎛ ∂θˆ1 ⎞ ⎛ ˆ ⎞ ⎜ ⎟ , ⎜ ∂θ1 ⎟ , ⎜ ⎟ ⎜ ∂n ⎟ ⎝ 11 ⎠ A ⎝ ∂n12 ⎠ A ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ , ⎝ 13 ⎠ A ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ , ⎝ 22 ⎠ A ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ , ⎝ 23 ⎠ A ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ ⎝ 33 ⎠ A In a similar vein, we get the following derivative terms for θˆ2 : ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ 11 ⎝ ⎠A , ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ 12 ⎝ ⎠A , ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ 13 ⎝ ⎠A , ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ 22 ⎝ ⎠A , ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ 23 ⎝ ⎠A , ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ 33 ⎝ ⎠A ⎛ ∂θˆ3 ⎞ ⎟ ⎜ ⎜ ∂n ⎟ , ⎝ 13 ⎠ A ⎛ ∂θˆ3 ⎞ ⎟ ⎜ ⎜ ∂n ⎟ , ⎝ 22 ⎠ A ⎛ ∂θˆ3 ⎞ ⎟ ⎜ ⎜ ∂n ⎟ , ⎝ 23 ⎠ A ⎛ ∂θˆ3 ⎞ ⎟ ⎜ ⎜ ∂n ⎟ ⎝ 33 ⎠ A Also, find the derivative terms for θˆ3 ⎛ ∂θˆ3 ⎞ ⎛ ˆ ⎞ ⎟ , ⎜ ∂θ 3 ⎟ , ⎜ ⎟ ⎜ ⎜ ∂n ⎟ ⎝ 11 ⎠ A ⎝ ∂n12 ⎠ A The explicit terms are included in Appendix 8. ⎛ ∂θˆ ⎞ Cov(θˆ1 ,θˆ2 ) = ⎜⎜ 1 ⎟⎟ ⎝ ∂n11 ⎠ A ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ np11 (1 − p11 ) ⎝ 11 ⎠ A 90 ⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞ − ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p12 ⎝ ∂n11 ⎠ A ⎝ ∂n12 ⎠ A ⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞ − ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p13 ⎝ ∂n11 ⎠ A ⎝ ∂n13 ⎠ A ⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞ − ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p22 ⎝ ∂n11 ⎠ A ⎝ ∂n22 ⎠ A ⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞ − ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p23 ⎝ ∂n11 ⎠ A ⎝ ∂n23 ⎠ A ⎛ ∂θˆ ⎞ ⎛ ∂θˆ ⎞ − ⎜⎜ 1 ⎟⎟ ⎜⎜ 2 ⎟⎟ np11 p33 ⎝ ∂n11 ⎠ A ⎝ ∂n33 ⎠ A ⎡ 3 3 ⎛ ∂θˆ1 ⎞ ⎛ ∂θˆ2 ⎞ ⎟ p − n ⎢∑∑ ⎟ ⎜ ⎜ ⎢ i =1 j =1 ⎜ ∂n ⎟ ⎜ ∂n ⎟ ij ⎝ ij ⎠ A ⎝ ij ⎠ A ⎢⎣ i≤ j ⎤ ⎡ 3 3 ⎛ ∂θˆ1 ⎞ ⎜ ⎟ p ⎥⋅⎢ ⎜ ∂n ⎟ ij ⎥ ⎢∑∑ ⎝ ij ⎠ A ⎥ ⎢ i =1 j =1 ⎦ ⎣ i≤ j ⎤ ⎛ ∂θˆ2 ⎞ ⎜ ⎟ p ⎥ ⎜ ∂n ⎟ ij ⎥ ⎝ ij ⎠ A ⎥ ⎦ ⎡ 3 3 3 3 ⎛ ∂θˆ1 ⎞ ⎛ ∂θˆ3 ⎞ ˆ ˆ ⎟ ⎜ ⎟ pij − n ⎢∑∑ Cov (θ1 ,θ 3 ) = n ∑∑ ⎜ ⎢ i =1 j =1 ⎜ ⎜ ⎟ ⎟ i =1 j =1 ⎝ ∂nij ⎠ A ⎝ ∂nij ⎠ A ⎢⎣ i≤ j i≤ j ⎤ ⎡ 3 3 ⎛ ∂θˆ1 ⎞ ⎜ ⎟ p ⎥⋅⎢ ⎜ ∂n ⎟ ij ⎥ ⎢∑∑ ⎝ ij ⎠ A ⎥ ⎢ i =1 j =1 ⎦ ⎣ i≤ j ⎤ ⎛ ∂θˆ3 ⎞ ⎜ ⎟ p ⎥ ⎜ ∂n ⎟ ij ⎥ ⎝ ij ⎠ A ⎥ ⎦ ⎡ 3 3 3 3 ⎛ ∂θˆ2 ⎞ ⎛ ∂θˆ3 ⎞ ˆ ˆ ⎟ ⎟ ⎜ ⎜ Cov(θ 2 , θ 3 ) = n ∑∑ pij − n ⎢⎢∑∑ ⎜ ∂n ⎟ ⎜ ∂n ⎟ i =1 j =1 ⎝ ij ⎠ A ⎝ ij ⎠ A ⎢⎣ i =1 i≤j =j 1 i≤ j ⎤ ⎡ 3 3 ⎛ ∂θˆ2 ⎞ ⎜ ⎟ pij ⎥ ⋅ ⎢∑∑ ⎥ ⎢ ⎜ ∂n ⎟ ⎝ ij ⎠ A ⎥ ⎢ i =1 j =1 ⎦ ⎣ i≤ j ⎤ ⎛ ∂θˆ3 ⎞ ⎜ ⎟ pij ⎥ ⎥ ⎜ ∂n ⎟ ⎝ ij ⎠ A ⎥ ⎦ 3 3 = n ∑∑ i =1 j =1 i≤ j In a similar vein, Again, Mathematica code and the individual terms can be found in Appendix 8 and 9. 91 5.6 Conclusions Every 2x2 joint distribution of alleles in the bi-allelic case can be reformatted to bring into focus the inbreeding coefficient θ . This is not possible in the tri-allelic case (A1, A2, and A3). Consequently, formulating the HWE in this case is fraught with difficulties. Going straight after the inbreeding coefficient once the data are given is creating computational complexities of a very high order. In this chapter, we made a systematic exploration of issues involved. We have looked at three 2x2 joint distribution: A1 vs. (not A1), A2 vs. (not A2), A3 vs. (not A3) stemming from the 3x3 joint distribution of the alleles, and their corresponding inbreeding coefficient θ1, θ2, and θ3, respectively. We have shown that the 3x3 joint distribution of the alleles admits a single inbreeding coefficient θ if and only if θ1 = θ2 = θ3 = θ. This basic result paved way to tackle the HWE problem. We proposed a Chi-squared test Q to test hypotheses about θ. This test is shown to have good power. This is a caveat in the way we conducted simulations. The allele frequencies p1 and p2 are randomly generated from [0,1] subject to p1+p2 ≤1. The inbreeding coefficient θ has natural bounds set by the allele frequencies. In simulations, whenever the natural bounds are violated, data are not generated. The simulations are counted only for those cases in which data are generated. This does not seem to impact the observed size. See Figure 5-2. We believe that this has no impact on power either. See Figure 5-1. 92 6 Generalization to multiple alleles In Chapter 5, a comprehensive discussion has been carried out about the HWE in the context of 3 alleles. A crucial result in reducing the 3x3 joint distribution to several 2x2 joint distributions helped the testing problem. In this Chapter, we will generalize the key result to any number of multi-allelic case. When a specific locus has three or more alleles (k alleles), apparently, by maximizing the likelihood over the parameter space Ω* , with respect to all marginals and inbreeding coefficient θ is not possible to find an explicit solution to the maximum likelihood equations. As in Chapter 5, we need to find a way to reduce the dimensionality of the problem. This is what we plan to do in this section. 6.1 Formulation of the problem For three or more alleles (k alleles), the joint distribution of the alleles A1, A2, ..., Ak can be represented in the following contingency table (Table 6-1). Table 6-1 General Joint distribution of genotypes of multiple alleles Alleles Marginal A1 A2 … Ak-1 Ak A1 p11 p12 … p1,k-1 p1,k p1 A2 p21 p22 … p2,k-1 p2,k p2 … … … … … … … Ak-1 pk-1,1 pk-1,2 … pk-1,k-1 pk-1,k pk-1 Ak pk1 pk2 … pk,k-1 pk,k pk p1 p2 … pk-1 pk 1 Alleles Marginal frequencies 93 frequencies Let us use the generic symbol Α for the joint distribution. The marginal probabilities p1, kxk p2 , p3, … and pk are usually called allelic frequencies. The genotypes A1A2 and A2A1, for example, are not distinguishable. The matrix Α has the following properties: kxk 1. Α is symmetric; kxk 2. The marginal distributions are the same. The purpose in this chapter is to make inferences on the unknown Α drawn from the kxk population of interest based on a random sample of individuals and their genotypes. The underlying parameter spaces are defined by: 1) Ω = { Α : A is all possible joint distributions, symmetric and identical marginals}. kxk The dimension of Ω is given by: k (k + 1) −1 . 2 2) Ω P = {All possible joint distributions with fixed marginals} , p1, p2 , p3, …… and pk are fixed and known. Ω P is a special subset of Ω of interest. The dimension of Ω P is given by: k (k − 1) . 2 3) Ω* = {A: The joint distribution A is the Ω for some θ ≠ 0 } Definition: A joint distribution is of type Ω* if it is of the form (Table 6-2) for some θ, p1, p2 , p3, … and pk . 94 A joint distribution of type Ω Table 6-2 Alleles Alleles A2 … Ak-1 Ak Marginal frequencies (1 − θ ) p1 p2 … (1 − θ ) p1 pk −1 (1 − θ ) p1 pk p1 … (1 − θ ) p2 pk −1 (1 − θ ) p2 pk p2 … … … … (1 − θ ) pk2−1 (1 − θ ) pk −1 pk A1 (1 − θ ) p12 A1 + θp1 (1 − θ ) p22 A2 * + θp2 … Ak-1 + θpk −1 pk-1 (1 − θ ) pk2 Ak pk + θpk Marginal frequencies p1 p2 … pk-1 1 pk The dimension of Ω* is clearly k. The entity θ can be labeled as the inbreeding coefficient. When θ = 0, the equilibrium is achieved. The equilibrium distribution is given in Table 6-3. It is clear that Ω* is a subset of Ω . 4) Parameter space: Ω* p1, p2, …, pk = {A ∈ Ω* : A has marginal frequencies p1, p2 ,…, pk} If marginals p1, p2 , p3, …and pk-1 are all known, only one parameter θ is left unknown. The dimension of Ω* p1, p2, …, pk is one. Table 6-3 Joint distribution of genotypes under Equilibrium ( Ω 0 ) Alleles Alleles A1 A1 p1 2 A2 … Ak-1 Ak p1 p 2 … p1 p k-1 p1 p k p1 p22 … p2 p k-1 p2 p k p2 … … … … pk-12 pk-1 p k pk-1 pk2 pk pk 1 … Ak-1 Ak Marginal frequencies p1 Marginal A2 p2 … 95 pk-1 frequencies 6.2 Data and Likelihood 6.2.1 Data Structure A random sample of n individuals is selected at random from the population of interest. Let nij (i < j) be the number of individuals in the sample with genotype AiAj. The data are arranged in Table 6-4. Table 6-4 Data collected for multiple alleles Alleles Alleles A1 A2 … Ak-1 Ak n11 n12 … n1,k-1 n1,k n1 n22 … n2,k-1 n2,k n2 … … … … nk-1,k-1 nk-1,k nk-1 nk,k nk nk n A2 … Ak-1 Ak Marginal frequencies 6.2.2 Marginal A1 n1 n2 … nk-1 frequencies Maximum Likelihood estimators k n The Likelihood of the data is given by L(A) = ∏( pii ) nii ∏(2 pij ) ij . i =1 i< j 1. Maximize L over all A ∈ Ω . The maximum likelihood estimators are given by pˆ ii = 2 pˆ ij = nii , n nij n i = 1, 2, … , k , i<j. 2. Maximize L over A ∈ Ω* . This is an intractable optimization problem. 96 3. Maximize L over Ω* p1, p2, …, pk. Suppose the margianl frequencies p1, p2 , p3, …… and pk are known. The only unknown parameter is θ. The likelihood is given by: k n L(A) = ∏[(1 − θ ) pi2 + θpi ]nii ∏((1 − θ ) pi p j ) ij . i =1 LnL(θ ) = k ∑ n ln[(1 − θ ) p i =1 2 i ii ∂LnL = ∂θ set i< j + θpi ] + ∑ nij ln(1 − θ ) + constant i< j n nii ( pi − pi2 ) − ∑ ij ∑ 2 i =1 (1 − θ ) pi + θpi i < j (1 − θ ) k ∂LnL = 0 to solve θˆ . ∂θ θ nii ( pi − pi2 ) Let η = . The likelihood equation is given by ∑ = ∑ nij . pi2 + ηpi 1−θ i =1 i< j k This is a polynomial in η of degree k. The solution is tractable. Once the data and marginal frequencies are given, a good software should be able to provide all the roots of the polynomial. From the set of roots, one should be able to pick up the root which maximizes the likelihood. 6.3 Lower dimensional joint distributions As in Chapter 5, we will identify 2x2 joint distributions with their corresponding inbreeding coefficients which will determine uniquely the inbreeding coefficient of the kxk joint distribution. Let θi be the inbreeding coefficient stemming from the 2x2 joint distribution associated with the case Ai vs. (not Ai), i = 1, 2, … , k. 97 For i < j, ∈ { i = 1, 2, … , k}, let θij be the inbreeding coefficient stemming from the 2x2 joint distribution associated with the case (Ai or Aj) vs. (not Ai and not Aj). The following is the main result. Therom. A joint distribution A of the genotypes is of the type Ωθ for some θ if and only if θi = θ for every i, and θij = θ for every i < j. Proof. It is easy to show that if A is of type Ωθ , then θi = θ for every i. for example, look at the 2x2 joint distribution associated with A1 vs. (not A1). Alleles A1 Alleles A1 Marginal frequencies not A1 (1 − θ ) p12 + θp1 k ∑ (1 − θ ) p p 1 i =2 k not A1 ∑ (1 − θ ) p p 1 i =2 Marginal frequencies i p1 p1 i * p1 + ... p k p1 + ... p k 1 * is obtained by subtraction. From the table, it follows that θ1 = θ. In a similar fashion, one can show that θ1 = θ2 = … = θk = θ. Look at the 2x2 joint distribution associated with (A1 or A2) vs. (not A1 and not A2). Alleles Alleles A1 or A2 A1 or A2 (1 − θ )( p12 + p12 ) + θ ( p1 + p2 ) + 2θp1 p2 not A1 and not A2 Marginal frequencies a p1 + p2 not A1 and not A2 b c p3 + ... pk Marginal frequencies p1 + p2 p3 + ... pk 1 a, b, and c are obtained by subtraction. 98 Note that (1 − θ )( p12 + p12 ) + θ ( p1 + p 2 ) + 2θp1 p 2 = (1 − θ )( p1 + p2 ) 2 + θ ( p1 + p2 ) , which implies that θ12 = θ . In a similar fashion, one can shown that θij = θ for every i < j. Let us prove the converse. Suppose θi = θ for every i and θij = θ for every i < j. We want to show that in A= (pij) , pi = (1 − θ ) pi2 + θpi , i = 1, 2, … , k. and pij = (1 − θ ) pi p j , for every i < j. by the hypothesis. Looking at the case (Ai or Aj) vs. (not Ai and not Aj) (i < j), we have pii + 2pij + pjj = (1 − θ ij )( pi + p j ) 2 + θ ij ( pi + p j ) = (1 − θ )( pi + p j ) 2 + θ ( pi + p j ) Since pii = (1 − θ ) pi + θpi and pjj = (1 − θ ) p 2j + θp j , 2 we have 2pij = (1 − θ )( pi + p j ) 2 + θ ( pi + p j ) − pii − pjj = (1 − θ )2 pi p j , which implies pij = (1 − θ ) pi p j . This completes the proof. Remarks: The number of 2x2 joint distributions that are required for a unique determination of the inbreeding coefficient of the kxk joint distribution is k + k (k − 1) k (k + 1) . = 2 2 This is the upper bound. For lower values of k, there will be some duplication of θi s with θij s. We give a list for k ∈ {3, 4, 5} here in Table 6-5 below. 99 Table 6-5 2x2 distribution required for a test about inbreeding coefficient k 2x2 joint distributions needed for a unique determination of the inbreeding coefficient of the kxk joint distribution Total number A1 vs. (not A1) 3 3 A2 vs. (not A2) A3 vs. (not A3) A1 vs. (not A1) A2 vs. (not A2) A3 vs. (not A3) 4 A4 vs. (not A4) 7 (A1 or A2) vs. (A3 or A4) (A1 or A3) vs. (A2 or A4) (A1 or A4) vs. (A2 or A3) A1 vs. (not A1) A2 vs. (not A2) A3 vs. (not A3) A4 vs. (not A4) A5 vs. (not A5) (A1 or A2) vs. (A3, A4 or A5) (A1 or A3) vs. (A2, A4 or A5) 5 (A1 or A4) vs. (A2, A3 or A5) 15 (A1 or A5) vs. (A2, A3 or A4) (A2 or A3) vs. (A1, A4 or A5) (A2 or A4) vs. (A1, A3 or A5) (A2 or A5) vs. (A1, A3 or A4) (A3 or A4) vs. (A1, A2 or A5) (A3 or A5) vs. (A1, A2 or A4) (A4 or A5) vs. (A1, A2 or A3) 6 unlisted 21 And so on. We have not exploited this result to build tests on the inbreeding coefficient. 100 7 Conclusions and Future Research The broad theme of research carried out in this dissertation is on association studies in genetics. The work done can be classified into three segments. In the first segment (Chapter 3), association between a bi-allelic gene and a quantitative phenotype is the main focus. An additive model is assumed exemplifying the connection between the genotypes of the (true) gene and phenotype of interest. The true gene is unknown. Data are collected on the phenotype of subjects classified according to the genotypes for a gene under investigation. The ANOVA method is the standard recipe to examine whether the investigative gene is associated with the phenotype. We have discovered that the assumptions needed for the validity of the ANOVA method are not met. Normality failed; homogeneity of variances does not hold. Bartlett’s test is a viable option. One needs normality for the validity of the test, which tests homogeneity of variances. We made a comparison of the performances of both procedures in terms of power. We have shown that the ANOVA procedure is superior to the Bartlett’s test. Violation of the normality condition is the main reason, we believe, for the poor performance of the Bartlett’s test. The ANOVA procedure seems more robust despite the violation of its assumptions. There are a number of tests of homogeneity of variances (eg. Levene test) available in the literature which are more robust than the Bartlett’s test. Future work would consist of examining details of these tests and make a comprehensive comparison of powers. In the second segment (Chapter 4), the focus is on testing hypotheses about the HardyWeinberg equilibrium. The standard method used is the χ 2 -test. The χ 2 -test is not usable when one wants to entertain one-sided alternatives. Two alternative tests (Z-test and Siegmund’s test) are considered to fill out the lacunae. We have shown that these two tests are essentially the same. 101 Sample size formula has been established using the Z-test. A comparison of sample sizes obtained using the Z-test and Ward and Sing’s χ 2 -test is made. The sample sizes under the Z-test are smaller than those given by the χ 2 -test. In future, we want to develop an R-code for sample size calculations. After it is developed, we want to make it available among R-users and seek their input. Examining the Hardy-Weinberg equilibrium issues in a case of tri-allelic gene is fraught with mathematical, statistical, and computational difficulties. This problem is taken up in Segment 3 (Chapter 5). Two solutions have been proposed. One was to find the maximum likelihood estimate of the inbreeding coefficient after the standard estimates of the allele frequencies have been plugged into the equation. The resulting equation is a third degree polynomial in the inbreeding coefficient. The other solution is to reduce the problem to several 2x2 joint distributions. Testing about the inbreeding coefficient is equivalent to testing the equality of several inbreeding coefficients stemming from the 2x2 distributions. A test statistic, which is a quadratic form, has been proposed to carry out the testing problem. This test is shown to have good power. We want to compare the performance of the two procedures we have proposed in this connection in terms of power in a future endeavor. In Chapter 6, we have broadened the Hardy-Weinberg equilibrium problem to multi-allelic cases. The details have not been completely worked out. We want to pursue the extension more extensively. As the number of alleles increases, computational complexity increases. We want to develop a R-code to ease the heavy burden of computations. 102 Bibliographic references Barr, J. (1991) Liver slices in dynamic organ culture. II. An in vitro cellular technique for the study of integrated drug metabolism using human tissue. Xenobiotica; 21 (3): 341-350. Chakraborty, R, Hanis, CL, and Boerwinkle, E (1986) Effect of a marker locus on the quantitative variability of a risk factor to chronic diseases. American Journal of Human Genetics; 39 : A231. Charkraborty, R, and Zhong, Y. (1994) Statistical power of an exact test of Hardy-Weinberg proportions of genotypic data at a multiallelic locus. Hum. Hered. 44: 1-9. Choudhry, S and Coyle NE (2006) Population stratification confounds genetic association studies. Hum Genet;118: 652–664. Cockerham, CC (1973) Analyses of gene frequencies. Genetics; 74: 679-700. Cramer, H. (1961). Mathematical Methods of Statistics. Priceton University Press, USA. Crow, JF and Kimura, M (1970) An introduciton to population genetics theory. Harper and Row, New York. Curie-Cohen, M. (1982) Estimates of inbreeding in a natureal population: A comparison of sampling properties. Genetics; 100: 339-358. Guo, SW and Thompson, EA (1992) Performing the exact test of Hardy-Weinberg proportion for multiple alleles. Biometrics; 48: 361-372. Harber, M. (1980) Detection of inbreeding effects by the χ2 test on genotypic and phenotypic frequencies. Am. J. Hum. Genet. 32: 754-760. Li, CC (1969); Population subdivision with respect to multiple alleles. Ann. Hum., Lond: 33: 2329. Li, CC and Horvitz, DG (1953) Some methods of estimating Inbreeding Coefficient. Ameri. Journal of Human Genetics. Vol. 5 No.2: 107-117. 103 Lessios, H (1992) Testing electrophoretic data for agreement with Hardy-Weinberg expectations. Mar. Biol. 112: 517-523. Milliken, GA and Johnson, DE (1989). Analysis of Messy data. Van Nostrand Reinhold, New York. Pamilo, P and Varvio-Aho, S (1984) Testing genotype frequencies and heterozygosities. Mar. Biol; 79: 99-100. Ralls, K, Brugger K and Ballow J (1979) Inbreeding and juvenile mortality in small populations of ungulates. Science; 206: 1101-1103. Rao, CR (1983) Linear Statistical Inference and Its Applications, Wiley Publications, NY. Seener, JW (1980) Inbreeding and the survival of zoo populations. In: Consevation Biology; 209229. Edited by M. Soule and B. Wilcox. Sinauer Assoc., Sunderland, Massachusetts. Ward, R and Sing, C (1970) A consideration of the power of the χ2 test to detect inbreeding effects un natural populations. American Naturalist; V.104, No. 938: 355-365. Wright, S (1977) Evolution and the Genetics of Populations; V. 3. University of Chicago Press, Chicago. Weir, B(1996) Genetic Data Analysis II. Sinauer Associates, Inc., Sunderland, Massachusetts. 104 Appendices Mathematical derivations, SAS Programs and Mathematica Codes 105 Appendix 1: Derivation of Conditional Expectations and Variances In the additive model presented in Chapter 3, conditional means and variances are needed to check on the assumptions of ANOVA. In the Appendix here, the calculations are set out. For genotype AA of the marker gene G′: 2 2 λPMA − λPmA E(P | G′ = AA) = = 2 A P = = = λ ( PMA + PmA )( PMA − PmA ) PA2 λ ( PM PA + Δ + Pm PA − Δ )( PM PA + Δ − Pm PA + Δ ) PA2 λ ( PA )( PA ( PM − Pm ) + 2Δ ) PA2 λ (PM − Pm ) + 2 Δλ PA 2 2 (λ2 + σ 2 ) + 2 PMA PmA (σ 2 ) + PmA (λ 2 + σ 2 ) PMA E(P | G′ = AA) = PA2 2 2 2 2 2 λ2 + PMA σ 2 + 2 PMA PmA (σ 2 ) + PmA λ2 + PmA σ2 PMA = PA2 = = = 2 2 2 2 + PmA + 2 PMA PmA + PmA λ2 ( PMA ) + σ 2 ( PMA ) PA2 2 2 + PmA λ2 ( PMA ) + σ 2 ( PMA + PmA ) 2 PA2 2 2 + PmA λ2 ( PMA ) + σ 2 ( PM PA + Δ + Pm PA − Δ) 2 PA2 =σ + 2 Var(P | G′ = AA) = σ + 2 2 2 + PmA λ2 ( PMA ) PA2 2 2 λ2 ( PMA ) + PmA PA2 − λ2 PA4 2 2 2 ( PMA − PmA ) 106 =σ + 2 2 =σ + 2 =σ + = σ2 + = σ2 + = σ2 + 2 2 + PmA λ2 ( PMA ) PA2 2 2 λ2 ( PMA ) + PmA 2 A P 2 2 λ2 ( PMA ) + PmA 2 A P λ2 2 A P λ2 PA2 − − − λ2 4 A P λ2 4 A P λ2 2 A P [( PMA − PmA )( PMA + PmA )]2 [( PMA − PmA )( PA )]2 ( PMA − PmA ) 2 2 2 [( PMA + PmA ) − ( PMA − PmA ) 2 ] (2 PMA PmA ) 2λ2 ( PM PA + Δ)( Pm PA − Δ) PA2 2λ2 2 2 = σ + 2 ( PM PA Pm + ΔPm PA − ΔPM PA − Δ ) PA 2 = σ 2 + 2λ2 PM Pm + 2Δλ2 ( Pm − PM ) 2Δ2 λ2 − PA PA2 Also, for genotype Aa of marker G′: 2PMA PMa N( λ , σ 2 ) + (2PMA Pma + 2PMa PmA )N(0, σ 2 ) + 2PmA Pma N( − λ , σ 2 ) P | G′ = Aa = 2PA Pa E(P | G′ = Aa) = = 2 PMA PMa λ − 2 PmA Pma λ 2 PA Pa 2λ( PMA PMa − PmA Pma) 2 PA Pa = = λ[( PM PA + Δ)( PM Pa − Δ) − ( Pm PA − Δ)( Pm Pa + Δ)] PA Pa λ[( PM2 PA Pa + ΔPM Pa − ΔPM PA − Δ2 ) − ( Pm2 PA Pa − ΔPm Pa + ΔPm PA − Δ2 )] PA Pa 107 = = = λ ( PM2 PA Pa + ΔPM Pa − ΔPM PA − Δ2 − Pm2 PA Pa + ΔPm Pa − ΔPm PA + Δ2 ) PA Pa λ[( PM2 − Pm2 ) PA Pa + ΔPa ( PM + Pm ) − ΔPA ( PM + Pm )] PA Pa λ[( PM2 − Pm2 ) PA Pa + ΔPa − ΔPA ] PA Pa = λ ( PM − Pm ) − E(P2 | G′ = Aa) = Δλ ( PA − Pa ) PA Pa 2 PMA PMa (λ2 + σ 2 ) + (2 PMA Pma + 2 PMa PmA )(σ 2 ) + 2 PmA Pma (λ2 + σ 2 ) 2 PA Pa ( PMA PMa + PmA Pma )λ2 + ( PMA PMa + PMA Pma + PMa PmA + PmA Pma )(σ 2 ) = PA Pa = ( PMA PMa + PmA Pma )λ2 + ( PMA + PmA )( PMa + Pma )(σ 2 ) PA Pa = ( PMA PMa + PmA Pma )λ2 + PA Pa (σ 2 ) PA Pa ( PMA PMa + PmA Pma )λ2 =σ + PA Pa 2 = σ2 + = σ2 + [( PM PA + Δ)( PM Pa − Δ ) + ( Pm PA − Δ)( Pm Pa + Δ)]λ2 PA Pa ( PM2 PA Pa + ΔPM Pa − ΔPM PA − Δ2 + Pm2 PA Pa − ΔPm Pa + ΔPm PA − Δ2 )λ2 PA Pa [( PM2 + Pm2 ) PA Pa + ΔPM ( Pa − PA ) − ΔPm ( Pa − PA ) − 2Δ2 )]λ2 =σ + PA Pa 2 = σ 2 + ( PM2 + Pm2 )λ2 + Var(P | G′ = Aa) = σ 2 + ( PM2 + Pm2 )λ2 + [Δ( PM − Pm )( Pa − PA ) − 2Δ2 )]λ2 PA Pa [Δ( PM − Pm )( Pa − PA ) − 2Δ2 )]λ2 PA Pa − [λ ( PM − Pm ) − Δλ ( PA − Pa ) 2 ] PA Pa 108 Δλ2 ( PM − Pm )( PA − Pa ) − 2Δ2 λ2 = σ + ( P + P )λ − PA Pa 2 2 M 2 m 2 − λ2 ( PM − Pm ) 2 + 2Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA − Pa ) 2 − PA Pa PA2 Pa2 = σ 2 + 2λ2 PM Pm + Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( P − Pa ) 2 − [2 + A ] PA Pa PA Pa PA Pa = σ 2 + 2λ2 PM Pm + Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 PA2 + Pa2 ( ) − PA Pa PA Pa PA Pa Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA2 + Pa2 ) − = σ + 2λ PM Pm + PA Pa PA2 Pa2 2 2 Again, for genotype aa of marker G′: 2 2 PMa N( λ , σ 2 ) + 2PMa Pma N(0, σ 2 ) + Pma N( − λ , σ 2 ) P | G′ = aa = Pa2 2 PMa λ − Pma2 λ E(P | G′ = aa) = Pa2 = = = λ ( PMa + Pma )( PMa − Pma ) Pa2 λ ( Pa )( PM Pa − Δ − Pm Pa − Δ) Pa2 λ ( Pa )( Pa ( PM − Pm ) − 2Δ ) Pa2 = λ ( PM − Pm ) − 2Δλ Pa 2 2 (λ2 + σ 2 ) + 2 PMa Pma (σ 2 ) + Pma (λ 2 + σ 2 ) PMa E(P | G′ = aa) = Pa2 2 = 2 2 ( PMa + Pma ) 2 σ 2 λ2 ( PMa ) + Pma + 2 2 Pa Pa 109 =σ + 2 Var(P | G′ = aa) = σ 2 + 2 2 + Pma λ2 ( PMa ) Pa2 2 2 + Pma λ2 ( PMa ) 2 a P = σ2 + − [λ ( PM − Pm ) − 2Δλ 2 ] Pa 2 2 λ2 ( PMa ) + Pma Pa2 4Δλ 2 (PM − Pm ) 4Δ 2λ 2 −[λ (PM − Pm ) − + ] Pa Pa2 2 2 λ 2 [(PM Pa − Δ ) 2 + (Pm Pa + Δ ) 2 ] = σ2 + Pa2 −[λ 2 (PM − Pm ) 2 − =σ + 2 λ2 2 a P [(PM2 + Pm2 )Pa2 − 2ΔPa (PM − Pm ) + 2Δ 2 ] −[λ 2 (PM − Pm ) 2 − = 4Δλ 2 (PM − Pm ) 4Δ 2λ 2 + ] Pa Pa2 4Δλ 2 (PM − Pm ) 4Δ 2λ 2 + ] Pa Pa2 σ 2 + λ 2 (PM2 + Pm2 ) − 2Δλ 2 (PM − Pm ) 2Δ 2λ 2 + Pa Pa2 4Δλ 2 (PM − Pm ) 4Δ 2λ 2 −[λ (PM − Pm ) − + ] Pa Pa2 2 = 2 σ 2 + 2λ 2 PM Pm + 2Δλ 2 (PM − Pm ) 2Δ 2λ 2 − Pa Pa2 Therefore, the derived conditional expectations and variations are summarized as follows: E(P | G′ = AA) = λ (PM − Pm ) + 2 Δλ PA Var(P | G′ = AA) = σ 2 + 2λ2 PM Pm + 2Δλ2 ( Pm − PM ) 2Δ2 λ2 − PA PA2 110 E(P | G′ = Aa) = λ ( PM − Pm ) − Δλ ( PA − Pa ) PA Pa Var(P | G′ = Aa) = σ 2 + 2λ2 PM Pm + E(P | G′ = aa) = λ (p M − p m ) − Δλ2 ( PM − Pm )( PA − Pa ) Δ2 λ2 ( PA2 + Pa2 ) − PA Pa PA2 Pa2 2Δλ pa Var(P | G′ = aa) = σ 2 + 2λ 2 PM Pm + 2Δλ 2 (PM − Pm ) 2Δ 2λ 2 − Pa Pa2 111 Appendix 2: SAS code of different scenarios of ANOVA test Power calculations under the ANOVA test are set out in this appendix in the form of SAS macros. (Chapter 3) options nosource nodate; ods trace off; %macro sim(delta=0,lamda=1,v=1,Pm=0.5,Pa=0.5,sample=200,alphalevel=0.05,seed_0=0,seed_1 =0,simu=10); data one; retain sd0-sd1 (&seed_0 &seed_1); BmBm=(1-&pm)**2;BmLm=2*(1-&pm)*±LmLm=(&pm)**2; BaBa=(1-&pa)**2;BaLa=2*(1-&pa)*±LaLa=(&pa)**2; BmBa=(1-&pm)*(1-&pa)+δ BmLa=(1-&pm)*&pa-δ LmBa=&pm*(1-&pa)-δ LmLa=&pm*&pa+δ p1=BmBa**2; p2=2*BmBa*BmLa; p3=BmLa**2; p4=2*BmBa*LmBa; p5=2*BmBa*LmLa+2*BmLa*LmBa; p6=2*BmLa*LmLa; p7=LmBa**2; p8=2*LmBa*LmLa; p9=LmLa**2; g1=BmBa**2/0.25; g4=2*BmBa*LmBa/0.25; g7=LmBa**2/0.25; g2=BmBa*BmLa/0.25; g5=(BmBa*LmLa+BmLa*LmBa)/0.25; g8=LmBa*LmLa/0.25; g3=BmLa**2/0.25; g6=2*BmLa*LmLa/0.25; g9=LmLa**2/0.25; /* p1=f( p2=f( p3=f( p4=f( p5=f( p6=f( p7=f( p8=f( p9=f( */; MM, MM, MM, Mm, Mm, Mm, mm, mm, mm, AA) Aa) aa) AA) Aa) aa) AA) Aa) aa) 112 mean=&lamda*(1-2*&Pm); VBaBa=&v+2*&lamda**2*(1-&Pm)*&Pm+2*&delta*&lamda**2*(2*&Pm-1)/(1-&pa)2*&delta**2*&lamda**2/(1-&pa)**2; VBaLa=&v+2*&lamda**2*(1-&Pm)*&Pm+&delta*&lamda**2*(1-2*&pa)*(1-2*&pm)/(1&pa)/&pa-&delta**2*&lamda**2*((1-&pa)**2+&pa**2)/(1-&pa)**2/&pa**2; VLaLa=&v+2*&lamda**2*(1-&Pm)*&Pm+2*&delta*&lamda**2*(1-2*&Pm)/&pa2*&delta**2*&lamda**2/&pa**2; mBaBa=mean+2*&delta*&lamda/(1-&pa); mBaLa=mean-&delta*&lamda*(1-2*&pa)/(1-&pa)/&pa; mLaLa=mean-2*&delta*&lamda/&pa; do i=1 to &simu; do j=1 to &sample; call rantbl (sd0,p1,p2,p3,p4,p5,p6,p7,p8,p9,g); call rannor (sd1,x); if g=1 then do; genotype='MM AA';genotype_M='MM';genotype_A='AA'; phenotype=&lamda+sqrt(&V)*x; end; else if g=2 then do; genotype='MM Aa';genotype_M='MM';genotype_A='Aa'; phenotype=&lamda+sqrt(&V)*x; end; else if g=3 then do; genotype='MM aa';genotype_M='MM';genotype_A='aa'; phenotype=&lamda+sqrt(&V)*x; end; else if g=4 then do; genotype='Mm AA';genotype_M='Mm';genotype_A='AA'; phenotype=0+sqrt(&V)*x; end; else if g=5 then do; genotype='Mm Aa';genotype_M='Mm';genotype_A='Aa'; phenotype=0+sqrt(&V)*x; end; else if g=6 then do; genotype='Mm aa';genotype_M='Mm';genotype_A='aa'; phenotype=0+sqrt(&V)*x; end; else if g=7 then do; genotype='mm AA';genotype_M='mm';genotype_A='AA'; phenotype=(-&lamda)+sqrt(&V)*x; end; else if g=8 then do; genotype='mm Aa';genotype_M='mm';genotype_A='Aa'; phenotype=(-&lamda)+sqrt(&V)*x; end; else if g=9 then do; genotype='mm aa';genotype_M='mm';genotype_A='aa'; phenotype=(-&lamda)+sqrt(&V)*x; end; 113 sim=i; output; end; end; run; ods output bartlett=bart; ods listing close; /*ods output modelanova=anova_raw bartlett=bartlett_raw welch=welch_raw;*/ proc anova data=one outstat=anova_raw; class genotype_A; by sim; model phenotype=genotype_A ; means genotype_A /HOVTEST=BARTLETT ; run; ods listing; /*ods output close;*/ data anova_raw; set anova_raw; where _type_^='ERROR'; if prob<=&alphalevel then reject=1;else reject=0; run; data bartlett_raw; set bart; if probchisq<=&alphalevel then reject=1;else reject=0; run; proc freq noprint data=anova_raw; tables reject /out=anova; run; proc freq noprint data=bartlett_raw; tables reject /out=bartlett; run; data anova; set anova; prob_anova=percent/100;if reject^=1 then delete; keep prob_anova; run; data bartlett; set bartlett; prob_bartlett=percent/100;if reject^=1 then delete; keep prob_bartlett; run; data result; merge anova bartlett; delta=δlamda=&lamda;v=&v;Pm=±Pa=&pa;Sample=&sample;Sim=&simu; keep delta lamda v pm pa sample prob_anova prob_bartlett sim; label prob_anova='Power of Anova' prob_bartlett='Power of Bartlett’s test' Sim='Simulation Times'; 114 file 'C:\Personal Folder\power.txt' mod; put delta lamda v pm pa sample prob_anova prob_bartlett sim; run; proc print data=result; var delta lamda v pm pa sample prob_anova prob_bartlett sim; run; quit; %mend sim; %sim (delta=0.0625,lamda=1,v=1,Pm=0.5,Pa=0.5,sample=200,alphalevel=0.05); axis1 offset=(11,11); symbol1 color=red interpol=none value=dot height=0.5; proc gplot data=one; plot phenotype*genotype/ haxis=axis1 hminor=2 vaxis=axis2 vminor=1; run; proc boxplot data=one; plot phenotype*genotype/ haxis=axis1 hminor=2 vaxis=axis2 vminor=1; run; proc capability data=one noprint; spec lsl=6.8 llsl=2 clsl=black; cdf phenotype / cframe = ligr legend = legend2; run; **** Draw plot by AA Aa aa****; proc sort data=one; by genotype_A; run; proc univariate data=one; by genotype_A; var phenotype; probplot phenotype; histogram phenotype /normal(noprint) outhistogram=raw_graph midpoints=-10 to 10 by 0.2; run; /*proc print data=raw_graph; run; data graph; set raw_graph; 115 if genotype='Aa' then _obspct_=_obspct_+50; else if genotype='AA' then _obspct_=_obspct_+100; run; */ proc sort data=raw_graph; by genotype_A desending _midpt_; run; symbol1 color=red interpol=join value=dot height=0.5; symbol2 color=blue interpol=join value=star height=0.5; symbol3 color=yellow interpol=join value=diamond height=0.5; proc gplot data=raw_graph; plot _obspct_*_midpt_=genotype_A; run; **** Draw plot by MM Mm mm****; proc sort data=one; by genotype_M; run; proc univariate data=one; by genotype_M; var phenotype; probplot phenotype; histogram phenotype /normal(noprint) outhistogram=raw_graph midpoints=-10 to 10 by 0.2; run; /*proc print data=raw_graph; run; data graph; set raw_graph; if genotype='Aa' then _obspct_=_obspct_+50; else if genotype='AA' then _obspct_=_obspct_+100; run; */ proc sort data=raw_graph; by genotype_M desending _midpt_; run; symbol1 color=red interpol=join value=dot 116 height=0.5; symbol2 color=blue interpol=join value=star height=0.5; symbol3 color=yellow interpol=join value=diamond height=0.5; proc gplot data=raw_graph; plot _obspct_*_midpt_=genotype_M; run; quit; 117 Appendix 3: SAS code of Power comparison of ANOVA and Bartlett test SAS macros are presented for a Power comparison. (Chapter 3) options nosource nodate; ods trace off; %macro sim(delta=0.1,v=1,sample=200,alphalevel=0.05,seed_0=0,seed_1=0,simu=1000); data one; retain sd0-sd1 (&seed_0 &seed_1); do i=1 to &simu; lamda=-1+2*ranuni(0); pm=ranuni(0); pa=ranuni(0); %let pm=pm; %let pa=pa; %let lamda=lamda; BmBm=(1-pm)**2;BmLm=2*(1-pm)*pm;LmLm=(pm)**2; BaBa=(1-pa)**2;BaLa=2*(1-pa)*pm;LaLa=(pa)**2; BmBa=(1-pm)*(1-pa)+δ BmLa=(1-pm)*pa-δ LmBa=pm*(1-pa)-δ LmLa=pm*pa+δ p1=BmBa**2; p2=2*BmBa*BmLa; p3=BmLa**2; p4=2*BmBa*LmBa; p5=2*BmBa*LmLa+2*BmLa*LmBa; p6=2*BmLa*LmLa; p7=LmBa**2; p8=2*LmBa*LmLa; p9=LmLa**2; p=p1+p2+p3+p4+p5+p6+p7+p8+p9; /* p1=f( p2=f( p3=f( p4=f( p5=f( p6=f( p7=f( p8=f( p9=f( */ MM, MM, MM, Mm, Mm, Mm, mm, mm, mm, AA) Aa) aa) AA) Aa) aa) AA) Aa) aa) mean=&lamda*(1-2*Pm); 118 VBaBa=&v+2*&lamda**2*(1-&Pm)*Pm+2*&delta*&lamda**2*(2*Pm-1)/(1-pa)2*&delta**2*&lamda**2/(1-pa)**2; VBaLa=&v+2*&lamda**2*(1-&Pm)*&Pm+&delta*&lamda**2*(1-2*&pa)*(1-2*&pm)/(1&pa)/&pa-&delta**2*&lamda**2*((1-&pa)**2+&pa**2)/(1-&pa)**2/&pa**2; VLaLa=&v+2*&lamda**2*(1-&Pm)*&Pm+2*&delta*&lamda**2*(1-2*&Pm)/&pa2*&delta**2*&lamda**2/&pa**2; mBaBa=mean+2*&delta*&lamda/(1-&pa); mBaLa=mean-&delta*&lamda*(1-2*&pa)/(1-&pa)/&pa; mLaLa=mean-2*&delta*&lamda/&pa; do j=1 to &sample; call rantbl (sd0,p1,p2,p3,p4,p5,p6,p7,p8,p9,g); call rannor (sd1,x); if g=1 then do; genotype='MM AA';genotype_M='MM';genotype_A='AA'; phenotype=&lamda+sqrt(&V)*x; end; else if g=2 then do; genotype='MM Aa';genotype_M='MM';genotype_A='Aa'; phenotype=&lamda+sqrt(&V)*x; end; else if g=3 then do; genotype='MM aa';genotype_M='MM';genotype_A='aa'; phenotype=&lamda+sqrt(&V)*x; end; else if g=4 then do; genotype='Mm AA';genotype_M='Mm';genotype_A='AA'; phenotype=0+sqrt(&V)*x; end; else if g=5 then do; genotype='Mm Aa';genotype_M='Mm';genotype_A='Aa'; phenotype=0+sqrt(&V)*x; end; else if g=6 then do; genotype='Mm aa';genotype_M='Mm';genotype_A='aa'; phenotype=0+sqrt(&V)*x; end; else if g=7 then do; genotype='mm AA';genotype_M='mm';genotype_A='AA'; phenotype=(-&lamda)+sqrt(&V)*x; end; else if g=8 then do; genotype='mm Aa';genotype_M='mm';genotype_A='Aa'; phenotype=(-&lamda)+sqrt(&V)*x; end; else if g=9 then do; genotype='mm aa';genotype_M='mm';genotype_A='aa'; phenotype=(-&lamda)+sqrt(&V)*x; end; sim=i; output; end; end; run; 119 ODS SELECT NONE; ods output bartlett=bart; proc anova data=one outstat=anova_raw; class genotype_A; by sim; model phenotype=genotype_A ; means genotype_A /HOVTEST=BARTLETT ; run; /*ods output close;*/ data anova_raw; set anova_raw; where _type_^='ERROR'; if prob<=&alphalevel then reject=1;else reject=0; run; data bartlett_raw; set bart; if probchisq<=&alphalevel then reject=1;else reject=0; run; proc freq noprint data=anova_raw; tables reject /out=anova; run; proc freq noprint data=bartlett_raw; tables reject /out=bartlett; run; data anova; set anova; prob_anova=percent/100;if reject^=1 then delete; keep prob_anova; run; data bartlett; set bartlett; prob_bartlett=percent/100;if reject^=1 then delete; keep prob_bartlett; run; data result; merge anova bartlett; delta=δtheata=δv=&v;Sample=&sample;Sim=&simu; keep delta theata v sample prob_anova prob_bartlett sim; label prob_anova='Power of Anova' prob_bartlett='Power of Bartlett’s test' Sim='Simulation Times';run; quit; %mend sim; data power; run; %macro power(theata_start=0, theata_end=0.2, sample=100,alphalevel=0.05,seed_0=0,simu=1000,devide=10); 120 %do i=1 %to &devide; %let theata=&theata_start + (&i-1)*(&theata_end-&theata_start)/&devide; %sim (delta=&theata,v=1,sample=&sample,alphalevel=&alphalevel,simu=&simu); data power; set power result; if theata=. then delete; run; %end; %mend power; %power(theata_start=0, theata_end=0.2, sample=200,alphalevel=0.05,seed_0=0,simu=400,devide=20); data power; set power; label prob_anova='Rejecting delta=0 using ANOVA prob_bartlett='Rejecting delta=0 using Bartlett’s test run; symbol1 interpol=join value=dot height=1 width=2 cv=red CI=red; symbol2 interpol=join value=dot height=1 width=2 cv=blue CI=blue; legend1 label=none shape=symbol(5,1) position=(top center inside) mode=share; axis1 lable=('Power'); proc gplot data=power; plot prob_anova*delta prob_bartlett*delta /overlay legend=legend1 vaxis=axis1; run; 121 ' '; Appendix 4: Derivation of Expectation and Variance of Siegmund’s T-test The derivations of expectation and variance of Siegmund’s T-test is given. The technique used is the Δ-method. (Chapter 4) • ٛ Expectation of T T=( 2nn3 2nn1 + − n) / n n + n1 − n3 n + n3 − n1 n3 n1 + − n] / n } (2n1 + n 2 ) / 2n (2n3 + n2 ) / 2n 2nE (n1 ) 2nE (n3 ) =( + − n) / n n + E (n1 ) − E (n3 ) n + E (n3 ) − E (n1 ) E (T) = E{[ =( 2n * n * ( p 2 + θpq ) 2n * n * (q 2 + θpq ) + − n) / n n + n * ( p 2 + θpq ) − n * (q 2 + θpq ) n + n * (q 2 + θpq ) − n * ( p 2 + θpq ) E (n1 ) = n(θ (1 − p ) p + p 2 ) E (n2 ) = 2 p (1 − p )(1 − θ ) E (n3 ) = n((1 − p ) 2 + θ (1 − p ) p ) E (T) = = n(θ (1 − p ) p + p 2 ) { + [2(n(θ (1 − p ) p + p 2 )) + 2 p(1 − p )(1 − θ )] / 2n n((1 − p) 2 + θ (1 − p ) p) − n} / n [2(n((1 − p) 2 + θ (1 − p) p)) + 2 p (1 − p)(1 − θ )] / 2n 1 2n 2 ((1 − p ) 2 + θ (1 − p ) p ) ( + 2 2 n n + n((1 − p ) + θ (1 − p ) p ) − n(θ (1 − p ) p + p ) 2n 2 (θ (1 − p ) p + p 2 ) − n) n − n((1 − p ) 2 + θ (1 − p) p) + n(θ (1 − p ) p + p 2 ) After simplification,Î E(T) = n *θ • Variance of T By employing Delta-Method, we could find the variance of T 122 E (n1 ) = n * ( p 2 + θpq ) E (n3 ) = n * (q 2 + θpq) V (n1 ) = n * ( p 2 + θpq)(1 − p 2 − θpq) V (n3 ) = n * (q 2 + θpq)(1 − q 2 − θpq) Cov(n1 , n3 ) = - n * ( p 2 + θpq)(q 2 + θpq ) 2n(n − E (n3 )) 2n * E (n3 ) + 2 (n + E (n1 ) − E (n3 )) (n + E (n3 ) − E (n1 )) 2 2n(n − E (n1 )) 2n * E (n1 ) b= + 2 (n + E (n3 ) − E (n1 )) (n + E (n1 ) − E (n3 )) 2 a= V (T ) = (V (n1 ) * a 2 + V (n3 ) * b 2 + 2Cov(n1 , n3 ) * a * b) / n = (1 − θ )(θ 2 (1 − 2 p ) 2 + 2( p − 1) p − θ (6 p 2 − 6 p + 2)) 2( p − 1) p Ef (n1 , n3 ) ≅ 0 ∂f 2 n = En ∂f 2 n = En ) n13 = En13 + Var(n 3 )( ) 1 1 + ∂n1 ∂n3 n3 = En3 ∂f ∂f n = En 2Cov(n 1 , n 3 )( )( ) 1 1) ∂n1 ∂n3 n3 = En3 f (n1 , n3 ) ~ N (0, Var(n1 )( ⎧V (n1 ) = n( p 2 + fpq)(1 − P 2 − fpq ) ⎪ 2 2 ⎪V (n3 ) = n(q + fpq )(1 − P − fpq) ⎪ 2 2 ⎨Cov(n1 , n3 ) = − n( p + fpq )(q + fpq) ⎪ 2 ⎪ E (n1 ) = n( f (1 − p) p + p ) ⎪ E (n2 ) = 2 p(1 − p)(1 − f ) ⎩ 2n ( n − n 3 ) 2nn3 ∂f = + 2 n1 (n + n1 − n3 ) (n + n1 − n3 ) 2 2n ( n − n3 ) 2nn1 ∂f = + 2 n3 (n + n3 − n1 ) (n + n1 − n3 ) 2 Î V(T) = (1 − θ )(θ 2 (1 − 2 p ) 2 + 2( p − 1) p − θ (6 p 2 − 6 p + 2)) 2( p − 1) p 123 Appendix 5: SAS code of sample size calculation of Wald’s Z test SAS code for calculating sample size under the Wald’s Z test is given. (Chapter 4) data sample_size; do i=1 to 2; select (i); when (1) alpha=0.05; when (2) alpha=0.01; end; do j=1 to 4; select (j); when (1) belta=.2; when (2) belta=.5; when (3) belta=.9; when (4) belta=.95; end; do k=1 to 5; Pa=k*0.1; do l=1 to 12; select (l); when (1) f=.0001; when (2) f=.0005; when (3) f=.001; when (4) f=.002; when (5) f=.005; when (6) f=.01; when (7) f=.02; when (8) f=.05; when (9) f=.1; when (10) f=.25; when (11) f=.5; when (12) f=1; end; n=(probit(1-alpha)+probit(belta))**2*((1-f)**2*(1-2*f)+f*(1-f)*(2-f)/(2*Pa*(1Pa)))/(f**2); output; end; end; end; end; run; 124 Appendix 6: SAS code of power comparison of Ward and Sing’s χ2 Test and Wald’s Z test SAS macros are developed. (Chapter 4) options nosource nodate; ods trace off; proc printto; %macro sim(f=0.2, sample=100,alphalevel=0.05,seed_0=0,simu=1000); data temp_1; do i=1 to &simu; do j=1 to 3; if j=1 then genotype='AA'; if j=2 then genotype='Aa'; if j=3 then genotype='aa'; sim=i; output; end; end; run; data one; retain sd0 (&seed_0); do i=1 to &simu; p=ranuni(0); q=1-p; f=&f; p1=p**2+f*p*q; p2=2*p*q*(1-f); p3=q**2+f*p*q; do j=1 to &sample; g=rantbl(0,p1,p2,p3); if g=1 then do; genotype='AA'; end; else if g=2 then do; genotype='Aa'; end; else if g=3 then do; genotype='aa'; end; sim=i; output; end; end; 125 run; proc freq data=one noprint; table genotype / out=FreqCnt ; by sim; run; data freqcnt(drop=i j); merge temp_1 freqcnt; by sim genotype; if count=. then count=0; run; proc transpose data=freqcnt out=T prefix=n; by sim; var count; run; data T; set T; n=n1+n2+n3; f_e=1-2*n*n2/(2*n1+n2)/(2*n3+n2); p_e=(2*n1+n2)/2/n; V_f_e=1/n*(1-f_e)**2*(1-2*f_e) +f_e*(1-f_e)*(2-f_e)/(2*n*p_e*(1-p_e)); z=f_e/(V_f_e)**0.5; if abs(z)>-probit(&alphalevel/2) then reject_z='Yes' ; else if z=. then reject_z='N/A'; else reject_z='No '; chisq=(n1-n*p_e**2)**2/(n*p_e**2)+(n2-n*2*p_e*(1-p_e))**2/(n*2*p_e*(1-p_e))+(n3n*(1-p_e)**2)**2/(n*(1-p_e)**2); if chisq>cinv(1-&alphalevel,1) then reject_chi='Yes'; else if chisq=. then reject_chi='N/A'; else reject_chi='No '; f=&f; sample=&sample; run; proc freq data=t noprint; table reject_z /out=freq_z; table reject_chi / out=freq_chi; run; data freq_z1; set freq_z; rename percent=power_z; theata=&f; if reject_z^='Yes' then delete; run; data freq_chi1; set freq_chi; rename percent=power_chi; theata=&f; if reject_chi^='Yes' then delete; run; 126 data freq; merge freq_chi1 freq_z1; keep power_chi Power_z theata; run; %mend sim; %macro power(theata_start=0, theata_end=0.2, sample=1000,alphalevel=0.05,seed_0=0,simu=100,devide=2); %do i=1 %to &devide; %let theata=&theata_start + (&i-1)*(&theata_end-&theata_start)/&devide; %sim( f=&theata,sample=&sample,alphalevel=&alphalevel, simu=&simu); data power; set power freq; if theata=. then delete; run; %end; %mend power; data power; run; %power(theata_start=0, theata_end=0.2, sample=1000,alphalevel=0.05,simu=10000,devide=10); data power; set power; label power_z='Rejecting theata=0 using Z-test power_chi='Rejecting theata=0 using Chi-square run; symbol1 interpol=join value=dot height=1 width=2 cv=red CI=red; symbol2 interpol=join value=circle height=1 width=2 cv=blue CI=blue; legend1 label=none shape=symbol(5,1) position=(top center inside) mode=share; axis1 label=('Power'); 127 ' '; proc gplot data=power; plot power_z*theata power_chi*theata /overlay legend=legend1 vaxis=axis1; run; quit; title1 "Normal Q-Q Plot for Z's"; axis2 label=('Z'); proc capability data=T noprint; qqplot Z / normal(mu=est sigma=est color=orange l=2 w=7) square vaxis=axis2; histogram Z/ normal; run; 128 Appendix 7: SAS code of Rao’s Homogeneity Test The Q test for testing homogeneity of several inbreeding coefficients from 2x2 joint distribution is dealt. (Chapter 5) options nosource nodate; ods trace off; %macro data(theta=,sample_z=,simu_z=); data temp_1; do i=1 to &simu_z; do g=1 to 6; select (g); when (1) genotype='A1A1'; when (2) genotype='A1A2'; when (3) genotype='A1A3'; when (4) genotype='A2A2'; when (5) genotype='A2A3'; when (6) genotype='A3A3'; otherwise; end; simu_z=i; output; end; end; run; data z_1; do i=1 to &simu_z; P1=ranuni(0); P2=(1-P1)*ranuni(0); P3=1-P1-P2; t=θ P11=(1-t)*(P1**2)+t*P1; P12=(1-t)*p1*p2; P13=(1-t)*p1*p3; P22=(1-t)*(P2**2)+t*P2; P23=(1-t)*p2*p3; P33=(1-t)*(P3**2)+t*P3; simu_z=i; do j=1 to &sample_z; g=rantbl(0,p11,2*p12,2*p13,p22,2*p23,p33); select (g); when (1) genotype='A1A1'; when (2) genotype='A1A2'; when (3) genotype='A1A3'; when (4) genotype='A2A2'; when (5) genotype='A2A3'; when (6) genotype='A3A3'; otherwise; 129 end; output; end; end; run; proc freq data=z_1 noprint; table genotype / out=FreqCnt ; by simu_z; run; data freqcnt(drop=i g); merge temp_1 freqcnt; by simu_z genotype; if count=. then count=0; run; proc transpose data=freqcnt out=T prefix=n; by simu_z; var count; run; data t; set t; rename n1=n11 n2=n12 n3=n13 n4=n22 n5=n23 n6=n33;n=&sample_z; run; %mend data; %macro Z(data=,theta=,alphalevel=); data z1; set &data; n1=n11;n2=n12+n13;n3=n22+n33+n23; f_e=1-2*n*n2/(2*n1+n2)/(2*n3+n2); p_e=(2*n1+n2)/2/n; V_e=1/n*(1-f_e)**2*(1-2*f_e) +f_e*(1-f_e)*(2-1*f_e)/(2*n*p_e*(1-p_e)); rename f_e=f1_e V_e=V1_e; run; data z2; set &data; n1=n22;n2=n12+n23;n3=n11+n33+n13; f_e=1-2*n*n2/(2*n1+n2)/(2*n3+n2); p_e=(2*n1+n2)/2/n; V_e=1/n*(1-f_e)**2*(1-2*f_e) +f_e*(1-f_e)*(2-1*f_e)/(2*n*p_e*(1-p_e)); rename f_e=f2_e V_e=V2_e; run; data z3; set &data; n1=n33;n2=n13+n23;n3=n11+n22+n23; f_e=1-2*n*n2/(2*n1+n2)/(2*n3+n2); p_e=(2*n1+n2)/2/n; V_e=1/n*(1-f_e)**2*(1-2*f_e) +f_e*(1-f_e)*(2-1*f_e)/(2*n*p_e*(1-p_e)); rename f_e=f3_e V_e=V3_e; run; proc sql; 130 create table z as select a.*, b.f2_e, b.v2_e,c.f3_e,c.v3_e from z1 as a,z2 as b, z3 as c where a.simu_z=b.simu_z=c.simu_z; quit; data Z; set z; v_e=1/(1/v1_e+1/v2_e+1/v3_e);f_e=(f1_e/v1_e+f2_e/v2_e+f3_e/v3_e)*V_e; Z_test=f_e/(v_e)**0.5; if abs(z_test)>-probit(&alphalevel/2) then reject_z='Yes' ; else if z_test=. then reject_z='N/A'; else reject_z='No '; H_test=(f1_e-f_e)**2/v1_e+(f2_e-f_e)**2/v2_e+(f3_e-f_e)**2/v3_e; if H_test>cinv(1-&alphalevel,2) then reject_H='Yes'; else if H_test=. then reject_H='N/A'; else reject_H='No '; run; proc freq data=z noprint; table reject_z /out=freq_z; table reject_h / out=freq_h; run; data freq_z1; set freq_z; rename percent=power_z; theta=θ if reject_z^='Yes' then delete; run; data freq_h1; set freq_h; rename percent=power_h; if reject_h^='No' then delete; run; data freq; merge freq_h1 freq_z1; keep power_h Power_z theta; run; %mend Z; data power; run; %macro power(theta_start=0, theta_end=0.2, sample=100,alphalevel=0.05,seed_0=0,simu=10000, devide=100); %do i=1 %to &devide; %let theta=&theta_start + &i*(&theta_end-&theta_start)/&devide; %data( theta=&theta,sample_z=&sample,simu_z=&simu); 131 %z(data=t,theta=&theta,alphalevel=&alphalevel); data power; set power freq; if theta=. then delete; run; %end; %mend power; %power(theta_start=0, theta_end=0.2, sample=100,alphalevel=0.05,simu=1000,devide=10); data power; set power; label power_z='Rejecting theta=0 using Z-test power_h='Accepting all thetas are equal using Chi-square'; run; symbol1 interpol=join value=dot height=1 width=2 cv=red CI=red; symbol2 interpol=join value=dot height=1 width=2 cv=blue CI=blue; legend1 label=none shape=symbol(5,1) position=(top center inside) mode=share; axis1 label=('Power'); proc gplot data=power; plot power_z*theta power_h*theta /overlay legend=legend1 vaxis=axis1; run; quit; 132 ' Appendix 8: Derivatives of θˆ s with respective to frequencies The following expressions are useful in calculating Cov (θˆ1 ,θˆ2 ) , Cov(θˆ1 ,θˆ3 ) and Cov(θˆ2 ,θˆ3 ) . These expressions are used in Appendix 9. (Chapter 5) ∂θˆ1 = ∂n11 ∂θˆ1 = ∂n12 ∂θˆ1 = ∂n13 ∂θˆ1 = ∂n22 ∂θˆ1 = ∂n23 133 ∂θˆ1 = ∂n33 ∂θˆ2 = ∂n11 ∂θˆ2 = ∂n12 ∂θˆ2 = ∂n13 ∂θˆ2 = ∂n22 ∂θˆ2 = ∂n23 134 ∂θˆ2 = ∂n33 ∂θˆ3 = ∂n11 ∂θˆ3 = ∂n12 ∂θˆ3 = ∂n13 ∂θˆ3 = ∂n22 ∂θˆ3 = ∂n23 135 ∂θˆ3 = ∂n33 find the derivative terms for θˆ1 under A: ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = 11 ⎝ ⎠A ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 12 ⎠ A 136 ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = 13 ⎝ ⎠A ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 22 ⎠ A ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 23 ⎠ A ⎛ ∂θˆ1 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ ⎝ 33 ⎠ A Also, find the derivative terms for θˆ2 under A: 137 ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = 11 ⎝ ⎠A ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 12 ⎠ A ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 13 ⎠ A ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 22 ⎠ A ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 23 ⎠ A 138 ⎛ ∂θˆ2 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 33 ⎠ A Also, find the derivative terms for θˆ3 under A: ⎛ ∂θˆ3 ⎞ ⎟ ⎜ ⎜ ∂n ⎟ = ⎝ 11 ⎠ A ⎛ ∂θˆ3 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 12 ⎠ A ⎛ ∂θˆ3 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = ⎝ 13 ⎠ A 139 ⎛ ∂θˆ3 ⎞ ⎟ ⎜ ⎜ ∂n ⎟ = 22 ⎠A ⎝ ⎛ ∂θˆ3 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = 23 ⎝ ⎠A ⎛ ∂θˆ3 ⎞ ⎜ ⎟ ⎜ ∂n ⎟ = 33 ⎝ ⎠A 140 Appendix 9: Mathematica code for power and size computations of the Qtest The iterative computations for finding the optimal estimate of inbreeding coefficient, the derivatives of variance-covariance terms, and the different choices of parameters needed for simulation are developed for the power and size computations of Q-test. (Chapter 5) 141 142 143