Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Association Study for the Relationship Between Genotype with Diplotype Configuration and Phenotype of Multiple Quantitative Responses (Sample Format) Makoto Tomita, Noboru Hashimoto† and Yutaka Tanaka‡ Abstract — Though there have been several works for the analysis of association between genotype and phenotype, little can be found for the association analysis between a haplotype or haplotype sets and multivariate quantitative responses. For example, QTLmarc is available for the analysis of multivariate responses, but it cannot be applied to the case of stochastic diplotype configurations and complex genetic models. The present paper proposes a method of association analysis between diplotype configuration and multivariate quantitative responses assuming the dominant, recessive and additive models. A comparative study is performed between the proposed method and QTLmarc by applying the two methods to numerical examples and a small size simulated data sets with actual genotype information taken from the data set of Hapmap project and artificial quantitative phenotype data which follow multivariate normal distributions. The results show that the proposed method is superior to QTLmarc in finding the assumed association. Keyword: Multivariate analysis; quantitative responses; haplotype; likelihood ratio test. 1 Introduction Recently the association has been actively studied between genotype and phenotype in post genomic research. Here ‘genotype’ means not only genotype itself but also haplotypes and diplotype configurations that are estimated from genotypes of the sample, and ‘phenotype’ indicates qualitative or quantitative variables which may be related to some specific diseases. Quantitative phenotype variable, called QTL (quantitative trait locus), includes covariate such as BMI and glucose level. Some algorithms have been proposed so far to analyze the association between the genotype information and the quantitative phenotypic QTL. The algorithm QTLhaplo (Shibata et al., 2004) deals with the association between the genotype and the univariate phenotype, assuming the normality of the conditional distribution of the phenotype given the genotype information. The likelihood is calculated on the basis of frequencies of diplotype configurations (joint probability of the frequencies of the haplotypes that compose the dipolotype) and the density function of a normal distribution. The algorithm QTLmarc (Kamitsuji and Kamatani, 2006) has been proposed for multivariate analysis of multiple quantitative responses, however, it can deal with only the case where each subject’s haplotype is determined uniquely from its genotype. It is doubtful whether it can evaluate the association properly in general cases. Therefore, it is valuable to develop a general method of association analysis for multivariate quantitative responses. In this paper, we extend the algorithm QTLhaplo so that it can deal with the association between the genotype and multiple quantitative variables assuming three types of models, i.e., the dominant, recessive and additive models. 2 Methodology 2.1 Univariate models Shibata et al. (2004) describe the algorithm QTLhaplo as follows. Suppose that there exist l linked loci. As DNA is of double helix structure and each haplotype has its counterpart, the number of possible haplotypes is L 2l in total. Let the relative frequencies of the haplotypes be given as Clinical Research Center, Tokyo Medical and Dental University Hospital Faculty of Medicine, 1-5-45 Yushima, Bunkyo-ku, Tokyo, 113-8519, Japan, E-mail: [email protected], Tel: +81-3-5803-5612 † Department of Biochemistry II, Nagoya University Graduate School of Medicine, 466-8550, Japan ‡ The Institute of Statistical Mathematics, 190-8562, Japan (1 , , j , L ) , where j 0 and L j 1 j 1 . As each subject has a combination of two haplotypes, there is a possibility that it has L2 possible combinations a1; a2; · · · ; aL . The probability that the ith subject has a diplotype configuration ak of the lth and the mth haplotypes is given by P(d i ak | ) l m , where di is a diplotype configuration for the i th subject. Also suppose that the i th subject has quantitative phenotype i with probability density function f . Now consider that a sample 2 of size N is observed in an experiment. The phenotype for each diplotype configuration is assumed to follow a normal distribution with a common variance but with a mean which varies depending on the diplotype configuration. The outcome of the experiment can then be expressed as (, D, ψ ) , where D = (d1; · · · ; dN) indicates the vector of the diplotype configurations and ψ ( 1 , , N ) indicates the matrix of the quantitative phenotypes. The observations forN subjects are classified into the two groups of subjects with and without specified haplotype ht in the diplotype configurations, and from these observations the group means k and the common variance 2 can be estimated for the distributions of quantitative phenotype, respectively. The problem is to test whether there exists any difference in the distribution of the phenotype between the two groups. For dominant model, D+ is defined as the set of subjects with diplotype configurations containing haplotype ht, and D− is defined as the set of those without containing ht. Then, the distribution of phenotype is given by N( 1 , 2 ) for diplotype di ∈ D+ or by N( 2 , 2 ) for di ∈ D − . Denote the probability density functions by f ( x), j 1,2 . Thus the j probability density function for i is defined by f_1 (x) = f ( i =x|di ∈ D+) in case di ∈ D+ and f_2 (x) = f ( i =x|di ∈ D−) in case di ∈ D−. Let A and B denote haplotype with specified ht and that without ht, respectively. Then every diplotype configuration is expressed as AA, AB, or BB, and sets D+ and D− can be defined as follows. In the case of dominant models, AA and AB belong to D+, while BB belongs to D−. In the case of recessive models, AA belongs to D+, while AB and BB belong to D−. For additive models, the distributions of i for AA, BB and AB are given by N( 1 , 2 ), N( 2 , 2 ) and N( 3 , 2 ), respectively, where 3 (1 2 ) / 2 . 2.2 Extension to multivariate models We try to extend the above univariate model to a multivariate model. Suppose that the quantitative phenotype vector ψ i follows a multinormal distribution with a common variance-covariance matrix but with different mean vector corresponding to the group defined by the diplotype configurations. In dominant models, the density function is given by 1 ( x 1 ) 1 ( x 1 ) 1 (2.1) f ( i x | d i ak , , ) ( )e 2 if ak D , 1 (2 ) p / 2 | | 2 where , indicate the mean vector and the variance-covariance matrix, respectively, x is the vector of individual quantitative phenotypes, and p is the number of phenotype variables. 3 Numerical study and the results of analyses We consider the case where there are three loci with two kinds of alleles as genotypes and two quantitative phenotype variables. As a genotype input data set, an actual data set for 44 subjects was downloaded from the Hapmap project, and the information of the region (107,189 loci) of the X chromosome was used. Among large number of loci, 40 loci with linkage disequilibrium were selected by Tomita et al. (2008), where they studied this area of the X chromosome from Hapmap project. See Table 1 for detailed information (rs numbers, chromosome positions) on loci of data. Note that we do not have any information on haplotype as the data set does not contain the phase information and that there is no missing observation. The LD map has been made using GUI Haploview software (Barrett et al., 2005). In Figure 1, we can confirm that there are LD blocks. Table 1: Information of loci. (rs number, chromosome position) locus# rs# position locus# rs# position locus# rs# position locus# rs# position locus# rs# position locus# rs# position locus# rs# position 1 rs197000 28409449 7 rs197018 28435927 13 rs196983 28448532 19 rs196988 28453742 25 rs1468134 28526074 31 rs5985930 28680456 37 rs5943579 28723040 2 rs197005 28413819 8 rs197021 28441493 14 rs115126 28449670 20 rs196990 28456060 26 rs724087 28644979 32 rs5985809 28681164 38 rs628704 28724158 3 rs197006 28416655 9 rs197022 28442352 15 rs115125 28449943 21 rs1265497 28458373 27 rs5985808 28645245 33 rs5943575 28681984 39 rs629965 28724381 4 rs197012 28424066 10 rs5943527 28442857 16 rs196985 28452952 22 rs6630730 28468511 28 rs4103136 28645826 34 rs2521807 28703122 40 rs11095138 28725017 5 rs197014 28430651 11 rs642519 28445155 17 rs196986 28453150 23 rs196982 28468715 29 rs12863731 28650807 35 rs634270 28705667 6 rs197016 28433916 12 rs404274 28446901 18 rs17348455 28453190 24 rs196975 28475390 30 rs1586093 28675987 36 rs6630793 28709418 Figure 1: LD map(r2) using by Haploview. 4 Discussions and summary We proposed a method of multivariate association analysis and developed an R program. It is an extension of QTLhaplo to the case of multivariate analysis, where we can analyze the cases where the diplotype configuration is not determined uniquely from the genotype and we can assume any of dominant, recessive and additive models. To study the effectiveness of our method for the recessive and additive models we carried out a numerical study to analyze two data sets with real genotype data (Hapmap project) and simulated QTL data, and for the dominant model we performed a small size simulation study to compare the powers with QTLmarc. The results showed that our method was more effective for detecting the genotype-to-phenotype relationship than QTLmarc of Kamitsuji and Kamatani (2006) in all the dominant, recessive and additive models. So far we discussed the comparison of the powers between our method and QTLmarc. In actual data analysis, however, the objective is to find out if there exists any haplotype which is closely related to the phenotypes assuming an appropriate one among the dominant, recessive and additive models. For this purpose we may operationally choose the model with the highest significance. Note that, if we wish to use the AIC statistics we can compute them up to an additive constant based on the values of likelihood ratio statistics, because the log-likelihood statistics for the null models are common among the dominant, recessive and additive models. To study how this idea works in actual data analysis we analyzed each of the three datasets generated in the manner described in section 3 assuming the three models, i.e., the dominant, recessive and additive models. The chi-squared statistics and the AIC statistics are noted that the true models could be detected correctly in all cases but that there might exit a tendency that the dataset generated assuming the additive model is easily misjudged to be taken from other models. (Tomita et al., 2011) There are two advantages of our method compared to the QTLmarc algorithm. One is that our method can treat genotype data with stochastically determined diplotype and the other is that we can assume any model among dominant, recessive and additive models. It is expected that our method will be useful in association studies of complex diseases such as schizophrenia and autism, where the causes of the diseases are not yet resolved and there exist multiple candidate responses. Acknowledgment This work was partly supported by KAKENHI (21700317; Grant-in-Aid for Young Scientists (B)). References [1] Barrett J. C., Fry B., Maller J. and Daly M. J. (2005). Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics, 21(2):263-265. [2] Kamitsuji S. and Kamatani N. (2006). Estimation of haplotype associated with several quantitative phenotypes based on maximization of are under a receiver operating characteristic (ROC) curve. Journal of Human Genetics, 51(4):314-325. [3] Shibata K., Ito T., Kitamura Y., Iwaaki N., Tanaka H. and Kamatani N. (2004). Simultaneous estimation of haplotype frequencies and quantitative trait parameters: applications to the test of association between phenotype and diplotype configuration. Genetics, 168:525-539. [4] The International HapMap Consortium (2003). The International HapMap Project. Nature, 426(18): 789-796. [5] Tomita M., Hatsumichi M. and Kurihara K. (2008) Identify LD Blocks Based on Hierarchical Spatial Data. Computational Statistics and Data Analysis, 52(4):1806-1820. [6] Tomita M., Hashimoto N. and Tanaka Y. (2011) Association Study for the Relationship Between a Haplotype or Haplotype Set and Multiple Quantitative Responses. Computational Statistics and Data Analysis, 55(6):2104-2113.