Download temp_JSCS2016

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Dominance (genetics) wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Tag SNP wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Quantitative trait locus wikipedia , lookup

A30-Cw5-B18-DR3-DQ2 (HLA Haplotype) wikipedia , lookup

Transcript
Association Study for the Relationship Between
Genotype with Diplotype Configuration and Phenotype
of Multiple Quantitative Responses (Sample Format)
Makoto Tomita, Noboru Hashimoto† and Yutaka Tanaka‡
Abstract — Though there have been several works for the analysis of association between genotype and
phenotype, little can be found for the association analysis between a haplotype or haplotype sets and
multivariate quantitative responses. For example, QTLmarc is available for the analysis of multivariate
responses, but it cannot be applied to the case of stochastic diplotype configurations and complex genetic
models. The present paper proposes a method of association analysis between diplotype configuration and
multivariate quantitative responses assuming the dominant, recessive and additive models. A comparative
study is performed between the proposed method and QTLmarc by applying the two methods to numerical
examples and a small size simulated data sets with actual genotype information taken from the data set of
Hapmap project and artificial quantitative phenotype data which follow multivariate normal distributions.
The results show that the proposed method is superior to QTLmarc in finding the assumed association.
Keyword: Multivariate analysis; quantitative responses; haplotype; likelihood ratio test.
1
Introduction
Recently the association has been actively studied between genotype and phenotype in post genomic
research. Here ‘genotype’ means not only genotype itself but also haplotypes and diplotype
configurations that are estimated from genotypes of the sample, and ‘phenotype’ indicates qualitative or
quantitative variables which may be related to some specific diseases. Quantitative phenotype variable,
called QTL (quantitative trait locus), includes covariate such as BMI and glucose level. Some algorithms
have been proposed so far to analyze the association between the genotype information and the
quantitative phenotypic QTL. The algorithm QTLhaplo (Shibata et al., 2004) deals with the association
between the genotype and the univariate phenotype, assuming the normality of the conditional
distribution of the phenotype given the genotype information. The likelihood is calculated on the basis of
frequencies of diplotype configurations (joint probability of the frequencies of the haplotypes that
compose the dipolotype) and the density function of a normal distribution. The algorithm QTLmarc
(Kamitsuji and Kamatani, 2006) has been proposed for multivariate analysis of multiple quantitative
responses, however, it can deal with only the case where each subject’s haplotype is determined uniquely
from its genotype. It is doubtful whether it can evaluate the association properly in general cases.
Therefore, it is valuable to develop a general method of association analysis for multivariate quantitative
responses. In this paper, we extend the algorithm QTLhaplo so that it can deal with the association
between the genotype and multiple quantitative variables assuming three types of models, i.e., the
dominant, recessive and additive models.
2 Methodology
2.1 Univariate models
Shibata et al. (2004) describe the algorithm QTLhaplo as follows. Suppose that there exist l linked loci.
As DNA is of double helix structure and each haplotype has its counterpart, the number of possible
haplotypes is L  2l in total. Let the relative frequencies of the haplotypes be given as
 Clinical Research Center, Tokyo Medical and Dental University Hospital Faculty of Medicine, 1-5-45 Yushima, Bunkyo-ku,
Tokyo, 113-8519, Japan, E-mail: [email protected], Tel: +81-3-5803-5612
† Department of Biochemistry II, Nagoya University Graduate School of Medicine, 466-8550, Japan
‡ The Institute of Statistical Mathematics, 190-8562, Japan
  (1 ,  ,  j ,  L ) , where  j  0 and

L
j 1
 j  1 . As each subject has a combination of two
haplotypes, there is a possibility that it has L2 possible combinations a1; a2; · · · ; aL . The probability
that the ith subject has a diplotype configuration ak of the lth and the mth haplotypes is given by
P(d i  ak | )  l m , where di is a diplotype configuration for the i th subject. Also suppose that the i th
subject has quantitative phenotype  i with probability density function f . Now consider that a sample
2
of size N is observed in an experiment. The phenotype for each diplotype configuration is assumed to
follow a normal distribution with a common variance but with a mean which varies depending on the
diplotype configuration. The outcome of the experiment can then be expressed as (, D, ψ ) , where D =
(d1; · · · ; dN) indicates the vector of the diplotype configurations and ψ  ( 1 , , N ) indicates the
matrix of the quantitative phenotypes. The observations forN subjects are classified into the two groups of
subjects with and without specified haplotype ht in the diplotype configurations, and from these
observations the group means  k and the common variance  2 can be estimated for the distributions of
quantitative phenotype, respectively. The problem is to test whether there exists any difference in the
distribution of the phenotype between the two groups. For dominant model, D+ is defined as the set of
subjects with diplotype configurations containing haplotype ht, and D− is defined as the set of those
without containing ht. Then, the distribution of phenotype is given by N( 1 ,  2 ) for diplotype di ∈ D+ or
by N(  2 ,  2 ) for di ∈ D − . Denote the probability density functions by f  ( x), j  1,2 . Thus the
j
probability density function for  i is defined by f_1 (x) = f ( i =x|di ∈ D+) in case di ∈ D+ and f_2 (x) = f
( i =x|di ∈ D−) in case di ∈ D−.
Let A and B denote haplotype with specified ht and that without ht, respectively. Then every diplotype
configuration is expressed as AA, AB, or BB, and sets D+ and D− can be defined as follows. In the case of
dominant models, AA and AB belong to D+, while BB belongs to D−. In the case of recessive models, AA
belongs to D+, while AB and BB belong to D−. For additive models, the distributions of i for AA, BB
and AB are given by N( 1 ,  2 ), N(  2 ,  2 ) and N(  3 ,  2 ), respectively, where 3  (1  2 ) / 2 .
2.2 Extension to multivariate models
We try to extend the above univariate model to a multivariate model. Suppose that the quantitative
phenotype vector ψ i follows a multinormal distribution with a common variance-covariance matrix but
with different mean vector corresponding to the group defined by the diplotype configurations. In
dominant models, the density function is given by
1
 ( x  1 )  1 ( x  1 )
1
(2.1)
f ( i  x | d i  ak ,  , )  (
)e 2
if ak  D ,
1

(2 ) p / 2 |  | 2
where  ,  indicate the mean vector and the variance-covariance matrix, respectively, x is the vector
of individual quantitative phenotypes, and p is the number of phenotype variables.
3 Numerical study and the results of analyses
We consider the case where there are three loci with two kinds of alleles as genotypes and two
quantitative phenotype variables. As a genotype input data set, an actual data set for 44 subjects was
downloaded from the Hapmap project, and the information of the region (107,189 loci) of the X
chromosome was used. Among large number of loci, 40 loci with linkage disequilibrium were selected by
Tomita et al. (2008), where they studied this area of the X chromosome from Hapmap project. See Table
1 for detailed information (rs numbers, chromosome positions) on loci of data. Note that we do not have
any information on haplotype as the data set does not contain the phase information and that there is no
missing observation. The LD map has been made using GUI Haploview software (Barrett et al., 2005). In
Figure 1, we can confirm that there are LD blocks.
Table 1: Information of loci. (rs number, chromosome position)
locus#
rs#
position
locus#
rs#
position
locus#
rs#
position
locus#
rs#
position
locus#
rs#
position
locus#
rs#
position
locus#
rs#
position
1
rs197000
28409449
7
rs197018
28435927
13
rs196983
28448532
19
rs196988
28453742
25
rs1468134
28526074
31
rs5985930
28680456
37
rs5943579
28723040
2
rs197005
28413819
8
rs197021
28441493
14
rs115126
28449670
20
rs196990
28456060
26
rs724087
28644979
32
rs5985809
28681164
38
rs628704
28724158
3
rs197006
28416655
9
rs197022
28442352
15
rs115125
28449943
21
rs1265497
28458373
27
rs5985808
28645245
33
rs5943575
28681984
39
rs629965
28724381
4
rs197012
28424066
10
rs5943527
28442857
16
rs196985
28452952
22
rs6630730
28468511
28
rs4103136
28645826
34
rs2521807
28703122
40
rs11095138
28725017
5
rs197014
28430651
11
rs642519
28445155
17
rs196986
28453150
23
rs196982
28468715
29
rs12863731
28650807
35
rs634270
28705667
6
rs197016
28433916
12
rs404274
28446901
18
rs17348455
28453190
24
rs196975
28475390
30
rs1586093
28675987
36
rs6630793
28709418
Figure 1: LD map(r2) using by Haploview.
4 Discussions and summary
We proposed a method of multivariate association analysis and developed an R program. It is an
extension of QTLhaplo to the case of multivariate analysis, where we can analyze the cases where the
diplotype configuration is not determined uniquely from the genotype and we can assume any of
dominant, recessive and additive models. To study the effectiveness of our method for the recessive and
additive models we carried out a numerical study to analyze two data sets with real genotype data
(Hapmap project) and simulated QTL data, and for the dominant model we performed a small size
simulation study to compare the powers with QTLmarc. The results showed that our method was more
effective for detecting the genotype-to-phenotype relationship than QTLmarc of Kamitsuji and Kamatani
(2006) in all the dominant, recessive and additive models.
So far we discussed the comparison of the powers between our method and QTLmarc. In actual data
analysis, however, the objective is to find out if there exists any haplotype which is closely related to the
phenotypes assuming an appropriate one among the dominant, recessive and additive models. For this
purpose we may operationally choose the model with the highest significance. Note that, if we wish to use
the AIC statistics we can compute them up to an additive constant based on the values of likelihood ratio
statistics, because the log-likelihood statistics for the null models are common among the dominant,
recessive and additive models. To study how this idea works in actual data analysis we analyzed each of
the three datasets generated in the manner described in section 3 assuming the three models, i.e., the
dominant, recessive and additive models. The chi-squared statistics and the AIC statistics are noted that
the true models could be detected correctly in all cases but that there might exit a tendency that the dataset
generated assuming the additive model is easily misjudged to be taken from other models. (Tomita et al.,
2011)
There are two advantages of our method compared to the QTLmarc algorithm. One is that our method
can treat genotype data with stochastically determined diplotype and the other is that we can assume any
model among dominant, recessive and additive models. It is expected that our method will be useful in
association studies of complex diseases such as schizophrenia and autism, where the causes of the
diseases are not yet resolved and there exist multiple candidate responses.
Acknowledgment
This work was partly supported by KAKENHI (21700317; Grant-in-Aid for Young Scientists (B)).
References
[1] Barrett J. C., Fry B., Maller J. and Daly M. J. (2005). Haploview: analysis and visualization of LD and
haplotype maps. Bioinformatics, 21(2):263-265.
[2] Kamitsuji S. and Kamatani N. (2006). Estimation of haplotype associated with several quantitative
phenotypes based on maximization of are under a receiver operating characteristic (ROC) curve. Journal
of Human Genetics, 51(4):314-325.
[3] Shibata K., Ito T., Kitamura Y., Iwaaki N., Tanaka H. and Kamatani N. (2004). Simultaneous
estimation of haplotype frequencies and quantitative trait parameters: applications to the test of
association between phenotype and diplotype configuration. Genetics, 168:525-539.
[4] The International HapMap Consortium (2003). The International HapMap Project. Nature, 426(18):
789-796.
[5] Tomita M., Hatsumichi M. and Kurihara K. (2008) Identify LD Blocks Based on Hierarchical Spatial
Data. Computational Statistics and Data Analysis, 52(4):1806-1820.
[6] Tomita M., Hashimoto N. and Tanaka Y. (2011) Association Study for the Relationship Between a
Haplotype or Haplotype Set and Multiple Quantitative Responses. Computational Statistics and Data
Analysis, 55(6):2104-2113.