Download [Full text/PDF]

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Epistasis wikipedia , lookup

Population genetics wikipedia , lookup

Genome (book) wikipedia , lookup

Public health genomics wikipedia , lookup

Human genetic variation wikipedia , lookup

Microevolution wikipedia , lookup

Genome-wide association study wikipedia , lookup

Behavioural genetics wikipedia , lookup

Twin study wikipedia , lookup

Designer baby wikipedia , lookup

Heritability of IQ wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Transcript
NIH Public Access
Author Manuscript
Stat Interface. Author manuscript; available in PMC 2011 September 15.
NIH-PA Author Manuscript
Published in final edited form as:
Stat Interface. 2011 January 1; 4(3): 295–304.
A faster pedigree-based generalized multifactor dimensionality
reduction method for detecting gene-gene interactions
Guo-Bo Chen,
Institute of Bioinformatics, Zhejiang University, China
Jun Zhu, and
Institute of Bioinformatics, Zhejiang University, China
Xiang-Yang Lou
Section on Statistical Genetics, Department of Biostatistics, University of Alabama at
Birmingham, USA
Jun Zhu: [email protected]; Xiang-Yang Lou: [email protected]
NIH-PA Author Manuscript
Abstract
NIH-PA Author Manuscript
We proposed a faster pedigree-based generalized multifactor dimensionality reduction algorithm,
called PedG-MDR II (PII), to detect gene-gene interactions underlying complex traits. Inherited
from our previous framework of PedGMDR (PI), PII can handle both dichotomous and continuous
traits in pedigree-based designs and allows for covariate adjustment. Compared with PI, this faster
version can theoretically halve the computing burden and memory requirement. To evaluate the
performance of PII, we performed comprehensive simulations across a wide variety of
experimental scenarios, in which we considered two study designs, discordant sib pairs and mixed
families of varying size, and, for each study design, we considered five common factors that may
potentially affect statistical power: minor allele frequency, missing rate of parental genotypes,
covariate effect, gene-gene interaction, and scheme to adjust phenotypic outcomes. Simulations
showed that PII gave well controlled type I error rates against population admixture. Under a total
of 4,096 scenarios simulated, PII, in general, had a higher average power than PI for both
dichotomous and continuous traits, and the advantage was more pronounced for continuous traits.
PII also appeared to be less sensitive than PI to changes in the other four factors than the
magnitude of genetic effects considered in this study. Applied to the Mid-South Tobacco Family
study, PII detected a significant interaction with a p value of 5.4 × 10−5 between two taster
receptor genes, TAS2R16 and TAS2R38, responsible for nicotine dependence. In conclusion, PII is
a faster supplementary version of our previous PI for detecting multifactor interactions.
Keywords and phrases
Gene-gene interaction; Pedigree-based design; GMDR; Population admixture; Statistical power
1. INTRODUCTION
Although their exact inheritance pattern remains unknown, complex traits are influenced by
a combination of relevant genes and environmental factors, and often lack a one-to-one
genotype to phenotype correspondence (Phillips 2008). This poses a great challenge in
Correspondence to: Xiang-Yang Lou, [email protected].
Current Address: Section on Statistical Genetics, Department of Biostatistics, University of Alabama at Birmingham, USA,
[email protected]
Chen et al.
Page 2
NIH-PA Author Manuscript
dissecting the genetic architecture underlying them. The traditional single factor-based
statistical strategies assuming that a gene causes a detectable perturbation in a phenotypic
outcome, although having achieved a limited success in hunting determinants for complex
traits, are underpowered for most risk factors given the widespread existence of gene-gene
(G × G) and gene-environment (G × E) interactions. A preferable strategy is to tackle
interacting factors simultaneously as much as possible.
NIH-PA Author Manuscript
Many approaches have been proposed to detect G × G interactions for various genetic
designs. Logistic regression methods are well adapted to estimate the effects of interactions
(Bastone et al. 2004; Cook et al. 2004; Kooperberg et al. 2001; Tahri-Daizadeh et al. 2003;
Zhu and Hastie 2004) but confront a dramatic explosion of parameters in terms of
multifactor dimension searching of interacting terms. Recently, a novel category of methods
that can project multi-dimension searching of interaction down to one-dimension space have
been proposed, alleviating the restrictions associated with the logistic regression methods.
Depending on the category of a phenotypic outcome, multifactor dimensionality reduction
(MDR) method (Ritchie et al. 2001) and its modifications offer solutions for detecting
interactions for dichotomous traits (Hahn and Moore 2004; Hahn et al. 2003; Lee et al.
2007; Moore et al. 2006), while the combinatorial partition method (Nelson et al. 2001) and
its variants (Culverhouse et al. 2004) are dedicated to quantitative traits. By implanting the
generalized linear model into the MDR framework, Lou et al. (2007) proposed a generalized
multifactor dimensionality reduction (GMDR) approach, which provides a unified
framework for handling both continuous and discrete traits and further permits adjustment of
phenotypes for covariates. These methods aforementioned, however, are largely applicable
to population-based design (MDR can analyze discordant sib pairs, viewed as a special case
of case-control samples), a genetic design that is well appreciated but requires control and
case samples of a homogeneous genetic origin, and, if possible, being well matched on other
related factors. A population-based design is subject to spurious association in the presence
of population admixture and thus a technical adjustment is usually performed prior to
association analysis, to rule out effects of population admixture (Price et al. 2006).
NIH-PA Author Manuscript
Pedigree-based design, another popular alternative in genetic studies, is inherently robust
against the effect of population admixture in population-based design. In sexual
reproduction, a pair of genetic complementary haploids is produced from diploid germline
cells in a form of cell division called meiosis; a human genome is composed of two, from
each of one’s parents respectively, haploid genomes. Instead of recruiting a control that can
potentially come from a heterogeneous population, we can use the untransmitted genetic
counterpart of an offspring, which is inferable given sufficient pedigree information, as an
internal control, so pedigree-based design largely reduces spurious association even in the
existence of population admixture, balancing a sound statistical power and a controlled type
I error rate. Martin et al. (2006) proposed a pedigree-based MDR to detect G × G
interactions. To utilize the genetic information in pedigrees more thoroughly and handle
both dichotomous and continuous traits, we developed PedGMDR (abbreviated as PI
thereafter), which built a minimal sufficient statistic approach (Rabinowitz and Laird 2000)
into the GMDR framework (Lou et al. 2008).
In the present study, we propose a new pedigree-based framework, called PedGMDR II
(PII), that can handle both dichotomous and continuous traits and permits adjustment of
covariates with arbitrary missing marker information. PII is more computationally efficient
and also, as demonstrated in simulation, outperforms, or is comparable to, PI especially for
quantitative traits.
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 3
2. METHODS
2.1 Test statistics for PII
NIH-PA Author Manuscript
Consider a set of biallelic loci and there are up to three genotypes at each locus, e.g., aa, Aa,
and AA for locus A, and bb, Bb, and BB for locus B, and so on. Let g(xij ) denote an
indicator vector of xij, a set of genotypes at loci of interest for individual j in family i, whose
length is determined by the number of loci for a G × G interaction being tested. Let yij
denote the phenotypic value of offspring j in family i, and t(yij ) is its phenotypic function,
which can take the form of score statistics in the exponential family class of distributions by
choosing appropriate link functions (Lunetta et al. 2000). Let μ = E(yij ) and l(·) be an
appropriate link function depending on the distributions of phenotypic outcomes, and a
generalized linear model can be expressed as follows,
(1)
NIH-PA Author Manuscript
where β0 is the intercept, β1 is a vector of the effects of the loci being tested, g(xij ) indicates
a vector coding for genotype xij, β2 represents the effects of the covariate(s), and zij is the
covariate value(s). The above model is easy to extend by adding other covariates or
interaction terms if there are any. When yij follows a normal distribution, the natural link
function is the identity; or it can be,
if yij is a dichotomous trait. We can further define a general score statistic,
(2)
NIH-PA Author Manuscript
as an analogue to the statistic in the FBAT (Laird et al. 2000). However, here xij refers to a
combination of loci, whereas xij codes a single locus only in the FBAT statistic, viewed as a
special case of our statistics. We suggest here to use the score of Eq. (1) under the null
hypothesis: β1 = 0, in the place of t(yij ), and different schemes for covariate adjustment can
be considered in generating the score statistics — for example, we can either adjust the
phenotypes with covariates or not, and either include the founders or not in adjustment.
Different from PI in which an informative nonfounder generates a pair of statistics for
transmitted and pseudo nontransmitted individuals, respectively, we only use here the
transmitted to construct the statistic, but the non-transmitted individuals contribute to
construct the genotypic distribution under the null hypothesis of G × G interaction
associated with a trait being tested. Thus, in contrast to PI, the sample size entering into
multifactor reduction will be halved, as is the computing burden and memory requirement,
thus providing a faster implementation.
2.2 Multifactor-reduction algorithm
The new method is devised by integrating the family statistic defined in Eq. (2) into the
GMDR framework, whose implementation of k-fold cross-validation is summarized as
follows. The six steps involved in PII are illustrated in Figure 1.
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 4
NIH-PA Author Manuscript
In step one, randomly partition the nonfounder individuals, regardless of their family
origins, into k even or nearly even subdivisions. We use k = 10, which can be other
integers, throughout the paper. Motsinger and Ritchie (2006) showed that reducing the
number of CV intervals from ten to five caused no loss of power and accuracy.
In step two, a subset of γ discrete factors of either genetic and/or environmental origin
are selected from all ω factors of interest. We have a total of
combinations.
In step three, this set of factors stretches into γ-dimensional space, and each genotyped
subject in the training set is allocated to a cell accordingly. The values of statistic,
defined in equation 2, are averaged over each cell respectively. Each nonempty cell is
labeled either high-risk if its average statistic value is not less than some threshold T
(
, where n is the number of individuals employed in the multifactor-reduction
algorithm, the overall mean that is a natural extension of T = 0 to unbalanced casecontrol studies, is used throughout the paragraphs below), or low-risk otherwise.
In step four, an interaction model is formed by pooling high- and low-risk cells into
two distinct groups, i.e., high-risk and low-risk groups. The classification accuracy can
be assessed by the averages of the statistic values in the high-risk and the low-risk
groups: a higher accuracy indicates a better classification for the two groups.
NIH-PA Author Manuscript
In step five, all other possible γ factor combinations in the training set are iterated, and
the best γ-factor model is selected based on the classification accuracy.
In step six, the independent testing set is used to evaluate the best model from step five.
As there are k different pairs of training-testing sets, the above procedure repeats k rounds
on the k training sets.
2.3 Distribution of the test statistic for interaction
The PII allows different statistics to evaluate an interaction, and we employed testing
accuracy (TA) as a testing statistic
(3)
NIH-PA Author Manuscript
where TP is True Positive defined as having a high-risk value in the high-risk group, TN is
True Negative defined as having a low-risk value in the low-risk group, FP is False Positive
defined as having a low-risk value in the high-risk group, and FN is False Negative defined
as having a high-risk value in the low-risk group. Although the theoretical distribution of TA
remains unknown, when the sample size is sufficiently large, as the result of the central limit
theorem, an approximate Z score statistic can be constructed
, where E(TA) and
Var (TA) are the mathematical expectation and the variance of TA under the null
hypothesis.
The test procedure takes the genetic dependence among the relatives into account. Given a
mating type (parental genotypes) or its minimal sufficient statistic, we have the genotypic
distribution of offspring under the null hypothesis, denoted by GM; different mating types
have their respective genotypic distributions of offspring. Each of these genotypic
distributions follows Mendel’s law only, and thus is independent of any phenotype and can
serve as the reference distribution. Nevertheless, the genotypic distribution of offspring may
differ conditional on the mating type and a trait of interest in the presence of genotype-
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 5
NIH-PA Author Manuscript
phenotype association, denoted by GM,T. The difference between GM and GM,T is the basis
for detecting gene-gene interactions under-lying the trait. Noticeably, the numerator of
Testing Accuracy consists of two parts which are respectively calculated from the observed
family data, following GM,T, and evaluated from the null hypothesis, following GM. As the
genetic dependency affects both parts in parallel, the discrepancy between them will ascribe
to the association of the combination of loci with the trait only, thus eventually eliminating
the impact from genetic dependency through comparison between GM,T and its reference
distribution GM.
NIH-PA Author Manuscript
Evaluating the p value of the Z score test above needs to calculate the three terms involved,
where the first term, TA, can be calculated directly conditioning on the traits and the marker
scores, yet its mathematical expectation and variance under the null hypothesis, the second
and the third terms accordingly, closely depend on the distributions of genotypes and the
traits observed, the ascertainment condition, and other factors which might be unavailable.
However, these two terms can be investigated empirically conditioning on the traits
observed and the offspring’s genotypes under the null hypothesis of no association of
interacting factors with the phenotype. Following Mendel’s law, each parent transmits either
allele to each offspring independently with a probability of 0.5, and the genotypic
distribution under the null hypothesis is easily constructed when all parent genotypes are
available. It is plausible that the genotypes of founders are incomplete for late onset
diseases, e.g., Parkinson’s disease and Alzheimer’s disease. Rabinowitz and Laird gave a
unified algorithm (Rabinowitz and Laird 2000) on constructing the genotype distribution of
offspring under the null hypothesis for various scenarios of incomplete parental genotypes.
In general, conditioning on the traits and the null distribution of offspring genotypes, E(TA)
and Var (TA) under the null distribution can be evaluated by Monte Carlo simulations.
3. MONTE CARLO SIMULATIONS
NIH-PA Author Manuscript
To evaluate the performance of the proposed method and compare with PI, we carried out a
comprehensive simulation study. Without loss of generality, we considered a total of 10
independent diallelic markers in Hardy-Weinberg equilibrium, none of and two of which are
functional loci, respectively, for assessment of the Type I error rate and the power. In the
latter case, loci 1 and 3 were chosen as interacting loci, and two digenic interaction models
of low marginal effects were adopted to demonstrate the ability of identifying interacting
loci: checkerboard models (aaBb, Aabb, AaBB, and AABb are labeled to a high-value
genetic group and the rest to low) and diagonal models (aabb, AaBb, and AABB are labeled
to a high-value genotypic group and the rest to low) (Culverhouse et al. 2004). Eq. (1) was
used to simulate phenotypic outcomes, and corresponding to the type of phenotypes, we
chose an appropriate link function, i.e., logit for dichotomous traits or identity for
continuous traits. We set β0 = −5.3 for dichotomous traits and 0 for continuous ones, xij = 1
for high risk genotype and 0 otherwise, and zij ~ N (0, 1). We set four levels of the
interaction: β1=0.25, 0.50, 0.75, and 1.00, respectively, for dichotomous traits, and β1 =
0.125, 0.250, 0.375, and 0.500, respectively, for continuous traits. β2 was assigned the
values of 0.25, 0.50, 0.75, and 1.00, respectively. Thus, there are up to 16 combinations of
the two factors, genotype relative risk of high- to low-risk genotypes ranging from 1.25 to
2.60 for dichotomous traits, and for continuous traits heritability ranged from about 0.0018
to 0.0500 given equi-frequent biallelic loci; such ranges are reasonably well established in
the literature (Flint and Mackay 2009; Iles 2008). In addition, three other factors that
potentially affect statistical power were examined in simulations: minor allele frequency
(MAF) (three levels: 0.10, 0.25, 0.50), average genotype missing rate for each parent (five
levels: 0.00, 0.25, 0.50, 0.75, 1.00), and the schemes for generating score statistics (four
schemes: scheme 1, using the phenotype of both parents and offspring with covariate
adjustment; scheme 2, using the phenotype of offspring with covariate adjustment; scheme
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 6
NIH-PA Author Manuscript
3, using the phenotype of both parents and offspring without adjustment; scheme 4, using
the phenotype of offspring without adjustment). These five factors took up to 960 scenarios,
a comprehensive coverage that was expected to provide a broad reference for the method
investigated.
The samples were simulated based on two genetic designs in this study. The first one was a
discordant sib pair (DSP) design consisting of 300 families. If a sibling was affected for a
dichotomous trait, or a continuous phenotypic value of interest located in the upper 10% of
the distribution of a phenotype simulated, this individual was identified as a proband. When
a full sib of the proband did not reach the criterion for proband status, these two sibs as well
as their parents were recruited to the study. The second one comprised a mixture of three
categories of families (MF) consisting of two, three, and four sibs, respectively, and each
category had 100 families, of which at least two sibs had proband phenotypes. In the MF
design, the MAF of each locus was randomly assigned either 0.10, 0.25, or 0.50, and
parental genotype was set to a missing rate of 0.25.
NIH-PA Author Manuscript
Type I error rates of PII were examined with a DSP design consisting of a total of 300
families for both dichotomous and continuous outcomes to verify the robustness to
population admixture. To generate an admixed population, we portioned the 300 families
into 3 even groups; families in each group were randomly assigned an MAF of either 0.10,
0.25, or 0.50 to each locus, and simulated samples according to the ascertainment criteria
described above. Furthermore, we adopted β2 = 1 to check whether the existence of
covariates would inflate type I error rates under different schemes of calculating score
statistics. Simulations were replicated for 500 times and the empirical Type I error rate was
calculated. The simulated data were analyzed with PI and PII. In an exhaustive searching
strategy for all possible digenic models, the one that had the greatest CVC (the one with the
greatest TA was preferred if there was a tie in CVC) after 10 cross-validations was selected.
After the mathematical expectation and variance of TA were computed, the p value of the Z
score could be calculated, and we counted the interaction was significant at alpha level 0.05
if its p value was less than 0.05. Statistical power was calculated as the proportion of the
simulations yielding a significant p value at 0.05 significance level and the correct model in
all 200 simulations. For PI, 1,000 replications of shuffling the transmitted set and the
nontransmitted set, with each family as a permuting unit of phenotypic score, were used to
evaluate the empirical cutoff point of nominal 0.05 significance level (Lou et al. 2008).
The simulation results showed that, under the given scenarios, the empirical type I error
rates were well controlled as indicated in Table 1, regardless of population admixture and
schemes of generating score statistics, verifying the validity of the proposed test procedure.
NIH-PA Author Manuscript
To provide an overall picture of comparison between PI and PII, the average powers were
listed in Table 2. Each number derived from the DSP design listed in Table 2 was the mean
of 960 scenarios, whereas for the MF design that was the mean of 64 scenarios because
different levels of MAF and parental genotype missing rates had already been randomly
assigned in the MF design. In general, PII outperformed PI in average power for all cases
except for one; the average improvement in power for PII ranged from 0 to 0.14. The only
outlier was under the MF design for dichotomous traits simulated under the diagonal
models: PI was advanced by 0.02 in power. For both genetic designs, the power difference
of the two methods was < 0.10 for dichotomous traits but > 0.10 for continuous traits. In
three respects, PII appeared to perform better for continuous traits. First, there was a higher
averaged power compared with that for dichotomous traits. Second, the advantage of
averaged power of PII over PI was bigger for the continuous traits. Third, for two genetic
designs used, the greatest average powers were observed for continuous traits; those of the
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 7
DSP design and the MF design were 0.46 and 0.54, respectively, under the checkerboard
interaction model.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
To better demonstrate and compare the performance of two PedGMDR versions, we detailed
two sets, highlighted in bold in Table 2, of statistical power, in which one was of the DSP
design for dichotomous traits simulated under the checkerboard model and the other was of
the MF design for continuous traits simulated under the diagonal model, by drawing
Probability-Probability (PP) plots in which points should go along the diagonal if the two
methods were of equivalent performance in power. For the case of the DSP design for
dichotomous traits, simulated under the checkerboard model where the average powers were
of 0.40 for PII and of 0.35 for PI, respectively, the scattered points seemed to largely fall in
four groups as shown in the PP plot (Figure 2). Although a small proportion of points was
located off the diagonal due to a dropping of power of PI when interaction was either of 0.75
or of 1, most points fell on or near the diagonal area, in which, given a scenario, the power
difference of PII and PI was not greater than 0.20. To further investigate the robustness of
both methods in regard to power given different factors, we projected each point into two
sets of secondary panels, vertically oriented and horizontally oriented, generating the
distribution of 960 power scores along a simulation parameter in each panel. In PII (Figure
2), the power of 960 scenarios (points) was distributed in four clearly cut blocks defined by
the size of epistatic interaction effects (highlighted in yellow) while in PI, the power blocks
appeared to overlap with their neighboring one(s), suggesting that the interaction effect size
is a key determinant of power and that PII has a better discriminability. There was no
recognized difference in power distribution across different levels of covariate effects,
implying that the impact of covariates can be controlled in both PI and PII. Although,
compared with that of PI, the power distribution of PII seemed more unevenly distributed at
different levels of MAF, the effect of MAF was not essential for PII (neither for PI). Neither
parental genotype missing rates nor schemes of adjusting phenotypes were major players in
determining power distributions. The powers of both methods were largely determined by
the size of epistatic effects and remained robust to the other factors. A similar trend was also
found in the comparison of power for all dichotomous traits studied (data not shown).
NIH-PA Author Manuscript
For continuous traits simulated under the diagonal model in the MF design, we plotted a PP
distribution for the power values of PII and PI (Figure 3), where there was a difference of
0.11 in the average power as listed in Table 2. As the MAF and a genotype missing rate of
0.25 had already been used for generating the MF design, only three factors remained,
yielding a total of 64 scenarios (points). As shown in Figure 3, the power points were largely
located in the lower triangles, indicating a dominant performance of PII, which carried out
an average power of 0.45 compared with that of 0.34 for PI. When the effect of interaction
was small, where the powers of both PI and PII were close to zero, PII did not have a
distinguishably better power than that of PI. But PII had increased power when the
interaction effect parameter was greater than 0.25. As with the dichotomous traits, the
magnitude of interaction effects was the major factor affecting statistical power. The
magnitude of covariate effects and the schemes on adjusting phenotypes did not appear to
influence the power very much. For the other three average power comparisons between PII
and PI for continuous traits, the PP plots had a similar pattern (data not shown). In general,
PII appeared to have a better power than PI in detecting interaction underlying continuous
traits.
4. WORKED EXAMPLE
We applied PII to detect susceptibility genes to nicotine dependence (ND) in the U.S. MidSouth Tobacco Family. The data come from our previous reports (Lou et al. 2008; Mangold
et al. 2008). Briefly, all the participants involved in this study were recruited primarily from
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 8
NIH-PA Author Manuscript
Tennessee, Mississippi, and Arkansas in the U.S. during 1999–2004, and are of either
African-American (AA) or European-American (EA) origin. A proband smoker was
required to have smoked for at least the last 5 years, to smoke an average of 20 or more
cigarettes per day for the last 12 months, and to be at least 21 years of age. Once a smoker
proband was identified, all siblings and biological parents of the proband of interest were
recruited whenever possible, regardless of their smoking status. In this sample, there were a
total of 2,037 individuals, 1,366 individuals from 402 AA families, and 671 individuals from
200 EA families. For more detailed demographic and clinical characteristics of this study,
please refer to previous reports (Li et al. 2005; Li et al. 2006). All participants provided
informed consent. Institutional review boards approved all protocols, forms, and procedures
used in this study.
NIH-PA Author Manuscript
We focused on a pair of taste receptor genes TAS2R16 and TAS2R38 both on chromosome 7,
and each of them had three SNPs genotyped. The detailed genetic information of the six
SNPs is shown in Table 3. To demonstrate the use of the proposed PII method and
investigate whether there was an epistatic interaction between these two genes, we used the
Fagerström Test for ND score (FTND) (Heatherton et al. 1991), a well appreciated measure
for ND, as phenotype. The phenotype was adjusted for covariates age, sex, and ethnicity.
The PII results are summarized in Table 4. As shown in Table 4, a trilocus model of
rs846664 from TAS2R16, and rs1726866 and rs10246939 from TAS2R38, gave a significant
interaction with a p value of 5.4 × 10−5.
Human taste receptors, including type 2 taste receptor (TAS2Rs) are rich in taste buds of
gustatory papillae on the tongue surface and palate epithelia. Bitter sensitivity varies among
individuals, and previous genetic studies pointed to association between genetic variants
with TAS2R and diverse bitterness sensitivity (Kim et al. 2003). Psychologically,
stimulation at the receptors of bitterness in human tongues feedbacks a rejection of a
substance to avoid a potential toxic. As tobacco smoking basically exerts on human tongues
a pharmacological signal equivalent to bitterness, interaction among genes seems possible to
associate with ND. However, the role of TAS2R in the plasticity of smoking behavior is
complex; to profile their metabolic details, further investigation is required.
5. DISCUSSION
NIH-PA Author Manuscript
To detect G × G interactions poses a great challenge to statistical genetics in both the aspects
of statistical methodologies and computation feasibility. GMDR was recognized as an
efficient method to detect interactions, and in this study we proposed a new pedigree-based
approach to detecting G × G interactions underlying complex traits. As a pedigree-based
approach, it was robust to population admixture. Compared with a previously proposed
pedigree-based GMDR approach (Lou et al. 2008), the proposed method showed an
increased statistical power in a comprehensive set of simulation scenarios. As only
transmitted genotypes are used, PII halves the computing sample size compared with PI,
which uses both transmitted and nontransmitted genotypes. PII is consequently faster and
more economical in utilizing computer memory, representing a progress that may be
nontrivial in the exercise of genome-wide association studies for detecting G × G
interactions.
Both PII and PI combine GMDR and sufficient statistics together, but they are different in
using the transmitted and nontransmitted genotypes. PI infers the nontransmitted genotypes
of an individual to construct a control for each offspring, doubling the sample size. Then
statistics, such as TA and CVC, are calculated. Permutation is employed to evaluate the
significance of a selected interaction in PI. However, PII calculates the statistics on the
observed sample directly, and evaluates their p values by constructing the empirical
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 9
NIH-PA Author Manuscript
reference distributions on the basis of the sufficient statistic on a null distribution. An
analytical solution, in the context of discordant sib pair design, profiles the mechanical
difference of PII and PI in details (Chen 2009). Compared with other similar methods, PII is
advanced in tackling both dichotomous and continuous traits and allows phenotypic
outcomes to be adjusted with covariates.
As simulation studies remain a rule of thumb in evaluating the performance of methods, we
carried out an intensive simulation study of 4,096 scenarios, covering a set of consensus
factors probably perturbing the power of statistical methods in genetic epidemiological
studies. Given the present study, it was demonstrated that PII rivaled PI for dichotomous
traits and was more advantageous to detect interactions for continuous traits. However, the
real world involves more complicated circumstances which cannot be thoroughly scrutinized
but may distort statistical power of PII and PI. As the simulation study had largely focused
on digenic epistasis, whether the conclusion can straightforwardly be applied to detect G ×
G interaction over two loci still needs to be determined by simulations.
NIH-PA Author Manuscript
PII is model free, but prior information on genetic models can be taken into account because
g() promises flexibility to code additional information. For example, if there is evidence
supporting biological equivalence of a pair of genotypes, say, AA and Aa, we can tune g() to
code AA and Aa the same indicator. Currently, high throughput genotyping platforms
generate high density SNP data, promising a productive future for genome-wide association
studies of G × G interactions. To enhance the selection of tagging SNPs, it may be important
to select biologically relevant SNPs to control the burden of computation.
NIH-PA Author Manuscript
In general, as the generalized linear model is embedded to handle both continuous and
dichotomous traits, it is feasible to accommodate various kinds of data in genetic
epidemiology. The frameworks of PII, as well as PI, are flexible in that after some
straightforward modification, they can evolve to handle other kinds of issues in genetic
epidemiological studies. If we change the coding schema of t(), replacing it with a output
score given survival analysis, it can be applied to survival analysis in terms of the
fundamental roles of G × G interaction. It should also be noted that, in genetic
epidemiological studies, a set of related phenotypes, such as longitudinal data, are measured,
and PII can be easily extended to accommodate phenotypes of interest, and a test statistic
can be constructed as [TA−E(TA)]T V −[TA−E(TA)] ~ χ2, where V is the variancecovariance matrix of TA (a vector), has an asymptotically central χ2 distribution with its
degrees equal to the rank of V. However, heterogeneity of interactions, probably common in
ethnicity-specific diseases, can be a concern to the current approach, which only chooses the
best model but discards other competitive ones that potentially reveal diseases of different
etiologies. The appropriate methods to entertain such heterogeneities are needed.
Acknowledgments
This work was funded in part by the National Institutes of Health Grant R01 DA025095, R01 GM081488, R01
DK080100, and the National Science Foundation of China 30571131. We thank Dr. Yann C. Klimentidis for
critical reading of the manuscript.
References
Bastone L, Reilly M, Rader DJ, Foulkes AS. MDR and PRP: a comparison of methods for high-order
genotype-phenotype associations. Hum Hered. 2004; 58:82–92. [PubMed: 15711088]
Chen, GB. Developing Statistical Approaches to Detecting Gene-Gene Interactions Underlying
Complex Traits. Institute of Bioinformatics, Zhejiang University; Hangzhou, China: 2009.
Cook NR, Zee RY, Ridker PM. Tree and spline based association analysis of gene-gene interaction
models for ischemic stroke. Stat Med. 2004; 23:1439–1453. [PubMed: 15116352]
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 10
NIH-PA Author Manuscript
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Culverhouse R, Klein T, Shannon W. Detecting epistatic interactions contributing to quantitative traits.
Genet Epidemiol. 2004; 27:141–152. [PubMed: 15305330]
Flint J, Mackay TF. Genetic architecture of quantitative traits in mice, flies, and humans. Genome Res.
2009; 19:723–733. [PubMed: 19411597]
Hahn LW, Moore JH. Ideal discrimination of discrete clinical endpoints using multilocus genotypes. In
Silico Biol. 2004; 4:183–194. [PubMed: 15107022]
Hahn LW, Ritchie MD, Moore JH. Multifactor dimensionality reduction software for detecting genegene and gene-environment interactions. Bioinformatics. 2003; 19:376–382. [PubMed: 12584123]
Heatherton TF, Kozlowski LT, Frecker RC, Fagerstrom KO. The Fagerstrom test for nicotine
dependence: a revision of the Fagerstrom tolerance questionnaire. Br J Addict. 1991; 86:1119–
1127. [PubMed: 1932883]
Iles MM. What can genome-wide association studies tell us about the genetics of common disease.
PLoS Genet. 2008; 4:e33. [PubMed: 18454206]
Kim UK, Jorgenson E, Coon H, Leppert M, Risch N, et al. Positional cloning of the human
quantitative trait locus underlying taste sensitivity to phenylthiocarbamide. Science. 2003;
299:1221–1225. [PubMed: 12595690]
Kooperberg C, Ruczinski I, Leblanc ML, Hsu L. Sequence analysis using logic regression. Genet
Epidemiol. 2001; 21(Suppl 1):S626–631. [PubMed: 11793751]
Laird NM, Horvath S, Xu X. Implementing a unified approach to family-based tests of association.
Genet Epidemiol. 2000; 19(Suppl 1):S36–42. [PubMed: 11055368]
Lee S, Chung Y, Elston R, Kim Y, Park T. Log-linear model-based multifactor dimensionality
reduction method to detect gene gene interactions. Bioinformatics. 2007; 23:2589–2595.
[PubMed: 17872915]
Li MD, Beuten J, Ma JZ, Payne TJ, Lou XY, et al. Ethnic- and gender-specific association of the
nicotinic acetylcholine receptor alpha4 subunit gene (CHRNA4 ) with nicotine dependence. Hum
Mol Genet. 2005; 14:1211–1219. [PubMed: 15790597]
Li MD, Payne TJ, Ma JZ, Lou XY, Zhang D, et al. A genomewide search finds major susceptibility
loci for nicotine dependence on chromosome 10 in African Americans. Am J Hum Genet. 2006;
79:745–751. [PubMed: 16960812]
Lou XY, Chen GB, Yan L, Ma JZ, Mangold JE, et al. A combinatorial approach to detecting genegene and gene-environment interactions in family studies. Am J Hum Genet. 2008; 83:457–467.
[PubMed: 18834969]
Lou XY, Chen GB, Yan L, Ma JZ, Zhu J, et al. A generalized combinatorial approach for detecting
gene-by-gene and gene-by-environment interactions with application to nicotine dependence. Am
J Hum Genet. 2007; 80:1125–1137. [PubMed: 17503330]
Lunetta KL, Faraone SV, Biederman J, Laird NM. Family-based tests of association and linkage that
use unaffected sibs, covariates, and interactions. Am J Hum Genet. 2000; 66:605–614. [PubMed:
10677320]
Mangold JE, Payne TJ, Ma JZ, Chen G, Li MD. Bitter taste receptor gene polymorphisms are an
important factor in the development of nicotine dependence in African Americans. J Med Genet.
2008; 45:578–582. [PubMed: 18524836]
Martin ER, Ritchie MD, Hahn L, Kang S, Moore JH. A novel method to identify gene-gene effects in
nuclear families: the MDR-PDT. Genet Epidemiol. 2006; 30:111–123. [PubMed: 16374833]
Moore JH, Gilbert JC, Tsai CT, Chiang FT, Holden T, et al. A flexible computational framework for
detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of
human disease susceptibility. J Theor Biol. 2006; 241:252–261. [PubMed: 16457852]
Motsinger AA, Ritchie MD. The effect of reduction in cross-validation intervals on the performance of
multifactor dimensionality reduction. Genet Epidemiol. 2006; 30:546–555. [PubMed: 16800004]
Nelson MR, Kardia SL, Ferrell RE, Sing CF. A combinatorial partitioning method to identify
multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 2001;
11:458–470. [PubMed: 11230170]
Phillips P. Epistasis–the essential role of gene interactions in the structure and evolution of genetic
systems. Nat Rev Genet. 2008; 9:855–867. [PubMed: 18852697]
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 11
NIH-PA Author Manuscript
Price A, Patterson N, Plenge R, Weinblatt M, Shadick N, et al. Principal components analysis corrects
for stratification in genome-wide association studies. Nat Genet. 2006; 38:904–909. [PubMed:
16862161]
Rabinowitz D, Laird N. A unified approach to adjusting association tests for population admixture
with arbitrary pedigree structure and arbitrary missing marker information. Hum Hered. 2000;
50:211–223. [PubMed: 10782012]
Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, et al. Multifactor-dimensionality reduction
reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am J
Hum Genet. 2001; 69:138–147. [PubMed: 11404819]
Tahri-Daizadeh N, Tregouet DA, Nicaud V, Manuel N, Cambien F, et al. Automated detection of
informative combined effects in genetic association studies of complex traits. Genome Res. 2003;
13:1952–1960. [PubMed: 12902385]
Zhu J, Hastie T. Classification of gene microarrays by penalized logistic regression. Biostatistics.
2004; 5:427–443. [PubMed: 15208204]
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 12
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 1.
Illustration of the multifactor reduction algorithm of PII. Summary of the steps involved in
implementing the data reduction algorithm (adapted from the work of Lou et al. (2007)) in
PII, under the context of discordant sib pair design and without adjustment of phenotypic
outcomes with the covariate(s). For a detailed description of the steps, please see the
“Multifactor-reduction algorithm” subsection. In step 3, bars represent hypothetical
distributions of affected individuals (left, dark shading) and unaffected individuals (right,
light shading); numbers not in parentheses above bars are the numbers of affected and
unaffected individuals, and those in parentheses are the sums of the scores. In steps 4 and 6,
numbers not in parentheses are the ratios of the number of cases to the number of controls,
and those in parentheses are the average scores. “High-risk” cells are indicated by dark
shading, “low-risk” cells by light shading, and “empty” cells by no shading.
NIH-PA Author Manuscript
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 13
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 2.
NIH-PA Author Manuscript
A Probability-Probability plot of the power of PI and PII across 960 scenarios for
dichotomous traits simulated under the checkerboard model in the discordant sib pair design.
In the main panel, the top-right one, the horizontal and the vertical axes are statistical
powers of PII and PI, respectively, and the horizontal and the vertical coordinates of each
point are determined by the statistical power of PII and PI of a given scenario. The
distributions of the simulation parameters are represented graphically in the vertically tiled
panels and the horizontally tiled panels for PI and PII, respectively. The horizontal axes of
the vertically tiled panels are the power of PII; the vertical axes of the horizontally tiled
panels are the power of PI.
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 14
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Figure 3.
NIH-PA Author Manuscript
A Probability-Probability plot of the power of PI and PII across 64 scenarios for continuous
traits simulated under the diagonal model in the mixed families design. In the main panel,
the horizontal and the vertical axes are statistical powers of PII and PI, respectively, and the
horizontal and the vertical coordinates of each point are determined by the statistical power
of PII and PI of a given scenario. The distributions of the simulation parameters are
represented graphically in the vertical panels and the horizontal panels for PI and PII,
respectively (in the mixed families design, minor allele frequency and parental missing
genotype rates were built-in parameters in generating populations). The distributions of the
simulation parameters are represented graphically in the vertical panels and the horizontal
panels for PI and PII, respectively. The meanings of the vertical and horizontal axes are the
same as in their corresponding panels in Figure 2.
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 15
Table 1
Type I error rate of PII at 0.05 significance level
NIH-PA Author Manuscript
Dichotomous Traits
Including Founders
Continuous Traits
Adjustment
Without Adjustment
Adjustment
Without Adjustment
True
0.044
0.044
0.050
0.048
False
0.060
0.045
0.050
0.036
We simulated 300, of three subpopulations consisting of 100 families in each, discordant sib pairs. For families in each category, the MAF of each
locus was randomly assigned either 0.1, 0.25, or 0.5 independently.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Stat Interface. Author manuscript; available in PMC 2011 September 15.
NIH-PA Author Manuscript
0.34
Diagonal
0.36
0.38
0.32
0.35
PI
0.45
0.54
0.41
0.46
PII
0.34
0.40
0.29
0.35
PI
Continuous Traits
For the MF design, the power was averaged over 64 scenarios from different combinations of 3 factors, excluding the MAF and genotype missing rate of parents, which were built into the MF design.
Average powers in bold were compared and illustrated in Figures 2 and 3.
b
For the DSP design, the power was averaged over 960 scenarios from different combinations of 5 factors, MAF, genotype missing rate of parents, magnitude of a covariate, magnitude of a interaction, and
the scheme of adjusting phenotypes.
a
0.46
Diagonal
Checkerboard
0.32
Checkerboard
DSPa
MFb
0.40
Model
Design
PII
Dichotomous Traits
NIH-PA Author Manuscript
Average power of PII and PI under various scenarios
NIH-PA Author Manuscript
Table 2
Chen et al.
Page 16
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 17
Table 3
Information on the SNPs scored in the two genes in this study
NIH-PA Author Manuscript
Gene
Chromosome
TAS2R38
7
TAS2R16
7
dbSNP ID
Positiona
rs713598
141319814
C/G
rs1726866
141319174
G/A
Allele
rs10246939
141319073
T/C
rs2233989
122422465
A/G
rs846664
122422409
A/C
rs1204014
122422079
C/T
a
The information was provided at NCBI dbSNP Build 131 for Human.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Stat Interface. Author manuscript; available in PMC 2011 September 15.
Chen et al.
Page 18
Table 4
Interaction of TAS2R16 and TAS2R38 detected by PII
NIH-PA Author Manuscript
No. of Loci
Model
Testing Accuracy
Z score
p value
0.099
1
rs846664
0.745
1.29
2
rs1204014, rs846664
0.745
−0.27
0.606
3
rs846664, rs1726866, rs10246939
0.816
3.87
5.4 × 10−5
4
rs846664, rs713598, rs1726866, rs10246939
0.818
3.19
0.00071
5
rs1204014, rs846664, rs713598, rs1726866, rs10246939
0.823
3.24
0.00060
The SNP IDs in italic font are located in TAS2R38.
NIH-PA Author Manuscript
NIH-PA Author Manuscript
Stat Interface. Author manuscript; available in PMC 2011 September 15.