Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Gene-Environment Case-Control
Studies
Raymond J. Carroll
Department of Statistics
Center for Statistical Bioinformatics
Institute for Applied Mathematics and
Computational Science
Texas A&M University
http://stat.tamu.edu/~carroll
Advertising
• Training: We are finishing Year 08 of an NCIfunded R25T training program
• http://www.stat.tamu.edu/b3nc
• We train statistically and computationally
oriented post-docs in the biology of nutrition and
cancer
• Active seminar series
Outline
• Problem: Case-Control Studies with GeneEnvironment relationships
• Efficient formulation when genes are observed
• Haplotype modeling and Robustness
• Applications
Acknowledgment
• This work is joint with Nilanjan Chatterjee (NCI)
and Yi-Hau Chen (Academia Sinica)
Software
• SAS and Matlab Programs Available at my web
site under the software button
http://stat.tamu.edu/~carroll
• Examples are given in the programs
• Paper are in Biometrika (2005), Genetic
Epidemiology (2006), Biostatistics (2007),
Biometrics (2008) and JASA (2009)
• R programs available from the NCI
Basic Problem Formalized
• Gene and Environment
• Question: For women who carry the BRCA1/2
mutation, does oral contraceptive use
provide any protection against ovarian cancer?
Basic Problem Formalized
• Gene and Environment
• Question: For people carrying a particular
haplotype in the VDR pathway, does higher
levels of serum Vitamin D protect against
prostate cancer?
Basic Problem Formalized
• Gene and Environment
• Question: If you are a current smoker, are
you protected against colorectal adenoma if you
carry a particular haplotype in the NAT2
smoking metabolism region?
Prospective and Retrospective Studies
• D = disease status (binary)
• X = environmental variables
• Smoking status
• Vitamin D
• Oral contraceptive use
• G = gene status
• Mutation or not
• Multiple or single SNP
• Haplotypes
Prospective and Retrospective Studies
• Prospective: Classic random sampling of a
population
• You measure gene and environment on a cohort
• You then follow up people for disease
occurrence
Prospective and Retrospective Studies
• Prospective Studies:
• Expensive: disease states are rare, so large
sample sizes needed
• Time-consuming: you have to wait for disease
to develop
• They Exist: Framingham Heart Study, NIHAARP Diet and Health Study, Women’s Health
Initiative, etc.
Prospective and Retrospective Studies
• Prospective Studies:
• Daunting Task: Only very large, very
expensive prospective studies can find geneenvironment interactions
• Data Access: Access to the Framingham Heart
Study requires a university commitment to
security
Prospective and Retrospective Studies
• Retrospective Studies: Usually called casecontrol studies
• Find a population of cases, i.e., people with a
disease, and sample from it.
• Find a population of controls, i.e., people
without the disease, and sample from it.
Prospective and Retrospective Studies
• Retrospective Studies: Because the gene G
and the environment X are sample after disease
status is ascertained
• Microarray studies on humans: most are
case-control studies
• Genome Wide Association Studies
(GWAS): most are case-control studies
Prospective and Retrospective Studies
• Case-control Studies:
• Fast: no need to wait for disease to develop
• Cheap: sample sizes are much smaller
• Subtle: The controls need to be representative
of the population of people without the disease.
Basic Problem Formalized
• Case control sample: D = disease
• Gene expression: G
• Environment, can include strata: X
• We are interested in main effects for G and X
along with their interaction as they affect
development of disease
Basic Problem Formalized
• 99.9999% of analyses of case-control data use
logistic regression
• Closely related to Fisher’s Linear Discriminant
Analysis (LDA)
• Difference: we want to understand what
targets affect disease, not just predict disease
Logistic Regression
• Logistic Function:
1
H(x)
1 exp( x)
exp(x)
• The approximation works for rare diseases
Prospective Models
• Simplest logistic model without an interaction
pr(D 1|G, X) H( 0 1G 2 X)
• The effect of having a mutation (G=1) versus not
(G=0) is
pr(D 1|G 1, X) pr(D 0|G 0, X)exp(1 )
Prospective Models
• Simplest logistic model with an interaction
pr(D 1|G, X) H( 0 1G 2 X 3G * X)
• The effect of having a mutation (G=1) versus not
(G=0) is
pr(D 1| G 1, X) pr(D 0 |G 0 , X ) exp(1 3 X )
Empirical Observations
• Logistic regression is in every statistical package
• Unfortunately, logistic regression is not
efficient for understanding interactions
• Much larger sample sizes are required for
interactions that for just gene effects
• Most gene-environment interaction case-control
studies fail for this reason
Empirical Observations
• Statistical Theory: There is a lovely statistical
theory available
• It says: ignore the fact that you have a casecontrol sample, and pretend you have a
prospective study
• It all works out: don’t worry, be happy!
Empirical Observations
• Statistical Theory: Ordinary logistic regression
applied to a case-control study makes no
assumptions about the population distribution
of (G,X)
• Remember: we do not have a sample from a
population, only a case-control sample
• Logistic regression is robust: to assumptions
about the population distribution of (G,X)
Likelihood Function
• The likelihood is
pr (X = x ; G = gjD = d)
pr (X = x ; G = g)
=
pr (D = djX = x ; G = g)
pr (D = d)
• Note how the likelihood depends on two things:
• The distribution of (X,G) in the population
• The probability of disease in the population
• Neither can be estimated from the case-control study
When G is observed
• Logistic regression is thus robust to any
modeling assumptions about the covariates in
the population
• Unfortunately it is not very efficient for
understanding interactions
Gene-Environment Independence
• In many situations, it may be reasonable to
assume G and X are independently distributed in
the underlying population, possibly after
conditioning on strata
• This assumption is often used in geneenvironment interaction studies
G-E Independence
• Does not always hold!
• Example: polymorphisms in the smoking
metabolism pathway may affect the degree of
addiction
Gene-Environment Independence
• If you’re willing to make assumptions about
the distributions of the covariates in the
population, more efficiency can be obtained.
• This is NOT TRUE for prospective studies, only
true for retrospective studies.
Gene-Environment Independence
• The reason is that you are putting a constraint on
the retrospective likelihood
pr (X = x ; G = gjD = d)
pr (X = x ; G = g)
=
pr (D = djX = x ; G = g)
pr (D = d)
pr (X = x )pr (G = g)
=
pr (D = djX = x ; G = g)
pr (D = d)
Gene-Environment Independence
• Our Methodology: Is far more general than
assuming that genetic status and environment
are independent
• We have developed capacity for modeling the
distribution of genetic status given strata
and environmental factors
• I will skip this and just pretend G-E independence
here
More Efficiency, G Observed
• Our model: G-E independence and a genetic
model, e.g., Hardy-Weinberg Equilibrium
pr(G g) q(g|θ)
• Consequences:
• More efficient estimation of G effects
• Much more efficient estimation of G-E interactions.
The Formulation
• Any logistic model works
pr(D 1|G, X) Hβ 0 m(G, X, β1 ) ,
pr(G g) q(g| θ)
X Nonparametric,multi dimensional
• Question: What methods do we have to
construct estimators?
Methodology
• I won’t give you the full methodology, but it
works as follows.
• Case-control studies are very close to a
prospective (random sampling) study, with the
exception that sometimes you do not observe
people
Pretend Missing Data Formulation
• Suppose you have a large but finite population
of size N
• Then, there are
• There are N ¼0
N ¼1
with the disease
without the disease
Pretend Missing Data Formulation
• In a case-control sample, we randomly select n1
with the disease, and n0 without.
• The fraction of people with disease status D=d
that we observe is
nd
N ¼d
Pretend Missing Data Formulation
• Pretend you randomly sample a population
• You observe a person who has D=d, and
with the probability
1
,
nd
N ¼d
• Statisticians know how to deal with missing data, e.g.,
compute probabilities for what you actually see
Pretend Missing Data Formulation
• In this pretend missing data formulation,
ordinary logistic regression is simply
pr(D=d|G=g,=1,X)
• We have a model for G given X, hence we
compute
pr(D=d,G=g|=1,X)
Methodology
• Our method has an explicit form, i.e., no
integrals or anything nasty
• It is easy to program the method to estimate
the logistic model
• It is likelihood based. Technically, a
semiparametric profile likelihood
Methodology
• We can handle missing gene data
• We can handle error in genotyping
• We can handle measurement errors in
environmental variables, e.g., diet
Methodology
• Our method results in much more efficient
statistical inference
More Data
• What does More efficient statistical
inference mean?
• It means, effectively, that you have more data
• In cases that G is a simple mutation, our
method is typically equivalent to having 3
times more data
How much more data: Typical
Simulation Example
• The increase in effective sample size when
using our methodology
4
3.5
3
2.5
pr(G)=.05
pr(G)=.20
2
1.5
1
0.5
0
G
X
G times X
Real Data Complexities
• The Israeli Ovarian Cancer Study
• G = BRCA1/2 mutation (very deadly)
• X includes
• age,
• ethnic status (below),
• parity,
• oral contraceptive use
• Family history
• Smoking
• Etc.
Real Data Complexities
• In the Israeli Study, G is missing in 50% of
the controls, and 10% of the cases
• Also, among Jewish citizens, Israel has two
dominant ethnic types
• Ashkenazi (European)
• Shephardic (North African)
Real Data Complexities
• The gene mutation BRCA1/2 if frequent
among the Ashkenazi, but rare among the
Shephardic
• Thus, if one component of X is ethnic status,
then pr(G=1 | X) depends on X
• Gene-Environment independence fails
here
• What can be done? Model pr(G=1 | X) as
binary with different probabilities!
Israeli Ovarian Cancer Study
• Question: Can carriers of the BRCA1/2
mutation be protected via OC-use?
Typical Empirical Example
Israeli Ovarian Cancer Study
• Main Effect of BRCA1/2:
Israeli Ovarian Cancer Study
• Odds ratio for OC use among carriers = 1.04
(0.98, 1.09)
• No evidence for protective effect
• Not available from case-only analysis
• Length of interval is ½ the length of
the usual analysis
Haplotypes
• Haplotypes consist of what we get from our
mother and father at more than one site
• Mother gives us the haplotype hm = (Am,Bm)
• Father gives us the haplotype hf = (af,bf)
• Our diplotype is Hdip = {(Am,Bm), (af,bf)}
Haplotypes
• Unfortunately, we cannot presently observe the
two haplotypes
• We can only observe genotypes
• Thus, if we were really Hdip = {(Am,Bm), (af,bf)},
then the data we would see would simply be
the unordered set (A,a,B,b)
Missing Haplotypes
• Thus, if we were really Hdip = {(Am,Bm), (af,bf)},
then the data we would see would simply be
the unordered set (A,a,B,b)
• However, this is also consistent with a different
diplotype, namely Hdip = {(am,Bm), (Af,bf)}
• Note that the number of copies of the (a,b)
haplotype differs in these two cases
• The true diploid = haplotype pair is missing
Missing Haplotypes
• Our methods handle unphased diplotyes
(missing haplotypes) with no problem.
• Standard EM-algorithm calculations can be
used
• We assume that the haplotypes are in HWE, and
have extended to cases of non-HWE
Robustness
• Robustness: We are making assumptions to
gain efficiency = “get more data”
• What happens if the assumptions are wrong?
• Biases, incorrect conclusions, etc.
• How can we gain efficiency when it is warranted,
and yet have valid inferences?
Two Likelihoods
• In our “pretend” missing data formulation, the
model free estimator uses the likelihood
pr(D=d|G=g,=1,X)
• The model-based estimator uses the likelihood
pr(D=d,G=g|=1,X)
Two Likelihoods
• The two likelihoods lead to two estimators
ˆfree ,
ˆmodel
• The former is robust but not efficient
• The latter is efficient but not robust
• What to do?
Empirical Bayes
• We chose an Empirical Bayes approach
ˆEB
ˆfree Κ (
ˆmodel
ˆfree )
• Let V cov(ˆmodel ˆfree ) and
ˆmodel
ˆfree
• Then Κ is diagonal with elements
vj
vj
2
j
Comments on Empirical Bayes
• If the model fails, then the estimator
converges to the model-free estimator
• If the model holds, the estimator estimates
the right thing, but is much more efficient
than the model-free estimator
Simulations
• Various simulations show the following
• If the model holds, EB is
• slightly less efficient that model-based
• much more efficient than model-free
• If the model fails,
• Model-based is badly biased
• EB and shrinkage eliminate most bias, at least as
efficient as model-free
Example 1: Prostate Cancer
• G = SNPs in the Vitamin D Pathway
• X = Serum-level biomarker of vitamin D (diet
and sun)
• The VDR gene is downstream in the pathway,
hence unlikely to influence the level of X
• Gene-environment independence likely
Example I: Vitamin D
Example 2: Colorectal Adenoma
• G = SNPs in the NAT2 gene, which is important
in the metabolism of
• X =Various measures of smoking history
• The NAT2 gene may make smokers more
addicted
• Gene-environment independence unlikely
The NAT2 Example
• Current smoking and 101010 haplotype
interaction coefficient
Method
Estimate s.e.
p-value
Model Free
-0.63
0.17
0.014
Independence
-0.33
0.16
0.048
Consistent EB1
-0.59
0.25
0.017
• Current smokers with this haplotype are 50%
less likely to develop a colorectal adenoma
The VDR Example
• Serum Vitamin D and 000 haplotype interaction
coefficient
Method
Estimate s.e.
p-value
Model Free
-0.21
0.12
0.093
Independence
-0.18
0.08
0.019
Consistent EB1
-0.19
0.08
0.021
• Men with 1 sd greater Serum vitamin D then
the norm are 70% less likely to develop
prostate cancer
Genome-Wide Association Studies
• These methods can be applied to GWAS
• My last two examples were actually from the
PLCO GWAS
• Also, can call the environment = other SNP
Identifying Genetic Markers for Prostate
& Breast Cancer
Genome-Wide Analysis
Public Health Problem
Prostate (1 in 8 Men)
Breast (1 in 9 Women)
Analyze Long-Term Studies
NCI PLCO Study
Nurses’ Health Study
Fine Mapping
Functional Studies
Validate Plausible Variants
Possible Clinical Testing
Initial Study
Follow-up #1
Follow-up #2
Establish
Loci
http://cgems.cancer.gov
Identifying Genetic Markers for Prostate
& Breast Cancer
Case-Control studies nested in prospective
cohort used in CGEMS GWAS
1990
NHS
cohort
starts
1976
1995
2000
Post-menoposal Breast Cancer
2004
1183 cases
May 2004
1185 controls
32,826 eligible participants
blood sample
collection
1995
PLCO
cohort
starts
1994
2000
Aggressive ProstateCancer
Non-aggressive P. C.
737 cases
493 cases Oct 2001
1230 controls
28,521 eligible participants
blood sample
collection
1998
2000
Non aggressive : stage <= 2 (non invasive) and Gleason score <= 6
Aggressive
: stage > 2 (invasive)
or Gleason score > 6
2002
Oct 2003
Genome-Wide Association Studies
• The methodology I will describe is now the
standard gene-environment analysis at the
National Cancer Institute for GWAS
• There are now 500,000 SNP in a typical GWAS,
and our method is fast enough to handle this
Genome-Wide Association Studies
• Typically, loci are identified initially for main
effects, then followed up for gene-environment
interactions
• My analyses have come from the PLCO study
• In some cases, the “environment” is other
genes on different chromosomes, i.e., genegene interactions
Genome-Wide Association Studies
• Despite the fact that the genes are on different
chromosomes, they are not always independent
• For example they might be in the same
pathway
Genome-Wide Association Studies
• When genes on different chromosomes are
independenjt, our methods give huge gains in
efficiency = “more data” = smaller standard
errors
• When they are not, our methods give, in effect,
the robust method of ordinary logistic
regression
Summary
• Case-control studies are the backbone of
epidemiology in general, and genetic
epidemiology in particular
• Their retrospective nature distinguishes them
from random samples = prospective studies
Summary
• We start by assuming relationships between the
genes and the “environment” in the
population, e.g., independence
• This model can be fully flexible
• We also, where necessary, specify distributions
for genes
Summary
• We calculated a new likelihood function, leading
to more much more precise inferences
• The method can handle missing genes,
genotyping errors, measurement errors in
the environment
• Calculations are straightforward via the EM
algorithm
Summary
• Forced to face the dilemma
• Lousy but robust method
• Great but not robust method
• We developed a fast, data adaptive, novel way
of addressing this issue
• In cases where one can predict the outcome,
the EB method works as desired