Download technical report 2003/ge1

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Dominance (genetics) wikipedia , lookup

Epistasis wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Behavioural genetics wikipedia , lookup

Hardy–Weinberg principle wikipedia , lookup

Designer baby wikipedia , lookup

Tay–Sachs disease wikipedia , lookup

Microevolution wikipedia , lookup

Genome (book) wikipedia , lookup

Heritability of IQ wikipedia , lookup

Fetal origins hypothesis wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Quantitative trait locus wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Public health genomics wikipedia , lookup

Transcript
Technical Report 2003/GE1
TECHNICAL REPORT 2003/GE1
On the accuracy of estimates of the effect of
phenotype on disease derived from Mendelian
randomisation studies
John R Thompson*, Martin D Tobin, Cosetta Minelli
Genetic Epidemiology Unit,
Centre for Biostatistics,
Department of Health Sciences,
University of Leicester, U.K.
*E-mail : [email protected]
Date: 4th Aug 2003
Revision: 11th Aug 2003
We are grateful to Roger Harbord from the University of Bristol for pointing out that it is much better to
represent the data in Table 1 in terms of the odds ratio for a 1 standard deviation change in the phenotype
rather than a 1-unit change in the phenotype. This modification has been incorporated.
1
Technical Report 2003/GE1
SUMMARY
Observational studies of the effects of intermediate phenotypes on disease are frequently
difficult to interpret because of their susceptibility to confounding and reverse causation.
An alternative approach, called Mendelian randomisation, involves selecting a gene that
influences the level of the phenotype and assessing whether that gene is associated with
the disease. Because genotype is decided by a random process at conception it is very
unlikely that the gene:disease or gene:phenotype association will be biased by
confounding. Here we consider the accuracy of the estimate of the effect of intermediate
phenotype on disease derived by combining data from Mendelian randomisation studies
of gene on disease and of gene on phenotype. The final accuracy is especially poor if the
effect of genotype on phenotype is poorly estimated. Even when data are obtained from
meta-analyses, Mendelian randomisation studies may not be accurate enough to confirm
or contradict estimates of the effect of phenotype on disease derived from observational
studies.
2
Technical Report 2003/GE1
INTRODUCTION
Observational studies that investigate the relationship between intermediate phenotypes,
such as biochemical measurements, and disease are often difficult to interpret due to their
potential for reverse causation and confounding. That is, either the biochemical is
affected by the presence of the disease, or the measurement is influenced by some other
factor that is itself associated with the disease (1, 2). Yet where effective interventions are
available that influence the intermediate phenotype, unbiased estimates of the
relationship between the intermediate phenotype and disease are of real public health
importance.
As a way of avoiding the problems of confounding and reverse causation, several authors
have suggested using studies based on a gene that is known to influence the level of the
intermediate phenotype. It is highly unlikely that the relationship between the gene and
the disease is subject to reverse causation or confounding, since a subject's genotype is
determined by a random process at conception. This naturally occurring randomisation,
known as Mendelian randomisation, gives studies in which the exposure is the subject's
genotype a similar status to randomised trials (2-5). If such a study shows an association
between the gene and the disease, and if it is known that the gene acts through the
biochemical phenotype, then we have strong evidence to suggest that the phenotype is
causally linked to the disease. Furthermore, the estimates from studies of the effect of the
gene on disease and the gene on biochemical phenotype can be used to estimate the size
of the effect of the phenotype on disease, and this can be compared to the results from
observational studies as a way of investigating whether or not the observational studies
had been adequately adjusted for confounding (2, 5).
An important example of Mendelian randomisation is the investigation of the possible
relationship between homocysteine and coronary heart disease (CHD) through the use of
the methylene tetrahydrofolate reductase (MTHFR) gene. Numerous observational
studies have found an association between the concentration of homocysteine in blood
samples and the development of coronary heart disease. However, a large meta-analysis
3
Technical Report 2003/GE1
suggests that the effect is small; a decrease of 3mol/L leading to a reduction of about
11% in the risk of CHD (homocysteine levels in healthy subjects average about
12mol/L with a standard deviation of about 4mol/L) (6). The observational studies
used in the meta-analysis show significant heterogeneity, which may well be because
some of the studies are confounded by factors such as smoking, blood pressure and
cholesterol. Despite the relatively small size of the effect, the relationship is of real
importance because CHD is so common and homocysteine levels are reduced by dietary
folate supplementation. This is easily implemented, for example, the FDA fortification
program for refined flour, which commenced in the USA in 1998
(http://www.cfsan.fda.gov/~dms/fdafolic.html) (7).
The MTHFR gene has a common polymorphism that involves a C-to-T substitution at
base 677. The effect of the C-to-T substitution is to reduce the enzyme's effectiveness and
so to produce lower folate and raise homocysteine concentrations. The prevalence of the
C-to-T substitution varies between populations, but in Europe and North America the
polymorphism is found in about a third of all copies of the gene, so that approximately
11% of people carry two copies of the C-to-T substitution, referred to a TT. The metaanalysis of studies of the MTHFR gene in populations that have not been subject to
dietary folate supplementation suggests an increased risk of CHD of about 16% when
comparing TT subjects to subjects without any copy of the polymorphism (CC) (8).
The relationship between the MTHFR polymorphism and CHD and the fact that the
polymorphism affects homocysteine together are said to provide good evidence that the
relationship found in observational studies is not just a result of reverse causation or
confounding (2, 8). In the following sections we consider the use of Mendelian
randomised studies of gene on disease and gene on phenotype to estimate the size of the
effect of phenotype on disease. We also consider whether the consistency between this
estimate and the measurements made in observational studies can act as a test of the
reliability of the observational results.
4
Technical Report 2003/GE1
GENOTYPE, PHENOTYPE AND DISEASE
Suppose that we have genotype, G, a continuous intermediate phenotype, IP, and a binary
disease outcome D. The gene is selected because of its influence on the phenotype and
we wish to establish the relationship between phenotype and disease. The hypothesised
causal pathway is
G
IP
D
Denote two of the genotypes that we wish to compare by G=1 and G=0, and assume that
the intermediate phenotype measurement, x, is normally distributed with mean 1 if G=1
and 0 if G=0 but with common standard deviation . Further suppose that the
probability of disease, D=1, is related to x by a logistic function, so that
p(D=1 | x ) = exp(+x)/[1+ exp(+x)]
Here  represents the log odds ratio of disease given a unit in the measurement of
phenotype, log(ORPD).
Using (x; ,) to represent the density of data x from a normal distribution with mean 
and standard deviation , it follows that
p(D=1 | G=i ) = Ii =  (x; i,) exp(+x)/[1+ exp(+x)] dx
which may be approximated by
Ii  exp(+i)/[1+ exp(+i)]
Thus the log odds of disease given G=i is approximately +i and the log odds ratio of
disease comparing genotypes is approximately ( 1- 0). Thus the log odds ratio of
disease comparing the two genotypes, log(ORGD) is approximately equal to the log odds
ratio of disease given a unit change in the phenotype, log(ORPD) multiplied by the
average difference between the intermediate phenotype measurements under the two
genotypes.
log(ORGD) = ( 1- 0) log(ORPD)
5
Technical Report 2003/GE1
Equivalently this may be represented as
1 0
ORGD  ORPD
This adjustment has been assumed without comment in several articles on Mendelian
randomisation or the meta-analysis of genetic studies (9).
Under this model the full expression for the log odds ratio of disease comparing the
genotypes is
log[I1/(1-I1) . (1-I0)/I0]
Table 1 compares the approximate log odds ratio, ( 1- 0), with the actual value
obtained by numerical integration over the intermediate phenotype using a trapezium
rule. In this table the disease frequency refers to the percentage of subjects with the
average level of the phenotype for the healthier genotype that develop the disease. The
phenotype measurements are compared in terms of the effect size of the genotype on
phenotype, that is (1- 0)/ and * the log odds ratio for a one standard deviation change
in the phenotype. The Stata program used for the numerical integration is given in
Appendix 1.
Table 1 shows that the approximation is least reliable when the disease is common in the
populations studied, the effect of the phenotype on disease is large, or the difference in
phenotype measurements between genotypes is small relative to their variability. Since
the method of Mendelian randomisation is typically applied to data from case-control
studies the rare disease assumption will usually hold. Further since potential confounding
is usually not a possible alternative explanation to causality when the odds ratio is large
(10), the only practical concern with the approximation arises when genes are studied that
have a small effect on the phenotype. When selecting an appropriate gene for Mendelian
randomisation it would be natural to choose one that has as large an effect on the
phenotype as possible, in which case the approximation is likely to be reasonable. The
average difference between the levels of homocysteine in the CC and TT genotypes of
the MTHFR gene is about 1.5mol/L, taking the standard deviation to be 4mol/L, the
6
Technical Report 2003/GE1
effect size is approximately 0.38. For populations in which the disease is rare the
numerical integration shows the approximation to be accurate to within 3 or 4%.
TABLE 1. Percentage error in using the approximation to the log odds ratio of the disease
given genotype ( 1- 0)
Phenotype Difference3
0.5
Disease
1.0
2.0
ORPD2
Frequency1
1%
5%
10%
1.1
0.0%
0.0%
0.0%
1.2
0.0%
0.0%
0.0%
1.5
0.2%
0.2%
0.3%
2.0
0.9%
1.0%
1.5%
1.1
0.0%
0.1%
0.1%
1.2
0.2%
0.2%
0.2%
1.5
1.0%
1.1%
1.3%
2.0
3.6%
4.1%
5.2%
1.1
0.1%
0.1%
0.1%
1.2
0.3%
0.3%
0.4%
1.5
1.7%
1.9%
2.1%
2.0
5.9%
6.5%
7.6%
1. Disease Frequency is the percentage of subjects with phenotype equal to the average
for the healthier genotype who develop the disease
2. ORPD is the odds ratio of a one standard deviation change in phenotype on disease
3. Difference in average phenotype between the two genotypes divided by the standard
deviation (assumed equal in the two groups)
7
Technical Report 2003/GE1
ESTIMATING THE EFFECT OF PHENOTYPE ON DISEASE
Under the reasonable assumptions discussed in section 5, Mendelian randomisation
allows the unconfounded estimation of both the mean effect of genotype on intermediate
phenotype (1- 0) and the log odds ratio of genotype on disease, which will be denoted
by =log(ORGD). Using the approximate relationship given in the previous section, the
log odds ratio of a unit change in phenotype on disease, , can be estimated as
ˆ (ˆ1  ˆ 0 ) .
In Mendelian randomisation studies it would be usual to select a gene for which there is
good evidence that (1- 0) > 0 so that a test of a non-zero effect of phenotype on disease
is almost equivalent to a test of a non-zero effect of genotype on disease. However
uncertainty about the size of the difference (1- 0) can greatly increase the uncertainty
in the estimate, ˆ (ˆ1  ˆ 0 ) . In large samples it will be reasonable to assume that ˆ and
( ˆ1  ˆ 0 ) are both normally distributed. The problem of finding a confidence interval
for the ratio of two normal variates is considered by Kendall and Stuart (11) following
the work of Fieller (12). The confidence interval for the ratio of two means
y
x
is the set
of  that satisfy

ˆ xy  
ˆ y2 

ˆ x2  2
2
2
2
  ˆ y  t 2
0
ˆ x  t 2
  2ˆ x ˆ y  t 2
n  1
n  1
n  1



where t represents the 100-/2% point from a t-distribution with n-1 degrees of freedom
and the hat denotes the usual estimates of mean, variance and covariance. Adapting their
formula for large samples in which the two estimates used in the ratio are independent,
for instance, when they come from separate meta-analyses, and denoting the difference in
means by , then a (100-)% confidence interval for the ratio has limits,
8
Technical Report 2003/GE1
1  z
ˆ
ˆ
2
 s2 s2 
      z
2
 ˆ 2

ˆ 


s2 
2 
1  z 2 2 
ˆ 

2
2
s 2 s2
ˆ 2 ˆ 2
where z/2 is the value 100-/2% value from a standard normal distribution, for instance
1.96 for a 95% confidence interval, and s denotes the standard error of the corresponding
mean.
If the divisor

1  z

2
2
s 2 
 is positive, as it usually will be if the gene is well chosen,
ˆ 2 
then the confidence interval consists of all points within these limits, but if it is negative
then it consists of all points outside the limits. Thus, assuming that  is positive, when it
is possible that  is close to zero, the confidence interval for the ratio / will stretch to
+ because  may be small and positive and to - because  may be small and negative;
but because  is known not to be large, the values excluded from the confidence interval
will be those corresponding to small values of /. In the extreme case the quadratic may
have no solution, in which case the confidence interval is (-,) and the sample is not
informative enough to exclude any ratio.
The confidence limits depend on the ratio of the estimates ˆ and ( ˆ1  ˆ 0 ) to their
respective standard errors. We refer to the ratio of the estimate to its standard error as the
standardised effect, Z. Thus the confidence interval will consist of points between the
limits provided Z( ˆ ) > 1.962, that is provided the mean effect of genotype on phenotype
is accurately estimated.
In the homocysteine example described in the introduction the odds ratio comparing the
TT with the CC genotype has been estimated in a recent meta-analysis to be 1.16 with a
95% confidence interval (1.05,1.28). Thus the log odds ratio, ˆ , is 0.15 with standard
error 0.05, so that the standardised effect, Z( ˆ )=3. A similar meta-analysis found the
9
Technical Report 2003/GE1
relationship between average levels (95% CI) of homocysteine (mol/L) in disease-free
subjects to be; CC 9.9 (9.8,10.0), TT 11.4 (11.0,11.8). The difference is 1.5 with a
standard error of 0.2 and Z( ˆ )=7.5. The log odds ratio of CHD due to 1 mol/L change
in homocysteine is therefore estimated to be 0.10 (0.03,0.18) equivalent to an odds ratio
of 1.11 (1.03,1.20). Assuming linearity of effect, a more direct comparison is obtained by
calculating the effect of a 1.5mol/L change in homocysteine which would give an odds
ratio of 1.16 (1.05,1.31).
TABLE 2. Median percent* increase in the width of the 95% confidence interval for the
log odds ratio of a unit change in phenotype on disease compared to that for genotype on
disease.
Z§()
Z§ ()
3
5
10
20
3
80
160
380
840
5
30
65
165
390
10
8
15
50
135
20
2
4
15
40
*percentages over 10 are rounded to the nearest 5.
§ Z = expected standardised effect = average of estimate/standard error for  the log odds
ratio of genotype on disease and  the average effect of genotype on phenotype
The uncertainty in the size of the average change in homocysteine due to genotype has
lead to a widening of the 95% confidence interval, from (1.05,1.28) for genotype on
disease, to (1.05,1.31) for intermediate phenotype on disease. Table 2 relates the width of
the confidence interval for the log odds ratio of genotype on disease to the corresponding
estimate of the effect of phenotype on disease. The results were obtained from sets of
50,000 simulations from normal populations with odds ratio of genotype on disease of
1.5 and average effect of genotype on phenotype on 1. The Stata program used for the
calculations is given in Appendix 2. The table shows the median increase for cases where
10
Technical Report 2003/GE1
the confidence interval for the ratio is finite and thus excludes cases where the divisor in
the formula for the standard confidence limits is negative. In table 2 a negative divisor
occurs in about 1 sample in 6 when the expected value of Z( ˆ )=3 and almost never in the
rest of the table.
If the mean difference in phenotype due to genotype, , is poorly estimated, say the ratio
of the mean to the standard error of 5 or less, then the precision with which the effect of
phenotype on disease can be estimated will be appreciably worse than the precision of the
estimate of genotype on disease. Even with a good degree of accuracy for , there will be
a sizeable loss in the relative precision if  is very accurately estimated. In the
homocysteine example Z( ˆ ) is 7.5 and Z( ˆ ) is 3 so the loss in precision is small.
TESTING FOR LACK OF CONFOUNDING IN THE OBSERVATIONAL STUDIES
Observational studies will, after adjustment for measured confounders, provide an
estimate of the log odds ratio of a unit change in the phenotype on disease, . Mendelian
randomisation provides the alternative estimate ˆ (ˆ1  ˆ 0 ) , which is very unlikely to
be biased by confounding. One possible measure of the degree of residual confounding in
the observational study is
 ˆ ˆ

   ( ˆ 1  ˆ 0 )

ˆ  
1  ˆ ˆ

  ( ˆ  ˆ )

1
0 
2
Taking a parallel from bioequivalence studies that place limits on the ratio of mean
responses to two treatments (13), the amount of residual confounding might be
considered negligible if ̂ lies within specified limits, for instance provided that
-0.2 < ̂ < 0.2
11
Technical Report 2003/GE1
When planning a Mendelian randomisation study to assess whether information on the
effect of phenotype on disease from observational studies is biased, it might be sensible
TABLE 3. Percent of samples for which -0.2 <  < 0.2 when there is no confounding of
the relationship between phenotype and disease. In terms of the expected standardised
estimates (Z) of , the log odds ratio of genotype on disease,  the average effect of
genotype on phenotype and , the log odds ratio of phenotype on disease.
Z§ ()
Z§ ()
Z§ ()
3
5
10
20
3
3
26
29
31
31
5
30
34
38
38
10
32
38
42
43
20
32
39
43
45
3
30
35
38
39
5
35
43
49
51
10
38
49
58
62
20
39
51
62
65
3
32
38
42
43
5
38
49
58
62
10
42
58
75
82
20
43
62
81
90
3
33
39
43
44
5
39
51
62
65
10
44
62
82
90
20
44
66
90
98
5
10
20
§ Z = expected standardised effect = average of estimate/standard error
12
Technical Report 2003/GE1
to adopt the requirement that there is a 90% chance of finding -0.2 < ̂ < 0.2 when there
is no actual confounding. The problem in practice is that to achieve this degree of
accuracy in the estimate , extremely large sample sizes would be required in both the
Mendelian randomisation studies and the observational study. Once again the chance that
-0.2 < ̂ < 0.2 depends on the ratios of the individual estimates to their standard errors.
Table 3 sets out the percentage of time that the criterion will be met in terms of the
expected values of those ratios. Each figure is obtained by generating 50000 samples
from normal distributions with means  and  equal to log(1.2) and mean =1, although
the general pattern shown in Table 3 depends only on the standardised effects and not the
means. The Stata program used for the calculations is given in Appendix 3. The only
combinations that produce a 90% chance that the criterion is met are those for which the
expected ratios are at least 10. Even if the estimates of ,  and the 's come from large
meta-analyses it is unlikely that this degree of accuracy will be achieved and hence
Mendelian randomisation is unlikely to have the power to rule out residual confounding
in the studies of phenotype and disease.
ASSUMPTIONS BEHIND MENDELIAN RANDOMISATION
Models similar to those described in the section on Genotype, Phenotype and Disease
underlie most Mendelian randomisation studies. However the idea of a single gene acting
through a single phenotype to increase the risk of disease is clearly a simplification of
reality even before assumptions are made about the mathematical patterns describing the
associations.
First, it is possible for genes to demonstrate a pleiotropic effect (14), that is, they may
influence more than one phenotype. For example, the number of apolipoprotein 4 alleles
affects both plasma cholesterol and triglyceride levels, and each of these intermediate
phenotypes predisposes to coronary heart disease (15). If the other phenotypes do not
affect the risk of the disease under study this will not matter, but it may be that only part
of the effect of the gene on disease acts through the selected phenotype. In such
situations, analysis by Mendelian randomisation will tend to overestimate the impact of
13
Technical Report 2003/GE1
the selected phenotype on disease (unless the genotype affects two intermediate
phenotypes that have opposing effects on the risk of disease).
Second, it is possible that the level of the phenotype is affected by more than one gene
(genetic heterogeneity). This will not invalidate the model unless the genes interact. If
there are several independent genes, it is important to select one with a large effect in
order to achieve a powerful study, but it may also be possible to validate the estimates of
phenotype on disease obtained from studies with one of the genes by comparison with
results from studies of another gene, again assuming no interaction. More difficult is the
situation in which the phenotype is affected by environmental factors that interact with
the effect of the gene. There is some evidence to suggest that this may happen with
homocysteine when dietary supplementation with folate lowers homocysteine and masks
the impact of the MTHFR gene (8). In this situation it is important to allow for the
environmental factor, for instance only analysing studies in populations that have not had
folate supplementation.
Third, it is possible for the gene variant under study to be in linkage disequilibrium with a
gene variant that itself has an effect on the phenotype and/or disease (16); that is, the two
genes tend to occur together in the population either because they are physically linked or
for reasons of population origin. This may distort the estimates of genotype:disease
association and/or genotype: phenotype association and thereby the indirect estimate of
phenotype:disease association, and because linkage disequilibrium varies between
populations, these estimates may differ substantially between populations.
Genetic association studies can be susceptible to confounding by population
stratification, that is, confounding due to differences in the prevalence of variant alleles
and disease prevalence in genetically distinct subsections of the population. One may be
able to test for hidden population stratification in published Mendelian randomisation
studies using genotype frequencies from unlinked markers.(17) In other instances, this
may not be possible; even so, confounding due to population stratification will only occur
in relatively rare circumstances (18). More problematic is that many genetic association
14
Technical Report 2003/GE1
studies have been undertaken using family-based controls; these designs reduce power
and make the interpretation of Mendelian randomisation studies difficult.
DISCUSSION
Under reasonable assumptions Mendelian randomisation provides unconfounded
estimates of the effects of genotype on phenotype and genotype on disease. By
combining these two estimates it is possible to derive an estimate of the effect of
phenotype on disease. Unlike a direct estimate from an observational study, the derived
estimate should be free of confounding or reverse causation. However, the information on
the gene:disease and gene:phenotype relationships will often come from case-control
studies that may be susceptible to other forms of bias; it is still important that this is
minimised by good design, such as case-control studies nested within major cohorts, as
planned by the UK Biobank (http://www.ukbiobank.ac.uk/) and the US National
Children’s Study (http://nationalchildrensstudy.gov/).
In section 3 the confidence interval for the derived estimate was shown to depend on the
standardised effects (estimate/standard error) of the log odds ratio of genotype on disease
and the mean effect of genotype on phenotype. Because the derived estimate of
phenotype on disease is obtained as a ratio, the accuracy of the divisor is particularly
important, and unless the standardised effect of genotype on phenotype is over 5 the
confidence interval of the ratio can be very wide. In section 4 the derived estimate is
compared with the direct estimate from observational studies and once again the ability to
establish equivalence, and hence a lack of material confounding, is limited unless all
three studies provide a very high degree of precision.
Given the general lack of power of individual genetic association studies (16, 19, 20), the
indirect estimate of the effect size of the phenotype on disease will usually be based on
meta-analyses of studies on genotype:phenotype and genotype:disease associations.
However, even when the estimates of the effects of genotype on phenotype and genotype
on disease come from large meta-analyses that provide relatively small standard errors,
15
Technical Report 2003/GE1
our results illustrate that they may not provide sufficient precision to derive a precise
estimate of the effect of phenotype on disease. Moreover, our study shows how the
precision of the derived estimate depends on the “balance” in the amount of information
(i.e. the precision) available for genotype:phenotype and for genotype:disease association
studies, the precision of the genotype:phenotype association being crucially important.
In practise, the genotype:phenotype association will often be measured on a subset of
subjects within a study primarily designed to assess genotype:disease association.
Consequently genotype:phenotype studies tend to be fewer in number and smaller in size
than those assessing genotype:disease association. As the disease under study might itself
influence the phenotype, the genotype:phenotype association is often assessed among
controls in case-control studies designed to investigate genotype:disease association.
Such reverse causality is seen in studies of myocardial infarction where blood lipids are
deranged around the time of myocardial infarction (21). Furthermore, many biochemical
intermediates are difficult to measure, for example, homocysteine levels increase in blood
after collection and differences in the timing of analysis of samples and in laboratory
techniques may account for important heterogeneity among studies. Although the
precision of meta-analyses of genotype:disease association studies is generally high, as
shown by a recent review of 55 meta-analyses (20), this does not help when using these
results to indirectly estimate the effect of phenotype on disease if not coupled with
accurate data on the association between genotype and phenotype.
A further important difficulty is that the pooled estimates of genotype: disease association
and genotype: phenotype association upon which to Mendelian randomisation studies
often depend may be distorted by publication bias. Publication bias is particularly
pronounced in genetic association studies.(16) Moreover, pooling of data in metaanalyses is often hampered by differences in disease definition between studies.(16)
These limitations mean that accurate estimates of the effects of genotype on disease and
the effects of genotype on phenotype are difficult to obtain in genetic association studies,
and that caution must be exercised in combining estimates from published studies.(14,
16)
16
Technical Report 2003/GE1
Mendelian randomisation studies provide a theoretically attractive solution to the
problems of confounded estimates from classical epidemiological studies. However,
unrealistic assumptions about the simplicity of the causal pathway underlying common
diseases may lead to biased estimates. Furthermore, the imprecision of the derived
estimates in Mendelian randomisation studies suggests that their utility will be limited
given the sample sizes currently used in genetic association studies. In summary,
Mendelian randomisation studies may provide a useful adjunct to classical
epidemiological studies, but they do not signal their end. There is still a need for welldesigned epidemiological studies to provide evidence on intermediate phenotypes.
17
Technical Report 2003/GE1
REFERENCES
1.
Greenland S, Rothman K. Measures of Effect and Measures of Association. In: S
G, ed. Modern epidemiology. Philadelphia: Lippincott-Raven, 1998:47-64.
2.
Davey Smith G, Ebrahim S. "Mendelian randomisation": can genetic
epidemiology contribute to understanding environmental determinants of disease?
International Journal of Epidemiology 2003;in press.
3.
Youngman L, Keavney B. Plasma fibrinogen and fibrinogen genotypes in 4685
cases of myocardial infarction and in 6002 controls: test of causality by
"Mendelian randomisation". Circulation 2000;102:31-32.
4.
Clayton D, McKeigue PM. Epidemiological methods for studying genes and
environmental factors in complex diseases. Lancet 2001;358:1356-60.
5.
Keavney B. Genetic epidemiological studies of coronary heart disease. Int J
Epidemiol 2002;31:730-6.
6.
The Homocysteine Studies Collaboration. Homocysteine and risk of ischemic
heart disease and stroke: a meta-analysis. Jama 2002;288:2015-22.
7.
Rader JI. Folic acid fortification, folate status and plasma homocysteine. J Nutr
2002;132:2466S-2470S.
8.
Klerk M, Verhoef P, Clarke R, Blom HJ, Kok FJ, Schouten EG. MTHFR 677C->T polymorphism and risk of coronary heart disease: a meta-analysis. Jama
2002;288:2023-31.
9.
Wald DS, Law M, Morris JK. Homocysteine and cardiovascular disease: evidence
on causality from a meta-analysis. Bmj 2002;325:1202.
18
Technical Report 2003/GE1
10.
Hill A. The environment and disease: association or causation? J R Soc Med
1965;58:295-300.
11.
Kendall M, Stuart A. The advance theory of statistics. Volume 2. Inference and
relationship. London: Griffin, 1973.
12.
Fieller E. Some problems in interval estimation. Journal of the Royal Statistical
Society, B 1954;16:175.
13.
Berger R, Hsu J. Bioequivalence trials, intersection-union tests and equivalence
confidence sets. Statistical Science 1996;11:283-319.
14.
Palmer LJ. Complex Diseases. In: Palmer LJ, ed. Biostatistical Genetics and
Genetic Epidemiology. Chichester: Wiley, 2002:206-217.
15.
Dallongeville J, Lussier-Cacan S, Davignon J. Modulation of plasma triglyceride
levels by apoE phenotype: a meta-analysis. J Lipid Res 1992;33:447-54.
16.
Cardon LR, Bell JI. Association study designs for complex diseases. Nat Rev
Genet 2001;2:91-9.
17.
Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect
population stratification in association studies. Am J Hum Genet 1999;65:220-8.
18.
Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic
studies of common genetic variants and cancer: quantification of bias. J Natl
Cancer Inst 2000;92:1151-8.
19.
Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. A comprehensive review
of genetic association studies. Genet Med 2002;4:45-61.
19
Technical Report 2003/GE1
20.
Ioannidis JP, Trikalinos TA, Ntzani EE, Contopoulos-Ioannidis DG. Genetic
associations in large versus small studies: an empirical assessment. Lancet
2003;361:567-71.
21.
Clark S, Youngman LD, Sullivan J, Peto R, Collins R. Stabilization of
Homocysteine in Unseparated Blood over Several Days: A Solution for
Epidemiological Studies. Clin Chem 2003;49:518-20.
20
Technical Report 2003/GE1
APPENDIX 1
Stata program for calculating the values in Table 1
***********************************************************
* COMPARISON OF RATIO ESTIMATE WITH NUMERICAL INTEGRATION
* Parameters set for the middle entry in Table 1
***********************************************************
version 8
cd "d:\research\genetics\mendelian randomisation\power\stata"
clear
set more off
local mu1=1
/* phenotype mean genotype=0 */
local mu0=0
/* phenotype mean genotype=1 */
local sigma=1
/* sd of phenotype */
local alpha=logit(0.1) /* approx 1% prevalence in when x=mu0 &*/
local N 50000
/* points per integration */
*-------------------------------------* Try a range of Odds ratios
* per sd change in phenotype
* (phenotype on disease)
*
foreach OR of numlist 1.1 1.2 1.5 2.0 {
local beta=log(`OR')/`sigma'
local approx = `beta' * ( `mu1' - `mu0' )
quietly {
clear
set obs `N'
*-------------------------------------* Integrate over +-6 st devs
*
local delta=(`mu1'-`mu0'+12*`sigma')/`N' /* step size */
gen double x=`mu0'-6*`sigma'+`delta'*_n
/* mid points */
gen double f1=normden((x-`mu1')/`sigma')/`sigma'
gen double f0=normden((x-`mu0')/`sigma')/`sigma'
gen double P1 = f1*invlogit(`alpha'+`beta'*(x-`mu0'))
gen double P0 = f0*invlogit(`alpha'+`beta'*(x-`mu0'))
summarize P1
local I1=r(sum)*`delta'
summarize P0
local I0=r(sum)*`delta'
local actual = log((`I1'/(1-`I1'))/(`I0'/(1-`I0')))
}
local error = 100*(`approx'-`actual')/`actual'
di "OR " %6.2f `OR' " Actual " %10.4f `actual' " Approx " /*
*/ %10.4f `approx' " Error " %6.2f `error' "%"
}
21
Technical Report 2003/GE1
APPENDIX 2
Stata program for calculating the values in Table 2
*****************************************************
* RELATIVE SIZES OF CI in MENDELIAN RANDOMISATION
* calculates the values for Table 2
*****************************************************
version 8
cd "d:\research\genetics\mendelian randomisation\power\stata"
clear
set more off
*----------------------------------------* Base Data
*
local me=log(1.5)
/* log OR GD */
local md=1
/* difference in phenotype */
local zn=1.96
/* 95% N(0,1) */
*----------------------------------------* set ranges for standarised effects
*
e=log odds ratio GD
*
d=mean effect of PD
*
matrix xe=(3, 5, 10, 20)
matrix xd=(3, 5, 10, 20)
*----------------------------------------* Define matrixes to store results
*
P=% increase in width
*
H=% infinite CI
*
matrix P=J(4,4,0)
matrix H=J(4,4,0)
*----------------------------------------* loop over set ranges
*
forvalues i=1/4 {
forvalues j=1/4 {
local sde=`me'/xe[1,`j']
local sdd=`md'/xd[1,`i']
local b=`me'/`md'
local sdse=`sde'/5
/*assumption re variation in st dev */
local sdsd=`sdd'/5
/* actually depends on sample size */
/*
but not critical */
quietly {
drop _all
*----------------------------------------* generate 50,000 random datasets
*
e=log odds ratio GD
*
d=mean effect of PD
*
ratio=estimate log odds PD
*
set obs 50000
gen e=`me'+`sde'*invnorm(uniform())
gen d=`md'+`sdd'*invnorm(uniform())
gen ratio=e/d
gen se=`sde' +`sdse'*invnorm(uniform())
22
Technical Report 2003/GE1
gen sd=`sdd' +`sdsd'*invnorm(uniform())
*----------------------------------------* CI for log odds PD by Eq of Page 9
*
gen re=(se/e)^2
gen rd=(sd/d)^2
gen g=`zn'*sqrt(rd+re-`zn'^2*re*rd)
gen h=1-`zn'^2*rd
gen rp=ratio*(1+g)/h
gen rm=ratio*(1-g)/h
egen sl=rmin(rp rm)
egen su=rmax(rp rm)
gen width=(su-sl)
*----------------------------------------* width CI log odds GD
*
gen widthe=1.96*2*se
*----------------------------------------* relative width
*
gen c=width/widthe
summarize c if h > 0, detail
local perc=100*r(p50)-100
matrix P[`i',`j']=`perc'
count if h < 0
local perc=100*r(N)/50000
matrix H[`i',`j']=`perc'
}
}
}
matrix list P, format(%8.0f)
matrix list H, format(%8.0f)
23
Technical Report 2003/GE1
APPENDIX 3
Stata program for calculating the values in Table 3
***********************************************************
* COMPARISON OF EPIDEMIOLOGICAL AND MENDELIAN
* RANDOMISATION ESTIMATES OF PD
* Calculate entries for table 3
***********************************************************
version 8
cd "d:\research\genetics\mendelian randomisation\power\stata"
clear
set more off
*----------------------------------* Parameter choice
* d=Phenotype difference
* e=GD log odds ratio
* b=PD log odds ratio
*
local Zd=20
local md=1
local sed=`md'/`Zd'
local me=log(1.2)
local mb=`me'/`md'
*----------------------------------* Loop over e and b
*
foreach Ze of numlist 3 5 10 20 {
local see=`me'/`Ze'
foreach Zb of numlist 3 5 10 20 {
quietly {
set obs 50000
local seb=`mb'/`Zb'
gen b=`mb'+`seb'*invnorm(uniform())
gen e=`me'+`see'*invnorm(uniform())
gen d=`md'+`sed'*invnorm(uniform())
gen lambda=2*(b-e*d)/(b+e*d)
replace lambda=(lambda>-0.2)*(lambda < 0.2)
summarize lambda
local y=100*r(sum)/_N
drop b e d lambda
}
di "Zd= " %4.0f `Zd' " Zb="%4.0f `Zb' " Ze="%4.0f `Ze' "
%4.0f `y'
}
}
p="
24