* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download technical report 2003/ge1
Survey
Document related concepts
Dominance (genetics) wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Behavioural genetics wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Designer baby wikipedia , lookup
Tay–Sachs disease wikipedia , lookup
Microevolution wikipedia , lookup
Genome (book) wikipedia , lookup
Heritability of IQ wikipedia , lookup
Fetal origins hypothesis wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Transcript
Technical Report 2003/GE1 TECHNICAL REPORT 2003/GE1 On the accuracy of estimates of the effect of phenotype on disease derived from Mendelian randomisation studies John R Thompson*, Martin D Tobin, Cosetta Minelli Genetic Epidemiology Unit, Centre for Biostatistics, Department of Health Sciences, University of Leicester, U.K. *E-mail : [email protected] Date: 4th Aug 2003 Revision: 11th Aug 2003 We are grateful to Roger Harbord from the University of Bristol for pointing out that it is much better to represent the data in Table 1 in terms of the odds ratio for a 1 standard deviation change in the phenotype rather than a 1-unit change in the phenotype. This modification has been incorporated. 1 Technical Report 2003/GE1 SUMMARY Observational studies of the effects of intermediate phenotypes on disease are frequently difficult to interpret because of their susceptibility to confounding and reverse causation. An alternative approach, called Mendelian randomisation, involves selecting a gene that influences the level of the phenotype and assessing whether that gene is associated with the disease. Because genotype is decided by a random process at conception it is very unlikely that the gene:disease or gene:phenotype association will be biased by confounding. Here we consider the accuracy of the estimate of the effect of intermediate phenotype on disease derived by combining data from Mendelian randomisation studies of gene on disease and of gene on phenotype. The final accuracy is especially poor if the effect of genotype on phenotype is poorly estimated. Even when data are obtained from meta-analyses, Mendelian randomisation studies may not be accurate enough to confirm or contradict estimates of the effect of phenotype on disease derived from observational studies. 2 Technical Report 2003/GE1 INTRODUCTION Observational studies that investigate the relationship between intermediate phenotypes, such as biochemical measurements, and disease are often difficult to interpret due to their potential for reverse causation and confounding. That is, either the biochemical is affected by the presence of the disease, or the measurement is influenced by some other factor that is itself associated with the disease (1, 2). Yet where effective interventions are available that influence the intermediate phenotype, unbiased estimates of the relationship between the intermediate phenotype and disease are of real public health importance. As a way of avoiding the problems of confounding and reverse causation, several authors have suggested using studies based on a gene that is known to influence the level of the intermediate phenotype. It is highly unlikely that the relationship between the gene and the disease is subject to reverse causation or confounding, since a subject's genotype is determined by a random process at conception. This naturally occurring randomisation, known as Mendelian randomisation, gives studies in which the exposure is the subject's genotype a similar status to randomised trials (2-5). If such a study shows an association between the gene and the disease, and if it is known that the gene acts through the biochemical phenotype, then we have strong evidence to suggest that the phenotype is causally linked to the disease. Furthermore, the estimates from studies of the effect of the gene on disease and the gene on biochemical phenotype can be used to estimate the size of the effect of the phenotype on disease, and this can be compared to the results from observational studies as a way of investigating whether or not the observational studies had been adequately adjusted for confounding (2, 5). An important example of Mendelian randomisation is the investigation of the possible relationship between homocysteine and coronary heart disease (CHD) through the use of the methylene tetrahydrofolate reductase (MTHFR) gene. Numerous observational studies have found an association between the concentration of homocysteine in blood samples and the development of coronary heart disease. However, a large meta-analysis 3 Technical Report 2003/GE1 suggests that the effect is small; a decrease of 3mol/L leading to a reduction of about 11% in the risk of CHD (homocysteine levels in healthy subjects average about 12mol/L with a standard deviation of about 4mol/L) (6). The observational studies used in the meta-analysis show significant heterogeneity, which may well be because some of the studies are confounded by factors such as smoking, blood pressure and cholesterol. Despite the relatively small size of the effect, the relationship is of real importance because CHD is so common and homocysteine levels are reduced by dietary folate supplementation. This is easily implemented, for example, the FDA fortification program for refined flour, which commenced in the USA in 1998 (http://www.cfsan.fda.gov/~dms/fdafolic.html) (7). The MTHFR gene has a common polymorphism that involves a C-to-T substitution at base 677. The effect of the C-to-T substitution is to reduce the enzyme's effectiveness and so to produce lower folate and raise homocysteine concentrations. The prevalence of the C-to-T substitution varies between populations, but in Europe and North America the polymorphism is found in about a third of all copies of the gene, so that approximately 11% of people carry two copies of the C-to-T substitution, referred to a TT. The metaanalysis of studies of the MTHFR gene in populations that have not been subject to dietary folate supplementation suggests an increased risk of CHD of about 16% when comparing TT subjects to subjects without any copy of the polymorphism (CC) (8). The relationship between the MTHFR polymorphism and CHD and the fact that the polymorphism affects homocysteine together are said to provide good evidence that the relationship found in observational studies is not just a result of reverse causation or confounding (2, 8). In the following sections we consider the use of Mendelian randomised studies of gene on disease and gene on phenotype to estimate the size of the effect of phenotype on disease. We also consider whether the consistency between this estimate and the measurements made in observational studies can act as a test of the reliability of the observational results. 4 Technical Report 2003/GE1 GENOTYPE, PHENOTYPE AND DISEASE Suppose that we have genotype, G, a continuous intermediate phenotype, IP, and a binary disease outcome D. The gene is selected because of its influence on the phenotype and we wish to establish the relationship between phenotype and disease. The hypothesised causal pathway is G IP D Denote two of the genotypes that we wish to compare by G=1 and G=0, and assume that the intermediate phenotype measurement, x, is normally distributed with mean 1 if G=1 and 0 if G=0 but with common standard deviation . Further suppose that the probability of disease, D=1, is related to x by a logistic function, so that p(D=1 | x ) = exp(+x)/[1+ exp(+x)] Here represents the log odds ratio of disease given a unit in the measurement of phenotype, log(ORPD). Using (x; ,) to represent the density of data x from a normal distribution with mean and standard deviation , it follows that p(D=1 | G=i ) = Ii = (x; i,) exp(+x)/[1+ exp(+x)] dx which may be approximated by Ii exp(+i)/[1+ exp(+i)] Thus the log odds of disease given G=i is approximately +i and the log odds ratio of disease comparing genotypes is approximately ( 1- 0). Thus the log odds ratio of disease comparing the two genotypes, log(ORGD) is approximately equal to the log odds ratio of disease given a unit change in the phenotype, log(ORPD) multiplied by the average difference between the intermediate phenotype measurements under the two genotypes. log(ORGD) = ( 1- 0) log(ORPD) 5 Technical Report 2003/GE1 Equivalently this may be represented as 1 0 ORGD ORPD This adjustment has been assumed without comment in several articles on Mendelian randomisation or the meta-analysis of genetic studies (9). Under this model the full expression for the log odds ratio of disease comparing the genotypes is log[I1/(1-I1) . (1-I0)/I0] Table 1 compares the approximate log odds ratio, ( 1- 0), with the actual value obtained by numerical integration over the intermediate phenotype using a trapezium rule. In this table the disease frequency refers to the percentage of subjects with the average level of the phenotype for the healthier genotype that develop the disease. The phenotype measurements are compared in terms of the effect size of the genotype on phenotype, that is (1- 0)/ and * the log odds ratio for a one standard deviation change in the phenotype. The Stata program used for the numerical integration is given in Appendix 1. Table 1 shows that the approximation is least reliable when the disease is common in the populations studied, the effect of the phenotype on disease is large, or the difference in phenotype measurements between genotypes is small relative to their variability. Since the method of Mendelian randomisation is typically applied to data from case-control studies the rare disease assumption will usually hold. Further since potential confounding is usually not a possible alternative explanation to causality when the odds ratio is large (10), the only practical concern with the approximation arises when genes are studied that have a small effect on the phenotype. When selecting an appropriate gene for Mendelian randomisation it would be natural to choose one that has as large an effect on the phenotype as possible, in which case the approximation is likely to be reasonable. The average difference between the levels of homocysteine in the CC and TT genotypes of the MTHFR gene is about 1.5mol/L, taking the standard deviation to be 4mol/L, the 6 Technical Report 2003/GE1 effect size is approximately 0.38. For populations in which the disease is rare the numerical integration shows the approximation to be accurate to within 3 or 4%. TABLE 1. Percentage error in using the approximation to the log odds ratio of the disease given genotype ( 1- 0) Phenotype Difference3 0.5 Disease 1.0 2.0 ORPD2 Frequency1 1% 5% 10% 1.1 0.0% 0.0% 0.0% 1.2 0.0% 0.0% 0.0% 1.5 0.2% 0.2% 0.3% 2.0 0.9% 1.0% 1.5% 1.1 0.0% 0.1% 0.1% 1.2 0.2% 0.2% 0.2% 1.5 1.0% 1.1% 1.3% 2.0 3.6% 4.1% 5.2% 1.1 0.1% 0.1% 0.1% 1.2 0.3% 0.3% 0.4% 1.5 1.7% 1.9% 2.1% 2.0 5.9% 6.5% 7.6% 1. Disease Frequency is the percentage of subjects with phenotype equal to the average for the healthier genotype who develop the disease 2. ORPD is the odds ratio of a one standard deviation change in phenotype on disease 3. Difference in average phenotype between the two genotypes divided by the standard deviation (assumed equal in the two groups) 7 Technical Report 2003/GE1 ESTIMATING THE EFFECT OF PHENOTYPE ON DISEASE Under the reasonable assumptions discussed in section 5, Mendelian randomisation allows the unconfounded estimation of both the mean effect of genotype on intermediate phenotype (1- 0) and the log odds ratio of genotype on disease, which will be denoted by =log(ORGD). Using the approximate relationship given in the previous section, the log odds ratio of a unit change in phenotype on disease, , can be estimated as ˆ (ˆ1 ˆ 0 ) . In Mendelian randomisation studies it would be usual to select a gene for which there is good evidence that (1- 0) > 0 so that a test of a non-zero effect of phenotype on disease is almost equivalent to a test of a non-zero effect of genotype on disease. However uncertainty about the size of the difference (1- 0) can greatly increase the uncertainty in the estimate, ˆ (ˆ1 ˆ 0 ) . In large samples it will be reasonable to assume that ˆ and ( ˆ1 ˆ 0 ) are both normally distributed. The problem of finding a confidence interval for the ratio of two normal variates is considered by Kendall and Stuart (11) following the work of Fieller (12). The confidence interval for the ratio of two means y x is the set of that satisfy ˆ xy ˆ y2 ˆ x2 2 2 2 2 ˆ y t 2 0 ˆ x t 2 2ˆ x ˆ y t 2 n 1 n 1 n 1 where t represents the 100-/2% point from a t-distribution with n-1 degrees of freedom and the hat denotes the usual estimates of mean, variance and covariance. Adapting their formula for large samples in which the two estimates used in the ratio are independent, for instance, when they come from separate meta-analyses, and denoting the difference in means by , then a (100-)% confidence interval for the ratio has limits, 8 Technical Report 2003/GE1 1 z ˆ ˆ 2 s2 s2 z 2 ˆ 2 ˆ s2 2 1 z 2 2 ˆ 2 2 s 2 s2 ˆ 2 ˆ 2 where z/2 is the value 100-/2% value from a standard normal distribution, for instance 1.96 for a 95% confidence interval, and s denotes the standard error of the corresponding mean. If the divisor 1 z 2 2 s 2 is positive, as it usually will be if the gene is well chosen, ˆ 2 then the confidence interval consists of all points within these limits, but if it is negative then it consists of all points outside the limits. Thus, assuming that is positive, when it is possible that is close to zero, the confidence interval for the ratio / will stretch to + because may be small and positive and to - because may be small and negative; but because is known not to be large, the values excluded from the confidence interval will be those corresponding to small values of /. In the extreme case the quadratic may have no solution, in which case the confidence interval is (-,) and the sample is not informative enough to exclude any ratio. The confidence limits depend on the ratio of the estimates ˆ and ( ˆ1 ˆ 0 ) to their respective standard errors. We refer to the ratio of the estimate to its standard error as the standardised effect, Z. Thus the confidence interval will consist of points between the limits provided Z( ˆ ) > 1.962, that is provided the mean effect of genotype on phenotype is accurately estimated. In the homocysteine example described in the introduction the odds ratio comparing the TT with the CC genotype has been estimated in a recent meta-analysis to be 1.16 with a 95% confidence interval (1.05,1.28). Thus the log odds ratio, ˆ , is 0.15 with standard error 0.05, so that the standardised effect, Z( ˆ )=3. A similar meta-analysis found the 9 Technical Report 2003/GE1 relationship between average levels (95% CI) of homocysteine (mol/L) in disease-free subjects to be; CC 9.9 (9.8,10.0), TT 11.4 (11.0,11.8). The difference is 1.5 with a standard error of 0.2 and Z( ˆ )=7.5. The log odds ratio of CHD due to 1 mol/L change in homocysteine is therefore estimated to be 0.10 (0.03,0.18) equivalent to an odds ratio of 1.11 (1.03,1.20). Assuming linearity of effect, a more direct comparison is obtained by calculating the effect of a 1.5mol/L change in homocysteine which would give an odds ratio of 1.16 (1.05,1.31). TABLE 2. Median percent* increase in the width of the 95% confidence interval for the log odds ratio of a unit change in phenotype on disease compared to that for genotype on disease. Z§() Z§ () 3 5 10 20 3 80 160 380 840 5 30 65 165 390 10 8 15 50 135 20 2 4 15 40 *percentages over 10 are rounded to the nearest 5. § Z = expected standardised effect = average of estimate/standard error for the log odds ratio of genotype on disease and the average effect of genotype on phenotype The uncertainty in the size of the average change in homocysteine due to genotype has lead to a widening of the 95% confidence interval, from (1.05,1.28) for genotype on disease, to (1.05,1.31) for intermediate phenotype on disease. Table 2 relates the width of the confidence interval for the log odds ratio of genotype on disease to the corresponding estimate of the effect of phenotype on disease. The results were obtained from sets of 50,000 simulations from normal populations with odds ratio of genotype on disease of 1.5 and average effect of genotype on phenotype on 1. The Stata program used for the calculations is given in Appendix 2. The table shows the median increase for cases where 10 Technical Report 2003/GE1 the confidence interval for the ratio is finite and thus excludes cases where the divisor in the formula for the standard confidence limits is negative. In table 2 a negative divisor occurs in about 1 sample in 6 when the expected value of Z( ˆ )=3 and almost never in the rest of the table. If the mean difference in phenotype due to genotype, , is poorly estimated, say the ratio of the mean to the standard error of 5 or less, then the precision with which the effect of phenotype on disease can be estimated will be appreciably worse than the precision of the estimate of genotype on disease. Even with a good degree of accuracy for , there will be a sizeable loss in the relative precision if is very accurately estimated. In the homocysteine example Z( ˆ ) is 7.5 and Z( ˆ ) is 3 so the loss in precision is small. TESTING FOR LACK OF CONFOUNDING IN THE OBSERVATIONAL STUDIES Observational studies will, after adjustment for measured confounders, provide an estimate of the log odds ratio of a unit change in the phenotype on disease, . Mendelian randomisation provides the alternative estimate ˆ (ˆ1 ˆ 0 ) , which is very unlikely to be biased by confounding. One possible measure of the degree of residual confounding in the observational study is ˆ ˆ ( ˆ 1 ˆ 0 ) ˆ 1 ˆ ˆ ( ˆ ˆ ) 1 0 2 Taking a parallel from bioequivalence studies that place limits on the ratio of mean responses to two treatments (13), the amount of residual confounding might be considered negligible if ̂ lies within specified limits, for instance provided that -0.2 < ̂ < 0.2 11 Technical Report 2003/GE1 When planning a Mendelian randomisation study to assess whether information on the effect of phenotype on disease from observational studies is biased, it might be sensible TABLE 3. Percent of samples for which -0.2 < < 0.2 when there is no confounding of the relationship between phenotype and disease. In terms of the expected standardised estimates (Z) of , the log odds ratio of genotype on disease, the average effect of genotype on phenotype and , the log odds ratio of phenotype on disease. Z§ () Z§ () Z§ () 3 5 10 20 3 3 26 29 31 31 5 30 34 38 38 10 32 38 42 43 20 32 39 43 45 3 30 35 38 39 5 35 43 49 51 10 38 49 58 62 20 39 51 62 65 3 32 38 42 43 5 38 49 58 62 10 42 58 75 82 20 43 62 81 90 3 33 39 43 44 5 39 51 62 65 10 44 62 82 90 20 44 66 90 98 5 10 20 § Z = expected standardised effect = average of estimate/standard error 12 Technical Report 2003/GE1 to adopt the requirement that there is a 90% chance of finding -0.2 < ̂ < 0.2 when there is no actual confounding. The problem in practice is that to achieve this degree of accuracy in the estimate , extremely large sample sizes would be required in both the Mendelian randomisation studies and the observational study. Once again the chance that -0.2 < ̂ < 0.2 depends on the ratios of the individual estimates to their standard errors. Table 3 sets out the percentage of time that the criterion will be met in terms of the expected values of those ratios. Each figure is obtained by generating 50000 samples from normal distributions with means and equal to log(1.2) and mean =1, although the general pattern shown in Table 3 depends only on the standardised effects and not the means. The Stata program used for the calculations is given in Appendix 3. The only combinations that produce a 90% chance that the criterion is met are those for which the expected ratios are at least 10. Even if the estimates of , and the 's come from large meta-analyses it is unlikely that this degree of accuracy will be achieved and hence Mendelian randomisation is unlikely to have the power to rule out residual confounding in the studies of phenotype and disease. ASSUMPTIONS BEHIND MENDELIAN RANDOMISATION Models similar to those described in the section on Genotype, Phenotype and Disease underlie most Mendelian randomisation studies. However the idea of a single gene acting through a single phenotype to increase the risk of disease is clearly a simplification of reality even before assumptions are made about the mathematical patterns describing the associations. First, it is possible for genes to demonstrate a pleiotropic effect (14), that is, they may influence more than one phenotype. For example, the number of apolipoprotein 4 alleles affects both plasma cholesterol and triglyceride levels, and each of these intermediate phenotypes predisposes to coronary heart disease (15). If the other phenotypes do not affect the risk of the disease under study this will not matter, but it may be that only part of the effect of the gene on disease acts through the selected phenotype. In such situations, analysis by Mendelian randomisation will tend to overestimate the impact of 13 Technical Report 2003/GE1 the selected phenotype on disease (unless the genotype affects two intermediate phenotypes that have opposing effects on the risk of disease). Second, it is possible that the level of the phenotype is affected by more than one gene (genetic heterogeneity). This will not invalidate the model unless the genes interact. If there are several independent genes, it is important to select one with a large effect in order to achieve a powerful study, but it may also be possible to validate the estimates of phenotype on disease obtained from studies with one of the genes by comparison with results from studies of another gene, again assuming no interaction. More difficult is the situation in which the phenotype is affected by environmental factors that interact with the effect of the gene. There is some evidence to suggest that this may happen with homocysteine when dietary supplementation with folate lowers homocysteine and masks the impact of the MTHFR gene (8). In this situation it is important to allow for the environmental factor, for instance only analysing studies in populations that have not had folate supplementation. Third, it is possible for the gene variant under study to be in linkage disequilibrium with a gene variant that itself has an effect on the phenotype and/or disease (16); that is, the two genes tend to occur together in the population either because they are physically linked or for reasons of population origin. This may distort the estimates of genotype:disease association and/or genotype: phenotype association and thereby the indirect estimate of phenotype:disease association, and because linkage disequilibrium varies between populations, these estimates may differ substantially between populations. Genetic association studies can be susceptible to confounding by population stratification, that is, confounding due to differences in the prevalence of variant alleles and disease prevalence in genetically distinct subsections of the population. One may be able to test for hidden population stratification in published Mendelian randomisation studies using genotype frequencies from unlinked markers.(17) In other instances, this may not be possible; even so, confounding due to population stratification will only occur in relatively rare circumstances (18). More problematic is that many genetic association 14 Technical Report 2003/GE1 studies have been undertaken using family-based controls; these designs reduce power and make the interpretation of Mendelian randomisation studies difficult. DISCUSSION Under reasonable assumptions Mendelian randomisation provides unconfounded estimates of the effects of genotype on phenotype and genotype on disease. By combining these two estimates it is possible to derive an estimate of the effect of phenotype on disease. Unlike a direct estimate from an observational study, the derived estimate should be free of confounding or reverse causation. However, the information on the gene:disease and gene:phenotype relationships will often come from case-control studies that may be susceptible to other forms of bias; it is still important that this is minimised by good design, such as case-control studies nested within major cohorts, as planned by the UK Biobank (http://www.ukbiobank.ac.uk/) and the US National Children’s Study (http://nationalchildrensstudy.gov/). In section 3 the confidence interval for the derived estimate was shown to depend on the standardised effects (estimate/standard error) of the log odds ratio of genotype on disease and the mean effect of genotype on phenotype. Because the derived estimate of phenotype on disease is obtained as a ratio, the accuracy of the divisor is particularly important, and unless the standardised effect of genotype on phenotype is over 5 the confidence interval of the ratio can be very wide. In section 4 the derived estimate is compared with the direct estimate from observational studies and once again the ability to establish equivalence, and hence a lack of material confounding, is limited unless all three studies provide a very high degree of precision. Given the general lack of power of individual genetic association studies (16, 19, 20), the indirect estimate of the effect size of the phenotype on disease will usually be based on meta-analyses of studies on genotype:phenotype and genotype:disease associations. However, even when the estimates of the effects of genotype on phenotype and genotype on disease come from large meta-analyses that provide relatively small standard errors, 15 Technical Report 2003/GE1 our results illustrate that they may not provide sufficient precision to derive a precise estimate of the effect of phenotype on disease. Moreover, our study shows how the precision of the derived estimate depends on the “balance” in the amount of information (i.e. the precision) available for genotype:phenotype and for genotype:disease association studies, the precision of the genotype:phenotype association being crucially important. In practise, the genotype:phenotype association will often be measured on a subset of subjects within a study primarily designed to assess genotype:disease association. Consequently genotype:phenotype studies tend to be fewer in number and smaller in size than those assessing genotype:disease association. As the disease under study might itself influence the phenotype, the genotype:phenotype association is often assessed among controls in case-control studies designed to investigate genotype:disease association. Such reverse causality is seen in studies of myocardial infarction where blood lipids are deranged around the time of myocardial infarction (21). Furthermore, many biochemical intermediates are difficult to measure, for example, homocysteine levels increase in blood after collection and differences in the timing of analysis of samples and in laboratory techniques may account for important heterogeneity among studies. Although the precision of meta-analyses of genotype:disease association studies is generally high, as shown by a recent review of 55 meta-analyses (20), this does not help when using these results to indirectly estimate the effect of phenotype on disease if not coupled with accurate data on the association between genotype and phenotype. A further important difficulty is that the pooled estimates of genotype: disease association and genotype: phenotype association upon which to Mendelian randomisation studies often depend may be distorted by publication bias. Publication bias is particularly pronounced in genetic association studies.(16) Moreover, pooling of data in metaanalyses is often hampered by differences in disease definition between studies.(16) These limitations mean that accurate estimates of the effects of genotype on disease and the effects of genotype on phenotype are difficult to obtain in genetic association studies, and that caution must be exercised in combining estimates from published studies.(14, 16) 16 Technical Report 2003/GE1 Mendelian randomisation studies provide a theoretically attractive solution to the problems of confounded estimates from classical epidemiological studies. However, unrealistic assumptions about the simplicity of the causal pathway underlying common diseases may lead to biased estimates. Furthermore, the imprecision of the derived estimates in Mendelian randomisation studies suggests that their utility will be limited given the sample sizes currently used in genetic association studies. In summary, Mendelian randomisation studies may provide a useful adjunct to classical epidemiological studies, but they do not signal their end. There is still a need for welldesigned epidemiological studies to provide evidence on intermediate phenotypes. 17 Technical Report 2003/GE1 REFERENCES 1. Greenland S, Rothman K. Measures of Effect and Measures of Association. In: S G, ed. Modern epidemiology. Philadelphia: Lippincott-Raven, 1998:47-64. 2. Davey Smith G, Ebrahim S. "Mendelian randomisation": can genetic epidemiology contribute to understanding environmental determinants of disease? International Journal of Epidemiology 2003;in press. 3. Youngman L, Keavney B. Plasma fibrinogen and fibrinogen genotypes in 4685 cases of myocardial infarction and in 6002 controls: test of causality by "Mendelian randomisation". Circulation 2000;102:31-32. 4. Clayton D, McKeigue PM. Epidemiological methods for studying genes and environmental factors in complex diseases. Lancet 2001;358:1356-60. 5. Keavney B. Genetic epidemiological studies of coronary heart disease. Int J Epidemiol 2002;31:730-6. 6. The Homocysteine Studies Collaboration. Homocysteine and risk of ischemic heart disease and stroke: a meta-analysis. Jama 2002;288:2015-22. 7. Rader JI. Folic acid fortification, folate status and plasma homocysteine. J Nutr 2002;132:2466S-2470S. 8. Klerk M, Verhoef P, Clarke R, Blom HJ, Kok FJ, Schouten EG. MTHFR 677C->T polymorphism and risk of coronary heart disease: a meta-analysis. Jama 2002;288:2023-31. 9. Wald DS, Law M, Morris JK. Homocysteine and cardiovascular disease: evidence on causality from a meta-analysis. Bmj 2002;325:1202. 18 Technical Report 2003/GE1 10. Hill A. The environment and disease: association or causation? J R Soc Med 1965;58:295-300. 11. Kendall M, Stuart A. The advance theory of statistics. Volume 2. Inference and relationship. London: Griffin, 1973. 12. Fieller E. Some problems in interval estimation. Journal of the Royal Statistical Society, B 1954;16:175. 13. Berger R, Hsu J. Bioequivalence trials, intersection-union tests and equivalence confidence sets. Statistical Science 1996;11:283-319. 14. Palmer LJ. Complex Diseases. In: Palmer LJ, ed. Biostatistical Genetics and Genetic Epidemiology. Chichester: Wiley, 2002:206-217. 15. Dallongeville J, Lussier-Cacan S, Davignon J. Modulation of plasma triglyceride levels by apoE phenotype: a meta-analysis. J Lipid Res 1992;33:447-54. 16. Cardon LR, Bell JI. Association study designs for complex diseases. Nat Rev Genet 2001;2:91-9. 17. Pritchard JK, Rosenberg NA. Use of unlinked genetic markers to detect population stratification in association studies. Am J Hum Genet 1999;65:220-8. 18. Wacholder S, Rothman N, Caporaso N. Population stratification in epidemiologic studies of common genetic variants and cancer: quantification of bias. J Natl Cancer Inst 2000;92:1151-8. 19. Hirschhorn JN, Lohmueller K, Byrne E, Hirschhorn K. A comprehensive review of genetic association studies. Genet Med 2002;4:45-61. 19 Technical Report 2003/GE1 20. Ioannidis JP, Trikalinos TA, Ntzani EE, Contopoulos-Ioannidis DG. Genetic associations in large versus small studies: an empirical assessment. Lancet 2003;361:567-71. 21. Clark S, Youngman LD, Sullivan J, Peto R, Collins R. Stabilization of Homocysteine in Unseparated Blood over Several Days: A Solution for Epidemiological Studies. Clin Chem 2003;49:518-20. 20 Technical Report 2003/GE1 APPENDIX 1 Stata program for calculating the values in Table 1 *********************************************************** * COMPARISON OF RATIO ESTIMATE WITH NUMERICAL INTEGRATION * Parameters set for the middle entry in Table 1 *********************************************************** version 8 cd "d:\research\genetics\mendelian randomisation\power\stata" clear set more off local mu1=1 /* phenotype mean genotype=0 */ local mu0=0 /* phenotype mean genotype=1 */ local sigma=1 /* sd of phenotype */ local alpha=logit(0.1) /* approx 1% prevalence in when x=mu0 &*/ local N 50000 /* points per integration */ *-------------------------------------* Try a range of Odds ratios * per sd change in phenotype * (phenotype on disease) * foreach OR of numlist 1.1 1.2 1.5 2.0 { local beta=log(`OR')/`sigma' local approx = `beta' * ( `mu1' - `mu0' ) quietly { clear set obs `N' *-------------------------------------* Integrate over +-6 st devs * local delta=(`mu1'-`mu0'+12*`sigma')/`N' /* step size */ gen double x=`mu0'-6*`sigma'+`delta'*_n /* mid points */ gen double f1=normden((x-`mu1')/`sigma')/`sigma' gen double f0=normden((x-`mu0')/`sigma')/`sigma' gen double P1 = f1*invlogit(`alpha'+`beta'*(x-`mu0')) gen double P0 = f0*invlogit(`alpha'+`beta'*(x-`mu0')) summarize P1 local I1=r(sum)*`delta' summarize P0 local I0=r(sum)*`delta' local actual = log((`I1'/(1-`I1'))/(`I0'/(1-`I0'))) } local error = 100*(`approx'-`actual')/`actual' di "OR " %6.2f `OR' " Actual " %10.4f `actual' " Approx " /* */ %10.4f `approx' " Error " %6.2f `error' "%" } 21 Technical Report 2003/GE1 APPENDIX 2 Stata program for calculating the values in Table 2 ***************************************************** * RELATIVE SIZES OF CI in MENDELIAN RANDOMISATION * calculates the values for Table 2 ***************************************************** version 8 cd "d:\research\genetics\mendelian randomisation\power\stata" clear set more off *----------------------------------------* Base Data * local me=log(1.5) /* log OR GD */ local md=1 /* difference in phenotype */ local zn=1.96 /* 95% N(0,1) */ *----------------------------------------* set ranges for standarised effects * e=log odds ratio GD * d=mean effect of PD * matrix xe=(3, 5, 10, 20) matrix xd=(3, 5, 10, 20) *----------------------------------------* Define matrixes to store results * P=% increase in width * H=% infinite CI * matrix P=J(4,4,0) matrix H=J(4,4,0) *----------------------------------------* loop over set ranges * forvalues i=1/4 { forvalues j=1/4 { local sde=`me'/xe[1,`j'] local sdd=`md'/xd[1,`i'] local b=`me'/`md' local sdse=`sde'/5 /*assumption re variation in st dev */ local sdsd=`sdd'/5 /* actually depends on sample size */ /* but not critical */ quietly { drop _all *----------------------------------------* generate 50,000 random datasets * e=log odds ratio GD * d=mean effect of PD * ratio=estimate log odds PD * set obs 50000 gen e=`me'+`sde'*invnorm(uniform()) gen d=`md'+`sdd'*invnorm(uniform()) gen ratio=e/d gen se=`sde' +`sdse'*invnorm(uniform()) 22 Technical Report 2003/GE1 gen sd=`sdd' +`sdsd'*invnorm(uniform()) *----------------------------------------* CI for log odds PD by Eq of Page 9 * gen re=(se/e)^2 gen rd=(sd/d)^2 gen g=`zn'*sqrt(rd+re-`zn'^2*re*rd) gen h=1-`zn'^2*rd gen rp=ratio*(1+g)/h gen rm=ratio*(1-g)/h egen sl=rmin(rp rm) egen su=rmax(rp rm) gen width=(su-sl) *----------------------------------------* width CI log odds GD * gen widthe=1.96*2*se *----------------------------------------* relative width * gen c=width/widthe summarize c if h > 0, detail local perc=100*r(p50)-100 matrix P[`i',`j']=`perc' count if h < 0 local perc=100*r(N)/50000 matrix H[`i',`j']=`perc' } } } matrix list P, format(%8.0f) matrix list H, format(%8.0f) 23 Technical Report 2003/GE1 APPENDIX 3 Stata program for calculating the values in Table 3 *********************************************************** * COMPARISON OF EPIDEMIOLOGICAL AND MENDELIAN * RANDOMISATION ESTIMATES OF PD * Calculate entries for table 3 *********************************************************** version 8 cd "d:\research\genetics\mendelian randomisation\power\stata" clear set more off *----------------------------------* Parameter choice * d=Phenotype difference * e=GD log odds ratio * b=PD log odds ratio * local Zd=20 local md=1 local sed=`md'/`Zd' local me=log(1.2) local mb=`me'/`md' *----------------------------------* Loop over e and b * foreach Ze of numlist 3 5 10 20 { local see=`me'/`Ze' foreach Zb of numlist 3 5 10 20 { quietly { set obs 50000 local seb=`mb'/`Zb' gen b=`mb'+`seb'*invnorm(uniform()) gen e=`me'+`see'*invnorm(uniform()) gen d=`md'+`sed'*invnorm(uniform()) gen lambda=2*(b-e*d)/(b+e*d) replace lambda=(lambda>-0.2)*(lambda < 0.2) summarize lambda local y=100*r(sum)/_N drop b e d lambda } di "Zd= " %4.0f `Zd' " Zb="%4.0f `Zb' " Ze="%4.0f `Ze' " %4.0f `y' } } p=" 24