* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A New Method for Estimating the Risk Ratio in Studies Using Case
Survey
Document related concepts
No-SCAR (Scarless Cas9 Assisted Recombineering) Genome Editing wikipedia , lookup
Bisulfite sequencing wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genome-wide association study wikipedia , lookup
Transcript
American Journal of Epidemiology Copyright © 1998 by The Johns Hopkins University School of Hygiene and Public Health All rights reserved Vol. 148, No. 9 Printed in U.S.A. A New Method for Estimating the Risk Ratio in Studies Using Case-Parental Control Design Fengzhu Sun,1 W. Dana Flanders,2 Quanhe Yang,3 and Muin J. Khoury4 The authors describe a new simple noniterative, yet efficient method to estimate the risk ratio in studies using case-parental control design. The new method is compared with two other noniterative methods, Khoury's method and Flanders and Khoury's method, and with a maximum likelihood-based method of Schaid and Sommer. The authors found that the variance of the new estimation method is usually smaller than that of Khoury's method or Flanders and Khoury's method and that it is slightly larger than that of the maximum likelihood-based method of Schaid and Sommer. Despite the slightly large variance of the new estimator compared with that of the maximum likelihood-based method, the simplicity of the new estimator and its variance makes the new method appealing. When genotypic information for only one parent is available, the authors also describe a method to estimate the risk ratio without assuming Hardy-Weinberg equilibrium or random mating. A simple formula for the variance of the estimator is given. Am J Epidemiol 1998; 148:902-9. case-control studies; gene frequency; genes; genetic markers; genetics; odds ratio; risk Association studies are widely used in epidemiology. One such approach is the case-control design in which case subjects and appropriate control subjects are selected from the population, and the fraction of case subjects with a risk factor is then compared with the fraction of control subjects with that risk factor. A potential problem with the case-control design involves the selection of appropriate control subjects. Rubinstein et al. (1) and Falk and Rubinstein (2) proposed a case-parental control design to avoid this problem. With the case-parental control design, case subjects are randomly sampled and the hypothetical control subjects are assumed to carry the genotypes formed by the nontransmitted parental alleles. If the candidate locus for disease susceptibility has alleles "AT and "AT with "M" the susceptible allele, the fraction of case subjects with allele "M" is compared with the fraction of hypothetical control subjects with that allele. The haplotype relative risk is the odds ratio used with the case-parental control design. Ott (3) studied the statistical properties of the haplotype relative risk under the recessive model, and Knapp et al. (4) showed that the distribution of the estimator for haplotype relative risk is stochastically smaller than that for risk ratio in the presence of positive linkage disequilibrium between a marker and disease allele. For epidemiologic purposes, it is important to estimate the risk ratio of individuals carrying genotypes "MM" and "MN" versus those carrying "AW." Schaid and Sommer (5) gave maximum likelihood approaches to estimate the risk ratio with and without the HardyWeinberg equilibrium assumption. Because the main purpose of the case-parental control design is to avoid population stratification, the Hardy-Weinberg equilibrium assumption may not hold in some situations. Schaid and Sommer (5) proposed a maximum likelihood-based conditional on parental genotype (CPG) method to estimate the risk ratio without using the Hardy-Weinberg equilibrium assumption. Knapp et al. (6) showed how to estimate the relative risks based on the CPG method without using iterative algorithms. Khoury (7) and Flanders and Khoury (8) proposed noniterative methods to estimate the risk ratio. Both methods provide approximate unbiased estimates. Here we propose another simple noniterative method to estimate the risk ratio. The variance of the new estimator is generally smaller than those obtained using Khoury's method and Flanders and Khoury's method, and it is slightly larger than that obtained with the CPG method. We also give a simple formula to Received for publication October 29, 1997, and accepted for publication March 25, 1998. Abbreviations: CPG, conditional on parental genotype. 1 Department of Genetics, Emory University, Atlanta, GA. 2 Department of Epidemiology, Emory University, Atlanta, GA. 3 Birth Defects and Genetic Diseases Branch, Centers for Disease Control and Prevention, Atlanta, GA. 4 Office of Genetics and Disease Prevention, Centers for Disease Control and Prevention, Atlanta, GA. Reprint requests to Fengzhu Sun, Department of Genetics, Emory University School of Medicine, 1462 Clifton Road, 4th Floor, Atlanta, GA 30322. 902 Risk Ratio in Case-Parental Control Design Studies estimate the variance of the new estimator. The simplicity of the new estimator and its variance makes the new method appealing. When genotypic information from only one parent is available, Schaid and Sommer (5) proposed an estimation method for the risk ratio assuming both Hardy-Weinberg equilibrium and random mating. Flanders and Khoury (8) proposed another method that only assumed random mating. The method we propose here requires neither of these two assumptions and is thus applicable to more general situations. The organization of this paper is as follows. First, we describe the case-parental control design and present a new 4 X 4 table that summarizes relevant data by distinguishing the paternal and maternal alleles for case subjects and the hypothetical control subjects formed by the two nontransmitted parental alleles. Second, we calculate the probabilities for the 16 cells. Third, we present Khoury's method, Flanders and Khoury's method, and the CPG method of Schaid and Sommer in terms of the 16 cells. Fourth, we present the new estimation method and the variance of the new estimator. Finally, we compare the new method with the other three methods using Monte-Carlo simulations. MATERIALS AND METHODS The case-parental control design In the case-parental control design, case subjects are randomly sampled from all new case subjects in the population, and both the case subjects and their parents are genotyped at the candidate locus. Relevant data thus consist of the genotypes of case subjects and their parents. To facilitate analysis, we summarize the data using a new 4 X 4 table (table 1) formed by the genotypes of case subjects and the genotypes of hypothetical control subjects carrying the nontransmitted parental alleles. In this table, we distinguish between paternal and maternal alleles; that is, the first allele is the paternal allele, and the second allele is the maternal allele. We also use the following notation: NN = 0, NM = 1, MN = 1', and MM = 2, with the numbers corresponding to the number of susceptible alleles "M" a genotype has. Let Ftj be the number of case subjects in cell (i, j). The genotypes of parents are 1. 903 completely determined in each cell and are given in parentheses. The first genotype in each set of parentheses is the father's genotype, and the second is the mother's genotype. The first allele in each parental genotype is the transmitted allele, and the second allele is the nontransmitted allele. All the estimating methods are based on this table. Calculation of cell probabilities. First let us calculate the cell probabilities for table 1. We do not assume the Hardy-Weinberg equilibrium nor do we assume random mating. There are a total of nine mating types in the parental generation that can be identified by distinguishing paternal and maternal genotypes: (AW, AW), (AW, MAO, (NN, MM), (MN, NN), (MN, MN), (MN, MM), (MM, NN), (MM, MN), and (MM, MM). As above, we denote NN = 0, MN = 1, and MM = 2 corresponding to the number of mutant alleles in the genotype. Let q^ be the probability of mating type (i,j) in the parental generation. Let pt be the risk of the disease among people with genotype i. Then it is possible to calculate the cell probabilities in table 1 conditional on the case subjects being affected. Table 2 gives the cell probabilities corresponding to table 1 multiplied by P(D), the overall risk of the disease in the general population. The risk ratio for individuals having genotype "MN" versus those having genotype "AW" is defined as \x = pxlpQ, and the risk ratio for individuals having genotype "MM' versus those having genotype "AW" is defined as A2 = p2lpo- We also define X_x = px/p2- The objective of this study is to obtain accurate noniterative methods for estimating A, and A2. The Khoury, Flanders and Khoury, and CPG methods To simplify the presentation, we define the following variables. vo = mxo, Fu. + Fyi = mu,F]2 + FV2 = mn, F2o = m20, F2l + F2V = WRI- Case-parental control design by transmitted and nontransmitted genotypes distinguishing paternal and maternal alleles Control Case NN(O) NM(1) MM(2) Am J Epidemiol NN(O) NM(1) MN(1') MM(2) Foo (NN, NN) F 10 (NN, MN) Fro(MN, NN) F 20 (MN, MN) F 01 (NN, NM) F n (NN, MM) F V 1 (MN, NM) F21 (MN, MM) Fov (NM, NN) F 1V (NM, MN) Fvr(MM,NN) F21. (MM, MN) F 02 (NM, NM) F12 (NM, MM) FV2(MM, NM) F22 (MM, MM) Vol. 148, No. 9, 1998 904 Sun et al. TABLE 2. Cell probabilities multiplied by P(D) corresponding to table 1 estimator for A_, and Aj similar to that of the Khoury method based on m01, m10, m12, and m21. Control Case NN(O) NN(0) NM(1) QooPo MN(V) NM(1) £(0) = ^ l i ) m21 MM(2) 1/2q01p, 1 / ^.Q -] 2P1 1/4(7, ,p1 1/2q12p MM(2) Q 2 oPi q22p2 All the estimation methods depend on these seven variables. Khoury's estimator for A, and A2 is given by m0l We can also obtain another approximate unbiased estimator for A_, and Aj based on m02, mn, and m20. 2m;20 X, = m lo /m O i, X2 = By comparing table 1 above with table 1 in the paper by Flanders and Khoury (8), we calculate Flanders and Khoury's estimator as A,= K2 = mQl/(mw m02 mn/(m02 + mu + m20) + m01) + 2mO2/(mO2 + mn + m 20 )' Now we have two estimators for both A_t and A,. The weighted average of X^, with weight m21 and X_, with weight 2m20 gives our estimator A_j. Similarly, the weighted average of Aj0-* with weight m01 and A, with weight 2m02 gives our estimator X P The final estimator is . Note here that the estimator of A2 is the same as that given by Khoury (7). Flanders and Khoury (8) also proposed a modified method to estimate A2 in which A_, = P\lp2 can be estimated in a similar manner as A,: *-,= 2m02' mnl(mu + m2]) + mu/(m02 +. mu + m20) m2]/(mn + m2]) + 2m20/(m02 + mu + m20)' and then A2 can be estimated by A,/A_t. Similarly, by comparing table 1 above with table 1 in Schaid and Sommer (5), the log-likelihood function to be maximized is given by L(A,, A2) = (m20 + m21)log(A2) + (m10 + mn + m12)log(A,) - (m12 + m21)log(A, + A2) 2m m21' mn m01 02 Finally, we estimate A2 by X]/X_1. From table 2, it can be shown that the above estimator is an approximate unbiased estimator of the risk ratio. We note that Flanders and Khoury's estimator of A_j can be regarded as a weighted average of A^, with weight m2\l(mX2 + m 21 ) and A(_!i with weight 2m2O/(mo2 + mu + m 20 ). Similarly, their estimator of A, is the weighted average of A(i0) with weight mQl/(mu + m10) and X(,]) with weight 2mO2/(mo2 + mu + m 20 ). Next we treat (m^, i,j = 0, 1, 2) as a multinomial random vector. Using the delta method, the variance of the logarithm of A_, and A, can be approximated by 2m, - (m02 + mu + m20)\og(\2 + 2A, + 1) - (mm + m]0)log(A1 + 1). 20 1 Var(ln(X_,)) - mu 1 2m20 + m21 The maximum point of this log-likelihood function yields the CPG estimator. We use the approach of Knapp et al. (6) to calculate the CPG estimator. The new estimation method 2m:20 (2m 20 + m 2 1 ) 2 ' Var(ln(X,)) « 1 2m02 + W01 When the number of case subjects is large, table 1 approximately equals a constant times table 2. From tables 1 and 2, we obtain one approximate unbiased (2m 02 + m 0 ,) 2 ' Am J Epidemiol Vol. 148, No. 9, 1998 Risk Ratio in Case-Parental Control Design Studies 905 The variance of the logarithm of A2 can be approximated by Var(ln(A2)) Var(ln(X,)) 2{An-Am-An) A?,-(A 1 0 -A 1 2 ) 2 - Var(ln(A2)) * Var(ln(A_,)) + Var(ln(A,)) 2m,, (m]0 + mu)(mn + mn)' As previously shown (2, 8), we can approximate the confidence limits of the risk ratio by treating the logarithm of the estimated risk ratio as though it is normally distributed. Only one parental genotype is known. Sometimes genotypic information from only one parent is available. We summarize the relevant data as in table 3 where Atj is the number of genotype i case subjects whose available parent has genotype j . We define the following estimator of A_, and A,, respectively. X_, = ' x,= + A19 A 10 2A 2 1 2A01 Then A2 is estimated by A,/A_j. It can be shown that the estimator is an approximate unbiased estimator of the corresponding risk ratio when {qtj) are symmetric. It is also an approximate unbiased estimator of the corresponding risk ratio if father or mother is missing with equal probability Vz. Using the delta method, the variance of the logarithm of the estimated risk ratio is given by When (q^) are not symmetric, A_, and A, given above are not the approximate unbiased estimators of A_j and A,. Next we give another estimation method that is unbiased even if qtj are not symmetric. Let P and M be the numbers of available fathers and mothers, respectively. Let PyiMfj) be the number of genotype i case subjects whose father (mother) has genotype;. Define A[j = MPtj + PMy. The new estimator is given as above by replacing Atj with A-j. From table 1 we can obtain the number of fathers and mothers with genotypes 0, 1, and 2 given the genotypes of case subjects as shown in table 4. From table 2, the corresponding cell probabilities multiplied by P{D) for table 4 can be obtained as given in table 5. From tables 4 and 5, we can easily calculate the expected values of A\y Substitution of the expected values of A,y into the expression for the estimated risk ratio shows approximate unbiasedness. We assume that the availability of the maternal genotype is independent of the individual genotype as is the availability of paternal genotype. The variance of this estimator is much more complicated and is omitted from the paper. Interested readers can request the exact formula from the first author. Comparisons of the estimation methods 2A 10 (A,, Var(ln(X,)) - ^ - A, o A +An-A]0)2 A 12 TABLE 4. Case-parental control design when only one parental genotype is available 2A 12 (A,, +A 1 0 -A 1 2 )2> TABLE 3. Case-parental control design when only one parental genotype is available Case genotype NN(0) NM(1) MM(2) Am J Epidemiol Parental genotype NN(0) NM(1) •^01 0 Vol. 148, No. 9, 1998 Next we use Monte-Carlo simulations to compare the four estimation methods: the CPG method of Schaid and Sommer (5), Khoury's method (7), Flanders and Khoury's method (8), and the new method. Under the Hardy-Weinberg equilibrium and random mating, the nine mating probabilities are de- Case genotype Parental genotype NN(0) MM(2) NN(0) NM(1) MM(2) Foo 1- F ol For + Fo2 0 Fio"l- F n 0 Fiv + F12 + F v 0 + ^ n F20 + F21 Frr + F r 2 F 2 2 - \-F2v NN(0) NM(1) MM(2) Foo "• F o v F r o " f Fr1. 0 MM(2) 0 A12 A22 NM(1) When only father is available When only mother is available ^01 + F02 F , i + F r 2 + F 10 + F 1 V F 20 + F2V 0 Frr + F 12 F 22 -.hF 2 1 906 Sun et al. TABLE 5. Cell probabilities multiplied by P{D) corresponding to table 6 Parental genotype Case genotype MM(2) NM(1) NN(O) When on/y father is available NN(0) NM(1) MM(2) 1/2(2qOo + 9oi)Po 1/2(Qoi "I" 2q O 2 )p, 1/4(2q10 + q n )p 0 1/2(q12 + qu + q,0)p, 0 1/4(q 11 + 2q 1 2 )p 2 0 1/2(2q 20 + q 2 ,)p. 1/2(q 21 + 2q 2 2 )p 2 IVhen only mother is available NN(0) NM(1) MM(2) 1/2(2q O0 + <7io)Po V 2 ( q 1 0 "I" 2 q 2 0 ) p , 1/4(2q01 + q^po 1/2(q21 + qu + qo,)p, 0 1/4(9, •, + 2q 2 1 )p 2 termined by a single parameter, /, the prevalence of allele ' W in the parental population. The following equations hold under the assumption of the HardyWeinberg equilibrium and random mating. ffl = 4/ (l ~f)\ 3 qn = fci = 2/(1 - / ) , To simplify the presentation, we assumed the Hardy-Weinberg equilibrium and random mating in the parental population in our simulations, although our results should hold even if these conditions are not present. We also fixed the three riskspo,px, andp2 for individuals harboring genotypes 0, 1, and 2, respectively. For all the four estimators to be meaningful, none of the nty should equal zero. We chose the number of case subjects such that the minimum expected value of my is about 10 for given/, prevalence of allele "N" in the general population, and for given risks pQ, px, and p2. For that number of case subjects, we ran the simulations 10,000 times and recorded the estimated risk ratios calculated by each of the four methods. We calculated the average of the estimated risk ratios and the square" root of the average squared differences of the estimated risk ratios from the true risk ratio for each method. Table 6 gives the averages of the estimated risk ratios from 10,000 simulations using Khoury's method, Flanders and Khoury's method, the new method, and the CPG method. Table 7 gives the square roots of the average squared differences of the estimated risk ratios from the true risk ratio. In both table 6 and table 7, we let Xt = pxlp0 = 4, A2 = p2/p0 = 10, a n d / = 0.2, 0.5, and 0.8. RESULTS In these simulations, the average number of cases was 550, 480, and 6,400 for / = 0.2, 0.5, and 0.8, respectively. Some general patterns emerge from our simulation results (tables 6 and 7), although they may 1/2(2q 02 + q12)Pt 1/2(q, 2 + 2q 2 2 )p 2 TABLE 6. The average estimated risk ratios through 10,000 simulations using Khoury's method (K), Flanders and Khoury's method (FK), the new method (N), and the conditional on parental genotype method (CPG), with A1 = 4 and As = 10 and f, allele frequency of N A 2 0 A, K A2 FK N CPG FK K N CPG 0.2 4.56 4.23 4.10 4.09 10.29 10.59 10.25 10.25 0.5 4.56 4.27 4.25 4.23 11.35 10.74 10.68 10.64 0.8 4.12 4.10 4.10 4.09 11.22 10.45 10.32 10.30 TABLE 7. The square roots of the average squared differences of estimated risk ratios with the true risk ratio through 10,000 simulations, with A1 = 4 and Aj = 10 using Khoury's method (K), Flanders and Khoury's method (FK), the new method (N), and the conditional on parental genotype method (CPG) A A2 A, W K FK N CPG K FK N CPG 0.2 0.5 0.8 2.22 2.31 0.78 1.13 1.27 0.71 0.67 1.21 0.71 0.65 1.15 0.68 1.83 5 .38 5 .13 2.85 3.39 2.55 1.66 3.11 2.11 1.62 3.01 2.10 not hold for all parameters. As seen in table 6, all simulations yield risk ratios that are close to the true value on average, although all averages are slightly above the true value with these sample sizes and the new and CPG methods are the closest. The variances of the four estimators are in the order of Khoury's method > Flanders and Khoury's method > the new method > CPG. The variance of the new estimator is only slightly larger than that of the CPG method. We also simulated the process using other parameters (data not shown). The patterns are similar. We also calculated the variance and the lower and upper confidence limits of the logarithm of the estimated risk ratio. We used A,' = 4, A2 = 10, and/ = 0.2 in these simulations. The histograms of Var(log(A,)) and Var(log(A2)) are given in figure 1. The real simulated variances of log(A,) and log(A2) are 0.026 and Am J Epidemiol Vol. 148, No. 9, 1998 Risk Ratio in Case-Parental Control Design Studies 1.9 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0 3.1 32 3.3 3.4 3.5 3.6 3.7 3.8 3.9 4.0 4.1 42 4.3 4.6 4.7 U U U it U U 54 5.5 907 Estimated variancetOO 3.1 12 3.3 3.4 3.5 3.6 3.7 3.8 19 4.0 ••'...* 4.1 4.2 43 4.4 4.S Estimated varlanc»*100 ' FIGURE 1. The histograms for the estimated variances of logfA,) (a) and logfAJ (b). A, = 4, A2 = 10, and f = 0.2. The real variances are 0.026 (a) and 0.039 (b), respectively. 0.039, respectively. From figure 1, we see that the estimated variances of logCA,) and log(A2) using our formulas center around the true variances. The histograms for the upper 10 percent confidence limits of Am J Epidemiol Vol. 148, No. 9, 1998 log(A,) and log(A2) using the proposed approach are given in figure 2. The real values of the upper 10 percent confidence limits of log(A,) and log(A2) are 1.67 and 2.65, respectively. Thus, the estimated con- 908 Sun et al. 1.10 1.15 1.20 IIS U0 1.35 1.40 1.4S 1.50 1.55 ISO 1.65 1.70 175 1.80 1.85 1.90 1.85 2.00 2.05 2.10 2.15 220 2.25 2.30 Estimated upper 10% confidence limit 1.W — - _ 1.92 2.00 2.08 2.16 2.24 2.32 2.40 2.48 2.56 £64 2.72 2M " 2.88 2.96 3.04 3.12 3.20 3J8 3.36 3.44 3.52 3.60 3.68 3.76 Ettlmtttd upptr 10% confidence IMt _J& . FIGURE 2. The histograms for the estimated upper 10 percent confidence limits for logfA,) (a) and logfAj) (b). A, = 4, A2 = 10, and f = 0.2. The real upper 10 percent confidence limits are 1.67 (a) and 2.65 (b), respectively. fidence limits also center around the true values. In the simulations, log(A,) = log(4) «= 1.39, and log(A2) = log(10) = 2.30. From the simulated data, we see that in 94.8 percent of the simulations, the estimated upper 10 percent confidence limits of log(A.,) are greater than its true value. Similarly, in 95.2 percent of the simulations, the estimated upper 10 percent confidence limits of log(A2) are greater than its true value. Am J Epidemiol Vol. 148, No. 9, 1998 Risk Ratio in Case-Parental Control Design Studies DISCUSSION In this paper, we propose a new simple noniterative method for estimating the risk ratio in studies using case-parental control design. The new estimation method has several important features. First, it is an approximate unbiased estimator of the risk ratio. Second, the variance of the new estimator is smaller than that of Khoury's estimator and Flanders and Khoury's estimator, and it is roughly the same as that of the maximum likelihood-based method of Schaid and Sommer. Third, there exists a simple approximate formula for the variance of the logarithm of the estimated risk ratio. The simplicity of the new estimator and its variance makes the new method appealing. When information from only one parent is available, we propose a method to estimate the risk ratio without assuming the Hardy-Weinberg equilibrium and random mating, thus relaxing the conditions in earlier studies (5, 7). Therefore, the new method is applicable to more general situations. ACKNOWLEDGMENTS This research is partly supported by a grant from the Research Council of Emory University and NIH FIRST award R29-DK53392 (to F. Sun). Am J Epidemiol Vol. 148, No. 9, 1998 909 The authors thank Dr. Schaid for bringing their attention to the noniterative solution of the CPG method by Dr. Knapp et al. (7). They also acknowledge Dr. Knapp for suggestions that led to the improvement of the presentation. REFERENCES 1. Rubinstein P, Walker M, Carpenter C, et al. Genetics of HLA disease associations: the use of the haplotype relative risk (HRR) and the "haplo-delta" (Dh) estimates in juvenile diabetes from three racial groups. (Abstract). Hum Immunol 1981;3:384. 2. Falk CT, Rubinstein P. Haplotype risk ratio: an easy reliable way to construct a proper control sample for risk calculations. Ann Hum Genet 1987;51:227-33. 3. Ott J. Statistical properties of the haplotype relative risk. Genet Epidemiol 1989;6:127-30. 4. Knapp M, Seuchter SA, Baur MP. The haplotype-relative-risk (HRR) method for analysis of association in nuclear families. Am J Hum Genet 1993;52:1085-93. 5. Schaid DJ, Sommer SS. Genotype risk ratio: methods for design and analysis of candidate-gene association studies. Am J Hum Genet 1993,53:1114-26. 6. Knapp M, Wassmer G, Baur MP. The relative efficiency of the Hardy-Weinberg equilibrium-likelihood and the conditional on parental genotype-likelihood methods for candidategene association studies. Am J Hum Genet 1995;57:1476-85. 7. Khoury MJ. Case-parental control method in the search for disease susceptibility genes. Am J Hum Genet 1994;55: 414-15. 8. Flanders WD, Khoury MJ. Analysis of case-parental control studies: method for the study of associations between disease and genetic markers. Am J Epidemiol 1996;144:696-701.