* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Use of classification trees for association studies
Microevolution wikipedia , lookup
Genetic drift wikipedia , lookup
Hardy–Weinberg principle wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Public health genomics wikipedia , lookup
Maximum parsimony (phylogenetics) wikipedia , lookup
Gene expression programming wikipedia , lookup
Dominance (genetics) wikipedia , lookup
Genetic Epidemiology 19:323–332 (2000) Use of Classification Trees for Association Studies Heping Zhang1* and George Bonney2 1 Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, Connecticut 2 National Human Genome Center, Howard University, Washington, DC We propose the use of classification trees for association studies. This approach is applied to a data set from Genetic Analysis Workshop 9 (GAW9), and our analysis precisely identified two disease alleles. Our purpose is to demonstrate the great potential of tree-based analyses for genetic studies, and discuss some issues that warrant further investigation. Genet. Epidemiol. 19:323–332, 2000. © 2000 Wiley-Liss, Inc. Key words: classification trees; recursive partitioning; association studies; genome scan; mode of inheritance INTRODUCTION We present classification trees for association studies and conduct a genomewide screening using the data set from GAW9: Problem 1. As a non-parametric statistical method, classification trees are conventionally constructed for classification purposes [Breiman et al., 1984] and risk factor analyses [Zhang and Bracken, 1995, 1996; Zhang and Singer, 1999]. Recently, tree-based methods have attracted attention for genetic studies [Rao, 1998; Shannon et al., personal communication]. We describe a simple, innovative use of classification trees for identifying disease genes and susceptibility alleles and for inferring the mode of inheritance through appropriate interpretations of tree structures. In the Data Set section, we briefly describe the data set and its underlying simulation model. Then, we present two major steps (i.e., tree growing and pruning) for constructing classification trees. In the Association Analysis section, we illustrate the Contract grant sponsor: NIH; Contract grant numbers: HD30712, AG16996, GM31575. *Correspondence to: Heping Zhang, Department of Epidemiology and Public Health, Yale University School of Medicine, New Haven, CT 06520-8034. E-mail: [email protected] Received for publication 20 September 1999; revision accepted 11 December 1999 © 2000 Wiley-Liss, Inc. 324 Zhang and Bonney use of classification trees for association studies, including the data preparation and tree presentation and interpretation. Finally, we discuss methodological and practical issues that warrant further investigation. DATA SET The GAW9-Problem 1 data set was provided to us by the Southwest Foundation. Hodge [1995a,b] and Speer et al. [1995] described in detail the development and simulation of the data set and Hodge [1995a] summarized the results from the participants of GAW9 who analyzed the data set using a variety of association and linkage analysis approaches. The data set includes 200 nuclear families with at least one affected child and 100 control families with no affected members, resulting in a total of 1,484 individuals. Sixty highly polymorphic markers, spaced 2 cM apart, were provided on each of the six chromosomes, resulting a total of 360 markers. Each marker has between 3 and 9 alleles, randomly assigned by an algorithm [Speer et al., 1995]. Recombination frequencies have been converted to map distances allowing for a Sturt level of interference and there was no linkage disequilibrium among markers [Hodge, 1995a]. An oligogenic model with four “disease susceptibility” loci was used to generate the phenotype with a penetrance of 20%. The first disease locus is identical to marker locus D1G31, the 31st marker on chromosome 1, and the eighth allele in this locus is a disease allele with frequency 0.05. The second disease locus is identical to marker locus D5G23, the 23rd marker on chromosome 5, whose disease allele is number 7 with frequency 0.2. The remaining two loci are on chromosomes 4 and 6. They were designed virtually undetectable in this data set and were indeed missed by the GAW9 participants [Hodge, 1995a]. Individuals who inherited at least four disease alleles were considered to be “susceptible” and had an equal risk (20%) to be affected across all susceptible genotypes. Markers will be referred to in the format DcGmAa, denoting allele a of marker m on chromosome c, e.g., D1G31A8. This format is similar to the one used in Almasy et al. [1995] with adjustments to conform with the GAW format. CLASSIFICATION TREES Figure 1 depicts a representative tree resulting from a recursive partitioning process and then a bottom-up pruning process. Zhang and Singer [1999] offer a detailed technical description for constructing classification trees. Here, we only describe some elements that are particularly relevant to our tree-based analysis and interpretation. As in Figure 1, a tree consists of internal and terminal nodes, respectively represented by circles and boxes. The top node is called the root node, containing the entire study sample, and all other nodes are subsets of the study sample, which are some of the 1,484 individuals in our case. The tree divides each internal node into two offspring nodes (e.g., node 1 into nodes 2 and 3 in Fig. 1) whereas the terminal nodes (e.g., node 5 in Fig. 1) do not have offspring. The partition of an internal node into two offspring nodes is carried out by the values of one of the covariates, e.g., the alleles on the 360 markers, and it is aimed at improving the distribution homoge- Classification Trees for Association Studies 325 Fig. 1. The pruned tree at significance level 0.001. Inside each node are the node number (top), the numbers of affected (middle), and unaffected (bottom) individuals. Under each internal node is the split based on the genotype. For example, node 1 is split based on the number of alleles of D5G23A7. The arrows guide the assignments to nodes 3 and 2 according to whether or not an individual has the allele. neity of the outcome, i.e., the affective status. For instance, in Figure 1, node 1 is split into nodes 2 and 3 according to the number of D5G23A7 alleles because this partition offers the “best possible” performance by attempting to send more unaffected individuals to node 2 and more affected individuals to node 3, as compared to any single binary split allowed by all alleles on the 360 markers. Specifically, one performance measure of a split s, called goodness-of-split [Zhang and Singer, 1999], can be defined as 326 Zhang and Bonney i(s) = ptL [ atL log(atL ) + (1 − atL ) log(1 − atL )] + ptR [ atR log(atR ) + (1 − atR ) log(1 − atR )], where tL and tR are two offspring nodes of node t (e.g., nodes 2 and 3 from node 1), ptL and ptR are, respectively, the proportions of individuals in nodes tL and tR, and atL and atR are the proportions of affected individuals within nodes tL and tR, respectively. For example, in Figure 1, p2 = 654/1484 = 0.44, p3 = 830/1484 = 0.56, a2 = 54/654 = 0.083, and a3 = 193/810 = 0.233. Two central steps are involved in tree construction. The first step is a recursive partitioning process that splits the root node into two offspring nodes. Then, it divides the two offspring nodes into another generation of four nodes. This process continues recursively and eventually leads to a generally large tree. Whenever a split is made, it is intended to maximize i(s) as defined above. There is no need to intervene and stop this recursive partitioning process because the process stops itself as we have only a finite sample and hence only a finite number of possible splits. A natural concern with this recursive partitioning is that the splits are based on smaller and smaller sample sizes. This concern will be addressed in the second step. It is usually difficult to interpret a large tree. Thus, the large tree initially produced by the recursive partitioning process is not really useful. This is why the second step, called pruning, is warranted. Heuristically, the pruning step is similar to the backward deletion in the ordinary linear regression, and it removes from bottom up those splits that may be “superficial” or based on an unreliably small sample. This can be validated by sample reuse methods [Breiman et al., 1984] or assessed with the use of χ2 test for 2 × 2 tables as described in Zhang and Singer [1999]. We adopt the χ2 test to follow the tradition in linkage and association analyses. A split is regarded as unnecessary if the χ2 tests from this split as well as its further splits are not significant at a prespecified level. All nodes resulting from unnecessary splits are then removed. To illustrate this process, let us begin with the tree in Figure 2 and explain how the nodes are pruned. Under each internal node (represented by a circle), we list a raw χ2 statistic. For example, we have “raw: 59.3” under node 1, indicating that the χ2 statistic from a 2 × 2 tables with cell values of 54, 600 (from node 2), 193 and 637 (from node 3) equals 59.3. Under each internal node, we also report a maximum χ2 statistic as obtained as follows. Let us take two representative nodes (3 and 5) from Figure 2 and show how their maximum χ2 statistics are derived. For node 3, it has a raw χ2 statistic of 13.7. It has two offspring internal nodes (6 and 12), and their raw χ2 statistics are 10.9 and 4.8. Then, the maximum χ2 statistic for node 3 is the maximum of 13.7, 10.9, and 4.8, which is 13.7 and turns out to be the same as the raw χ2 statistic of node 3. For node 5, however, its raw χ2 statistic is 4.6 and its offspring node (10) has a larger raw χ2 statistic of 8.4. Thus, the maximum χ2 statistic for node 5 becomes 8.4, which is the maximum of 4.6 and 8.4. Likewise, a maximum χ2 statistic can be assigned for any internal node as displayed in Figure 2. After the maximum χ2 statistics are computed for all internals, we then set a critical χ2 level, e.g., 10.83 at the significance level of 0.001. An internal node becomes a terminal node (in other words, its offspring nodes are pruned) if its maxi- Classification Trees for Association Studies 327 Fig. 2. Illustration of tree pruning. Inside each node are the node number (top), the numbers of affected (middle), and unaffected (bottom) individuals. Under each internal node is the raw and maximum χ2 statistics, as described in the text in detail. mum χ2 statistic is less than the critical level. Consequently, nodes 4, 5, and 12 become terminal nodes because their maximum χ2 statistics are less than 10.83, and nodes 8 through 11 and 13 through 19 are pruned. It is useful to note that the pruned internal nodes (e.g., node 10) cannot have maximum χ2 statistics greater or equal to the critical level because of the way by which the maximum χ2 statistics are defined. This explains how we obtained the tree in Figure 1 at the significance level of 0.001. For the significance level of 0.005 or any other level, the pruning is done in the same way. ASSOCIATION ANALYSIS Before conducting the tree-based analysis, we need to prepare the data set in a regression format. Particularly, the covariates and the response variable must be defined as in logistic regression. Obviously, the response variable is whether or not an individual is affected, i.e., the phenotype. The covariates in the present application include gender, the parental phenotypes, and the marker information. Gender is considered to accommodate potential sex differences. The inclusion of the parental phenotypes is to control potential familial correlations that cannot be explained by identifiable genetic information, which is the idea in Bonney’s [1986, 1987] regres- 328 Zhang and Bonney sive models. If a marker, DcMm, has ncm alleles, ncm covariates are created for this marker. The a-th covariate in these ncm covariates records the number of DcMmAa alleles, thus taking a value of 0, 1, or 2. For example, the 23rd marker on chromosome 5 has 7 possible alleles. Thus, 7 covariates are created for this marker. If an individual has a genotype 46 in this marker, the 7 covariates take the values of 0, 0, 0, 1, 0, 1, 0. Moreover, if a subject has a genotype 77 in this marker, the 7 covariates take the values of 0, 0, 0, 0, 0, 0, 2. Thus, the covariates are created to indicate the number of allele copies for each marker. The total number of covariates is 3 + Σ 6c=1Σ 60 m =1ncm . Even though the node split uses one covariate at a time, i.e., each split is based on the number of copies for a single allele, the effect of two distinct alleles at the same marker can be accommodated by another split. Having prepared the appropriate data set, we used Heping Zhang’s RTREE program available from his website (http://peace.med.yale.edu) for tree construction. As described above, the first step of tree construction is to build an initially large tree using recursive partitioning. In fact, we grew a tree with over 100 nodes, far exceeding the needed size. Then, we applied the pruning procedure at various significance levels. For example, Figure 1 presents the pruned tree at the significance level 0.001. Because we conduct a genome screening, it is reasonable that we begin with the significance level 0.001 to correct for multiple comparisons [Almasy et al., 1995]. Each of the two disease genes, D1G31 and D5G23, is used twice in the tree. Not only does this tree identify the correct locus, but also its splits are precisely determined by two disease alleles D1G31A8 and D5G23A7. Although 0.001 is a reasonable threshold level of significance, its implication is not entirely clear, and it has not been explored under the tree paradigm. We plan to conduct an extensive simulation study in the future to examine the appropriate significance level for tree pruning that balances the false positive and false negative errors in declaring candidate genes. For the present analysis, it is useful to examine what happens when a more rigid significance level is adopted. If the significance level is set at 0.0005, a 7-node tree in Figure 3 is derived because the χ2 test of splitting node 6 is no longer significant. Furthermore, if the level is 0.0001, a 5-node tree in Figure 4 is obtained in which node 3 becomes a terminal. The two splits in the last tree have χ21 values equaling 59.3 and 32.7, which are significant by any practical standard even though their precise, globally adjusted significance levels require further investigation. Thus, further pruning in Figure 4 is not justified. If we examine the smallest tree in Figure 4, the evidence is significant and overwhelming that D1G31A8 and D5G23A7 are associated with the disease. It is also obvious that the three trees in Figures 1, 3, 4 make use of the two identical alleles; it is scientifically more revealing to interpret the tree in Figure 1. It is useful to notice the order by which the loci D1G31 and D5G23 are used in the tree. Because allele D5G23A7 has a frequency of 0.2 whereas allele D1G31A8 has a frequency of 0.05, it is no coincidence that D5G23A7 precedes D1G31A8. In this case, we have better power detecting a more frequent disease allele. How do we infer the mode of inheritance from Figure 1? The D5G23A7 allele is used twice, and it alone divides our sample into three disjoint groups (node 2 without D5G23A7, node 6 with one D5G23A7, and node 7 with two D5G23A7’s). Classification Trees for Association Studies 329 Fig. 3. The pruned tree at significance level 0.0005. Inside each node are the node number (top), the numbers of affected (middle), and unaffected (bottom) individuals. Under each internal node is the split based on the genotype. For example, node 1 is split based on the number of alleles of D5G23A7. The arrows guide the assignments to nodes 3 and 2 according to whether or not an individual has the allele. According to the within-node risks reported in Table I, it is evident that D5G23A7 has an additive effect. Locus D1G31 is also employed twice while every time it follows D5G23A7. Note that nodes 5 and 9 are not further split by D1G31. Individuals with one or two alleles of D1G31A8 are in the same terminal nodes. Thus, there is no significant evidence that an extra D1G31A8 increases the risk. However, our ability to dissect the effect of one D1G31A8 from that of two D1G31A8’s is restricted by the sample size. From Table I, we see that individuals in node 7 (with two D5G23A7’s) have the highest risk, followed by those in node 9 who have one D5G23A7 and at least one D1G31A8. Individuals in node 5 inherited at least one D1G31A8, but had no D5G23A7. On the other hand, those in node 8 inherited only one D5G23A7, but not D1G31A8. The risks in these two nodes are comparable although the estimate in node 5 is slightly higher. This trend appears to suggest that the total number of alleles of D1G31A8 and D5G23A7 is associated with the disease. This conforms with the underlying model although the remaining two disease alleles were not detected. 330 Zhang and Bonney Fig. 4. The pruned tree at significance level <0.0001. Inside each node are the node number (top), the numbers of affected (middle), and unaffected (bottom) individuals. Under each internal node is the split based on the genotype. For example, node 1 is split based on the number of alleles of D5G23A7. The arrows guide the assignments to nodes 3 and 2 according to whether or not an individual has the allele. DISCUSSION We have presented a tree-based association study using GAW9-Problem 1 for illustration. This approach is particularly convenient to use and makes no assumption with regard to the genetic model. In our analysis, we are able to identify two disease alleles. After examining the contributions of the identified alleles, we found that our conclusions are largely consistent with the underlying simulation model. The tedious step in conducting a tree-based analysis is to prepare a data set in an appropriate format, which is no exception with the use of other methods. Depending on the number of markers and the polymorphism of the markers, the number of covariates as we created above could be so large that the data processing may become prohibitive depending on the available computing facilities. However, this problem can be solved easily using a preliminary screening step prior to a formal analysis. Instead of creating the number of covariates equaling the number of different alleles for each marker, we may use only two nominal covariates to indicate the genotype at the marker. Applying the RTREE program to this smaller data set will enable us to select a subset (say 20 to 50) of the markers. This subset could include any markers that are slightly associated with the disease or simply the markers that are used in Classification Trees for Association Studies 331 TABLE I. Within Node Risk Estimates Based on Figure 1 Node number 1 2 3 4 5 6 7 8 9 Risk estimate Standard error 0.166 0.083 0.233 0.055 0.220 0.208 0.352 0.184 0.317 0.010 0.011 0.015 0.010 0.040 0.015 0.040 0.016 0.042 the top 20 splits. Restarting from this subset of markers, we can perform the final tree-based analysis as presented above. We should note that this inconvenience is not the fault of the tree-based methods, and it will become less and less an issue in the future. The present tree programs such as RTREE are not designed specifically for genome-wide screening, so we plan to redesign the input procedure in RTREE in order to read the marker data more efficiently. It is noteworthy that the covariates are created based on the copies of a single allele. This data formation may not always give rise to the most powerful split when distinct alleles at the same marker have some joint effects. A simple approach is to create some composite variables, for example, by adding two covariates created from the same marker together. A more ambitious approach is to create a nominal covariate for each marker whose level represents the genotype and use the nominal variable during the split. Thus, if a marker has 4 possible alleles, the corresponding nominal covariate has 4 × 3/2 = 6 levels. This approach is desirable for markers having no more than 5 possible alleles, because the number of possible splits using a nominal variable increases exponentially as the number of nominal levels increases. Otherwise, the simpler approach should be considered. Although the data set, GAW9-Problem 1, analyzed here consists of genotypes and a binary phenotype only, the tree-based methods have great potentials for controlling environmental factors and accounting for gene-gene and gene-environmental interactions. For this purpose, we will consider and investigate in a future study the two-stage tree-based strategy described by Zhang and Bracken [1996]. This strategy also appears useful for linkage analysis in the presence of genetic heterogeneity because we can use the trees to stratify the sample into genetically relatively homogeneous groups [Shannon et al., personal communication]. The implementation of this idea, however, requires revising the node splitting criterion. The candidate genes identified by our tree-based analysis are associated with the phenotype, but they are not necessarily disease-causing genes or in linkage disequilibrium with the disease-causing genes. A post-hoc statistical testing can be performed using the transmission/disequilibrium test [Spielman and Ewens, 1996]. In other words, the tree approach can be viewed as a means of screening the genome data. ACKNOWLEDGMENTS This research was supported in part by NIH grants HD30712 (to H.Z.), AG16996 (to G.B. and H.Z.), and GM31575 (to the GAW9 data sets). 332 Zhang and Bonney REFERENCES Almasy L, Tierney C, Risch N. 1995. Use of sibling risk ratios and components of genetic variance in the characterization of a simulated oligogenic disease. Genet Epidemiol 12:565–70. Bonney GE. 1986. Regression logistic models for familial disease and other binary traits. Biometrics 42:611–25. Bonney GE. 1987. Logistic regression for dependent binary observations. Biometrics 43:951–73. Breiman L, Friedman F, Stone C, Olshen R. 1984. Classification and regression trees. New York: Chapman and Hall. Hodge SE. 1995a. An oligogenic disease displaying weak marker associations: A summary of contributions to Problem 1 of GAW9. Genet Epidemiol 12:545–54. Hodge SE. 1995b. Genetic Analysis Workshop 9: Development of problem 1. Genet Epidemiol 12:555–60. Rao DC. 1998. CAT scans, PET scans, and genomic scans. Genet Epidemiol 15:1–18. Speer MC, Terwilliger JD, Ott J. 1995. Data simulation for GAW9 problems 1 and 2. Genet Epidemiol 12:561–4. Spielman RS, Ewens WJ. 1996. The TDT and other family-based tests for linkage disequilibrium and association. Am J Hum Genet 59:983–9. Zhang HP, Bracken M. 1995. Tree-based risk factor analysis of preterm delivery and small-for-gestational-age birth. Am J Epidemiol 141:70–8. Zhang HP, Bracken M. 1996. A tree-based two-stage risk factor analysis of spontaneous abortion. Am J Epidemiol 144:989–96. Zhang HP, Singer B. 1999. Recursive partitioning in the health sciences. New York: Springer. Zhang HP, Holford T, Bracken M. 1996. A tree-based method in prospective studies. Stat Med 15:37–50.