Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discovery of Predictive Models in an Injury Surveillance Database: An Application of Data Mining in Clinical Research John H. Holmes, Ph.D.1, Dennis R. Durbin, M.D., M.S.1, Flaura K. Winston, M.D., Ph.D.2 1University of Pennsylvania Medical Center, Philadelphia, PA 2The Children’s Hospital of Philadelphia, Philadelphia, PA ABSTRACT A new, evolutionary computation-based approach to discovering prediction models in surveillance data was developed and evaluated. This approach was operationalized in EpiCS, a type of learning classifier system specially adapted to model clinical data. In applying EpiCS to a large, prospective injury surveillance database, EpiCS was found to create accurate predictive models quickly that were highly robust, being able to classify >99% of cases early during training. After training, EpiCS classified novel data more accurately (p<0.001) than either logistic regression or decision tree induction (C4.5), two traditional methods for discovering or building predictive models. INTRODUCTION Data mining can be defined as the methods and processes used to perform knowledge discovery, which in turn can be defined as the identification of meaningful and useful patterns in databases.1 While these patterns may be of interest for such tasks as hypothesis generation, an especially important role of data mining is the discovery of models that predict the class membership of previously unseen “cases” or database records. These predictive models are in common use throughout many clinical domains for diagnosis, risk assessment, and determining appropriateness of tests and interventions. In addition, predictive models may be used as a knowledge base in a decision support system. Traditional methods of building predictive models include logistic and other regression procedures applied to population-based data. While often effective, the process for deriving models in this way can be cumbersome, and is often hampered by pre-existing investigator biases as to the candidate variables to be entered into an exploratory model. Even before this, however, it is often tedious to derive a set of candidate variables if the data are composed of many records (cases) and/or fields, which can cause delays in the model building process. Finally, sparse distributions of variables in the data will affect the successful building of a model. For example, missing data on even one candidate term will cause an entire case to be discarded from a model. Solutions to this problem have included imputation, an arguably controversial method, or simply excluding the variable from the list of candidates. These issues argue for a different approach to developing predictive models. The domain of knowledge discovery in databases (KDD) is rich with such approaches. One in particular, decision tree induction, embodied in the software C4.52, has been used extensively and successfully to discover prediction models in data. Unfortunately, this approach has several faults that argue against its use in large clinical databases. For example, conflicting or contradictory data, poorly separated classes, and small cell sizes for specific attributes are not adequately addressed by this approach. Such problems are common in clinical data, thereby causing one to continue to look for yet another approach to building predictive models. One such approach is found in evolutionary computation, which uses one or more techniques that reflect a Darwinistic metaphor. In the evolutionary computation paradigm, the overarching goal is to evolve solutions to problems. This has been accomplished with genetic algorithms3, genetic programming4, and learning classifier systems5. The learning classifier system (LCS) is an especially attractive evolutionary computation method for prediction model discovery in that it integrates a knowledge base as part of its design, making it something like a “learning expert system.” We have embraced this approach, modifying the LCS paradigm to facilitate its use in large databases for epidemiologic surveillance. Specifically, we applied a new LCS, called EpiCS, to the domain of population-based head injury surveillance. This investigation reports on this application and its success at discovering predictive models compared to logistic regression and decision tree induction. METHODS System: EpiCS Knowledge representation. Like other learning classifier systems, EpiCS’s knowledge representation scheme focuses on condition-action rules, called classifiers. The condition side of a classifier is referred to as a taxon, while the action side is simply its action. Each classifier has a strength, which indicates its accuracy relative to others in the population. Typically, classifiers are encoded in bit strings, such that integer and real numbers are represented in base-2 notation. This notation is extended to add a third character to the alphabet, the “*”, which represents a “wild card” that can take the value of either 0 or 1. This representation facilitates the operations of the genetic algorithm, discussed below. Classifiers are held in a static array of constant size, called a population, which is the knowledge base of EpiCS. When EpiCS is initialized, the population is filled with a predetermined number of randomly generated classifiers. The unique classifiers in the population are referred to as macrostate classifiers. Each macrostate classifier represents a separate rule. At initialization, the population is composed of many unique classifiers; however, because they were randomly generated, these classifiers cannot be expected to represent plausible hypotheses for a given problem. As the system learns, the classifiers in the population will be increasingly refined, to the point where, out of an entire population, a relatively small set of unique classifiers will emerge to define the system’s knowledge base. The change in composition of the population from highly unique classifiers to increasingly numerous instances of non-unique, highly general, classifiers reflects a fundamental process of inductive learning: generalization. The consequence of generalization is the reduction in the resources required for searching the population of matching classifiers, but also the potential for increased accuracy when the system is applied to novel cases from the testing set. Components. There are three functional components to EpiCS: performance, reinforcement, and discovery (Figure 1). The performance component creates a subset of all classifiers in the population whose taxa match a stream of data received as input from the environment. In this way, the performance component is analogous to a forward chaining rule-base system. All matching classifiers matching the input taxon comprise a Match Set [M], even though some of these classifiers may advocate different actions. The process is equivalent to the triggering of rules, and [M] is analogous to an agenda in an expert system. From [M], the classifier with the highest strength is selected. The action of this classifier is then used as the output of the system; this process is analogous to the firing of a rule in an expert system. When EpiCS is used to classify cases, as would be done during testing, the operation cycle stops here. During training, however, two additional components are used by EpiCS to reinforce classifiers according to their performance and to discover new, yet plausible, classifiers as a form of hypothesis generation. EpiCS Input Performance component Output Classifier population Reinforcement component Discovery component Figure 1. Schematic diagram of EpiCS. In supervised learning, the true classification of a training case is known to the system, and this information is used by the reinforcement component in adjusting the strengths of all classifiers in the system according to the following scheme. First, a Correct Set [C] is created from the classifiers in [M] that have action bits matching the decision output by the system, and the remaining classifiers in [M] form the set [notC]. This assumes that the decision advocated by the system is correct; if the decision was not correct, then only [notC] is formed. Next, a tax is applied to [C], reducing the strength of each classifier in [C] by 10 percent. The purpose of this tax is to inhibit premature convergence: the accurate classifiers in [C] at one time step may not be accurate at another. Often, this premature convergence is due to overly general classifiers in the population. The tax helps to “smooth” the asymptotic ascent to an accurate, yet optimally general, population of classifiers. A reward, R, is evenly distributed among the classifiers in [C]. R is adjusted so that a higher fraction is apportioned to more general classifiers. The strength of each classifier in [notC] is diminished proportionally by a penalty, typically 50%. The effect of this reward scheme is to exert some degree of selection pressure on the population, such that classifiers are chosen in the discovery component for reproduction based on their strength proportional to other classifiers in the population. The discovery component employs the genetic algorithm, which is a method of optimization and discovery predicated on Darwinian evolution. The role of the genetic algorithm in EpiCS is to discover new classifiers by applying genetic operators such as reproduction, crossover, and mutation to the strongest classifiers in the population. The newly formed classifiers “inherit” traits from those that are strongest (their “parents,” yet they contain different “genetic material” obtained via crossover and mutation. Since the population is steady-state, an equal number of classifiers, typically the weakest, are deleted to make room for the new, stronger ones. If these new classifiers prove accurate, their strengths will increase over time, and they will subsequently be selected for reproduction. Over time, the strongest, most accurate classifiers will prevail at the expense of the weakest, least accurate; hence the Darwinian metaphor. Source of data The data for this investigation were obtained from the Partners for Child Passenger Safety (PCPS), a five-year investigation into child occupant protection in automobile crashes6,7. Funded by the State Farm Insurance Companies, the research is being conducted at The Children’s Hospital of Philadelphia and the University of Pennsylvania Medical Center. The goal of the PCPS project is to identify ways to reduce the morbidity and mortality of children involved in automobile crashes. We address this goal through a multidisciplinary approach that incorporates clinical researchers, epidemiologists, biomechanical engineers, automotive engineers, and informaticians to identify a spectrum of modifiable risk factors for pediatric injury in crash events. The PCPS project uses State Farm Insurance Companies claims data from 15 states and the District of Columbia on automobile crashes involving at least one child less than 16 years of age. Approximately 30,000 such claims are received each year at the University of Pennsylvania from State Farm. Of these, 20% are subjected to a telephone interview to obtain more specific details about the crash and any injuries incurred by children involved in the crash. These two sources of data, claims records and telephone interview, contribute to the richness of the PCPS surveillance database, which is reflected in the number of variables (over 500), as well as in the large number of records. This investigation focused solely on 47 numeric variables which were selected from a number of “modules” defined by their function. These included passenger restraint (characteristics of any restraint devices and their usage, pertaining to an individual passenger); crash (characteristics of the crash event, such as point of impact, type of object struck, and estimated speed); kinematics (ejection from vehicle, contact with surfaces such as windows or dashboard, and evidence of occupant motion during the crash); and demographics (age and injurypredisposing factors such as physical disability). A dichotomously-coded element was used to indicate the presence or absence of head injury defined as “serious” or worse by the Abbreviated Injury Scale (AIS)8. Data preparation A total of 8,334 records comprised the pool of data to be mined. A series of 20 study datasets were created from this pool to mimic 20 separate casecontrol studies, effectively implementing a bootstrap. All records with head injury (cases) were included in each dataset, while an equal number of non-head injury records (controls) were selected randomly from the pool without replacement. Thus, the controls were unique within each study dataset. No matching procedures were performed. Training and testing sets were created from each study dataset by selecting records at a sampling fraction of 0.50 without replacement. Thus, training and testing sets were equal in size (N=415) and mutually exclusive. In addition, head injury-positive and negative cases were distributed as equally as possible (N=207 and 208, respectively) in the training and testing sets. The data were used in their native coding for the comparison studies with LR and C4.5. For the EpiCS trials, the data were encoded as bit strings using the ternary alphabet described above, with missing values coded as “wild cards” to preserve their original semantic. Comparison methods Two methods were used to compare the performance of EpiCS in classifying novel data. Logistic regression. Logistic regression (LR) is commonly used in clinical research to create and validate prediction models from data, and is therefore an excellent method with which to compare EpiCS’ prediction performance. In order to create the 20 logistic models for comparison, all 47 variables from each training set were entered into a separate stepwise logistic regression using forward stepping with relaxed entry criterion (p to enter=0.95). Of the 47 candidate terms stepped in, between 10 and 12 were found to be at least marginally significant (p<0.07). Eleven variables were dropped due to sparse cell sizes, while the remaining variables were not statistically significant. The resulting prediction model was applied to the cases in the testing set to obtain an estimate of risk of outcome for each, and the area under the receiver operating characteristic curve (AUC) was calculated. C4.5. A well-known program that creates decision trees, C4.52 was chosen as the second comparison method. All 47 variables were used to create a decision tree from each of the 20 training sets. The tree was then used to classify the cases in the corresponding testing set. The cross-validation procedure built into the software was used to optimize the final decision tree, using a total of 10 blocks. Subsequently, the optimized tree was used to create sets of IF..THEN rules using the C4.5RULES procedure, which were in turn applied to the corresponding testing sets to ascertain their classification performance using the AUC. Experimental procedure: EpiCS The population in EpiCS was initialized with 5,000 randomly generated classifiers. This population size was found empirically to provide the best performance in terms of classification accuracy and ability to classify cases. The EpiCS system was trained over a series of iterations, with a case drawn randomly from the training set and presented to the system at each iteration. As these cases were drawn with replacement, an individual training case could be presented to the system many times during a training phase, which was defined as 100 iterations. At the 0th and every 100th iteration thereafter, the system moved from the training phase to the interim evaluation phase. During this phase, the learning ability of EpiCS was evaluated by presenting the taxon of every case in the training set to the system for classification. Since the purpose of the interim evaluation phase was the evaluation of the state of learning of EpiCS, the reinforcement and discovery components were disabled for its duration. The decision advocated by EpiCS for a given case was compared to the action bit of that case to determine the type of decision made by EpiCS. The decision type was classified in one of four categories: true positive, true negative, false positive, and false negative; these were tallied for all training cases. The evaluation metrics (sensitivity, specificity, area under the receiver operating characteristic curve (AUC), and the indeterminant rate (IR, or the proportion of cases that could not be classified) were calculated and written to a file for analysis. The IR was used to correct the other metrics using the following equation: Metric 1 IR After the completion of the designated number of iterations of the training epoch, EpiCS entered the testing epoch, in which the final learning state of the system was evaluated using every case in the testing set, presented in sequence. As in the interim evaluation phase, the reinforcement component, and the genetic algorithm, were disabled during the testing phase. At the completion of the testing phase, the evaluation metrics were calculated and written to a file for analysis, as was done during the interim evaluations. The entire cycle of training and testing Corrected Metric comprised a single trial; a total of 20 trials were performed for each of the 20 study datasets. RESULTS The performance of EpiCS during the training phase is illustrated in Figure 2, in which the AUC and IR obtained during each 100th iteration are plotted over time. As can be seen in this figure, EpiCS converged quickly (within 3,000 iterations) to an accurate classification of the training set (AUC0.94). In addition, the IR decreased correspondingly with the increase in AUC. At the beginning of the training phase, there were 5,000 unique classifiers in the population. By the end of the training phase, this was reduced to 2,314 unique classifiers. This level of reduction (53.7%) in unique classifiers indicated that generalization took place. These classifiers were in turn used by EpiCS to predict the class membership (head injury/no head injury) of data in the testing set. In comparison, a total of 11 rules were created by 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 AUC 0.2 Indeterminant Rate 0.1 0 0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 Iterations Figure 2. Performance of EpiCS during training. C4.5, most of these containing single conjuncts. The performance of EpiCS on testing with unseen cases is shown in Table 1, compared with the results obtained with logistic regression and C4.5. EpiCS performed significantly better in classifying unseen cases than either of the two comparison methods (p<0.001). DISCUSSION This investigation focused on the application of a LCS, EpiCS, to the discovery of prediction models in a large prospective injury surveillance database. Table 1. Area under the receiver operating characteristic curve (AUC) obtained on testing with unseen cases. AUCs were averaged over the 20 casecontrol studies; one standard deviation is shown in parentheses. EpiCS Logistic Regression C4.5 0.97 (.04) 0.74 (0.03) 0.79 (0.04) By itself, EpiCS’s performance was excellent in both rapidly and accurately learning the models during its training phase and in classifying novel data in the testing phase. However, it is in comparison with LR and C4.5 that the superiority of EpiCS, as applied to these data, is demonstrated. There are several possible reasons for this. First, EpiCS is not bound by the statistical assumptions that constrain LR. For example, EpiCS is not hampered by multicollinearity. Furthermore, interactions between two or more variables do not need to be accounted for in the EpiCS model, as they are simply a part of the individual classifier’s representation. Second, LR provides a single-rule model that is based on statistical significance that is set arbitrarily by the investigator. Terms that fail to meet statistical significance criteria are not included in the model, and their potential contribution is never known as a result. Both EpiCS and C4.5 provide a suite of rules in their models, and these can be used to form the nucleus of a knowledge-based system. Third, the models of both LR and C4.5 are applied monolithically, in that one never knows when novel data cannot be classified. There is no indeterminant rate function available with these approaches, whereas this is integrated into EpiCS, providing critical information about the robustness of the model. Finally, the rules derived by C4.5 are often overly sparse; this was clearly evident in the PCPS data, which is extremely rich in subtle patterns and rule interactions that are lost due to C4.5’s decision tree pruning procedures. EpiCS, on the other hand, provided a very large number of rules which, although showing strong evidence of generalization, would benefit from judicious pruning. The number of rules in the macrostate population mitigate against using this knowledge base as a parsimonious source of rules that could be imported into another knowledge-based system. Although the content of the knowledge bases evolved by EpiCS and created by C4.5 is not a focus of this investigation, this is a line of research we are currently undertaking. CONCLUSION A new system for discovering prediction models was developed and applied to a large prospective injury surveillance database. This system, EpiCS, borrows from an evolutionary computation paradigm, the learning classifier system, which incorporates reinforcement learning and genetic algorithm-driven discovery in the context of a rule-based knowledge system. This investigation is the first reported use of a LCS to discover prediction models in this type of data. We hope to apply EpiCS to larger, more complex surveillance databases and to compare its performance on these domains with a larger array of methods, including k-nearest neighbors and naive Bayes classifiers. In addition, we plan to expand the knowledge representation to include integer and realnumber data; thus obviating the need for potentially costly encoding and decoding procedures. Finally, we are investigating methods for pruning the rules contained in the macrostate population after training, with the goal of making the prediction model more parsimonious without losing efficiency or accuracy. Acknowledgement. This project was funded by The State Farm Insurance Companies. REFERENCES 1. 2. 3. 4. 5. 6. Fayyad UM, Piatetsky-Shapiro G, Smyth P, Uthurusamy R, editors. Advances in Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press, 1996. Quinlan JR. C4.5: Programs for Machine Learning. San Francisco: Morgan Kaufmann, 1992. Goldberg, DE. Genetic Algorithms in Search, Optimization, and Machine Learning. New York: Addison-Wesley, 1989. Koza, JR. Genetic Programming. On the Programming of Computers by Means of Natural Selection. Cambridge, MA: The MIT Press, 1993. Holland, JH. Adaptation in Natural and Artificial Systems. Cambridge, MA: The MIT Press; 1992. Holmes JH, Winston FK, Durbin DR, Bhatia E, Arbogast K, Werner J: The Partners for Child Passenger Safety Project: An Information Infrastructure for Injury Surveillance. In: Chute CG, editor. Proceedings of the Fall Symposium of the American Medical Informatics Association, November 1998. Philadelphia: Hanley and Belfus, p. 1016. 7. Durbin DR, Winston FK, Bhatia E, et al. Partners for Child Passenger Safety: A unique child-specific crash surveillance system. Accident and Injury Prevention, accepted for publication. 8. Association for the Advancement of Automotive Medicine. The Abbreviated Injury Scale, 1990 Revision. Des Plaines, IL: The Association; 1990.