Download System: EpiCS - Operations, Information and Decisions Department

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Discovery of Predictive Models in an Injury Surveillance Database:
An Application of Data Mining in Clinical Research
John H. Holmes, Ph.D.1, Dennis R. Durbin, M.D., M.S.1,
Flaura K. Winston, M.D., Ph.D.2
1University of Pennsylvania Medical Center, Philadelphia, PA
2The Children’s Hospital of Philadelphia, Philadelphia, PA
ABSTRACT
A new, evolutionary computation-based approach to
discovering prediction models in surveillance data
was developed and evaluated. This approach was
operationalized in EpiCS, a type of learning classifier
system specially adapted to model clinical data. In
applying EpiCS to a large, prospective injury
surveillance database, EpiCS was found to create
accurate predictive models quickly that were highly
robust, being able to classify >99% of cases early
during training. After training, EpiCS classified novel
data more accurately (p<0.001) than either logistic
regression or decision tree induction (C4.5), two
traditional methods for discovering or building
predictive models.
INTRODUCTION
Data mining can be defined as the methods and
processes used to perform knowledge discovery,
which in turn can be defined as the identification of
meaningful and useful patterns in databases.1 While
these patterns may be of interest for such tasks as
hypothesis generation, an especially important role of
data mining is the discovery of models that predict the
class membership of previously unseen “cases” or
database records. These predictive models are in
common use throughout many clinical domains for
diagnosis, risk assessment, and determining
appropriateness of tests and interventions.
In
addition, predictive models may be used as a
knowledge base in a decision support system.
Traditional methods of building predictive
models include logistic and other regression
procedures applied to population-based data. While
often effective, the process for deriving models in this
way can be cumbersome, and is often hampered by
pre-existing investigator biases as to the candidate
variables to be entered into an exploratory model.
Even before this, however, it is often tedious to derive
a set of candidate variables if the data are composed
of many records (cases) and/or fields, which can
cause delays in the model building process. Finally,
sparse distributions of variables in the data will affect
the successful building of a model. For example,
missing data on even one candidate term will cause an
entire case to be discarded from a model. Solutions to
this problem have included imputation, an arguably
controversial method, or simply excluding the
variable from the list of candidates. These issues
argue for a different approach to developing
predictive models.
The domain of knowledge discovery in
databases (KDD) is rich with such approaches. One
in particular, decision tree induction, embodied in the
software C4.52, has been used extensively and
successfully to discover prediction models in data.
Unfortunately, this approach has several faults that
argue against its use in large clinical databases. For
example, conflicting or contradictory data, poorly
separated classes, and small cell sizes for specific
attributes are not adequately addressed by this
approach. Such problems are common in clinical
data, thereby causing one to continue to look for yet
another approach to building predictive models.
One such approach is found in evolutionary
computation, which uses one or more techniques that
reflect a Darwinistic metaphor. In the evolutionary
computation paradigm, the overarching goal is to
evolve solutions to problems.
This has been
accomplished with genetic algorithms3, genetic
programming4, and learning classifier systems5. The
learning classifier system (LCS) is an especially
attractive evolutionary computation method for
prediction model discovery in that it integrates a
knowledge base as part of its design, making it
something like a “learning expert system.” We have
embraced this approach, modifying the LCS paradigm
to facilitate its use in large databases for
epidemiologic surveillance. Specifically, we applied
a new LCS, called EpiCS, to the domain of
population-based head injury surveillance.
This
investigation reports on this application and its
success at discovering predictive models compared to
logistic regression and decision tree induction.
METHODS
System: EpiCS
Knowledge representation. Like other learning
classifier systems, EpiCS’s knowledge representation
scheme focuses on condition-action rules, called
classifiers. The condition side of a classifier is
referred to as a taxon, while the action side is simply
its action. Each classifier has a strength, which
indicates its accuracy relative to others in the
population. Typically, classifiers are encoded in bit
strings, such that integer and real numbers are
represented in base-2 notation. This notation is
extended to add a third character to the alphabet, the
“*”, which represents a “wild card” that can take the
value of either 0 or 1. This representation facilitates
the operations of the genetic algorithm, discussed
below. Classifiers are held in a static array of
constant size, called a population, which is the
knowledge base of EpiCS. When EpiCS is initialized,
the population is filled with a predetermined number
of randomly generated classifiers.
The unique classifiers in the population are
referred to as macrostate classifiers. Each macrostate
classifier represents a separate rule. At initialization,
the population is composed of many unique
classifiers; however, because they were randomly
generated, these classifiers cannot be expected to
represent plausible hypotheses for a given problem.
As the system learns, the classifiers in the population
will be increasingly refined, to the point where, out of
an entire population, a relatively small set of unique
classifiers will emerge to define the system’s
knowledge base. The change in composition of the
population from highly unique classifiers to
increasingly numerous instances of non-unique, highly
general, classifiers reflects a fundamental process of
inductive learning: generalization. The consequence
of generalization is the reduction in the resources
required for searching the population of matching
classifiers, but also the potential for increased
accuracy when the system is applied to novel cases
from the testing set.
Components.
There are three functional
components to EpiCS: performance, reinforcement,
and discovery (Figure 1).
The performance
component creates a subset of all classifiers in the
population whose taxa match a stream of data
received as input from the environment. In this way,
the performance component is analogous to a forward
chaining rule-base system. All matching classifiers
matching the input taxon comprise a Match Set [M],
even though some of these classifiers may advocate
different actions. The process is equivalent to the
triggering of rules, and [M] is analogous to an agenda
in an expert system. From [M], the classifier with the
highest strength is selected. The action of this
classifier is then used as the output of the system; this
process is analogous to the firing of a rule in an expert
system. When EpiCS is used to classify cases, as
would be done during testing, the operation cycle
stops here. During training, however, two additional
components are used by EpiCS to reinforce classifiers
according to their performance and to discover new,
yet plausible, classifiers as a form of hypothesis
generation.
EpiCS
Input
Performance
component
Output
Classifier
population
Reinforcement
component
Discovery
component
Figure 1. Schematic diagram of EpiCS.
In supervised learning, the true classification of
a training case is known to the system, and this
information is used by the reinforcement component
in adjusting the strengths of all classifiers in the
system according to the following scheme. First, a
Correct Set [C] is created from the classifiers in [M]
that have action bits matching the decision output by
the system, and the remaining classifiers in [M] form
the set [notC]. This assumes that the decision
advocated by the system is correct; if the decision was
not correct, then only [notC] is formed. Next, a tax is
applied to [C], reducing the strength of each classifier
in [C] by 10 percent. The purpose of this tax is to
inhibit premature convergence: the accurate classifiers
in [C] at one time step may not be accurate at another.
Often, this premature convergence is due to overly
general classifiers in the population. The tax helps to
“smooth” the asymptotic ascent to an accurate, yet
optimally general, population of classifiers.
A
reward, R, is evenly distributed among the classifiers
in [C]. R is adjusted so that a higher fraction is
apportioned to more general classifiers. The strength
of each classifier in [notC] is diminished
proportionally by a penalty, typically 50%. The effect
of this reward scheme is to exert some degree of
selection pressure on the population, such that
classifiers are chosen in the discovery component for
reproduction based on their strength proportional to
other classifiers in the population.
The discovery component employs the genetic
algorithm, which is a method of optimization and
discovery predicated on Darwinian evolution. The
role of the genetic algorithm in EpiCS is to discover
new classifiers by applying genetic operators such as
reproduction, crossover, and mutation to the strongest
classifiers in the population. The newly formed
classifiers “inherit” traits from those that are strongest
(their “parents,” yet they contain different “genetic
material” obtained via crossover and mutation. Since
the population is steady-state, an equal number of
classifiers, typically the weakest, are deleted to make
room for the new, stronger ones. If these new
classifiers prove accurate, their strengths will increase
over time, and they will subsequently be selected for
reproduction. Over time, the strongest, most accurate
classifiers will prevail at the expense of the weakest,
least accurate; hence the Darwinian metaphor.
Source of data
The data for this investigation were obtained
from the Partners for Child Passenger Safety (PCPS),
a five-year investigation into child occupant
protection in automobile crashes6,7. Funded by the
State Farm Insurance Companies, the research is
being conducted at The Children’s Hospital of
Philadelphia and the University of Pennsylvania
Medical Center. The goal of the PCPS project is to
identify ways to reduce the morbidity and mortality of
children involved in automobile crashes. We address
this goal through a multidisciplinary approach that
incorporates clinical researchers, epidemiologists,
biomechanical engineers, automotive engineers, and
informaticians to identify a spectrum of modifiable
risk factors for pediatric injury in crash events.
The PCPS project uses State Farm Insurance
Companies claims data from 15 states and the District
of Columbia on automobile crashes involving at least
one child less than 16 years of age. Approximately
30,000 such claims are received each year at the
University of Pennsylvania from State Farm. Of
these, 20% are subjected to a telephone interview to
obtain more specific details about the crash and any
injuries incurred by children involved in the crash.
These two sources of data, claims records and
telephone interview, contribute to the richness of the
PCPS surveillance database, which is reflected in the
number of variables (over 500), as well as in the large
number of records. This investigation focused solely
on 47 numeric variables which were selected from a
number of “modules” defined by their function.
These included passenger restraint (characteristics of
any restraint devices and their usage, pertaining to an
individual passenger); crash (characteristics of the
crash event, such as point of impact, type of object
struck, and estimated speed); kinematics (ejection
from vehicle, contact with surfaces such as windows
or dashboard, and evidence of occupant motion during
the crash); and demographics (age and injurypredisposing factors such as physical disability). A
dichotomously-coded element was used to indicate the
presence or absence of head injury defined as
“serious” or worse by the Abbreviated Injury Scale
(AIS)8.
Data preparation
A total of 8,334 records comprised the pool of
data to be mined. A series of 20 study datasets were
created from this pool to mimic 20 separate casecontrol studies, effectively implementing a bootstrap.
All records with head injury (cases) were included in
each dataset, while an equal number of non-head
injury records (controls) were selected randomly from
the pool without replacement. Thus, the controls were
unique within each study dataset. No matching
procedures were performed. Training and testing sets
were created from each study dataset by selecting
records at a sampling fraction of 0.50 without
replacement. Thus, training and testing sets were
equal in size (N=415) and mutually exclusive. In
addition, head injury-positive and negative cases were
distributed as equally as possible (N=207 and 208,
respectively) in the training and testing sets.
The data were used in their native coding for the
comparison studies with LR and C4.5. For the EpiCS
trials, the data were encoded as bit strings using the
ternary alphabet described above, with missing values
coded as “wild cards” to preserve their original
semantic.
Comparison methods
Two methods were used to compare the
performance of EpiCS in classifying novel data.
Logistic regression. Logistic regression (LR) is
commonly used in clinical research to create and
validate prediction models from data, and is therefore
an excellent method with which to compare EpiCS’
prediction performance. In order to create the 20
logistic models for comparison, all 47 variables from
each training set were entered into a separate stepwise
logistic regression using forward stepping with
relaxed entry criterion (p to enter=0.95). Of the 47
candidate terms stepped in, between 10 and 12 were
found to be at least marginally significant (p<0.07).
Eleven variables were dropped due to sparse cell
sizes, while the remaining variables were not
statistically significant.
The resulting prediction
model was applied to the cases in the testing set to
obtain an estimate of risk of outcome for each, and the
area under the receiver operating characteristic curve
(AUC) was calculated.
C4.5. A well-known program that creates decision
trees, C4.52 was chosen as the second comparison
method. All 47 variables were used to create a
decision tree from each of the 20 training sets. The
tree was then used to classify the cases in the
corresponding testing set.
The cross-validation
procedure built into the software was used to optimize
the final decision tree, using a total of 10 blocks.
Subsequently, the optimized tree was used to create
sets of IF..THEN rules using the C4.5RULES
procedure, which were in turn applied to the
corresponding testing sets to ascertain their
classification performance using the AUC.
Experimental procedure: EpiCS
The population in EpiCS was initialized with
5,000 randomly generated classifiers. This population
size was found empirically to provide the best
performance in terms of classification accuracy and
ability to classify cases. The EpiCS system was
trained over a series of iterations, with a case drawn
randomly from the training set and presented to the
system at each iteration. As these cases were drawn
with replacement, an individual training case could be
presented to the system many times during a training
phase, which was defined as 100 iterations.
At the 0th and every 100th iteration thereafter,
the system moved from the training phase to the
interim evaluation phase. During this phase, the
learning ability of EpiCS was evaluated by presenting
the taxon of every case in the training set to the
system for classification. Since the purpose of the
interim evaluation phase was the evaluation of the
state of learning of EpiCS, the reinforcement and
discovery components were disabled for its duration.
The decision advocated by EpiCS for a given case
was compared to the action bit of that case to
determine the type of decision made by EpiCS. The
decision type was classified in one of four categories:
true positive, true negative, false positive, and false
negative; these were tallied for all training cases. The
evaluation metrics (sensitivity, specificity, area under
the receiver operating characteristic curve (AUC), and
the indeterminant rate (IR, or the proportion of cases
that could not be classified) were calculated and
written to a file for analysis. The IR was used to
correct the other metrics using the following equation:
Metric
1  IR
After the completion of the designated number
of iterations of the training epoch, EpiCS entered the
testing epoch, in which the final learning state of the
system was evaluated using every case in the testing
set, presented in sequence. As in the interim
evaluation phase, the reinforcement component, and
the genetic algorithm, were disabled during the testing
phase. At the completion of the testing phase, the
evaluation metrics were calculated and written to a
file for analysis, as was done during the interim
evaluations. The entire cycle of training and testing
Corrected Metric 
comprised a single trial; a total of 20 trials were
performed for each of the 20 study datasets.
RESULTS
The performance of EpiCS during the training
phase is illustrated in Figure 2, in which the AUC and
IR obtained during each 100th iteration are plotted
over time. As can be seen in this figure, EpiCS
converged quickly (within 3,000 iterations) to an
accurate classification of the training set (AUC0.94).
In addition, the IR decreased correspondingly with the
increase in AUC.
At the beginning of the training phase, there
were 5,000 unique classifiers in the population. By
the end of the training phase, this was reduced to
2,314 unique classifiers. This level of reduction
(53.7%) in unique classifiers indicated that
generalization took place. These classifiers were in
turn used by EpiCS to predict the class membership
(head injury/no head injury) of data in the testing set.
In comparison, a total of 11 rules were created by
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
AUC
0.2
Indeterminant Rate
0.1
0
0
1000
2000
3000
4000
5000
6000
7000
8000
9000 10000
Iterations
Figure 2. Performance of EpiCS during training.
C4.5, most of these containing single conjuncts.
The performance of EpiCS on testing with
unseen cases is shown in Table 1, compared with the
results obtained with logistic regression and C4.5.
EpiCS performed significantly better in classifying
unseen cases than either of the two comparison
methods (p<0.001).
DISCUSSION
This investigation focused on the application of
a LCS, EpiCS, to the discovery of prediction models
in a large prospective injury surveillance database.
Table 1.
Area under the receiver operating
characteristic curve (AUC) obtained on testing with
unseen cases. AUCs were averaged over the 20 casecontrol studies; one standard deviation is shown in
parentheses.
EpiCS
Logistic Regression
C4.5
0.97 (.04)
0.74 (0.03)
0.79 (0.04)
By itself, EpiCS’s performance was excellent in both
rapidly and accurately learning the models during its
training phase and in classifying novel data in the
testing phase. However, it is in comparison with LR
and C4.5 that the superiority of EpiCS, as applied to
these data, is demonstrated. There are several
possible reasons for this.
First, EpiCS is not bound by the statistical
assumptions that constrain LR. For example, EpiCS
is not hampered by multicollinearity. Furthermore,
interactions between two or more variables do not
need to be accounted for in the EpiCS model, as they
are simply a part of the individual classifier’s
representation.
Second, LR provides a single-rule model that is
based on statistical significance that is set arbitrarily
by the investigator. Terms that fail to meet statistical
significance criteria are not included in the model, and
their potential contribution is never known as a result.
Both EpiCS and C4.5 provide a suite of rules in their
models, and these can be used to form the nucleus of a
knowledge-based system.
Third, the models of both LR and C4.5 are
applied monolithically, in that one never knows when
novel data cannot be classified.
There is no
indeterminant rate function available with these
approaches, whereas this is integrated into EpiCS,
providing critical information about the robustness of
the model.
Finally, the rules derived by C4.5 are often
overly sparse; this was clearly evident in the PCPS
data, which is extremely rich in subtle patterns and
rule interactions that are lost due to C4.5’s decision
tree pruning procedures. EpiCS, on the other hand,
provided a very large number of rules which, although
showing strong evidence of generalization, would
benefit from judicious pruning. The number of rules
in the macrostate population mitigate against using
this knowledge base as a parsimonious source of rules
that could be imported into another knowledge-based
system. Although the content of the knowledge bases
evolved by EpiCS and created by C4.5 is not a focus
of this investigation, this is a line of research we are
currently undertaking.
CONCLUSION
A new system for discovering prediction models
was developed and applied to a large prospective
injury surveillance database. This system, EpiCS,
borrows from an evolutionary computation paradigm,
the learning classifier system, which incorporates
reinforcement learning and genetic algorithm-driven
discovery in the context of a rule-based knowledge
system. This investigation is the first reported use of
a LCS to discover prediction models in this type of
data.
We hope to apply EpiCS to larger, more
complex surveillance databases and to compare its
performance on these domains with a larger array of
methods, including k-nearest neighbors and naive
Bayes classifiers. In addition, we plan to expand the
knowledge representation to include integer and realnumber data; thus obviating the need for potentially
costly encoding and decoding procedures. Finally, we
are investigating methods for pruning the rules
contained in the macrostate population after training,
with the goal of making the prediction model more
parsimonious without losing efficiency or accuracy.
Acknowledgement. This project was funded by The
State Farm Insurance Companies.
REFERENCES
1.
2.
3.
4.
5.
6.
Fayyad UM, Piatetsky-Shapiro G, Smyth P,
Uthurusamy R, editors. Advances in Knowledge
Discovery and Data Mining. Menlo Park, CA:
AAAI Press, 1996.
Quinlan JR.
C4.5: Programs for Machine
Learning. San Francisco: Morgan Kaufmann,
1992.
Goldberg, DE. Genetic Algorithms in Search,
Optimization, and Machine Learning. New York:
Addison-Wesley, 1989.
Koza, JR.
Genetic Programming.
On the
Programming of Computers by Means of Natural
Selection. Cambridge, MA: The MIT Press, 1993.
Holland, JH. Adaptation in Natural and Artificial
Systems. Cambridge, MA: The MIT Press; 1992.
Holmes JH, Winston FK, Durbin DR, Bhatia E,
Arbogast K, Werner J: The Partners for Child
Passenger
Safety
Project:
An
Information
Infrastructure for Injury Surveillance. In: Chute CG,
editor. Proceedings of the Fall Symposium of the
American
Medical
Informatics
Association,
November 1998. Philadelphia: Hanley and Belfus, p.
1016.
7.
Durbin DR, Winston FK, Bhatia E, et al.
Partners for Child Passenger Safety: A unique
child-specific crash surveillance system.
Accident and Injury Prevention, accepted for
publication.
8.
Association for the Advancement of Automotive
Medicine.
The Abbreviated Injury Scale, 1990
Revision. Des Plaines, IL: The Association; 1990.