Download Dental Data Mining: Potential Pitfalls and Practical Issues

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Mixture model wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Types of artificial neural networks wikipedia , lookup

Multinomial logistic regression wikipedia , lookup

Transcript
Dental Data Mining:
Potential Pitfalls and Practical Issues
digital diagnostics (e.g., radiology, microbiology), and molecular
biology (e.g., polymerase chain-reactions, microarrays).
S.A. Gansky
Center for Health and Community, Center to Address Disparities in
Children's Oral Health, Department of Preventive and Restorative Dental
Sciences, Division of Oral Epidemiology and Dental Public Health,
University of California, San Francisco, CA 94143-1361, USA;
[email protected]
Adv Dent Res 17:109-114, December, 2003
Abstract — Knowledge Discovery and Data Mining (KDD) have
become popular buzzwords. But what exactly is data mining?
What are its strengths and limitations? Classic regression,
artificial neural network (ANN), and classification and regression
tree (CART) models are common KDD tools. Some recent reports
(e.g., Kattan et al., 1998) show that ANN and CART models can
perform better than classic regression models: CART models
excel at covariate interactions, while ANN models excel at
nonlinear covariates. Model prediction performance is examined
with the use of validation procedures and evaluating concordance, sensitivity, specificity, and likelihood ratio. To aid
interpretation, various plots of predicted probabilities are
utilized, such as lift charts, receiver operating characteristic
curves, and cumulative captured-response plots. A dental caries
study is used as an illustrative example. This paper compares the
performance of logistic regression with KDD methods of CART
and ANN in analyzing data from the Rochester caries study.
With careful analysis, such as validation with sufficient sample
size and the use of proper competitors, problems of naïve KDD
analyses (Schwarzer et al., 2000) can be carefully avoided.
Introduction
I
nformatics, in general, and dental informatics, in particular,
are disciplines encompassing a variety of research areas,
from molecular biology to library science to public health
surveillance. Many dental informatics application areas
utilize knowledge discovery and data mining (KDD)—semiautomatic pattern, association, anomaly, and statistically
significant structure discovery in data (Fayyad et al., 1996, p. 6).
KDD operates at the intersection of artificial intelligence,
machine language learning, computer science, engineering,
and statistics. KDD has been named a Top Ten emerging
technology that will change the world (Waldrop, 2001).
However, KDD is not alchemy—it does not turn lead into gold
(i.e., bad data or flawed study designs into incredible, novel
insights)—but rather KDD is a discipline using modern
computing tools to solve problems.
In current business applications, KDD touches lives daily
when customers swipe supermarket savings cards, sending
buying habits to data warehouses. This provided retailers the
(apocryphal?) data mining discovery: diapers and beer sharing
men's late-night supermarket baskets. In the future, similar
encounters in clinicians' offices might collect health information
in data warehouses (according to patient confidentiality
protections), which can be mined to identify at-risk patients and
better treatment modalities. Such possibilities are gradually
becoming reality (e.g., Page et al., 2002). Some potential oral
health applications for KDD include: large surveys (e.g.,
NHANES), longitudinal cohort studies (e.g., Veterans
Administration Longitudinal Study on Aging), disease registries
(e.g., National Cancer Institute's Surveillance, Epidemiology and
End Results [SEER] program; birth defects registry; craniofacial
treatment outcomes registry), health services research (e.g.,
claims data, fraud detection), provider and workforce databases,
KDD Methods
KDD learning methods (see the Glossary in the Appendix) can
be unsupervised (grouping into similar, heretofore
undetermined, classes based on similarities) or supervised
(prediction using already-determined classes, such as disease
status). Unsupervised methods include hierarchical cluster
analysis and k-means; supervised methods include regression,
tree models (e.g., classification and regression trees [CART],
boosting, bagging, and ensemble methods), multivariate
adaptive regression splines, artificial neural networks (ANNs),
support vector machines, and random forests (Hastie et al.,
2001). In oral health research, CART was used to predict caries
(Stewart and Stamm, 1991). ANN was used in clinical decisionmaking for third molar extraction (Brickley and Shepherd, 1996,
1997; Brickley et al., 1998; Goodey et al., 2000), oral cancer risk
assessment (Speight et al., 1995), predicting dental age from
photomicrographs (Amariti et al., 2000), predicting growth
classification from lateral cephalograms (Lux et al., 1998), and
assessing correlations of tooth enamel chemical elements
(Nilsson et al., 1996). However, KDD methods have not been
compared very much in oral health studies; compared with
regression models, ANN might better identify non-linearities,
while CART may better find interactions (Kattan et al., 1998).
Logistic regression models linear relationships between
predictors (inputs) and a binary response (output) (e.g.,
Harrell, 2001). The binary logit model can be written as:
i
logit(i) = loge ______
= 0 + 1xi1 + ··· + PxiP + ei
1 - i
where i is the probability of the i-th person having the response
(y i ), s are the corresponding parameters for P predictor
variables, and ei is the error for the i-th person. For example, if
log10MS and fluoride levels relate linearly to the probability of
developing caries, this model would fit well. Logistic regression
coefficients (s) are easy to interpret (as natural logarithms of
odds ratios), a very desirable property. If the actual likelihood
surface is not a hyperplane, logistic regression will not fit well,
since it misses bumps or non-linearities.
CART models (e.g., Stewart and Stamm, 1991; Hastie et al.,
2001) adapt well to fit interactions, since they group
individuals with similar probabilities of caries (to produce
terminal nodes with the highest purity or homogeneity of
outcome classes). Unlike logit models, CART models are
robust to outliers and do not require specific data
transformations or hierarchical interaction specification. CART
Key Words
Models, statistical; decision support techniques; neural networks
(computer); dental caries; oral health.
Publication supported by Software of Excellence (Auckland, NZ)
Presented at "Dental Informatics & Dental Research: Making the
Connection", a conference held in, Bethesda, MD, USA, June 12-13,
2003, sponsored by the University of Pittsburgh Center for Dental
Informatics and supported in part by award 1R13DE014611-01 from
the National Institute of Dental and Craniofacial Research/National
Library of Medicine.
109
models are step-function-type likelihood approximations
(analogous to Riemann sums approximating integrals); these
models are highly interpretable for easy clinician use.
ANNs, extremely flexible weighted combinations of nonlinear functions, use a hidden layer with hidden
units/nodes/neurons and activation functions to link inputs to
the hidden layer and from the hidden layer to outputs. A feedforward or multilayer perceptron ANN is:
g0-1(i) = w0 + w1H1 + w2H2 + ··· + wRHR
where
Hr = tanh(w0r + w1rx1i + w2rx2i + ··· + wPrxPi)
with r = 1, 2, ..., R indexing the neurons, Hr denoting the r-th
neuron, wpr denoting the coefficients of the p-th input xpi for
the r-th neuron, and g 0 -1 denoting the inverse activation
function (in this case, tanh-1). In ANN terminology (Schwarzer et
al., 2000), a P-R-S model has P inputs (predictors), 1 hidden layer
with R neurons, and S outputs (outcomes). Neurons are a
function of weighted sums of inputs plus a constant ("bias"), w0.
Similarly, outputs are a function of weighted sums of neurons
plus bias; logistic and hyperbolic tangents are common
activation functions. (Logistic regression is a P-0-1 feed-forward
ANN with logistic activation function.) Weight decay, a model
complexity penalty term for maximization, can be added to
examine potential overfitting. In a simulation study of a 1-15-1
ANN, weight decays of 0, 0.002, and 0.005 were used with 0.005
stabilizing prediction (Schwarzer et al., 2000). Varying random
seeds, R, and weight decays stabilizes global optimization
(Ripley, 1996). ANNs are iteratively optimized with training
data, and the final model is fitted to validation data so that
future performance can be assessed. Training estimates weights,
but they have no clear interpretation; thus, ANNs have very
poor interpretability. Since ANNs with large R fit any arbitrary
surface, ANNs should not be overfitted to the training data.
Common mistakes with ANN are: too many parameters for the
sample size, not using validation, not using a model complexity
penalty, incorrect misclassification estimation, implausible
probability functions, incorrectly described network complexity,
inadequate flexible statistical competitors (e.g., CART), and
insufficient comparisons with statistical competitors (e.g.,
receiver operating characteristic curves) (Schwarzer et al., 2000).
KDD Process
Although KDD is probably best known for analytic algorithms,
KDD is an iterative process with the following steps regarding
data: collect and store, pre-process, analyze, validate, and
implement (Fig. 1). Data collection and storage include study
design, sampling, merging, and warehousing. Pre-processing
which may be needed includes cleaning, imputing missing
values, transforming (e.g., logarithm for skewed bacterial count
data), standardizing (centering at the mean and scaling by
standard deviation), and registering (e.g., aligning radiographs
for digital subtraction radiography). Analysis can involve
unsupervised and supervised learning techniques plus
visualization methods. Ideally, validation will be both internal
(these data) and external (separate data) (Altman and Royston,
2000). Internal validation can use a split sample (separate
training and validation data, if sample size > 3000), crossvalidation (if sample size < 3000), bootstrap (resampling with
replacement), or jackknife (leave-one-out) methods. Finally,
implementation could involve changes in the KDD process,
new clinical interventions, or changes in health policy.
Data quality and study design issues remain paramount:
Limitations inherent in study designs remain when KDD is
used. For example, a tooth implant registry cannot examine the
bone quality's impact on implant failure if bone quality is not
measured well (or at all). Similarly, a cross-sectional study of
stress and temporomandibular joint disorders still assesses
only associations, not causality.
Goals of this paper are to demystify knowledge discovery
and data mining (KDD) by explaining the process, to identify
possible pitfalls and practical issues, and to compare the
performance of KDD methods (logit, CART, and ANN) in
analyzing Rochester caries study data.
Materials and Methods
Rochester Caries Study
In upstate New York (the Rochester and Finger Lakes areas), firstand second-graders, caries-free at baseline, had stimulated saliva
collected and dental exams without radiographs performed every
six months for up to 6 years for a larger longitudinal caries risk
assessment study (Billings et al., 2003); this was a follow-up to the
same research team's earlier cross-sectional investigation (Leverett
et al., 1993a) and two-year longitudinal study (Leverett et al.,
1993b). This example analysis predicts primary tooth caries
(output) according to a subset of predictors (inputs) which may
have non-linearity or interactions. Salivary assays assessed
mutans streptococci (MS) and lactobacillus (LB) levels (colonyforming units per milliliter, CFU/mL), fluoride (F), calcium (Ca),
and phosphate (P) levels. Data for 466 children with 2 years of
follow-up were analyzed with input variables, selected based on
published discriminant analysis models (Leverett, 1993b) (log10
MS, log10 LB, F [parts per million, ppm], Ca [millimole per liter,
mmol/L], and P [mmol/L]). The output (response) variable was
caries incidence (at least one decayed or filled surface) on primary
teeth at 24 months of follow-up. Earlier analyses showed 18month measures to be more predictive
of 24-month caries incidence than
baseline, six-month, or 12-month
measures.
KDD methods
Fig. 1 — Knowledge discovery and data mining (KDD) steps. KDD involves several iterative steps, beyond
analytic algorithms, to process scientific information.
110
Gansky
Logistic regression, CART, and ANN
caries prediction methods were
compared. Logistic regression used
stepwise selection, with alpha = 0.05
to enter and 0.20 to stay, and the
Akaike Information Criterion to judge
the need for additional predictors.
CART used the Gini index-splitting
criterion and the proportion correctly
classified for pruning back the
maximal tree. A 5-3-1 multilayer
perceptron ANN model (22 degrees of
freedom) with inverse hyperbolic
tangent
activation
functions,
Adv Dent Res 17:109-114, December, 2003
Levenberg-Marquardt optimization, 5 preliminary runs,
average error selection, and no weight decay function was
used. Sensitivity analyses varied random seed (5 different
values), number of hidden neurons (2, 3, 4), and weight decay
parameter (0, 0.001, 0.005, 0.010, 0.250).
Training and validation were performed with a 70%/30%
randomly split sample stratified on primary dentition caries.
All methods used the same training data to develop the
prediction models and the hold-out (not used to develop the
models) validation data to score or validate the models.
Additionally, five-fold cross-validation [CV(5)] was performed,
randomly forming 5 groups leading to 5 analyses, each with
4/5 of the total data (i.e., each 5th was left out of one analysis).
Results were then aggregated to calculate mean square error
(MSE), also called the Brier score (B), between observed and
expected output:
1 n
1 n
ˆi
MSE = __ ei2 = B = __ yi - n i=1
n i=1
(
),
2
The resultant training classification tree is presented in
Fig. 2. In this example, the overall prevalence of caries in the
primary dentition was 15% (root node). Each input variable
was searched to partition the root node. All children with log10
MS less than 7.08 (~ 10 million CFU/mL) were in the left node,
with the remainder in the right node. Node-specific
prevalences were 15% and 91%, respectively. Circles identify
nodes with prevalence less than or equal to the overall
prevalence of 15%, while squares identify nodes with
prevalence greater than the overall prevalence. Continuing, the
left node was split into two nodes according to log10 LB. The
node with log10 LB less than 3.05 was further split with log10
MS for identification of a group with very low prevalence. This
illustrates tree models' recursive nature, since predictors can be
re-used. Next, the node with log10 LB greater than or equal to
3.05 was split with fluoride. Finally, the node with log10 MS
greater than or equal to 7.08 was split with fluoride. There
were 6 terminal nodes (3 high prevalence and 3 low
prevalence); 1 high-prevalence node was very high, while 2
low-risk nodes were very low.
Cumulative captured-response curves (Fig. 3) compare
where n is the sample size.
Visualization
Area under the curve (AUC) from
receiver operating characteristic
(ROC) curves plotting sensitivity vs.
the false-positive fraction (1 specificity), which is equivalent to
concordance (c index), was also
calculated. The (positive) likelihood
ratio is sensitivity / (1 - specificity).
ROC curves allow for balancing
between the sensitivity-specificity
tradeoff.
Cumulative captured-response
curves are similar to ROC curves, but
with graph sensitivity vs. the percent
testing positive (identified as highrisk). Thus, sensitivity for KDD
methods can be compared for a
specific percent-positive cut-off,
which may be useful when resources
for those labeled high-risk might be
limited. A related graph is the lift
chart, which displays the gain each
KDD method has over baseline vs.
the percent testing positive.
To visualize the input contribution, we divided ANN predicted
probabilities into quintiles (fifths)
and showed the distributions of the
standardized predictors in each
quintile via boxplots.
Fig. 2 — Classification and regression tree model to predict 24-month caries in primary dentition from salivary
bacterial and chemical measurement 6 months prior. Mutans streptococci counts (log10 colony-forming units
per milliliter, CFUs/mL), lactobacilli counts (log10 CFUs/mL), and fluoride (F) levels (parts per million) were used
to produce 6 terminal nodes—3 with risk higher than overall prevalence and 3 with risk lower than overall
prevalence.
Results
Logistic regression yielded a model
with only the two bacterial level
variables as significant predictors:
the log 10 MS odds ratio (OR) was
1.27 (95% confidence interval, 1.10 1.46), while the log10 LB OR was 1.36
(95% CI, 1.19 - 1.57), meaning that
each log10 increase in MS related to a
27% increase in probability of
having primary tooth caries 6
months later, and each log10 increase
in LB related to 36% greater odds of
carious primary dentition.
Adv Dent Res 17:109-114, December, 2003
Fig. 3 — Percent of children identified with caries as the percent classified as high-risk changes for the different
KDD models. Cumulative captured-response graph comparing baseline, 5-3-1 artificial neural network (ANN),
logistic regression (logit), and classification and regression tree model performance in identifying children with
caries in the primary dentition (% captured response) vs. the percent labeled as high-risk. Percent captured
response is sensitivity, while percent high-risk is the percent testing positive.
Dental Data Mining: Pitfalls and Issues
111
the performance of the 3 methods. These curves show the
percent of children with caries identified (vertical axis) as the
percent classified as high-risk (horizontal axis) increases—i.e.,
sensitivity on the vertical axis and percent with a positive
diagnostic test on the horizontal axis. Like ROC curves, this
graph shows performance vs. a 45-degree diagonal reference
line, which identifies the same percent of cases as the percent
labeled as high-risk. For example, if 30% of the children are
labeled high-risk, 30% of the cases would be expected to be
identified (i.e., at the same prevalence as the overall
prevalence). As with ROC curves, an ideal prediction method
would identify all 100% of the cases with a small percent of
the children being labeled high-risk. Here, if 10% of the
children were labeled high-risk, all 3 methods identified about
30% of the cases; if 20% were denoted high-risk, logistic
regression (logit) identified about 35% of the cases, CART
(tree) 40%, and ANN 50%; if 30% were classified high-risk,
logistic regression and CART each identified 55% of the cases,
while ANN found more than 60%. Corresponding values in a
lift chart (not shown) would be about 3 for all 3 methods if
10% of children are labeled high-risk, but about 1.75 for logit
and tree and 2 for ANN if 30% of children are labeled highrisk. To be clinically useful, studies probably would not want
to denote more than 30-40% of children as being high-risk
(Hausen, 1997). ANN may provide some advantages over
logit and CART. ROC curves (not shown) yield patterns very
similar to those shown in Fig. 3; however, Fig. 3 provides the
direct interpretation for the percent of children identified as
high-risk, which may be more useful to public health and
policy planners.
Although ANN weights do not provide the direct
interpretation (e.g., ORs), the predicted probability from an
ANN model can be categorized (binned), and the
distribution of predictors in each category can be graphed.
Fig. 4 shows the distribution of standardized log 10 MS
(standard normal distribution) with boxplots for quintiles of
ANN model predicted probability. The lowest quintile
(chance of caries in the primary dentition) has the lowest
log10 MS level.
CV(5) results showed extremely similar root MSE values
for the 3 methods: 0.365 for logit, 0.363 for CART, and 0.362 for
ANN. AUC (c index) from ROC curves differed somewhat:
0.553 for CART, 0.680 for logit, and 0.707 for ANN. This is the
probability that one randomly chosen child with caries and one
without caries would both be correctly classified.
Discussion
Limitations of the example presented included the relatively
small number of predictors utilized. Moreover, other factors
potentially related to caries, such as salivary flow rate and pH,
were not included. Sufficient salivary flow rate was an
inclusion criterion for the study. Buffering capacity, pH, was
not an important predictor in earlier work for the precursor
study. However, investigators thought that the relationships
between predictors and response might be non-linear and
include interactions (earlier analyses showed interactions
between bacterial counts and salivary chemistry measures).
Additionally, the logistic regression models considered did not
include interaction or non-linear terms, which may have
produced logistic regression models approaching predictive
accuracy of artificial neural networks. Boosting (re-weighting
to emphasize misclassifications) or bootstrap aggregation
(bagging) could have improved the performance of the tree
models.
Conclusions
Knowledge discovery and data mining (KDD) is not a
panacea but rather a process with useful tools; KDD does not
obviate the need for careful monitoring of data quality and
study design issues. Multiple methods should be used to
assess sensitivity to one particular method; prediction results
from various methods should be compared according to
receiver operating characteristic (ROC) curves, cumulative
captured-response curves, or lift charts. Care should be taken
to avoid common mistakes made with artificial neural
networks (ANNs) (e.g., Schwarzer et al., 2000). Validation
(internal and external) is essential: "The major cause of
unreliable models is overfitting the
data" (Harrell, 2001, p. 249).
Graphic displays can greatly help
interpretations and demystify the
"black box" nature of some KDD
methods, such as ANNs. KDD
methods may provide advantages
over traditional statistical methods
in dental data.
Acknowledgments
Fig. 4 — Distribution of standardized log10 mutans streptococci (MS) counts for 5-3-1 artificial neural network
(ANN) model predicted probability quintiles. Predicted probability of caries in primary dentition is split into 5
equal-sized groups (quintiles). Log10 MS counts (standardized to a standard normal distribution by centering at
the mean and scaling by the standard deviation) were much lower in the first quintile (i.e., the 20% of children
who had the lowest predicted probability of caries in their primary dentition). Thus, even though ANN model
weights are not interpretable for individual input (predictor) variables, visualization techniques can show how
inputs relate to the predicted output (response).
112
Gansky
The author is grateful to Dr. John
Featherstone for his insights about
caries risk, to Dr. Jin Whan Jung for
his inspiration to utilize KDD tools,
to Dr. Ronald Billings for generously
providing the Rochester caries
study data, and to Dr. Jane
Weintraub for her helpful suggestions and comments on an earlier
draft of the paper. Any ambiguities,
omissions, or errors that remain are
solely my own. This research was
supported in part by cooperative
agreement US DHHS/NIH/NIDCR,
NCMHD U54DE14251-01. The
Rochester caries study was performed with support from US
DHHS/NIH/NIDCR R01DE08946
(R.J. Billings, Principal Investigator).
Adv Dent Res 17:109-114, December, 2003
Appendix: Glossary of Knowledge Discovery
and Data Mining Methods and Related Terms
(italicized words are cross-referenced)
Artificial neural network (ANN) model — multilayer non-linear
"black box" mathematical model to predict output from inputs
Bagging (Bootstrap aggregation) — ensemble tree model method to
reduce misclassification error using bootstrap (with
replacement) resampling
Boosting (e.g., adaptive resampling and combining (ARCing) or
adaptive boosting (AdaBoost)) — ensemble tree model method
to reduce misclassification error using increased weights for
misclassified observations to allow for better prediction in
subsequent trees on those records
Bootstrap — drawing (resampling) a large number (e.g., 500 to
10,000) of new sets of data with the original sample size (n)
from the original data with replacement and re-analyzing
those bootstrap resamples to simulate variability and assess
robustness
Classification and regression tree (CART) model — recursive
partitioning method (re-assessing all inputs at each stage) to
split the data into 2 groups at each stage based on inputs that
minimize the output class misclassification error
Cross-validation or K-fold cross-validation [CV(K)] — randomly
dividing the data into K mutually exclusive and exhaustive
subsets (e.g., 5 or 10), re-analyzing each subset, and
aggregating across the K subsets to estimate robustness
Ensemble tree model or committee of trees — classifier using
majority vote (modal) class assignment or mean predicted
probability from a group of tree models grown under different
conditions to reduce classification error
Hierarchical clustering — groups' records together based on
closeness/similarity, starting with each record in its own
cluster and ending with all records in one cluster (or vice versa)
and allowing the reader to choose classification from those in
the middle; formed in either a step-down (divisive) or step-up
(agglomerative) direction
Input — predictor, explanatory, or independent variable
Jackknife — assessing analysis robustness by leaving out one
observation (i.e., sample size is n - 1), analyzing the data again,
repeating n times until each observation has been left out once,
and then comparing with the original analysis with all the
data; equivalent to n-fold cross-validation [CV(n)]
K-means clustering — iteratively determines K groups based on
closeness/similarity to group center (mean) and minimal
within-group variability
k-nearest neighbor (knn) clustering — iteratively identifies groups
based on the k closest neighbors to each point, assigning
modal or majority class among the k neighbors
Multivariate adaptive regression splines (MARS) — iterative
modeling method using combinations of linear basis functions
of inputs (predictors) to fit non-linear relationships smoothly
Output — response, outcome, or dependent variable
Random Forests — ensemble tree method using randomly selected
subsets of inputs while also providing interpretability through
summary measures of input variable importance
Regression model (linear or logistic) — classic statistical model to
predict output value or probability (linearly or log-linearly)
from inputs
Split sample — randomly grouping data into training and testing
samples, stratifying on output, building prediction models
with the training sample, and testing that resultant model
with the holdout testing sample to provide an unbiased error
estimate and assess robustness
Adv Dent Res 17:109-114, December, 2003
Supervised learning — modeling in which the output class is known
Support vector machines (SVM) — computationally intensive
"black box" method to find the non-linear multidimensional
boundary (hyperplane) transformed as a linear hyperplane
that best splits classes
Unsupervised learning — modeling in which the output class is
not known; data are clustered according to similar input
variables
References
Altman DG, Royston P (2000). What do we mean by validating a
prognostic model? Statist Med 19:453-473.
Amariti ML, Restori M, De Ferrari F, Paganelli C, Faglia R,
Legnani G (2000). A histological procedure to determine dental
age. J Forensic Odontostomatol 18:1-5.
Billings RJ, Gansky SA, Mundorff-Shrestha SA, Leverett DH,
Featherstone JDB (2003). Pathological and protective caries
risk factors in a children's longitudinal study (abstract). Caries
Res 37:277-278.
Brickley MR, Shepherd JP (1996). Performance of a neural network
trained to make third-molar treatment-planning decisions.
Med Decis Making 16:153-60.
Brickley MR, Shepherd JP (1997). Comparisons of the abilities of a
neural network and three consultant oral surgeons to make
decisions about third molar removal. Br Dent J 182(2):59-63.
Brickley MR, Shepherd JP, Armstrong RA (1998). Neural networks:
a new technique for development of decision support systems
in dentistry. J Dent 26:305-309.
Fayyad UM, Piatetsky-Shapiro G, Smyth P (1996). From data
mining to knowledge discovery: an overview. In: Advances in
knowledge discovery and data mining. Fayyad UM, PiatetskyShapiro G, Smyth P, Uthurusamy R, editors. Menlo Park, CA:
AAAI Press, pp. 1-36.
Goodey RD, Brickley MR, Hill CM, Shepherd JP (2000). A
controlled trial of three referral methods for patients with
third molars. Br Dent J 189:556-560.
Harrell FE Jr (2001). Regression modeling strategies with
applications to linear models, logistic regression and survival
analysis. New York: Springer Verlag.
Hastie T, Tibshirani R, Friedman JH (2001). The elements of
statistical learning: data mining, inference, and prediction.
New York: Springer Verlag.
Hausen H (1997). Caries prediction—state of the art. Community
Dent Oral Epidemiol 25:87-96.
Kattan MW, Hess KR, Beck JR (1998). Experiments to determine
whether recursive partitioning (CART) or an artificial neural
network overcomes theoretical limitations of Cox proportional
hazards regression. Comput Biomed Res 31:363-373.
Leverett DH, Featherstone JDB, Proskin HM, Adair SM, Eisenberg
AD, Mundorff-Shrestha SA, et al. (1993a). Caries risk
assessment by a cross-sectional discrimination model. J Dent
Res 72:529-537.
Leverett DH, Proskin HM, Featherstone JDB, Adair SM, Eisenberg
AD, Mundorff-Shrestha SA, et al. (1993b). Caries risk assessment
in a longitudinal discrimination study. J Dent Res 72:538-543.
Lux CJ, Stellzig A, Volz D, Jager W, Richardson A, Komposch G
(1998). A neural network approach to the analysis and
classification of human craniofacial growth. Growth Dev Aging
62(3):95-106.
Nilsson T, Lundgren T, Odelius H, Sillen R, Noren JG (1996). A
computerized induction analysis of possible co-variations
among different elements in human tooth enamel. Artif Intell
Dental Data Mining: Pitfalls and Issues
113
Med 8:515-526.
Page RC, Krall EA, Martin J, Mancl L, Garcia RI (2002). Validity
and accuracy of a risk calculator in predicting periodontal
disease. J Am Dent Assoc 133:569-576.
Ripley BD (1996). Pattern recognition and neural networks. New
York: Cambridge University Press.
Schwarzer G, Vach W, Schumacher M (2000). On the misuses of
artificial neural networks for prognostic and diagnostic
classification in oncology. Stat Med 19:541-561.
Speight PM, Elliott AE, Jullien JA, Downer MC, Zakrzewska JM
114
(1995). The use of artificial intelligence to identify people at
risk of oral cancer and precancer. Br Dent J 179(10):382-7.
Stewart PW, Stamm JW (1991). Classification tree prediction
models for dental caries from clinical, microbiological, and
interview data. J Dent Res 70:1239-1251.
Waldrop MM (2001). Data mining. MIT Technology Review—Ten
emerging technologies that will change the world.
January/February. Internet Web site accessed 23 March 2004.
<http://www.technologyreview.com/articles/mag_toc_jan01.
asp>
Gansky
Adv Dent Res 17:109-114, December, 2003