Download HDMWS-final - School of Computer Science and Software

yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Data Mining
Bayesian Networks
Charles Twardy†, Ann Nicholson†, Kevin Korb†, John McNeil‡
(Danny Liew‡, Sophie Rogers‡, Lucas Hope†)
†School of Computer Science & Software Engineering
‡Dept. of Epidemilogy & Preventive Medicine
Monash University
Problem: assessment of risk for coronary heart disease (CHD)
1. Knowledge Engineering
2 epidemiological models
2. Data Mining
Busselton Study data
Bayesian network
software (Netica)
+ Other learners
Medical Experts
3. Evaluation
Knowledge Engineering BNs
from the medical literature
The Australian Busselton Study
every 3 years, 1966-1981, > 8,000 participants
mortality followup via WA death register + manually
Cox proportional-hazards model, 2,258 from 1978 cohort
CHD event base rates: 23% for men, 14% for women
The German PROCAM Study
» 1979-1985, followup every 2 years, > 25,000 participants
» Scoring model (based on Cox), ~5,000 men
» CHD event base rates: ~6%
General question: are models transferable across
The Busselton BN: nodes
The Busselton BN: arcs
All nodes have an associated
conditional prob. distribution
P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S)
predictor variables
10-year risk of CHD event
BNs summarize the joint
The Busselton BN: discretization
binary nodes
discretization choices
The Busselton BN: reasoning
The Busselton BN: reasoning
The Busselton BN: reasoning
Bad cholesterol
Heavy smoking
The Busselton BN: reasoning
More risk factors
A risk assessment tool for clinicians
Previous tool: TAKEHEART
Combine risk assessment (probability) with
Risk Assessment Tool: example
Young, predictor not observed – don’t treat
old, predictor not observed – treat
Young, predictor observed – don’t treat
Not so old, predictor not observed – treat
CaMML: a causal learner
Developed at Monash University
Data mines BNs from epidemiological data
Minimum message length (MML) metric:
Trades-off complexity vs goodness of fit
MCMC search over model space
CaMML: example BN
CaMML: example BN
Predicting 10 year risk of CHD using
Busselton data
» ROC Curves (area under curve)
» Bayesian Information Reward (BIR)
Experiment 1:
» Compare Busselton, PROCAM and CaMML BNs
Experiment 2
» Compare CaMML and other standard machine
learners (from Weka)
» 90-10 training/testing split, 10-fold crossvalidation
Experiment 1: ROC Results
Everyone at risk!
Area under curve (AUC)
No-one at risk!
Experiment 2: ROC Results
Experiment 2: Bayesian Info Reward
Summary of Results
Experiment I (Models of whole data)
 PROCAM model does at least as well as Busselton
» On Busselton data
» For both "relative" (ROC) and "absolute" (BIR) risk
CaMML Models do as well
» But much simpler: only 4 nodes matter to CHD10!
Experiment II (Cross-validation of learners)
 Logistic regression does best on both metrics
» Statistically powerful: only 1 parameter per arc
» No search required: structure is given
» No discretization necessary
Busselton & PROCAM models appear to perform
equally well on Busselton data, using an absolute risk
measure (BIR) from the literature
CaMML results suggest the data have high variance
and are too weak to support inference to complex
models. Combining data would help.
Future directions
Improve data mining by
» Adding prior knowledge to search
» Assessing whether data sources can be combined;
if so, do so
Investigate combination of continuous and discrete
variables in data mining and modeling
Develop new TAKEHEART model using BNs (taking
the best from experts, literature, data mining)
» with intervention modeling (Causal Reckoner)
» with decision support
» with GUI, usable by clinicians
G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for
calculating the risk of acute coronary events based on the 10-year
follow-up of the Prospective Cardiovascular Munster (PROCAM)
study. Circulation, 105(3):310-315, 2002.
M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk
estimation for coronary heart disease: the Busselton Health Study,
Australian & New Zealand Journal of Public Health, 22:747-753,
C.S. Wallace and K.B. Korb. Learning Linear Causal Models by
MML Sampling, In A. Gammerman, editor, Causal Models and
Intelligent Data Management, pages 89-111. Springer-Verlag, 1999.
C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering
cardiovascular Bayesian networks from the literature, Technical
Report 2005/170, School of CSSE, Monash University, 2005.