Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Cardiovascular Bayesian Networks Charles Twardy†, Ann Nicholson†, Kevin Korb†, John McNeil‡ (Danny Liew‡, Sophie Rogers‡, Lucas Hope†) †School of Computer Science & Software Engineering ‡Dept. of Epidemilogy & Preventive Medicine Monash University www.datamining.monash.edu.au/bnepi Overview Problem: assessment of risk for coronary heart disease (CHD) 1. Knowledge Engineering 2 epidemiological models 2. Data Mining Busselton Study data Bayesian network software (Netica) Causal discovery (CaMML) + Other learners Medical Experts 3. Evaluation Knowledge Engineering BNs from the medical literature The Australian Busselton Study » » » » every 3 years, 1966-1981, > 8,000 participants mortality followup via WA death register + manually Cox proportional-hazards model, 2,258 from 1978 cohort CHD event base rates: 23% for men, 14% for women The German PROCAM Study » 1979-1985, followup every 2 years, > 25,000 participants » Scoring model (based on Cox), ~5,000 men » CHD event base rates: ~6% General question: are models transferable across populations? The Busselton BN: nodes The Busselton BN: arcs uninformative All nodes have an associated conditional prob. distribution P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S) predictor variables 10-year risk of CHD event BNs summarize the joint distribution The Busselton BN: discretization binary nodes discretization choices The Busselton BN: reasoning The Busselton BN: reasoning The Busselton BN: reasoning Normal Bad cholesterol Heavy smoking The Busselton BN: reasoning More risk factors ! A risk assessment tool for clinicians Previous tool: TAKEHEART Combine risk assessment (probability) with costs. Risk Assessment Tool: example Young, predictor not observed – don’t treat old, predictor not observed – treat Young, predictor observed – don’t treat Not so old, predictor not observed – treat CaMML: a causal learner Developed at Monash University Data mines BNs from epidemiological data Minimum message length (MML) metric: Trades-off complexity vs goodness of fit MCMC search over model space CaMML: example BN CaMML: example BN Evaluation Predicting 10 year risk of CHD using Busselton data Metrics: » ROC Curves (area under curve) » Bayesian Information Reward (BIR) Experiment 1: » Compare Busselton, PROCAM and CaMML BNs Experiment 2 » Compare CaMML and other standard machine learners (from Weka) » 90-10 training/testing split, 10-fold crossvalidation Experiment 1: ROC Results Extremes: Everyone at risk! Area under curve (AUC) priors No-one at risk! Experiment 2: ROC Results Experiment 2: Bayesian Info Reward Summary of Results Experiment I (Models of whole data) PROCAM model does at least as well as Busselton » On Busselton data » For both "relative" (ROC) and "absolute" (BIR) risk CaMML Models do as well » But much simpler: only 4 nodes matter to CHD10! Experiment II (Cross-validation of learners) Logistic regression does best on both metrics » Statistically powerful: only 1 parameter per arc » No search required: structure is given » No discretization necessary Conclusions Busselton & PROCAM models appear to perform equally well on Busselton data, using an absolute risk measure (BIR) from the literature CaMML results suggest the data have high variance and are too weak to support inference to complex models. Combining data would help. Future directions Improve data mining by » Adding prior knowledge to search » Assessing whether data sources can be combined; if so, do so Investigate combination of continuous and discrete variables in data mining and modeling Develop new TAKEHEART model using BNs (taking the best from experts, literature, data mining) » with intervention modeling (Causal Reckoner) » with decision support » with GUI, usable by clinicians References G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Munster (PROCAM) study. Circulation, 105(3):310-315, 2002. M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk estimation for coronary heart disease: the Busselton Health Study, Australian & New Zealand Journal of Public Health, 22:747-753, 1998. C.S. Wallace and K.B. Korb. Learning Linear Causal Models by MML Sampling, In A. Gammerman, editor, Causal Models and Intelligent Data Management, pages 89-111. Springer-Verlag, 1999. www.datamining.monash.edu.au/software/camml C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering cardiovascular Bayesian networks from the literature, Technical Report 2005/170, School of CSSE, Monash University, 2005.