Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Data Mining Cardiovascular Bayesian Networks Charles Twardy†, Ann Nicholson†, Kevin Korb†, John McNeil‡ (Danny Liew‡, Sophie Rogers‡, Lucas Hope†) †School of Computer Science & Software Engineering ‡Dept. of Epidemilogy & Preventive Medicine Monash University www.datamining.monash.edu.au/bnepi Overview Problem: assessment of risk for coronary heart disease (CHD) 1. Knowledge Engineering 2 epidemiological models 2. Data Mining Busselton Study data Bayesian network software (Netica) Causal discovery (CaMML) + Other learners Medical Experts 3. Evaluation Knowledge Engineering BNs from the medical literature The Australian Busselton Study » » » » every 3 years, 1966-1981, > 8,000 participants mortality followup via WA death register + manually Cox proportional-hazards model, 2,258 from 1978 cohort CHD event base rates: 23% for men, 14% for women The German PROCAM Study » 1979-1985, followup every 2 years, > 25,000 participants » Scoring model (based on Cox), ~5,000 men » CHD event base rates: ~6% General question: are models transferable across populations? The Busselton BN: nodes The Busselton BN: arcs uninformative All nodes have an associated conditional prob. distribution P(S,B,Al,At) =P(S)P(B|S)P(Al|S)P(At|S) predictor variables 10-year risk of CHD event BNs summarize the joint distribution The Busselton BN: discretization binary nodes discretization choices The Busselton BN: reasoning The Busselton BN: reasoning The Busselton BN: reasoning Normal Bad cholesterol Heavy smoking The Busselton BN: reasoning More risk factors ! A risk assessment tool for clinicians Previous tool: TAKEHEART Combine risk assessment (probability) with costs. Risk Assessment Tool: example Young, predictor not observed – don’t treat old, predictor not observed – treat Young, predictor observed – don’t treat Not so old, predictor not observed – treat CaMML: a causal learner Developed at Monash University Data mines BNs from epidemiological data Minimum message length (MML) metric: Trades-off complexity vs goodness of fit MCMC search over model space CaMML: example BN CaMML: example BN Evaluation Predicting 10 year risk of CHD using Busselton data Split data 90-10 training/testing 10 fold cross validation Metrics: » Predictive Accuracy » ROC Curves (area under curve): correct classification vs false positives » Bayesian Information Reward (BIR) Using Weka: Java environment for machine learning tools and techniques Predictive accuracy Examining each joint observation in the sample Adding any available evidence for the other nodes Updating the network Use value with highest probability as predicted value Compare predicted value with the actual value Information Reward Rewards calibration of probabilities Zero reward for just reporting priors Unbounded below for a bad prediction Bounded above by a maximum that depends on priors Reward = 0 Repeat If I == correct state IR += log ( 1 / p[i] ) else IR += log ( 1 / 1 - p[i] ) Experimental Evaluation Experiment 1: » Compare Busselton, PROCAM and CaMML BNs Experiment 2 » Compare CaMML and other standard machine learners (from Weka) Evaluation: Weka learners Naïve Bayes J48 (version of C4.5) CaMML –Causal BN learner, using MML metric Pr=1/3 Pr=1/3 Pr=1/3 AODE TAN Logistic Experiment 1: ROC Results Extremes: Everyone at risk! Area under curve (AUC) priors No-one at risk! Experiment 2: ROC Results Experiment 2: Bayesian Info Reward Summary of Results Experiment I (Models of whole data) PROCAM model does at least as well as Busselton » On Busselton data » For both "relative" (ROC) and "absolute" (BIR) risk CaMML Models do as well » But much simpler: only 4 nodes matter to CHD10! Experiment II (Cross-validation of learners) Logistic regression does best on both metrics » Statistically powerful: only 1 parameter per arc » No search required: structure is given » No discretization necessary Conclusions Busselton & PROCAM models appear to perform equally well on Busselton data, using an absolute risk measure (BIR) from the literature CaMML results suggest the data have high variance and are too weak to support inference to complex models. Combining data would help. Future directions Improve data mining by » Adding prior knowledge to search » Assessing whether data sources can be combined; if so, do so Investigate combination of continuous and discrete variables in data mining and modeling Develop new TAKEHEART model using BNs (taking the best from experts, literature, data mining) » with intervention modeling (Causal Reckoner) » with decision support » with GUI, usable by clinicians References G. Assmann, P. Cullen and H. Schulte. Simple scoring scheme for calculating the risk of acute coronary events based on the 10-year follow-up of the Prospective Cardiovascular Munster (PROCAM) study. Circulation, 105(3):310-315, 2002. M.W. Knuiman, H.T. Vu and H. C. Bartholomew. Multivariate risk estimation for coronary heart disease: the Busselton Health Study, Australian & New Zealand Journal of Public Health, 22:747-753, 1998. C.S. Wallace and K.B. Korb. Learning Linear Causal Models by MML Sampling, In A. Gammerman, editor, Causal Models and Intelligent Data Management, pages 89-111. Springer-Verlag, 1999. www.datamining.monash.edu.au/software/camml C.R. Twardy, A.E. Nicholson, K.B. Korb and J. McNeil. Data Mining Cardiovascular Bayesian Networks. Technical report 2004/165. School of Computer Science and Software Engineering, Monash University, 2004. C.R. Twardy, A.E. Nicholson and K.B. Korb. Knowledge engineering cardiovascular Bayesian networks from the literature, Technical Report 2005/170, School of CSSE, Monash University, 2005.