Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Algorithms to Investigate Causal Paths to Explain the Incidence of Cardiovascular Disease. Simon Thornley, MPH, MBChB, FAFPHM. [email protected] Professional Teaching Fellow, Research Fellow, PhD candidate. The University of Auckland, New Zealand. Summary • Background to study • Directed Acyclic Graphs (DAGs) – What are they? – What can they be used for? • How do computers draw DAGs? • A look at a case study including risk factors for CVD… My PhD • Cardiovascular risk prediction – Screen healthy adults – Put high risk ones on drugs – Distortion of natural history of disease – How to deal with it when analysing CVD risk? Primary prevention • In the 70s, risk factors identified for the treatment of CVD, from cohort studies. – Raised blood pressure – Diabetes status – Cigarette smoking – LDL cholesterol level – Age • Targets for drug treatment. Assumption • Not just risk factors, but on the causal pathway to disease. Assumption • Not just risk factors, but on the causal pathway to disease. • Are they canaries or the miner?? Summary measure of effect (OR/HR/RR) Drug treatment: a summary 1.5 Harm No effect 1.0 Benefit 0.5 Drug type Drug effects in observational studies • Being on a drug indicates ↑, rather than ↓ risk, after adjustment for all other factors??!!! • Explanations: – Unmeasured confounding – Measurement error – Drug does harm For example: Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P. Derivation and validation of QRISK, a new cardiovascular disease risk score for the United Kingdom: prospective open cohort study. BMJ 2007;335(7611):136. Sydney: Professorial fellow • “I've worked a lot with blood pressure epidemiology, and blood pressure-lowering drug use is always associated with higher risk in all observational studies…” • “That is because people who get treated differ from those who don't in too many respects to be able to capture post-hoc. That's why observational studies can never replace randomised trials…. “ • “Estimating causal effect [sic] can only be attempted under very special circumstances in observational studies.” Continued • After much flogging of the analyst… • [If you followed my advice about the design of the study]… • “you would probably find some evidence of a protective effect of statins (unless all RCTs of statins are wrong)” Statistics and causality Statistics • Assesses parameters of a distribution from samples. • Infers associations • Estimate probabilities of past and future events... • If... experimental conditions remain the same. Causal analysis • Infers probabilities under conditions that are changing – e.g. treatments or interventions The problem: variable selection • Association with outcome – Based on relationship with outcome variable (p-value) • Minimising information metric (AIC, BIC, Mallow’s C) – fit of data to model; ↑ joint probability of data given model, penalised for model complexity • Causal relationship – What about causal relationships between variables? – Confounding: “shared common cause of exposure and disease”. What are DAGs? • Graphic: A picture of nodes (variables) and arcs or edges (causal influence) • Directed: directed causal effects shown • Acyclic: No arrows from ‘effects’ to ‘causes’ Why use DAGs? • Encodes expert knowledge • Make assumptions about research question explicit; allow debate • Link causal to statistical model for causal inference • “What could give rise to an observed association between exposure and disease?” What do we use DAGs for? EXPLAINING OBSERVED ASSOCIATIONS Confounding • E and D share a common cause (confounding) Confounder Exposure Disease Collider • Induced by conditioning on common effect of Exposure and Disease (e.g. selection bias, collider). Hospitalisation Exposure Disease True causal association? Exposure Disease Researcher drawn DAG: Serum urate and CVD Diabetes Creatinine HbA1c BP meds Obesity Sex BPt-1 BP Nutrition Propensity to take preventive treatment Urate Gout HDL Trigs LDLt-1 Ethnic group Statin therapy CVD HDL Trigs LDLt Smoking A computer can do it for us… • Several algorithms available (from computer science, artificial intelligence). • Starts with Chi-square tests of independence – Conditional tests (similar to Mantel-Haenszel test) Aim • Use algorithm to draw DAG for variables used to assess CVD risk • Inform structure of regression model for causal enquiry and prediction Technical details may induce somnolence, so do not attempt to drive or operate large machinery after listening to this section. HOW THE ARTIFICIAL INTELLIGENCE ALGORITHM WORKS… Chi-square tests • Null: P(smoke, CVD) = P(smoke)P(CVD) – No relationship • Alt: P(smoke, CVD) ≠ P(smoke)P(CVD) – Yes, a relationship exists (association) • Chi-square distribution gives distribution assuming independence (null), if on tails of this (P<0.05), then assume null is false. Conditional Chi-square test • Null: P(smoke, CVD) = P(smoke|age) P(CVD|age) – No relationship • Alt: P(smoke, CVD) ≠ P(smoke|age) P(CVD|age) – Yes, a relationship exists (causal, if alternative hypothesis supported for all subsets of conditioning variables). • Equivalent to MH chi-squared test… Simplified THE ALGORITHM 1: Determine causal neighbours • Start with arcs (dependence) between all variables • Let set of variables = U • For each pair of nodes X (e.g. smoke) and Y (e.g. CVD), determine if X is independent of Y, given all subsets of U. • If so, drop the edge between X and Y • Repeat for all pairs 2: Causal direction of triplets • Find colliders: – For each triplet X, Y, Z, if X—Z and Y—Z, but not X—Y (X-Z-Y), if for all subsets S of U-{X,Y,Z}, X is dependent on Y|(S U {Z}), then orient the arcs so that XZ Y. • Repeat for all triplets. 3: Avoid cycles • Then orientate other edges so as not to introduce cycles (‘effect’ causes ‘cause’) • Note – not all directions may be determined, since XYZ and X YZ are equivalent patterns of conditional dependence. AI and CVD risk prediction. A WORKED EXAMPLE Predict cohort study • Population – 30 to 80 year old patients – free of CVD and heart failure – CVD risk assessment at GP between ‘06 to ‘09 – At least 2 years of follow-up Variables • Combined CVD events (death or hospital admission) – Cumulative incidence • • • • • • • • • Age-at-enrolment Sex Diabetes Smoking Ethnic group Statin and antihypertensive drug use Systolic blood pressure Family history Total to high-density-lipoprotein cholesterol ratio Software • bnlearn with R (M. Scutari) • False positive proportion: 5% • Tests option: – Monte-Carlo chi-square, due to small cell counts • Categorical data only: – Continuous variables categorised into deciles. Banned list • Sex, ethnic group and age must not be caused by any other variable. • Family history must not be caused by drug treatment variables. • The outcome, fatal and nonfatal CVD, must not cause any other variable. The populations CRUDE ASSOCIATIONS WITH CVD Total CVD (col%) No CVD (col%) 101 6155 Total Test stat. P-value (col%) 6256 Gender Men 61 (60.4) 3395 (55.2) 61.7 (10.2) 54.1 (10.5) 0.343 T-test < 0.001 3456 (55.2) Age at enrolment Mean (SD) Chisq. 54.2 (10.5) CVD No CVD (col%) (col%) Total (col%) Ethnic group Other 62 (61.4) 4348 (70.6) 4410 (70.5) Maori 22 (21.8) 826 (13.4) 848 (13.6) Pacific 16 (15.8) 773 (12.6) 789 (12.6) Indian 1 (1.0) 208 (3.4) 209 (3.3) Smoking status Yes 28 (27.7) 1082 (17.6) 1110 (17.7) Test stat. Pvalue Fisher’s exact 0.036 Chisq. 0.012 CVD (col%) No CVD (col%) Total Test stat. P(col%) value Systolic blood pressure (mmHg) Median(IQR) 140 (130, 150) Diagnosis of diabetes? Yes 24 (23.8) Total to HDL-cholesterol ratio Median (IQR) 3.7 (3.1, 4.8) Rank sum < 0.001 test 130 (120, 142) 130 (120, 143) Chisq. 0.0143 Rank sum 0.744 896 (14.6) 920 (14.7) 3.8 (3.1, 4.7) 3.8 (3.1, 4.7) CVD Statin treatment at baseline? Yes 20 (19.8) Antihypertensive treatment at baseline? Yes 48 (47.5) No CVD Total 860 (14.0) 880 (14.1) Test stat. P-value Chisq. 0.127 Chisq. 1637 (26.6) 1685 (26.9) < 0.001 LET RIP! The DAG… Ethnic group Systolic blood pressure Sex CVD Diabetes Age Family history of CVD TC: HDL ratio Smoking status Statin use Anti hypertensive sex ethni age FHx TC/HDL smoke Statin AntiH SBP CVD Diabetes Use regression HOW STRONG ARE THE ARCS? Arc strength Software reports p-values • X Dependent on sample size Instead use regression. • Cause=independent var. (x) • Effect=dependent var. (y) • If effect binary: logistic – Derive odds ratios • If effect continuous: linear • For continuous vars: compare 16th and 84th centiles (≈binary var. comparison) • Adjust for confounders and effect modifiers (e.g. age) Arc strength Cause Effect Low+ High+ Beta-coeff.(95% CI) Odds ratio (95% CI) Age CVD 43.4 65.2 1.54 (1.11 to 1.97) 4.65 (3.03 to 7.14) Age Statin use 43.4 65.2 0.84 (0.69 to 0.99) 2.31 (1.99 to 2.69) Age Anti-hypertensive 43.4 65.2 1.44 (1.31 to 1.57) 4.23 (3.72 to 4.82) Age Family history of CVD 43.4 65.2 -0.31 (-0.43 to -0.20) 0.73 (0.65 to 0.82) Age Systolic blood pressure 43.4 65.2 10.42 (9.5 to 11.34) N/A Arc strength Cause Effect Ethnic Smoker group (adj. for age) Low+ High+ Beta-coeff.(95% CI) Odds ratio (95% CI) Other Indian -0.66 (-1.16 to -0.17) 0.51 (0.31 to 0.84) Other Maori 1.19 (1.02 to 1.35) 3.28 (2.78 to 3.88) Other Pacific 0.64 (0.46 to 0.83) 1.91 (1.58 to 2.30) Ethnic Family Other group history of (adj. for CVD age) Other Indian 0.02 (-0.28 to 0.32) 1.02 (0.75 to 1.37) Other Maori -0.24 (-0.41 to -0.07) 0.79 (0.67 to 0.93) Pacific -1.03 (-1.24 to -0.82) 0.36 (0.29 to 0.44) Arc strength Cause Effect Low+ High Beta-coeff.(95% CI) + Diabetes Statin use (adj. for age) Diabetes Antihyperten (adj. for age) sive use Statin use Antihyperten (adj. for age) sive use AntiSystolic hypertensive blood (adj. for age) pressure No Yes 1.94 (1.77 to 2.10) No Yes 1.68 (1.53 to 1.84) No Yes 1.70 (1.55 to 1.86) No Yes 7.30 (6.28 to 8.33) Odds ratio (95% CI) 6.94 (5.90 to 8.16) 5.38 (4.60 to 6.28) 5.49 (4.69 to 6.42) N/A High+ Beta-coeff.(95% CI) Ethnic group (adj. Diabetes Other Indian 1.89 (1.57 to for age) 2.21) Other Maori 1.21 (1.01 to 1.41) Other Pacific 2.04 (1.85 to 2.22) Smoker (no adj.) CVD No Yes 0.59 (0.15 to 1.03) Smoker (no adj.) Total: No Yes 0.51 (0.43 to HDL0.59) cholester ol ratio Sex (adj. for age) TC: HDL Female Male 0.55 (0.49 to 0.61) Cause Effect Low+ Odds ratio (95% CI) 6.64 (4.83 to 9.12) 3.36 (2.74 to 4.11) 7.65 (6.36 to 9.22) 1.80 (1.16 to 2.79) N/A N/A So… what? • DAG seems plausible – Cigarette smoking and age only causal influences on CVD. – Many causal influences on drug use – Drugs do not influence CVD risk? – Cigarette smoking mediator of ethnic group effects • Is researcher drawn DAG compatible with data? – If not, why not? • Only age and smoking necessary to adjust for when testing causal hypotheses? Barren proxy • Variable that has no influence on exposure and outcome (not true confounder), but influenced by (proxy for) one. • Here, TC:HDL ratio, when considering smoking CVD relationship TC/HDL Smoke CVD Limitations • Limited by sample size – type-2 error rate likely to be high (e.g. only 101 CVD events). – 5% type-1 error rate. – With this algorithm, early errors in statistical tests can propagate through algorithm. • Cross-sectional relationships – may be prone to survival bias. • Assumptions: – No hidden or latent variables, independent subject data, no errors in tests. ‘Assumption free’ regression Diabetes Age Blood pressure CVD TC:HDL Gender Smoking Summary • DAGs are useful when considering variable selection for regression modelling • Possible to draw DAGs either from data or from informed scientific knowledge. • Useful to compare researcher drawn DAG with that from data. • Can help ‘visualise’ relationships between variables. • Software available and relatively easy to use. THE STORY CONTINUES… "The main reason we take so many drugs is that drug companies don’t sell drugs, they sell lies about drugs. This is what makes drugs so different from anything else in life… Virtually everything we know about drugs is what the companies have chosen to tell us and our doctors…” Publication bias • Ioannidis JPA, Trikalinos TA. An exploratory test for an excess of significant findings. Clin. Trials 2007;4(3):245-53. • Calculate expected number of positive studies, given: – Sample size of individual studies – Number of events in controls – Summary effect (assumed true) Statin meta-analysis Further reading • Pearl, Judea (2010) "An Introduction to Causal Inference," The International Journal of Biostatistics: Vol. 6: Iss. 2, Article 7. • DOI: 10.2202/1557-4679.1203 • Available at: http://www.bepress.com/ijb/vol6/iss2/7