Download Exploring cardiovascular disease and its risk factors

Document related concepts

Regression analysis wikipedia , lookup

Transcript
Algorithms to Investigate
Causal Paths to Explain the
Incidence of
Cardiovascular Disease.
Simon Thornley, MPH, MBChB, FAFPHM.
[email protected]
Professional Teaching Fellow, Research Fellow,
PhD candidate.
The University of Auckland, New Zealand.
Summary
• Background to study
• Directed Acyclic Graphs (DAGs)
– What are they?
– What can they be used for?
• How do computers draw DAGs?
• A look at a case study including risk factors for
CVD…
My PhD
• Cardiovascular risk prediction
– Screen healthy adults
– Put high risk ones on drugs
– Distortion of natural history of disease
– How to deal with it when analysing CVD risk?
Primary prevention
• In the 70s, risk factors identified for the
treatment of CVD, from cohort studies.
– Raised blood pressure
– Diabetes status
– Cigarette smoking
– LDL cholesterol level
– Age
• Targets for drug treatment.
Assumption
• Not just risk factors, but on the causal
pathway to disease.
Assumption
• Not just risk factors, but on the causal
pathway to disease.
• Are they canaries or the miner??
Summary measure of effect
(OR/HR/RR)
Drug treatment: a summary
1.5
Harm
No
effect
1.0
Benefit
0.5
Drug type
Drug effects in observational studies
• Being on a drug indicates ↑, rather than ↓
risk, after adjustment for all other factors??!!!
• Explanations:
– Unmeasured confounding
– Measurement error
– Drug does harm
For example: Hippisley-Cox J, Coupland C, Vinogradova Y, Robson J, May M, Brindle P.
Derivation and validation of QRISK, a new cardiovascular disease risk score for the
United Kingdom: prospective open cohort study. BMJ 2007;335(7611):136.
Sydney: Professorial fellow
• “I've worked a lot with blood pressure epidemiology,
and blood pressure-lowering drug use is always
associated with higher risk in all observational
studies…”
• “That is because people who get treated differ from
those who don't in too many respects to be able to
capture post-hoc. That's why observational studies can
never replace randomised trials…. “
• “Estimating causal effect [sic] can only be attempted
under very special circumstances in observational
studies.”
Continued
• After much flogging of the analyst…
• [If you followed my advice about the design of
the study]…
• “you would probably find some evidence of a
protective effect of statins (unless all RCTs of
statins are wrong)”
Statistics and causality
Statistics
• Assesses parameters of a
distribution from samples.
• Infers associations
• Estimate probabilities of
past and future events...
• If... experimental conditions
remain the same.
Causal analysis
• Infers probabilities under
conditions that are changing
– e.g. treatments or
interventions
The problem: variable selection
• Association with outcome
– Based on relationship with outcome variable (p-value)
• Minimising information metric (AIC, BIC, Mallow’s C)
– fit of data to model; ↑ joint probability of data given
model, penalised for model complexity
• Causal relationship
– What about causal relationships between variables?
– Confounding: “shared common cause of exposure and
disease”.
What are DAGs?
• Graphic: A picture of nodes (variables) and
arcs or edges (causal influence)
• Directed: directed causal effects shown
• Acyclic: No arrows from ‘effects’ to ‘causes’
Why use DAGs?
• Encodes expert knowledge
• Make assumptions about research question
explicit; allow debate
• Link causal to statistical model for causal
inference
• “What could give rise to an observed association
between exposure and disease?”
What do we use DAGs for?
EXPLAINING OBSERVED
ASSOCIATIONS
Confounding
• E and D share a common cause (confounding)
Confounder
Exposure
Disease
Collider
• Induced by conditioning on common effect of
Exposure and Disease (e.g. selection bias,
collider).
Hospitalisation
Exposure
Disease
True causal association?
Exposure
Disease
Researcher drawn DAG:
Serum urate and CVD
Diabetes
Creatinine
HbA1c
BP meds
Obesity
Sex
BPt-1
BP
Nutrition
Propensity to take
preventive
treatment
Urate
Gout
HDL
Trigs
LDLt-1
Ethnic group
Statin
therapy
CVD
HDL
Trigs
LDLt
Smoking
A computer can do it for us…
• Several algorithms available (from computer
science, artificial intelligence).
• Starts with Chi-square tests of independence
– Conditional tests (similar to Mantel-Haenszel test)
Aim
• Use algorithm to draw DAG for variables used
to assess CVD risk
• Inform structure of regression model for
causal enquiry and prediction
Technical details may induce somnolence, so do not attempt to drive or
operate large machinery after listening to this section.
HOW THE ARTIFICIAL INTELLIGENCE
ALGORITHM WORKS…
Chi-square tests
• Null: P(smoke, CVD) = P(smoke)P(CVD)
– No relationship
• Alt: P(smoke, CVD) ≠ P(smoke)P(CVD)
– Yes, a relationship exists (association)
• Chi-square distribution gives distribution
assuming independence (null), if on tails of
this (P<0.05), then assume null is false.
Conditional Chi-square test
• Null: P(smoke, CVD) = P(smoke|age) P(CVD|age)
– No relationship
• Alt: P(smoke, CVD) ≠ P(smoke|age) P(CVD|age)
– Yes, a relationship exists (causal, if alternative hypothesis
supported for all subsets of conditioning variables).
• Equivalent to MH chi-squared test…
Simplified
THE ALGORITHM
1: Determine causal neighbours
• Start with arcs (dependence) between all
variables
• Let set of variables = U
• For each pair of nodes X (e.g. smoke) and Y
(e.g. CVD), determine if X is independent of Y,
given all subsets of U.
• If so, drop the edge between X and Y
• Repeat for all pairs
2: Causal direction of triplets
• Find colliders:
– For each triplet X, Y, Z, if X—Z and Y—Z, but not
X—Y (X-Z-Y), if for all subsets S of U-{X,Y,Z}, X is
dependent on Y|(S U {Z}), then orient the arcs so
that XZ  Y.
• Repeat for all triplets.
3: Avoid cycles
• Then orientate other edges so as not to
introduce cycles (‘effect’ causes ‘cause’)
• Note
– not all directions may be determined, since
XYZ and X  YZ are equivalent patterns of
conditional dependence.
AI and CVD risk prediction.
A WORKED EXAMPLE
Predict cohort study
• Population
– 30 to 80 year old patients
– free of CVD and heart failure
– CVD risk assessment at GP between ‘06 to ‘09
– At least 2 years of follow-up
Variables
• Combined CVD events (death or hospital admission)
– Cumulative incidence
•
•
•
•
•
•
•
•
•
Age-at-enrolment
Sex
Diabetes
Smoking
Ethnic group
Statin and antihypertensive drug use
Systolic blood pressure
Family history
Total to high-density-lipoprotein cholesterol ratio
Software
• bnlearn with R (M. Scutari)
• False positive proportion: 5%
• Tests option:
– Monte-Carlo chi-square, due to small cell counts
• Categorical data only:
– Continuous variables categorised into deciles.
Banned list
• Sex, ethnic group and age must not be caused
by any other variable.
• Family history must not be caused by drug
treatment variables.
• The outcome, fatal and nonfatal CVD, must
not cause any other variable.
The populations
CRUDE ASSOCIATIONS WITH CVD
Total
CVD
(col%)
No CVD
(col%)
101
6155
Total Test stat. P-value
(col%)
6256
Gender
Men
61 (60.4)
3395
(55.2)
61.7
(10.2)
54.1
(10.5)
0.343
T-test
< 0.001
3456
(55.2)
Age at
enrolment
Mean (SD)
Chisq.
54.2
(10.5)
CVD
No CVD
(col%) (col%)
Total
(col%)
Ethnic group
Other
62 (61.4) 4348 (70.6) 4410 (70.5)
Maori
22 (21.8)
826 (13.4)
848 (13.6)
Pacific
16 (15.8)
773 (12.6)
789 (12.6)
Indian
1 (1.0)
208 (3.4)
209 (3.3)
Smoking status
Yes
28 (27.7) 1082 (17.6) 1110 (17.7)
Test
stat.
Pvalue
Fisher’s
exact
0.036
Chisq.
0.012
CVD
(col%)
No CVD
(col%)
Total Test stat. P(col%)
value
Systolic blood pressure (mmHg)
Median(IQR)
140 (130,
150)
Diagnosis of diabetes?
Yes
24 (23.8)
Total to HDL-cholesterol ratio
Median (IQR)
3.7 (3.1,
4.8)
Rank sum < 0.001
test
130 (120,
142)
130 (120,
143)
Chisq.
0.0143
Rank sum
0.744
896 (14.6) 920 (14.7)
3.8 (3.1,
4.7)
3.8 (3.1,
4.7)
CVD
Statin treatment at
baseline?
Yes
20 (19.8)
Antihypertensive
treatment at baseline?
Yes
48 (47.5)
No CVD
Total
860
(14.0)
880 (14.1)
Test stat. P-value
Chisq. 0.127
Chisq.
1637
(26.6)
1685
(26.9)
< 0.001
LET RIP!
The DAG…
Ethnic
group
Systolic
blood
pressure
Sex
CVD
Diabetes
Age
Family
history of
CVD
TC: HDL
ratio
Smoking
status
Statin use
Anti
hypertensive
sex
ethni
age
FHx
TC/HDL
smoke
Statin
AntiH
SBP
CVD
Diabetes
Use regression
HOW STRONG ARE THE ARCS?
Arc strength
Software reports p-values
• X Dependent on sample size
Instead use regression.
• Cause=independent var. (x)
• Effect=dependent var. (y)
• If effect binary: logistic
– Derive odds ratios
• If effect continuous: linear
• For continuous vars: compare
16th and 84th centiles (≈binary
var. comparison)
• Adjust for confounders and
effect modifiers (e.g. age)
Arc strength
Cause Effect
Low+
High+
Beta-coeff.(95% CI) Odds ratio (95% CI)
Age
CVD
43.4
65.2
1.54 (1.11 to 1.97)
4.65 (3.03 to 7.14)
Age
Statin use
43.4
65.2
0.84 (0.69 to 0.99)
2.31 (1.99 to 2.69)
Age
Anti-hypertensive
43.4
65.2
1.44 (1.31 to 1.57)
4.23 (3.72 to 4.82)
Age
Family history of
CVD
43.4
65.2 -0.31 (-0.43 to -0.20) 0.73 (0.65 to 0.82)
Age
Systolic blood
pressure
43.4
65.2
10.42 (9.5 to 11.34)
N/A
Arc strength
Cause
Effect
Ethnic Smoker
group
(adj. for
age)
Low+
High+
Beta-coeff.(95% CI)
Odds ratio (95% CI)
Other
Indian -0.66 (-1.16 to -0.17) 0.51 (0.31 to 0.84)
Other
Maori
1.19 (1.02 to 1.35)
3.28 (2.78 to 3.88)
Other
Pacific
0.64 (0.46 to 0.83)
1.91 (1.58 to 2.30)
Ethnic Family
Other
group history of
(adj. for CVD
age)
Other
Indian
0.02 (-0.28 to 0.32)
1.02 (0.75 to 1.37)
Other
Maori
-0.24 (-0.41 to -0.07) 0.79 (0.67 to 0.93)
Pacific -1.03 (-1.24 to -0.82) 0.36 (0.29 to 0.44)
Arc strength
Cause
Effect
Low+ High Beta-coeff.(95% CI)
+
Diabetes
Statin use
(adj. for age)
Diabetes
Antihyperten
(adj. for age) sive use
Statin use Antihyperten
(adj. for age) sive use
AntiSystolic
hypertensive blood
(adj. for age) pressure
No
Yes
1.94 (1.77 to 2.10)
No
Yes
1.68 (1.53 to 1.84)
No
Yes
1.70 (1.55 to 1.86)
No
Yes
7.30 (6.28 to 8.33)
Odds ratio
(95% CI)
6.94 (5.90 to
8.16)
5.38 (4.60 to
6.28)
5.49 (4.69 to
6.42)
N/A
High+ Beta-coeff.(95%
CI)
Ethnic group (adj. Diabetes Other Indian 1.89 (1.57 to
for age)
2.21)
Other Maori 1.21 (1.01 to
1.41)
Other Pacific 2.04 (1.85 to
2.22)
Smoker (no adj.) CVD
No
Yes
0.59 (0.15 to
1.03)
Smoker (no adj.) Total:
No
Yes
0.51 (0.43 to
HDL0.59)
cholester
ol ratio
Sex (adj. for age) TC: HDL Female Male 0.55 (0.49 to
0.61)
Cause
Effect
Low+
Odds ratio
(95% CI)
6.64 (4.83 to
9.12)
3.36 (2.74 to
4.11)
7.65 (6.36 to
9.22)
1.80 (1.16 to
2.79)
N/A
N/A
So… what?
• DAG seems plausible
– Cigarette smoking and age only causal influences on
CVD.
– Many causal influences on drug use
– Drugs do not influence CVD risk?
– Cigarette smoking mediator of ethnic group effects
• Is researcher drawn DAG compatible with data?
– If not, why not?
• Only age and smoking necessary to adjust for
when testing causal hypotheses?
Barren proxy
• Variable that has no influence on exposure
and outcome (not true confounder), but
influenced by (proxy for) one.
• Here, TC:HDL ratio, when considering
smoking CVD relationship
TC/HDL
Smoke
CVD
Limitations
• Limited by sample size
– type-2 error rate likely to be high (e.g. only 101 CVD
events).
– 5% type-1 error rate.
– With this algorithm, early errors in statistical tests can
propagate through algorithm.
• Cross-sectional relationships
– may be prone to survival bias.
• Assumptions:
– No hidden or latent variables, independent subject
data, no errors in tests.
‘Assumption free’ regression
Diabetes
Age
Blood
pressure
CVD
TC:HDL
Gender
Smoking
Summary
• DAGs are useful when considering variable
selection for regression modelling
• Possible to draw DAGs either from data or from
informed scientific knowledge.
• Useful to compare researcher drawn DAG with
that from data.
• Can help ‘visualise’ relationships between
variables.
• Software available and relatively easy to use.
THE STORY CONTINUES…
"The main reason we take so many drugs
is that drug companies don’t sell drugs,
they sell lies about drugs. This is what
makes drugs so different from anything
else in life… Virtually everything we
know about drugs is what the companies
have chosen to tell us and our doctors…”
Publication bias
• Ioannidis JPA, Trikalinos TA. An
exploratory test for an excess of
significant findings. Clin. Trials
2007;4(3):245-53.
• Calculate expected number of
positive studies, given:
– Sample size of individual studies
– Number of events in controls
– Summary effect (assumed true)
Statin meta-analysis
Further reading
• Pearl, Judea (2010) "An Introduction to Causal
Inference," The International Journal of
Biostatistics: Vol. 6: Iss. 2, Article 7.
• DOI: 10.2202/1557-4679.1203
• Available at:
http://www.bepress.com/ijb/vol6/iss2/7