Download Reproducibility: Obstacles and Opportunities p.

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Reproducibility: Obstacles and Opportunities
p.1 of 12
Reproducibility: Obstacles and Opportunities
Big Data and Healthcare Analytics – A Path to Personalized Medicine
April 14, 2016
Roger Day
OUTLINE
1) Reproducibility in data analysis
a) Failure to reproduce data analysis
i) A historic example: the Duke experience with personalized cancer medicine
ii) What went wrong
b) Benefits of reproducibility data analysis
i) Makes tech transfer easy
ii) Makes data updates easier
iii) Encourages planning and documentation
iv) Guards against incompetence and fraud
v) The stakes are high
c) Solutions
i) The role of individuals: tools for reproducible analysis
ii) The role of journals
iii) The role of institutions
2) Reproducibility: achieving internal and external validity
Bias, variance, sample size, and personalized medicine.
Example: Mean squared error in regression; effect of model complexity
Example: hypothetical medical study demonstrating lumping versus splitting
Consequences of splitting:
i) Decreased bias
ii) Increased variance
iii) Hopefully greatly increased effect sizes
iv) Risks of multiple testing
e) Optimizing the lump/split compromise
i) For internal validity
ii) For external validity
a)
b)
c)
d)
3) The replication and reproducibility crises
a) Failure to "reproduce" (replicate) study results: the "decline effect".
b) Failure to "reproduce" (replicate) study results: explanations.
i) Explanations from "Why Most Research Results are False"
ii) Publication bias
iii) Regression to the mean
c) Efforts at remediation
Reproducibility: Obstacles and Opportunities
p.2 of 12
Preamble:
Terminology… Replication? Or Reproducibility?
1) Reproducibility in data analysis
1.a)
1.a.i)
Failure to reproduce data analysis
A historic example: personalized cancer medicine & the Duke experience.
Keith Baggerly, "The Importance of Reproducible Research in High-Throughput Biology":
https://www.youtube.com/watch?v=7gYIs7uYbMo
Deriving Chemosensitivity From Cell Lines: Forensic Bioinformatics And Reproducible
Research In High-Throughput Biology", K. Baggerly & K. Coombes, Annals of Applied
Statistics, 2009.
"A Biostatistic Paper Alleges Potential Harm To Patients In Two Duke Clinical Studies",
Paul Goldberg, The Cancer Letter, 2009.
The setting: predicting which cancer patients should/should not get which chemotherapy drugs.
The technique:
• Using drug sensitivity data for a panel of cell lines (the NCI60),
and choose those that are most sensitive and most resistant to a drug.
• Using array profiles of chosen cell lines,
select the most differentially expressed genes.
• Using the selected genes,
build a model that takes an array profile and returns a classification,
• Use this model to predict patient response.
Reproducibility: Obstacles and Opportunities
p.3 of 12
What went wrong -- a SELECTED list
1.a.ii)

Using Excel: lack of care in pasting data.

Miscoding
0 = responder, 1 = non-responder
Oops, it was the opposite.
Giving an agent ONLY to patients who the model says will NOT benefit.

Machine learning methods using random number generators.
The k-means clustering method relies on randomly chosen starting points in feature space.
NCI could not reproduce the prediction model results.
NCI could not even reproduce its own results five minutes later.
Listen to Lisa McShane testimony: http://www.cancerletter.com/downloads/20110128_1 .

Failures of integrity
"Independent validation" that was not.
Re-using the same heat map on articles reporting different studies.
Reporting genes as significant that were not even assayed on the array… but fit the narrative.
…
Reproducibility: Obstacles and Opportunities

p.4 of 12
Failures of leadership
o Senior research leadership rarely checks details.
o Prestigious journals failed to publish critical letters.
o Administrative self-interest led to:
- Burying the career of a (polite) whistle-blower.
- Creating a non-transparent "report" that hid the issues,
and led to some trials re-starting.
- “We have been yelling about the science for three years….
So I find it ironic that [revelations about Potti’s fake Rhodes Scholarship]
got things rolling,” said Baggerly.

Light penalties encourage future abuses:
"Department Of Health and Human Services’ Office of Research Integrity has concluded
that a five-year ban on federal research funding for one individual researcher is a
sufficient response to a case involving millions of taxpayer dollars, completely fabricated
data, and hundreds to thousands of patients in invasive clinical trials".
Penalty Too Light: A Guest Editorial by Keith Baggerly and C.K. Gunsalus, The
Cancer Letter, 2015.

A scientific culture that encourages little frauds.
"Want a letter? You write it for me", Roger Day, Science, 2016.
Offloading letter-writing to the supplicant.
 Courtesy authorships for the highly placed.
 Covering up data problems.
 Ultimately, falsifying data.
Reproducibility: Obstacles and Opportunities
1.b)
1.b.i)
p.5 of 12
Benefits of reproducible analysis
Makes tech transfer easy
Predictive models can be accurately applied to future data.
1.b.ii)
Makes data updates easier
Rerunning the same analysis when data are corrected or augmented with new
results becomes easy.
1.b.iii)
Encourages planning and documentation
The "literate programming" paradigm provides a convenient space for describing the
data analysis plan.
1.b.iv)
Guards against incompetence and fraud
Others can reproduce an analysis easily, so incompetence and fraud become easier
to detect.
Easy detection of mistakes and cheats will encourage care and discourage fraud.
1.b.v)
The stakes are high:
This is a HOT FIELD!
Professional advancement, grants, …, and poor replication
(J. Ioannidis, PLoS Medicine 2007)
If done well, cancer patients will receive more effective PERSONALIZED medicines.
If done poorly:
“Opportunity cost” of better ideas not tested.
Cancer patients mistreated on clinical trials.
Reproducibility: Obstacles and Opportunities
1.c)
Solutions
1.c.i)
1.c.ii)
p.6 of 12
The role of individuals: tools for reproducible analysis

Documented archived data freeze

Scripts instead of interactive interfaces

Literate programming integrating code into reports: Sweave and Rmarkdown.
The role of journals
Data sharing policies for improving reproducibility.
See Gary King's site on Data Sharing and Replication:
http://gking.harvard.edu/pages/data-sharing-and-replication
Public Library of Science (2014):
Authors must provide "the dataset used to reach the conclusions drawn in the manuscript
with related metadata and methods, and any additional data required to replicate the reported
study findings in their entirety."
"Authors need to indicate where the data are housed, at the time of submission."
"Journals unite for reproducibility", Nature 2014.
NIH + Science + Nature + editors + other funders and science leaders:
"Principles and Guidelines in Reporting Preclinical Research" go.nature.com/ezjl1p
1.c.iii)
The role of institutions and their leaders
"It's the integrity, stupid."
Reproducibility: Obstacles and Opportunities
2)
p.7 of 12
Reproducibility: achieving internal and external validity
2.a)
Bias, variance, sample size, and personalized medicine.
An idea of WIDESPREAD application and CRITICAL IMPORTANCE in personalized medicine:
Some decision (model complexity, number of parameters, drilling down etc) triggers a tradeoff
between reliability (e.g. low variance) and validity (e.g. low bias).
As you make a model more complex and "free", it fits better, but eventually overfits.
2.b)
Example: Mean squared error in regression; effect of model complexity
lump


split
simple

complexity

complex
few

#parameters

many
few

#components

many
large

penalty

small
heavy

prior weight

light
Reproducibility: Obstacles and Opportunities
p.8 of 12
2.c) Example: hypothetical medical study demonstrating lumping versus splitting
The Problem: A new treatment is given to 100 patients. Of them, only 8 respond. But there is a
subgroup of 5 in which 3 patients respond, yielding a response rate of 60%! Should the treatment be
recommended for people in the subgroup?
Group D
Group L
TOTAL
Responder
3
5
8
Nonresponder
2
90
92
TOTAL
5
95
100
What if D and L are:
2 alleles of a gene known to affect this drug's pharmacodynamics
2 alleles of one gene out of a hundred known to affect this drug's pharmacodynamics
2 alleles of one gene out of a hundred thousand; nothing known
D = dark hair, L = light hair
D = dark hair, L = light hair;
hair color is strongly tied to ethnicity…
which is strongly tied to a key enzyme
2.d) Consequences of splitting, good and bad:
i. Decreased bias
ii. Increased variance
 due to smaller samples sizes
 due to decreased variation in treatments delivered (if not randomized)
iii. Hopefully greatly increased effect sizes
iv. Risks of multiple testing
2.e) Optimizing the lump/split compromise
i) For internal validity
Internal validity: the answer is sufficiently correct to apply to new patients "similar"
to those in this study.
It will keep on working well here, for patients like these, even if we don't know why.
ii) For external validity
External validity: the answer is sufficiently correct to apply even to new patients
from a different sampling catchment
(age, location, ethnicity, socio-economic, …).
The science is well grounded enough to generalize.
iii) Techniques:
Meaningful Bayes priors, Bayesian networks, hierarchical models, empirical Bayes, …
Reproducibility: Obstacles and Opportunities
p.9 of 12
3) The replication and reproducibility crises:
3.a) Failure to replicate study results: The "decline effect".
Subsequent studies intended to replicate and confirm
frequently show decreased effect sizes.
PSYCHOLOGY
"Over half of psychology studies fail reproducibility test", M. Baker
Nature 2015.
"An effort to reproduce 100 psychology fingings found that only 39 held up."
15/100 were classified as "not at all similar".
The Truth Wears Off: Is there something wrong with the scientific method?
J. Lehrer, The New Yorker, Annals of Science, 2010.
"Something strange was happening: the therapeutic power of the [anti-depression] drugs appeared to
be steadily waning. A recent study showed an effect that was less than half of that documented in the
first trials, in the early nineteen-nineties."
CANCER
"Drug development: Raise standards for preclinical cancer research",
Begley, C. G. & Ellis, L. M. Nature, 2012.
Amgen could confirm only 6 of 53 preclinical results (11%).
"Repeatability of published microarray gene expression analyses",
Ioannidis et al, Nature Genetics 2009.
Of 18 quantitative papers published in Nature Genetics in the past two years found that reproducibility
was not achievable even in principle in 10 cases. “One table or figure from each article was
independently evaluated by two teams of analysts. We reproduced two analyses in principle and six
partially or with some discrepancies; ten could not be reproduced.”
Reproducibility: Obstacles and Opportunities
p.10 of 12
3.b) Failure to "reproduce" (replicate) study results: explanations
3.b.i) "Why Most Published Research Findings Are False",
John Ioannidis, Plos Medicine 2007.
R = odds of a true relationship.  = Type I error,  = Type II error. c = total #P values.
u = "bias" = proportion of probed analyses that would not have been “research findings,” but
nevertheless end up presented and reported as such, because of bias.
R* = post-data (posterior) odds of a true relationship, given Research Finding = Yes. (Yes/Total).
R* is decreased by :







Small sample size
Small effect size
Multiplicity of testing
Non-selectivity of hypotheses
Flexibility in designs, definitions, outcomes, and analytical modes
Financial and other interests and prejudices
Hot scientific field-- like personalized medicine.
Estimating bias: use of "null fields", scientific questions where ALL relationships are false.
" Too large and too highly significant effects may actually be more likely to be signs of large bias in
most fields of modern research."
Friendly critique: Goodman & Greenland, PLoS Medicine 2007.
Reproducibility: Obstacles and Opportunities
p.11 of 12
3.b.ii) The Role of Chance: Publication bias
Publication bias: NOT publishing insignificant results.
(Almost the opposite of Ioannidis's bias term u: publishing as significant something that shouldn't be.)
"Unpublished results hide the decline effect", J.Schooler, Nature 2011.
To effects of publication bias, we use P values from a breast cancer gene expression study of
Hedenfalk et al. (2001).
Pretend each gene test is a separate paper. 19% were significant.
"66% of null hypotheses are true".
(Estimated from Storey's qvalue method.)
R/(R+1) = 0.16+0.06+0.12 = 34%.
We combine with Dickersin, K. et al.
(1987). "Publication bias and clinical
trials". Controlled Clinical Trials 8 (4).
"Statistically significant results have been
shown to be three times more likely to be
published compared to papers with null
results." So here, assume:
Pr(reported | NOT signif) = 33%
Pr(reported |
signif) = 100%
CONCLUSIONS:
Publication bias is NOT a great
explanation for high failure to replicate:
Failing to publish 2/3 of negative
results increased the Type I error (from
5% to 14%),
but not the false discovery rate (17%).
But careful…
studies with bias can convert
a "P>0.05" non-significant study
into
a "P<0.05" significant study.
Reproducibility: Obstacles and Opportunities
p.12 of 12
3.b.iii) The Role of Chance: Regression to the mean
When the decline effect due to statistical self-correction of initially exaggerated outcomes.
The decline effect: an initially exciting study with a strong and significant “effect size”
is subject to to statistical self-correction.
3.c) Efforts at remediation for poor replication and reproduction
"First results from psychology’s largest reproducibility test", Monya Baker, Nature 2015.
"Estimating the reproducibility of psychological science", Open Science Collaboration, Science 2015.
"Disclose all data in publications", Keith Baggerly, Nature 2010.
At MD Anderson, by policy all analyses are reproducible, using "literate programming" techniques.