Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Reproducibility: Obstacles and Opportunities p.1 of 12 Reproducibility: Obstacles and Opportunities Big Data and Healthcare Analytics – A Path to Personalized Medicine April 14, 2016 Roger Day OUTLINE 1) Reproducibility in data analysis a) Failure to reproduce data analysis i) A historic example: the Duke experience with personalized cancer medicine ii) What went wrong b) Benefits of reproducibility data analysis i) Makes tech transfer easy ii) Makes data updates easier iii) Encourages planning and documentation iv) Guards against incompetence and fraud v) The stakes are high c) Solutions i) The role of individuals: tools for reproducible analysis ii) The role of journals iii) The role of institutions 2) Reproducibility: achieving internal and external validity Bias, variance, sample size, and personalized medicine. Example: Mean squared error in regression; effect of model complexity Example: hypothetical medical study demonstrating lumping versus splitting Consequences of splitting: i) Decreased bias ii) Increased variance iii) Hopefully greatly increased effect sizes iv) Risks of multiple testing e) Optimizing the lump/split compromise i) For internal validity ii) For external validity a) b) c) d) 3) The replication and reproducibility crises a) Failure to "reproduce" (replicate) study results: the "decline effect". b) Failure to "reproduce" (replicate) study results: explanations. i) Explanations from "Why Most Research Results are False" ii) Publication bias iii) Regression to the mean c) Efforts at remediation Reproducibility: Obstacles and Opportunities p.2 of 12 Preamble: Terminology… Replication? Or Reproducibility? 1) Reproducibility in data analysis 1.a) 1.a.i) Failure to reproduce data analysis A historic example: personalized cancer medicine & the Duke experience. Keith Baggerly, "The Importance of Reproducible Research in High-Throughput Biology": https://www.youtube.com/watch?v=7gYIs7uYbMo Deriving Chemosensitivity From Cell Lines: Forensic Bioinformatics And Reproducible Research In High-Throughput Biology", K. Baggerly & K. Coombes, Annals of Applied Statistics, 2009. "A Biostatistic Paper Alleges Potential Harm To Patients In Two Duke Clinical Studies", Paul Goldberg, The Cancer Letter, 2009. The setting: predicting which cancer patients should/should not get which chemotherapy drugs. The technique: • Using drug sensitivity data for a panel of cell lines (the NCI60), and choose those that are most sensitive and most resistant to a drug. • Using array profiles of chosen cell lines, select the most differentially expressed genes. • Using the selected genes, build a model that takes an array profile and returns a classification, • Use this model to predict patient response. Reproducibility: Obstacles and Opportunities p.3 of 12 What went wrong -- a SELECTED list 1.a.ii) Using Excel: lack of care in pasting data. Miscoding 0 = responder, 1 = non-responder Oops, it was the opposite. Giving an agent ONLY to patients who the model says will NOT benefit. Machine learning methods using random number generators. The k-means clustering method relies on randomly chosen starting points in feature space. NCI could not reproduce the prediction model results. NCI could not even reproduce its own results five minutes later. Listen to Lisa McShane testimony: http://www.cancerletter.com/downloads/20110128_1 . Failures of integrity "Independent validation" that was not. Re-using the same heat map on articles reporting different studies. Reporting genes as significant that were not even assayed on the array… but fit the narrative. … Reproducibility: Obstacles and Opportunities p.4 of 12 Failures of leadership o Senior research leadership rarely checks details. o Prestigious journals failed to publish critical letters. o Administrative self-interest led to: - Burying the career of a (polite) whistle-blower. - Creating a non-transparent "report" that hid the issues, and led to some trials re-starting. - “We have been yelling about the science for three years…. So I find it ironic that [revelations about Potti’s fake Rhodes Scholarship] got things rolling,” said Baggerly. Light penalties encourage future abuses: "Department Of Health and Human Services’ Office of Research Integrity has concluded that a five-year ban on federal research funding for one individual researcher is a sufficient response to a case involving millions of taxpayer dollars, completely fabricated data, and hundreds to thousands of patients in invasive clinical trials". Penalty Too Light: A Guest Editorial by Keith Baggerly and C.K. Gunsalus, The Cancer Letter, 2015. A scientific culture that encourages little frauds. "Want a letter? You write it for me", Roger Day, Science, 2016. Offloading letter-writing to the supplicant. Courtesy authorships for the highly placed. Covering up data problems. Ultimately, falsifying data. Reproducibility: Obstacles and Opportunities 1.b) 1.b.i) p.5 of 12 Benefits of reproducible analysis Makes tech transfer easy Predictive models can be accurately applied to future data. 1.b.ii) Makes data updates easier Rerunning the same analysis when data are corrected or augmented with new results becomes easy. 1.b.iii) Encourages planning and documentation The "literate programming" paradigm provides a convenient space for describing the data analysis plan. 1.b.iv) Guards against incompetence and fraud Others can reproduce an analysis easily, so incompetence and fraud become easier to detect. Easy detection of mistakes and cheats will encourage care and discourage fraud. 1.b.v) The stakes are high: This is a HOT FIELD! Professional advancement, grants, …, and poor replication (J. Ioannidis, PLoS Medicine 2007) If done well, cancer patients will receive more effective PERSONALIZED medicines. If done poorly: “Opportunity cost” of better ideas not tested. Cancer patients mistreated on clinical trials. Reproducibility: Obstacles and Opportunities 1.c) Solutions 1.c.i) 1.c.ii) p.6 of 12 The role of individuals: tools for reproducible analysis Documented archived data freeze Scripts instead of interactive interfaces Literate programming integrating code into reports: Sweave and Rmarkdown. The role of journals Data sharing policies for improving reproducibility. See Gary King's site on Data Sharing and Replication: http://gking.harvard.edu/pages/data-sharing-and-replication Public Library of Science (2014): Authors must provide "the dataset used to reach the conclusions drawn in the manuscript with related metadata and methods, and any additional data required to replicate the reported study findings in their entirety." "Authors need to indicate where the data are housed, at the time of submission." "Journals unite for reproducibility", Nature 2014. NIH + Science + Nature + editors + other funders and science leaders: "Principles and Guidelines in Reporting Preclinical Research" go.nature.com/ezjl1p 1.c.iii) The role of institutions and their leaders "It's the integrity, stupid." Reproducibility: Obstacles and Opportunities 2) p.7 of 12 Reproducibility: achieving internal and external validity 2.a) Bias, variance, sample size, and personalized medicine. An idea of WIDESPREAD application and CRITICAL IMPORTANCE in personalized medicine: Some decision (model complexity, number of parameters, drilling down etc) triggers a tradeoff between reliability (e.g. low variance) and validity (e.g. low bias). As you make a model more complex and "free", it fits better, but eventually overfits. 2.b) Example: Mean squared error in regression; effect of model complexity lump split simple complexity complex few #parameters many few #components many large penalty small heavy prior weight light Reproducibility: Obstacles and Opportunities p.8 of 12 2.c) Example: hypothetical medical study demonstrating lumping versus splitting The Problem: A new treatment is given to 100 patients. Of them, only 8 respond. But there is a subgroup of 5 in which 3 patients respond, yielding a response rate of 60%! Should the treatment be recommended for people in the subgroup? Group D Group L TOTAL Responder 3 5 8 Nonresponder 2 90 92 TOTAL 5 95 100 What if D and L are: 2 alleles of a gene known to affect this drug's pharmacodynamics 2 alleles of one gene out of a hundred known to affect this drug's pharmacodynamics 2 alleles of one gene out of a hundred thousand; nothing known D = dark hair, L = light hair D = dark hair, L = light hair; hair color is strongly tied to ethnicity… which is strongly tied to a key enzyme 2.d) Consequences of splitting, good and bad: i. Decreased bias ii. Increased variance due to smaller samples sizes due to decreased variation in treatments delivered (if not randomized) iii. Hopefully greatly increased effect sizes iv. Risks of multiple testing 2.e) Optimizing the lump/split compromise i) For internal validity Internal validity: the answer is sufficiently correct to apply to new patients "similar" to those in this study. It will keep on working well here, for patients like these, even if we don't know why. ii) For external validity External validity: the answer is sufficiently correct to apply even to new patients from a different sampling catchment (age, location, ethnicity, socio-economic, …). The science is well grounded enough to generalize. iii) Techniques: Meaningful Bayes priors, Bayesian networks, hierarchical models, empirical Bayes, … Reproducibility: Obstacles and Opportunities p.9 of 12 3) The replication and reproducibility crises: 3.a) Failure to replicate study results: The "decline effect". Subsequent studies intended to replicate and confirm frequently show decreased effect sizes. PSYCHOLOGY "Over half of psychology studies fail reproducibility test", M. Baker Nature 2015. "An effort to reproduce 100 psychology fingings found that only 39 held up." 15/100 were classified as "not at all similar". The Truth Wears Off: Is there something wrong with the scientific method? J. Lehrer, The New Yorker, Annals of Science, 2010. "Something strange was happening: the therapeutic power of the [anti-depression] drugs appeared to be steadily waning. A recent study showed an effect that was less than half of that documented in the first trials, in the early nineteen-nineties." CANCER "Drug development: Raise standards for preclinical cancer research", Begley, C. G. & Ellis, L. M. Nature, 2012. Amgen could confirm only 6 of 53 preclinical results (11%). "Repeatability of published microarray gene expression analyses", Ioannidis et al, Nature Genetics 2009. Of 18 quantitative papers published in Nature Genetics in the past two years found that reproducibility was not achievable even in principle in 10 cases. “One table or figure from each article was independently evaluated by two teams of analysts. We reproduced two analyses in principle and six partially or with some discrepancies; ten could not be reproduced.” Reproducibility: Obstacles and Opportunities p.10 of 12 3.b) Failure to "reproduce" (replicate) study results: explanations 3.b.i) "Why Most Published Research Findings Are False", John Ioannidis, Plos Medicine 2007. R = odds of a true relationship. = Type I error, = Type II error. c = total #P values. u = "bias" = proportion of probed analyses that would not have been “research findings,” but nevertheless end up presented and reported as such, because of bias. R* = post-data (posterior) odds of a true relationship, given Research Finding = Yes. (Yes/Total). R* is decreased by : Small sample size Small effect size Multiplicity of testing Non-selectivity of hypotheses Flexibility in designs, definitions, outcomes, and analytical modes Financial and other interests and prejudices Hot scientific field-- like personalized medicine. Estimating bias: use of "null fields", scientific questions where ALL relationships are false. " Too large and too highly significant effects may actually be more likely to be signs of large bias in most fields of modern research." Friendly critique: Goodman & Greenland, PLoS Medicine 2007. Reproducibility: Obstacles and Opportunities p.11 of 12 3.b.ii) The Role of Chance: Publication bias Publication bias: NOT publishing insignificant results. (Almost the opposite of Ioannidis's bias term u: publishing as significant something that shouldn't be.) "Unpublished results hide the decline effect", J.Schooler, Nature 2011. To effects of publication bias, we use P values from a breast cancer gene expression study of Hedenfalk et al. (2001). Pretend each gene test is a separate paper. 19% were significant. "66% of null hypotheses are true". (Estimated from Storey's qvalue method.) R/(R+1) = 0.16+0.06+0.12 = 34%. We combine with Dickersin, K. et al. (1987). "Publication bias and clinical trials". Controlled Clinical Trials 8 (4). "Statistically significant results have been shown to be three times more likely to be published compared to papers with null results." So here, assume: Pr(reported | NOT signif) = 33% Pr(reported | signif) = 100% CONCLUSIONS: Publication bias is NOT a great explanation for high failure to replicate: Failing to publish 2/3 of negative results increased the Type I error (from 5% to 14%), but not the false discovery rate (17%). But careful… studies with bias can convert a "P>0.05" non-significant study into a "P<0.05" significant study. Reproducibility: Obstacles and Opportunities p.12 of 12 3.b.iii) The Role of Chance: Regression to the mean When the decline effect due to statistical self-correction of initially exaggerated outcomes. The decline effect: an initially exciting study with a strong and significant “effect size” is subject to to statistical self-correction. 3.c) Efforts at remediation for poor replication and reproduction "First results from psychology’s largest reproducibility test", Monya Baker, Nature 2015. "Estimating the reproducibility of psychological science", Open Science Collaboration, Science 2015. "Disclose all data in publications", Keith Baggerly, Nature 2010. At MD Anderson, by policy all analyses are reproducible, using "literate programming" techniques.