Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics in Oncology Clinical Trials Yu Shyr, PhD My 44-hour trip!! Highlights Understanding high-density biomarker data Repeatability and reproducibility of the pre-clinical highdensity biomarker data ROC curve vs. PPV Pathway of drug development based on high-density biomarker data Bioinformatics Bioinformatics is an interdisciplinary field that develops methods and software tools for understanding biological data. It is conceptualizing biology in terms of molecules and applying "informatics techniques" (derived from disciplines such as applied math, computer science and statistics) to understand and organize the information associated with these molecules, on a large scale. In short, bioinformatics is a management information system for molecular biology and has many practical applications Omics biomedical research Microarray: cDNA (about 5,000 variables), Affymetrix U133 Plus 2.0 (about 45,000 variables) SNPs (about 500,000 – 2,000,000 variables) Next Generation Sequencing (?) Storage of the Data? cDNA, Microarray, SNPs NGseq raw imaging data: > 2 TB per sample RNA seq or Exome seq data: 10 GB per sample (raw data), 30-50 GB during the processing. Whole genome seq: 200 GB per sample (raw data), 400-600 GB during the processing. Cost per Human Genome 13 years ~$3,000,000,000 SOLID platform Moore’s Law <2 weeks ~$1,000 Data Analysis/Mining? NGS – Data Analysis NATURE REVIEWS | CANCER VOLUME 13 | NOVEMBER 2013 Our ability to translate cancer research to clinical success has been remarkably low. Sadly, clinical trials in oncology have the highest failure rate compared with other therapeutic areas. Issues related to clinical-trial design — such as uncontrolled phase II studies, a reliance on standard criteria for evaluating tumor response and the challenges of selecting patients prospectively — also play a significant part in the dismal success rate. 2 9 M A R C H 2 0 1 2 | VO L 4 8 3 | N AT U R E | 5 3 1 In studies for which findings could be reproduced, authors had paid close attention to controls, reagents, investigator bias and describing the complete data set. For results that could not be reproduced, however, data were not routinely analyzed by investigators blinded to the experimental versus control groups. 2 9 M A R C H 2 0 1 2 | VO L 4 8 3 | N AT U R E | 5 3 1 • Massively parallel sequencing approaches are beginning to be used clinically to characterize individual patient tumors and to select therapies based on the identified mutations. • A major question in these analyses is the extent to which these methods identify clinically actionable alterations and whether the examination of the tumor tissue alone is sufficient or whether matched normal DNA should also be analyzed to accurately identify tumor-specific (somatic) alterations. • To address these issues, we comprehensively evaluated 815 tumor-normal paired samples from patients of 15 tumor types. • We identified genomic alterations using next-generation sequencing of whole exomes or 111 targeted genes that were validated with sensitivities >95% and >99%, respectively, and specificities >99.99%. • This research suggest that matched tumor-normal sequencing analyses are essential for precise identification and interpretation of somatic and germline alterations and have important implications for the diagnostic and therapeutic management of cancer patients. • Method: In patients with/without prostate cancer, measure peptides patterns through mass spectrometry • Results: ~100% sensitive, specific for prostate cancer • Limitation: Groups being compared are different: −Cancer: mean age 67, 100% male −Control: mean age 35, 58% female J Clin Invest. 2006;116(1):271–284. doi:10.1172/JCI26022. How Science Goes Wrong Oct 19th 2013 RNA Sequencing RNA Seq – Sample size estimation Li, Shyr. Sample size calculation for differential expression analysis of RNA-seq data under Poisson distribution. Int J Comput Biol Drug Des. 2013;6(4):358-75 Li, Shyr. Sample size calculation based on exact test for assessing differential expression analysis in RNA-seq data. BMC Bioinformatics 2013, 14:357, Dec 6, 2013 Shyr, Li. Sample size calculation for RNA sequencing experiment – a simulation base approach of TCGA data. http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ Testing Hypothesis • We are interested in identifying differential gene expression between two groups. The testing hypothesis is H 0 : 1 0 0 H1 : 1 0 0 • Research papers focusing on testing the equality of two Poisson rates – Thode (1997) – Krishnamoorthy and Thomson (2004) – Ng and Tang (2005) – Gu et al. (2008) Li, Shyr. Int J Comput Biol Drug Des. 2013;6(4):358-75 Test Statistics • Wald and score test ZW X 1 wX 0 X 1 w2 X 0 , w d1 / d 0 ZS X 1 wX 0 , ( X 1 X 0 )w • Log transformation of Wald and score is usually adopted for skewness correction and variance stabilization Z lw ln( X 1 / X 0 ) ln w 1/ X1 1/ X 0 Z ls ln( X 1 / X 0 ) ln w (2 w 1 / w) /( X 1 X 0 ) • Transformation of Poisson to increase the rate of convergence to normality (Huffman, 1984) Z tp 2( X 1 3 / 8 w( X 0 3 / 8 ) • Likelihood ratio test 1 w X0 X1 X 1 X 0 X 1 X 0 Z lr X (1 1 / w) X 0 (1 1 / w) 1 Li, Shyr. Int J Comput Biol Drug Des. 2013;6(4):358-75 Sample Size Determination for False Discovery Rate • Thousands of genes are examined in an RNA-seq experiment, and those genes are tested simultaneously • Benjamini(1995), Storey(2002) proposed the use of false discovery rate (FDR) R0 FDR E | R 0 R R0 is the number of false discoveries R is the number of results declared significant Li, Shyr. Int J Comput Biol Drug Des. 2013;6(4):358-75 RNA-Seq: Sample Size Estimation Li & Shyr BMC Bioinformatics 2013, 14, 357. https://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/ Published online: 09 March 2014 | doi:10.1038/nm.3466 We selected 53 participants with either aMCI or AD for metabolomic and lipidomic biomarker discovery. Included in this aMCI/AD group were 18 Converters. We also selected 53 matched cognitively normal control (NC) participants. Published online: 09 March 2014 | doi:10.1038/nm.3466 We discovered and validated a set of ten lipids from peripheral blood that predicted phenoconversion to either amnestic mild cognitive impairment or Alzheimer’s disease within a 2–3 year timeframe with over 90% accuracy. During pregnancy, cell-free DNA fragments – short DNA fragment - of the mother and the fetus circulate in maternal blood Harmony analyses fragments from specific chromosomes, rather than all chromosomes Targeted analysis results in higher throughput and accurate trisomy risk assessment PPV > 0.8 if Specificity > 0.998 Disease No Disease (D+) (D-) Test (+) 30 854 884 (3.4%) Test (-) 8 14949 14957 (99.9%) 38 (79%) 15803 (95%) 15841 Disease No Disease (D+) (D-) Test (+) 38 9 47 (81%) Test (-) 0 15794 15794 (100%) 38 (100%) 15803 (99.9%) 15841 Experiment Design for High-Throughput Biomarker Assays Study Objectives – Target Therapy – Personalized Medicine • Multiple data sets Data Mining Biology Trials CESC PAAD LIHC BLCA LGG KIRP STAD PRAD READ LAML THCA SKCM LUSC HNSC LUAD UCEC COAD KIRC GBM OV BRCA Downloadable Tumor Samples NCI - The Cancer Genome Atlas (TCGA) Statistics of 21 cancer types 800 600 400 200 0 Identification of Six Biological Subtypes of Triple Negative Breast Cancer and Insights to Therapeutic Strategies • Patients with triple negative breast cancers (TNBCs) have limited targeted therapy options. • To better understand this heterogeneous group, we determined if TNBCs could be subclassified, a first step in aligning patients with appropriate therapies. • Using gene expression (GE) profiles from 21 breast cancer datasets, we identified six TNBC subtypes that display unique GE patterns and gene ontologies. Journal of Clinical Investigation 2011;121(7):2750-2767 Identification of Six Biological Subtypes of Triple Negative Breast Cancer and Insights to Therapeutic Strategies • The GE signatures of these subtypes allowed the designation of TNBC cell lines as in vitro models of TNBC subtypes • Predicted ‘driver’ signaling pathways from these cell lines were pharmacologically targeted as proof-of-concept that distinct GE signatures can inform response to targeted therapies • These data have significant value for drug discovery, clinical trial design and biomarker selection that will enable alignment of TNBC patients to appropriate targeted therapies Journal of Clinical Investigation 2011;121(7):2750-2767 Journal of Clinical Investigation 2011;121(7):2750-2767 Criteria for the use of omics-based prediction in clinical trials This paper provides a checklist of criteria that can be used to determine the readiness of omics-based tests for guiding patient care in clinical trials. The checklist criteria cover issues relating to specimens, assays, mathematical modeling, clinical trial design, and ethical, legal and regulatory aspects. The checklist will be used to evaluate proposals for NCIsponsored clinical trials in which omics tests will be used to guide therapy. Model development, and preliminary performance evaluation Evaluate data used in developing and validating the predictor model to check for accuracy, completeness, and outliers. Perform retrospective verification of the data quality if necessary. Assess the developmental data sets for technical artefacts, focusing particular attention on whether any artefacts could potentially influence the observed association between the omics profiles and clinical outcomes. Model development, and preliminary performance evaluation Evaluate the appropriateness of the statistical methods used to build the predictor model and to assess its performance. Establish that the predictor algorithm, including all data preprocessing steps, cutpoints applied to continuous variables (if any), and methods for assigning confidence measures for predictions, are completely locked down (that is, fully specified) and identical to prior versions for which performance claims were made. Model development, and preliminary performance evaluation Document sources of variation that affect the reproducibility of the final predictions, and provide an estimate of the overall variability along with verification that the prediction algorithmcan be applied to one case at a time. Evaluate whether clinical validations of the predictor were analytically and statistically rigorous and unequivocally blinded. Model development, and preliminary performance evaluation Search public sources, including literature and citation databases, journal correspondence, and retraction notices, to determine whether any questions have been raised about the data or methods used to develop the predictor or assess its performance, and ensure that all questions have been adequately addressed. Conclusions Reproducible research - from pre-clinical data, biomarker data to clinical data Focus on pathway, gene set analysis and systems biology Need a team of quantitative scientists We need a team!!