Download Yu Shyr, PhD

Document related concepts

The Cancer Genome Atlas wikipedia , lookup

Transcript
Bioinformatics in Oncology Clinical
Trials
Yu Shyr, PhD
My 44-hour trip!!
Highlights
 Understanding high-density biomarker data
 Repeatability and reproducibility of the pre-clinical highdensity biomarker data
 ROC curve vs. PPV
 Pathway of drug development based on high-density
biomarker data
Bioinformatics
Bioinformatics is an interdisciplinary field that develops
methods
and
software
tools
for
understanding
biological
data. It is conceptualizing biology in terms of
molecules and applying "informatics techniques" (derived
from disciplines such as applied math, computer science and
statistics) to understand and organize the information
associated with these molecules, on a large scale. In short,
bioinformatics is a management information system for
molecular biology and has many practical applications
Omics biomedical research
 Microarray:
cDNA
(about
5,000
variables),
Affymetrix U133 Plus 2.0 (about 45,000 variables)
 SNPs (about 500,000 – 2,000,000 variables)
 Next Generation Sequencing (?)
Storage of the Data?

cDNA, Microarray, SNPs

NGseq raw imaging data: > 2 TB per sample

RNA seq or Exome seq data: 10 GB per sample (raw data),
30-50 GB during the processing.

Whole genome seq: 200 GB per sample (raw data), 400-600
GB during the processing.
Cost per Human Genome
13 years
~$3,000,000,000
SOLID
platform
Moore’s
Law
<2 weeks
~$1,000
Data Analysis/Mining?
NGS – Data Analysis
NATURE REVIEWS | CANCER VOLUME 13 | NOVEMBER 2013
 Our ability to translate cancer research to clinical success
has been remarkably low.
 Sadly, clinical trials in oncology have the highest failure
rate compared with other therapeutic areas.
 Issues related to clinical-trial design — such as
uncontrolled phase II studies, a reliance on standard
criteria for evaluating tumor response and the challenges
of selecting patients prospectively — also play a
significant part in the dismal success rate.
2 9 M A R C H 2 0 1 2 | VO L 4 8 3 | N AT U R E | 5 3 1
 In studies for which findings could be reproduced, authors had
paid close attention to controls, reagents, investigator bias and
describing the complete data set. For results that could not be
reproduced, however, data were not routinely analyzed by
investigators blinded to the experimental versus control groups.
2 9 M A R C H 2 0 1 2 | VO L 4 8 3 | N AT U R E | 5 3 1
• Massively parallel sequencing approaches are beginning to be
used clinically to characterize individual patient tumors and to
select therapies based on the identified mutations.
• A major question in these analyses is the extent to which these
methods identify clinically actionable alterations and whether
the examination of the tumor tissue alone is sufficient or
whether matched normal DNA should also be analyzed to
accurately identify tumor-specific (somatic) alterations.
• To address these issues, we comprehensively evaluated 815
tumor-normal paired samples from patients of 15 tumor types.
• We identified genomic alterations using next-generation
sequencing of whole exomes or 111 targeted genes that were
validated with sensitivities >95% and >99%, respectively, and
specificities >99.99%.
• This research suggest that matched tumor-normal sequencing
analyses
are
essential
for
precise
identification
and
interpretation of somatic and germline alterations and have
important implications for the diagnostic and therapeutic
management of cancer patients.
• Method: In patients with/without prostate cancer,
measure peptides patterns through mass spectrometry
• Results: ~100% sensitive, specific for prostate cancer
• Limitation: Groups being compared are different:
−Cancer: mean age 67, 100% male
−Control: mean age 35, 58% female
J Clin Invest. 2006;116(1):271–284. doi:10.1172/JCI26022.
How Science Goes Wrong
Oct 19th 2013
RNA Sequencing
RNA Seq – Sample size estimation
 Li, Shyr. Sample size calculation for differential expression
analysis of RNA-seq data under Poisson distribution. Int J
Comput Biol Drug Des. 2013;6(4):358-75
 Li, Shyr. Sample size calculation based on exact test for
assessing differential expression analysis in RNA-seq data.
BMC Bioinformatics 2013, 14:357, Dec 6, 2013
 Shyr, Li. Sample size calculation for RNA sequencing
experiment – a simulation base approach of TCGA data.
 http://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/
Testing Hypothesis
• We
are
interested
in
identifying
differential
gene
expression between two groups. The testing hypothesis is
H 0 :  1   0  0 H1 :  1   0  0
• Research papers focusing on testing the equality of two
Poisson rates
–
Thode (1997)
–
Krishnamoorthy and Thomson (2004)
–
Ng and Tang (2005)
–
Gu et al. (2008)
Li, Shyr. Int J Comput Biol Drug Des. 2013;6(4):358-75
Test Statistics
• Wald and score test
ZW 
X 1  wX 0
X 1  w2 X 0
, w  d1 / d 0
ZS 
X 1  wX 0
,
( X 1  X 0 )w
• Log transformation of Wald and score is usually adopted for
skewness correction and variance stabilization
Z lw 
ln( X 1 / X 0 )  ln w
1/ X1  1/ X 0
Z ls 
ln( X 1 / X 0 )  ln w
(2  w  1 / w) /( X 1  X 0 )
• Transformation of Poisson to increase the rate of convergence
to normality (Huffman, 1984)
Z tp 
2( X 1  3 / 8  w( X 0  3 / 8 )
• Likelihood ratio test
1 w
X0
X1

 X 1  X 0   X 1  X 0  

 
 
Z lr  
X (1  1 / w)   X 0 (1  1 / w)  

 1

Li, Shyr. Int J Comput Biol Drug Des. 2013;6(4):358-75
Sample Size Determination for False Discovery Rate
• Thousands of genes are examined in an RNA-seq
experiment, and those genes are tested simultaneously
• Benjamini(1995), Storey(2002) proposed the use of false
discovery rate (FDR)
 R0

FDR  E  | R  0 
R

R0 is the number of false discoveries
R is the number of results declared significant
Li, Shyr. Int J Comput Biol Drug Des. 2013;6(4):358-75
RNA-Seq: Sample Size Estimation
Li & Shyr BMC Bioinformatics 2013, 14, 357.
https://cqs.mc.vanderbilt.edu/shiny/RnaSeqSampleSize/
Published online:
09 March 2014 | doi:10.1038/nm.3466
We selected 53 participants with either aMCI or AD for
metabolomic and lipidomic biomarker discovery. Included in this
aMCI/AD group were 18 Converters. We also selected 53 matched
cognitively normal control (NC) participants.
Published online:
09 March 2014 | doi:10.1038/nm.3466
We discovered and validated a set of ten
lipids from peripheral blood that predicted
phenoconversion to either amnestic mild
cognitive impairment or Alzheimer’s disease
within a 2–3 year timeframe with over 90%
accuracy.
 During pregnancy, cell-free DNA fragments – short DNA
fragment - of the mother and the fetus circulate in
maternal blood
 Harmony
analyses
fragments
from
specific
chromosomes, rather than all chromosomes
 Targeted analysis results in higher throughput and
accurate trisomy risk assessment
PPV > 0.8
if
Specificity > 0.998
Disease
No Disease
(D+)
(D-)
Test (+)
30
854
884 (3.4%)
Test (-)
8
14949
14957 (99.9%)
38 (79%)
15803 (95%)
15841
Disease
No Disease
(D+)
(D-)
Test (+)
38
9
47 (81%)
Test (-)
0
15794
15794 (100%)
38 (100%) 15803 (99.9%)
15841
Experiment Design for High-Throughput Biomarker Assays
Study Objectives – Target Therapy – Personalized Medicine
• Multiple data sets  Data Mining  Biology  Trials
CESC
PAAD
LIHC
BLCA
LGG
KIRP
STAD
PRAD
READ
LAML
THCA
SKCM
LUSC
HNSC
LUAD
UCEC
COAD
KIRC
GBM
OV
BRCA
Downloadable Tumor Samples
NCI - The Cancer Genome Atlas (TCGA)
Statistics of 21 cancer types
800
600
400
200
0
Identification of Six Biological Subtypes of Triple Negative
Breast Cancer and Insights to Therapeutic Strategies
• Patients with triple negative breast cancers (TNBCs) have
limited targeted therapy options.
• To better understand this heterogeneous group, we
determined if TNBCs could be subclassified, a first step in
aligning patients with appropriate therapies.
• Using gene expression (GE) profiles from 21 breast cancer
datasets, we identified six TNBC subtypes that display
unique GE patterns and gene ontologies.
Journal of Clinical Investigation 2011;121(7):2750-2767
Identification of Six Biological Subtypes of Triple Negative
Breast Cancer and Insights to Therapeutic Strategies
• The GE signatures of these subtypes allowed the designation of
TNBC cell lines as in vitro models of TNBC subtypes
• Predicted ‘driver’ signaling pathways from these cell lines were
pharmacologically targeted as proof-of-concept that distinct GE
signatures can inform response to targeted therapies
• These data have significant value for drug discovery, clinical trial
design and biomarker selection that will enable alignment of
TNBC patients to appropriate targeted therapies
Journal of Clinical Investigation 2011;121(7):2750-2767
Journal of Clinical Investigation 2011;121(7):2750-2767
Criteria for the use of omics-based prediction in clinical trials
 This paper provides a checklist of criteria that can be used to
determine the readiness of omics-based tests for guiding
patient care in clinical trials.
 The checklist criteria cover issues relating to specimens,
assays, mathematical modeling, clinical trial design, and
ethical, legal and regulatory aspects.
 The checklist will be used to evaluate proposals for NCIsponsored clinical trials in which omics tests will be used to
guide therapy.
Model development, and preliminary performance evaluation
 Evaluate data used in developing and validating the predictor
model to check for accuracy, completeness, and outliers.
Perform retrospective verification of the data quality if
necessary.
 Assess the developmental data sets for technical artefacts,
focusing particular attention on whether any artefacts could
potentially influence the observed association between the
omics profiles and clinical outcomes.
Model development, and preliminary performance evaluation
 Evaluate the appropriateness of the statistical methods used
to build the predictor model and to assess its performance.
 Establish that the predictor algorithm, including all data preprocessing steps, cutpoints applied to continuous variables (if
any), and methods for assigning confidence measures for
predictions, are completely locked down (that is, fully
specified) and identical to prior versions for which
performance claims were made.
Model development, and preliminary performance evaluation
 Document sources of variation that affect the reproducibility
of the final predictions, and provide an estimate of the overall
variability along with verification that the prediction
algorithmcan be applied to one case at a time.
 Evaluate whether clinical validations of the predictor were
analytically and statistically rigorous and unequivocally
blinded.
Model development, and preliminary performance evaluation
 Search public sources, including literature and citation
databases, journal correspondence, and retraction notices, to
determine whether any questions have been raised about the
data or methods used to develop the predictor or assess its
performance, and ensure that all questions have been
adequately addressed.
Conclusions
 Reproducible research - from pre-clinical data,
biomarker data to clinical data
 Focus on pathway, gene set analysis and systems
biology
 Need a team of quantitative scientists
We need a team!!