Download High-dimensional Prognosis: Developing a gene signature from a

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metagenomics wikipedia , lookup

Genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

Pathogenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Gene desert wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Neuronal ceroid lipofuscinosis wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Pharmacogenomics wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene nomenclature wikipedia , lookup

The Selfish Gene wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Gene therapy wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Public health genomics wikipedia , lookup

Genome (book) wikipedia , lookup

Gene expression programming wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

RNA-Seq wikipedia , lookup

Microevolution wikipedia , lookup

Designer baby wikipedia , lookup

Transcript
High-dimensional Prognosis:
Developing a gene signature from a very large
number of potential predictors
Ulrich Mansmann
IBE, LMU München
[email protected]
1
Is it important to decipher the heterogeneity of
"normal karyotype AML"?
Metzeler KH et al. (2008) Blood
Almost half of adult acute myelogenous leukemia (AML) is normal cytogenetically,
and this subgroup shows a remarkable heterogeneity of genetic mutations at the
molecular level and an intermediate response to therapy.
The finding of recurrent cytogenetic abnormalities has influenced, in a primary
way, the understanding and treatment of leukemias. Yet "normal karyotype AML"
lacks such obvious abnormalities, but has a variety of prognostically important
genetic (submicroscopic) abnormalities.
NPM1 and FLT3 mutations are established factors which influence prognosis.
Is it possible to detect patterns of genetic activities
with strong influence on prognosis additional to the
known genetic mutations?
Oncologists need improved tools for selecting treatments
for individual patients.
2
What can be done?
• There is hardly any guidance from the biologists how to disentangle cellular
processes with regard to their effects on the disease course → black box
• There is no established cellular paradigm of certain tumors which can be
represented in a prognostic system.
• There is no thorough statistical experience which algorithm should be used
when developing a prognostic gene signature.
• There is a lot of arbitrariness in setting up a specific strategy for the project.
• Principles which shield the data analyst from failing are not common knowledge.
• Biotechnologies with different concepts can produce data.
Mutations
Copy number changes
Translocations
Expression profile
Prognosis
3
What is a gene signature?
1. A set of genes
2. An algorithm which transforms measured gene expression into a prognostic
statement.
In general, the gene set is published and no information is available about the
algorithm.
People generally ignore the algorithm and have not a clear perception on the
nature of the algorithm.
4
Project road map
Developing the gene
signature
Applying it to new patients
Functional interpretation
•
Normalization
•
Preprocessing of data: yes/no/which
•
Choice of prognostic algorithm
•
How to avoid overfitting?
•
Complexity of algorithm and measurement
process
•
Normalization
•
Interpretation in terms of the disease process
•
What are useful strategies
5
Elementary blunders to be avoided:
• Lack of specification of the process used to derive the model. Without such
specification, it is difficult to judge the appropriateness of the process
→ reproducible statistics
• Small sample sizes: the importance of having an adequate number of
subjects is still not well understood.
• Do not use a convenience sample, use a typical clinical patient population
with delineated patient selection criteria.
6
Validation
Justice AC, Covinsky CE, Berlin JA (1999) Assessing the
Generalizability of Prognostic Information,
Ann Intern Med. 1999;130:515-524.
The purpose of validation is that the procedure
is fit for purpose.
7
Normalisation
A specific step to remove systematic bias which are inherent to the production of
microarray data.
Broad question: How do we compare results across chips?
Focused goal:
Getting numbers (quantifications) from one chip to mean the
same as numbers from another chip.
• Normalization acts on a group of arrays. Derived gene signatures are only valid
within the normalization setting.
• Information on the normalization process has to be communicated to allow future
data to be put into the context of the normalization which is the basis of the
derived gene signature.
• In general, this information is not communicated in published gene signature
papers. People only communicate the gene set.
8
Normalisation
9
Preprocessing
Procedures used in reducing an unmanageably high set of molecular data to a
more manageable, but still perhaps quite large, number of (summary) features to
be used in further development:
•
Metagenes (Mike West): Collapse genes with similar expression profiles
to an artificial metagene by K-means
•
Univariate Tests
•
Use genes with large variability
•
Use of subject knowledge
•
and much more …
In general researcher do not see preprocessing as part of the prognostic research.
But, they have profound effect on the later high-level analyses.
10
High-level analyses: Choice of central algorithm
Use algorithms with inbuilt regularization features:
• Elastic nets: combination of ridge and lasso regression
Zou, Hastie (2005) JRSS B, 67:301-320
Practical algorithms for Cox-Regression and GLMs by J. Goeman
http://cran.r-project.org/web/packages/penalized/index.html
• PCA: Semi-Supervised Methods to Predict Patient Survival from Gene
Expression Data
Bair E, Tibshirani R (2004) PLoS Biol. 2:E108.
11
Internal validation: How to avoid overfitting?
• The algorithm is a composite procedure.
There is a lag of understanding how the components influence each other as
well as influence the quality of the final result.
• The choice of its elements is quite subjective and arbitrary.
• Need of a multi-layer cross-validation approach:
1.) Determination of internal model parameters
2.) Selection from a set of suitable algorithms
3.) Validating the chosen candidate
12
Optimal (unique) gene signatures?
RASHOMON AND THE MULTIPLICITY OF GOOD MODELS
Leo Breiman (2001) Statistical Modeling: The Two Cultures, Statistical Science, 16: 199–231
…We showed that, in fact, the resulting set of genes is not unique; it is strongly influenced by the
subset of patients used for gene selection. Many equally predictive lists could have been produced
from the same analysis. Three main properties of the data explain this sensitivity: (1) many genes are
correlated with survival; (2) the differences between these correlations are small; (3) the correlations
fluctuate strongly when measured over different subsets of patients.
Liat Ein-Dor, Itai Kela, Gad Getz, David Givol, Eytan Domany (2005) Bioinformatics, 21: 171–178
Knowledge about the disease processes is too sparse to propose a
comprehensive model.
It is necessary to compare the predictive quality of competing prognostic
models.
Large data sets: Most gene signatures are developed with less than 300 patients. Large
trials are on the way.
13
Choice of strategy
Reanalysis of Huang et al.
(2003) Lancet, 361:1590–
1596
Ruschhaupt et al. (2004)
SAGMB, Vol. 3, Article 37
SVM – support vector machine
RF – Random forrest
PAM – shrunken centroids
PLR – penealized logistic regression
BBT – Bayesian binary trees
M – metagenes, method for dimension
reduction
Patient without recurrence
Patient with recurrence
14
External validation: Transportability
Training data:
HGU 133 A&B
163 patients
(Munich)
Validation data (II):
HGU 133 A&B
64 patients
different study gorup
(Cleveland)
Validation data (I):
HGU 133 plus
79 patients
different study
(Munich)
No convenience samples!
Metzeler KH et al. (2008) Blood
15
External validation: Transportability
Overall survival
Validation data (I):
HGU 133 plus
79 patients
different study
(Munich)
Overall survival
Validation data (II):
HGU 133 A&B
64 patients
different study gorup
(Cleveland)
Metzeler KH et al. (2008) Blood
16
Functional interpretation
Biological information on features of the disease process is hidden in the gene
signature.
Naïve interpretation may not be helpful:
… The connection between the metagene predictors and genes for interferons is intriguing in view of the
role of interferons as mediators of the antitumour response and the fact that many genes involved in T-cell
function (TCRA, CD3D, IL2R, MHC) are also included within the group that predict lymph-node metastasis.
Huang et al. (2003), The Lancet, 361: 1590-1596
More systematic approach:
Hummel et al. (2008) Association between a Prognostic Gene Signature and Functional
Gene Sets, Bioinformatics and Biological Insights.
17
Functional interpretation
KEGG pathway ’acute myeloid leukemia’
(hsa05221).
Red boxes mark involved genes that
correlate significantly with at least one of
the signature genes.
Blue boxes mark genes that show a
significant partial correlation (in the gene
association network) to at least one of the
signature genes.
Result of hierarchical variable selection
for 15 cancer-specific KEGG pathways.
Meinshausen N. (2008). Hierarchical
testing of variable importance.
Biometrika, 95(2): 265-278.
Rows indicate pathways; columns show
the 67 signature genes. Squares are
dark gray rather than light gray if there is
a significant influence of that signature
gene on that pathway (adjusted p-value
=0.0067).
18
Statistical Modeling: The Two Cultures
Breiman L (2001) Statistical Science, 16:199–231
There are two cultures in the use of statistical modeling to reach conclusions
from data. One assumes that the data are generated by a given stochastic data
model. The other uses algorithmic models and treats the data mechanism as
unknown.
• The primary goal is not interpretability, but accurate information for a specific
purpose.
• Interpretability is a way of getting information. But a model does not have to
be simple to provide reliable information about the relation between predictor
and response variables.
• There are measures which quantify predictive quality. Competing predictive
tools can be compared. Predictive practice for a specified purpose can be
improved.
19
Reproducible statistical analyses
Ruschhaupt M, Huber W, Poustka A, Mansmann U. (2004) A compendium to ensure computational
reproducibility in high-dimensional classification tasks. Stat Appl Genet Mol Biol. 3:Article 37.
• In statistics, the ability to document both programming language coding as
well as mathematical thought is critical to understandable, explainable, and
reproducible data analysis.
• Publishing results in the traditional paper based way in a journal hides too
much information. Compendia can provide the insights needed to plan
future projects.
• For a scientist planning a prognostic study on a molecular signature the
compendium offers a complete framework for the design, analysis and
reporting of the study. A compendium allows sensitivity analyses of a given
problem and improves the ideas to plan new project steps.
There is a tendency to accept seemingly realistic computational results, as presented by figures and
tables, without any proof of correctness.
Leisch / Rossini (2003) Chance, 16:41-46
20
How to report data?
• High ranked journals request authors to publish their microarray data.
• Two prominent repositories: GeneOmnibus (NIH), ArrayExpress (EBI)
• There are several ncAML prognostic studies with microarrays reported
• Data in repository are deficient:
- deficient ZIP files
- no original microarray data (only normalized version)
- no relevant clinical data (established prognostic factors)
• Data in repositories are useless for validation purpose.
• Direct contact to study groups is needed.
21
Transfer programs for gene signatures in clinical
prognosis
• Simon R. Development and validation of therapeutically relevant multi-gene biomarker
classifiers, Journal of the National Cancer Institute 97:866-7, 2005.
• Simon R. Bioinformatics in cancer therapeutics hype or hope? Nature Clinical Practice
Oncology 2:223, 2005.
• Simon R. Roadmap for Developing and Validating Therapeutically Relevant Genomic
Classifiers J Clin Oncol.2005; 23: 7332-7341
• Dupuy A, Simon R. Critical Review of Published Microarray Studies for Cancer
Outcome and Guidelines on Statistical Analysis and Reporting,
Journal Nat. Cancer Inst, 99:147-157
22
Superstitions
• The gene signature is the direct image of the biological reality governing a
disease process.
• Forget about the algorithm, the gene set is the focus!
I can build a prognostic tool from the gene set, but it will be different from the tool
which was the starting point! The algorithms are not compared!
• The proposed signature is optimal
• Heuristic dimension reduction does not bias the gene signature
• Forget about standard prognostic factors! Microarray information is enough!
23
Summary
• The association between patient characteristics and outcome must be expressed
through an explicite algorithm.
• Awareness for the complex algorithmic task is needed.
• Comparing the results between different algorithmic strategies helps to gain
confidence in the proposed solution of the complex task.
• The functional interpretation of a gene signature is a complex statistical task of
its own. No experience does exist sofar how to proceed.
• Need to compare the predictive quality of competing proposals.
• There is enough methodological guidance to produce a credible candidate
as starting point for a transfer into clinical use
• Need to delineate transfer programs for complex gene signatures into clinical
prognosis. Transfer the prognostic finding to an easily to use routine technology
and demonstrate reproducibility.
• Need for Phase III prognostic studies which assess the benefit of using the
signatures to adapt individual treatment.
24