* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download High-dimensional Prognosis: Developing a gene signature from a
Metagenomics wikipedia , lookup
Genetic engineering wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Pathogenomics wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Genome evolution wikipedia , lookup
Gene desert wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Pharmacogenomics wikipedia , lookup
Oncogenomics wikipedia , lookup
Gene nomenclature wikipedia , lookup
The Selfish Gene wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Gene therapy wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Public health genomics wikipedia , lookup
Genome (book) wikipedia , lookup
Gene expression programming wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression profiling wikipedia , lookup
High-dimensional Prognosis: Developing a gene signature from a very large number of potential predictors Ulrich Mansmann IBE, LMU München [email protected] 1 Is it important to decipher the heterogeneity of "normal karyotype AML"? Metzeler KH et al. (2008) Blood Almost half of adult acute myelogenous leukemia (AML) is normal cytogenetically, and this subgroup shows a remarkable heterogeneity of genetic mutations at the molecular level and an intermediate response to therapy. The finding of recurrent cytogenetic abnormalities has influenced, in a primary way, the understanding and treatment of leukemias. Yet "normal karyotype AML" lacks such obvious abnormalities, but has a variety of prognostically important genetic (submicroscopic) abnormalities. NPM1 and FLT3 mutations are established factors which influence prognosis. Is it possible to detect patterns of genetic activities with strong influence on prognosis additional to the known genetic mutations? Oncologists need improved tools for selecting treatments for individual patients. 2 What can be done? • There is hardly any guidance from the biologists how to disentangle cellular processes with regard to their effects on the disease course → black box • There is no established cellular paradigm of certain tumors which can be represented in a prognostic system. • There is no thorough statistical experience which algorithm should be used when developing a prognostic gene signature. • There is a lot of arbitrariness in setting up a specific strategy for the project. • Principles which shield the data analyst from failing are not common knowledge. • Biotechnologies with different concepts can produce data. Mutations Copy number changes Translocations Expression profile Prognosis 3 What is a gene signature? 1. A set of genes 2. An algorithm which transforms measured gene expression into a prognostic statement. In general, the gene set is published and no information is available about the algorithm. People generally ignore the algorithm and have not a clear perception on the nature of the algorithm. 4 Project road map Developing the gene signature Applying it to new patients Functional interpretation • Normalization • Preprocessing of data: yes/no/which • Choice of prognostic algorithm • How to avoid overfitting? • Complexity of algorithm and measurement process • Normalization • Interpretation in terms of the disease process • What are useful strategies 5 Elementary blunders to be avoided: • Lack of specification of the process used to derive the model. Without such specification, it is difficult to judge the appropriateness of the process → reproducible statistics • Small sample sizes: the importance of having an adequate number of subjects is still not well understood. • Do not use a convenience sample, use a typical clinical patient population with delineated patient selection criteria. 6 Validation Justice AC, Covinsky CE, Berlin JA (1999) Assessing the Generalizability of Prognostic Information, Ann Intern Med. 1999;130:515-524. The purpose of validation is that the procedure is fit for purpose. 7 Normalisation A specific step to remove systematic bias which are inherent to the production of microarray data. Broad question: How do we compare results across chips? Focused goal: Getting numbers (quantifications) from one chip to mean the same as numbers from another chip. • Normalization acts on a group of arrays. Derived gene signatures are only valid within the normalization setting. • Information on the normalization process has to be communicated to allow future data to be put into the context of the normalization which is the basis of the derived gene signature. • In general, this information is not communicated in published gene signature papers. People only communicate the gene set. 8 Normalisation 9 Preprocessing Procedures used in reducing an unmanageably high set of molecular data to a more manageable, but still perhaps quite large, number of (summary) features to be used in further development: • Metagenes (Mike West): Collapse genes with similar expression profiles to an artificial metagene by K-means • Univariate Tests • Use genes with large variability • Use of subject knowledge • and much more … In general researcher do not see preprocessing as part of the prognostic research. But, they have profound effect on the later high-level analyses. 10 High-level analyses: Choice of central algorithm Use algorithms with inbuilt regularization features: • Elastic nets: combination of ridge and lasso regression Zou, Hastie (2005) JRSS B, 67:301-320 Practical algorithms for Cox-Regression and GLMs by J. Goeman http://cran.r-project.org/web/packages/penalized/index.html • PCA: Semi-Supervised Methods to Predict Patient Survival from Gene Expression Data Bair E, Tibshirani R (2004) PLoS Biol. 2:E108. 11 Internal validation: How to avoid overfitting? • The algorithm is a composite procedure. There is a lag of understanding how the components influence each other as well as influence the quality of the final result. • The choice of its elements is quite subjective and arbitrary. • Need of a multi-layer cross-validation approach: 1.) Determination of internal model parameters 2.) Selection from a set of suitable algorithms 3.) Validating the chosen candidate 12 Optimal (unique) gene signatures? RASHOMON AND THE MULTIPLICITY OF GOOD MODELS Leo Breiman (2001) Statistical Modeling: The Two Cultures, Statistical Science, 16: 199–231 …We showed that, in fact, the resulting set of genes is not unique; it is strongly influenced by the subset of patients used for gene selection. Many equally predictive lists could have been produced from the same analysis. Three main properties of the data explain this sensitivity: (1) many genes are correlated with survival; (2) the differences between these correlations are small; (3) the correlations fluctuate strongly when measured over different subsets of patients. Liat Ein-Dor, Itai Kela, Gad Getz, David Givol, Eytan Domany (2005) Bioinformatics, 21: 171–178 Knowledge about the disease processes is too sparse to propose a comprehensive model. It is necessary to compare the predictive quality of competing prognostic models. Large data sets: Most gene signatures are developed with less than 300 patients. Large trials are on the way. 13 Choice of strategy Reanalysis of Huang et al. (2003) Lancet, 361:1590– 1596 Ruschhaupt et al. (2004) SAGMB, Vol. 3, Article 37 SVM – support vector machine RF – Random forrest PAM – shrunken centroids PLR – penealized logistic regression BBT – Bayesian binary trees M – metagenes, method for dimension reduction Patient without recurrence Patient with recurrence 14 External validation: Transportability Training data: HGU 133 A&B 163 patients (Munich) Validation data (II): HGU 133 A&B 64 patients different study gorup (Cleveland) Validation data (I): HGU 133 plus 79 patients different study (Munich) No convenience samples! Metzeler KH et al. (2008) Blood 15 External validation: Transportability Overall survival Validation data (I): HGU 133 plus 79 patients different study (Munich) Overall survival Validation data (II): HGU 133 A&B 64 patients different study gorup (Cleveland) Metzeler KH et al. (2008) Blood 16 Functional interpretation Biological information on features of the disease process is hidden in the gene signature. Naïve interpretation may not be helpful: … The connection between the metagene predictors and genes for interferons is intriguing in view of the role of interferons as mediators of the antitumour response and the fact that many genes involved in T-cell function (TCRA, CD3D, IL2R, MHC) are also included within the group that predict lymph-node metastasis. Huang et al. (2003), The Lancet, 361: 1590-1596 More systematic approach: Hummel et al. (2008) Association between a Prognostic Gene Signature and Functional Gene Sets, Bioinformatics and Biological Insights. 17 Functional interpretation KEGG pathway ’acute myeloid leukemia’ (hsa05221). Red boxes mark involved genes that correlate significantly with at least one of the signature genes. Blue boxes mark genes that show a significant partial correlation (in the gene association network) to at least one of the signature genes. Result of hierarchical variable selection for 15 cancer-specific KEGG pathways. Meinshausen N. (2008). Hierarchical testing of variable importance. Biometrika, 95(2): 265-278. Rows indicate pathways; columns show the 67 signature genes. Squares are dark gray rather than light gray if there is a significant influence of that signature gene on that pathway (adjusted p-value =0.0067). 18 Statistical Modeling: The Two Cultures Breiman L (2001) Statistical Science, 16:199–231 There are two cultures in the use of statistical modeling to reach conclusions from data. One assumes that the data are generated by a given stochastic data model. The other uses algorithmic models and treats the data mechanism as unknown. • The primary goal is not interpretability, but accurate information for a specific purpose. • Interpretability is a way of getting information. But a model does not have to be simple to provide reliable information about the relation between predictor and response variables. • There are measures which quantify predictive quality. Competing predictive tools can be compared. Predictive practice for a specified purpose can be improved. 19 Reproducible statistical analyses Ruschhaupt M, Huber W, Poustka A, Mansmann U. (2004) A compendium to ensure computational reproducibility in high-dimensional classification tasks. Stat Appl Genet Mol Biol. 3:Article 37. • In statistics, the ability to document both programming language coding as well as mathematical thought is critical to understandable, explainable, and reproducible data analysis. • Publishing results in the traditional paper based way in a journal hides too much information. Compendia can provide the insights needed to plan future projects. • For a scientist planning a prognostic study on a molecular signature the compendium offers a complete framework for the design, analysis and reporting of the study. A compendium allows sensitivity analyses of a given problem and improves the ideas to plan new project steps. There is a tendency to accept seemingly realistic computational results, as presented by figures and tables, without any proof of correctness. Leisch / Rossini (2003) Chance, 16:41-46 20 How to report data? • High ranked journals request authors to publish their microarray data. • Two prominent repositories: GeneOmnibus (NIH), ArrayExpress (EBI) • There are several ncAML prognostic studies with microarrays reported • Data in repository are deficient: - deficient ZIP files - no original microarray data (only normalized version) - no relevant clinical data (established prognostic factors) • Data in repositories are useless for validation purpose. • Direct contact to study groups is needed. 21 Transfer programs for gene signatures in clinical prognosis • Simon R. Development and validation of therapeutically relevant multi-gene biomarker classifiers, Journal of the National Cancer Institute 97:866-7, 2005. • Simon R. Bioinformatics in cancer therapeutics hype or hope? Nature Clinical Practice Oncology 2:223, 2005. • Simon R. Roadmap for Developing and Validating Therapeutically Relevant Genomic Classifiers J Clin Oncol.2005; 23: 7332-7341 • Dupuy A, Simon R. Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting, Journal Nat. Cancer Inst, 99:147-157 22 Superstitions • The gene signature is the direct image of the biological reality governing a disease process. • Forget about the algorithm, the gene set is the focus! I can build a prognostic tool from the gene set, but it will be different from the tool which was the starting point! The algorithms are not compared! • The proposed signature is optimal • Heuristic dimension reduction does not bias the gene signature • Forget about standard prognostic factors! Microarray information is enough! 23 Summary • The association between patient characteristics and outcome must be expressed through an explicite algorithm. • Awareness for the complex algorithmic task is needed. • Comparing the results between different algorithmic strategies helps to gain confidence in the proposed solution of the complex task. • The functional interpretation of a gene signature is a complex statistical task of its own. No experience does exist sofar how to proceed. • Need to compare the predictive quality of competing proposals. • There is enough methodological guidance to produce a credible candidate as starting point for a transfer into clinical use • Need to delineate transfer programs for complex gene signatures into clinical prognosis. Transfer the prognostic finding to an easily to use routine technology and demonstrate reproducibility. • Need for Phase III prognostic studies which assess the benefit of using the signatures to adapt individual treatment. 24