Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
General Data Analysis Issues and Approaches in Metabolomics Bruce S. Kristal, Ph.D. Department of Neurosurgery, Brigham and Women’s Hospital Department of Surgery, Harvard Medical School (Pending) Secretary, Metabolomics Society …the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue. R.A. Fisher Working Definitions Statistics: What is the probability that what was observed occurred by chance? Informatics What was observed? Data vs Information Data Information Can you group these? Partitional Clustering Can you group these? Hierarchical Clustering How much information is enough? How much information is enough? How much information is enough? How much information is enough? How much information is enough? Principal Components Analysis Given experience, what can we know about unknowns Probably Sad Probably Happy Pattern Recognition Megavariate Analysis • Clustering • Principal components • Pattern recognition HUMANS DO MEGAVARIATE ANALYSIS INATELY What we don’t do so well… What is Multi-/Megavariate Analysis? • Simplifying large data sets for human consideration – Clustering and Principal Components • Pattern Recognition: – Classifying unknowns into previously defined groups What is Multi-/Megavariate Analysis? • Data-mining – How many customers who buy pretzels also buy potato chips? • Estimation and prediction – Multivariate regression • • • • Which variables are most important? Mathematical modeling Outlier diagnostics Enables data-driven approaches Why do it? Omics datasets are otherwise beyond human comprehension Informatics in Metabolomics Sample Analysis Sample Collection Database Curation Response (µA) 0.80 0.60 0.40 0.20 0.00 0.0 20.0 40.0 60.0 80.0 Retention time (minutes) 100.0 1 Objectively Defining Class Identity Computational Modeling of Metabolic Serotypes 3 SD 2 SD Actual Mechanistic Insight Drug Development Toxicology Classification Prediction Functional genomics Sub-threshold studies Others AL8 AL7 AL5 AL1 AL4 AL3 AL2 AL6 DR8 DR6 DR5 DR7 DR1 DR4 DR2 DR3 1.0 0.8 0.6 0.4 0.2 0.0 Observed Values vs. Predicted Values 2 SD Predicted Following Biochemical Pathways Bioinformatics Modeling Metabolic Interactions Informatics: An example classification workflow Data Validation, Data Normalization, Missing Data Decisions, Inclusion/Exclusion Criteria Subgroups, Class-specific models Outlier removal scaling transformations Unsupervised: Clustering SOMs PCA Supervised: kNN SIMCA PLS PLS-DA Random Forest Machine learning: Neural Nets GAs GPs Overfit tests, Internal validation, optimization, External validation, optimization, 2o validation Practicality important – not theory Multivariate Analysis is Easy But… Art – Not Science Multiple Approaches • Mathematical robustness • Megavariate analysis is not word processing • Different algorithms see different things! • Different answers can be both right, or both wrong Multivariate Analysis can be easy – or too easy …the statistician’s task, in fact, is limited to the extraction of the whole of the available information on any particular issue. R.A. Fisher “THE” Problem: Overfitting • Beware the power of today’s tools – PLS-DA/O-PLS – GAs/GPs, neural nets, machine learning • Try to understand your tools – At least conceptually – PCA and selective reporting • choosing components is not objective • Beware of “low value” components – Clustering and rotations • DO NOT search until you like what you see – Choosing multiple tools/conditions is fine – in the model building phase “Solutions” • Data analysis is not word processing • Permutation Testing is a step in the right direction • The Gold Standard is biological replication • Training Sets and test sets should have no members in common – Rarely recognized – Not always possible… • Set up design as rigorously as possible – In advance… • Our definition: – Training sets are proof of principle – Test sets are, theoretically, validation Three “final” thoughts • There is an inherent statistical and informatics minefield that arises when the number of variables queried far exceeds the number of observations (“N vs P problem”) • Caution: mathematical validation in NOT biological validation • Report what you do Informatics: An example classification workflow Data Validation, Data Normalization, Missing Data Decisions, Inclusion/Exclusion Criteria Subgroups, Class-specific models Outlier removal scaling transformations Unsupervised: Clustering SOMs PCA Supervised: kNN SIMCA PLS PLS-DA Random Forest Machine learning: Neural Nets GAs GPs Overfit tests, Internal validation, optimization, External validation, optimization, 2o validation