Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Strategies for Metabolomic Data Analysis Dmitry Grapov, PhD Goals? Metabolomics Analytical Dimensions variables Samples Analyzing Metabolomic Data •Pre-analysis •Data properties •Statistical approaches •Multivariate approaches •Systems approaches Pre-analysis Data quality metrics • precision • accuracy Remedies • normalization • outliers detection • missing values imputation Normalization • sample-wise • sum, adjusted • measurement-wise • transformation (normality) • encoding (trigonometric, etc.) standard deviation mean Outliers • single measurements (univariate) • two compounds (bivariate) Outliers univariate/bivariate \ vs. multivariate outliers? mixed up samples Transformation • logarithm (shifted) • power (BOX-COX) • inverse Quantile-quantile (Q-Q) plots are useful for visual overview of variable normality X X -0.5 Missing Values Imputation Why is it missing? •random •systematic • analytical • biological mean Imputation methods •single value (mean, min, etc.) •multiple •multivariate PCA Goals for Data Analysis Exploration Classification • Are there any trends in my data? – analytical sources – meta data/covariates • Useful Methods – matrix decomposition (PCA, ICA, NMF) – cluster analysis • Differences/similarities between groups? – discrimination, classification, significant changes • Useful Methods – analysis of variance (ANOVA) – partial least squares discriminant analysis (PLS-DA) – Others: random forest, CART, SVM, ANN • What is related or predictive of my variable(s) of interest? – regression • Useful Methods – correlation Prediction Data Structure •univariate: a single variable (1-D) •bivariate: two variables (2-D) •multivariate: 2 > variables (m-D) •Data Types •continuous •discreet • binary Data Complexity Meta Data m n variables Experimental Design = complexity samples Data m-D 1-D 2-D Variable # = dimensionality Univariate Analyses univariate properties •length •center (mean, median, geometric mean) •dispersion (variance, standard deviation) •Range (min / max) standard deviation mean Univariate Analyses •sensitive to distribution shape •parametric = assumes normality •error in Y, not in X (Y = mX + error) •optimal for long data wide •assumed independence •false discovery rate long n-of-one False Discovery Rate (FDR) univariate approaches do not scale well • Type I Error: False Positives •Type II Error: False Negatives •Type I risk = •1-(1-p.value)m m = number of variables tested FDR correction Example: Design: 30 sample, 300 variables Test: t-test FDR method: Benjamini and Hochberg (fdr) correction at q=0.05 Results FDR adjusted p-values (fdr) or estimate of FDR (Fdr, q-value) Bioinformatics (2008) 24 (12):1461-1462 Achieving “significance” is a function of: significance level (α) and power (1-β ) effect size (standardized difference in means) sample size (n) Bivariate Data relationship between two variables •correlation (strength) •regression (predictive) regression correlation Correlation •Parametric (Pearson) or rank-order (Spearman, Kendall) •correlation is covariance scaled between -1 and 1 Correlation vs. Regression Regression describes the least squares or best-fitline for the relationship (Y = m*X + b) Bivariate Example Goal: Don’t miss eruption! Data •time between eruptions – 70 ± 14 min •duration of eruption Old Faithful, Yellowstone, WY – 3.5 ± 1 min Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365 Bivariate Example Two cluster pattern for both duration and frequency Azzalini, A. and Bowman, A. W. (1990). A look at some data on the Old Faithful geyser. Applied Statistics 39, 357–365 Bivariate Example Noted deviations from two cluster pattern –Outliers? –Covariates? Covariates Trends in data which mask primary goals can be accounted for using covariate adjustment and appropriate modeling strategies Bivariate Example Noted deviations from two cluster pattern can be explained by covariate: Hydrofraking Covariate adjustment is an integral aspect of statistical analyses (e.g. ANCOVA) Summary Data exploration and pre-analysis: • increase robustness of results • guards against spurious findings • Can greatly improve primary analyses Univariate Statistics: • are useful for identification of statically significant changes or relationships • sub-optimal for wide data • best when combined with advanced multivariate techniques Resources Web-based data analysis platforms • MetaboAnalyst( • MeltDB( ) http://www.metaboanalyst.ca/MetaboAnalyst/faces/Home.jsp https://meltdb.cebitec.uni-bielefeld.de/cgi-bin/login.cgi Programming tools • The R Project for Statistical Computing( ) • Bioconductor( ) GUI tools • imDEV( ) http://www.r-project.org/ http://www.bioconductor.org/ http://sourceforge.net/projects/imdev/?source=directory )