Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Canadian Bioinformatics Workshops www.bioinformatics.ca Module #: Title of Module 2 Module 6 Backgrounder in Statistical Methods David Wishart Schedule Day 1 Day 2 8:30-10:30 Mod. 1: Introduction to Metabolomics Mod. 5: LC-MS Spectral Processing using XCMS 10:30-11:00 Coffee Coffee 11:00-12:30 Mod. 2: Metabolite ID and Quantification Pt. I + Lab Mod. 6: Backgrounder in Statistical Methods 12:30-1:30 Lunch Lunch 1:30-3:00 Mod. 3: Metabolite ID and Annotation – Part II Mod. 7: Metabolomic Data Analysis w. MetaboAnalyst 3:00-3:30 Coffee Coffee 3:30-5:00 Mod. 4: Databases for Chemical/Spectral Data Mod. 8: Data Integration & Applications 5:00-6:30 Dinner Survey & Close Out 6:30-9:00 Integrated Assignment Learning Objectives • Learn about distributions and significance • Learn about univariate statistics (ttests and ANOVA) • Learn about correlation and clustering • Learn about multivariate statistics (PCA and PLS-DA) Statistics • There are three kinds of lies: lies, damned lies, and statistics - Benjamin Disraeli • 98% of all statistics are made up – Unknown • Statistics are like bikinis. What they reveal is suggestive, but what they conceal is vital - Aaron Levenstein • Statistics is the mathematics of impressions Distributions & Significance Univariate Statistics Univariate Statistics • Univariate means a single variable • If you measure a population using some single measure such as height, weight, test score, IQ, you are measuring a single variable • If you plot that single variable over the whole population, measuring the frequency that a given value is reached you will get the following: # of each A Bell Curve Height Also called a Gaussian or Normal Distribution Features of a Normal Distribution m = mean • Symmetric Distribution • Has an average or mean value (m) at the centre • Has a characteristic width called the standard deviation (s) • Most common type of distribution known Normal Distribution • Almost any set of biological or physical measurements will display some some variation and these will almost always follow a Normal distribution • The larger the set of measurements, the more “normal” the curve • Minimum set of measurements to get a normal distribution is 30-40 Gaussian Distribution P( x) m-3s m-2s m-s m m+s m+2s 1 e 2s m+3s ( xm )2 2s 2 Some Equations Mean m = Sxi N Variance s2 = S(xi - m)2 N Standard Deviation s = S(xi - m)2 N Standard Deviations (Z-values) m ± 1.0 S.D. 0.683 > m + 1.0 S.D. 0.158 m ± 2.0 S.D. 0.954 > m + 2.0 S.D. 0.023 m ± 3.0 S.D. 0.9972 > m + 3.0 S.D. 0.0014 m ± 4.0 S.D. 0.99994 > m + 4.0 S.D. 0.00003 m ± 5.0 S.D. 0.999998 > m + 5.0 S.D. 0.000001 P( x) m-3s m-2s m-s m m+s m+2s 1 e 2s m+3s ( xm )2 2s 2 Significance • Based on the Normal Distribution, the probability that something is >1 SD away (larger or smaller) from the mean is 32% • Based on the Normal Distribution, the probability that something is >2 SD away (larger or smaller) from the mean is 5% • Based on the Normal Distribution, the probability that something is >3 SD away (larger or smaller) from the mean is 0.3% Significance • In a test with a class of 400 students, if you score the average you typically receive a “C” • In a test with a class of 400 students, if you score 1 SD above the average you typically receive a “B” • In a test with a class of 400 students if you score 2 SD above the average you typically receive an “A”, The P-value • The p-value is the probability of obtaining a test statistic (a score, a set of events, a height) at least as extreme as the one that was actually observed • One "rejects the null hypothesis" when the p-value is less than the significance level α which is often 0.05 or 0.01 • When the null hypothesis is rejected, the result is said to be statistically significant P-value • If the average height of an adult (M+F) human is 5’ 7” and the standard deviation is 5”, what is the probability of finding someone who is more than 6’ 10”? • If you choose an a of 0.05 is a 6’ 11” individual a member of the human species? • If you choose an a of 0.01 is a 6’ 11” individual a member of the human species? P-value • If you flip a coin 20 times and the coin turns up heads 14/20 times the probability that this would occur is 60,000/1,048,000 = 0.058 • If you choose an a of 0.05 is this coin a fair coin? • If you choose an a of 0.10 is this coin a fair coin? Mean, Median & Mode Mode Median Mean Mean, Median, Mode • In a Normal Distribution the mean, mode and median are all equal • In skewed distributions they are unequal • Mean - average value, affected by extreme values in the distribution • Median - the “middlemost” value, usually half way between the mode and the mean • Mode - most common value Different Distributions Unimodal Bimodal Other Distributions • Binomial Distribution • Poisson Distribution • Extreme Value Distribution • Skewed or Exponential Distribution Binomial Distribution 1 1 1 P(x) = (p + q)n 1 2 1 1 3 3 1 1 4 6 4 1 1 5 10 10 5 1 Poisson Distribution P( x) m =0.1 m =1 Proportion of samples P(x) m =2 m =3 m = 10 x m xem x! Extreme Value Distribution Gaussian Distribution • Arises from sampling the extreme end of a normal distribution • A distribution which is “skewed” due to its selective sampling • Skew can be either right or left Skewed Distribution Outliers • Resembles an exponential or Poisson-like distribution • Lots of extreme values far from mean or mode • Hard to do useful statistical tests with this type of distribution Fixing a Skewed Distribution • A skewed distribution or exponentially decaying distribution can be transformed into a “normal” or Gaussian distribution by applying a log transformation • This brings the outliers a little closer to the mean because it rescales the x-variable, it also makes the distribution much more Gaussian Log Transformation Skewed distribution 0 Normal distribution exp’t B exp’t B linear scale 0 log transformed 0.5 0.4 0.3 0 0.2 0 0.1 0 0.0 0 8000 16000 24000 32000 V5 40000 48000 56000 64000 8.0 8.8 9.6 10.4 11.2 12.0 12.8 V4 13.6 14.4 15.2 16.0 Log Transformation on Real Data Distinguishing 2 Populations Normals Leprechauns # of each The Result Height Are they different? What about these 2 Populations? # of each The Result Height Are they different? Student’s t-Test • Also called the t-Test • Used to determine if 2 populations are different • Formally allows you to calculate the probability that 2 sample means are the same • If the t-Test statistic gives you a p=0.4, and the a is 0.05, then the 2 populations are the same • If the t-Test statistic gives you a p=0.04, and the a is 0.05, then the 2 populations are different • Paired and unpaired t-Tests are available, paired if used for “before & after” expts. while unpaired is for 2 randomly chosen samples Student’s t-Test Variable 1 • A t-Test can also be used to determine whether 2 clusters are different if the clusters follow a normal distribution Variable 2 What if the Distributions are not Normal? Mann-Whitney U-Test • Also called the Wilcoxon Rank Sum Test • Used to determine if 2 non-normally distributed populations are different • More powerful and robust than the t-test • Formally allows you to calculate the probability that 2 sample medians are the same • If the U-Test statistic gives you a p=0.4, and the a is 0.05, then the 2 populations are the same • If the U-Test statistic gives you a p=0.04, and the a is 0.05, then the 2 populations are different Distinguishing 3+ Populations Normals Leprechauns Elves # of each The Result Height Are they different? Distinguishing 3+ Populations # of each The Result Height Are they different? ANOVA • Also called Analysis of Variance • Used to determine if 3 or more populations are different, it is a generalization of the t-Test • Formally ANOVA provides a statistical test (by looking at group variance) of whether or not the means of several groups are all equal • Uses an F-measure to test for significance • 1-way, 2-way, 3-way and n-way ANOVAs, most common is 1-way which just is concerned about whether any of the 3+ populations are different, not which pair is different ANOVA Variable 1 • ANOVA can also be used to determine whether 3+ clusters are different -- if the clusters follow a normal distribution Variable 2 Distinguishing N Populations (False Discovery Rate) • Suppose you performed 100 different t-tests, and found 20 results with a p value of <0.05 • What are the odds that one of these findings is going to be false? • Roughly 20 X 0.05 = 1.00 • How many of these 20 tests are likely false positives? 20x0.05 = 1 • To correct for this you try to choose those results with a p value < 0.05/20 or p < 0.0025 Example (Some Weather Predictions) • • • • • • • P = 0.08 It will rain P = 0.05 It will be sunny P = 0.06 It will be foggy P = 0.02 It’ll be cloudy P = 0.05 It will snow P = 0.07 It will be windy P = 0.06 It will be calm • • • • • • • P = 0.09 It will hail P = 0.02 Lightning P = 0.16 Thunder P = 0.001 Eclipse P = 0.09 Tornado P = 0.18 Hurricane P = 0.05 Sleet 100% certainty it will do something tomorrow Only one prediction is significant with FDR or Bonferroni correction (Eclipse) Normalization/Scaling Normalization/Scaling # of each • What if we measured the top population using a ruler that was miscalibrated or biased (inches were short by 10%)? We would get the following result: Height Normalization # of each • Normalization adjusts for systematic bias in the measurement tool • After normalization we would get: Height Normalization • Normalization also has other meanings in statistics… • When working with univariate and multivariate statistics, normalization also means making the distribution look normal or Gaussian • Key assumption in most statistical modeling is that the population is “normal” or Gaussian Log Transformation for Normalization Skewed distribution 0 Normal distribution exp’t B exp’t B linear scale 0 log transformed 0.5 0.4 0.3 0 0.2 0 0.1 0 0.0 0 8000 16000 24000 32000 V5 40000 48000 56000 64000 8.0 8.8 9.6 10.4 11.2 12.0 12.8 V4 13.6 14.4 15.2 16.0 Data Comparisons & Dependencies Data Comparisons • In many kinds of experiments we want to know what happened to a population “before” and “after” some treatment or intervention • In other situations we want to measure the dependency of one variable against another • In still others we want to assess how the observed property matches the predicted property • In all cases we will measure multiple samples or work with a population of subjects • The best way to view this kind of data is through a scatter plot A Scatter Plot Scatter Plots • If there is some dependency between the two variables or if there is a relationship between the predicted and observer variable or if the “before” and “after” treatments led to some effect, then it is possible to see some clear patterns to the scatter plot • This pattern or relationship is called correlation Correlation “+” correlation Uncorrelated “-” correlation Correlation High correlation Low correlation Perfect correlation Correlation Coefficient r= r = 0.85 S(xi - mx)(yi - my) S(xi - mx)2(yi - my)2 r = 0.4 r = 1.0 Correlation Coefficient • Sometimes called coefficient of linear correlation or Pearson product-moment correlation coefficient • A quantitative way of determining what model (or equation or type of line) best fits a set of data • Commonly used to assess most kinds of predictions, simulations, comparisons or dependencies Correlation Coefficient vs. Coefficient of Determination • R (correlation coefficient) vs. R2 (coefficient of determination) • R and R2 are very different • Do not confuse R with R2 • Do not call R2 a correlation coefficient – THIS IS WRONG • Avoid using R2 in discussions or comparisons in scientific papers Significance of Correlation r = 0.85 Is this significant? r = 0.99 Is this significant? Significance & Correlation Add 2 more points to the plot r = 0.99 r = 0.05 Tricks to Getting Good (but meaningless) Correlation Coefficients Use only data at extreme ends of the curve or line r = 0.95 Is this significant? Use only a small number of “good” data points r = 0.95 Is this significant? Student’s t-Test (Again) • The t-Test can also be used to assess the statistical significance of a correlation • It specifically determines whether the slope of the regression line is statistically different than 0 • As might be expected, more points in a scatter plot lead to more confidence in correlation Correlation and Outliers Experimental error or something important? A single “bad” point can destroy a good correlation Outliers • Can be both “good” and “bad” • When modeling data -- you don’t like to see outliers (suggests the model is bad) • Often a good indicator of experimental or measurement errors -- only you can know! • When plotting metabolite concentration data you do like to see outliers • A good indicator of something significant Height Detecting Clusters Weight Height Is it Right to Calculate a Correlation Coefficient? r = 0.73 Weight Or is There More to This? Height male female Weight Clustering Applications in Bioinformatics • Metabolomics and Cheminformatics • Microarray or GeneChip Analysis • 2D Gel or ProteinChip Analysis • Protein Interaction Analysis • Phylogenetic and Evolutionary Analysis • Structural Classification of Proteins • Protein Sequence Families Clustering • Definition - a process by which objects that are logically similar in characteristics are grouped together. • Clustering is different than Classification • In classification the objects are assigned to pre-defined classes, in clustering the classes are yet to be defined • Clustering helps in classification Clustering Requires... • A method to measure similarity (a similarity matrix) or dissimilarity (a dissimilarity coefficient) between objects • A threshold value with which to decide whether an object belongs with a cluster • A way of measuring the “distance” between two clusters • A cluster seed (an object to begin the clustering process) Clustering Algorithms • K-means or Partitioning Methods - divides a set of N objects into M clusters -- with or without overlap • Hierarchical Methods - produces a set of nested clusters in which each pair of objects is progressively nested into a larger cluster until only one cluster remains • Self-Organizing Feature Maps - produces a cluster set through iterative “training” K-means or Partitioning Methods • Make the first object the centroid for the first cluster • For the next object calculate the similarity to each existing centroid • If the similarity is greater than a threshold add the object to the existing cluster and redetermine the centroid, else use the object to start new cluster • Return to step 2 and repeat until done K-means or Partitioning Methods Initial cluster choose 1 centroid= choose 2 centroid= Rule: lT = lcentroid + - 50 nm test & join Hierarchical Clustering • Find the two closest objects and merge them into a cluster • Find and merge the next two closest objects (or an object and a cluster, or two clusters) using some similarity measure and a predefined threshold • If more than one cluster remains return to step 2 until finished Hierarchical Clustering Initial cluster pairwise compare select closest Rule: lT = lobs + - 50 nm select next closest Hierarchical Clustering A A A B B C B C D E Find 2 most similar metabolite expression levels or curves Find the next closest pair of levels or curves F Iterate Heat map Multivariate Statistics Multivariate Statistics • Multivariate means multiple variables • If you measure a population using multiple measures at the same time such as height, weight, hair colour, clothing colour, eye colour, etc. you are performing multivariate statistics • Multivariate statistics requires more complex, multidimensional analyses or dimensional reduction methods A Typical Metabolomics Experiment A Metabolomics Experiment • Metabolomics experiments typically measure many metabolites at once, in other words the instruments are measuring multiple variables and so metabolomic data are inherently multivariate data • Metabolomics requires multivariate statistics Multivariate Statistics – The Trick • The key trick in multivariate statistics is to find a way that effectively reduces the multivariate data into univariate data • Once done, then you can apply the same univariate concepts such as pvalues, t-Tests and ANOVA tests to the data • The trick is dimensional reduction Dimension Reduction & PCA Scores plot • PCA – Principal Componenent Analysis • Process that transforms a number of possibly correlated variables into a smaller number of uncorrelated variables called principal components • Reduces 1000’s of variables to 2-3 key features Principal Component Analysis Hundreds of peaks 2 components 25 PAP PC2 20 15 10 ANIT 5 ANIT 0 -5 Control -10 Control -15 PAP -20 PC1 -25 -30 -20 -10 0 Scores plot PCA captures what should be visually detectable If you can’t see it, PCA probably won’t help 10 Visualizing PCA • PCA of a “bagel” • One projection produces a weiner • Another projection produces an “O” • The “O” projection captures most of the variation and has the largest eigenvector (PC1) • The weiner projection is PC2 and gives depth info PCA - The Details t1 t2 ….. tm ……. xn s1 s2 s3… samples. sk x1 x2 x3, … variables Scores = t (eigen vectors uncorrelated orthogonal) p1 p2 ….. • PCA involves the calculation of the eigenvalue (singular value) decomposition of a data covariance matrix • PCA is an orthogonal linear transformation • PCA transforms data to a new coordinate system so that the greatest variance of the data comes to lie on the first coordinate (1st PC), the second greatest variance on the 2nd PC etc. pk Loadings = p scores = loadings x data t1 = p1x1 + p2x2 + p3x3 + … + pnxn Visualizing PCA • Airport data from USA • 5000 “samples” • X1 - latitude • X2 - longitude • X3 - altitude • What should you expect? Data from Roy Goodacre (U of Manchester) Visualizing PCA PCA is equivalent to K-means clustering K-means Clustering Initial cluster choose 1 centroid= choose 2 centroid= Rule: lT = lcentroid + - 50 nm test & join PCA Clusters • Once dimensional reduction has been achieved you obtain clusters of data that are mostly normally distributed with means and variances (in PCA space) • It is possible to use t-Tests and ANOVA tests to determine if these clusters or their means are significantly different or not PCA and ANOVA PC 1 • ANOVA can also be used to determine whether 3+ clusters are different if the clusters follow a normal distribution PC 2 PCA Plot Nomenclature • PCA Generate 2 kinds of plots, the scores plot and the loadings plot • Scores plot (on right) plots the data using the main principal components PCA Loadings Plot • Loadings plot shows how much each of the variables (metabolites) contributed to the different principal components • Variables at the extreme corners contribute most to the scores plot separation PCA Details/Advice • In some cases PCA will not succeed in identifying any clear clusters or obvious groupings no matter how many components are used. If this is the case, it is wise to accept the result and assume that the presumptive classes or groups cannot be distinguished • As a general rule, if a PCA analysis fails to achieve even a modest separation of classes, then it is probably not worthwhile using other statistical techniques to try to separate them PCA vs. PLS-DA • Partial Least Squares Discriminant Analysis • PLS-DA is a supervised classification technique while PCA is an unsupervised clustering technique • PLS-DA uses “labeled” data while PCA uses no prior knowledge • PLS-DA enhances the separation between groups of observations by rotating PCA components such that a maximum separation among classes is obtained PLS-DA Validation • PLS-DA results are essentially prediction models or class predictors • These models need to be validated and assessed to make sure they are not over-trained or over-fitted • There are several routes to assessing the quality and robustness of the model – R2/Q2 assessments and permutation testing Validating PLS-DA with Q2 & R2 • The performance of a PLS-DA model can be quantitatively evaluated in terms of an R2 and/or a Q2 value • R2 is the correlation index and refers to the goodness of fit or the explained variation (range = 0-1) • Q2 refers to the predicted variation or quality of prediction (range = 0-1) • Typically Q2 and R2 track very closely together PLS-DA R2 • R2 is a quantitative measure (with a maximum value of 1) that indicates how well the PLS-DA model is able to mathematically reproduce the data in the data set • A poorly fit model will have an R2 of 0.2 or 0.3, while a well-fit model will have an R2 of 0.7 or 0.8. PLS-DA Q2 • To guard against over-fitting, the value Q2 is commonly determined. Q2 is usually estimated by cross validation or permutation testing to assess the predictive ability of the model relative to the number of principal components used in the model • Generally a Q2 > 0.5 if considered good while a Q2 of 0.9 is outstanding Validating PLS-DA (Permutation) PCA Labelled data PLS-DA/SVM Permuted data PLS-DA/SVM Separation score Other Supervised Classification Methods • SIMCA – Soft Independent Modeling of Class Analogy • OPLS – Orthoganol Projection of Latent Structures • Support Vector Machines • Random Forest • Naïve Bayes Classifiers • Neural Networks Breaching the Data Barrier Unsupervised Methods PCA K-means clustering Factor Analysis Supervised Methods PLS-DA LDA PLS-Regression Machine Learning Neural Networks Support Vector Machines Bayesian Belief Net Data Analysis Progression • Unsupervised Methods – PCA or cluster to see if natural clusters form or if data separates well – Data is “unlabeled” (no prior knowledge) • Supervised Methods/Machine Learning – Data is labeled (prior knowledge) – Used to see if data can be classified – Helps separate less obvious clusters or features • Statistical Significance – Supervised methods always generate clusters -this can be very misleading – Check if clusters are real by label permutation Note of Caution • Supervised classification methods are powerful – Learn from experience – Generalize from previous examples – Perform pattern recognition • Too many people skip the PCA or clustering steps and jump straight to supervised methods • Some get great separation and think the job is done - this is where the errors begin… • Too many don’t assess significance using permutation testing or n-fold cross validation • If separation isn’t partially obvious by eye-balling your data, you may be treading on thin ice