* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Lecture 3 - Tresch Group
Polycomb Group Proteins and Cancer wikipedia , lookup
Metagenomics wikipedia , lookup
Genome evolution wikipedia , lookup
Quantitative trait locus wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Microevolution wikipedia , lookup
Genome (book) wikipedia , lookup
Minimal genome wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Ridge (biology) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Achim Tresch UoC / MPIPZ Cologne Statistics treschgroup.de/OmicsModule1415.html [email protected] 1 Clustering = Partitioning into groups K-means clustering, Example with k=2 14 13 12 11 10 9 8 7 8 10 12 14 Taken from Padhraic Smyth, University of California Irvine, 2007 16 18 20 K-means clustering, Example with k=2 Initial Cluster Centers at Iteration 1 14 13 12 11 10 9 8 7 8 10 12 14 Taken from Padhraic Smyth, University of California Irvine, 2007 16 18 20 K-means clustering, Example with k=2 Updated Memberships and Boundary at Iteration 1 14 13 Y Variable 12 11 10 9 8 7 8 10 12 14 X Variable Taken from Padhraic Smyth, University of California Irvine, 2007 16 18 20 K-means clustering, Example with k=2 Updated Cluster Centers at Iteration 2 14 13 12 11 10 9 8 7 8 10 12 14 Taken from Padhraic Smyth, University of California Irvine, 2007 16 18 20 K-means clustering, Example with k=2 Updated Memberships and Boundary at Iteration 2 14 13 Y Variable 12 11 10 9 8 7 8 10 12 14 X Variable Taken from Padhraic Smyth, University of California Irvine, 2007 16 18 20 K-means clustering, Example with k=2 Updated Cluster Centers at Iteration 3 14 13 12 11 10 9 8 7 8 10 12 14 Taken from Padhraic Smyth, University of California Irvine, 2007 16 18 20 K-means clustering, Example with k=2 Updated Memberships and Boundary at Iteration 3 14 13 Y Variable 12 11 10 9 8 7 8 10 12 14 X Variable Taken from Padhraic Smyth, University of California Irvine, 2007 16 18 20 K-means clustering, Example with k=2 Updated Cluster Centers at Iteration 4 14 13 12 11 10 9 8 7 8 10 12 14 Taken from Padhraic Smyth, University of California Irvine, 2007 16 18 20 K-means clustering, Example with k=2 Updated Memberships and Boundary at Iteration 4 14 13 Y Variable 12 11 10 9 8 7 8 10 12 14 X Variable Taken from Padhraic Smyth, University of California Irvine, 2007 16 18 20 Example (Image Compression) 20 40 60 80 100 120 20 40 60 80 Original image 100 120 Example (Image Compression) Image segmentation (k=2) Example (Image Compression) Image segmentation (k=3) Example (Image Compression) Image segmentation (k=8) Pseudocolor display Hierarchical clustering (on the Black board) Classification Expression profile of Ms. Smith Ms. Smith Microarray of Ms. Smith Classification The 30.000 properties of Mrs. Smith The expression profile ... - a list of 30,000 numbers - some of them reflect her health problem (e.g., cancer) -the profile is an image of Ms. Smith‘s physiology How can these numbers tell us (predict) whether Ms. Smith has tumor type A or tumor type B ? Classification Looking for similarities ? Mrs. Smith Compare her profile to profiles of people with tumor type A and to patients with tumor type B Training and Prediction There are patients of known class, the training samples There are patients of unknown class, the ”new“ samples Mrs. Smith Training and Prediction Use the training samples ... ... to learn how to predict ”new“ samples Mrs. Smith Prediction using one Gene Color coded expression levels of trainings samples A B Ms. Smith type A Ms. Smith type B Ms. Smith borderline Which color shade is a good decision boundary? Optimal decision rule Use the cutoff with the fewest misclassifications on the trainings samples Smallest training error Decision boundary A B Distribution of expression values in type A Training error Distribution of expression values in type B Optimal decision rule Training set Training error The decision boundary was chosen to minimize the training error The two distributions of expression values for type A and B will be similar but not identical in a set of new cases Test set We can not adjust the decision boundary because we do not know the class of the new samples Test errors are usually larger then training errors This phenomenon is called Test error overfitting Combining information across genes Taking means across genes The top gene The average of the top 10 genes ALL vs. AML, Golub et al. Combining information across genes Using a weighted average y x11 x2 2 ... xn n with “good weights” you get an improved separation x1 ,..., xn Expression values 1 ,..., n weights Combining information across genes The geometry of weighted averages ( x1 , x2 ) y Calculating a weighted average is identical to projecting (orthogonally) the expression profiles onto the line defined by the weights vector (of length 1). Linear decision rules Hyperplanes y x11 x2 2 ... xn n 0 A B 2 genes 3 genes Together with an offset β0 the weight vector defines a hyperplane that cuts the data in two groups Linear decision rules Linear Signatures A y 0 x11 x2 2 ... xn n B 2 genes If y≥0 Disease A If y<0 Disease B Nearest Centroids Linear Discriminant Analysis Diagonal Linear Discriminant Analysis (DLDA) Rescale axis according to the variances of genes Linear Discriminant Analysis Discriminant Analysis The data often shows evidence of non identical covariances of genes in the two groups Hence using LDA, DLDA or NC introduces a model biad (=wrong model assumptions, here due to oversimplification) Feature Reduction Gene Filtering - Rank genes according to a score - Choose top n genes - Build a signature with these genes only Still 30.000 weights, but most of them are zero … Note that the data decides which are zero and which are not Limitation: You have no(?) chance to find these two genes among 30,000 noninformative genes Feature Reduction How many genes? Is this a biological or a statistical question? Biology: How many genes are (causally) involved in the biological process? Statistics: How many genes should we use for classification ? Gene expression measurements provide ~30.000 individual expression values per sample. Feature Reduction Finding the needle in the haystack A common myth: Classification information in gene expression signatures is restricted to a small number of genes, the challenge is to find them Feature Reduction The Avalanche Aggressive lymphomas with and without a MYC-breakpoint MYC-neg MYC-pos Cross Validation Training error Validation of a signature requires independent test data Test set The accuracy of a signature on the data it was learned from is biased because of the overfitting phenomenon Training set Independent Validation Test error Cross Validation Generating Test Sets Split data randomly into … test … … and training data ok mistake Learn the classifier on the training data, and apply it to the test data Cross validation Problem: The test error cannot be measured directly. Idee: Generate artificial test data by splitting the data into k partitions of ~equal size, e.g. k=5. Find the regression function / the classifier using k-1 partitions D1 D2 D3 Measure the training error TR on the k-1 partitionsx measure the cross validation error CV on the remaining partition. D4 D5 TR1 CV1 Training error cross validation error T = mean(TR1,…,TR5) CV = mean(CV1,…,CV5) CV is a good estimate of the test error. Cross validation Problem: The test error cannot be measured directly. Idee: Generate artificial test data by splitting the data into k partitions of ~equal size, e.g. k=5. Find the regression function / the classifier using k-1 partitions D1 D1 TR1 TR2 CV1 CV2 Training error cross validation error D3 Measure the training error TR on the k-1 partitionsx measure the cross validation error CV on the remaining partition. D4 D5 T = mean(TR1,…,TR5) CV = mean(CV1,…,CV5) CV is a good estimate of the test error. Cross validation Problem: The test error cannot be measured directly. Idee: Generate artificial test data by splitting the data into k partitions of ~equal size, e.g. k=5. Find the regression function / the classifier using k-1 partitions D1 D2 D3 TR1 TR2 TR3 CV1 CV2 Training error cross validation error Measure the training error TR on the k-1 partitionsx measure the cross validation error CV on the remaining partition. D4 D5 CV3 T = mean(TR1,…,TR5) CV = mean(CV1,…,CV5) CV is a good estimate of the test error. Cross validation Problem: The test error cannot be measured directly. Idee: Generate artificial test data by splitting the data into k partitions of ~equal size, e.g. k=5. Find the regression function / the classifier using k-1 partitions Measure the training error TR on the k-1 partitionsx measure the cross validation error CV on the remaining partition. D1 D2 D3 D4 TR1 TR2 TR3 TR4 CV1 CV2 Training error cross validation error CV3 D5 CV4 T = mean(TR1,…,TR5) CV = mean(CV1,…,CV5) CV is a good estimate of the test error. Cross validation Problem: The test error cannot be measured directly. Idee: Generate artificial test data by splitting the data into k partitions of ~equal size, e.g. k=5. Find the regression function / the classifier using k-1 partitions Measure the training error TR on the k-1 partitionsx measure the cross validation error CV on the remaining partition. D1 D2 D3 D4 D5 TR1 TR2 TR3 TR4 TR5 CV1 CV2 Training error cross validation error CV3 CV4 CV5 T = mean(TR1,…,TR5) CV = mean(CV1,…,CV5) CV is a good estimate of the test error. Bootstrap Problem: The test error cannot be measured directly. Idee: Generate artificial (sub)samples from the sample at hand Baron von Münchhausen, pulling himself out of the swamp with his own hair. In the English literature, he does the same using his own bootstraps. Bradley Efron *1938, Stanford University Bootstrap Idea: Draw a bootstrap (sub)sample B. from the whole sample S, allowing repetitions. Find a regression function on the bootstrap sample B. Calculate bootstrap error E on S-B. Population Sample S Bootstrap sample B N cases Prediction error on S-B N cases Regression function fB Bootstrap Repeat the above process many times (Bootstrap samples B1,…,BK), e.g., k=1000. The test error can be estimated as the average bootstrap error. V(f) = mean (E1,…,Ek) This is one of the best methods to estimate the test error. Drawback: It is computationally expensive (many regressions need to be calcualted). Cross Validation Estimators of performance have a variance … … which can be high. The chances of a meaningless signature to produce 100% accuracy on test data is high if the test data includes only few patients Nested 10-fold- CV Variance from 100 random partitions Bias & Overfitting The gap between training error and test error becomes wider Overfitting is a good reason for not including hundreds of genes in a model even if they are biologically affected Centroid Shrinkage genes The shrunken centroid method and the PAM package genes genes genes Tibshirani et al 2002 genes genes genes genes genes genes genes genes genes genes genes Centroid Shrinkage Shrinkage D Centroid Shrinkage How much shrinkage is good in PAM (partitioning around medoids? Train Train Select Train Train Train Train Train Select Train cross validation Compute the CV-Performance for several values of D Pick the D that gives you the smallest number of CVMisclassifications Adaptive Model Selection PAM does this routinely Selection Bias The test data must not be used for gene selection or adaptive model selection, otherwise the observed (Cross Validation-based) accuracy is biased Selection bias Cross Validation Small D, many genes poor performance due to overfitting High D, few genes, poor performance due to lack of information – underfitting The optimal D is somewhere in the middle Predictive genes are not causal genes Assume protein A binds to protein B and inhibits it The clinical phenotype is caused by active protein A Predictive information is in expression of A minus expression of B Calling signature genes markers for a certain disease is misleading! Naïve Idea: Don’t calculate weights based on single gene scores but optimize over all possible hyperplanes Optimal decision rules Only one of these problems exists Problem 1: No separating line Problem 2: Many separating lines Why is this a problem? Optimal decision rules This problem is related to overfitting ... more soon The p>N problem With the microarray we have more genes than patients Think about this in three dimensions There are three genes, two patients with known diagnosis (red and yellow) and Ms. Smith (green) There is always one plane separating red and yellow with Ms. Smith on the yellow side and a second separating plane with Ms. Smith on the red side OK! If all points fall onto one line it does not always work. However, for measured values this is very unlikely and never happens in practice. The p>N problem The overfitting disaster From the data alone we can not decide which genes are important for the diagnosis, nor can we give a reliable diagnosis for a new patient This has little to do medicine. It is a geometrical problem. If you find a separating signature, it does not mean (yet) that you have a top publication ... ... in most cases it means nothing. Finding meaningful signatures There always exist separating signatures caused by overfitting - meaningless signatures Hopefully there is also a separating signature caused by a disease mechanism, or which at least are predictive for the disease - meaningful signatures – We need to learn how to find and validate meaningful signatures Separating hyperplanes Which hyperplane is the best? Support Vector Machines (SVMs) Fat planes: With an infinitely thin plane the data can always be separated correctly, but not necessarily with a fat one. Again if a large margin separation exists, chances are good that we found something relevant. Large Margin Classifiers Support Vector Machines (SVMs) Maximal Margin Hyperplane There are theoretical results that the size of the margin correlates with the test (!) error (V. Vapnik) SVMs are not only optimized to fit to the training data but for predictive performance directly Support Vector Machines (SVMs) No separable training set Penalty of error: distance to hyperplane multiplied by a parameter c Balance over- and underfitting External Validation and Documentation Documenting a signature is conceptually different from giving a list of genes, although is is what most publications give you In order to validate a signature on external data or apply it in practice: - All model parameters need to be specified - The scale of the normalized data to which the model refers needs to be specified Establishing a signature External Validation Cross Validation: Split Data into Training and Test Data - select genes - find the optimal number of genes Training data only: Machine Learning - learn model parameters Cookbook for good classifiers 1. Decide on your diagnosis model (PAM,SVM,etc...) and don‘t change your mind later on 2. Split your profiles randomly into a training set and a test set 3. Put the data in the test set away ... far away! 4. Train your model only using the data in the training set (select genes, define centroids, calculate normal vectors for large margin separators, perform adaptive model selection ...) don‘t even think of touching the test data at this time 5. Apply the model to the test data ... don‘t even think of changing the model at this time 6. Do steps 1-5 only once and accept the result ... don‘t even think of optimizing this procedure Acknowledgements Rainer Spang, University of Regensburg Florian Markowetz, Cancer Research UK, Cambridge Regression (Estimation of one quantitative endpoint by a function of the covariates) Population Sample Unknown functional relation Yi ~ X i i $ $ $ $ Regression function $ Yi f ( X i ) i $ $ Regression Specify a (parametric) family of functions, which describes the type of dependence you want to model. E.g., linearer dependence, f(x) = ax+b quadratric dependence, f(x) = ax2+bx+c 100 50 0 0 20 40 60 Goodness-of-fit measures Specify the loss function = the measure for the goodness of fit = the target function to be minimized. E.g. quadratec loss (= residual sum of squares, RSS) (Xj ,Yj) True value Yj f(Xj) Vorhersage 60 40 20 0 Y Y= f(X) 0 20 40 X 60 Residuum Xj RSS = Σj(jth true value – jth predicted value) 2 = Σj( Yj- f(Xj) )2 Goodness-of-fit measures Specify a loss function) L, which accounts for the difference of the predictions from the observed „true“ values Ex.: y = true value, f(x) = prediction for continuous data: L(y,f(x)) = ( y - f(x) )2 quadratic Loss L(y,f(x)) = | y - f(x) | linear Loss for binary data: L(y,f(x)) = 0 falls y=f(x) 1 falls y≠f(x) 0-1 Loss Regression Find the function from the specified family of functions (i.e. find the parameters defining this function) which fits the data best. RSS = 8.0 RSS = 1.1 RSS = 1.7 100 RSS = 3 50 0 0 20 40 60 Univariate linear Regression Ex.: Relation between body weight and brain weight. Brain weight Body- / Brain weight of 62 mammalians Body weight Univariate linear Regression Ex.: Relation between body weight and brain weight. weight Brain Log10 (Gehirngewicht) Körper-/Gehirngewichte Body- / Brain weight of 62 vonmammalians 62 Säugern Log10Body (Körpergewicht) weight 74 Univariate linear Regression Ex.: Relation between body weight and brain weight. weight Brain Log10 (Gehirngewicht) Körper-/Gehirngewichte Body- / Brain weight of 62 vonmammalians 62 Säugern Log10Body (Körpergewicht) weight 75 Univariate linear Regression Ex.: Relation between body weight and brain weight. weight Brain Log10 (Gehirngewicht) Residuals Log10 Body (Körpergewicht) weight Chironectes minimus (Schwimmbeutelratte, Opossum) 76 Linear Regression Goals Find a good predictor of the endpoint Y, given the Identify covariates that are relevant (have prognostic value) for the prediction of Y. Ex.: An increase of the body weight by 1 unit (on the log scale) results in an average increase in the (log) brain weight by 0.75 units. Body weight seems to exert a major influence on brain weight. 77 Univariate nonlinear Regression Nonlinear dependencies can also be modeled by a regeression “True” regression function Y = aX2+bX+c Multiple Regression Univariate regression: only one covariate Multiple Regression: several (up to thousands) of covariates Beispiel: Y = Oxygen consumption (mMol O2/min) X1= body temperature X2= physical performance Multivariate linear regression function: Y = a1 X1 + a2 X2 + c 79 From the treasury of statistics Olympia 2156 Women win the men‘s 100m race Nature 431, 525 (30 September 2004) | doi:10.1038/431525a; Athletics: Momentous sprint at the 2156 Olympics? Andrew J. Tatem, Carlos A. Guerra, Peter M. Atkinson & Simon I. Hay The 2004 Olympic women's 100-metre sprint champion, Yuliya Nesterenko, is assured of fame and fortune. But we show here that — if current trends continue — it is the winner of the event in the 2156 Olympics whose name will be etched in sporting history forever, because this may be the first occasion on which the women‘s race is won in a faster time than the men's event. 10 5 World‘s best 100m times of each year. women, men 0 Zeit [s] 15 20 From the treasury of statistics 1900 1920 1940 1960 Jahr 1980 2000 2156 From the treasury of statistics 10 5 World‘s best 100m times of each year. women, men 0 Zeit [s] 15 20 Note: Interpolation is much more reliable than Extrapolation! 1900 1920 1940 1960 Jahr 1980 2000 2156 2385 One needs to be clear about the range of values to which the regression model can be applied sensibly. Training vs. test error Problem: The loss function (if applied to the training data) measures the wrong thing. Population Sample Yi f ( X i ) ? Yi f ( X i ) regression function f Yi f ( X i ) Training error T(f) Test error V(f) How well does the regression function apply to the population? How well does the regression function apply to the sample? Biases in RNA-Seq data Aim: to provide you with a brief overview of literature about biases in RNA-seq data such that you become aware of this potential problem. Bias and Variance x Strong noise Weak noise Bias No bias Biases in RNA-Seq data • Experimental (and computational) biases affect expression estimates and, therefore, subsequent data analysis: – – – – – Differential expression analysis Study of alternative splicing Transcript assembly Gene set enrichment analysis Other downstream analyses • We must attempt to avoid, detect and correct these biases Sources of Bias and Variance Systematic (Bias) Stochastic (Variance) • Similar effects on many (all) data of one sample • Correction can be estimated and removed from the data (normalization) • Effects on single data points of a sample • Correction can not be estimated, noise can only be quantified and taken into account (error model) Efficiency of RNA Extraction, Reverse Transcription Background Backgroundfluorescence fluorescence Amplification efficiency Tissue Tissue contamination contamination DNA DNA Quality quality RNA Degradation ACTG-signal detection Bias-Variance Tradeoff Which factors influence the quality of predictions? Ex.: Binary classification of points in the plane. Flexibility Stability Overfitting Bias From Hastie, Tibshirani, Efron. The Elements of Statistical Learning Bias-Variance Tradeoff Test error V(f) Training error T(f) Even though increasing flexibility („complexity“) reduces the training error, the test error rises again at some point. This phenomenon is called overfitting. Aus: Hastie, Tibshirani, Efron. The Elements of Statistical Learning