Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics Other data reduction techniques Kristel Van Steen, PhD, ScD ([email protected]) Université de Liege - Institut Montefiore 2008-2009 Acknowledgements Material based on: work from Pradeep Mummidi class notes from Christine Steinhoff Outline Intuition behind PCA Theory behind PCA Applications of PCA Extensions of PCA Multidimensional scaling MDS (not to be confused with MDR) Intuition behind PCA Introduction Most of the scientific or industrial data is Multivariate data (huge size of data) Is If all the data useful? not, how do we quickly extract useful information only? Problem When we use traditional techniques, 1. Not easy to extract useful information from the multivariate data 1) Many bivariate plots are needed 2) Bivariate plots, however, mainly represent correlations between variables (not samples). Visualization Problem Not easy to visualize multivariate data - 1D: dot - 2D: Bivariate plot (i.e. X-Y plane) - 3D: X-Y-Z plot - 4D: ternary plot with a color code /Tetrahedron- 5D, 6D, etc. : ??? Visualization????? As the number of variables increases, data space becomes harder to visualize Basics of PCA PCA is useful when we need to extract useful information from multivariate data sets. This technique is based on the reduced dimensionality. Therefore, trends in multivariate data are easily visualized. Variable Reduction Procedure Principal component analysis is a variable reduction procedure. It is useful when you have obtained data on a number of variables (possibly a large number of variables), and believe that there is some redundancy in those variables Redundancy means that some of the variables are correlated with one another, possibly because they are measuring the same construct. Because of this redundancy, you believe that it should be possible to reduce the observed variables into a smaller number of principal components (artificial variables) that will account for most of the variance in the observed variables. What is Principal Component A principal component can be defined as a linear combination of optimally-weighted observed variables. Based on how subject scores on a principal component are computed. 7 Item measure of Job Satisfaction General Formula Below is the general form for the formula to compute scores on the first component extracted (created) in a principal component analysis: C1 = b 11(X1) + b12(X 2) + ... b1p(Xp) where C1 = the subject’s score on principal component 1 (the first component extracted) b1p = the regression coefficient (or weight) for observed variable p, as used in creating principal component 1 Xp = the subject’s score on observed variable p. For example, assume that component 1 in the present study was the “satisfaction with supervision” component. You could determine each subject’s score on principal component 1 by using the following fictitious formula: C1 = .44 (X1) + .40 (X2) + .47 (X3) + .32 (X4)+ .02 (X5) + .01 (X6) + .03 (X7) Obviously, a different equation, with different regression weights, would be used to compute subject scores on component 2 (the satisfaction with pay component). Below is a fictitious illustration of this formula: C2 = .01 (X1) + .04 (X2) + .02 (X3) + .02 (X4)+ .48 (X5) + .31 (X6) + .39 (X7) Number of components Extracted If a principal component analysis were performed on data from the 7-item job satisfaction questionnaire, only two components was created. However, such an impression would not be entirely correct. In reality, the number of components extracted in a principal component analysis is equal to the number of observed variables being analyzed. However, in most analyses, only the first few components account for meaningful amounts of variance, so only these first few components are retained, interpreted, and used in subsequent analyses (such as in multiple regression analyses). Characteristics of principal components The first component extracted in a principal component analysis accounts for a maximal amount of total variance in the observed variables. Under typical conditions, this means that the first component will be correlated with at least some of the observed variables. It may be correlated with many. The second component extracted will have two important characteristics. First, this component will account for a maximal amount of variance in the data set that was not accounted for by the first component. Under typical conditions, this means that the second component will be correlated with some of the observed variables that did not display strong correlations with component 1. The second characteristic of the second component is that it will be uncorrelated with the first component. Literally, if you were to compute the correlation between components 1 and 2, that correlation would be zero. The remaining components that are extracted in the analysis display the same two characteristics: each component accounts for a maximal amount of variance in the observed variables that was not accounted for by the preceding components, and is uncorrelated with all of the preceding components. Generalization A principal component analysis proceeds in this fashion, with each new component accounting for progressively smaller and smaller amounts of variance (this is why only the first few components are usually retained and interpreted). When the analysis is complete, the resulting components will display varying degrees of correlation with the observed variables, but are completely uncorrelated with one another. References http://support.sas.com/publishing/pubcat/ chaps/55129.pdf http://www.cs.otago.ac.nz/cosc453/stude nt_tutorials/principal_components.pdf http://www.cis.hut.fi/jhollmen/dippa/node 30.html Theory behind PCA Theory behind PCA Linear Algebra OUTLINE What do we need from „linear algebra“ for understanding principal component analysis ? •Standard deviation, Variance, Covariance •The Covariance matrix •Symmetric matrix and orthogonality •Eigenvalues and Eigenvectors •Properties Motivation Motivation Protein 2 Proteins 1 and 2 measured for 200 patients Protein1 Motivation Patients 1 Genes 1 Microarray Experiment ? Visualize ? ? Which genes are important ? ? For which subgroup of patients ? 22,000 200 Motivation Genes 1 Patients 1 200 10 Basics for Principal Component Analysis •Orthogonal/Orthonormal •Some Theorems... •Standard deviation, Variance, Covariance •The Covariance matrix •Eigenvalues and Eigenvectors Standard Deviation The average distance from the mean of the data set to a point MEAN: Example: Measurement 1: 0,8,12,20 Measurement 2: 8,9,11,12 M1 M2 Mean 10 Mean 10 SD 8.33 SD 1.83 Variance Example: Measurement 1: 0,8,12,20 Measurement 2: 8,9,11,12 M1 M2 Mean 10 Mean 10 SD 8.33 SD 1.83 Var 69.33 Var 3.33 Covariance Standard Deviation and Variance are 1-dimensional How much do the dimensions vary from the mean with respect to each other ? Covariance measures between 2 dimensions We easily see, if X=Y we end up with variance Covariance Matrix Let X be a random vector. Then the covariance matrix of X denoted by Cov(X), is , The diagonals of Cov(X) are In matrix notation, The covariance matrix is symmetric . Symmetric Matrix Let be a square matrix of size nxn. The matrix A is symmetric, if for all Orthogonality/Orthonormality <v1,v2> = <(1 0),(0 1)> = 0 1.5 1 0.5 0.5 1.0 1.5 Two vectors v1 and v2 for which <v1,v2>=0 holds are said to be orthogonal Unit vectors which are orthogonal are said to be orthonormal. Eigenvalues/Eigenvectors Let A be an nxn square matrix and x an nx1 column vector. Then a (right) eigenvector of A is a nonzero vector x such that: For some scalar Eigenvalue Eigenvector Procedure: Finding the eigenvalues =0 Finding corresponding eigenvectors R: eigen(matrix) Matlab: eig(matrix) Finding lambdas Some Remarks If A and B are matrices whose sizes are such that the given operations are defined and c is any scalar then, ( At )t A ( A B) A B t t (cA) cA t t ( AB ) B A t t t t Now,… We have enough definitions to go into the procedure how to perform Principal Component Analysis Theory behind PCA Linear algebra applied OUTLINE What is principal component analysis good for? Principal Component Analysis: PCA •The basic Idea of Principal Component Analysis •The idea of transformation •How to get there ? The mathematics part •Some remarks •Basic algorithmic procedure Idea of PCA •Introduced by Pearson (1901) and Hotelling (1933) to describe the variation in a set of multivariate data in terms of a set of uncorrelated variables •We typically have a data matrix of n observations on p correlated variables x1,x2,…xp •PCA looks for a transformation of the xi into p new variables yi that are uncorrelated Idea Genes x1 Patients 1 n Dimension high So how can we reduce the dimension ? Simplest way: take the first one, two, three; Plot and discard the rest: Obviously a very bad idea. Matrix: X xp Transformation We want to find a transformation that involves ALL columns, not only the first ones So find a new basis, order it such that in the first component lies almost ALL information of the whole dataset Looking for a transformation of the data matrix X (pxn) such that T Y= X=1 X1+ 2 X2+..+ p Xp Transformation What is a reasonable choice for the ? Remember: We wanted a transformation that maximizes „information“ That means: captures „Variance in the data“ Maximize the variance of the projection of the observations on the Y variables ! Find such that T Var( X) is maximal The matrix C=Var(X) is the covariance matrix of the Xi variables Transformation Can we intuitively see that in a picture? Good Better Transformation PC2 Orthogonality PC1 How do we get there? Patients 1 n Genes x1 X is a real valued pxn matrix Cov(X) is a real value pxp matrix or nxn matrix -> decide whether you want to analyse patient groups Or do you want to analyse gene groups? xp How do we get there? Lets decide for genes: Cov(X)= v( x1 ) c(x1,x2 ) ........c(x1,x p ) c(x1,x2 ) v( x2 ) ........c(x2 ,x p ) c(x ,x ) c(x ,x )..........v( x ) 2 p p 1 p How do we get there Some Features on Cov(X) •Cov(X) is a symmetric pxp matrix •The diagonal terms of Cov(X) are the variance genes across patients •The off-diagonal terms of Cov(X) are the covariance between gene vectors •Cov(X) captures the correlations between all possible pairs of measurements •In the diagonal terms, by assumption, large values correspond to interesting dynamic •In the off diagonal terms large values correspond to high redundancy How do we get there? The principal Components of X are the Eigenvectors of Cov(X) Assume, we can „manipulate“ X a bit: Lets call this Y Y should be manipulated in a way that it is a bit more optimal than X was What does optimal mean? That means: SMALL! Var Var Cov Var LARGE! In other words: should be diagonal and large values on the diagonal How do we get there? The manipulation is a change of the basis with orthonormal vectors And they are ordered in a way that the most important comes first (principal) ... How do we put this in mathematical terms? Find orthonormal P such that Y=PX With Cov(Y) diagonalized Then the rows of P are the principal components of X How do we get there? Y PX Cov(Y) = 1/(n-1) YY t 1 Cov(Y ) ( PX )( PX )t n 1 1 PXX t P t n 1 1 P( XX t ) Pt n 1 1 PAP t n 1 A:=XX t How do we get there? A is symmetric Therefore there is a matrix E of eigenvectors and a diagonal matrix D such that: A EDE t Now define P to be the transpose of the matrix E of eigenvectors P : E t Then we can write A: A P DP t How do we get there? Now we can go back to our Covariance Expression: Cov(Y) 1 PAP t n 1 1 Cov(Y ) P( P t DP) P t n 1 1 ( PP t ) D( PP t ) n 1 How do we get there? The inverse of an orthogonal matrix is its transpose (due to its definition): 1 P P t In our context that means: Cov(Y) 1 ( PP 1 ) D( PP 1 ) n 1 1 D n 1 How do we get there? P diagonalizes Cov(Y) t Where P is the transpose of the matrix of Eigenvectors of XX The principal components of X are the eigenvectors of XX (thats the same as the rows of P) t The ith diagonal value of Cov(Y) is the variance of X along pi (=along the ith principa Essentially we need to compute EIGENVALUES Explained variance and EIGENVECTORS Principal components Of the covariance matrix of the original matrix X Some Remarks •If you multiply one variable by a scalar you get different results •This is because it uses covariance matrix (and not correlation) •PCA should be applied on data that have approximately the same scale in each variable •The relative variance explained by each PC is given by eigenvalue/sum(eigenvalues) • When to stop? For example: Enough PCs to have a cumulative variance explained by the PCs that is >5070% •Kaiser criterion: keep PCs with eigenvalues >1 Some Remarks Some Remarks If variables have very heterogenous variances we standardize them The standardized variables Xi* Xi*= (Xi-mean)/variance The new variables all have the same variance, so each variable have the same weight. REMARKS •PCA is useful for finding new, more informative, uncorrelated features; it reduces dimensionality by rejecting low variance features •PCA is only powerful if the biological question is related to the highest variance in the dataset Algorithm Data = (Data.old – mean ) /sqrt(variance) Cov(data) = 1/(N-1) Data*tr(Data) Find Eigenvector/Eigenvalue (Function in R and matlab: eig) and sort Eigenvectors: V Eigenvalues: P Project the original data: P * data Plot as many components as necessary Applications of PCA Applications Include: Image Processing Micro array Experiments Pattern Recognition OUTLINE Principal component analysis in bioinformatics OUTLINE Principal component analysis in bioinformatics Example 1 Lefkovits et al. Clones 1 n Spots x1 X is a real valued pxn matrix They want to analyse relatedness of clones Cov(X) is a real value nxn matrix They take Correlation matrix (which is on the top the division by the standard deviations) xp Lefkovits et al. Example 2 Yang et al. Yang et al. Babo tkv Control Ulloa-Montoya et al. Multipotent Adult progenitor cells Pluripotent Embryonic stem cells Mesenchymal stem cells Ulloa-Montoya et al. Yang et al. But: We only see the different experiments If we do it the other way round – that means analysing for the genes not for the experiments we see grouping of genes But we never see both together. So, can we relate somehow the experiments and the genes? That means group genes whose expression might be explained by the the respective experimental group (tkv, babo, control)? This goes into „correspondence analysis“ Extensions of PCA Difficult example Non-linear PCA Kernel PCA (http://research.microsoft.com/users/Cambridge/nicolasl/papers/eigen_dimred.pdf) PCA in feature space PCA in feature space PCA in feature space PCA in feature space Side remark Summary of kernel PCA Multidimensional Scaling (MDS) Common stress functions