Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
EECS 800 Research Seminar Mining Biological Data Instructor: Luke Huan Fall, 2006 The UNIVERSITY of Kansas Administrative Project design is due Oct 30th ~3 weeks from now Include the following items in the document The goal of the project A brief introduction of the overall project A list of background materials that will be covered in the final report A high level design of your project A testing plan 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide2 Overview Gain insights of high dimensional space by projection pursuit (feature reduction). PCA: Principle components analysis A data analysis tool Mathematical background PCA and gene expression profile analysis briefly 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide3 A Group of Related Techniques Unsupervised Principal Component Analysis (PCA) Latent Semantic Indexing (LSI): truncated SVD Independent Component Analysis (ICA) Canonical Correlation Analysis (CCA) Supervised Linear Discriminant Analysis (LDA) Semi-supervised Research topic 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide4 Rediscovery – Renaming of PCA Statistics: Principal Component Analysis (PCA) Social Sciences: Factor Analysis (PCA is a subset) Probability / Electrical Eng: Karhunen – Loeve expansion Applied Mathematics: Proper Orthogonal Decomposition (POD) Geo-Sciences: Empirical Orthogonal Functions (EOF) 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide5 An Interesting Historical Note The 1st (?) application of PCA to Functional Data Analysis: Rao, C. R. (1958) Some statistical methods for comparison of growth curves, Biometrics, 14, 1-17. 1st Paper with “Curves as Data” viewpoint 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide6 What is Principal Component Analysis? Principal component analysis (PCA) Reduce the dimensionality of a data set by finding a new set of variables, smaller than the original set of variables Retains most of the sample's information. Useful for the compression and classification of data. By information we mean the variation present in the sample, given by the correlations between the original variables. The new variables, called principal components (PCs), are uncorrelated, and are ordered by the fraction of the total information each retains. 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide7 A Geometric Picture z1 z • the 1st PC is the line in the space such that the “projected” 1 data set has the largest total variance z • the 2nd PC z2 is the line, orthogonal to 1, to capture the remaining total variance PCs are a series of linear fits to a sample, each orthogonal to all the previous. 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide8 Connect Math to Graphics 2-d Toy Example Feature Space Object Space Data Points (Curves) are columns of data matrix, X 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide9 Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Sample Mean, X 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide10 Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Residuals from Mean = Data - Mean 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide11 Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Recentered Data = Mean Residuals, shifted to 0 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide12 Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space PC1 Direction = η = Eigenvector (w/ biggest λ) 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide13 Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Centered Data 9/25/2006 Data Types PC1 Projection Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Residual slide14 Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space PC2 Direction = η = Eigenvector (w/ 2nd biggest λ) 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide15 Connect Math to Graphics (Cont.) 2-d Toy Example Feature Space Object Space Centered Data 9/25/2006 Data Types PC2 Projection Mining Biological Data KU EECS 800, Luke Huan, Fall’06 Residual slide16 Connect Math to Graphics (Cont.) Note for this 2-d Example: PC1 Residuals = PC2 Projections PC2 Residuals = PC1 Projections (i.e. colors common across these pics) 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide17 PCA and Complex Data Analysis Data set is a set of curves How to find clusters? Treat curves as points in a high dimensional space Applications in gene expression profile analysis Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14, 789-808 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide18 N-D Toy Example Upper left shows the mean. Upper right is residuals from mean. Lower left is projections of the mean residuals in the PC1 direction. Lower right is further residuals from PC1 projections. 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide19 Yeast Cell Cycle Data Central question: Which genes are “periodic” over 2 cell cycles? 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide20 Yeast Cell Cycle Data, PCA analysis Periodic genes? Naïve approach: Simple PCA 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide21 Yeast Cell Cycle Data, FDA View Central question: which genes are “periodic” over 2 cell cycles? Naïve approach: Simple PCA Doesn’t work No apparent (2 cycle) periodic structure? Eigenvalues suggest large amount of “variation” PCA finds “directions of maximal variation” Often, but not always, same as “interesting directions” Here need better approach to study periodicities 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide22 Yeast Cell Cycles, Freq. 2 Proj. PCA on Freq. 2 Periodic Component Of Data 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide23 PCA for 2D Surfaces 2-d M-Rep Example: Corpus Callosum Atoms Spokes Implied Boundary 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide24 Pros and Cons PCA works for Multi-dimensional Gaussian distribution It doesn’t work for Gaussian mixtures Data in non-Euclidian spaces 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide25 Detailed Look at PCA Three important (and interesting) viewpoints: Mathematics Numerics Statistics 1st: Review linear alg. and multivar. prob. 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide26 Review of Linear Algebra Vector Space: set of “vectors”, x , and “scalars” (coefficients or an element in a field), “closed” under “linear combination” ( ai x i in space) a i For example: x1 d x : x1 ,..., xd x d “ d dim Euclid’n space” 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide27 Subspace Subspace: subset that is again a vector space which is closed under linear combination Examples: lines through the origin planes through the origin all linear combos of a subset of vector (= a hyperplane through origin) 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide28 Basis Basis of subspace: set of vectors that span, i.e. everything is a lin. com. of them are linearly indep’t, i.e. lin. Com. is unique Example: “unit vector basis” in d 1 0 0 0 1 , ,..., 0 1 0 0 e.g. x 1 1 0 0 0 1 x2 x1 x 2 x d 0 0 0 1 xd 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide29 Basis Matrix Basis Matrix, of subspace of d v ,..., v Given a basis: 1 n create matrix of columns: vn1 v11 vn v v 1d nd d n B v1 Then “linear combo” is a matrix multiplicat’n: n a v i 1 9/25/2006 Data Types i i Ba a1 a a n Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide30 Linear Transformation Aside on matrix multiplication: (linear transformation) for matrices a1,1 a1, m A a a k ,m k ,1 b1,1 b1, n B b b m, n m,1 Define the “matrix product” m a1,i bi ,1 i 1 AB m a k ,i bi ,1 i 1 a b 1, i i , n i 1 m a b k ,i i , n i 1 m (“inner products” of columns with rows) (composition of linear transformations) 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide31 Matrix Trace For a square matrix • Define m tr ( A) ai ,i a1,1 a1, m A a a m,m m,1 i 1 • Trace commutes with matrix multiplication: tr AB tr BA 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide32 Dimension Dimension of subspace (a notion of “size”): number of elements in a basis (unique) dim d d (use basis above) Example Dimension of a line is 1 Dimension of a plane is 2 Dimension is “degrees of freedom” 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide33 Vector Norm • in • Idea: “length” of the vector d, 1/ 2 2 x x j j 1 d • x x t 1/ 2 x x “length normalized vector”: (has length one, thus on surf. of unit sphere) • get “distance” as: d x , y x y 9/25/2006 Data Types x y x y t Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide34 Inner Product Inner (dot, scalar) product: • for vectors d x and y , x, y x j y j x y • related to norm, via t j 1 x x, x x x t • measures “angle between x and y ” as: x, y 1 anglex, y cos x y • key to “orthogonality”, t x y cos 1 xt x yt y i.e. “perpendicul’ty”: x y if and only if x, y 0 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide35 Orthonormal Basis Orthonormal basis v1 ,..., vn : • All ortho to each other, i.e. vi , vi ' 0 , • All have length 1, i.e. vi , vi 1 , fori i ' for i 1,..., n n • “Spectral Representation”: x ai vi whereai x, vi check: x, v i i 1 n n a i '1 • Matrix notation: x B a i' vi ' , vi a i ' vi ' , vi a i i '1 t wherea x B t t a B x i.e. a is called “transform (e.g. Fourier, wavelet) of x ” 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide36 Vector Projection Projection of a vector x onto a subspace V : • Idea: member of V that is closest to x (i.e. “approx’n”) • Find PV x V that solves: min x v (“least squa’s”) vV • General solution in d : for basis matrix BV PV x BV B BV BVt x 1 t V • So “proj’n operator” is “matrix mult’n”: (thus projection is another linear operation) PV BV B BV BVt t V 9/25/2006 Data Types 1 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide37 Vector Projection (cont) v1 ,..., vn : Projection using orthonormal basis BVt BV I nn • Basis matrix is “orthonormal”: v1 , v1 v1t v1 vn v ,v vt n n 1 • So PV x BV BVt x = v1 , vn vn , vn 1 0 0 1 Recon(Coeffs of x “in V dir’n”) • For “orthogonal complement”, V , x PV x PV x and • Parseval inequality: 9/25/2006 Data Types x PV x PV x 2 2 n P x x x, vi 2 Mining Biological Data V KU EECS 800, Luke Huan, Fall’06 2 i 1 2 n 2 ai2 a i 1 2 slide38 Random Vectors Given a “random vector” X1 X X d A “center” of the distribution is the mean vector, EX 1 EX EX d A “measure of spread” is the covariance matrix: cov X 1 , X d var X 1 cov( X ) cov X , X var X d 1 d 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide39 Empirically Given a random sample X 1 ,..., X n , estimate the theoretical mean , with the sample mean: X1 1 n ˆ X X i X n i 1 d 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide40 Empirically (Cont.) And estimate the “theoretical cov.” cov.”: n 2 X X i1 1 i 1 1 ˆ n n 1 X id X d X i1 X 1 i 1 9/25/2006 Data Types , with the “sample X X X X i1 1 id d n i 1 Mining Biological Data KU EECS 800, Luke Huan, Fall’06 n 2 X X id d i 1 slide41 With Linear Algebra Outer product representation: 2 X i1 X 1 n ˆ 1 n 1 i 1 X id X d X i1 X 1 X i1 X 1 X id X d 2 X id X d ˆ 1 X i X X i X t X~X~ t , n 1 i 1 n where: 9/25/2006 Data Types ~ X 1 X 1 X X n X d n n Mining 1 Biological Data KU EECS 800, Luke Huan, Fall’06 slide42 PCA as an Optimization Problem Find “direction of greatest variability”: 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide43 Applications of PCA Eigenfaces for recognition. Turk and Pentland. 1991. Principal Component Analysis for clustering gene expression data. Yeung and Ruzzo. 2001. Probabilistic Disease Classification of Expression-Dependent Proteomic Data from Mass Spectrometry of Human Serum. Lilien. 2003. Zhao, X., Marron, J.S. and Wells, M.T. (2004) The Functional Data View of Longitudinal Data, Statistica Sinica, 14, 789-808 9/25/2006 Data Types Mining Biological Data KU EECS 800, Luke Huan, Fall’06 slide44