* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Chem452 : Lecture 15
Epigenetics of diabetes Type 2 wikipedia , lookup
Long non-coding RNA wikipedia , lookup
History of genetic engineering wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Genome evolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Metagenomics wikipedia , lookup
Microevolution wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Oncogenomics wikipedia , lookup
Genomic imprinting wikipedia , lookup
Minimal genome wikipedia , lookup
Designer baby wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Genome (book) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Gene expression programming wikipedia , lookup
Ridge (biology) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Chem452 : Lecture 15 Week 8: Complex Data Analysis in Biology Halil Bayraktar What is common in all? Singular Value Decomposition (SVD analysis) It is a mathematical matrix decomposition method or tool to analyze complex data and answer important questions. It is used extensively for a) b) c) d) e) Gene expression analysis Image Analysis, compression and matching Sound analysis and matching Chemical sample analysis etc. Time series analysis, signal processing and automation control f) Google web site ranking It is very easy to use it and rich information can be obtained from data. Gene expression, You would like to answer the following questions. What is the best way of analyzing your data? Which of the genes are expressed at the same time in cancer cells? What are the genes that are expressed in cancer cells but not normal cells? How dramatically the expression changes over time? Type of gene Colors define the intensity values Intensity Time Time Gene Type N genes are expressed M different time points N by M matrix Yeast cell cycle PCA for DNA Microarrays Genes with same pattern might have similar function a Fig. 2 Two pattern-discovery techniques. Data for both figures measure expression for 11 genes characterizing sensitivity to compound cytochalasin D in 60 cancer cell lines97. a, The first three principal components, plotted using Matlab software (Mathworks). Apparent features include a tight cluster of leukemia samples (red dots, nearly superimposed) and the more scattered outlying cluster of CNS tumors (black dots). A single lung cancer sample (NSCLC-NCIH226) also appears as an outlier — the solitary orange dot at the top. b, Hierarchical clustering of the same data, using Cluster/TreeView (http://rana.lbl.gov/ EisenSoftware.htm). Names of samples extremely sensitive or resistant to cytochalasin D (see Supplementary information) are prefixed ‘S’ and ‘R’ respectively. The samples fall into two main clusters, roughly, but not perfectly, separating the sensitive and resistant samples. As in a, fine structure shows a tight leukemia cluster (underlined in green) and a tight CNS cluster (underlined in red), but does not suggest that the CNS cluster or NSCLC-NCIH226 (underlined in blue) are outliers. Apparent in both a and b is the relative heterogeneity of the breast cancer cell lines. If genes are expressed at the same they have the same scores and localized close in 3D map. b merging the two closest clusters is repeated until a single cluster remains. This arranges the data into a tree structure that can be broken into the desired number of clusters by cutting across the tree at a particular height. Tree structures are easily viewed and understood (Fig. 2b), and the hierarchical structure provides potentially useful information about the relationships between clusters. Trees are known to reveal close relationships very well. However, as What important information can be obtained? 1. Prediction of the class of these gene : The key genes can be identified. The expression of these genes can be used to predict the type (cancer or normal) of cell sample. PC3 PC2 PC1 2. Building large network of genes: Complex data analysis is performed to learn the gene expression and construct a graph that show the dependency of expressed genes Gene expression can also be probed for effect of toxic compounds, ions, different peptides etc. Intensity Time Toxic compounds Different Ions Small molecules Different Peptides Cell types Etc. SVD in biology Gene Type Which genes are most important? So the largest eigenvalues? What is relative significance of these genes? Can we identify unknown genes? Computer Science Identifying faces: We need a training matrix – some initial data to start our analysis. Is there anything special in these pictures? Face database can be analyzed the same way. The following question would be asked such as. Eigenfaces for Recognition MIT media lab Matthew turk and Alex Pentland Can we find the important and good features when we analyze faces? PCA analysis on Faces - Eigenfaces First Principal Component Original Image 1906x1372 Total 400 Principal Components GOOGLE WEB PAGE RANKING Larry Page Computer Science Sergey Brin They used singular value decomposition to find the solution of their complex data. The results, eigenvalues, is used to predict the ranking of pages. Google rigorously calculates the eigenvectors of a large matrix. That matrix represents the internet's links in order to rank which pages users will most likely (and probably should) end up on. PCA in social science - Sociology Reading score Cate 1 Cate 2 Cate 3 Job quality Years of education Eigenvalue Decomposition !i = Eigenvalue bi , xi = Eigenvector Abi = !i xi Remember from linear algebra that that does not always exist. The columns bi and xi of B and X are called the left and right singular vectors respectively, and the diagonal elements λi of λ are called the singular values. Abi is in the direction of xi In Matrix form,Abi = !i xi becomes AV = US Singular Value Decomposition (SVD) of a rectangular matrix A is a decomposition of the form A = U S VT where U and V are orthogonal matrices, and S is a diagonal matrix. U is m×n and orthonormal S is n×n and diagonal V is n×n and orthonormal AV = US T T AVV = USV T A = USV Singular value decomposition of A. SVD can be written always for A . T VV =1 S = DIAG(! 1, ! 2,...., ! m) ! 1 = "i Eigenvalues of AAT or ATA Principal Component Analysis (PCA) PCA uses the SVD in its calculation In PCA, we basically find eigenvalues and eigenvectors of covariance matrix. C= AAT/N σaσb=0 σaσb=σa2 highly uncorr. correlated ! # # # # " A $ ! & # &=# & # & # % " U $! $! s 0 0 &# 1 &# &# 0 ! 0 & # &# &&# 0 0 s # n &" " % % V $ & & & % Remember that si on the diagonal are called the singular values of A. The Rank R is determined by the smallest dimension, in other words, number of nonzero singular values determines the rank R of A. T 1. Eigenvectors and eigenvalues always come in pairs. 2. Eigenvalues is the scaling factor of the vector. 3. Every matrix has SVD. 4. The eigenvalues can be determined and those values can be S1 ≥ S2 ≥ S3 ≥ ……Sn > 0 EIGENVALUES Why eigenvalues are important? It can be considered as characteristic tool of the matrix. For example you tell if a large sets of genes are expressed at certain time but not the other. Or you can say which of the web site are more important than the other by just looking its eigenvalue. Briefly, the eigenvalue for a given factor measures the variance in all the variables which is accounted for by that factor. Largest eigenvalues gives the principal axis where the variance is largest along the axis. The ratio of eigenvalues : It is extremely important. If a factor has a low eigenvalue, the variance in the variables can be explained less significantly by the eigenvalues. The main idea in PCA is to reduce the dimensionality of our data A by approximating A as a sum of rank matrices. n An = ! ui! i v i T i=1 Rank matrix A PCA by SVD We can use SVD to perform PCA. We decompose A using SVD. PCA seeks a linear combination of variables such that the maximum variance is extracted from the variables. A = USV It approximates a high-dimensional data set with a lower-dimensional linear small set. It still contains most of the information in the large set. Principal Component 1 Principal Component 2 Original axis Genes with same pattern might have similar function calculat thing li one var between Cons with ze A The var follows. Figure 3: A spectrum of possible redundancies in data from the two separate recordings r1 and r2 (e.g. where t xA , yB ). The best-fit line r2 = kr1 is indicated by ables. T the dashed line. forward Take home message. SVD is a great mathematical method to analyze complex data Covariance can be considered to be a measure of how well correlated two variables are. The correlations among the observations become clear after diagonalization of the covariance matrix.