* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Poster. - Stanford University
RNA interference wikipedia , lookup
Secreted frizzled-related protein 1 wikipedia , lookup
X-inactivation wikipedia , lookup
Molecular evolution wikipedia , lookup
Transcriptional regulation wikipedia , lookup
List of types of proteins wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Ridge (biology) wikipedia , lookup
Gene expression wikipedia , lookup
Genomic imprinting wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Genome evolution wikipedia , lookup
Gene desert wikipedia , lookup
Gene nomenclature wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
ICA-based Clustering of Genes from Microarray Expression Data Su-In 1 Lee , Serafim 2 Batzoglou [email protected], [email protected] 1Department of Electrical Engineering, 2Department of Computer Science, Stanford University 1. ABSTRACT 2. GENE EXPRESSION MODEL To cluster genes from DNA microarray, an unsupervised methodology using independent component analysis (ICA) is proposed. Based on an ICA mixture model of genomic expression patterns, linear and nonlinear ICA finds components that are specific to certain biological processes. Genes that exhibit significant up-regulation or downregulation within each component are grouped into clusters. We test the statistical significance of enrichment of gene annotations within each cluster. ICA-based clustering outperformed other leading methods in constructing functionally coherent clusters on various datasets. This result supports our model of genomic expression data as composite effect of independent biological processes. Comparison of clustering performance among various ICA algorithms including a kernel-based nonlinear ICA algorithm shows that nonlinear ICA performed the best for small datasets and natural-gradient maximization-likelihood worked well for all the datasets. Expression pattern of genes in a certain condition is a composite effect of independent biological processes that are active in that condition. For example, suppose that there are 9 genes and 3 biological processes taking place inside a cell. 3. Microarray Data Microarray Data display expression levels of a set of genes measured in various experimental conditions. Expression Patterns of Genes under an Experimental Condition Expi Expression Levels of aGene Gi across Experimental Conditions G1 G2 GN-1GN Exp 1 Exp 2 Exp 3 Exp i Examples Heat shock, G phase in cell cycle, etc … conditions Liver cancer patient, normal person, etc … samples Exp M Ribosome Biosynthesis Gene 1 Genome Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 messenger RNA Each biological process becomes active by turning on genes associated with the processes. Cell Cycle Regulation Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Oxidative Phosphorylation Observed genomic expression pattern can be seen as a combinational effect of genomic expression programs of biological processes that are active in that condition. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Oxidative Phosphorylation Cell Cycle Regulation In an Experimental Condition Ribosome Biosynthesis 4. Mathematical Modeling The expression measurement of K genes observed in three conditions denoted by x1, x2 and x3 can be expressed as linear combinations of genomic expression programs of three biological processes denoted by Unknown Mixing System s1, s2 and s3. Given a microarray dataset, can we recover genomic expression programs of biological processes? x As x1 a11 a1n s1 : : : : xm am1 amn sn Ribosome Biogenesis Oxidative Phosphorylation Cell Cycle Regulation Genomic Expression Programs of Biological Processes Heat Shock Starvation Hyper-Osmotic Shock Genomic Expression Pattern in Certain Experimental Conditions In other words, can we decompose a matrix X into A and S so that each row of S represents a genomic expression program of a biological process? 6. ICA-based Clustering Step 1 Apply ICA to microarray data X to obtain Y Step 2 Cluster genes based on independent components, rows of Y. Based on our gene expression model, Independent Components y1,…, yn are assumed to be expression programs of biological processes. For each yi, genes are ordered based on activity levels on yi and C% (C=7.5) showing significantly high/low level are grouped into each cluster. We can measure expression level of genes using Microarray. Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 5. ICA Algorithm Using the log-likelihood maximization approach, we can find W that maximizes log-likelihood L(y,W). yi’s are assumed to be statistically independent y Wx n p( x) | det(W ) | p( y) p( y) pi ( yi ) i 1 n L( y,W ) log p( x) log | det(W ) | log pi ( yi ) Prior information on y Super-Gaussian or Sub-Gaussian ? W W W p( y) p( y1 ) p( yn ) yn y y1 ( y) ,... p( y) p( y1 ) p( yn ) i 1 L( y,W ) T 1 W (W ) ( y) xT W 7. Measuring significance of ICA-based clusters Statistical significance of biological coherence of clusters was measure using gene annotation databases like Gene Ontology (GO). Clusters from ICA GO categories GO 2 Cluster 1 GO 1 Cluster 2 Cluster 3 GO m Cluster n GO i Cluster i 9. Results For each method, the minimum p-values (<10-7) corresponding to each GO functional class were collected and compared. For every combination of our cluster and a GO category, we calculated the p-value, a change probability that these two clusters share the observed number of genes based on the hypergeometric distribution. f g f GO j k 1 k genes p 1 m n m i, j m 0 g n g: # of genes in all clusters and GOs f: # of genes in the GO j n: # of genes in the Cluster i k: # of genes GO j and Cluster i share 8. Microarray Datasets For testing, five microarray datasets were used and for each dataset, the clustering performance of our approach was compared with another approach applied to the same dataset. ID D1 D2 D3 Description Yeast during cell cycle Yeast during cell cycle Yeast under stressful conditions Genes 5679 6616 6152 Exps Compared with 22 PCA 17 k-means clustering 173 Bayesian approach Plaid model D4 C.elegans in various conditions 17817 553 Topomap approach D5 19 kinds of normal Human tissue 7070 59 PCA