* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download gene_expression
X-inactivation wikipedia , lookup
Neuronal ceroid lipofuscinosis wikipedia , lookup
History of genetic engineering wikipedia , lookup
Epigenetics in learning and memory wikipedia , lookup
Metagenomics wikipedia , lookup
Polycomb Group Proteins and Cancer wikipedia , lookup
Saethre–Chotzen syndrome wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Epigenetics of neurodegenerative diseases wikipedia , lookup
Pathogenomics wikipedia , lookup
Public health genomics wikipedia , lookup
Long non-coding RNA wikipedia , lookup
Gene therapy wikipedia , lookup
Genomic imprinting wikipedia , lookup
Genome evolution wikipedia , lookup
Gene desert wikipedia , lookup
Genome (book) wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Gene nomenclature wikipedia , lookup
Ridge (biology) wikipedia , lookup
Biology and consumer behaviour wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Mir-92 microRNA precursor family wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Microevolution wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
4. Gene Expression Data Analysis EECS 600: Systems Biology & Bioinformatics Instructor: Mehmet Koyuturk 4. Gene Expression Data Analysis Analyzing Gene Expression Data Clustering How are genes related in terms of their expression under different conditions? Differential gene expression Which genes are affected by change in condition, tissue, disease? Classification (supervised analysis) Given expression profile for a gene, can we assign a function? Given the expression levels of several genes in a sample, can we characterize the type of sample (e.g., cancerous or normal)? Regulatory network inference 2 How do genes regulate each others expression to orchestrate cellular function? EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Clustering Group similar items together Clustering genes based on their expression profiles We can measure the expression of multiple genes in multiple samples Genes that are functionally related should have similar expression profiles Gene expression profile 3 A vector (or a point) in multi-dimensional space, where each dimension corresponds to a sample Clustering of multi-dimensional real-valued data is a wellstudied problem EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Motivating Example Expression levels of 2,000 genes in 22 normal and 40 tumor colon tissues (Alon et al. , PNAS, 1999) 4 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Applications of Clustering Functional annotation If a gene with unknown function is clustered together with genes that perform a particular function, then that is likely to be associated with that function Identification of regulatory motifs If a group of genes are co-regulated, then it is likely that their regulation is modulated by similar transcription factors, so looking for common elements in the neighborhood of the coding sequences of genes in a cluster, we can identify regulatory motifs and their location (promoters) Modular analysis 5 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Gene Expression Matrix n samples m genes Generally, m >> n m = O(103) n = O(101) Each row is an n-dimensional vector Expression profile E [eij ], 1 i m, 1 j n ei [ei1 , ei 2 ,..., ein ]T 6 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Proximity Measures How do we decide which genes are similar to each other? Euclidian distance Euclidian (ei , e j ) ei e j 2 n 2 ( e e ) ik jk k 1 Manhattan distance n Manhattan (ei , e j ) ei e j | eik e jk | 1 7 k 1 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Distance Minkowski distance General version of Euclidian, Manhattan etc. Minkowski(ei , e j ) ei e j p is a parameter ei e j 8 p n p ( e e ) ik jk k 1 max eik e jk 1 k n EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Normalization If we want to measure the distance between directions rather than absolute magnitude, it may be necessary to standardize mean and variation of expression levels for each gene 1 n i (ei ) eik n k 1 1 n 2 i (ei ) ( e ) ik i n k 1 ' eik i ' ' ' T ' ei [ei1 ,ei 2 ,...,ein ] , eik i 9 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Correlation The similarity between the variation of two random variables A vector is treated as sampling of a random variable Covariance 1 n Cov[ei , e j ] (eik i )(e jk j ) n k 1 2 Var[ei ] Cov[ei , e j ] i 10 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Pearson Correlation Coefficient Pearson correlation coefficient n (eik i )(e jk j ) Cov[ei , e j ] k 1 Pearson(ei , e j ) i j Var[ei ]Var[e j ] Pearson correlation is equal to the cosine of the angle (or inner product of) normalized expression profiles 1 Pearson(ei , e j ) 1 Pearson correlation is normalized ' ' Pearson(ei , e j ) Pearson(ei , e j ) 11 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Euclidian Distance & Correlation Euclidian distance (normalized) and Pearson correlation coefficient are closely related ' ' Euclidian (ei , e j ) 2n(1 Pearson (ei , e j )) These are the two most commonly used proximity measures in gene expression data analysis Without loss of generality, we will use ij (ei , e j ) to denote the distance between two expression profiles 12 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Other Measures of correlation Pearson is vulnerable to outliers If two genes have very high expression in a single profile, it might dominate to show that the two expression levels are highly correlated Jackknife correlation: Estimate n correlations by taking each dimension (sample) out, take the minimum among them Pearson is not robust for non-Gaussian distributions 13 Spearman’s rank order correlation coefficient: Rank expression levels, replace each expression level with its rank More robust against outliers A lot of loss of information EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Clustering Methods Hierarchical clustering Group genes into a tree (a.k.a, dendrogram), so that each branch of the tree corresponds to a cluster Higher branches correspond to coarser clusters Partitioning 14 Partition genes into several groups so that similar genes will be in the same partition EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Hierarchical clustering Direction of clustering Bottom-up (agglomerative): Start from individual genes, join them into groups until only one group is left Top-down (divisive): Start with one group consisting of all genes, keep partitioning groups until each group contains exactly one gene Agglomerative clustering is computationally less expensive Why? Hierarchical clustering methods are greedy 15 Once a decision is made, it cannot be undone EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Agglomerative clustering Start with m clusters: Each cluster contains one gene At each step, choose two clusters that are closest (or most correlated), merge them How do we evaluate the distance between two clusters? Single-linkage: If clusters contain two very close genes, than the clusters are close to each other (Ck , Cl ) min ( ij ) iCk , jCl 16 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Agglomerative Clustering Complete linkage: Two clusters are close to each other only if all genes inside them are close to each other (Ck , Cl ) max ( ij ) iCk , jCl Group average: Two clusters are close to each other if their centers are close to each other 1 (Ck , Cl ) Ck Cl 17 iCk jCl ij EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Divisive Clustering Recursive bipartitioning May be computationally expensive Find an “optimal” partitioning of the genes into two clusters Recursively work on each partition Since the number of clusters is an issue for partitioning based clustering algorithms, the magic number 2 solves a lot of problems The problem is “global” At every level of the tree, we have to work on all of the genes If tree is imbalanced, there might be as many as m levels With a reasonable stopping criterion, maybe considered a partition-based clustering as well 18 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Partition Based Clustering Find groups of genes such that genes in each group are similar to each other, while being somewhat less similar to those in other clusters Easily interpratable 19 Especially, for large datasets (as compared to hierarchical) EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Number of Clusters Clustering is “unsupervised”, so generally we do not have prior knowledge on how many clusters underly the data It is very difficult to partition data into an “unknown” number of clusters Most algorithms assume that K (number of clusters) is known Try different values of K, find the one that results in best clustering Very expensive 20 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Overlapping vs. Disjoint Clusters Genes do not have a single function Most genes might be involved in different processes, so their expression profiles might demonstrate similarities with different genes in different contexts Can we allow a gene to be included in more than one cluster? Allowing overlaps between clusters poses additional challenges 21 To what extent do we allow overlaps? (We definitely don’t want to identify two identical clusters) EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Fuzzy Clustering Assign weights to each gene-cluster pair, showing the extent (or likelihood) of the gene belonging to the cluster 22 Difficult interpretation Partitioning is a special case of fuzzy clustering, where the weights are restricted to binary values Hierarchical clustering is also “fuzzy” in some sense Continuous relaxation might alleviate computational complexity as well EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis K-Means Clustering The most famous clustering algorithm Given K, find K disjoint clusters such that the total intracluster variation is minimized 1 Cluster mean: k Ck ei iC k Intracluster variation: k (ei , i ) iCk K Total intracluster variation: k k 1 23 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis K-Means Algorithm K-Means is an iterative algorithm that alters parameters based on each other’s values until no improvement is possible 1. Choose K expression profiles randomly, designate each of them as the center of one of the K clusters 2. Assign each gene to a cluster 2.1. Each gene is assigned to the cluster with closest center to its profile 3. Redetermine cluster centers 4. If any gene was moved, go back to Step 2, else stop 24 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Sample Run of K-Means 25 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Self Organizing Maps Just like K-means, we have K clusters, but this time they are organized into a map Just like K-means, each cluster is associated with a weight vector Often a 2D grid We want to organize clusters so that similar clusters will be in proximity in the map A way of visualizing in low-dimensional (2D) space It was the cluster center in K-means Each weight vector is first initialized randomly to some gene’s expression profile 26 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis SOM Algorithm At each step, a gene is selected at random The distance between the gene’s expression profile and each cluster’s weight vector is calculated, and the cluster with closest weight vector becomes the winner The winner’s and its neighbors’ (according to the 2D mapping) weight vectors are adjusted to represent the gene’s expression profile better wk (t 1) wk (t ) (t )(Ck , C j )( wk (t ) ei ) 27 Cj is the winner cluster for gene i at time t α is a decreasing function of time, θ is the neighborhood function EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Sample SOM Output 28 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Gene Co-expression Network Nodes represent genes Weighted edges between nodes represent proximity (correlation) between genes’ expression profiles This is indeed a way of predicting interactions between genes 29 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Graph Theoretical Clustering Partition the graph into heavy subgraphs Heuristic algorithms Maximize total weight (number of edges) inside a cluster Minimize total weight (number of edges) between clusters CLICK: Recursive min-cut CAST: Iterative improvement one by one for each cluster Loss of information? 30 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Model Based Clustering Generating model Each cluster is associated with a distribution (that generates expression profiles for associated genes) specified by model parameters The probability that a gene belongs to a cluster is specified by hidden parameters Expectation Maximization (EM) algorithm 31 Start with a guess of model parameters E-step: Compute expected values of hidden parameters based on model parameters M-step: Based on hidden parameters, estimate model parameters to maximize the likelihood of observing the data at hand, iterate K-means is a special case EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Evaluation of Clusters In general, we want to maximize intra-cluster similarity, while minimizing inter-cluster similarity Homogeneity, separation Reference partition Based on the proximity metric Information on “true clusters” that comes from a different source (apart from expression data) Molecular annotation (e.g., Gene Ontology) Jaccard coefficient, sensitivity, specificity Cluster annotation 32 Processes that are significantly enriched in a cluster EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Homogeneity & Separation Heterogeneity (or homogeneity in reverse direction) How similar are the genes in one cluster? H (C ) 2 ij C ( C 1) i , jCk Separation How dissimilar are different clusters? S (Ck , Cl ) 1 Ck Cl iCk jCl ij Good clustering: high heterogeneity, low separation 33 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Overall Quality Overall heterogeneity 1 H Ck H (Ck ) m Ck Overall separation S 1 C k Cl C k Cl S (Ck , Cl ) C k , Cl C k ,Cl How do these change with respect to number of clusters? 34 Can we optimize these values to choose the best number of clusters? EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Bayesian Information Criterion A statistical criterion for evaluating a model Penalizes model complexity (number of free parameters to be estimated) k is the number of free parameters in the model, which increases with the number clusters RSS is the “total error” in the model Trade-off number of clusters and optimization function to choose the best number of clusters 35 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Reference Partitioning If there is information about “ground truth” from an independent source, we can compare our clustering to such reference partitioning Pairwise assessment Let Cij = 1 if gene i and gene j are assigned to the same cluster by the clustering algorithm, 0 otherwise Let Rij = 1 if gene i and gene j are in the same cluster according to reference partition n11 Cij Rij n00 (Cij Rij ) n01 Cij Rij n10 Cij Rij i, j i, j 36 i, j i, j EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Comparing Partitions Rand index (symmetric) Rand n11 n00 n11 n00 n10 n01 Jaccard coefficient (sparse) Jaccard n11 n11 n10 n01 Minkowski measure (sparse) n10 n01 Minkowski n11 n01 37 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Cluster Annotation Clustering results in groups of genes that are coexpressed (or co-regulated) We have partial knowledge on the function of many individual genes For each group, can we tell something about the biological phenomena that underlies our observation (their coexpression)? Gene Ontology, COG (Clusters of Ortholog Groups), PFAM (Protein Domain Families) Taking a statistical approach, we can assign function to each group of genes 38 A function popular in a cluster is associated with that cluster EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Gene Ontology Ontology: Study of being (e.g., conceptualization) Gene Ontology is an attempt to develop a standardized library of cellular function Unified view of life: Processes, structures, and functions recur in diverse organisms Three concepts of Gene Ontology 39 Biological process: A recognized series of events or molecular functions (e.g., cell cycle, development, metabolism) Molecular function: What does a gene’s product do? (e.g., binding, enzyme activity, receptor activity) Cellular component: Localization within the cell (e.g., membrane, nucleus, ubiquitin ligase complex) EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Hierarchy in Gene Ontology Gene Ontology is hierarchical A process might have subprocesses A process might be described at different levels of detail Seed maturation is part of seed development Seed dormation is a(n example of) seed maturation Same for function and component Gene Ontology terms are related to each other via “is a” and “part of” relationships 40 If process A is part of process B, then A is B’s child (B is A’s parent); B involves A If function C is a function D, then C is D’s child; C is a more detailed specification of D EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis 41 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis GO Hierarchy is a DAG Gene Ontology is hierarchical, but the hierarcy is not represented by a tree, it is represented by a directed acyclic graph (DAG) 42 A GO term can have multiple parents (and obviously a GO term might (should?) have multiple children) EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Annotation GO-based annotation assigns GO terms to a gene True-path rule A gene might have multiple functions, can be involved in multiple processes Multiple genes might be associated with the same function, multiple genes take part in a process If a gene is annotated with a term, then it is also annotated by its parents (consequently, all ancestors) How does the number of genes associated with each term changes as we go down on the GO DAG? 43 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis GO Annotation of Gene Clusters There a |C| genes in a cluster C |T| genes are associated with GO term t |C ∩ T| genes are in C and are associated with t What is the association between cluster C and term t? If we chose random clusters, would we be able to observe that at least this many (|C ∩ T|) of the |C| genes in C are associated with t? What is the probability of this observation? Statistical significance based on hypergeometric distribution 44 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Hypergeometric Distribution We have n items, m of which are good If we choose r items from the entire set of items at random, what is the probability that at least k of them will be good? m n m min( m , r ) i r i p P[ K k ] n i k r n is the number of genes in the organism m=|T|, r=|C|, k= |C ∩ T| The lower p is, the more likely that there is an underlying association between the term and the cluster (the term is significantly enriched in the cluster) 45 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis GO Hierarchy & Cluster Annotation How specific (general) is the annotation we attach to a cluster? If a cluster is larger, then it might correspond to a more general process Some processes might be over-represented in the study set How do we find the best location of a cluster in GO hierarchy? Parent-child annotation 46 Condition probability of enrichment of a term in a cluster on the enrichment of its parent terms in the cluster The gene space is defined as the set of genes that are associated with t’s parents EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Parent-Child Annotation 47 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Multiple Hypotheses Testing The p-value for a single term provides an estimate of the probability of having the observed number of genes attached to that particular term We have many terms, even if the likelihood of enrichment is small for a particular term, it might be very probable that one term will be enriched as much as observed in the cluster We have to account for all hypotheses being tested simultaneously Bonferroni correction: Apply union rule, add all p-values Which terms should we consider while correcting for multiple hypotheses for a single term? 48 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Representativity of Terms How good does a significantly enriched term represent a cluster? How many of the genes in the cluster are attached to the term? How many of the genes attached to the term are in the cluster? For term t that is significantly enriched in cluster C 49 Specificity: |C ∩ T|/|C|, a.k.a. precision Specificity: |C ∩ T|/|T|, a.k.a. recall EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Biclustering A particular process might be active in certain conditions 50 A group of genes might be expressed (or up-regulated, supressed, co-regulated, etc.) in only a subset of samples They might behave almost independently under other conditions EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Clustering vs. Biclustering Clustering is a global approach Each gene is a point in the space defined by all samples How about points that are clustered in a subspace? Biclustering: While clustering genes, also choose a set of dimensions (samples) that provides best clustering 51 and vice versa a.k.a, co-clustering, subspace clustering… This is a much harder problem, because you are not only trying to find groups of points that are close to each other in multidimensional space, but also trying to identify a subspace in which groups are more evident EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Biclustering Applications Sample/tissue classification for diagnosis Identification of co-regulated genes The samples with leukemia show specific characters for a subset of genes Certain sets of genes exhibit coherent activations under specific conditions (while behaving more or less arbitrarily with respect to each other under other conditions) Functional annotation 52 Biological processes, functional classes are overlapping Different sets of samples reveal different functional relationships EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Biclustering Principles A cluster of genes is defined with respect to a cluster of samples and vice versa The clusters are not necessarily exclusive or exhaustive A gene/condition may belong to more than one cluster A gene/condition may not belong to any cluster at all Biclusters are not “perfect” 53 Noise Statistical inference becomes particularly important EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Biclustering Formulation Given a gene expression matrix A with gene set G and sample set S, a bicluster is defined by a subset of genes I and a subset of samples J General idea: A bicluster is a “good” one if AIJ , the submatrix defined by I and J, has some coherence (low variance, low rank, similar ordering of rows, etc.) The biclustering problem can be defined as one of finding a single bicluster in the entire gene expression matrix, or as one of extracting all biclusters (with some restriction on the relationship between biclusters) 54 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Coherence of a Submatrix 55 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Distribution of Biclusters 56 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Bipartite Graph Model Just like symmetric matrices, which can be modeled as arbitrary graphs, rectangular matrices can be modeled using bipartite graphs With proper definition of edge weights, biclustering can be posed as the problem of finding “heavy” subgraphs 57 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Row, Column, Matrix Means 58 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Objective Function Low-variance (constant) bicluster Ideal bicluster: Minimize bicluster variance Low-rank (constant row, constant column, coherent values) bicluster 59 Ideal constant row: Ideal constant column: General rank-one bicluster: Define residue for each value: Minimize mean squared residue EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Missing Values Not all expression levels are available for each gene/sample pair A solution is to replace missing values (random values, gene mean, sample mean, regression) Generalize definition row, column, and bicluster means to handle missing values implicitly Occupancy threshold: A bicluster is one with adequate number of (non-missing) values in each row and column 60 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Overlapping Biclusters The expression of a gene in one sample may be thought of as a superposition of contribution for multiple biclusters Plaid model: 61 : contribution of bicluster k on the expression value of the ith gene in the jth sample and (generally binary) specify the membership of row i and column j in the kth bicluster, respectively Minimize is defined to reflect “bicluster type” , , , EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Discrete Coherence A bicluster is defined to be one with coherent ordering of the values on rows and/or columns (as compared to values themselves) Order-preserving submatrix (OPSM) A submatrix is order preserving if there is an ordering of its columns such that the sequences of values in every row is increasing Gene expression motifs (xMOTIFs) 62 The expression level of a gene is conserved across a subset of conditions if the gene is in the same “state” in each of the conditions An xMOTIF is a subset of genes that are simultaneously conserved across a subset of samples EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Binary Biclusters Quantize gene expression matrix to binary values SAMBA: A 1 corresponds to a significant change in the expression value PROXIMUS: A 1 means that the gene is “expressed” in the corresponding sample A bicluster is a “dense submatrix”, i.e. one with significantly more number of 1’s than one would expect 63 Bipartite graph model: Bicliques, heavy subgraphs It is possible to statistically quantify the density of a submatrix Log-likelihood: p-value: EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Biclustering Algorithms Enumeration Greedy algorithms Solve problem recursively Alternating iterative heuristics Make a locally optimal choice at every step Divide and conquer Go for it! Fix one dimension, solve for other, alternate iteratively Model Based Parameter estimation 64 e.g., EM algorithm EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Enumerating Biclusters m rows, n columns in the matrix 2m X 2n possible biclusters in total Not doable in realistic amounts of time Is it really necessary? Put some restriction on size of biclusters SAMBA models the problem as one of finding heavy subgraphs in a bipartite graph Key assumption is sparsity: Nodes of the bipartite graph have bounded degree 65 Find K heavy bipartite subgraphs (biclusters) with bounded degree enumeration Refine them to optimize overlap and add/remove nodes that improve bicluster quality EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Greedy Algorithms Basic idea: Refine existing biclusters by adding/removing genes/samples to improve the objective function Generally, quite fast How to choose initial biclusters? How to jump over bad local optima? (Global awareness, Hillclimbing) Optimization function: mean-squared residue 66 Node deletion: Start with a large bicluster, keep removing genes/samples that contribute most to total residue Node addition: Start with a small bicluster, keep adding genes/samples that contribute least to total residue Repeat these alternatingly to improve global awareness EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Finding All Biclusters If biclusters are identified one by one, we should make sure that we do not identify the same bicluster again and again Masking discovered biclusters: Fill bicluster with random values First identify disjoint biclusters, then grow them to capture overlaps Flexible Overlapped Biclustering (FLOC) 67 Generate K initial biclusters Make decision from the gene/sample perspective (as compared to bicluster perspective): Choose the best (maximum gain) action for each gene EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Generalizing K-Means to Biclustering Assume K gene clusters, L sample clusters R: mxk gene clustering matrix, C: nxl sample clustering matrix Notice that this is a little counter-intuitive, we do not have well-defined biclusters, we rather have clusters of genes and samples, and each pair of gene and sample clusters defines a bicluster R(i,k)=1 if gene i belongs to cluster k (actually, columns are normalized to have unit norm) Minimize total residue: 68 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis KL-Means Algorithm We can show that Batch iteration Given R, compute (mxl matrix) serves as a prototype for column clusters For each column, find the column of that is closest to that column, update the corresponding entry of C accordingly Once C is fixed, repeat the same for rows to compute R from Converges to a local minimum of the objective function 69 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis OPSM Algorithm Recall that an order preserving submatrix (OPSM) is one such that all rows have their entries in the same order Growing partial models 70 Fix the extremes first The idea: Columns with very high or low values are more informative for identifying rows that support the assumed linear order Start with all (1,1) partial models, i.e., only consider the preservation of the first and last elements, keep the best ones Expand these to obtain (2,1) models, then (2,2) until we have (s/2, s/2) models, s being the number of columns in target bicluster EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Divide and Conquer Algorithms Block clustering (a.k.a., Direct clustering) Recursive bipartitioning Sort rows according to their mean, choose a row such that the total variance above and below the row is minimized Do the same for columns Pick the row or column that results in minimum intra-cluster variances, split matrix into two based on that row or column Continue splitting recursively One problem is that once two rows/columns go to different biclusters, they can never come together 71 Gap Statistics: Find a large number of biclusters, then recombine EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Binormalization Normalize matrix on both dimensions Independent scaling of rows and columns Bistochastization Here, R and C are diagonal matrices that contain row and column means, respectively Goal: Rows will add up to a constant (or will have constant norm), columns will add up to a separate constant Repeat independent scaling of rows and columns until stability is reached The residual of entire matrix is also normalized in the sense that both rows and columns have zero mean 72 EECS 600: Systems Biology & Bioinformatics 4. Gene Expression Data Analysis Spectral Biclustering Singular value decomposition The eigenvalues of the matrices ATA and AAT (say, σ2) are the same Each σ is called a singular value of A and the corresponding left and right eigenvectors are called singular vectors If σ1 is the largest singular vector of A such that ATAv1 = σ1v1 and AATu1 = σ1u1 , then σ1u1v1T is the best rank-one approximation to A, i.e., ||A- σuvT ||2 is minimized by σ1 , u1 , and v1 (over all orthogonal vector pairs with unit norm) Consequently, the entries of u and v are ordered in such a way that similar rows have similar values on u, similar columns have similar values on v 73 Split matrix based on u and v EECS 600: Systems Biology & Bioinformatics