Download ge04_bic2

Analysis of Gene Expression and Gene Networks Biclustering 2 On this lecture • Two current biclustering methodologies • Iterative Signature Algorithm (ISA) – Simple – Randomized • SAMBA – Combinatorial Roots – Fast • And maybe a little more What makes a biclustering algorithm? • Score/Define what is a bicluster • Algorithm for finding one bicluster in the data • Algorithm for finding all (many) biclusters in the data • Important themes: – Normalization – Redundencies Previously in GE: • What is a bicluster: – Cheng church – CTWC • How to search for a bicluster – Cheng church – CTWC • Normalization • Redundancies The Iterative Signature Algorithm • Developed at Naama Barkai’s Lab at WIS (I. Ihmels, S. Bergman) • Motivation: – A bicluster is a “stable” set of genes and conditions – It is possible to refine approximate set of genes by “stabalizing” them Normalization: ISA • Can we normalize for both gene and condition dependent trends? • In the ISA we are not trying to.. • Given a gene expression matrix E one conditions U and genes V form: – EC : normalize each column to 0 mean, 1 std – EG : normalize each gene to 0 mean, 1 std What is a bicluster: ISA • Observe: assume all columns are independent, what is the distribution of (j in U’) eGij for a random condition set U’ and gene i? • Mean = 0, Std=sqrt(|U’|) • Same for (i in V’) eGij and gene set V’. • In a bicluster, we like independence not to hold. What is a bicluster: ISA • Given a set of genes U’ define: – ISA(U’) = {v in V s.t. (j in U’) eGvj > TGσU’} • Given a set of genes V’ define: – ISA(V’) = {u in U s.t. (j in V’) eCiu > TCσV’} • TG ,TC – threshold parameters, σU’ ,σV’ standard deviations • A (perfect) bicluster is a pair (U’,V’) s.t. ISA(V’) = U’ ISA(U’) = V’ Searching for biclusters: ISA • ISA – defining a directed graph on the set of condition and genes subsets. • A bicluster is a cycle of two nodes U’ • An approximated bicluster is a larger cycle but not too large. • The algorithm: start from a random or known gene set, compute ISA until converging to an approximated bicluster: – Ui = ISA(Vi) , Vi = ISA(Ui-1) – Converge at i when for all j > i-m, |Ui-Uj|/|Ui+Uj| < 1-ε Redundancies: ISA • Starting from different seeds yield different fixed points (Biclusters) • Using different threshold changes the graph structure and give more fixed points. • Need to filter similar solutions and report a short list of significant biclusters ISA - applications • Starting from genes with a known functional annotation and refine them to a bicluster • Starting from genes with known transcription factor binding sites • Starting from a set of sequence orthologs • See: Ihmels et al. Nat Gen 2002, Bergman et al. Phy Rev Letter 2003, Bergman et al. PLoS 2004. ISA – Pros/Cons • Pros – Simple, Quite fast – Elegant solution to the normalization problem – Good empirical results in several cases • Cons – – – – Thresholds setting Finding good seeds Redundencies Non normal behaviors • Assignment 3 will give you more insights SAMBA • Developed here • Motivation: – Harvest efficient combinatorial techniques for biclustering large datasets. – Couple a statistical model to the biclusters – Allow integration of heterogeneous data The SAMBA model edge conditions no edge Goal : Find high similarity submatrices Goal : Find dense G=(U,V,E) subgraphs The SAMBA approach • Normalization: translate GE matrix to a weighted bipartite graph using a statistical model for the data • Bicluster model: Heavy subgraphs • How to find biclusters: Combined hashing and local optimization • Redundancies: Find many biclusters at once, filter them in post process From a statistical model to edge weights – a simple example • Background model: Independent edges, each present with prob. p. • H – subgraph of n genes, m conds, k edges • P-value = tail of binomial distribution:  nm  k '  p (1  p) nm k '  2 nm p k (1  p) nm k p( H )    k ' k  k '  • Weight the graph – edges: (-1-log p) – non-edges: (-1-log(1-p)). then subgraph weight  log p-value. Limitations of the uniform probability model • Not all dense subgraphs are statistically significant. • Different genes/conds have typical noise characteristics. • Noisy genes/conds have high probability of forming dense subgraphs. • An extended likelihood ratio model: Bicluster Random Subgraph Model Background Random Graph Model = Likelihood model translates to sum of weights over edges and non edges A Degree Based Random Graph Model low-prob edges medium-prob edges high-prob edges • An edge between (u,v) occurs independently with prob p(u,v). • p(u,v) depends on both u and v degrees • P(u,v) = Pr((u,v) in E’ | all G=(U,V,E’) such that deg(w, E’)=deg(w,E) for all w in U,V) • Approximated using a hyper-geometric calculation Model Likelihood Ratio • Model assumption - bicluster edges occur independently with prob pc • Likelihood ratio score: L( B )   ( u ,v )E ' pc 1  pc  p (u , v) (u ,v )E ' 1  p (u, v) pc 1  pc log L( B)   log   log p(u, v) (u ,v )E ' 1  p (u , v) ( u ,v )E ' Subgraph weight = log likelihood ratio Heaviest bipartite subgraph • NPC (Dawande et al. 97, Hochbaum 98) • (Recall: node blicque is polynomial!) • Assumption: degree on V side bounded by d: • Start by finding heavy bicliques. • Alg: use hashing to discover heavy subsets of conds. Takes O(n2d) time and space. Finding Heaviest Biclique 4 3 6 2 4 42 2 4 3 2 2 2 4 Using bicliques to find the heaviest biclusters Assume edge weight = 1, non-edge weight = -1 Note that: w((U ',V '))   w(((u '),V ') uU ' Lemma: If B=(U’,V’) is maximal and XU’ then v s.t. |N(v)X|>=|X|/2. Pf: 0  w(( X ,V '))   | N (v)  2 | N (v ) X | | N (v) vV ' X | | X | vV ' Corrolary: If B=(U’,V’) is maximal then |U’|<= 2d X | Using bicliques to find the heaviest biclusters A set of conditions in a maximal bicluster is the union of up to log(2D) subsets of gene neighborhoods. U’ u’’ u’’’ … • Exhaustive O((n2D)log(2D)) time alg: •Hash bicliques •enumerate all log(2D) size N(v) combinations. • Can be generalized to handle arbitrary edge/nonedge weights. SAMBA’s implementation • Phase I: find heavy bicliques - hash for each gene of deg<d all subsets of neighbors of size 4-6. • Phase II: greedy expansion of heaviest bicliques containing each gene/cond • Phase III: filter overlapping biclusters. Heterogeneous information sources Transcription Level mRNA profiling ChIP Chip Protein Level 2-Hybrid Protein Complexes Identification using Mass Spec and so many more… Phenotype Level Barcoded deletion libraries 1 + 1 = 0 Synthetic lethality From experiments to properties p1 p2 p3 p2 Strong complex binding to protein P p1 Medium complex binding to Protein P p4 Strong Medium Medium Strong Induction Induction Repression Repression p1 p2 Strong Medium Binding to Binding to TF T TF T p1 gene g p2 High Medium Sensitivity Sensitivity p1 p2 High Confidence Medium Confidence Interaction Interaction A Heterogeneous Collection of Yeast Genomic Information • Gene expression: ~1000 conditions, 27 publications • TF binding profiles: 110 profiles from growth on YPD (Lee et al.) • Phenotype profiles: 6 (30) profiles (Giaever et al.) • Two hybrid interactions: ~1000 (Uetz et al.) • Protein Complex interaction: ~4000 (Ho et al.) • MIPS interactions: ~1000 A SAMBA module Genes Properties CPA1 GO annotations CPA2 log likelihood Statistical Model Provides High Specificity + Lymphoma data (Alizadeh et.al) x Shuffled Data log p-value Global View of modular organization in yeast Inferring functional annotations • Using SAMBA results for annotating uncharacterized yeast genes • Performing “guilt by association” • Same procedure for properties (which reflects poorly characterized conditions) Mating Genes Putative Mating Over X% Uncharacterized Predictions are highly specific 5 mating predictions were tested experimentally 4 mutants failed to mate SAMBA as a universal language for functional genomics databases User SAMBA Query Gene expression TF location Proteomics Phenotypes ….. Updated Relevant Modules SAMBA – Pros/Cons • Pros – Fast – Allow simultaneous normalization of genes and conditions – Allow integration of hetergenous data – Well suited for query based usage • Cons – Discretization Two words on: Probabilistic Models for Biclsutering • Bicluster model: each subcolumn have a typical normal distribution ,different from the background • Model the entire matrix: tile the matrix by biclusters • Model score: likelihood based • Avoid overfitting by standard techinuqes Two words on: Probabilistic Models for Biclsutering • How to find the biclusters: Start by clustering and refine them using an EM algorithm: – Given a clustering calculate the model parameters (distirubtions per bicluster) – Given the distributions, reassign the biclusters Biclustering - Summary • A general data mining problem • The key point: defining what is a bicluster • Algorithms vary depending on the nature of bicluster model • The future problem: search for biclusters in a really huge matrices.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download ge04_bic2