Download JointCluster

Presented by: Tal Saiag Seminar in Algorithmic Challenges in Analyzing Big Data* in Biology and Medicine; With Prof. Ron Shamir @TAU • • • • • • Basic Terminology Introduction JointCluster: A simultaneous clustering algorithm Results Discussion Conclusion 2 3 • • • • • • Cell – building block of life Contains nucleus Chromosome – genetic material Forms the genome – DNA Gene – Stretch of DNA Proteins – multi functional workers 4 • • • • • Gene expression: from gene to protein Transcription RNA Translation Transcription factor 5 • DNA microarray / chips • Measures expression of genes • Condition specific 6 7 • Genome-wide datasets provide different views of the biology of a cell. • Physical interactions (protein-protein) and regulatory interactions (protein-DNA) maintain and regulate the cell's processes. • Expression of molecules (proteins or transcripts of genes) provide a snapshot of the cell’s state. • Researchers have exploited the complementarity of both. 8 • Integrating the physical and expression datasets. • Developing efficient solution for combined analysis of multiple networks. • Goal: find common clusters of genes supported by all of the networks of interest. • Computationally intractable for large networks. • Theoretical guarantees (reasonably approximates the optimal clustering). 9 10 • A cut refers to a partition of nodes in a graph into two sets. • A cut is called sparse-enough in a graph if the ratio of edges crossing the cut in the graph to the edges incident at the smaller side of the cut is smaller than a threshold specific to the graph. • Inter-cluster edges: edges with endpoints in different clusters. • Connectedness of a cluster in a graph: the cost of a set of edges is the ratio of their weight to the total edge weight in the graph. 11 • Approximate the sparsest cut in each input graph using a spectral method. • Choose among them any cut that is sparse-enough in the corresponding graph yielding the cut. • Recurse on the two node sets of the chosen cut. • Until well connected node sets with no sparse-enough cuts are obtained. 12 • Graph 𝐺 = 𝑉, 𝑎 • 𝑎 𝑢, 𝑣 ≥ 0 for any node pair 𝑢, 𝑣 ∈ 𝑉 × 𝑉 • Total weight of any edge set 𝑌: 𝑎 𝑌 = • For any node sets 𝑆, 𝑇 ⊆ 𝑉: 𝑎(𝑆, 𝑇) = 𝑢,𝑣 ∈Y 𝑎 𝑢∈S,𝑣∈T 𝑎 𝑢, 𝑣 𝑢, 𝑣 • Define 𝑎 𝑆 = 𝑎 𝑆, 𝑉 • Total edge weight in the graph: a(V)/2 • Singletons: 𝑎 𝑢 = 𝑎 𝑢 13 • The conductance of a cut (𝑆, 𝑇 = 𝐶\S) in a node set 𝐶: 𝑎 𝑆,𝑇 min 𝑎 𝑆 ,𝑎 𝑇 • The inter-cluster edges 𝑋: 𝑢, 𝑣 where 𝑢 and 𝑣 belong to different clusters. • An (𝛼, 𝜀) clustering of 𝐺 is a partition of its nodes into clusters such that: • The conductance of the clustering is at least 𝛼, and • The total weight of the inter-cluster edges 𝑋 is at most an 𝜀 fraction of the 𝜀 total edge weight in the graph; i.e., 𝑎 𝑋 ≤ 𝑎 𝑉 . 2 14 • Since finding the sparsest cut in a graph is a NP-hard problem, an approximation algorithm for the problem is used. • Efficient spectral techniques. • Spectral Algorithm: • Find the top 𝑘 right singular vectors 𝑣1 , 𝑣2 , … , 𝑣𝑘 (using SVD). • Let 𝐶 be the matrix whose 𝑗th column is given by 𝐴𝑣𝑗 . • Place row 𝑖 in cluster 𝑗 if 𝐶𝑖𝑗 is the largest entry in the 𝑖th row of 𝐶. 15 𝑝 𝑖=1 • Consider 𝑝 graphs 𝐺𝑖 = 𝑉, 𝑎𝑖 • An 𝛼𝑖 , 𝜀 simultaneous clustering: • The conductance of the clustering is at least 𝛼𝑖 in graph 𝐺𝑖 for all 𝑖, and • The total weight of the inter-cluster edges 𝑋 is at most an 𝜀 fraction of the 𝜀 total edge weight in all graphs; i.e., 𝑖 𝑎𝑖 𝑋 ≤ 𝑖 𝑎𝑖 𝑉 . 2 • Inter-cluster edge cost: 2 𝑖 𝑎𝑖 𝑋 𝑖 𝑎𝑖 𝑉 . • A cut in 𝐺𝑖 is sparse enough if the conductance of the cut is at most 𝛼𝑖∗ . 16 • Mixture graph 𝐻𝑖𝑘 for graph 𝐺𝑖 at scale 𝑘 has a weight function: 𝑏𝑖𝑘 𝑢, 𝑣 = 𝑎𝑖 𝑢, 𝑣 + 2−𝑘 𝑎𝑗 𝑢, 𝑣 𝑗!=𝑖 • The heuristic finds sparsest cuts in mixture graphs. • The heuristic starts with the sum graph to control edges lost in all graphs, and transitions through a series of mixture graphs that approach the individual graphs to refine the clusters. 17 • Combine a cut selection heuristic. • Choose the cut that is sparse-enough in the most number of input graphs. 18 19 20 • The modularity score of a cluster in a graph is: fraction of edges contained within the cluster minus the fraction expected by chance. • The partition of 𝑉 that respects the clustering tree and optimizes the min–modularity score can be found by dynamic programming. min − 𝑚𝑜𝑑𝑢𝑙𝑎𝑟𝑖𝑡𝑦 𝐷 , 𝑂𝑃𝑇 𝐷 = argmax min − 𝑚𝑜𝑑𝑢𝑙𝑎𝑟𝑖𝑡𝑦 𝑂𝑃𝑇 𝐷𝑙 ∪ OPT 𝐷𝑟 • Ordered by the min-modularity scores of clusters. 21 • We desire an unsupervised method for learning the related conductance threshold 𝛼𝑖∗ for each network of interest 𝐺𝑖 . • Algorithm for each graph 𝐺𝑖 : • Cluster only 𝐺𝑖 using JointCluster, without loss of generality set 𝛼𝑖∗ to maximum possible value 1. • Set 𝛼𝑖∗ to the minimum conductance threshold that would result in the same set of clusters 𝐶𝑡 . • Goal: automatically choose a threshold that is sufficiently low and sufficiently high. 22 23 • Alternative algorithms: • 𝐺𝑖 Tree: Choose one of the input graphs 𝐺𝑖 as a reference, cluster this single graph using an efficient spectral clustering method to obtain a clustering tree, and parse this tree into clusters using the min-modularity score computed from all graphs. • Coassociation: Cluster each graph separately using a spectral method, combine the resulting clusters from different graphs into a coassociation graph, and cluster this graph using the same spectral method. • 𝑘𝑜𝑢𝑡 Parametr 24 • Intra-cluster are a pair of elements that belong to a single cluster. • Jaccard Index = #element pairs that are intra−cluster wrt both clusterings #element pairs that are intra−cluster wrt either of the clusterings 25 26 • Two yeast strains grown under two conditions where glucose or ethanol was the predominant carbon source. • Coexpression networks using all 4,482 profiled genes as nodes. • Weight of an edge as the absolute value of the Pearson's correlation coefficient between the expression profiles of the two genes. • Physical gene-protein interactions (from various interaction databases): total of 41,660 non-redundant interactions. 27 • GO Process: Genes in each reference set in this class are annotated to the same GO Biological Process term. • TF (Transcription Factor) Perturbations: Genes in each set have altered expression when a TF is deleted or overexpressed. • Compendium of Perturbations: Genes in each set have altered expression under deletions of specific genes, or chemical perturbations. • TF Binding Sites: Genes in a set have binding sites of the same TF in their upstream genomic regions, with sites predicted using ChIP binding data. • eQTL Hotspots: Certain genomic regions exhibit a significant excess of linkages of expression traits to genotypic variations. 28 • Intra-cluster are a pair of elements that belong to a single cluster. • Jaccard Index = #element pairs that are intra−cluster wrt both clusterings #element pairs that are intra−cluster wrt either of the clusterings • Sensitivity: the fraction of reference sets that are enriched for genes belonging to some cluster output by the method. [coverage] • Specificity: the fraction of clusters that are enriched for genes belonging to some reference set. [accuracy] 29 30 • Comparing JointCluster against methods that integrate only a single coexpression network with a physical network. • Combined <glucose+ethanol> coexpression network and the physical network. • Comparing on fair terms for all algorithms: • Setting minimum cluster size parameter in Matisse to 10. • Size limit of 100 genes for JointCluster . • Co-clustering didn't have a parameter to directly limit cluster size. 31 32 33 • Heterogeneous large-scale datasets are accumulating at a rapid pace. • Efforts to integrate them are intensifying. • JointCluster provides a versatile approach to integrating any number of heterogeneous datasets. • Natural progression from clustering of single to multiple datasets. 34 • Testing JointCluster algorithm on simulated datasets. • Testing JointCluster on yeast empirical datasets. • More flexible than two-network clustering methods. • Consistent with known biology, extend our knowledge. • JointCluster can handle multiple heterogeneous network. • Enables better coverage of genes especialy when knowledge of physical interactions is less complete. • Unsupervised and exploratory approach to data integration. 35 36 • The challenge: integrating multiple datasets in order to study different aspects of biological systems. • Proposed simultaneous clustering of multiple networks. • Efficient solution that permits certain theoretical guarantees • Effective scaling heuristic • Flexibility to handle multiple heterogeneous networks • Results of JointCluster: • • • • More robust, and can handle high false positive rates. More consistently enriched for various reference classes. Yielding better coverage. Agree with known biology of yeast. 37 38 Bibliography: • Manikandan Narayanan, Adrian Vetta, Eric E. Schadt, Jun Zhu (2010), PLoS Computational Biology. • Simultaneous Clustering of Multiple Gene Expression and Physical Interaction Datasets. • Supplementary Text for “Simultaneous clustering of multiple gene expression and physical interaction datasets”. • CPP Source code • Kannan R, Vempala S, Vetta A (2000), Proceedings Annual IEEE Symposium on Foundations of Computer Science (FOCS). pp 367–377. • On clusterings - good, bad and spectral. • Shi J, Malik J (2000), IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI) 22: 888–905. • Normalized cuts and image segmentation. • Andersen R, Lang KJ (2008), Proceedings Annual ACM-SIAM Symposium on Discrete Algorithms (SODA). pp 651–660. • An algorithm for improving graph partitions. 39 40

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download JointCluster