Download JointCluster

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Long non-coding RNA wikipedia , lookup

History of genetic engineering wikipedia , lookup

Microevolution wikipedia , lookup

Genome (book) wikipedia , lookup

NEDD9 wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Genomic imprinting wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Designer baby wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

RNA-Seq wikipedia , lookup

Minimal genome wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Gene expression programming wikipedia , lookup

Gene expression profiling wikipedia , lookup

Mir-92 microRNA precursor family wikipedia , lookup

Ridge (biology) wikipedia , lookup

Median graph wikipedia , lookup

Transcript
Presented by: Tal Saiag
Seminar in Algorithmic Challenges in Analyzing Big Data*
in Biology and Medicine; With Prof. Ron Shamir @TAU
•
•
•
•
•
•
Basic Terminology
Introduction
JointCluster: A simultaneous clustering algorithm
Results
Discussion
Conclusion
2
3
•
•
•
•
•
•
Cell – building block of life
Contains nucleus
Chromosome – genetic material
Forms the genome – DNA
Gene – Stretch of DNA
Proteins – multi functional workers
4
•
•
•
•
•
Gene expression: from gene to protein
Transcription
RNA
Translation
Transcription factor
5
• DNA microarray / chips
• Measures expression of genes
• Condition specific
6
7
• Genome-wide datasets provide different views of the biology
of a cell.
• Physical interactions (protein-protein) and regulatory
interactions (protein-DNA) maintain and regulate the cell's
processes.
• Expression of molecules (proteins or transcripts of genes)
provide a snapshot of the cell’s state.
• Researchers have exploited the complementarity of both.
8
• Integrating the physical and expression datasets.
• Developing efficient solution for combined analysis of multiple
networks.
• Goal: find common clusters of genes supported by all of the
networks of interest.
• Computationally intractable for large networks.
• Theoretical guarantees (reasonably approximates the optimal
clustering).
9
10
• A cut refers to a partition of nodes in a graph into two sets.
• A cut is called sparse-enough in a graph if the ratio of edges
crossing the cut in the graph to the edges incident at the smaller
side of the cut is smaller than a threshold specific to the graph.
• Inter-cluster edges: edges with endpoints in different clusters.
• Connectedness of a cluster in a graph: the cost of a set of
edges is the ratio of their weight to the total edge weight in the
graph.
11
• Approximate the sparsest cut in each input graph using a
spectral method.
• Choose among them any cut that is sparse-enough in the
corresponding graph yielding the cut.
• Recurse on the two node sets of the chosen cut.
• Until well connected node sets with no sparse-enough cuts are obtained.
12
• Graph 𝐺 = 𝑉, 𝑎
• 𝑎 𝑢, 𝑣 ≥ 0 for any node pair 𝑢, 𝑣 ∈ 𝑉 × 𝑉
• Total weight of any edge set 𝑌: 𝑎 𝑌 =
• For any node sets 𝑆, 𝑇 ⊆ 𝑉: 𝑎(𝑆, 𝑇) =
𝑢,𝑣 ∈Y 𝑎
𝑢∈S,𝑣∈T 𝑎
𝑢, 𝑣
𝑢, 𝑣
• Define 𝑎 𝑆 = 𝑎 𝑆, 𝑉
• Total edge weight in the graph: a(V)/2
• Singletons: 𝑎 𝑢 = 𝑎 𝑢
13
• The conductance of a cut (𝑆, 𝑇 = 𝐶\S) in a node set 𝐶:
𝑎 𝑆,𝑇
min 𝑎 𝑆 ,𝑎 𝑇
• The inter-cluster edges 𝑋: 𝑢, 𝑣 where 𝑢 and 𝑣 belong to
different clusters.
• An (𝛼, 𝜀) clustering of 𝐺 is a partition of its nodes into clusters
such that:
• The conductance of the clustering is at least 𝛼, and
• The total weight of the inter-cluster edges 𝑋 is at most an 𝜀 fraction of the
𝜀
total edge weight in the graph; i.e., 𝑎 𝑋 ≤ 𝑎 𝑉 .
2
14
• Since finding the sparsest cut in a graph is a NP-hard problem,
an approximation algorithm for the problem is used.
• Efficient spectral techniques.
• Spectral Algorithm:
• Find the top 𝑘 right singular vectors 𝑣1 , 𝑣2 , … , 𝑣𝑘 (using SVD).
• Let 𝐶 be the matrix whose 𝑗th column is given by 𝐴𝑣𝑗 .
• Place row 𝑖 in cluster 𝑗 if 𝐶𝑖𝑗 is the largest entry in the 𝑖th row of 𝐶.
15
𝑝
𝑖=1
• Consider 𝑝 graphs 𝐺𝑖 = 𝑉, 𝑎𝑖
• An
𝛼𝑖 , 𝜀 simultaneous clustering:
• The conductance of the clustering is at least 𝛼𝑖 in graph 𝐺𝑖 for all 𝑖, and
• The total weight of the inter-cluster edges 𝑋 is at most an 𝜀 fraction of the
𝜀
total edge weight in all graphs; i.e., 𝑖 𝑎𝑖 𝑋 ≤
𝑖 𝑎𝑖 𝑉 .
2
• Inter-cluster edge cost:
2 𝑖 𝑎𝑖 𝑋
𝑖 𝑎𝑖 𝑉
.
• A cut in 𝐺𝑖 is sparse enough if the conductance of the cut is at
most 𝛼𝑖∗ .
16
• Mixture graph 𝐻𝑖𝑘 for graph 𝐺𝑖 at scale 𝑘 has a weight
function:
𝑏𝑖𝑘 𝑢, 𝑣 = 𝑎𝑖 𝑢, 𝑣 + 2−𝑘
𝑎𝑗 𝑢, 𝑣
𝑗!=𝑖
• The heuristic finds sparsest cuts in mixture graphs.
• The heuristic starts with the sum graph to control edges lost in all
graphs, and transitions through a series of mixture graphs that
approach the individual graphs to refine the clusters.
17
• Combine a cut selection heuristic.
• Choose the cut that is sparse-enough in the most number of input
graphs.
18
19
20
• The modularity score of a cluster in a graph is: fraction of
edges contained within the cluster minus the fraction expected
by chance.
• The partition of 𝑉 that respects the clustering tree and
optimizes the min–modularity score can be found by dynamic
programming.
min − 𝑚𝑜𝑑𝑢𝑙𝑎𝑟𝑖𝑡𝑦 𝐷 ,
𝑂𝑃𝑇 𝐷 = argmax
min − 𝑚𝑜𝑑𝑢𝑙𝑎𝑟𝑖𝑡𝑦 𝑂𝑃𝑇 𝐷𝑙 ∪ OPT 𝐷𝑟
• Ordered by the min-modularity scores of clusters.
21
• We desire an unsupervised method for learning the related
conductance threshold 𝛼𝑖∗ for each network of interest 𝐺𝑖 .
• Algorithm for each graph 𝐺𝑖 :
• Cluster only 𝐺𝑖 using JointCluster, without loss of generality set 𝛼𝑖∗ to
maximum possible value 1.
• Set 𝛼𝑖∗ to the minimum conductance threshold that would result in the same
set of clusters 𝐶𝑡 .
• Goal: automatically choose a threshold that is sufficiently low
and sufficiently high.
22
23
• Alternative algorithms:
• 𝐺𝑖 Tree: Choose one of the input graphs 𝐺𝑖 as a reference, cluster this
single graph using an efficient spectral clustering method to obtain a
clustering tree, and parse this tree into clusters using the min-modularity
score computed from all graphs.
• Coassociation: Cluster each graph separately using a spectral method,
combine the resulting clusters from different graphs into a coassociation
graph, and cluster this graph using the same spectral method.
• 𝑘𝑜𝑢𝑡 Parametr
24
• Intra-cluster are a pair of elements that belong to a single
cluster.
• Jaccard Index =
#element pairs that are intra−cluster wrt both clusterings
#element pairs that are intra−cluster wrt either of the clusterings
25
26
• Two yeast strains grown under two conditions where glucose or
ethanol was the predominant carbon source.
• Coexpression networks using all 4,482 profiled genes as nodes.
• Weight of an edge as the absolute value of the Pearson's correlation
coefficient between the expression profiles of the two genes.
• Physical gene-protein interactions (from various interaction databases):
total of 41,660 non-redundant interactions.
27
• GO Process: Genes in each reference set in this class are
annotated to the same GO Biological Process term.
• TF (Transcription Factor) Perturbations: Genes in each set have
altered expression when a TF is deleted or overexpressed.
• Compendium of Perturbations: Genes in each set have altered
expression under deletions of specific genes, or chemical
perturbations.
• TF Binding Sites: Genes in a set have binding sites of the same TF
in their upstream genomic regions, with sites predicted using ChIP
binding data.
• eQTL Hotspots: Certain genomic regions exhibit a significant
excess of linkages of expression traits to genotypic variations.
28
• Intra-cluster are a pair of elements that belong to a single
cluster.
• Jaccard Index =
#element pairs that are intra−cluster wrt both clusterings
#element pairs that are intra−cluster wrt either of the clusterings
• Sensitivity: the fraction of reference sets that are enriched for
genes belonging to some cluster output by the method. [coverage]
• Specificity: the fraction of clusters that are enriched for genes
belonging to some reference set. [accuracy]
29
30
• Comparing JointCluster against methods that integrate only a
single coexpression network with a physical network.
• Combined <glucose+ethanol> coexpression network and the physical
network.
• Comparing on fair terms for all algorithms:
• Setting minimum cluster size parameter in Matisse to 10.
• Size limit of 100 genes for JointCluster .
• Co-clustering didn't have a parameter to directly limit cluster size.
31
32
33
• Heterogeneous large-scale datasets are accumulating at a
rapid pace.
• Efforts to integrate them are intensifying.
• JointCluster provides a versatile approach to integrating any
number of heterogeneous datasets.
• Natural progression from clustering of single to multiple datasets.
34
• Testing JointCluster algorithm on simulated datasets.
• Testing JointCluster on yeast empirical datasets.
• More flexible than two-network clustering methods.
• Consistent with known biology, extend our knowledge.
• JointCluster can handle multiple heterogeneous network.
• Enables better coverage of genes especialy when knowledge of physical
interactions is less complete.
• Unsupervised and exploratory approach to data integration.
35
36
• The challenge: integrating multiple datasets in order to study
different aspects of biological systems.
• Proposed simultaneous clustering of multiple networks.
• Efficient solution that permits certain theoretical guarantees
• Effective scaling heuristic
• Flexibility to handle multiple heterogeneous networks
• Results of JointCluster:
•
•
•
•
More robust, and can handle high false positive rates.
More consistently enriched for various reference classes.
Yielding better coverage.
Agree with known biology of yeast.
37
38
Bibliography:
• Manikandan Narayanan, Adrian Vetta, Eric E. Schadt, Jun Zhu (2010), PLoS
Computational Biology.
• Simultaneous Clustering of Multiple Gene Expression and Physical Interaction Datasets.
• Supplementary Text for “Simultaneous clustering of multiple gene expression and physical
interaction datasets”.
• CPP Source code
• Kannan R, Vempala S, Vetta A (2000), Proceedings Annual IEEE Symposium on
Foundations of Computer Science (FOCS). pp 367–377.
• On clusterings - good, bad and spectral.
• Shi J, Malik J (2000), IEEE Transactions on Pattern Analysis and Machine Intelligence
(TPAMI) 22: 888–905.
• Normalized cuts and image segmentation.
• Andersen R, Lang KJ (2008), Proceedings Annual ACM-SIAM Symposium on Discrete
Algorithms (SODA). pp 651–660.
• An algorithm for improving graph partitions.
39
40