Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Iclust: information based clustering Noam Slonim The Lewis-Sigler Institute for Integrative Genomics Princeton University Joint work with Gurinder Atwal Gasper Tkacik Bill Bialek 1 Running example Gene expression data N conditions 2 12 -1 -1 6 12 ?? -5 11 K genes -3 8 -2 6 11 11 -8 -1 ?? 12 5 12 4 8 1 12 1 14 -8 5 -5 11 17 -2 1 11 -8 -6 12 8 ?? 7 -2 ?? ?? -2 5 -5 3 -4 12 ?? -2 5 14 ?? 14 -8 -7 15 5 14 -8 5 16 2 0 5 -5 5 14 18 ?? 2 1 4 12 4 7 -1 3 -7 3 7 -5 2 4 -11 -3 21 ?? ?? 3 3 -3 ?? (log) ratio of the mRNA expression level of a gene in a specific condition 9 Relations between genes? Relations between experimental conditions? 2 Information as a correlation/similarity measure Some nice features of the information measure: Model independent Responsive to any type of dependency Captures more than just pairwise relations Suitable for both continuous and discrete data Independent of the measurement scale Axiomatic 3 Mutual information - definition We have some “uncertainty” about the state of gene-A; but now someone told us the state of gene-B… -The resulting reduction in the uncertainty about gene-A state is called the mutual information between these two variables : p ( a, b) I [ p(a, b)] H [ p(a)] H [ p(a | b)] p(a, b) log p(a) p(b) aA,bB How much can we learn from the state of gene-B about the state of gene-A (and vice versa). 4 Model independence & responsiveness to “complicated” relations MI~2 bits; Corr.~0.6 gene-B expression level gene-B expression level MI~1 bit; Corr.~0.9 gene-A expression level MI~1.3 bits; Corr.~0 MI~0 bits; Corr.~0 gene-B expression level gene-B expression level gene-A expression level gene-A expression level gene-A expression level 5 Capturing more than just pairwise relations Triplet-information ~ 1.0 bits gene-A/gene-B expression gene-A/gene-B/gene-C expression MI~0 bits; Corr.~0 Experiment index Experiment index Using a model-dependent correlation measure might result in missing significant dependencies in our data. 6 Mutual-information vs. Pearson-Correlation results in bacteria gene-expression data Mycobacterium tuberculosis Mutual information 81 experiments Pearson Correlation 7 Information relations between gene expression profiles Given the expression of gene-A, how much information do we have about the expression of gene-B ? (when averaging over all conditions) ( sample size: number of conditions - 173 in Gasch data ) Once we find these information relations, we often want to apply cluster analysis. Numerous clustering methods are available – but typically they assume a particular model. For example, K-means corresponds to the modeling assumption that each cluster can be described by a spherical Gaussian. Back in square one …? 8 Iclust – information based clustering What is a “good” cluster? A simple proposal – given a cluster, we pick two items at random, and we want them to be as “similar” to each other as possible. S (c) q(c) q(i1 | c)q(i 2 | c) s(i1, i 2) Formally, we wish to maximize c i1,i 2 Or … S (c) q(c) q(i1 | c)q(i 2 | c)...q(ir | c) s (i1, i 2,..., ir ) Or … S (c) I (c) q(c) q(i1 | c)q(i 2 | c)...q(ir | c) I (i1, i 2,..., ir ) c c i1,..., ir i1,..., ir Namely, we wish to maximize the average information relations in our clusters, or to find clusters s.t. in each cluster all items are highly informative about each other. 9 Iclust – information based clustering (cont.) A penalty term that we wish to minimize, as in rate-distortion theory : I (C; i ) p(i )q(c | i ) log c ,i q (c | i ) q (c ) S(c) is maximized, Penalty Intermediate term isinteresting minimized but thecases (maximal penalty – small term compression), penalty is maximized with buthigh as S(c) well S(c) is minimized (no compression) as well. 10 Iclust – information based clustering (cont.) The intuitive clustering problem can be turned into a General mathematical optimization problem: Clustering parameters Expected information relations among data items Information between data items and clusters F [ q(c | i ) ] I (c) - T I (C; i) Tradeoff parameter Clustering is formulated as trading bits of similarity against bits of descriptive power, without any further assumptions. 11 Relations with other classical rate distortion Iclust F [ q(c | i) ] I (C; i) Classical rate distortion 1 D (c ) F [ q(c | i) ] I (C; i) D(c) q(c) q(i1 | c)...q(ir | c) d ( (i1) ,..., (ir ) ) c 1 ~ D (c ) ~ D(c) q(c) q(i1 | c) d ( ( i1) , ( c ) ) i 1,..., ir c i1 For the special case of pairwise relations D(c) q(c) q(i1 | c) q(i 2 | c) d ( (i1) , (i 2 ) ) c i1 i2 ~ D(c) q(c) q(i1 | c) d ( (i1) , q(i 2 | c) (i 2 ) ) c i1 i2 The difference is whether the sum over i2 is before/after d is computed If the distortion/similarity matrix is a kernel matrix the formulations are equivalent 12 And yet – some important differences Iclust is applicable when the raw data is given directly as pairwise relations Iclust do not require a definition of a “prototype” (or “centroid”) Iclust can handle more than just pairwise correlations Both formulations induce different decoding schemes A sender observes a pattern Φi, but is allowed to send only the cluster index, c In classical rate distortion the receiver is assumed to decode by In Iclust he receiver is assumed to decode by (i ) (i) ~ q(i | c) ~ (i ) ( c ) q(i | c) (i) i ~ Deterministic decoding with vocabulary size Nc Stochastic decoding with vocabulary size N 13 Iclust vs. classical rate-distortion decoding Original figure: 220 gray levels Iclust (stochastic) decoding RD (deterministic) decoding 2 clusters 2 clusters 14 Iclust algorithm - freely available Web implementation 1 q(c | i ) q(c) exp{ r S (c; i ) (r 1) S (c)} T Average “similarity” of i to c members Average “similarity” among c members Responsive to any type of dependency among the data Invariant to changes in the data representation See www.princeton.edu/~nslonim Allows to cluster based on more than pairwise relations For more details : Slonim, Atwal, Tkacik, and Bialek (2005) Information based clustering, PNAS, in press. 15 Iclust – clusters examples Clusters of genes C18 C15 C4 RPS10A RPS10B RPS11A RPS11B RPS12 … FRS1 KRS1 SES1 TYS1 VAS1 … PGM2 UGP1 TSL1 TPS1 TPS2 … Proteins of the Enzymes that Enzymes involved small ribosomal attach amino in the trehalose subunit acids to tRNA anabolism pathway Data: Dynamics of stock prices Data: Rating by viewers Clusters of stocks Clusters of movies C17 C12 C2 C12 C1 C7 Wal-Mart Target Home Depot Best Buy Staples … Microsoft Apple Comp. Dell HP Motorola … NY Times Tribune Co. Meredith Corp. Dow Jones & Co. Knight-Ridder Inc. … Snow White Cinderella Dumbo Pinocchio Aladdin … Psycho Apocalypse Now The Godfather Taxi Driver Pulp Fiction … Star Wars Return of the Jedi The Terminator Alien Apollo 13 … Given the price of stock-A, how much information do we have about the price of stock-B ? (when averaging over many days) Given the rating of movie-A, how much information do we have about the rating of movie-B ? (when averaging over many viewers) Hierarchical S&P 500 K-medians K-means Hierarchical K-medians ESR K-means Hierarchical 100 K-medians K-means Coherence Coherence results – comparison to alternative algorithms EachMovie 75 50 25 0 Quick Summary Information as the core measure of data analysis with many appealing features Iclust - a novel information-theoretic formulation of clustering, with some intriguing relations with classical rate distortion clustering. Validations: finding coherent gene clusters based on information relations in gene-expression data … and finding coherent stocks clusters, coherent movies clusters … … and genotype-phenotype association in bacteria, based on phylogenetic data Slonim, Elemento & Tavazoie (2005), Mol. Systems Biol., in press. … and more?