Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Clustering Gene Expression Data DNA Microarrays Workshop Feb. 26 – Mar. 2, 2001 ,UNIL & EPFL, Lausanne Gaddy Getz, Weizmann Institute, Israel • Gene Expression Data • Clustering of Genes and Conditions • Methods – Agglomerative Hierarchical: Average Linkage – Centroids: K-Means – Physically motivated: Super-Paramagnetic Clustering • Coupled Two-Way Clustering Feb 2001 (GG) 1 Gene Expression Technologies • DNA Chips (Affymetrix) and MicroArrays can measure mRNA concentration of thousands of genes simultaneously • General scheme: Extract RNA, synthesize labeled cDNA, Hybridize with DNA on chip. Feb 2001 (GG) 2 Single Experiment • After hybridization – Scan the Chip and obtain an image file – Image Analysis (find spots, measure signal and noise) Tools: ScanAlyze, Affymetrix, … • Output File – Affymetrix chips: For each gene a reading proportional to the concentrations and a present/absent call. (Average Difference, Absent Call) – cDNA MicroArrays: competing hybridization of target and control. For each gene the log ratio of target and control. (CH1I-CH1B, CH2I-CH2B) Feb 2001 (GG) 3 Preprocessing: From one experiment to many • Chip and Channel Normalization – Aim: bring readings of all experiments to be on the same scale – Cause: different RNA amounts, labeling efficiency and image acquisition parameters – Method: Multiply readings of each array/channel by a scaling factor such that: • The sum of the scaled readings will be the same for all arrays • Find scaling factor by a linear fit of the highly expressed genes – Note: In multi-channel experiments normalize each channel separately. Feb 2001 (GG) 4 Preprocessing: From one experiment to many Colon cancer data (Alon et. al.) 45 200 • Filtering of Genes 40 400 – Remove genes that are absent in most 600 experiments 800 – Remove genes that are constant in all 1000 experiments 1200 – Remove genes with low readings which are not 1400 reliable. 35 Genes 30 25 20 15 1600 10 1800 5 2000 Feb 2001 (GG) 10 20 30 40 Experiments 50 60 5 Noise and Repeats log – log plot • • • • >90% 2 to 3 fold Multiplicative noise Repeat experiments Log scale dist(4,2)=dist(2,1) Feb 2001 (GG) 6 We canSupervised ask many Methods questions? (use predefined labels) • Which genes are expressed differently in two known types of conditions? • What is the minimal set of genes needed to distinguish one type of conditions from the others? • Which genes behave similarly in the experiments? • How many different types of conditions are there? Unsupervised Methods (use only the data) Feb 2001 (GG) 7 Unsupervised Analysis • Goal A: Find groups of genes that have correlated expression profiles. These genes are believed to belong to the same biological process and/or are co-regulated. • Goal B: Divide conditions to groups with similar gene expression profiles. Example: divide drugs according to their effect on gene expression. Clustering Methods Feb 2001 (GG) 8 What is clustering? • Input: N data points, Xi, i=1,2,…,N in a D dimensional space. • Goal: Find “natural” groups or clusters. Data point of same cluster – “more similar” • Note: number of clusters also to be determined Feb 2001 (GG) 9 Clustering is ill-posed • Problem specific definitions • Similarity: which points should be considered close? – Correlation coefficient – Euclidean distance • Resolution: specify/hierarchical results • Shape of clusters: general, spherical. Feb 2001 (GG) 10 Similarity Measure • Similarity measures – – – – Centered Correlation Uncentered Correlation Absolute correlation Euclidean Feb 2001 (GG) 13 Need to define the distance between the new cluster and the other clusters. Single Linkage: distance between closest pair. Agglomerative Hierarchical Clustering Complete Linkage: distance between farthest pair. Average Linkage: average Distance between joined clustersdistance between all pairs or distance between cluster centers 4 2 5 3 1 1 3 2 4 5 The dendrogram induces a linear ordering of the data points Dendrogram Feb 2001 (GG) 14 Agglomerative Hierarchical Clustering • Results depend on distance update method – Single Linkage: elongated clusters – Complete Linkage: sphere-like clusters • Greedy iterative process • NOT robust against noise • No inherent measure to choose the clusters Feb 2001 (GG) 15 Centroid Methods - K-means •Start with random position of K centroids. •Iteratre until centroids are stable •Assign points to centroids •Move centroids to center of assign points Iteration = 0 Feb 2001 (GG) 16 Centroid Methods - K-means •Start with random position of K centroids. •Iteratre until centroids are stable •Assign points to centroids •Move centroids to center of assign points Iteration = 1 Feb 2001 (GG) 17 Centroid Methods - K-means •Start with random position of K centroids. •Iteratre until centroids are stable •Assign points to centroids •Move centroids to center of assign points Iteration = 1 Feb 2001 (GG) 18 Centroid Methods - K-means •Start with random position of K centroids. •Iteratre until centroids are stable •Assign points to centroids •Move centroids to center of assign points Iteration = 3 Feb 2001 (GG) 19 Centroid Methods - K-means • Result depends on initial centroids’ position • Fast algorithm: compute distances from data points to centroids • No way to choose K. • Example: 3 clusters / K=2, 3, 4 • Breaks long clusters Feb 2001 (GG) 20 Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties dilute magnets. • Calculating correlation between magnet orientations at different temperatures (T). T=Low Feb 2001 (GG) 21 Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties dilute magnets. • Calculating correlation between magnet orientations at different temperatures (T). T=High Feb 2001 (GG) 22 Super-Paramagnetic Clustering (SPC) M.Blatt, S.Weisman and E.Domany (1996) Neural Computation • The idea behind SPC is based on the physical properties dilute magnets. • Calculating correlation between magnet orientations at different temperatures (T). T=Intermediate Feb 2001 (GG) 23 Super-Paramagnetic Clustering (SPC) • The algorithm simulates the magnets behavior at a range of temperatures and calculates their correlation • The temperature (T) controls the resolution • Example: N=4800 points in D=2 Feb 2001 (GG) 24 Output of SPC A function (T) that peaks when stable clusters break Size of largest clusters as function of T Dendrogram Feb 2001 (GG) Stable clusters “live” for large T 25 Choosing a value for T Feb 2001 (GG) 26 Advantages of SPC • Scans all resolutions (T) • Robust against noise and initialization calculates collective correlations. • Identifies “natural” () and stable clusters (T) • No need to pre-specify number of clusters • Clusters can be any shape Feb 2001 (GG) 27 Many clustering methods applied to expression data • Agglomerative Hierarchical – Average Linkage (Eisen et. al., PNAS 1998) • Centroid (representative) – K-Means (Golub et. al., Science 1999) – Self Organized Maps (Tamayo et. al., PNAS 1999) • Physically motivated – Deterministic Annealing (Alon et. al., PNAS 1999) – Super-Paramagnetic Clustering (Getz et. al., Physica A 2000) Feb 2001 (GG) 28 Available Tools • M. Eisen’s programs for clustering and display of results (Cluster, TreeView) – Predefined set of normalizations and filtering – Agglomerative, K-means, 1D SOM • Matlab – Agglomerative, public m-files. • Dedicated software packages (SPC) • Web sites: e.g. http://ep.ebi.ac.uk/EP/EPCLUST/ • Statistical programs (SPSS, SAS, S-plus) Feb 2001 (GG) 29 Colon cancer data (normalized genes) Back to gene expression data 200 0.8 400 • 2 Goals: Cluster Genes and Conditions • 2 independent clustering: 0.6 600 Genes 800 0.4 – Genes represented as vectors of expression in all conditions 1200 1400 – Conditions are represented as vectors of expression of all 1600 genes 1000 0.2 0 -0.2 1800 -0.4 2000 Feb 2001 (GG) 10 20 30 40 Experiments 50 60 30 First clustering - Experiments 1. Identify tissue classes (tumor/normal) Feb 2001 (GG) 31 Second Clustering - Genes 2. Find Differentiating And Correlated Genes Ribosomal proteins Cytochrome C metabolism HLA2 Feb 2001 (GG) 32 Two-way Clustering Feb 2001 (GG) 33 Coupled Two-Way Clustering (CTWC) G. Getz, E. Levine and E. Domany (2000) PNAS • Why use all the genes to represent conditions and all conditions to represent genes? Different structures emerge when clustering sub-matrices. • New Goal: Find significant structure in subsets of the data matrix. • A non-trivial task – exponential number of subsets. • Recently we proposed a heuristic to solve this problem. Feb 2001 (GG) 34 CTWC of colon cancer data 60 200 A 50 40 400 30 20 600 (A) 10 800 0 1000 B 0 10 20 30 40 50 60 1200 60 50 1400 40 1600 30 1800 20 (B) 10 2000 10 20 30 40 50 60 0 0 Feb 2001 (GG) 10 20 30 40 50 60 35 Biological Work • Literature search for the genes • Genomics: search for common regulatory signal upstream of the genes • Proteomics: infer functions. • Design next experiment – get more data to validate result. • Find what is in common with sets of experiments/conditions. Feb 2001 (GG) 37 Summary • Clustering methods are used to – find genes from the same biological process – group the experiments to similar conditions • Different clustering methods can give different results. The physically motivated ones are more robust. • Focusing on subsets of the genes and conditions can uncover structure that is masked when using all genes and conditions www.weizmann.ac.il/physics/complex/compphys Feb 2001 (GG) 38