Download Making Sense of Complicated Microarray Data Part II - MGH-PGA

Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation slides Why Cluster?  Clustering is a process by which you can explore your data in an efficient manner.  Visualization of data can help you review the data quality.  Assumption: Guilt by association – similar gene expression patterns may indicate a biological relationship. Expression Vectors Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. Numeric Vector -0.8 1.5 1.8 0.5 -0.4 -1.3 0.8 1.5 2 Line Graph 0 -2 Heatmap -2 2 1 2 3 4 5 6 7 8 Expression Vectors As Points in ‘Expression Space’ G1 G2 G3 G4 G5 t1 t2 t3 -0.8 -0.4 -0.6 0.9 1.3 -0.3 -0.8 -0.8 1.2 0.9 -0.7 -0.7 -0.4 1.3 -0.6 Similar Expression Experiment 3 Experiment 2 Experiment 1 Distance and Similarity -the ability to calculate a distance (or similarity, it’s inverse) between two expression vectors is fundamental to clustering algorithms -distance between vectors is the basis upon which decisions are made when grouping similar patterns of expression -selection of a distance metric defines the concept of distance Distance: a measure of similarity between gene expression. Exp 1 Exp 2 Gene A x1A x2A Gene B x1B x2B Exp 3 Exp 4 x3A x3B x4A x4B Exp 5 Exp 6 x5A x6A x5B x6B p1 Some distances: (MeV provides 11 metrics) 1. Euclidean: i6= 1 (xiA - xiB)2 2. Manhattan: i = 1 |xiA – xiB| 6 3. Pearson correlation p0 Clustering Algorithms Clustering Algorithms  Be weary - confounding computational artifacts are associated with all clustering algorithms. -You should always understand the basic concepts behind an algorithm before using it.  Anything will cluster! Garbage In means Garbage Out. Hierarchical Clustering • IDEA: Iteratively combines genes into groups based on similar patterns of observed expression • By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. • Display the data as a heatmap and dendrogram • Cluster genes, samples or both (HCL-1) Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Gene 7 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering H L Hierarchical Clustering Genes Samples The Leaf Ordering Problem: • Find ‘optimal’ layout of branches for a given dendrogram architecture • 2N-1 possible orderings of the branches • For a small microarray dataset of 500 genes there are 1.6*E150 branch configurations Hierarchical Clustering The Leaf Ordering Problem: Hierarchical Clustering  Pros: – Commonly used algorithm – Simple and quick to calculate  Cons: – Real genes probably do not have a hierarchical organization Self-Organizing Maps (SOMs) A Idea: Place genes onto a grid so that genes with similar patterns of expression are placed on nearby squares. B C D a c bd Self-Organizing Maps (SOMs) A IDEA: Place genes onto a grid so that genes with similar patterns of expression are placed on nearby squares. B C D a c bd Self-organizing Maps (SOMs) Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene 10Gene 11 Gene 12 Gene 13 Gene 14 Gene 15 Gene 16 a_1hr 1 2 4 3 1 8 4 5 3 2 1 1 4 9 1 1 a_2hr 2 3 4 4 2 7 4 6 3 4 5 3 3 7 2 2 a_3hr 4 7 5 3 3 7 4 5 1 8 6 5 3 5 2 5 b_1hr 5 7 5 4 4 6 4 4 3 5 9 8 4 3 3 7 b_2hr 7 6 4 3 5 5 5 3 6 4 8 8 5 2 4 8 b_3hr 9 3 4 3 6 3 4 2 8 2 7 6 6 1 4 9 A B A B E C F H D I G E H C G D A D G I F B E C F H A D B E C F I G H I A D B E C F G H I Self-organizing Maps (SOMS) Gen es 11 and 12 Gen es 1, 1 6, and 5 Gen es 10 Gen e 1 5 A D G B E H C F I Gen es 8 Gen es 6 and 14 Gen es 9 and 13 Gen es 3 Gen es 4, 7 an d 2 The Gene Expression Dynamics Inspector – GEDI } } } Samples Group A G en en ee ss G Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 … Group C Group B 1.5 1.4 1.7 1.2 .85 .65 .50 .55 2.5 2.8 2.7 2.1 .78 .95 .75 .45 1.1 1.2 1.0 1.3 .56 .62 .78 .89 .45 .23 .15 .05 .82 .71 .62 .49 .11 .16 .11 .95 2.2 4.5 6.7 6.2 2.2 2.5 2.8 2.9 .48 .90 1.5 1.8 2.1 2.0 1.9 1.6 4.2 4.8 5.2 5.5 2.5 2.6 2.0 1.9 1.2 1.1 1.6 2.9 1.1 1.8 1.9 1.4 1.7 1.2 1.1 1.6 GEDI’s Features: •Allows for simultaneous analysis or several time courses or datasets •Displays the data in an intuitive and comparable mathematically driven visualization •The same genes maps to the same tiles H Group A Group B Group C L 1 2 3 4 Software Demonstrations MeV available at http://www.tigr.org/software/tm4/mev.html GEDI available at http://www.chip.org/~ge/gedihome.htm Comparison of GEDI vs. Hierarchical Clustering Hierarchical clustering of random data (GIGO) G.E.D.I. allows the direct visual assessment of the quality of conventional cluster analysis From: CreateGEP_Journal.wpd, random_A Questions

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Making Sense of Complicated Microarray Data Part II - MGH-PGA