Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
Making Sense of Complicated Microarray Data Part II Gene Clustering and Data Analysis Gabriel Eichler Boston University Some slides adapted from: MeV documentation slides Why Cluster? Clustering is a process by which you can explore your data in an efficient manner. Visualization of data can help you review the data quality. Assumption: Guilt by association – similar gene expression patterns may indicate a biological relationship. Expression Vectors Gene Expression Vectors encapsulate the expression of a gene over a set of experimental conditions or sample types. Numeric Vector -0.8 1.5 1.8 0.5 -0.4 -1.3 0.8 1.5 2 Line Graph 0 -2 Heatmap -2 2 1 2 3 4 5 6 7 8 Expression Vectors As Points in ‘Expression Space’ G1 G2 G3 G4 G5 t1 t2 t3 -0.8 -0.4 -0.6 0.9 1.3 -0.3 -0.8 -0.8 1.2 0.9 -0.7 -0.7 -0.4 1.3 -0.6 Similar Expression Experiment 3 Experiment 2 Experiment 1 Distance and Similarity -the ability to calculate a distance (or similarity, it’s inverse) between two expression vectors is fundamental to clustering algorithms -distance between vectors is the basis upon which decisions are made when grouping similar patterns of expression -selection of a distance metric defines the concept of distance Distance: a measure of similarity between gene expression. Exp 1 Exp 2 Gene A x1A x2A Gene B x1B x2B Exp 3 Exp 4 x3A x3B x4A x4B Exp 5 Exp 6 x5A x6A x5B x6B p1 Some distances: (MeV provides 11 metrics) 1. Euclidean: i6= 1 (xiA - xiB)2 2. Manhattan: i = 1 |xiA – xiB| 6 3. Pearson correlation p0 Clustering Algorithms Clustering Algorithms Be weary - confounding computational artifacts are associated with all clustering algorithms. -You should always understand the basic concepts behind an algorithm before using it. Anything will cluster! Garbage In means Garbage Out. Hierarchical Clustering • IDEA: Iteratively combines genes into groups based on similar patterns of observed expression • By combining genes with genes OR genes with groups algorithm produces a dendrogram of the hierarchy of relationships. • Display the data as a heatmap and dendrogram • Cluster genes, samples or both (HCL-1) Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Hierarchical Clustering Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Gene 7 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering Gene 7 Gene 1 Gene 2 Gene 4 Gene 5 Gene 3 Gene 8 Gene 6 Hierarchical Clustering H L Hierarchical Clustering Genes Samples The Leaf Ordering Problem: • Find ‘optimal’ layout of branches for a given dendrogram architecture • 2N-1 possible orderings of the branches • For a small microarray dataset of 500 genes there are 1.6*E150 branch configurations Hierarchical Clustering The Leaf Ordering Problem: Hierarchical Clustering Pros: – Commonly used algorithm – Simple and quick to calculate Cons: – Real genes probably do not have a hierarchical organization Self-Organizing Maps (SOMs) A Idea: Place genes onto a grid so that genes with similar patterns of expression are placed on nearby squares. B C D a c bd Self-Organizing Maps (SOMs) A IDEA: Place genes onto a grid so that genes with similar patterns of expression are placed on nearby squares. B C D a c bd Self-organizing Maps (SOMs) Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 Gene 7 Gene 8 Gene 9 Gene 10Gene 11 Gene 12 Gene 13 Gene 14 Gene 15 Gene 16 a_1hr 1 2 4 3 1 8 4 5 3 2 1 1 4 9 1 1 a_2hr 2 3 4 4 2 7 4 6 3 4 5 3 3 7 2 2 a_3hr 4 7 5 3 3 7 4 5 1 8 6 5 3 5 2 5 b_1hr 5 7 5 4 4 6 4 4 3 5 9 8 4 3 3 7 b_2hr 7 6 4 3 5 5 5 3 6 4 8 8 5 2 4 8 b_3hr 9 3 4 3 6 3 4 2 8 2 7 6 6 1 4 9 A B A B E C F H D I G E H C G D A D G I F B E C F H A D B E C F I G H I A D B E C F G H I Self-organizing Maps (SOMS) Gen es 11 and 12 Gen es 1, 1 6, and 5 Gen es 10 Gen e 1 5 A D G B E H C F I Gen es 8 Gen es 6 and 14 Gen es 9 and 13 Gen es 3 Gen es 4, 7 an d 2 The Gene Expression Dynamics Inspector – GEDI } } } Samples Group A G en en ee ss G Gene 1 Gene 2 Gene 3 Gene 4 Gene 5 Gene 6 … Group C Group B 1.5 1.4 1.7 1.2 .85 .65 .50 .55 2.5 2.8 2.7 2.1 .78 .95 .75 .45 1.1 1.2 1.0 1.3 .56 .62 .78 .89 .45 .23 .15 .05 .82 .71 .62 .49 .11 .16 .11 .95 2.2 4.5 6.7 6.2 2.2 2.5 2.8 2.9 .48 .90 1.5 1.8 2.1 2.0 1.9 1.6 4.2 4.8 5.2 5.5 2.5 2.6 2.0 1.9 1.2 1.1 1.6 2.9 1.1 1.8 1.9 1.4 1.7 1.2 1.1 1.6 GEDI’s Features: •Allows for simultaneous analysis or several time courses or datasets •Displays the data in an intuitive and comparable mathematically driven visualization •The same genes maps to the same tiles H Group A Group B Group C L 1 2 3 4 Software Demonstrations MeV available at http://www.tigr.org/software/tm4/mev.html GEDI available at http://www.chip.org/~ge/gedihome.htm Comparison of GEDI vs. Hierarchical Clustering Hierarchical clustering of random data (GIGO) G.E.D.I. allows the direct visual assessment of the quality of conventional cluster analysis From: CreateGEP_Journal.wpd, random_A Questions