* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download talk
Survey
Document related concepts
Molecular evolution wikipedia , lookup
Secreted frizzled-related protein 1 wikipedia , lookup
Gene expression wikipedia , lookup
Promoter (genetics) wikipedia , lookup
Genome evolution wikipedia , lookup
Genomic imprinting wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene regulatory network wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Transcript
Comparisons and validation of statistical clustering techniques for microarray gene expression data Susmita Datta and Somnath Datta Presented by: Jenni Dietrich Assisted by: Jeffrey Kidd and Kristin Wheeler Mentor: Dr. Takis Benos 26 June 2003 Outline Brief microarray overview Purpose of the paper Discuss clustering algorithms Experiment and Results Conclusions Microarrays Microarrays Allow Results for monitoring of gene expression at the transcript level Slide with single-stranded DNA molecules attached at fixed positions (probes) Exploit the complementary binding of single-stranded DNA sequences in a large data set containing expression levels of thousands of genes Microarray experiments are often used to track the changes in gene expression – Over time – In the presence of various agents 1 Microarrays Microarrays Microarray Slide Gene Microarrays Gene Expression Matrix *Numerical values encoded by color expression profiles characterize the dynamic functioning of each gene in the genome Expression data can be represented as a matrix where the rows are genes and the columns are samples The values in the cells of the matrix represent the expression levels Experiment Experiment Design – Decide on probes and genes – Type of microarray Data Normalization Data analysis – Identify differentially expressed genes – Cluster genes based on expression patterns 2 Clustering Goal of microarray data analysis – identify changing levels of gene expression – correlate the changes to identify sets of genes with similar profiles Clustering – group objects into subsets Clustering algorithms can be used to group genes that have similar expression patterns Clustering Algorithms Hierarchical clustering with correlation – UPGMA (most commonly used algorithm) Clustering by K-means Diana Fanny Model-based clustering Hierarchical clustering with partial least squares Purpose of the Paper Currently, there are no clear guidelines for choosing a clustering algorithm to group genes based on their expression profiles This paper evaluated the performance of six different algorithms using a microarray data set on sporulation of budding yeast Clustering Algorithms Algorithms differ in the measure of similarity used when grouping the objects and grouping technique Some need and use previous knowledge about the suspected number of clusters 3 Hierarchical clustering Produces a hierarchy of clusters rather than a predefined number of clusters (agglomerative approach) Initially, each observation is in its own cluster Subsequently, the two ‘closest’ clusters are combined into a single cluster The similarity (distance) measure used is the average method – Distance between clusters is the average of the distances between the points in one cluster and those in the other cluster Diana Divisive K-means Clustering Uses advance knowledge about the number of clusters to be formed – k clusters Initially, all objects are randomly assigned to one of k clusters Objects are moved between clusters in an attempt to minimize the distance between that object and its cluster Fanny clustering method – All objects start in one cluster and broken into smaller groups with larger dissimilarity are put in different clusters Uses the standard Euclidean distance measure Genes Uses fuzzy logic and produces a probability vector for each observation Hard cluster is formed by assigning an observation to a group with the highest probability Uses the Manhattan distance measurement – d=|x-u|+|y-v| • where (x,y) and (u,v) are two points Needs a predefined number of clusters (k) 4 Model-based Clustering Experiment Treats data as a mixture distribution – Often based on a Gaussian distribution Describes each cluster using a probabilistic model No predefined number of clusters Results Run each of the six clustering algorithms with sporulation of yeast data set Used three validation measurements to compare the results – Average proportion of non-overlap measure – Average distance between means measure – Average distance measure Average proportion of non-overlap & average distance between means measures Based on average proportion of nonoverlap and average distance between means measures: – Model-based clustering appears to be worst – Hierarchical clustering with correlation and Fanny seem to be the best Based on average distance measure: Overall, Diana performed consistently well for all three measurements – Hierarchical clustering performed worst 5 Average distance between mean measures Comparison of model profiles The smaller the distance from the model profile the closer the results of that algorithm are to the results of the model. Average Temporal Profiles Time (hours) Group category 0 2 I 52 genes II 62 genes III 47 genes IV 95 genes V 158 genes VI 61 genes VII 5 genes 4 6 8 10 12 Chu et al. (1998) Science 282: 699-705 6 Guidelines for choosing a clustering algorithm 1. 2. 3. Produce a visual plot of the first two principal components to determine the method that gives the most separation between groups Check for consistency of the method with temporal observations Compare the average group temporal profiles with the model profiles produced from a known, hand-picked set of genes (training set) In conclusion… Other aspects It may be important to inspect the computational stability as well as the computational time of an algorithm before making a choice References The clustering algorithm directly effects the interpretation and analysis of the data Therefore, careful consideration of the algorithms is necessary before a choice is made For this data set, Diana performed consistently well when compared with the model profile and when looking at the three validation measures Datta, S, and S Datta. (2003) Comparisons and validation of statistical clustering techniques for microarray gene expression data. Bioinformatics(19) 459-466 Brazma, A, and J Vilo. (2000) Gene expression data analysis. FEBS (Letters 480) 17-24 Quackenbush, J. (2001) Computational Analysis of Microarray Data. Nature Reviews (2) 418-427 Hastie, T, R Tibshirani, and J Friedman. (2001) The Elements of Statistical Learning 453-480 http://www.austinlinks.com/Fuzzy/overview.html http://www.maths.lth.se/help/R/.R/library/cluster/html/fanny .html 7