* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Clustering approaches for temporal microarray gene expression data
Public health genomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Epigenetics of human development wikipedia , lookup
Genome (book) wikipedia , lookup
Genome evolution wikipedia , lookup
Gene therapy wikipedia , lookup
Gene desert wikipedia , lookup
Vectors in gene therapy wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Gene nomenclature wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
1 Clustering approaches for temporal microarray gene expression data Irsal Jasebel Alsanea Electrical Engineering and Computer Science Department, Northwestern University Abstract: Clustering is the division of data into groups of similar objects. Temporal gene expression data has the potential to generate a great deal of biological knowledge using microarray technology. In this paper, we explore and implement different gene clustering methods for temporal microarray expression data, and propose a combined method to improve upon previous methods. We propose that pre-clustering with a Transitional State Discrimination algorithm (Template-Based), as well as a TC_linkage_infer algorithm (Shape-Based), and clustering with a Pointwise similarity algorithm will reduce information loss in the processing of genomic data. Keywords: Clustering, time-series, microarray, bioinformatics, fuzzy c-means algorithm, k-means clustering algorithm Received December 8, 2012 For a compiled list of software used, go to http://collablab.northwestern.edu/irsal/bioinformatics/ 1. Introduction 2.1 Pointwise Sim ilarity Recent developments in microarray technology have yielded revolutionary contributions to genomics. Microarrays allow for the monitoring of gene expressions of tens of thousands of genes in parallel. The analysis of microarray data is increasingly becoming a major bottleneck in the utilization of the technology [1]. Microarray experiments can be divided into two main types: static and time-series. In static experiments, gene expression measurements are taken one time each from a number of samples. In time-series experiments, gene expression levels are measured in a single sample at a number of points in time [4]. Clustering is used to make sense of microarray data. Similar to parallel clustering approaches in social and physical sciences, it divides large sets of gene expressions into smaller sets with comparably similar gene expressions grouped by different distance or correlation measures. Time-series microarray experiments have far greater applications than static experiments [4]. First, it is utilized in the discovery of the dynamics behind various biological systems. Second, time-series microarrays are used to study the development of different controlling genes (example, does gene 1 express or suppress gene 2?). Third, it allows scientists to study disease progression (such as cancer) over time and in greater depth. Fourth, time-series microarrays enable novel methods of drug discovery by allowing for the observation of genetic responses to varied queues. Expression levels in microarrays are measured by the intensity and frequency of the fluorescence tags (or dye). The tags (depending on the experiment) reveal genes that have been inhibited (typically red fluorescence), or activated (typically green fluorescence). Temporal data is taken in time points, which can in hour, minute or second units, depending on temporal activity of the genes. In this paper, we describe different time-series clustering algorithms and propose a combined method that improves upon methods in existence. Pointwise similarity algorithms are the simplest and easiest to implement of all clustering algorithms. Kmeans is a well-known pointwise algorithm and partitioning method [1]. Genes are classified as belonging to one of k groups, k chosen a priori. Cluster membership is determined by calculating the centroid of each group, finding the proximity (via Euclidean distance, Manhattan distance, the Pearson correlation coefficient, etc.) of a gene to each centroid, and assigning said gene to the closest centroid [2]. This algorithm finds the total minimum distance of each gene to an assigned centroid. 2. Discriminative Algorithms Here we use the Pearson correlation over other distance formulas and coefficient functions, as a control in the comparison with other algorithms. The Pearson correlation coefficient between any two series of number X={X1, X2, …, XN}, and Y={Y1, Y2, …, YN} is defined as Time-series clustering approaches in microarray data can be broken down into two types: discriminative and generative algorithms. Discriminative algorithms define a pairwise similarity function and then apply that function to cluster similar data points together. (a) Pseudocode Input: a set of S objects and an integer k clusters. Output: a partition of S into S1, S2, …, Sk. Program: • Choose an integer K, as the number of clusters. • Initialize the codebook vectors of the K clusters randomly. • For every new sample vector: o Compute the Pearson correlation coefficient between the new vector and every cluster’s centroid. o Assign each gene to closest centroid. k-means works by calculating the centroid of each cluster Si, denoted x-i, and optimizing the cost function: (1) The goal is to minimize the total cost: (2) 2 Clustering approaches for temporal microarray gene expression data • (3) (b) Analysis The k-means algorithm is a popular choice as far as clustering algorithms go. The time complexity is O(nkl), where n is the number of patterns, k is the number of clusters, and l is the number of iterations taken by the algorithm to converge. The space complexity is O(k+n). Pointwise algorithms are the very base of all the following methods. For the following 3 algorithms, we find that they are all improvements upon the Pointwise Similarity method by using other models or vectors to compare gene pairs (as opposed to using raw data, as is the case for Pointwise algorithms). (c) Implementation We used Cluster and TreeView to implement a k-means algorithm. A Yeast dataset was filtered and ran through the k-means algorithm. TreeView was then used to visualize the clustering of genes in a structure similar to a phylogenetic tree. (See http://rana.lbl.gov/manuals/ClusterTreeView.pdf for more information.) • • • • • One-step (Down): gene expressions transition from a high value to a low value Binary two-step (Up-Down): gene expressions transition from low to high and return to the same low value Binary two-step (Down-Up): gene expressions transition from high to low and return to the same high value F1 statistic: represents how well the onestep model fits the data F2 statistic: represents how well the twostep model fits F12 statistic: represents the relative goodness of fit of a one-step versus a twostep pattern Program (StepMiner): • Find a one- or two-step function that best fits features of expression profiles n time points, X1, X2, …,Xn, over binary transitions. o Find F1 and F2 that follows an Fdistribution with (m-1, n-m) degrees of freedom for respective one- and two-step functions • Find the Best feature vector: SelectBestModels(){ oneStep = F-Significant(F1) && Not-FSignificant(F12) twoStep = F-Significant(F2) && NotIn(oneStep) other = NotIn(oneStep, twoStep) } • Use a Pointwise Similarity algorithm (kmeans) to compute clusters by comparing similarity or dissimilarity of feature vectors found (Section 2.1). (b) Analysis Figure 1: TreeView interface of raw data passed through the Cluster algorithm (Pointwise method only). A Pearson correlation coefficient can be seen as the ‘Selected Array Node Correlation.’ 2.2 Feature-Based Sim ilarity Feature-Based similarity algorithms are more complex than Pointwise Similarity algorithms. Whereas Pointwise Similarity algorithms compare raw temporal expression data, Feature-Based Similarity algorithms extract a set of features from the set of data, and use that as a form of comparison. (a) Pseudocode In a general, Feature-Based methods first transform each gene expression vector into a feature vector [8], which encompasses the time and direction of step-wise temporal transitions (which the authors consider to be most important); then, they use traditional clustering algorithms, as in the Pointwise Similarity methods, such as k-means, and hierarchical clustering. Input: a determinant feature, a set of S objects and an integer k clusters. Output: a partition of S into S1, S2, …, Sk. Definitions (see Figure 2): • One-step (Up): gene expressions transition from a low value to a high value The strength of this method lies in the use of statistical parameters in creating the feature vector. By transforming the data into one and two-step functions, StepMiner [8] creates temporal ordering of measurements. Whereas Pointwise Similarity methods treat time as another parameter, Feature-Based methods take into account time and use it to pre-order temporal data. StepMiner is ideal for users interested in binary models of gene expression time courses. The downside is that binary models abstract from other features (essentially, we only isolate one feature—the change in expression level from low to high, or vice versa). Such is the case for most Feature-Based implementations, where only one feature is isolated. In StepMiner, however, the creators did take into account how well the binary model fits the temporal variation in gene expression by fixing a p-value. T abl e 1: GO annotations of different gene groups [8]. The extracted binary patterns from the StepMiner algorithm correspond to different cellular functions (by gene groups), all with low p-values. 3 Bioinformatics (c) Implementation In the implementation of this method, we used StepMiner, which relies on the assumption that transitions in expression levels are the most important features of an expression profile. Each function group at time n (one- and twostep, both up, down and other combinations) corresponds to a gene group responsible for a particular cellular process (or other functions). For example, an expression profile that fits the one-step (Down) function at time 9.25h, is responsible for Protein biosynthesis, with a pvalue of 3.4E-51 (See Table 1). Figure 2: Image on the left shows raw data that has been clustered using a Pointwise Similarity method only. Image on the right shows data that has been passed through the StepMiner (Feature-Based) algorithm, then a Pointwise Similarity method [8]. Output: a partition of S into S1, S2, …, Sk. Assumptions: • Functional relationships (in gene expression profiles over time) with high statistical significance must be possible. • Said functional relationships should have high biological significance. Definitions: • sc: maximal local alignment of change trend between each gene pair • cc: correlation coefficient between the maximal alignment Program (TC_linkage_infer): • Randomize a dataset by shuffling normalized gene expression levels at different time points among each gene expression profile in the original dataset. • Calculate sc between each pair in the random dataset. • Calculate cc for each gene pair in the random dataset. • Find the frequency of sc, f(sc) as a function of sc. • Find the distribution of cc for gene pairs that have the same sc. • Calculate p-vales for the two scores sc and cc (Psc(s≧sc), Pcc(c≧cc)) by integrating the frequency distribution. • Calculate the sc and cc between each gene pair in the original dataset. • Extract gene pairs with significantly high sc values with a certain preset p-value. o The correlation coefficient cc is regarded as a second index when the gene pairs have the same score sc. • Extract gene pairs with statistically significant high value of combined scores of sc and cc. • Find local alignment of gene expression profiles over time. • Use a Pointwise Similarity algorithm (kmeans) to compute clusters by comparing similarity or dissimilarity of feature vectors found (Section 2.1). 2.3 Shape-Based Sim ilarity Shape-Based Similarity algorithms utilize the shape of expression profiles over time to compare gene pairs. Shape-Based algorithms are commonly based on the popular Smith-Waterman algorithm for local sequence alignment (assigns to each gene pair of expression profiles a score and a relationship: simultaneous, timedelayed, inverted, or inverted time-delayed) [7]. This is analogous to the StepMiner algorithm (using one feature, one-step up, one-step down, two-step up-down or twostep down-up), except in a broader pattern of change of expression profile overtime. The score can then be taken as a measure of similarity and used for clustering the genes. Below, we show a more sophisticated method that relies on the basic principle behind Shape-Based Similarity. (a) Pseudocode The following algorithm improves upon previous ShapeBased Similarity algorithms by transforming gene expression vectors into “change trend” vectors containing the direction of change in the gene expression levels for successive time points prior to local shape alignment [6]. Input: a set of S objects and an integer k clusters. (b) Analysis The major advantage of Shape-Based algorithms is in their ability to identify as similar two expression profiles that are shifted, inverted, or both (See Figure 3). Biologically, the shifted time would mean that one gene is regulating another. The inverted shape would mean a particular mechanism is activating one gene and inhibiting the other gene pair (for that time interval). The major disadvantage of this method is the slow process of finding the best local sequence alignment must be performed many times. The best local sequence alignment algorithm has a time complexity of O(mn), where m is the number of genes and n is the number of time points. Thus, this algorithm has a time complexity of O(kmn), where k is the number of clusters defined. Heuristic approaches have been taken in this method. Since one of the first steps in the algorithm above is randomization of the dataset (and thus the time points), the inclusion of penalized gaps would reduce problems caused by non-uniform sampling of time points. (c) Implementation This algorithm was implemented using a C-program, and CLUTO, a clustering software. 4 Clustering approaches for temporal microarray gene expression data ng: number of genes nt: number of time points xg = [xg1, xg2,…xg ]: gene expression vector for the gth gene where 1<g<ng • ns: number of defined states for the pattern vector function Program (FCV- TSD): • Define the pattern vector function pg(xg, t) with ns number of states. • For all the genes and time points, evaluate the pattern vector function pg(xg, t). • Generate pattern vector functions or prototypes. • For all prototypes and all genes, match the gene to the corresponding pattern vector via a fuzzy c-means algorithm. o A gene g belongs to the cluster represented by prototype p. • Fuzzy c-means algorithm: o Randomly initialize the membership matrix (U) that has constraints in Equation 4. • • • ni (4) o Calculate centroids (ci) by using Equation 5. (5) o Figure 3: (A) Expression profiles that match simultaneously. (B) Expression profiles that are timedelayed. (C) Expression profiles that are inverted [7]. 3. Generative Algorithms Generative algorithms pre-process data to determine optimal parameters to group clusters, then use said parameters to identify similar profiles generated by models. 3.1 Tem plate-Based Sim ilarity A template-based algorithm uses expression vectors, and transforms them into pattern vectors. Pattern vectors show the change in consecutive expression profiles. (a) Pseudocode Below we describe a template-based algorithm called fuzzy c-varieties clustering with Transitional State Discrimination pre-clustering (FCV-TSD). This is a twostep approach which identifies groups of points ordered linearly in temporal locations, and orientations of the data-space that correspond to similar expressions in the time domain [6]. Input: a determinant feature, a set of S objects and an integer k clusters. Output: a partition of S into S1, S2, …, Sk. Definitions: Compute dissimilarity between centroids and data points using Equation 6. (6) o Compute a matrix (U’) new membership (b) Analysis The template-matching algorithm above does not require researchers to choose a candidate profile because it includes every possible pattern vector as a template profile (compare to Feature-Based Similarity algorithms, which isolate one feature). The downside of this flexibility, however, is that as the time points increase (longer time series), the number of template profiles and therefore the number of clusters becomes large (and at times larger than) compared to the number of genes. In Template-Based algorithms’ use of pattern vectors rather than raw data (as in Pointwise) makes it robust to noise. These algorithms work well with short time series, which is important because over eighty percent of all time series in the Stanford Microarray Database contains fewer than nine time points. A major disadvantage of Template-Based Similarity methods (similar to Feature-Based methods) is the loss of information from the transformation of raw data into a pattern vector. In the next section (Section 4), we propose a method to overcome this disadvantage. 5 Bioinformatics (b) Analysis (c) Implementation MATLAB was algorithm. used to implement the TSD-FCV The approach above would first group the genes based on very general shared patterns and then further distinguish within any individual group based on the more complex features of the expression profiles. The major advantages of this combined approach are the reduction of both information loss (in Template-Based Similarity methods), and the reduction of time needed to process a dataset in the Shape-Based method. The slowness would not be an issue because the pre-clustering via Template-Based algorithms reduces the dataset into a smaller subset, thus only a small subset would need to pass through the Shape-Based algorithm. This combined approach would also be robust to noise because of its two-step pre-clustering (first with a Template-Based algorithm, then with a Shape-Based algorithm). Additionally, the use of a fuzzy c-means algorithm for the Pointwise clustering portion of the method would ensure that every input point yields a membership value in each of the clusters [5]. (c) Implementation Figure 4: Template-Based method for an artificial dataset in MatLab. The horizontal axis denotes time, and the vertical axis denotes expression level (fluorescence) [6]. In future work, we would implement this method via MatLab. We would convert the C-program TC_linkage_infer to a MatLab code, and run raw artificial temporal data via a MatLab scrip through three MatLab files: tsd.m, timeshift.m, and last fcv.m (see http://collablab.northwestern.edu/irsal/bioinformatics/T emplate-Shape-Based%20Method/) for the files and script. 4. Proposed Method The latter three methods (Feature-Based, Shape-Based and Template-Based) mentioned involved a preclustering algorithm, then a Pointwise Similarity algorithm (k-means or fuzzy c-means). For our proposed method, we plan on running two algorithms in the preclustering phase of the method, then using a Pointwise algorithm (in this case, fuzzy c-means) to cluster the gene expressions. 4.1 Tem plate-Shape-Based Both Template-Based and Shape-Based Similarity methods have their drawbacks. This proposed combined method is an attempt to get rid of the respective disadvantages of both methods. (a) Pseudocode Input: a set of S objects and an integer k clusters. Output: a partition of S into S1, S2, …, Sk. Assumptions (modified from Section 2.3a): • Functional relationships after initial preclustering with high statistical significance must be possible. • Said functional relationships (again, after clustering) should have high biological significance. Program (Combined Method): • Pre-clustering: o Run raw gene expression data in TSD algorithm (Section 3.1a). o Run pre-clustered data in TC_linkage_infer algorithm (Section 2.3a). • Clustering: o Run pre-processed data into a fuzzy c-means (Section 3.1a) or kmeans (Section 2.1a) algorithm. 5. Conclusion This paper explored and compared different data clustering algorithms in the Bioinformatics space, particularly in the gene clustering of temporal microarray data. The most popular algorithms for general clustering (hierarchical, k-means, etc.) are also the most popular algorithms for gene clustering (Section 2.1), which we categorized as the Pointwise Similarity method. The rest of the methods we have described utilize the popular clustering algorithms, but as a last step in the method. Feature-Based, Shape-Based and Template-Based methods use a pre-clustering step that processes raw data with vectors to transform the raw data into processed data, and a clustering step that uses a similarity scale or matrix to group similar genes over time. We have proposed a new method that minimizes information loss and reduces processing time for the Shape-Based method. We plan on implementing this combined method in later iteration, and currently have a framework for how it will be implemented. Further comparisons of these algorithms can be performed using a set of factors, and comparing the effectiveness with statistical significance. It is important to note that each algorithm is used at different points of microarray experiments (Pointwise is used in early analyses, while Shape-Based is used in later analyses), thus there is no best method. Rather, we have methods that work for shorter time-series (Shape-Based), methods that are restrictive to scientists (Feature-Based), and methods that are prone to information loss (Template-Based and Pointwise). Our proposed method resolves these aforementioned problems. Acknowledgements We would like to thank Professor Kao for his excellent teaching of a very difficult subject matter (Bioinformatics Algorithms), and his guidance for this paper. Clustering approaches for temporal microarray gene expression data References [1] Osama Abu Abbas. Comparisions Between Data Clustering Algorithms. The International Arab Journal of Information Technology, 5:3-320, 2008. [2] M. B. Eisen, P. T. Spellman, P. O. Brown, and D. Botstein. Cluster analysis and dis- play of genomewide expression patterns. Proc Natl Acad Sci USA, 95:14863–14868, 1998. [3] J. Ernst, G. J. Nau, and Z. Bar-Joseph. Clustering short time series gene expres- sion data. Bioinformatics, 21 Suppl 1:i159–168, 2005. [4] L. Kuenzel. Gene clustering methods for time series microarray data. Biochemistry 215, 2010. [5] Amin MA, Afzulpurkar NV, Dailey MN, Esichaikul VE, Batanov DN (2005) Fuzzy-c-mean determines the principle component pairs to estimate the degree of emotion from facial expressions. In: International Conference on Natural Computation and International Conference on Fuzzy Systems and Knowledge Discovery, pp 484–493 [6] C. S. Moller-Levet, K. H. Cho, and O. Wolkenhauer. Microarray data clus- tering based on temporal variation: FCV with TSD preclustering. Appl Bioinfor- matics, 2:35–45, 2003. [7] J. Qian, M. Dolled-Filhart, J. Lin, H. Yu, and M. Gerstein. Beyond synexpression relationships: Local clustering of time- shifted and inverted gene expression pro- files identifies new, biologically relevant interactions. J Mol Biol, 314:1053–1066, 2001. [8] Sahoo D, Dill DL, Tibshirani R, Plevritis SK. Extracting binary signals from microarray timecourse data. Nucleic Acids Res 2007; 35: 3705–12 6