* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Analysis of Gene expression data using MATLAB Software
Microevolution wikipedia , lookup
Site-specific recombinase technology wikipedia , lookup
Metagenomics wikipedia , lookup
Ridge (biology) wikipedia , lookup
Therapeutic gene modulation wikipedia , lookup
Designer baby wikipedia , lookup
Nutriepigenomics wikipedia , lookup
Epigenetics of diabetes Type 2 wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Computational phylogenetics wikipedia , lookup
Analysis of Gene expression data using MATLAB Software R. Priscilla#1, C.N Prashantha*2, S. Swamynathan #3 # Department of Computer Science and Engineering, Anna University Chennai Guindy, Chennai, India 1 [email protected] 3 [email protected] * Center for Bioinformatics Research Institute, Chennai 203/1, Arcot Road, Vadapalani,, Chennai, India 2 [email protected] Abstract- In recent years, there have been various efforts to overcome the limitations of standard clustering approaches for the analysis of gene expression data by grouping genes and samples simultaneously. The underlying concept, which is often referred to as biclustering, allows to identify sets of genes sharing compatible expression patterns across subsets of samples, and its usefulness has been demonstrated for different organisms and datasets. Several biclustering methods have been proposed in the literature; however, it is not clear how the different techniques compare with each other with respect to the biological relevance of the clusters as well as with other characteristics such as robustness and sensitivity to noise. Accordingly, no guidelines concerning the choice of the biclustering method are currently available. There are several options can be used in MATLAB software to identify the clustering of Microarray data, import data, normalization and standardization using clustering techniques, but there is no information on biclustering with visualization of plots based on parallel co-ordination methods. Results- First, this paper provides information for clustering and biclustering algorithms comparing and validating using simple binary reference models. The Bi-clustering analysis is based on the subsets of groups and first to classify the data in clustering method and comparing these two groups by normalized data to standardizing the data. By comparing these clustering and bi-clustering groups in MATLAB software. To implement the software based on the automated robustness of data import, Normalization, Standardization, Visualization with parallel coordination using bi-clustering programs and algorithms. Conclusion- Based on import of microarray data to the MATLAB software easily to calculate cluster groups and bi-cluster groups with the parallel co-ordination of plots to easily to understand the resulted gene expressions. Key words- Clustering, Bi-Clustering, Parallel coordination, MATLAB implementation. I. INTRODUCTION The clustering methods has rapidly become one of the most advanced and older method for calculate microarray gene expression data analysis, for the literature surveys there are many number of clustering algorithms are used but little attention has been paid to uncertainty in the results obtained. In clustering, the patterns of expression of different genes across time, treatments, tissues and intensity of color are grouped into distinct clusters (perhaps organized hierarchically and k-means), in which genes in the same cluster are assumed to be potentially functionally related or to be influenced by a common upstream factor. Such cluster structure is often used to aid the elucidation of regulatory networks. Agglomerative hierarchical clustering [1] is one of the most frequently used methods for clustering gene expression profiles. However, commonly used methods for agglomerative hierarchical clustering rely on the setting of some score threshold to distinguish members of a particular cluster from non-members, making the determination of the number of clusters arbitrary and subjective. The algorithm provides no guide to choosing the "correct" number of clusters or the level at which to prune the tree. It is often difficult to know which distance metric to choose, especially for structured data such as gene expression profiles. Moreover, these approaches do not provide a measure of uncertainty about the clustering, making it difficult to compute the predictive quality of the clustering and to make comparisons between clusterings based on different model assumptions (e.g. numbers of clusters, shapes of clusters, etc.). Attempts to address these problems in a classical statistical framework have focused on the use of bootstrapping [4,5] or the use of permutation procedures to calculate local pvalues for the significance of branching in a dendrogram produced by agglomerative hierarchical clustering [6,7]. In this paper studying the list of clustering and bi-clustering algorithms is already present in the literature survey; these algorithms can be used in MATLAB to implement the MATLAB Software. We listed seven different types of clustering algorithms: single linkage (SL), complete linkage (CL), average linkage (AL), k-means (KM), mixture of multivariate Gaussians (FMG), spectral clustering (SPC) and shared nearest neighbor-based clustering (SNN). When applicable, we use four proximity measures together with these methods: Pearson's Correlation coefficient (P), Cosine (C), Spearman's correlation coefficient (SP) and Euclidean Distance (E). Regarding Euclidean distance, we employ the data in Table I List of clustering and Bi-clustering Algorithms Clustering Bi-Clustering Algorithms Algorithms k-Center Block clustering k-Median/k-MedianCTWC squared/Facility Location Hierarchical ITWC Clustering Clustering Large Data δ-bicluster Sets Clustering Data δ-pCluster Streams Spectral Clustering δ-pattern Conceptual Clustering FLOC Bi-clustering OPC Correlation Clustering Plaid Model Clustering with OPSMs Outliers Clustering Moving Gibbs Points SVM Clustering SAMBA Catalog Segmentation Robust Biclustering Algorithm (RoBA) Community Discovery Crossing Minimization, cMonkey Axioms of Clustering PRMs Cluster Evaluation DCC Model-based LEB(Localize and Extract Clustering Biclusters) Categorical Clustering QBUIC(QUalitative BIClustering) Projective Clustering BCCA(BiCorrelation Clustering Algorithm) Dimension Reduction ZBDD Scatter/Gather Text Clustering four different versions: original (Z0), standardized (Z1), scaled (Z2) and ranked (Z3) versions. There are many number of clustering and biclustering algorithms are developed in many research people, these all the algorithms in MATLAB software is used to implement the more advanced and robotic understanding of gene expression using[8] microarray data analysis. Cheng and Church have introduced a measure called mean squared residue score to evaluate the quality of a bicluster and has become one of the most popular measures to search for biclusters. These authors [9] reviewed the basic concepts of the metaheuristics Greedy Randomized Adaptive Search Procedure (GRASP)-construction and local search phases and propose a new method which is a variant of GRASP called Reactive Greedy Randomized Adaptive Search Procedure (Reactive GRASP) to detect significant biclusters from large microarray datasets. The method has two major steps. First, high quality bicluster seeds are generated by means of k-means[10] clustering. In the second step, these seeds are grown using the Reactive GRASP, in which the basic parameter that defines the restrictiveness of the candidate list is self-adjusted, depending on the quality of the solutions found previously. These all bi-clustering algorithms belong to a distinct class of clustering algorithms that perform simultaneous clustering of both rows and columns of the gene expression matrix and can be a very useful analysis tool when some genes have multiple functions and experimental conditions are diverse. [11]Based on the k-means clustering, hierarchical clustering, with statistical calculations like mean, standard deviation, and correlation and regression can be used to predict the bi-clustering algorithms in MATLAB software and to implement this software to visualize the graphs based on parallel co-ordination methods. To studying these literature surveys the following methodologies were proposed in MATLAB software[13], mainly bi-clustering algorithm and visualization of data based on parallel, Antiparallel and Neural Network analysis & coordinate Analysis. The microarray data can be done by preprocessing gene expression data using logarithm, and k-means clustering and to filter detected biclusters according to specified requirements such as minimum number of rows, minimum number of columns, maximum number of[14,15] biclusters and maximum overlapping to get the Different Gene expression values. These values were compared by using regression and correlation calculations. Based on gene expression difference and ratio matrices results can be defined in MATLAB software. Other common functions can be displayed in biclustering results to text files. is implemented in the robust multi-array average (RMA). II. MATERIALS AND METHODS D. Clustering 1) Hierarchical clustering The hierarchical clustering of these data can be calculated by using three methods like Node Score, Level score and Tree score. The Node Score is for calculating the node specifies a cluster, enrichment p-values can be calculated to assign the given node with one of the classes in the data The significance p-value of observing k instances assigned by the algorithm to a given category in a set of n instances is given by A. Data Collection The microarray data was collected by using GEO (Gene Expression Omnibus) and SMD (Stanford Microarray Database). The example data is Diabetes Nephropathy with GEO Entry is GDS961; Parent Platform id is GPL91, Reference Series GSE1009. To download the data set values and to import data to the MATLAB software for further analysis. B. Data import The selected data from GEO can be imported to MATLAB software by using Microsoft Excel and image analysis process. The data can be updated in command prompt and work space window. To analyze these large numbers of data by using many numbers of algorithms and calculation can be explained in the following steps. C. Normalization The affymetrix gene chip microarray sample can be normalized by using single label scheme, and consists of several tens of thousands probe sets. A probe set is a collection of probe pairs that interrogates the same sequence, or set of sequences, and contains 11−20 probe pairs of 25-mer oligonucleotides. Each pair contains the complementary sequence to the gene of interest, the so-called perfect match (PM), and a specificity control, called the mismatch (MM). MM probes are designed to discriminate non-specific hybridization. In order to analyze Gene Chip data with multiple arrays, the data preprocessing at probe level is critical step. The global background correction by signal and noise (background) convolution model in which PM intensity distribution is modeled by an exponentially distributed signal component S with parameter λ, and a normally distributed background component B with mean μ and standard deviation σ. PM=S+B ~ ~ , E (S|PM) represents background corrected value of each PM. φ and Φ is the normal density and cumulative density, respectively. Positive signal components are estimated after adjustment of the background components. This background correction , Where K is the total number of instances assigned to the class (the category) and N is the number of instances in the dataset. The p-values for all nodes and all classes may be viewed as dependent set estimations. In Level score a level l of the tree contains all nodes that are separated by l edges from the root, Each level specifies a partition of the data into clusters. Choosing for each node, the class for which it turned out to have a significant node score, (J=tp/ (tp+fn+fp), Where tp is the number of true positive cases, fn the number of false negative cases and fp the number of false positive cases). If the node in question has been judged to be non-significant by the enrichment criterion, its J-score is set to null. The level score is defined as the average of all J-scores at the given level. Tree score method is to define the weighted best-JScore Where J*i is the best J-Score for class i in the tree, ni is the number of instances in class i, c is the number of classes and N is the number of instances in the dataset. 2) K-means Clustering The k- means clustering can be used for calculating data to find means of noise data K and N are the number of clusters and genes in the data sets, m is a parameter which relate to `fuzziness' of resulting clusters, uki is the degree of membership of gene xi in cluster k, d2(xi; ck) is the distance from gene xi to centroid ck. E. Bi-Clustering The ZBDD algorithm is used to identify the bi-clustering of binary data using 0 and 1se columns and rows. Zero-suppressed BDDs (ZBDDs) are a variant of ROBDDs that represent a set of combinations. A combination of n elements is an nbit vector (x1; x2; : : : ; xn)Є Bn where B = {0,1}. The i-th bit reports whether the i-th element is contained in the combination. Thus, a set of combinations can be represented by a Boolean function f : BnÆ B. A combination given by the input vector (x1; x2; : : : ; xn) is contained in the set if and only if f(x1; x2; : : : ; xn) = 1. F. Parallel Co-ordination To visualize the calculated data from MATLAB Software by using a way to visualize the high dimensional data is to use the parallel coordinate (PC) plot. All axes are arranged in parallel to each other on a 1D plane. The additive-related bicluster shows a number of lines with the same slope across the conditions. Thus if columns {C2-C1, C3-C1} with rows R1, R3, R5, R9 and R11 can be visualized by these type of arrangement in PC plots. deviation in k-means clustering. (Figure.1.5a and 1.5b). The two clustering results compared by using ZBDD bi-clustering algorithm, to observe the gene expression based on rigidity of sample. The down regulated gene can be selected as 0th level and up regulated genes expression is selected as 1th level. The expression of these genes show (a) The response time spent by each method in order to find all the embedded biclusters from the synthetic data sets of various sizes. (b) The number of biclusters found by each method within the same time spent as our method (Figure.1.6a, 1.6b,1.6c and 1.6d).The parallel co-ordination plot is used to visualize the different clustering results (Figure1.7). Figure.1.1a Diabetes Nephropathy sample data III. RESULTS The various algorithms is used in biclustering methods to identify the gene expression in Diabetes Nephropathy which contains six samples, these samples expressed datasets contains log normalized data from 6 experiments on 5040 genes. Lot of online resources is available for gene expression data. Some important resources for gene expression data are Stanford Microarray Data website [10] and Gene Expression Omnibus website. The input data of this work has been obtained from GEO website (Figure1.1a and 1.1b). There are six samples in diabetes nephropathy 3 is control and 3 are diabetes nephropathy disease samples (Figure.1.2). The data samples can be normalized by using hierarchical clustering algorithmic method and resulted data is represented by the following methods namely node score, level score and tree score method (Figure1.3 & Figure.1.4). The raw data calculated by using statistical formulation namely mean and standard Figure.1.1b Sample subsets Figure.1.2 Import data into MATLAB software. Figure.1.6b Bi-clustering complete Figure.1.3. Normalized Data Figure.1.4 Hierarchical clustering Figure.1.6c Bi-clustering up regulation of genes Figure.1.6d Bi-clustering down regulation of genes Figure1.5a. k-means clustering Figure.1.5b. k-means clustering of subset data Figure.1.7. Parallel co-ordination plot IV. CONCLUSION Figure.1.6a Bi-clustering The identification of different gene expression levels in diabetes nephropathy were observed by using clustering techniques followed by bi-clustering methods. The assignment of a set of observations into subsets (called clusters) so that observations in the same cluster are similar in some sense. Clustering is a method of unsupervised learning, and a common technique for statistical data analysis. The clustering algorithmic method results shows the up regulated and down regulated genes in highly overlapped, where as the bi-clustering ZBDD algorithmic methods shows clear interpretation of neural network clusters. The reviews of these results make a [12]comparative study between these two methods. The implemented work using MATLAB software is visualized using parallel co-ordination plots. ACKNOWLEDGEMENT The Authors expresses their sincere thanks to the Department of Computer Science and Engineering, Anna University Chennai and Department of Bioinformatics, Centre for Bioinformatics Research Institute Chennai for providing necessary facility to conduct the research work. REFERENCES [1] Eisen M, Spellman P, Brown P, Botstein D: Cluster Analysis and Display of Genome-wide Expression. PNAS 1998, 95:14863-14868. [2] Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc Natl Acad Sci 1999, 96:6745-6750. [3] McLachlan G, Bean R, Peel D: A mixture model-based approach to the clustering of microarray expression data. Bioinformatics 2002, 18(3):413-422. [4] Kerr M, Churchill G: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments. Proceedings of the National Academy of Sciences 2001, 98(16):8961. [5] Zhang K, Zhao H: Assessing reliability of gene clusters from gene expression data. Funct Integr Genomics 2000, 1:156-173. [6] Hughes T, Marton M, Jones A, Roberts C, Stoughton R, Armour C,Bennett H, Coffey E, Dai H, He Y, Kidd M, King A, Meyer M, Slade D,Lum P, Stepaniants S, Shoemaker D, Gachotte D, Chakraburtty K,Simon J, Bard M, Friend S: Functional Discovery via a Compendiumof Expression Profiles. Cell 2000, 102:109-126. [7] Levenstien M, Yang Y, Ott J: Statistical significance for hierarchical clustering in genetic association and microarray expression studies. BMC bioinformatics 2003, 4:62. [8] Richard S Savage1, Katherine Heller3, Yang Xu3, Zoubin Ghahramani3,William M Truman4, Murray Grant4, Katherine J Denby1,2 and David L Wild, 1 R/BHC: fast Bayesian hierarchical clustering for microarray data Systems Biology Centre, University of Warwick, published 6 august 2009. [9] Michael B. Eisen, Paul T. Spellman, Patrick O. Brownand David Botstein, Cluster analysis and display of genome-wide expression patterns Department of Genetics and Department of Biochemistry and Howard Hughes Medical Institute, Stanford University School of Medicine, 300 Pasteur Avenue, December 1998. [10] Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ. Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays Department of Molecular Biology, Princeton University, Princeton, NJ 08540, USA. 1999. [11] Johannes M Freudenberg, Vineet K Joshi, Zhen Hu and Mario CLEAN: CLustering Enrichment ANalysis Medvedovic Laboratory for Statistical Genomics and Systems Biology, Department of Environmental Health, University of Cincinnati College of Medicin, 2009. [12] Marcilio CP de Souto, Ivan G Costa, Daniel SA de Araujo, Teresa B Ludermir and Alexander Schliep Clustering cancer gene expression data: a comparative study Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany, Brazil, 2008 [13] Fernando A. Beltrán, José R. Beltrán, Nicolas Holzem, Adrian Gogu. Matlab Implementation of Reverberation Algorithms Department of Electronic Engineering and Communications. University of Zaragoza (Spain), 2009. [14] Smitha Dharan and Achuthsankar S Nair, Biclustering of gene expression data using reactive greedy randomized adaptive search procedure Centre for Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, 695 581, India [15] Afolabi Olomola and Sumeet Dua, Biclustering of Gene Expression Data Using Conditional Entropy Data Mining Research Laboratory (DMRL), Department of Computer Science Louisiana Tech University, Ruston, LA, U.S.A. School of Medicine, Louisiana State University Health Sciences. 2009.