Download MUTUAL INFORMATION-BASED SUPERVISED ATTRIBUTE

Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013] MUTUAL INFORMATION-BASED SUPERVISED ATTRIBUTE CLUSTERING FOR LARGE MICROARRAY SAMPLE CLASSIFICATION Dr.Senthamarai Kannan M.E., Ph.D1, Sherin Mariam John2 1 Head of the Department, M.E (CSE), Sethu Institute of Technology, Kariapatti, India 2 Final year M.E student, Sethu Institute of Technology, Kariapatti, India 1 [email protected], [email protected] Abstract-This paper investigates the application of the mutual information criterion to evaluate a set of attributes and to select an informative subset to be used as input data for microarray classification. A microarray is a multiplex lab-on-a-chip. It is a 2D array on a solid substrate, only a small fraction is effective for performing a certain task. One of the major tasks with the gene expression data is to find groups of co regulated genes whose collective expression is strongly associated with the sample categories or response variables. In this regard, a new supervised attribute clustering algorithm is proposed to find such groups of genes. It directly incorporates the information of sample categories into the attribute clustering process. A new quantitative measure, based on mutual information, is introduced which incorporates the information of sample categories to measure the similarity between attributes. This similarity measure is useful for reducing the redundancy among the attributes. This Clustering algorithm is more effective for analyzing biologically gene clusters with excellent predictive capability. Then, Fuzzy Classification Algorithm is applied to classify the selected gene set. Also, the proposed algorithm avoids the noise sensitivity problem of existing supervised gene clustering algorithms. Keywords - Attribute Clustering, Microarray, Gene Selection, Mutual Information, Classification. I. INTRODUCTION A microarray gene expression data consists of the measured expression level of each gene in a sample. Among the large sample gene data, only small amount of gene is effective for performing a task. This algorithm attempts to perform this task of finding the effective gene set by means of a Supervised Clustering Algorithm. © http://ijcer.org An important application of gene expression data in functional genomics is to classify samples according to their gene expression profiles. A microarray gene expression data set can be represented by an expression table, where each row corresponds to one particular gene, each column to a sample, and each entry of the matrix is the measured expression level of a particular gene in a sample, respectively. However, for most gene expression data, the number of training samples is still very small compared to the large number of genes involved in the experiments. However, among the large amount of genes, only a small fraction is effective for performing a certain task. Also, a small subset of genes is desirable in developing gene expression-based diagnostic tools for delivering precise, reliable, and interpretable results. With the gene selection results, the cost of biological experiment and decision can be greatly reduced by analyzing only the marker genes. Hence, identifying a reduced set of most relevant genes is the goal of gene selection. The small number of training samples and a large number of genes make gene selection a more relevant and challenging problem in gene expression-based classification. As this is a feature selection problem, the clustering method can be used, which partitions the given gene ISSN: 2278-5795 Page 177 Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013] set into subgroups, each of which should be as homogeneous as possible. Different criteria may lead to different clustering results. However, every criterion tries to measure the similarity among the subset of genes presented in a cluster. While tree harvesting uses an unsupervised similarity measure to group a set of co regulated genes, other supervised algorithms such as supervised gene clustering, gene shaving, and partial least square procedure do not use any similarity measure to cluster genes. A new supervised attribute clustering algorithm is proposed to find co regulated clusters of genes whose collective expression is strongly associated with the sample categories or class labels. Subsequently, the cluster is formed for each relevant attribute that successively adds the attribute one after the other .The growth of the cluster is repeated until the cluster gets stabilized. II. RELATED WORK MBBC assumes that genes within each cluster follow a Bayesian linear mixed model with cluster specific and gene specific random effect terms. To find the optimal partition with respect to the Bayesian objective function, a number of options can be considered; Metropolis-Hastings, Gibbs sampling, biased random walk, and “heating” the chain by methods such as simulated tempering. All of these alternatives have been considered, and found that a well-tuned split-merge algorithm, the basis of MBBC, is simple and converges quite rapidly. 2.1. Measuring mRNA levels Compared with the traditional approach to genomic research, which has focused on the local examination and collection of data on single genes, microarray technologies have now made it possible to monitor the expression levels for tens of thousands of genes in parallel. The two major types of microarray experiments are the cDNA microarray and oligonucleotide arrays. © http://ijcer.org Chip manufacture: A microarray is a small chip, onto which tens of thousands of DNA molecules are attached in fixed grids. Each grid cell relates to a DNA sequence. Target preparation, labeling and hybridization: Typically, two mRNA are reverse-transcribed into cDNA, labeled using either fluorescent dyes or radioactive isotopics, and then hybridized with the probes on the surface of the chip. The scanning process: Chips are scanned to read the signal intensity that is emitted from the labeled and hybridized targets. 2.2 Pre-processing of gene expression data A microarray experiment typically assesses a large number of DNA sequences under multiple conditions. These conditions may be a time series during a biological process or a collection of different tissue samples. Focus on the cluster analysis of gene expression data without making a distinction among DNA sequences, which will uniformly be called “genes”. Uniformly refer to all kinds of experimental conditions as “samples” if no confusion will be caused. 2.3 Parametric Bootstrap Model Selection The parametric bootstrap, unlike cross-validation and non-parametric bootstrap, requires a more detailed model for the underlying process that generated data y = ( ). Parametric bootstrap model for more accurate estimation of the prediction error that is tailored to the microarray data by borrowing from the extensive research in identifying differentially expressed genes, especially the local false discovery rate. III. EXISTING SYSTEM The existing gene expression data analysis works uses the bayesian clustering, hierarchical clustering, k-means clustering etc. All these existing algorithms use the unsupervised similarity measures ISSN: 2278-5795 Page 178 Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013] for gene grouping. It does not use the sample categories or response variables for performing the task of clustering. Also other supervised algorithms like gene shaving, supervised gene clustering and partial least square does not use any similarity measures. Instead of similarity measure they use some other predictive measures such as Wilcox on test, Cox model score test etc. Existing works that depend on unsupervised similarity measure have the disadvantage that, they do not use the mutual information about the gene expression data. Since some methods do not use any similarity measures, they have the following drawbacks. Redundancy of attributes cannot be removed properly, sensitive to noise, and contain outliers. IV. 4.3 Coarse Cluster Formation Initially, one Attribute with highest relevance value is taken as the representative. The initial cluster is formed by selecting the set of attributes from the whole attribute set by considering above mentioned representative. Hence, the coarse cluster is formed with a set of attributes that has close relevance value with the representative. This process is repeated iteratively. PROPOSED SYSTEM A new supervised attribute clustering algorithm is presented for grouping co regulated genes with strong association to the class labels. 4.1 Load Data The gene data with number of attributes are taken as the input. For example, Cancer dataset that contains cell size, cell shape etc as its attributes. The sample file is split into two different samples according to their class labels. Then the Attributes present in the sample are identified. 4.2 Similarity Computation This is to calculate the similarity between the attributes based on the mutual information. Prior to computing the similarity, three basic gene expression values such as entropy, conditional entropy and mutual information are calculated. For computing these three values, Probability Density Function has to be computed. Then the supervised similarity between attributes is computed. Figure 1: Representation of system architecture. © http://ijcer.org ISSN: 2278-5795 Page 179 Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013] VI. REFERENCES 4.4 Finer Cluster The representative selected in the previous module is refined incrementally. By comparing with the various clusters the current representative is merged to form a single augmented cluster representative. The relevance value of this attribute gets increases. The merging of representative is repeated until it is no longer being improved. Finally, a finer cluster with effective attributes is obtained. 4.5 Classification Classification is a data mining function that assigns items or attributes in a collection to target categories or classes. When Classification is applied in biomedical data, it finely predicts the results of risk factors. The Cost-Effective Fuzzy Classification Algorithm is used in this for producing the diagnosis result of diseases. V. CONCLUSION In this paper we have demonstrated the feasibility and effectiveness of the proposed method. A new quantitative measure is defined and it is based on the mutual information to calculate the similarity between two genes. It incorporates the information of sample categories or class labels. A new supervised attribute clustering algorithm has been developed to find the co regulated clusters of genes. The performance of the proposed method and some existing methods has been done using the class separability index and predictive accuracy of support vector machine, K-nearest neighbor rule, and naive bayes classifier. This paper is capable of identifying co regulated clusters of genes whose average expression is strongly associated with the sample categories. The identified gene clusters may contribute to reveling underlying class structures, providing a useful tool for the exploratory analysis of biological data. © http://ijcer.org [1] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard, M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L. Loh, J.R. Downing, M.A. Caligiuri, C.D. Bloomfield, and E.S. Lander, “Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring,” Science, vol. 286, no. 5439, pp. 531-537, 1999. [2] E. Domany, “Cluster Analysis of Gene Expression Data, J. Statistical Physics, vol. 110, nos. 3-6, pp. 1117-1139, 2003. [3] J.G. Liao and K.-V. Chin, “Logistic Regression for Disease Classification Using Microarray Data: Model Selection in a Large p and Small n Case,” Bioinformatics, vol. 23, no. 15, pp. 19451951, 2007. [4] L. Wang, F. Chu, and W. Xie, “Accurate Cancer Classification Using Expressions of Very Few Genes,” IEEE/ACM Trans. Computational Biology and Bioinformatics, vol. 4, no. 1, pp. 40-53, Jan.-Mar. 2007. [5] P.A. Devijver and J. Kittler, Pattern Recognition: A Statistical Approach. Prentice Hall, 1982. [6] D. Koller and M. Sahami, “Toward Optimal Feature Selection,” Proc. Int’l Conf. Machine Learning, pp. 284-292. 1996. [7] R. Kohavi and G.H. John, “Wrappers for Feature Subset Selection,” Artificial Intelligence, vol. 97, nos. 1/2, pp. 273-324, 1997. [8] A.K. Jain and R.C. Dubes, Algorithms for Clustering Data. Prentice Hall, 1988. [9] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern Classification and Scene Analysis. John Wiley and Sons, 1999. [10] D. Jiang, C. Tang, and A. Zhang, “Cluster Analysis for Gene Expression Data: A Survey,” IEEE Trans. Knowledge and Data Eng., vol. 16, no. 11, pp. 1370-1386, Nov. 2004. ISSN: 2278-5795 Page 180 Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013] [11] W.-H. Au, K.C.C. Chan, A.K.C. Wong, and Y. Wang, “Attribute Clustering for Grouping, Selection, and Classification of Gene Expression Data,” [12] A. Thalamuthu, I. Mukhopadhyay, X. Zheng, and G.C. Tseng, “Evaluation and Comparison of Gene Clustering Methods in Microarray Analysis,” Bioinformatics, vol. 22, no. 19, pp. 2405-2412, 2006. [13] M. Medvedovic and S. Sivaganesan, “Bayesian Infinite Mixture Model Based Clustering of Gene Expression Profiles,” Bioinformatics, vol. 18, no. 9, pp. 1194-1206, 2002. © http://ijcer.org [14] Y. Joo, J.G. Booth, Y. Namkoong, and G. Casella, “Model-Based Bayesian Clustering (MBBC),” Bioinformatics, vol. 24, no. 6, pp. 874- 875, 2008. [15] J. Herrero, A. Valencia, and J. Dopazo, “A Hierarchical Unsupervised Growing Neural Network for Clustering Gene Expression Patterns,” Bioinformatics, vol. 17, pp. 126-136, 2001. ISSN: 2278-5795 Page 181

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download MUTUAL INFORMATION-BASED SUPERVISED ATTRIBUTE