Download Use of Entropy and Shrinkage method for Gene Expression Data

6 Short original article Use of Entropy and Shrinkage method for Gene Expression Data Analysis 1,2 Ji°í Haman 1 First Faculty of Medicine, Charles University in Prague, Czech Republic 2 State Institute for Drug Control, Prague, Czech Republic Correspondence to: Ji°í Haman Státní ústav pro kontrolu lé£iv Address: robárova 48, 100 41 Prague 10 Email: [email protected] Aims of Research The aim of my research is the identication of new approaches for dimension reduction and classication of high dimensional gene expression data. The problem of dimensionality reduction means reducing the large number of genes only on the important genes which are somehow bundled together. The problem of classication means nding few genes from a large number of genes which are important for the classication of an uknown sample to a certain group (such as disease). I would like to establish both tasks, i.e. dimension reduction and classication, on methods of information theory (see [1] and [2]). The most known variables in information theory are Shannon entropy (hereinafter reffered to as entropy) or mutual information. I would like to combine methods of information theory with the shrinkage method which is a generalization of the concept of the James-Stein estimate (see [3] and [4]). State of the Art Gene expression data from a biological sample are obtained from the microarray (see [5]). Data forming the gene expression matrix. Rows represent individual genes (or other gene elements), columns represent samples applied to the microarray. The problem with this type of data is the large dimension, since the number of genes is far greater than the number of samples. High dimensional data require dierent statistical procedures for their evaluation than in the methods of classical statistics. Description of some of these techniques can be found e.g. in [6]. Problems are basic exploratory data analysis, clustering and classication. Basic exploratory data analysis mainly deals with methods for nding of differentially expressed genes between two or more groups. IJBH Volume 2 (2014), Issue 1 IJBH 2014; 2(1):68 received: July 4, 2014 accepted: July 15, 2014 published: August 15, 2014 Classication deals with classication of an uknown sample to a specic group (healthy vs. diseased) on the basis of a classication rule. The classication rule is created on the basis of samples which are of the same type as the unknown sample and for which I know their group membership. Clustering deals with nding structure among genes while I can not use the knowledge of group membership. In my previous review article I described several applications of shrinkage methods for gene expression data analysis (see [7]). In [7] application of the shrinkage method to calculate the entropy is also mentioned. Shrunken value of entropy enters the estimation of mutual information which is calculated for all pairs of genes. Based on mutual information a gene association network is constructed so that a the value of mutual information, which crossed a certain threshold, leads to an indication of the relationship between genes. Entropy estimate based on shrinkage method is comparable and sometimes even better than other estimates of entropy in terms of estimation and computational complexity (see [8]). The algorithm used to create the gene assocation network is called ARACNE (Algorithm for Reconstruction of Accurate Cellular NEtworks). The original version of ARACNE is described in [9]. To my knowledge, shrinkage method is not applied to such algorithms as ARACNE (see [10] or [11]) which also are based on calculation of mutual information for all pairs of genes. It is also possible to use shrinkage for Minimum Redundancy Maximum Relevance criterion (MRMR). The MRMR method is introduced in [12]. Idea for minimum redundancy is to select genes which are mutually dierent to the maximum extent. Idea for maximum relevance is to select genes which are related to the group to the maximum extent. Liu et al. in [13] uses modied method of MRMR so that the mutual information is standardised. Meyer et al. in [14] uses the Minimum Redundancy NETc 2014 EuroMISE s.r.o. 7 Haman J. Use of Entropy and Shrinkage method for Gene Expression Data Analysis work (MRNET) for detecting gene association network based on microarray data. The MRNET method reuses MRMR method. Relationship between two genes is determined using MRMR method which is applied to each of the two genes separately. Each gene receives a certain value that tells something about the relevance of a given gene relative to the rest of genes. Using the maximum of these two values indicates the relationship between genes or no dependance on the choice of threshold value. Shrinkage method may also be used in the methods described in the following two articles. In [15] is introduced clustering algorithm based on Havrda-Charvát structural α-entropy (see [16]) which leads to the recursive property of entropy. In cluster analysis recursive property is suitable if gene clusters are nested to each other. In [17] the improved classication method is based on articial neural networks using multistage mutual information which means that the mutual information is calculated in several steps. Application in Biomedicine and Healthcare Microarray technology has application in the pharmaceutical industry (see [18]). In 2005, the organization Food and Drug Administration (medicine agency in the USA) approved the use of microarray technology in the evaluation of clinical trials relating to the registration of a new drug or testing samples obtained from patients for diagnostic purposes. It is primarily the development of new biomarkers that should provide the information to physician which treatment should be applied to the patient. Microarrays can be used e.g. in oncology to distinguish early stages of cancer from advanced stages of cancer or to determine the specic type of chemotherapy for the patient. Another example is the development of drugs. Drug development generally has several stages, beginning with a pre-clinical phase (drug testing on animals) and ending with huge clinical trials that involve hundreds or thousands of patients. Drug development typically takes several years and is very expensive. It is possible that in the later stages of development is emerges that the drug is ineective or on the contrary very dangerous. Microarray technology can detect this problem in the early days to save money or save lives. This is especially important when using a completely new chemical compound for drug development. Acknowledgement The present study was supported by the project SVV 260034 of Charles University in Prague. c 2014 EuroMISE s.r.o. Keywords Gene Expression Matrix Denition: Matrix with columns representing biological samples and rows representing genes. Synonyms: Gene Expression Data Reference: [6] UMLS: SNOMED CT: MeSH: no matching results found Gene Expression Microarray Denition: Gene Expression Microarray is used for simultaneous measuring of gene expression based on RNA obtained from biological sample. It is either a silicon or glass chip which contains dierent parts of one-stranded DNA in precisely set positions. The length of these parts is in the order of tens of nucleotides. Synonyms: Gene Chip Reference: [18] UMLS: C1709015 SNOMED CT: MeSH: no matching results found no matching results found High Dimensional Data Denition: The data where the number of variables measured is severalfold larger than the number of samples obtained by measuring these variables. Reference: UMLS: [19] no matching results found SNOMED CT: MeSH: no matching results found no matching results found Mutual Information Denition: A measure of the amount of information one random variable contains about another. This is a concept from information theory. Reference: UMLS: [2] no matching results found SNOMED CT: MeSH: no matching results found no matching results found IJBH Volume 2 (2014), Issue 1 8 Haman J. Use of Entropy and Shrinkage method for Gene Expression Data Analysis Gene Association Network Denition: Graph which vertices correspond to individual genes and edge between vertices indicates the relationship between genes. Reference: [8] UMLS: SNOMED CT: MeSH: no matching results found References [1] Shannon CE. A Mathematical Theory of Communication. Bell System Technical Journal. 1948 Oct; 27(3): 379-423 [9] Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky G, Favera RD, et al. ARACNE: an Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinformatics. 2006 Mar; 7(Supplement 1):S7 [10] Butte AJ, Kohane IS. Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy Measurements. Pacic Symposium on Biocomputing. 2000; 5:415-426 [11] Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J, Cottarel G, et al. Large-Scale Mapping and Validation of Escherichia Coli Transcriptional Regulation From a Compendium of Expression Proles. PLOS Biology. 2007 Jan; 5(1):e8 [12] Ding C, Peng H. Minimum Redundancy Feature Selection from Microarray Gene Expression Data. Journal of Bioinformatics and Computational Biology. 2005 Apr; 3(2):185-205 [2] Cover TM, Thomas JA. Elements of Information Theory, Second Edition. Hoboken (NJ): John Wiley & Sons, Inc.; 2006 [13] Liu X, Krishnan A, Mondry A. An Entropy-Based Gene Selection Method for Cancer Classication Using Microarray Data. BMC Bioinformatics. 2005 Mar; 6:76 [3] Stein C. Inadmissibility of the Usual Estimator for the Mean of Multivariate Normal Distribution. Proceedings of the 3rd Berkeley Symposium on Mathematical Statistics and Probability. 1956; 1:197-206 [14] Meyer PE, Kontos K, Latte F, Bontempi G. InformationTheoretic Inference of Large Transcriptional Regulatory Networks. EURASIP Journal On Bioinformatics & Systems Biology. 2007:79879 [4] Richards J. An Introduction to James-Stein Estimation. 1999 M.I.T. EECS Area Exam Report, 1999. http://www.yaroslavvb.com/papers/richards-introduction.ps (30th June 2014, date last accessed) [15] Li H, Zhang K, Jiang T. Minimum Entropy Clustering and Applications to Gene Expression Analysis. Proceedings / IEEE Computational Systems Bioinformatics Conference. 2004:142151 [5] Schena M, Shalon D, Davis RW, Brown PO. Quantitative Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. Science (New York, N.Y.). 1995 Oct; 270(5235):467-470 [16] Havrda J, Charvát F. Quatication Method of Classication Processes: Concept of structural α-entropy. Kybernetika. 1967; 3(1):30-34 [6] Dziuda DM. Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data. Hoboken (NJ): John Wiley & Sons, Inc.; 2010 [17] Kumar PG, Victoire TAA. Multistage Mutual Information for Informative Gene Selection. Journal of Biological Systems. 2011; 19(4):725-746 [7] Haman J, Valenta Z. Shrinkage Approach for Gene Expression Data Analysis. European Journal for Biomedical Informatics. 2013 Nov; 9(3):2-8 [18] Göhlmann H, Talloen W. Gene Expression Studies Using Aymetrix Microarrays. Boca Raton: Chapman & Hall / CRC; 2009 [8] Hausser J, Strimmer K. Entropy Inference and the James-Stein Estimator, with Application to Nonlinear Gene Association Networks. Journal of Machine Learning Research. 2009 Jul; 10:1469-1484 [19] Schäfer J, Strimmer K. A Shrinkage Approach to Large-Scale Covariance Matrix Estimation and Implications for Functional Genomics. Statistical Applications in Genetics and Molecular Biology. 2005; 4(1):Article32 IJBH Volume 2 (2014), Issue 1 c 2014 EuroMISE s.r.o.

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Use of Entropy and Shrinkage method for Gene Expression Data