Download Use of Entropy and Shrinkage method for Gene Expression Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Pharmacogenomics wikipedia , lookup

Epigenetics of neurodegenerative diseases wikipedia , lookup

Polycomb Group Proteins and Cancer wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Oncogenomics wikipedia , lookup

Minimal genome wikipedia , lookup

Long non-coding RNA wikipedia , lookup

Metagenomics wikipedia , lookup

History of genetic engineering wikipedia , lookup

Vectors in gene therapy wikipedia , lookup

NEDD9 wikipedia , lookup

Gene therapy of the human retina wikipedia , lookup

Gene therapy wikipedia , lookup

Pathogenomics wikipedia , lookup

Genomic imprinting wikipedia , lookup

Ridge (biology) wikipedia , lookup

Gene desert wikipedia , lookup

Epigenetics of diabetes Type 2 wikipedia , lookup

Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Genome evolution wikipedia , lookup

Biology and consumer behaviour wikipedia , lookup

Public health genomics wikipedia , lookup

The Selfish Gene wikipedia , lookup

Genome (book) wikipedia , lookup

Epigenetics of human development wikipedia , lookup

Site-specific recombinase technology wikipedia , lookup

Therapeutic gene modulation wikipedia , lookup

Nutriepigenomics wikipedia , lookup

Microevolution wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression programming wikipedia , lookup

Designer baby wikipedia , lookup

RNA-Seq wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
6
Short original article
Use of Entropy and Shrinkage method
for Gene Expression Data Analysis
1,2
Ji°í Haman
1
First Faculty of Medicine, Charles University in Prague, Czech Republic
2
State Institute for Drug Control, Prague, Czech Republic
Correspondence to:
Ji°í Haman
Státní ústav pro kontrolu lé£iv
Address: ’robárova 48, 100 41 Prague 10
Email: [email protected]
Aims of Research
The aim of my research is the identication of new approaches for dimension reduction and classication of high
dimensional gene expression data. The problem of dimensionality reduction means reducing the large number of
genes only on the important genes which are somehow
bundled together. The problem of classication means
nding few genes from a large number of genes which are
important for the classication of an uknown sample to a
certain group (such as disease).
I would like to establish both tasks, i.e. dimension
reduction and classication, on methods of information
theory (see [1] and [2]). The most known variables in
information theory are Shannon entropy (hereinafter reffered to as entropy) or mutual information. I would like to
combine methods of information theory with the shrinkage method which is a generalization of the concept of the
James-Stein estimate (see [3] and [4]).
State of the Art
Gene expression data from a biological sample are obtained from the microarray (see [5]). Data forming the
gene expression matrix. Rows represent individual genes
(or other gene elements), columns represent samples applied to the microarray. The problem with this type of
data is the large dimension, since the number of genes is
far greater than the number of samples.
High dimensional data require dierent statistical procedures for their evaluation than in the methods of classical statistics. Description of some of these techniques can
be found e.g. in [6]. Problems are basic exploratory data
analysis, clustering and classication. Basic exploratory
data analysis mainly deals with methods for nding of differentially expressed genes between two or more groups.
IJBH Volume 2 (2014), Issue 1
IJBH 2014; 2(1):68
received: July 4, 2014
accepted: July 15, 2014
published: August 15, 2014
Classication deals with classication of an uknown sample to a specic group (healthy vs. diseased) on the basis
of a classication rule. The classication rule is created
on the basis of samples which are of the same type as the
unknown sample and for which I know their group membership. Clustering deals with nding structure among
genes while I can not use the knowledge of group membership.
In my previous review article I described several applications of shrinkage methods for gene expression data
analysis (see [7]). In [7] application of the shrinkage method to calculate the entropy is also mentioned.
Shrunken value of entropy enters the estimation of mutual information which is calculated for all pairs of genes.
Based on mutual information a gene association network
is constructed so that a the value of mutual information,
which crossed a certain threshold, leads to an indication of
the relationship between genes. Entropy estimate based
on shrinkage method is comparable and sometimes even
better than other estimates of entropy in terms of estimation and computational complexity (see [8]). The algorithm used to create the gene assocation network is called
ARACNE (Algorithm for Reconstruction of Accurate Cellular NEtworks). The original version of ARACNE is described in [9]. To my knowledge, shrinkage method is not
applied to such algorithms as ARACNE (see [10] or [11])
which also are based on calculation of mutual information
for all pairs of genes.
It is also possible to use shrinkage for Minimum Redundancy Maximum Relevance criterion (MRMR). The
MRMR method is introduced in [12]. Idea for minimum
redundancy is to select genes which are mutually dierent
to the maximum extent. Idea for maximum relevance is
to select genes which are related to the group to the maximum extent. Liu et al. in [13] uses modied method of
MRMR so that the mutual information is standardised.
Meyer et al. in [14] uses the Minimum Redundancy NETc 2014 EuroMISE s.r.o.
7
Haman J. Use of Entropy and Shrinkage method for Gene Expression Data Analysis
work (MRNET) for detecting gene association network
based on microarray data. The MRNET method reuses
MRMR method. Relationship between two genes is determined using MRMR method which is applied to each
of the two genes separately. Each gene receives a certain
value that tells something about the relevance of a given
gene relative to the rest of genes. Using the maximum of
these two values indicates the relationship between genes
or no dependance on the choice of threshold value.
Shrinkage method may also be used in the methods described in the following two articles. In [15] is introduced
clustering algorithm based on Havrda-Charvát structural
α-entropy (see [16]) which leads to the recursive property
of entropy. In cluster analysis recursive property is suitable if gene clusters are nested to each other. In [17] the
improved classication method is based on articial neural networks using multistage mutual information which
means that the mutual information is calculated in several
steps.
Application in Biomedicine and
Healthcare
Microarray technology has application in the pharmaceutical industry (see [18]). In 2005, the organization
Food and Drug Administration (medicine agency in the
USA) approved the use of microarray technology in the
evaluation of clinical trials relating to the registration of a
new drug or testing samples obtained from patients for diagnostic purposes. It is primarily the development of new
biomarkers that should provide the information to physician which treatment should be applied to the patient.
Microarrays can be used e.g. in oncology to distinguish
early stages of cancer from advanced stages of cancer or
to determine the specic type of chemotherapy for the
patient.
Another example is the development of drugs. Drug
development generally has several stages, beginning with
a pre-clinical phase (drug testing on animals) and ending
with huge clinical trials that involve hundreds or thousands of patients. Drug development typically takes several years and is very expensive. It is possible that in the
later stages of development is emerges that the drug is
ineective or on the contrary very dangerous. Microarray technology can detect this problem in the early days
to save money or save lives. This is especially important
when using a completely new chemical compound for drug
development.
Acknowledgement
The present study was supported by the project SVV
260034 of Charles University in Prague.
c 2014 EuroMISE s.r.o.
Keywords
Gene Expression Matrix
Denition:
Matrix with columns representing biological
samples and rows representing genes.
Synonyms:
Gene Expression Data
Reference:
[6]
UMLS:
SNOMED CT:
MeSH:
no matching results found
Gene Expression Microarray
Denition:
Gene Expression Microarray is used for simultaneous measuring of gene expression based on
RNA obtained from biological sample. It is either
a silicon or glass chip which contains dierent parts
of one-stranded DNA in precisely set positions. The
length of these parts is in the order of tens of nucleotides.
Synonyms:
Gene Chip
Reference:
[18]
UMLS:
C1709015
SNOMED CT:
MeSH:
no matching results found
no matching results found
High Dimensional Data
Denition:
The data where the number of variables
measured is severalfold larger than the number of
samples obtained by measuring these variables.
Reference:
UMLS:
[19]
no matching results found
SNOMED CT:
MeSH:
no matching results found
no matching results found
Mutual Information
Denition:
A measure of the amount of information one
random variable contains about another. This is a
concept from information theory.
Reference:
UMLS:
[2]
no matching results found
SNOMED CT:
MeSH:
no matching results found
no matching results found
IJBH Volume 2 (2014), Issue 1
8
Haman J. Use of Entropy and Shrinkage method for Gene Expression Data Analysis
Gene Association Network
Denition:
Graph which vertices correspond to individual genes and edge between vertices indicates the
relationship between genes.
Reference: [8]
UMLS: SNOMED CT:
MeSH: no matching results found
References
[1] Shannon CE. A Mathematical Theory of Communication. Bell
System Technical Journal. 1948 Oct; 27(3): 379-423
[9] Margolin AA, Nemenman I, Basso K, Wiggins C, Stolovitzky
G, Favera RD, et al. ARACNE: an Algorithm for the Reconstruction of Gene Regulatory Networks in a Mammalian Cellular Context. BMC Bioinformatics. 2006 Mar; 7(Supplement
1):S7
[10] Butte AJ, Kohane IS. Mutual Information Relevance Networks: Functional Genomic Clustering Using Pairwise Entropy
Measurements. Pacic Symposium on Biocomputing. 2000;
5:415-426
[11] Faith JJ, Hayete B, Thaden JT, Mogno I, Wierzbowski J,
Cottarel G, et al. Large-Scale Mapping and Validation of Escherichia Coli Transcriptional Regulation From a Compendium
of Expression Proles. PLOS Biology. 2007 Jan; 5(1):e8
[12] Ding C, Peng H. Minimum Redundancy Feature Selection from
Microarray Gene Expression Data. Journal of Bioinformatics
and Computational Biology. 2005 Apr; 3(2):185-205
[2] Cover TM, Thomas JA. Elements of Information Theory, Second Edition. Hoboken (NJ): John Wiley & Sons, Inc.; 2006
[13] Liu X, Krishnan A, Mondry A. An Entropy-Based Gene Selection Method for Cancer Classication Using Microarray Data.
BMC Bioinformatics. 2005 Mar; 6:76
[3] Stein C. Inadmissibility of the Usual Estimator for the Mean
of Multivariate Normal Distribution. Proceedings of the 3rd
Berkeley Symposium on Mathematical Statistics and Probability. 1956; 1:197-206
[14] Meyer PE, Kontos K, Latte F, Bontempi G. InformationTheoretic Inference of Large Transcriptional Regulatory Networks. EURASIP Journal On Bioinformatics & Systems Biology. 2007:79879
[4] Richards J. An Introduction to James-Stein Estimation. 1999 M.I.T. EECS Area Exam Report, 1999.
http://www.yaroslavvb.com/papers/richards-introduction.ps
(30th June 2014, date last accessed)
[15] Li H, Zhang K, Jiang T. Minimum Entropy Clustering and Applications to Gene Expression Analysis. Proceedings / IEEE
Computational Systems Bioinformatics Conference. 2004:142151
[5] Schena M, Shalon D, Davis RW, Brown PO. Quantitative
Monitoring of Gene Expression Patterns with a Complementary DNA Microarray. Science (New York, N.Y.). 1995 Oct;
270(5235):467-470
[16] Havrda J, Charvát F. Quatication Method of Classication
Processes: Concept of structural α-entropy. Kybernetika. 1967;
3(1):30-34
[6] Dziuda DM. Data Mining for Genomics and Proteomics: Analysis of Gene and Protein Expression Data. Hoboken (NJ): John
Wiley & Sons, Inc.; 2010
[17] Kumar PG, Victoire TAA. Multistage Mutual Information
for Informative Gene Selection. Journal of Biological Systems.
2011; 19(4):725-746
[7] Haman J, Valenta Z. Shrinkage Approach for Gene Expression
Data Analysis. European Journal for Biomedical Informatics.
2013 Nov; 9(3):2-8
[18] Göhlmann H, Talloen W. Gene Expression Studies Using
Aymetrix Microarrays. Boca Raton: Chapman & Hall /
CRC; 2009
[8] Hausser J, Strimmer K. Entropy Inference and the James-Stein
Estimator, with Application to Nonlinear Gene Association
Networks. Journal of Machine Learning Research. 2009 Jul;
10:1469-1484
[19] Schäfer J, Strimmer K. A Shrinkage Approach to Large-Scale
Covariance Matrix Estimation and Implications for Functional
Genomics. Statistical Applications in Genetics and Molecular
Biology. 2005; 4(1):Article32
IJBH Volume 2 (2014), Issue 1
c 2014 EuroMISE s.r.o.