Download MUTUAL INFORMATION-BASED SUPERVISED ATTRIBUTE

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

List of types of proteins wikipedia , lookup

Secreted frizzled-related protein 1 wikipedia , lookup

Transcriptional regulation wikipedia , lookup

Molecular evolution wikipedia , lookup

Genomic imprinting wikipedia , lookup

Gene desert wikipedia , lookup

Genome evolution wikipedia , lookup

Gene wikipedia , lookup

Gene nomenclature wikipedia , lookup

Gene expression wikipedia , lookup

Promoter (genetics) wikipedia , lookup

Ridge (biology) wikipedia , lookup

Endogenous retrovirus wikipedia , lookup

Silencer (genetics) wikipedia , lookup

Community fingerprinting wikipedia , lookup

Gene regulatory network wikipedia , lookup

RNA-Seq wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Gene expression profiling wikipedia , lookup

Transcript
Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]
MUTUAL INFORMATION-BASED SUPERVISED
ATTRIBUTE CLUSTERING FOR LARGE MICROARRAY
SAMPLE CLASSIFICATION
Dr.Senthamarai Kannan M.E., Ph.D1, Sherin Mariam John2
1
Head of the Department, M.E (CSE), Sethu Institute of Technology, Kariapatti, India
2
Final year M.E student, Sethu Institute of Technology, Kariapatti, India
1
[email protected], [email protected]
Abstract-This paper investigates the application of the
mutual information criterion to evaluate a set of
attributes and to select an informative subset to be used
as input data for microarray classification. A microarray
is a multiplex lab-on-a-chip. It is a 2D array on a solid
substrate, only a small fraction is effective for performing
a certain task. One of the major tasks with the gene
expression data is to find groups of co regulated genes
whose collective expression is strongly associated with the
sample categories or response variables. In this regard, a
new supervised attribute clustering algorithm is proposed
to find such groups of genes. It directly incorporates the
information of sample categories into the attribute
clustering process. A new quantitative measure, based on
mutual information, is introduced which incorporates the
information of sample categories to measure the similarity
between attributes. This similarity measure is useful for
reducing the redundancy among the attributes. This
Clustering algorithm is more effective for analyzing
biologically gene clusters with excellent predictive
capability. Then, Fuzzy Classification Algorithm is applied
to classify the selected gene set. Also, the proposed
algorithm avoids the noise sensitivity problem of existing
supervised gene clustering algorithms.
Keywords - Attribute Clustering, Microarray, Gene
Selection, Mutual Information, Classification.
I. INTRODUCTION
A microarray gene expression data consists of
the measured expression level of each gene in a
sample. Among the large sample gene data, only
small amount of gene is effective for performing a
task. This algorithm attempts to perform this task of
finding the effective gene set by means of a
Supervised Clustering Algorithm.
© http://ijcer.org
An important application of gene expression data
in functional genomics is to classify samples
according to their gene expression profiles. A
microarray gene expression data set can be
represented by an expression table, where each row
corresponds to one particular gene, each column to
a sample, and each entry of the matrix is the
measured expression level of a particular gene in a
sample, respectively. However, for most gene
expression data, the number of training samples is
still very small compared to the large number of
genes involved in the experiments.
However, among the large amount of genes, only
a small fraction is effective for performing a certain
task. Also, a small subset of genes is desirable in
developing gene expression-based diagnostic tools
for delivering precise, reliable, and interpretable
results. With the gene selection results, the cost of
biological experiment and decision can be greatly
reduced by analyzing only the marker genes. Hence,
identifying a reduced set of most relevant genes is
the goal of gene selection. The small number of
training samples and a large number of genes make
gene selection a more relevant and challenging
problem in gene expression-based classification. As
this is a feature selection problem, the clustering
method can be used, which partitions the given gene
ISSN: 2278-5795
Page 177
Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]
set into subgroups, each of which should be as
homogeneous as possible.
Different criteria may lead to different clustering
results. However, every criterion tries to measure
the similarity among the subset of genes presented
in a cluster. While tree harvesting uses an
unsupervised similarity measure to group a set of co
regulated genes, other supervised algorithms such
as supervised gene clustering, gene shaving, and
partial least square procedure do not use any
similarity measure to cluster genes.
A new supervised attribute clustering algorithm is
proposed to find co regulated clusters of genes
whose collective expression is strongly associated
with the sample categories or class labels.
Subsequently, the cluster is formed for each
relevant attribute that successively adds the attribute
one after the other .The growth of the cluster is
repeated until the cluster gets stabilized.
II.
RELATED WORK
MBBC assumes that genes within each cluster
follow a Bayesian linear mixed model with cluster
specific and gene specific random effect terms. To
find the optimal partition with respect to the
Bayesian objective function, a number of options
can be considered; Metropolis-Hastings, Gibbs
sampling, biased random walk, and “heating” the
chain by methods such as simulated tempering. All
of these alternatives have been considered, and
found that a well-tuned split-merge algorithm, the
basis of MBBC, is simple and converges quite
rapidly.
2.1. Measuring mRNA levels
Compared with the traditional approach to
genomic research, which has focused on the local
examination and collection of data on single genes,
microarray technologies have now made it possible
to monitor the expression levels for tens of
thousands of genes in parallel. The two major types
of microarray experiments are the cDNA
microarray and oligonucleotide arrays.
© http://ijcer.org
Chip manufacture: A microarray is a small chip,
onto which tens of thousands of DNA molecules are
attached in fixed grids. Each grid cell relates to a
DNA sequence.
Target preparation, labeling and hybridization:
Typically, two mRNA are reverse-transcribed into
cDNA, labeled using either fluorescent dyes or
radioactive isotopics, and then hybridized with the
probes on the surface of the chip.
The scanning process: Chips are scanned to read
the signal intensity that is emitted from the labeled
and hybridized targets.
2.2 Pre-processing of gene expression data
A microarray experiment typically assesses a
large number of DNA sequences under multiple
conditions. These conditions may be a time series
during a biological process or a collection of
different tissue samples. Focus on the cluster
analysis of gene expression data without making a
distinction among DNA sequences, which will
uniformly be called “genes”. Uniformly refer to all
kinds of experimental conditions as “samples” if no
confusion will be caused.
2.3 Parametric Bootstrap Model Selection
The parametric bootstrap, unlike cross-validation
and non-parametric bootstrap, requires a more
detailed model for the underlying process that
generated data y = (
). Parametric
bootstrap model for more accurate estimation of the
prediction error that is tailored to the microarray
data by borrowing from the extensive research in
identifying
differentially
expressed
genes,
especially the local false discovery rate.
III.
EXISTING SYSTEM
The existing gene expression data analysis works
uses the bayesian clustering, hierarchical clustering,
k-means clustering etc. All these existing
algorithms use the unsupervised similarity measures
ISSN: 2278-5795
Page 178
Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]
for gene grouping. It does not use the sample
categories or response variables for performing the
task of clustering. Also other supervised algorithms
like gene shaving, supervised gene clustering and
partial least square does not use any similarity
measures. Instead of similarity measure they use
some other predictive measures such as Wilcox on
test, Cox model score test etc.
Existing works that depend on unsupervised
similarity measure have the disadvantage that, they
do not use the mutual information about the gene
expression data. Since some methods do not use any
similarity measures, they have the following
drawbacks. Redundancy of attributes cannot be
removed properly, sensitive to noise, and contain
outliers.
IV.
4.3 Coarse Cluster Formation
Initially, one Attribute with highest relevance
value is taken as the representative. The initial
cluster is formed by selecting the set of attributes
from the whole attribute set by considering above
mentioned representative. Hence, the coarse cluster
is formed with a set of attributes that has close
relevance value with the representative. This
process is repeated iteratively.
PROPOSED SYSTEM
A new supervised attribute clustering algorithm is
presented for grouping co regulated genes with
strong association to the class labels.
4.1 Load Data
The gene data with number of attributes are taken
as the input. For example, Cancer dataset that
contains cell size, cell shape etc as its attributes.
The sample file is split into two different samples
according to their class labels. Then the Attributes
present in the sample are identified.
4.2 Similarity Computation
This is to calculate the similarity between the
attributes based on the mutual information. Prior to
computing the similarity, three basic gene
expression values such as entropy, conditional
entropy and mutual information are calculated. For
computing these three values, Probability Density
Function has to be computed. Then the supervised
similarity between attributes is computed.
Figure 1: Representation of system architecture.
© http://ijcer.org
ISSN: 2278-5795
Page 179
Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]
VI. REFERENCES
4.4 Finer Cluster
The representative selected in the previous
module is refined incrementally. By comparing with
the various clusters the current representative is
merged to form a single augmented cluster
representative. The relevance value of this attribute
gets increases. The merging of representative is
repeated until it is no longer being improved.
Finally, a finer cluster with effective attributes is
obtained.
4.5 Classification
Classification is a data mining function that
assigns items or attributes in a collection to target
categories or classes. When Classification is
applied in biomedical data, it finely predicts the
results of risk factors. The Cost-Effective Fuzzy
Classification Algorithm is used in this for
producing the diagnosis result of diseases.
V. CONCLUSION
In this paper we have demonstrated the feasibility
and effectiveness of the proposed method. A new
quantitative measure is defined and it is based on
the mutual information to calculate the similarity
between two genes. It incorporates the information
of sample categories or class labels. A new
supervised attribute clustering algorithm has been
developed to find the co regulated clusters of genes.
The performance of the proposed method and some
existing methods has been done using the class
separability index and predictive accuracy of
support vector machine, K-nearest neighbor rule,
and naive bayes classifier. This paper is capable of
identifying co regulated clusters of genes whose
average expression
is strongly associated with
the sample categories. The identified gene clusters
may contribute to reveling underlying class
structures, providing a useful tool for the
exploratory analysis of biological data.
© http://ijcer.org
[1] T.R. Golub, D.K. Slonim, P. Tamayo, C. Huard,
M. Gaasenbeek, J.P. Mesirov, H. Coller, M.L.
Loh, J.R. Downing, M.A. Caligiuri, C.D.
Bloomfield, and E.S. Lander, “Molecular
Classification of Cancer: Class Discovery and
Class Prediction by Gene Expression
Monitoring,” Science, vol. 286, no. 5439, pp.
531-537, 1999.
[2] E. Domany, “Cluster Analysis of Gene
Expression Data, J. Statistical Physics, vol. 110,
nos. 3-6, pp. 1117-1139, 2003.
[3] J.G. Liao and K.-V. Chin, “Logistic Regression
for Disease Classification Using Microarray
Data: Model Selection in a Large p and Small n
Case,” Bioinformatics, vol. 23, no. 15, pp. 19451951, 2007.
[4] L. Wang, F. Chu, and W. Xie, “Accurate Cancer
Classification Using Expressions of Very Few
Genes,” IEEE/ACM Trans. Computational
Biology and Bioinformatics, vol. 4, no. 1, pp.
40-53, Jan.-Mar. 2007.
[5] P.A. Devijver and J. Kittler, Pattern
Recognition: A Statistical Approach. Prentice
Hall, 1982.
[6] D. Koller and M. Sahami, “Toward Optimal
Feature Selection,” Proc. Int’l Conf. Machine
Learning, pp. 284-292. 1996.
[7] R. Kohavi and G.H. John, “Wrappers for
Feature
Subset
Selection,”
Artificial
Intelligence, vol. 97, nos. 1/2, pp. 273-324,
1997.
[8] A.K. Jain and R.C. Dubes, Algorithms for
Clustering Data. Prentice Hall, 1988.
[9] R.O. Duda, P.E. Hart, and D.G. Stork, Pattern
Classification and Scene Analysis. John Wiley
and Sons, 1999.
[10] D. Jiang, C. Tang, and A. Zhang, “Cluster
Analysis for Gene Expression Data: A Survey,”
IEEE Trans. Knowledge and Data Eng., vol. 16,
no. 11, pp. 1370-1386, Nov. 2004.
ISSN: 2278-5795
Page 180
Sentamarai Kannan, et al International Journal of Computer and Electronics Research [Volume 2, Issue 2, April 2013]
[11] W.-H. Au, K.C.C. Chan, A.K.C. Wong, and Y.
Wang, “Attribute Clustering for Grouping,
Selection, and Classification of Gene
Expression Data,”
[12] A. Thalamuthu, I. Mukhopadhyay, X. Zheng,
and G.C. Tseng, “Evaluation and Comparison of
Gene Clustering Methods in Microarray
Analysis,” Bioinformatics, vol. 22, no. 19, pp.
2405-2412, 2006.
[13] M. Medvedovic and S. Sivaganesan, “Bayesian
Infinite Mixture Model Based Clustering of
Gene Expression Profiles,” Bioinformatics, vol.
18, no. 9, pp. 1194-1206, 2002.
© http://ijcer.org
[14] Y. Joo, J.G. Booth, Y. Namkoong, and G.
Casella, “Model-Based Bayesian Clustering
(MBBC),” Bioinformatics, vol. 24, no. 6, pp.
874- 875, 2008.
[15] J. Herrero, A. Valencia, and J. Dopazo, “A
Hierarchical Unsupervised Growing Neural
Network for Clustering Gene Expression
Patterns,” Bioinformatics, vol. 17, pp. 126-136,
2001.
ISSN: 2278-5795
Page 181