Download Hybrid Methods to Select Informative Gene Sets in Microarray Data

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Hybrid Methods to Select Informative Gene Sets
in Microarray Data Classification
Pengyi Yang1 and Zili Zhang1,2
1
2
Intelligent Software and Software Engineering Laboratory
Faculty of Computer and Information Science
Southwest University, Chongqing 400715, China
School of Engineering and Information Technology, Deakin University
Geelong, VIC 3217, Australia
[email protected]
Abstract. One of the key applications of microarray studies is to select and classify gene expression profiles of cancer and normal subjects.
In this study, two hybrid approaches–genetic algorithm with decision
tree (GADT) and genetic algorithm with neural network (GANN)–are
utilized to select optimal gene sets which contribute to the highest classification accuracy. Two benchmark microarray datasets were tested, and
the most significant disease related genes have been identified. Furthermore, the selected gene sets achieved comparably high sample classification accuracy (96.79% and 94.92% in colon cancer dataset, 98.67% and
98.05% in leukemia dataset) compared with those obtained by mRMR
algorithm. The study results indicate that these two hybrid methods are
able to select disease related genes and improve classification accuracy.
1
Introduction
One popular application of microarray studies is to extract and classify the gene
expression profiles between biological types. However, relatively few samples
with a large dimension of attribute space and a high signal-to-noise ratio [1]
are the nature of microarray data. Thus, the information contained in the gene
expression profiles needs to be extracted by robust computational strategies.
Decision tree and artificial neural networks (ANN) are nonlinear classifiers
which are suited in microarray data classification. However, decision tree algorithm is deterministic and it only uses the top ranked gene to split the data. This
leads to only one tree being created and it may be locally optimal. As to ANN,
the challenge is that usually the dataset been analyzed contains so many genes
that if all used as inputs, the computation complexity will increase significantly
and more noise will be introduced to the process [2].
Many hybrid methods are also used in the microarray data analysis. For instance, Leping Li et al. combined genetic algorithm (GA) and K-Nearest Neighbor (KNN) to select a subset of predictive genes [3]. Keedwell K. and Narayanan
A. proposed a neural-genetic approach for gene expression analysis [4]. However,
when applying these hybrid approaches to microarray data analysis, there are
M.A. Orgun and J. Thornton (Eds.): AI 2007, LNAI 4830, pp. 810–814, 2007.
c Springer-Verlag Berlin Heidelberg 2007
Hybrid Methods to Select Informative Gene Sets
811
k Candidates
Gene 1
Gene 2
. . .
Gene k−1 Gene k
Decision Tree
ANN
Fitness = Classification Accuracy
GA
iteration
Selection
Crossover
Mutation
Fig. 1. Hybrid model of GADT & GANN
still some issues like too many genes are included in the result set and gene
interaction information is ignored. To this end, we explore two other hybrid
approaches namely GADT-hybrid of genetic algorithm and decision tree, and
GANN-hybrid of genetic algorithm and artificial neural networks. These two
hybrid methods were tested on two benchmark microarray datasets [5,6]. Furthermore, the results are compared with those obtained by mRMR algorithm,
which is a feature selection algorithm based on mutual information theory [7].
2
Hybrid Approaches
GA and decision tree hybrid was first designed by J. Bala et al. [8] as a methodological tool for pattern classification. In this work, a standard GA is used to
search the space of all possible subsets and invoke decision tree to classify data.
Our GANN algorithm is a variation of Neural-genetic approach proposed
by Keedwell K. and Narayanan A. [4]. Different from Neural-genetic algorithm
which used a step-function network, a back-propagation NN with a hidden layer
was employed in our algorithm to capture nonlinear gene profiles.
The architecture of these two hybrid methods are similar in which they both
use GA as the gene selector. As illustrated in Figure 1, GA is utilized to create
various gene sets and select important combinations. ANN or decision tree were
used as the classifiers in the process, and the fitness function of GA is defined
as the classifiers’ classification accuracy of different gene combinations.
3
3.1
Methods
Data Preprocessing
Two benchmark microarray datasets, which have been published by Alon et
al. [5] and Golub et al. [6] respectively, have been used to evaluate our hybrid
812
P. Yang and Z. Zhang
approaches. In data preprocessing, the expression values of genes in colon dataset
were logarithmically transformed (with base e) and then normalized into 0-1.
The tumor tissues were marked as 1 and the normal ones were marked as 0. For
leukemia dataset, we adopted the preprocessing method proposed by Dudoit S
et al. [9]. Furthermore, every gene was marked with an ID and the chromosome
of GA was coded by a string of gene IDs, which specify a candidate gene profile.
3.2
Hybrid Model Construction and Evaluation
The starting population size of GA has been set to 500. The probability of
crossover and probability of mutation have been assigned to 0.7 and 0.03 respectively. The single point mutation and crossover were used in genetic operation
parts, and the binary tournament selection method was adopted. The termination condition is that GA reaches the 50th generation. For GADT, one of the
most popular decision tree package C5.0 [10] was used to build trees and make
data classification. A back propagation NN with 2-10 hidden nodes and 2-20 input nodes was utilized as the GANN’s fitness calculator to evaluate various gene
combinations. 1000 processing epoches, learning rate of 0.1-0.4 and momentum
of 0.4 have been used in ANN training and evaluation.
Cross validation and multiple random validation are two common strategies of
performance evaluation. In gene sets selection phase, multiple random validation
was employed to re-divide data in every generation. In gene sets evaluation phase,
5-fold cross validation was used to evaluate the classification accuracy. The top
five frequently selected gene combinations in both GANN and GADT were used
to compare with previous studies [1,11] to find overlapped genes.
4
Results
Using GANN, the highest evaluation accuracy 94.92% for colon data and 98.05%
for leukemia data are achieved by a 15-gene combination and a 5-gene combination respectively. The gene profiles of these two combinations are (U02020
M26481 T58861 H66786 H40108 X57351 M33210 R36977 L34657 T47645 R80400
H08393 R99907 T57882 U04241) and (M23197 M27891 D88270 D26156 U65011).
With GADT, we obtained the highest classification accuracy 96.79% for colon
data and 98.67% for leukemia data using two 5-gene combinations respectively,
and the gene profiles selected by GADT are (L25851 X12671 M91463 M81840
T67897) and (M29960 M12959 M28827 L35263 X61587).
The results are compared with those obtained by mRMR algorithm [7]. Figure
3 summarized the comparison. With mRMR and ANN, the best results are
87.74% for colon data and 97.82% for leukemia data, using a 5-gene and a 20gene combinations. When using mRMR with DT, the best results are 87.05%
for colon data and 93.05% for leukemia data, with two 2-gene combinations. As
can be seen from Figure 2, our hybrid approaches are superior in both cases.
The rank of the 10 highly significant colon tumor and leukemia relevant genes
selected by GANN and GADT are shown in Table 1. The genes are ranked
Hybrid Methods to Select Informative Gene Sets
a
Colon
b
100
98.67%
98.05%
98
96
94.92%
Prediciton accuracy %
Prediciton accuracy %
Leukemia
100
96.79%
98
94
92
90
88
86
84
GANN
GADT
mRMR+NN
mRMR+DT
82
80
78
2
4
6
8
10
12
14
16
18
813
96
94
GANN
GADT
mRMR+NN
mRMR+DT
92
90
88
20
Number of gene selected by hybrid algorithms
86
2
4
6
8
10
12
14
16
18
20
Number of gene selected by hybrid algorithms
Fig. 2. Evaluation accuracy of the best gene profiles selected by GANN, GADT,
mRMR+NN and mRMR+DT. (a) Classification accuracy comparison of colon dataset.
(b) Classification accuracy comparison of leukemia dataset.
Table 1. The rank of the top 10 disease associated genes selected by GANN & GADT
GANN
Colon Selec. Freq. Leukemia Selec. Freq.
U14631
0.277
M23197
0.287
D00860
0.216
M27891
0.224
H08393
0.195
U46499
0.209
R36977
0.187
X95735
0.160
L05485
0.183
M13690
0.140
L27841
0.176
D88270
0.139
T57882
0.162
M83667
0.135
R80400
0.162
L09209
0.129
H66786
0.162
M63138
0.120
H40108
0.162
U70063
0.120
GADT
Colon Selec. Freq. Leukemia Selec. Freq.
X12671
0.600
M27749
0.400
M91463
0.400
M27891
0.200
U31248
0.255
U46499
0.200
H08393
0.200
Y10807
0.200
T62947
0.200
Z49155
0.200
M26383
0.200
L07633
0.200
H11460
0.200
X61755
0.200
R46502
0.200
M92287
0.200
T60155
0.200
X95735
0.200
R99907
0.200
M12959
0.200
Table 2. Genes selected by GANN & GADT that overlapped with studies [1,11]
GANN
Colon Selec. Freq. Leukemia Selec. Freq.
H08393
0.195
M23197
0.287
T58861
0.139
M27891
0.224
U30825
0.139
L09209
0.129
M63391
0.053
U46499
0.209
Z24727
0.053
X95735
0.160
R87126
0.053
U70063
0.120
H87465
0.033
M83652
0.116
T62947
0.027
M12959
0.032
GADT
Colon Selec. Freq. Leukemia Selec. Freq.
H08393
0.200
U46499
0.200
M26383
0.200
M27891
0.200
T62947
0.200
L09209
0.200
H87465
0.200
X95735
0.200
U37012
0.094
M12959
0.200
–
–
–
–
–
–
–
–
–
–
–
–
by their frequency of selection in GANN and GADT results. Furthermore, the
selected colon tumor relevant genes and leukemia relevant genes which are overlapped with previous related studies [1,11] are presented in Table 2. Genes
H08393, T62947 from colon dataset and M27891, U46499 from leukemia dataset,
which have been considered as the most important disease related factors, have
been identified.
814
5
P. Yang and Z. Zhang
Conclusion
In this study, two hybrid methods, GANN and GADT are employed to select
optimal gene sets in microarray data classification. Appreciably high classification accuracy in two benchmark datasets were achieved. In addition, the most
important disease related genes (Table 2) which have been reported by other alternative methods have also been identified by our hybrid methods. The results
indicate that GANN and GADT are able to select and optimize informative gene
profile in order to improve the classification accuracy of the microarray dataset.
References
1. Li, X., Rao, S.Q., Wang, Y.D., Gong, B.S.: Gene mining: a novel and powerful ensemble decision approach to hunting for disease genes using microarray expression
profiling. Nucleic Acids Research 32(9), 2685–2694 (2004)
2. Yang, P.Y., Zhang, Z.L.: A Hybrid Approach to Selecting Susceptible Single Nucleotide Polymorphisms in Age-Related Macular Degeneration Diagnosis. submitted to Artificial Intelligence in Medicine
3. Li, L., Weinberg, C.R., Darden, T.A., Pedersen, L.G.: Gene selection for sample
classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 17, 1131–1142 (2001)
4. Keedwell, E., Narayanan, A.: Genetic Algorithms for Gene Expression Analysis.
In: Raidl, G.R., Cagnoni, S., Cardalda, J.J.R., Corne, D.W., Gottlieb, J., Guillot,
A., Hart, E., Johnson, C.G., Marchiori, E., Meyer, J.-A., Middendorf, M. (eds.)
EvoBIO 2003. LNCS, vol. 2611, pp. 76–86. Springer, Heidelberg (2003)
5. Alon, U., Barkai, N., Notterman, D.A., Gish, K., Ybarra, S., Mack, D., Levine,
A.J.: Broad patterns of gene expression revealed by clustering analysis of tumor and
normal colon tissues probed by oligonucleotide arrays. Proc. Natl. Acad. Sci. 96,
6745–6750 (1999)
6. Golub, T.R., Slonim, D.K., Tamayo, P., Huard, C., Gaasenbeek, M., Mesirov, J.P.,
Coller, H., Loh, M.L., Downing, J.R., Caligiuri, M.A., Bloomfield, C.D., Lander,
E.S.: Molecular Classification of Cancer: Class Discovery and Class Prediction by
Gene Expression Monitoring. Science 286, 531–537 (1999)
7. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria
of max-dependency, max-relevance, and min-redundancy. IEEE transactions on
pattern analysis and machine intelligence 27, 1226–1238 (2005)
8. Bala, J., Huang, J., Vafaie, H., DeJong, K., Wechsler, H.: Hybrid Learning Using Genetic Algorithms and Decision Trees for Pattern Classification. In: IJCAI
conference, Montreal (August 19-25, 1995)
9. Dudoit, S., Fridlyand, J., Speed, T.P.: Comparison of Discrimination Methods for
the Classification of Tumors Using Gene Expression Data. Journal of the the American Statistical Association 97, 77–87 (2002)
10. Quinlan, J.R.: C4.5: Programs for machine learning. Morgan Kaufmann Publishers,
San Francisco (1993)
11. Cho, S.B., Won, H.H.: Machine Learning in DNA Microarray Analysis for Cancer
Classification. In: Conferences in Research and Practice in Information Technology,
vol. 19 (2003)