Download Human Ovarian carcinoma microarray data analysis based on Support Vector

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Metagenomics wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Genome (book) wikipedia , lookup

Designer baby wikipedia , lookup

Quantitative comparative linguistics wikipedia , lookup

Gene expression profiling wikipedia , lookup

Oncogenomics wikipedia , lookup

Gene expression programming wikipedia , lookup

RNA-Seq wikipedia , lookup

Transcript
2011 International Conference on Environment Science and Engineering
IPCBEE vol.8 (2011) © (2011) IACSIT Press, Singapore
Human Ovarian carcinoma microarray data analysis based on Support Vector
Machines with different kernel functions
Meng-Hsiun Tsai3,4, SHIH-HUEI wang4, Kun-Cheng Wu3, Jui-Ming Chen1,2*, SHENG-HSIUNG CHIU5
1
Department of Endocrinology and Metabolism, Tungs’ Taichung MetroHarbor Hospital Taiwan, R.O.C
2
Department of Biomedical informatics, Asia University Taiwan, R.O.C
3
Department of Management Information Systems, National Chung Hsing University Taiwan, R.O.C
4
Institute of Genomics and Bioinformatics, National Chung Hsing University Taiwan, R.O.C
5
Troilus Biotechnology Co., Ltd, Taiwan
[email protected]
[email protected]
[email protected]
[email protected]
[email protected]
use linear regression and analysis of variance to solve the
problem.
In Taiwan, cancer mortality rate ranks first in the cause
of death list. In this paper, we have conducted research on
ovarian cancer, one of the common gynecological cancers
for women. Moreover it is also the fifth leading cause of
cancer deaths for women in the western world [2,3].
Incidence of ovarian cancer is second only to cervical cancer
and fatality rate ranks first. In addition, in the symptom of
ovarian cancer, it is difficult to diagnose and estimate when
the benignancy tumor becomes malignancy. Therefore, it is
necessary to discuss the method of distinguishing the
benignancy tumor from malignancy tissue. In this paper, we
collect samples from 41 patients, including normal ovarian
tumors, borderline of cancers, ovarian cancer at stage I and
ovarian cancer at stage III for the ovarian DNA expression
database, with a total sample of 9,600 genes.
At the last step of data pre-processing, we will use linear
regression and ANOVA for analyzing the ovarian cancer
gene chips. Then the Support Vector Machine (SVM) will
be used to classify each pathogenic stage. Afterwards, the
results of four kernel functions will be compared and
discussed. Finally, we will conclude the best parameter
values, and the optimal utilization of kernel function. The
flowchart is shown in Fig. 1.
Abstract—According to the statistics from the Department of
Health, occurrence rate and mortality rate of ovarian cancer
are both ranked in the top ten in Taiwan. Moreover, the
mortality rate of ovarian cancer is at the first place among
gynecologic cancer. In this research, we use ovarian cancer
gene chip as the base of database analysis in order to solve the
problems such as large number of gene chip variables,
insufficient number of samples. This research use a gene chip
database which contains 9,600 ovarian DNA expressions from
41 samples. Then we use linear regression and variance
analysis (ANOVA) for data pre-processing for the purpose of
lowering the number of genes and find the most valuable genes.
Finally the information database is examined and classified by
Support Vector Machine (SVM), and conduct the comparison
of different results of Kernel Function. Our research discovers
that the SVM has considerably fine effect in classification and
when different Kernel Function appears, the results will
change too. At last we have discussion of the final result for
identifying the most precise and efficient Kernel Function.
Keywords-ovarian cancer; gene chip; linear regression;
ANOVA; SVM
I. INTRODUCTION
Gene chip can store a large number of genes and at the
same time analyze thousands of genes. Therefore, it has
developed and become a mature Bio-chip technology for the
research of gene expressions. We usually use the chip as a
tool to conduct a lot screening and paralyzing analysis. It
can be applied to gene expression comparison and gene
sequence analysis, because with bio-chip technology,
researcher can observe thousands and even tens of thousands
of gene expression data at the same time [1]. Furthermore,
for cases with too many variables or too few samples, we
*Corresponding author.Tel: +886-4-26581919 ext 4305.
Email address: [email protected](J.MChen).
Fig.1 Research Flow
138
whether there are significant differences, moreover, to
determine if the value has obvious influence on the content
of the data. This research use probability value (P-value) as
the screening criteria to test whether each gene in the
samples has significant differences. In this paper, p-value is
set at 0.05-0.00001. We can further compare and analyze
gene chip database if testing results have significant
differences.
II. GENE DATA AND RESEARCH METHODS
First, the paper will introduce the ovarian cancer gene
chips which have been used and then the discussion on the
analysis of data. Since the gene chip has too many variables
and too few samples, after we load the gene database, we
implement the reduction of dimension. After the highdimensional data is lowered, then we identify the genes
with high significance. Finally, the genes are classified and
put into SVM for final analysis.
D. Support Vector Machine (SVM)
Support Vector Machine (SVM), proposed by Vapik and
collaborators, is a supervised learning algorithm used for
classification and regression [6]. Due to the significant
learning and classification ability, support vector machine is
generally applied to bioinformatics, image recognition and
text mining. Furthermore, the machine learning has become
a frequently used method for many researches in recent
years. The main elements of SVM are introduced as follows:
1) Hyperplane
In order to create support vector machine, we use a
simple lineal hyperplane to conduct the classification. The
following is the hypothesis formula of the hyperplane:
A. Gene Chip Data Analysis
The ovarian DNA expression database used in this paper
is the clinical data in China Medical University Hospital,
Taichung, Taiwan, provided by Meng-Hsiun Tsai Ph.D,
Department of Management Information Systems and
Institute of Bioinformatics of National Chung Hsing
University. This database is consturcted during 2001-2003,
contains ovarian tissues of 41 patients at different
pathological stages of ovarian cancer, including benign
ovarian tumor (OVT), borderline tumor (BOT), ovarian
cancer at stage I (OVCA-I), and ovarian cancer at stage III
(OVCA-III). In this database, the number of samples of
OVT, BOT, OVCA-I and OVCA-III is 13, 6, 7 and 15,
respectively, a total of 9,600 sample genes. Afterwards,
genes are loaded into the program to reduce the dimension
and indentify high-significant genes for the following
analysis.
(2)
H2: wT x + b =−1
(3)
W is the vector which is perpendicular to the hyperplane,
b is the distance from the origin to the hyperplane, and x is
input the data. The figure is shown in Fig.2.
B. Linear Regression
Linear regression analysis is a statistical analysis
method for predicting the relationship between an
independent variable set X and dependent variable set Y.
The independent variable and dependent variable
relationship can be divided into positive correlation,
negative correlation and no correlation relationships. In
addition, independent variable and dependent variable can
be divided into linear and nonlinear relationship. Linear
regression model is as follows:
yi = β0 + β1×xi + εi
H1: wT x + b =1
(1)
Where β0 is the intercept, β1 is the regression
coefficient, yi is the dependent variable, xi is the
independent variable, and εi is the residual. In this paper,
the linear regression analysis is used reduce the dimension
of the huge amount of data and find out the highperformance and the low-performance genes in each sample.
Residual is the error term and can be used to observe the
estimated value of linear regression [4]. This study sum up
the residual value of each gene for sequence. Then select
100 genes with smallest residual value and 100 genes with
largest residual value as the target genes.
Fig. 2 SVM Structural Graph
The linear, dividable data is going to be classified and
presented in various hyperplanes. Margin distance is the
distance from the training point to the hyperplane. What we
have to do is select the most suitable hyperplane, the one
with the greatest margin distance, defined by Vapnik,
Lerner [7] and Vapnik Chervonenkis [8]. Support Vector
Machine input vector to higher dimensional space where a
maximal separating hyperplane is constructed. Two parallel
hyperplanes are constructed on each side of the hyperplane
which separates the data. The following is the formula:
C. Analysis of Variable (ANOVA)
Analysis of variable (ANOVA) is a statistical method, it
can provide the variation value of a data set. By observing
the different values between databases, we can understand
w
Yi{(WTXi+b)}≧1
139
(4)
(5)
complexity of operation can be simplified. We use Kernel
Function to complete transforming the input data into the
feature space, and choose different Kernel Functions
corresponding to different input data. Accordingly, the
classification effects will be different as well. Therefore,
choosing the right Kernel Function is an important element
of SVM. The type of Kernel Function includes the
following:
The goal is to find out the maximum margin among all
the hyperplanes, the optimized hyperplane, which can be
illustrated by Fig 3. There are two different hyperplanes to
classify data A and B. Since the margin value of hyperplane
B is larger than that of hyperplane A, hyperplane B is the
better one of the two.
Linear:k(xi.xj)=xT xj
(6)
Polynomial:k(xi.xj)=(1+x x)dj
Radial basis function(RBF): k(xi.xj)=exp(Sigmoid: tanh(kxi•xj - )
Fig. 3 Hyperplane Comparison
2) Kernel Function
Based on the previous method, data can only be
classified under linear circumstances. However, in most
cases, the data are non-linear but dividable. If we classify
the non-linear data by adopting different Kernel Functions
to change the data type. We may obtain better classification
rate by transforming the raw data into another dimension or
a higher dimensional space through the kernel function
K(xi,xj), and implement the linear classification in the
feature space.
In Fig. 4, the map on the left is the raw data, transforms
into the map on the right with higher dimensional space.
The main concept is to transform the input data in lowdimensional space by function (φ) into to a higher
dimensional space. In the high-dimensional space, we can
split the linear and non-dividable data, which is proposed
by Aizerman [9]. Fig. 4 shows the results of both linear and
non-linear dividable cases using different Kernel Functions.
(7)
)
(8)
(9)
The followings are the characteristics of the four typical
Kernel Functions:
a) Radial Basis Function (RBF) kernel:
RBF is a non-linear classification, high-dimensional
function, and its value depends on the distance from the
origin. It has the advantages of the other three Kernel
Functions, such as the convenience for parameter settings,
the high usage rate, and excellent analysis results, and is
therefore applicable to most of the data.
b) Polynomial kernel:
The parameter setting is complex. Polynomial kernel is
suitable for face recognition, or data with high complexity
and it usually generates good analysis results.
c) Sigmoid kernel:
Sigmoid kernel is an unstable Kernel Function which
has more constraints and little flexibility for data. The data
ranged from 0 to 1 and is used only in specific data analysis.
d) Linear kernel:
The primitive expression of distance, one of the
radiation basis functions, which is significant in linear
classification or multi-class and multi-characteristic data.
In this paper, the overall steps of the construction of
Support Vector Machines for prediction are shown in Fig. 5,
including data selection, choice of classification method
and kernel function, optimization of parameter adjustment,
and the results analysis.
Fig.4 Different graphics of Kernel Function
When choosing the analysis data, it is the characteristics
of data that should be selected. The high-dimensional space
is also known as the feature space, and the higher
significance the representative characteristic is, the more
precise the data performance will be. Therefore, the
progress of the classification can be faster and the
140
Hsu, Chang, and Lin [12]. The experimental results split
into cross-validation and Kernel Function. In order to
ensure the accuracy of the experiment, we compare the 9
data sets of the P values between variation value 0.05 ~
0.0001. The result we obtain will be the basis of our
experiment. The following graph shows the results of the
cross-validation:
TABLE I.
RESULTS OF CROSS-VALIDATION
Fig. 5 Overall SVM Process
a) Choice of classification method and Kernel
Function:
There are five commonly used classification methods,
including C-SVC, nu-SVC, one-class SVM, epsilon-SVR,
and nu-SVR. The classification methods can also be divided
into the following three major functions:
• Support vector classification: C-SVC, NU-SVC.
• Regression function: epsilon-SVR, NU-SVR.
• Interval estimation function: one-class SVM.
In this paper, after the database is being tested, we can
know the main functions through the above mentioned three
major functions, and choose the appropriate function of
Support Vector Machine. The reason why C-SVC is being
determined more appropriate by parameter settings between
C-SVC and NU-SVC is that the parameter set for NU is
usually less than 0.00001 [9]. Since such slight parameter is
not applicable to this paper, we adopt C-SVC as the
classification method. Finally, we will compare four Kernel
Functions in order to find the optimized analysis and the
best Kernel Function for the research data.
b) Optimization of parameter adjustment
Different parameter settings will result in different
forecast results, and it takes some time and the parameters
obtained may not be the optimized ones. Therefore, before
adjusting the parameters, we will carry out the crossvalidation of LIBSVM (Grid.py) and take the results of the
parameters into account in order to save time for try and
error and improve the accuracy rate. Therefore, the training
of data and the search for parameters combination will
result in good forecast results.
Fig.6 Cross Validation Results – Ovarian Cancer
The next adjustment is based on the parameters of the
cross-validation results. Fig.6 shows the cross validation
results of ovarian cancer gene chip database. With reference
statistics, the research progress and the accuracy rate of the
results can be improved. Table.2 is the comparison results
of four different kernel functions:
III. EXPERIMENTAL RESULTS
We adopted LIBSVM (Chang, C.C. and Lin, C.J., [11])
for data analysis. For relative user instruction please refer to
141
TABLE II.
RESULTS OF KERNEL FUNCTION
Our gene chip database is a high-dimensional database with
statistical data. We used linear regression and ANOVA to
select the significant genes, but after all, the data is nonlinear. The statistics difference is not significant and the
complexity of classification feature is low. After a series of
comparison, we found out that RBF kernel function is the
best option for most of the data analysis, because it
combined all the advantages of the other Kernel Functions.
However, we also found that radiation-based Kernel
Function does not apply to all situations. During an
experiment, one mistake on feature setting result in a
sudden growth on the number of feature and category, and it
will not only cause poor results but also wasted a lot of time.
Therefore, we particularly tested the wrong database by
linear Kernel Function. Surprisingly, the results and the
analysis time are better than radiation-based Kernel
Function, with the accuracy rate of 70%, which is beyond
our expectation. Certainly, we still recommend RBF
function as the priority for experiment when using Kernel
Function.
REFERENCES
[1]
D. J. Duggan, M. Bittner, Y. Chen, P. Meltzer, J. M. Trent,
“Expression profiling using cDNA microarrays,” Nat Genet, vol. 21,
1999, pp. 10-14.
[2] A. Jemal, A. Thomas, T. Murray, and M. Thun, Cancer statistics, CA.
Cancer J. Clin., vol. 52, no. 2, 2002, pp. 22-47.
[3] M.H. Tsai, C. H. Lai, S. S. Yu, and J.D. Chang, “Statistical and
SVM-Based Oncogene Detection of Human Cdna Expressions for
Ovarian Carcinoma,” International Journal of Innovative Computing,
Information and Control, vol. 5, no. 9, Sep 2009.
[4] S. Chatterjee, A. S. Hadi, “High Leverage Points, and Outliers in
Linear Regression. Statistical Science,” Influential Observations,
1986.
[5] R. V. Hogg, J. Ledolter, “Engineering Statistics,” MacMillan
Publishing Company, 1987.
[6] V.N. Vapnik, "The Nature of Statistical Learning Theory, Springer",
New York, 1995.
[7] Vapnik, V. and Lerner, A., “Pattern Recognition using Generalized
Portrait Method,” Automation and Remote Control, vol. 24, 1963.
[8] Vapnik, V. and Chervonenkis, A., “A Note on one Class of
Perceptrons,” Automation and Remote Control, vol.25, 1964.
[9] Aizerman, M. A., Braverman, E. M., and Rozonoer, L. I.,
“Theoretical Foundations of the Potential Function Method in Pattern
Recognition Learning,” Autom. Remote Control, vol. 25, 1964.
[10] B. Schölkopf, A. Smola, R. Williamson, and P. L. Bartlett. “New
support vector algorithms,” Neural Computation, Dec 2000, pp.
1207-1245.
[11] Chang,C.C. and Lin, C.J., “LIBSVM: a library for support vector
machines,”Software available at
http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001.
[12] Hsu, Chih-Wei, Chang, Chih-Chung, and Lin, Chih-Jen, “A Practical
Guide to Support Vector Classification,” Available at
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, 2003.
Fig. 7 Results of Kernel Function – Ovarian Cancer
The analysis results of the ovarian cancer gene chips
appear to be better than our prediction. There are significant
differences between the four Kernel Functions, which
indicate that Radial Basis Function is the most suitable
function and sigmoid kernel function is least applicable.
Through the series of comparisons, we can see that the
performance of Radial Basis Function is the most
outstanding, because of the data type of the gene chip
database.
IV. CONCLUSION AND DISCUSSION
In the past few decades, cancer has become a deadly
threat to human life and ranked in the top ten causes of
death. In this paper, we used the ovarian cancer microarray
as the database to conduct the analysis. Moreover, to
provide a standard operating procedure in implementing
gene chip analysis. The accuracy rate of the results we
obtained from Support Vector Machine is about 90%,
approximately the same as the results of the literature [3]
we refer to, which confirmed the capability of SVM. With
parameter adjustments, the performance will be more stable
and can be used as the foundation of future researches.
In this paper, we focused on the kernel function of SVM.
142