* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Human Ovarian carcinoma microarray data analysis based on Support Vector
Metagenomics wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Genome (book) wikipedia , lookup
Designer baby wikipedia , lookup
Quantitative comparative linguistics wikipedia , lookup
Gene expression profiling wikipedia , lookup
Oncogenomics wikipedia , lookup
2011 International Conference on Environment Science and Engineering IPCBEE vol.8 (2011) © (2011) IACSIT Press, Singapore Human Ovarian carcinoma microarray data analysis based on Support Vector Machines with different kernel functions Meng-Hsiun Tsai3,4, SHIH-HUEI wang4, Kun-Cheng Wu3, Jui-Ming Chen1,2*, SHENG-HSIUNG CHIU5 1 Department of Endocrinology and Metabolism, Tungs’ Taichung MetroHarbor Hospital Taiwan, R.O.C 2 Department of Biomedical informatics, Asia University Taiwan, R.O.C 3 Department of Management Information Systems, National Chung Hsing University Taiwan, R.O.C 4 Institute of Genomics and Bioinformatics, National Chung Hsing University Taiwan, R.O.C 5 Troilus Biotechnology Co., Ltd, Taiwan [email protected] [email protected] [email protected] [email protected] [email protected] use linear regression and analysis of variance to solve the problem. In Taiwan, cancer mortality rate ranks first in the cause of death list. In this paper, we have conducted research on ovarian cancer, one of the common gynecological cancers for women. Moreover it is also the fifth leading cause of cancer deaths for women in the western world [2,3]. Incidence of ovarian cancer is second only to cervical cancer and fatality rate ranks first. In addition, in the symptom of ovarian cancer, it is difficult to diagnose and estimate when the benignancy tumor becomes malignancy. Therefore, it is necessary to discuss the method of distinguishing the benignancy tumor from malignancy tissue. In this paper, we collect samples from 41 patients, including normal ovarian tumors, borderline of cancers, ovarian cancer at stage I and ovarian cancer at stage III for the ovarian DNA expression database, with a total sample of 9,600 genes. At the last step of data pre-processing, we will use linear regression and ANOVA for analyzing the ovarian cancer gene chips. Then the Support Vector Machine (SVM) will be used to classify each pathogenic stage. Afterwards, the results of four kernel functions will be compared and discussed. Finally, we will conclude the best parameter values, and the optimal utilization of kernel function. The flowchart is shown in Fig. 1. Abstract—According to the statistics from the Department of Health, occurrence rate and mortality rate of ovarian cancer are both ranked in the top ten in Taiwan. Moreover, the mortality rate of ovarian cancer is at the first place among gynecologic cancer. In this research, we use ovarian cancer gene chip as the base of database analysis in order to solve the problems such as large number of gene chip variables, insufficient number of samples. This research use a gene chip database which contains 9,600 ovarian DNA expressions from 41 samples. Then we use linear regression and variance analysis (ANOVA) for data pre-processing for the purpose of lowering the number of genes and find the most valuable genes. Finally the information database is examined and classified by Support Vector Machine (SVM), and conduct the comparison of different results of Kernel Function. Our research discovers that the SVM has considerably fine effect in classification and when different Kernel Function appears, the results will change too. At last we have discussion of the final result for identifying the most precise and efficient Kernel Function. Keywords-ovarian cancer; gene chip; linear regression; ANOVA; SVM I. INTRODUCTION Gene chip can store a large number of genes and at the same time analyze thousands of genes. Therefore, it has developed and become a mature Bio-chip technology for the research of gene expressions. We usually use the chip as a tool to conduct a lot screening and paralyzing analysis. It can be applied to gene expression comparison and gene sequence analysis, because with bio-chip technology, researcher can observe thousands and even tens of thousands of gene expression data at the same time [1]. Furthermore, for cases with too many variables or too few samples, we *Corresponding author.Tel: +886-4-26581919 ext 4305. Email address: [email protected](J.MChen). Fig.1 Research Flow 138 whether there are significant differences, moreover, to determine if the value has obvious influence on the content of the data. This research use probability value (P-value) as the screening criteria to test whether each gene in the samples has significant differences. In this paper, p-value is set at 0.05-0.00001. We can further compare and analyze gene chip database if testing results have significant differences. II. GENE DATA AND RESEARCH METHODS First, the paper will introduce the ovarian cancer gene chips which have been used and then the discussion on the analysis of data. Since the gene chip has too many variables and too few samples, after we load the gene database, we implement the reduction of dimension. After the highdimensional data is lowered, then we identify the genes with high significance. Finally, the genes are classified and put into SVM for final analysis. D. Support Vector Machine (SVM) Support Vector Machine (SVM), proposed by Vapik and collaborators, is a supervised learning algorithm used for classification and regression [6]. Due to the significant learning and classification ability, support vector machine is generally applied to bioinformatics, image recognition and text mining. Furthermore, the machine learning has become a frequently used method for many researches in recent years. The main elements of SVM are introduced as follows: 1) Hyperplane In order to create support vector machine, we use a simple lineal hyperplane to conduct the classification. The following is the hypothesis formula of the hyperplane: A. Gene Chip Data Analysis The ovarian DNA expression database used in this paper is the clinical data in China Medical University Hospital, Taichung, Taiwan, provided by Meng-Hsiun Tsai Ph.D, Department of Management Information Systems and Institute of Bioinformatics of National Chung Hsing University. This database is consturcted during 2001-2003, contains ovarian tissues of 41 patients at different pathological stages of ovarian cancer, including benign ovarian tumor (OVT), borderline tumor (BOT), ovarian cancer at stage I (OVCA-I), and ovarian cancer at stage III (OVCA-III). In this database, the number of samples of OVT, BOT, OVCA-I and OVCA-III is 13, 6, 7 and 15, respectively, a total of 9,600 sample genes. Afterwards, genes are loaded into the program to reduce the dimension and indentify high-significant genes for the following analysis. (2) H2: wT x + b =−1 (3) W is the vector which is perpendicular to the hyperplane, b is the distance from the origin to the hyperplane, and x is input the data. The figure is shown in Fig.2. B. Linear Regression Linear regression analysis is a statistical analysis method for predicting the relationship between an independent variable set X and dependent variable set Y. The independent variable and dependent variable relationship can be divided into positive correlation, negative correlation and no correlation relationships. In addition, independent variable and dependent variable can be divided into linear and nonlinear relationship. Linear regression model is as follows: yi = β0 + β1×xi + εi H1: wT x + b =1 (1) Where β0 is the intercept, β1 is the regression coefficient, yi is the dependent variable, xi is the independent variable, and εi is the residual. In this paper, the linear regression analysis is used reduce the dimension of the huge amount of data and find out the highperformance and the low-performance genes in each sample. Residual is the error term and can be used to observe the estimated value of linear regression [4]. This study sum up the residual value of each gene for sequence. Then select 100 genes with smallest residual value and 100 genes with largest residual value as the target genes. Fig. 2 SVM Structural Graph The linear, dividable data is going to be classified and presented in various hyperplanes. Margin distance is the distance from the training point to the hyperplane. What we have to do is select the most suitable hyperplane, the one with the greatest margin distance, defined by Vapnik, Lerner [7] and Vapnik Chervonenkis [8]. Support Vector Machine input vector to higher dimensional space where a maximal separating hyperplane is constructed. Two parallel hyperplanes are constructed on each side of the hyperplane which separates the data. The following is the formula: C. Analysis of Variable (ANOVA) Analysis of variable (ANOVA) is a statistical method, it can provide the variation value of a data set. By observing the different values between databases, we can understand w Yi{(WTXi+b)}≧1 139 (4) (5) complexity of operation can be simplified. We use Kernel Function to complete transforming the input data into the feature space, and choose different Kernel Functions corresponding to different input data. Accordingly, the classification effects will be different as well. Therefore, choosing the right Kernel Function is an important element of SVM. The type of Kernel Function includes the following: The goal is to find out the maximum margin among all the hyperplanes, the optimized hyperplane, which can be illustrated by Fig 3. There are two different hyperplanes to classify data A and B. Since the margin value of hyperplane B is larger than that of hyperplane A, hyperplane B is the better one of the two. Linear:k(xi.xj)=xT xj (6) Polynomial:k(xi.xj)=(1+x x)dj Radial basis function(RBF): k(xi.xj)=exp(Sigmoid: tanh(kxi•xj - ) Fig. 3 Hyperplane Comparison 2) Kernel Function Based on the previous method, data can only be classified under linear circumstances. However, in most cases, the data are non-linear but dividable. If we classify the non-linear data by adopting different Kernel Functions to change the data type. We may obtain better classification rate by transforming the raw data into another dimension or a higher dimensional space through the kernel function K(xi,xj), and implement the linear classification in the feature space. In Fig. 4, the map on the left is the raw data, transforms into the map on the right with higher dimensional space. The main concept is to transform the input data in lowdimensional space by function (φ) into to a higher dimensional space. In the high-dimensional space, we can split the linear and non-dividable data, which is proposed by Aizerman [9]. Fig. 4 shows the results of both linear and non-linear dividable cases using different Kernel Functions. (7) ) (8) (9) The followings are the characteristics of the four typical Kernel Functions: a) Radial Basis Function (RBF) kernel: RBF is a non-linear classification, high-dimensional function, and its value depends on the distance from the origin. It has the advantages of the other three Kernel Functions, such as the convenience for parameter settings, the high usage rate, and excellent analysis results, and is therefore applicable to most of the data. b) Polynomial kernel: The parameter setting is complex. Polynomial kernel is suitable for face recognition, or data with high complexity and it usually generates good analysis results. c) Sigmoid kernel: Sigmoid kernel is an unstable Kernel Function which has more constraints and little flexibility for data. The data ranged from 0 to 1 and is used only in specific data analysis. d) Linear kernel: The primitive expression of distance, one of the radiation basis functions, which is significant in linear classification or multi-class and multi-characteristic data. In this paper, the overall steps of the construction of Support Vector Machines for prediction are shown in Fig. 5, including data selection, choice of classification method and kernel function, optimization of parameter adjustment, and the results analysis. Fig.4 Different graphics of Kernel Function When choosing the analysis data, it is the characteristics of data that should be selected. The high-dimensional space is also known as the feature space, and the higher significance the representative characteristic is, the more precise the data performance will be. Therefore, the progress of the classification can be faster and the 140 Hsu, Chang, and Lin [12]. The experimental results split into cross-validation and Kernel Function. In order to ensure the accuracy of the experiment, we compare the 9 data sets of the P values between variation value 0.05 ~ 0.0001. The result we obtain will be the basis of our experiment. The following graph shows the results of the cross-validation: TABLE I. RESULTS OF CROSS-VALIDATION Fig. 5 Overall SVM Process a) Choice of classification method and Kernel Function: There are five commonly used classification methods, including C-SVC, nu-SVC, one-class SVM, epsilon-SVR, and nu-SVR. The classification methods can also be divided into the following three major functions: • Support vector classification: C-SVC, NU-SVC. • Regression function: epsilon-SVR, NU-SVR. • Interval estimation function: one-class SVM. In this paper, after the database is being tested, we can know the main functions through the above mentioned three major functions, and choose the appropriate function of Support Vector Machine. The reason why C-SVC is being determined more appropriate by parameter settings between C-SVC and NU-SVC is that the parameter set for NU is usually less than 0.00001 [9]. Since such slight parameter is not applicable to this paper, we adopt C-SVC as the classification method. Finally, we will compare four Kernel Functions in order to find the optimized analysis and the best Kernel Function for the research data. b) Optimization of parameter adjustment Different parameter settings will result in different forecast results, and it takes some time and the parameters obtained may not be the optimized ones. Therefore, before adjusting the parameters, we will carry out the crossvalidation of LIBSVM (Grid.py) and take the results of the parameters into account in order to save time for try and error and improve the accuracy rate. Therefore, the training of data and the search for parameters combination will result in good forecast results. Fig.6 Cross Validation Results – Ovarian Cancer The next adjustment is based on the parameters of the cross-validation results. Fig.6 shows the cross validation results of ovarian cancer gene chip database. With reference statistics, the research progress and the accuracy rate of the results can be improved. Table.2 is the comparison results of four different kernel functions: III. EXPERIMENTAL RESULTS We adopted LIBSVM (Chang, C.C. and Lin, C.J., [11]) for data analysis. For relative user instruction please refer to 141 TABLE II. RESULTS OF KERNEL FUNCTION Our gene chip database is a high-dimensional database with statistical data. We used linear regression and ANOVA to select the significant genes, but after all, the data is nonlinear. The statistics difference is not significant and the complexity of classification feature is low. After a series of comparison, we found out that RBF kernel function is the best option for most of the data analysis, because it combined all the advantages of the other Kernel Functions. However, we also found that radiation-based Kernel Function does not apply to all situations. During an experiment, one mistake on feature setting result in a sudden growth on the number of feature and category, and it will not only cause poor results but also wasted a lot of time. Therefore, we particularly tested the wrong database by linear Kernel Function. Surprisingly, the results and the analysis time are better than radiation-based Kernel Function, with the accuracy rate of 70%, which is beyond our expectation. Certainly, we still recommend RBF function as the priority for experiment when using Kernel Function. REFERENCES [1] D. J. Duggan, M. Bittner, Y. Chen, P. Meltzer, J. M. Trent, “Expression profiling using cDNA microarrays,” Nat Genet, vol. 21, 1999, pp. 10-14. [2] A. Jemal, A. Thomas, T. Murray, and M. Thun, Cancer statistics, CA. Cancer J. Clin., vol. 52, no. 2, 2002, pp. 22-47. [3] M.H. Tsai, C. H. Lai, S. S. Yu, and J.D. Chang, “Statistical and SVM-Based Oncogene Detection of Human Cdna Expressions for Ovarian Carcinoma,” International Journal of Innovative Computing, Information and Control, vol. 5, no. 9, Sep 2009. [4] S. Chatterjee, A. S. Hadi, “High Leverage Points, and Outliers in Linear Regression. Statistical Science,” Influential Observations, 1986. [5] R. V. Hogg, J. Ledolter, “Engineering Statistics,” MacMillan Publishing Company, 1987. [6] V.N. Vapnik, "The Nature of Statistical Learning Theory, Springer", New York, 1995. [7] Vapnik, V. and Lerner, A., “Pattern Recognition using Generalized Portrait Method,” Automation and Remote Control, vol. 24, 1963. [8] Vapnik, V. and Chervonenkis, A., “A Note on one Class of Perceptrons,” Automation and Remote Control, vol.25, 1964. [9] Aizerman, M. A., Braverman, E. M., and Rozonoer, L. I., “Theoretical Foundations of the Potential Function Method in Pattern Recognition Learning,” Autom. Remote Control, vol. 25, 1964. [10] B. Schölkopf, A. Smola, R. Williamson, and P. L. Bartlett. “New support vector algorithms,” Neural Computation, Dec 2000, pp. 1207-1245. [11] Chang,C.C. and Lin, C.J., “LIBSVM: a library for support vector machines,”Software available at http://www.csie.ntu.edu.tw/~cjlin/libsvm, 2001. [12] Hsu, Chih-Wei, Chang, Chih-Chung, and Lin, Chih-Jen, “A Practical Guide to Support Vector Classification,” Available at http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf, 2003. Fig. 7 Results of Kernel Function – Ovarian Cancer The analysis results of the ovarian cancer gene chips appear to be better than our prediction. There are significant differences between the four Kernel Functions, which indicate that Radial Basis Function is the most suitable function and sigmoid kernel function is least applicable. Through the series of comparisons, we can see that the performance of Radial Basis Function is the most outstanding, because of the data type of the gene chip database. IV. CONCLUSION AND DISCUSSION In the past few decades, cancer has become a deadly threat to human life and ranked in the top ten causes of death. In this paper, we used the ovarian cancer microarray as the database to conduct the analysis. Moreover, to provide a standard operating procedure in implementing gene chip analysis. The accuracy rate of the results we obtained from Support Vector Machine is about 90%, approximately the same as the results of the literature [3] we refer to, which confirmed the capability of SVM. With parameter adjustments, the performance will be more stable and can be used as the foundation of future researches. In this paper, we focused on the kernel function of SVM. 142