Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bioinformatics System for Gene Diagnostics and Expression Studies Justin Chan Shao Ling National University of Singapore, 10 Kent Ridge Crescent, Singapore 119260 [email protected] The algorithms used for classification are the Naïve Bayes Classifier, C4.5 Decision Tree and Support Vector Machines. Abstract-- This report is based on work undertaken on the CADRA (Coronary Artery Disease RiskAssessment) project. Pattern recognition techniques were employed in the classification of CAD (Coronary Artery Disease) cases using various risk factors and information from microarray gene-expression data. The machine learning algorithms used are the Naïve Bayes Classifier (NBC), C4.5 Decision Tree and Support Vector Machines (SVM). The Chinese samples in the dataset were segregated and investigated using NBC and C4.5, allowing for comparison with ‘all-race’ classifiers. Lastly, the implementation of a web-based Bioinformatics system using Microsoft technology undertaken with my colleagues is described. 1. 2. Experiment Issues and Methodology Jonatrim 2 is the dataset used to do classification. It consists of a total of 29 independent attributes and 1 dependent attribute (ie. the CAD status). 19 of the independent attributes are categorical data and they pertain to genotype/phenotype information from microarray analysis ( henceforce known as A attributes). Several issues and complications have arisen in the dataset. These include possible noise in the genotype/phenotype classification process, contaminated risk factors (TC, HDL, LDL and TG), missing data and class imbalances. Introduction Advances in the Life Sciences and various related technologies has fueled much interest amongst researchers in the study of genes and diseases. The Biosensors-Focused Interest Group (BFIG) of the National University of Singapore (NUS) was formed with the focus of harnessing DNA microarrays for genetic diagnosis and gene expression studies. CADRA (Coronary Artery Disease Risk Assessment) was one of the projects undertaken which involved the study of 30 genes associated with atherosclerosis. This project involves the development and implementation of a Web-based front-end for integrating the DNA information with Microsoft technologies, as well as to understand and implement machine learning algorithms in classifying CAD patients. 2.1 Conduct of Training and Testing The training and testing of models of the Naïve Bayes Classifier and C4.5 Decision Tree were carried out as described by Russel and Norvig [1]. The results of classification on the test set are quoted for the various algorithms used in their respective chapters. The average results are given across the entire training set size range (except for SVM). The following steps and measures were also taken in the conduct of the various training and testing of models. Coronary Artery Disease (CAD) is a major cause of death in developed countries. It is described as a complex disease due to its multifactorial nature. This disease is further complicated by gene-environment and gene-gene interactions resulting in risk factor variability. CADRA researchers are utilising microarray technology in studying various candidate genes involved in CAD. Having obtained the large quantity of candidate gene-expressions data, the bioinformatics group of CADRA project has been involved in data mining and analysis in relation to CAD. 1 Sampling has been carried out based on uniform distribution. The test sets were sampled without replacement while the training set was sampled with replacement when insufficient examples arose. Training set sizes ranging from 100 to 800 instances ( in increments of 100 ) and 1000 to 2000 instances (in increments of 200) were used. Test set size has been fixed at 100 samples throughout. SVM calls for a different method for training and testing. The test set is still fixed at 100 samples but no resampling will be carried out for training sets, which is fixed at training size of 500. The Chinese training sets were of size 280 and test sets were of size 100. sample. The chosen class, y, should be the one which maximizes To address the various issue of contaminated attributes of TC, HDL, LDL and TG, and for comparative studies with previous investigations [3], Jonatrim 2 dataset has been split into several categorical types. Type A B C D E F G H I J K PrY y | X x PrY y PrX x | Y y PrX x where y represents the target class out of all the various possible classes ( ie. y Y ) and x={x1, x2, …,x}, the attributes used for classification. The various probabilities of Pr{Y}, Pr{X} and the conditional probability Pr{Y|X} had been obtained from the training sample. In the case where the various attributes making up X are statistical independent, then we can say that the attributes X1, X2, …, X, are statistically independent iff Attributes All attributes All attributes except A All attributes except TC,HDL,LDL,TG All attributes except A,TC,HDL,LDL,TG A from Type A A All attributes except Race All attributes except A, Race All attributes except TC,HDL,LDL,TG,Race All attributes except A,TC,HDL,LDL,TG,Race A from Type A with Race Pr X i i I Pr{ X i } I a ,..., ( 3.6 ) iI Therefore, NBC predicts the class, y, that maximizes Table 1: Description of Dataset Types used PrY y | X x Within each run of the algorithm concerned, the test set used is the same throughout. However, each run using Type F uses a different test set. The training and test sets are also different for between algorithms. PrY y Pr{ X i xi | Y y } PrX x i 1 ( 3.7 ) 4 Introduction to C4.5 Decision Trees 2.2 Performance Measure The divide-and-conquer approach to decision tree induction, sometimes called top-down induction of decision trees, was developed and refined over many years by Ross Quinlan[6]. Its expressiveness and ability to generate rules are its major area of attraction to researchers. To evaluate the performance of the algorithm, the various measures in the confusion matrix as well as others ( eg. sensitivity, predictive value positive ) would be presented in tabular form for the various algorithms used. It should be noted that in the case of medical diagnosis of a condition like CAD, not all classification errors carry the same penalty [4]. Specifically, one would like to avoid classifying CAD cases as non-CAD cases. A decision tree is a collection of branches (paths from the root to the leafs), leafs (indicating the class) and nodes (specifying test to be carried out). Decision trees classify cases as identified through attributes (or features). Essentially, they recursively partition regions of the attribute space into sub-regions according to the most ‘informative’ attribute [12]. To specify the criterion for selecting the most ‘informative’ attribute, the use of information gain ratio criterion was used for C4.5. Originally, only the information gain criterion was used (ie. in ID3) but it was observed that this measure had a strong bias in favour of attributes with many outcomes. These criterion measures are largely based on Information Theory. 3 Introduction to Naïve Bayes Classifier The Naive Bayesian Classifier (NBC) is a simple, yet effective method for pattern classification in cases involving both discrete and continuous attributes. Its basis is rooted in Probability Theory, particularly Baye’s Rule. Although, it is less expressive and assumes attributes to be equally important and independent of one another, its simplicity has been known to rival and outperform more complicated classifiers like decision trees and instance-based learners [5]. 4.1 Information Measures The following are some of the information measures used in C4.5. The entropy H(X) of a discrete random variable X is defined as The concept of conditional probability in Probability Theory provides the link in obtaining Baye’s Rule, which is PrY | X PrY PrX | Y PrX H( X ) x: p X ( x ) 0 Baye’s rule of conditional probability shows how to predict the class of a previously unseen example, given a training 2 p X ( x ) log 2 1 pX ( x ) The conditional entropy H(X|Y) of a random variable X given the random variable Y is In actual real-life situations, the data obtained may be contaminated or corrupted with noise, causing patterns to be linearly non-separable. SVM provides a ‘soft’ approach in such classification through the use of slack variable, p ( y ) H( X | Y y ) H( X | Y ) Y y : pY ( x ) 0 i 0, The information gain or mutual information between random variable X and Y is given as i 1,..., along with relaxed constraints, I( X ;Y ) H ( X ) ( X | Y ) yi ( w x i b ) 1 i The information gain ratio criterion or gain ratio, as it is more commonly known, is thus defined as Gain ratio i 1,..., A classifier that generalises well is then found by controlling both the classifier (via w.w) and the number of training errors, minimizing the objective function H( X ) ( X | Y ) H( X ) minimise 5 Introduction to Support Vector Machines 1 w w C i 2 i 1 subject to the constraints of the slack variable, the relaxed constraints and C > 0. The only modification in the dual formulation is the constraint on the i values, namely Support Vector Machines (SVM) are learning systems that use a hypothesis of space of linear functions in a high dimensional feature space, trained with a learning algorithm from Optimisation Theory that implements a learning bias derived from Statistical Learning Theory [7]. The motivation for use of higher dimensional feature space is that there is higher probability of encountering linearly separable patterns. SVM incorporates many ideas including quadratic programming and the use of kernels. 0 i C i 1,..., instead of i 0 . The parameter C controls the tradeoff between the complexity of the machine and the number of non-separable points. 5.1 Linearly Separable Patterns 6 Software Used Consider a training sample The Weka package (Version 3.0.1) , from the University of Waikato, was used to run NBC and C4.5. For NBC, the package offers the use of kernel density estimation for modeling continuous attributes. For C4.5, the confidence intervals set to 0.1 for pruning. Jonatrim 2 dataset was preprocessed into the required ARFF format before running the software. References with regard to the ARFF format and software features are available[8]. S=((x1,y1),…,(x,y)) that is linearly separable in the feature space implicitly defined by the kernel K(xi,xj) and suppose the parameters and b solve the following quadratic optimisation problem: maximise W( ) I I 1 subject to y i 1 i i i 0, 1 yi y j i j K x i x j 2 i , j 1 An application was developed to create the required models to do SVM classification. The software uses the CSV (comma-separated variable) data file format which is readily supported by major spreadsheet programs like Microsoft Excel. The polynomial kernel of degree 2 is used the kernel used in building the classifier ( ie. K(xi,xj) =(xi xj + 1)2 ). 0 i 1,..., Then the decision rule is given by 7 Comparing Results from Different Algorithms sgn( f ( x )) where f ( x ) yi i K xi x b The classification results from the 3 algorithms explored based on Type A to Type F datasets are summarised in the tables below. Note that the run that gave the better results for the particular dataset type in NBC and C4.5 classifiers are quoted here. Also, the results stated here are still limited by the different issues methodologies involved in training i 1 is equivalent to the maximal margin hyperplane in the feature space implicitly defined by the kernel K(xi,xj). 5.2 Linearly Non-Separable Patterns 3 and testing the classifiers, and serve to indicate the estimated potential of the classifiers. Statistics NBC C4.5 SVM True Positive(%) 44.64% 44.57% 36.00% True Negative(%) 42.36% 43.57% 33.00% False Positive(%) 7.64% 6.43% 17.00% False Negative(%) 5.36% 5.43% 14.00% Sensitivity 0.8929 0.8914 0.72 Specificity 0.8471 0.8714 0.66 Predictive Value Positive 0.8548 0.8762 0.679245 Predictive Value Negative 0.8884 0.8916 0.702128 Success Rate 0.87 0.8814 0.69 Failure Rate 0.13 0.1186 0.31 NBC C4.5 SVM True Positive(%) 45.29% 44.64% 47.00% True Negative(%) 41.93% 44.57% 39.00% False Positive(%) 8.07% 5.43% 11.00% False Negative(%) 4.71% 5.36% 3.00% Sensitivity 0.9057 0.8929 0.94 Specificity 0.8386 0.8914 0.78 Predictive Value Positive 0.8488 0.8938 0.810345 Predictive Value Negative 0.8996 0.8945 0.928571 Success Rate 0.8721 0.8921 0.86 Failure Rate 0.1279 0.1079 0.14 NBC C4.5 SVM True Positive(%) 41.36% 42.57% 36.00% True Negative(%) 43.07% 44.86% 35.00% False Positive(%) 6.93% 5.14% 15.00% False Negative(%) 8.64% 7.43% 14.00% Sensitivity 0.8271 0.85143 0.72 Specificity 0.8614 0.89714 0.7 Predictive Value Positive 0.8629 0.8969 0.705882 Predictive Value Negative 0.8397 0.86299 0.714286 Success Rate 0.8443 0.87429 0.71 Failure Rate 0.1557 0.12571 0.29 NBC C4.5 SVM True Positive(%) 44.00% 41.36% 44.00% 45.00% False Positive(%) 12.79% 6.93% 5.00% False Negative(%) 6.00% 8.64% 6.00% Sensitivity 0.88 0.8271 0.88 Specificity 0.7443 0.8614 0.9 Predictive Value Positive 0.7751 0.8629 0.897959 Predictive Value Negative 0.8624 0.8397 0.882353 Success Rate 0.8121 0.8443 0.89 Failure Rate 0.1879 0.1557 0.11 Statistics NBC C4.5 SVM True Positive(%) 28.57% 26.79% 24.00% True Negative(%) 31.71% 27.29% 31.00% False Positive(%) 18.29% 22.71% 19.00% False Negative(%) 21.43% 23.21% 26.00% Sensitivity 0.5714 0.5357 0.48 Specificity 0.6343 0.5457 0.62 Predictive Value Positive 0.6101 0.5398 0.55814 Predictive Value Negative 0.5982 0.5471 0.54386 Success Rate 0.6029 0.5407 0.55 Failure Rate 0.3971 0.4593 0.45 Statistics NBC C4.5 SVM True Positive(%) 30.07% 31.07% 39.00% True Negative(%) 25.29% 23.50% 22.00% False Positive(%) 24.71% 26.50% 28.00% False Negative(%) 19.93% 18.93% 11.00% Sensitivity 0.601429 0.621429 0.78 Specificity 0.505714 0.47 0.44 Predictive Value Positive 0.549736 0.54034 0.58209 Predictive Value Negative 0.55879 0.557516 0.666667 Success Rate 0.553571 0.545714 0.61 Failure Rate 0.446429 0.454286 0.39 Table 7: Classification result comparison on Type F As shown by the various tables above, it is observed that learning algorithms generally perform better for Type A, B, C and D datasets. The exception is the SVM, which performs only average on Type A and C. Performance of Type E and F, which uses A attributes, perform poorly. The highest success rates and the lowest false negative rates for each type are underlined for reference. Table 4: Classification result comparison on Type C Statistics 43.07% Table 6: Classification result comparison on Type E Table 3: Classification result comparison on Type B Statistics 37.21% Table 5: Classification result comparison on Type D Table 2: Classification result comparison on Type A Statistics True Negative(%) 4 8 Classification based on Chinese samples 10 It was shown that for both algorithms explored (ie. NBC and C4.5), better classification was generally observed when Chinese-trained classifiers were used on Chinese test cases. This may be due to less genetic and environmental variations among the race. The best result was by a Type G C4.5 decision tree with a success rate of 95%. Generally good classification results were generally obtained for all 3 algorithms used, except when microarray gene-expression data were used only. It is felt that for SVM, better results could be obtained using other kernels. Furthermore, better classification results were generally observed when Chinese test sets were shown to Chinesetrained classifiers as opposed to classifiers trained by all races based on NBC and C4.5. Lastly, a web-based bioinformatics system implemented with my colleagues for both medical information-retrieval and CAD riskassessment has been described. 9 CADRA's Web-based Bioinformatics System A web-based Bioinformatics system in providing patient database, data-mining and delivery of microarray geneexpression data has been proposed and its prototype has been implemented by colleagues and myself using various Microsoft technologies. References [1]Stuart Russell & Peter Norvig, Artificial Intelligence: A Modern Approach, 1995, Prentice Hall, p 538 A web-server was set up at the Clinical Research Centre to house the website. The server is running on the Windows 2000 operating system with Internet Information Server 5. Microsoft's Active Server Pages 3.0 (ASP 3.0) was used. Access 2000 was used as the prototype database. Microsoft Backoffice 4.5 and SQL Server 7 has been installed. Entry into the website can be found at http://bfigcad.nus.edu.sg/cad_ra/login.asp . 9.1 Conclusion [2] Gary M. Weiss, Foster Provost, The Effects of Class Distribution on Classifier Learning, Rutgers University, 2001 [3]Chin W.C., Gene Identification of DNA Microarray Patterns Using Neural Network Techniques, National University of Singapore, 2000 Variations on the CADRA Prototype [4]Congdon, C. B. A Comparison of Genetic Algorithms and Other Machine Learning Systems on a Complex Classification Task from Common Disease Research, University of Michigan, 1995 The prototype of the CADRA web-based system is a generic template for more elaborate structures. One such possible structure is described in the figure below. Clinic [5] Domingos, P. and Pazzani, M. Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifier, Proceedings of the 13th International Conference on Machine Learning, San Francisco, CA, Morgan Kaufmann (1996), p 105-112 Application and Internet Service Provider Administrator Clinic Assistant Internet Doctor [6] Ross Quinlan, C4.5: Programs for Machine Learning, Morgan Kaufmann, 1993, p vii, [7] Nello Cristinanini & John Shawe-Taylor, An Introduction to Support Vector Machines and Other Kernel-based Learning Methods, Cambridge University Press, 2000, p 7, 30-31, 93-110 Microarray Lab Lab-technician Figure 1: Alternative structure of service delivery with Application and Internet Service Provider separate from Microarray Lab [8] Witten, I. And Frank, E. Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, 2000, p 265-286 It was required that other technologies and products be explored for the project and several discoveries were made. Java technology from Sun Microsystems offers a viable alternative to Microsoft's products. In combination with the Linux operating system, Apache Web Server and mySQL database, this solution offers a low cost yet robust option for the Internet and Application Service Provider. 5