Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN:2249-5789 V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390 COMPARATIVE ANALYSIS OF SUPPORT VECTOR MACHINE ENSEMBLES FOR HEART DISEASE PREDICTION V.Subha, M.Revathi, D.Murugan Department of Computer Science and Engineering Manonmaniam Sundaranar University Tirunelveli-12. [email protected], [email protected], [email protected] Abstract Heart attack occurs when the blood flow to a part of the heart is blocked by a blood clot. If this clot cuts off the blood flow completely, the part of the heart muscle supplied by that artery begins to die. Currently there is no cure for heart attack but it can be controlled by quitting smoking, lowering cholesterol, controlling high blood pressure, maintaining a healthy weight, and exercising. Generally many tests are done that involve clustering or classification of large scale data. In this work, prediction of heart disease using support vector machine ensembles using matlab is done. The aim of this paper is to analyze the performance of Support Vector Machine (SVM) classifier and Ensemble classifier methods such as Bagging, Boosting and Random subspace for heart disease prediction. Accuracies of different classification algorithms are compared to bring out the best and effective algorithm suitable for heart disease prediction. Keywords: Data Mining, Statlog heart dataset, Support Vector heartbeats. Initial symptoms may start as a mild discomfort that progress to significant pain. In general to detect a disease numerous tests must be conducted in a patient. The usage of data mining techniques in disease prediction is to reduce the test and increase the accuracy rate of detection. Dataset Classification Techniques Ensemble Classifiers SVM Machine, Ensemble Classifiers. 1. Introduction Data mining is the process of extracting information from data. It is also called as knowledge discovery. Data mining has become more and more popular to analyze large amount of data in the past few years. The most important and popular data mining techniques are classification and clustering. In this paper SVM classifier and ensemble methods such as Bagging, Boosting and Random subspace are investigated for heart disease prediction. The heart disease is often used interchangeably with the term cardiovascular disease. Cardiovascular disease generally refers to conditions that involve narrowed or blocked blood vessels that can lead to a heart attack. Symptoms of a heart attack includes discomfort, pressure, heaviness, or pain in the chest, arm, or below the breastbone, discomfort radiating to the back, jaw, throat, or arm, indigestion, sweating, nausea, vomiting, extreme weakness, anxiety, or shortness of breath, rapid or irregular IJCSCN | Dec 2015 Available [email protected] Bagging Boosting Random Subspace Performance Evaluation Figure 1. Block Diagram 386 ISSN:2249-5789 V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390 2. Related work Abdulkadir Sengur [1] proposed Support Vector Machine Ensembles for Intelligent Diagnosis of Valvular Heart Disease. The model employs the use of ensemble learning for improving Support Vector Machine classifiers. Sellappan Palaniappan and Rafiah Awang [2] proposed Heart Disease Decision Support System (HDDSS) using data mining classification modeling techniques. The model employs three data mining techniques namely Decision Tree, Naïve Bayes and Neural Network. Tzung-I Tang, et al [3] compared decision tree and system reconstruction analysis as applied to heart disease medical data mining. Sumit Bhatia, et al [4] proposed a decision support system for heart disease classification based on support vector machine (SVM) and integer-coded genetic algorithm (GA). For selecting the important and relevant features and discarding the irrelevant and redundant ones, integer-coded genetic algorithm is used which also maximizes SVM’s classification accuracy. Asha Rajkumar and Sophia Reena [5] used tanagra tool for classification and the results are compared. The accuracy of Naïve Bayes is 52.33%, Decision List is 52% and KNN is 45.67%. Leo Breiman [6] proposed Bagging predictors for generating multiple versions of a predictor and using these to get an aggregated predictor which improves accuracy. Marina Skurichina and Robert P. W. Duin [7] proposed bagging, boosting and the random subspace methods for improving weak classifiers. Abdulkadir Sengur [8] proposed ensemble learning classifiers for diagnosis of valvular heart disorders. The performance of ensemble methodology is evaluated using a data set containing 215 samples and achieved 95.9% sensitivity and 96% specificity rate in ensemble methods. Subha, et. al [9] applied genetic algorithm and SVM for finding relevant features for cardiotocogram classification. Resul Das, et. al [10] proposed a neural network ensemble method. The ensemble-based method creates new models by combining the posterior probabilities of the predicted values from multiple predecessor models. It obtained 97.4% classification accuracy from the experiments made on data set containing 215 samples and 100% and 96% sensitivity and specificity values in valvular heart disease diagnosis. The 14 attributes are as follows; “Table 1. Description of attributes” Sl. No. Attribute 1 Age 2 Gender 3 Chest Pain(CP) 4 trestbps: resting blood pressure 5 Cholesterol 6 fbs: fasting blood sugar 7 restecg: resting electrocardiographic 8 thalach : maximum heart rate 9 exang : exercise induced angina 10 Oldpeak 11 Slope 12 ca: no. of major vessels 13 Thal 14 Class variable 10 fold Cross validation technique is used to split the data. In a 10 fold cross validation, the data is divided in to 10 parts where each part is approximately same to form the full dataset. Learning procedure executes 10 times on training sets and finally the accuracy rates for 10 sets are averaged to yield an overall accuracy rate. Confusion matrix is used to present the accuracy of classifiers obtained through classification. 4. Methodology 3. Dataset description In this work, Statlog Heart Dataset [11] is used from the UCI machine learning repository. The dataset contains 270 instances and 14 attributes with 2 class attributes. This dataset contains information concerning heart disease diagnosis. IJCSCN | Dec 2015 Available [email protected] 4.1 SVM Classifier SVM is a commonly used technique for data classification. SVM produces a model which predicts target value of data instances in the testing set. Support Vector Machine (SVM) is used when data has exactly two classes. An SVM classifier classifies data by finding the best hyperplane that separates all data points of one class from the other class. The hyperplane with the largest margin between the two classes is the best one. Margin means the 387 ISSN:2249-5789 V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390 maximal width of the slab parallel to the hyperplane that has no interior data points. Bagging Boosting Random Subspace The formulae for Support Vector Machine as follows: Set of training data-{(a1,b1)…..(al,bl)}, where, Bagging, boosting and random subspace method have been used commonly for combining weak classifiers. 4.2.1 Bagging Each ai S denotes the input space of the sample n bi S denotes the target value i=1,2,……..l l- size of training data 1 min 𝐽(𝐷, ) = ||𝐷||2 + 𝐶 ∑𝑖𝑖=1 𝑖 2 Bootstrap aggregation, or bagging, is a technique that can be used with many classification methods and regression methods to improve the prediction process by reducing the variance associated with prediction. It is a simple technique that many bootstrap samples are drawn from the available data and some prediction method is applied to each bootstrap sample, and then the results are combined, by averaging for regression and simple voting for classification, to obtain the overall prediction, with the variance being reduced due to the averaging. where, C -constant of capacity control 𝑖 - slack factor Optimization problem can be rewritten as follows: 1 max M(𝛼)= − ∑𝑙𝑖,𝑗=1 ∈ 𝛼𝑖 𝛼𝑗 𝑏𝑖 𝑏𝑗 𝐾(𝑎𝑖, 𝑎𝑗 ) + 2 ∑𝑙𝑖=1 𝛼𝑖 ∑𝑖𝑖=1 𝛼𝑖 𝑏𝑖 𝛼i ∈ [0,C,], i = 1,2, …l. 𝐾(𝑎𝑖, 𝑎𝑗 )- kernel function Optimal hyperplane with maximal margin: ∑ 𝛼𝑖 𝑏𝑖 𝐾(𝑎, 𝑎𝑖 ) + 𝑏 = 0 𝑠𝑣 4.2.2 Boosting The AdaBoost family of algorithms also known as boosting is another category of powerful ensemble method. It changes the distribution of weights. Initially the weights are uniform for all the training samples. The weights are adjusted after training of each classifier is completed. For misclassified samples the weights are increased while for correctly classified samples they are decreased. The final ensemble is constructed by combining individual classifiers according to their own accuracies. 4.2.3 Random Subspace In random subspace feature subspaces are picked at random from the original feature space and individual classifiers are created only based on those attributes in the chosen feature subspaces using the original training set. The outputs from different individual classifiers are combined by the uniform majority voting to give the final prediction. SVM for nonlinear classification in the input space is: (x) = sgn [∑𝑠𝑣 𝑏𝑖 𝛼𝑖 𝐾(𝑎𝑖, 𝑎) + 𝑏] 4.2 Ensemble Classifier (EC) Ensemble Data Mining Methods, also known as Committee Methods or Model Combiners are machine learning methods that use the power of many models to achieve better accuracy than the individual models. The following ensemble methods are used in this work; IJCSCN | Dec 2015 Available [email protected] 5. Result Analysis Different metrics can be used for evaluating the performance of classifiers. In this work, the performance metrics such as Accuracy, Sensitivity, Specificity, Positive Predictive Value (PPV), Negative Predictive Value(NPV) are used for evaluating the classifiers. The formulas for these metrics are given below: Sensitivity(%) = TP × 100 TP + FN 388 ISSN:2249-5789 V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390 Specificity(%) = TN × 100 TN + FP 100 90 TP × 100 PPV(%) = TP + FP 80 TN × 100 TN + FN 60 NPV(%) = Accuracy(%) = 70 50 TP + TN × 100 TP + TN + FP + FN 30 10 TP - Total number of correctly classified true data. Bagging 40 20 where, SVM Boosting Random Subspace 0 TN - Total number of misclassified true data. FP - Total number of correctly classified false data. FN - Total number of misclassified false data. Figure 2. Performance Analysis of SVM and Ensemble Classifier Methods 5.1 Experimental Results The performance measures of Support Vector Machine (SVM) and ensemble classifiers are given in table 2. The results are also shown graphically in figure 2. Table 2. Performance Analysis of SVM and Ensemble Classifier Methods SVM Bagging Boosting Random Subspace Accuracy 73.7 81.85 83.22 80.00 Sensitivity 73.78 81.49 83.00 77.9 Specificity 73.7 81.02 82.12 77.2 PPV 74.05 81.67 82.4 80.08 NPV 73.5 80.56 84.00 80.00 The experimental results shows that SVM classifier achieved classification accuracy of 73.7%, Bagging achieved a classification accuracy of 81.85% and Random Subspace achieved a classification accuracy of 80% and Boosting achieved a classification accuracy of 83.22%. It is clear that, Boosting method performs better when compared to the other techniques in terms of accuracy, sensitivity, specificity, PPV and NPV. 6. Conclusion IJCSCN | Dec 2015 Available [email protected] In this work, SVM and three ensemble methods such as bagging, boosting and random subspace have been implemented and tested with statlog heart dataset. 10-fold cross-validation evaluation was used to measure the accuracy of the three algorithms. The final comparative analysis shows that the Boosting ensemble method performed better than other methods. This work can be further extended using other datasets. Ensemble methods with other base classifiers can be applied for classification in future. Feature selection technique can also be adopted so that the performance may be improved. 389 ISSN:2249-5789 V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390 7. References [1] Abdulkadir Sengur “Support Vector Machine Ensembles for Intelligent Diagnosis of Valvular Heart Disease”, J Med Syst, 36 (4) pp 2649-2655, 2012. [2] Sellappan Palaniappan and Rafiah Awang, “Intelligent Heart Disease Prediction System Using Data Mining Techniques”, International Journal of Computer Science and Network Security, 8 (8), pp 343-350, 2008. [3] Tzung-I Tang, Gang Zheng, Yalou Huang, Guangfu Shu and Pengtao Wang, “A Comparative Study of Medical Data Classification Methods Based on Decision Tree and System Reconstruction Analysis”, IEMS, 4 (1), pp. 102-108, 2005. [4] Sumit Bhatia, Praveen Prakash and G.N. Pillai, “SVM Based Decision Support System for Heart Disease Classification with Integer-Coded Genetic Algorithm to Select Critical Features”, World Congress on Engineering and Computer Science (WCECS), October, 2008. [5] Asha Rajkumar and G.Sophia Reena, “Diagnosis of Heart Disease Using Datamining Algorithm”, Global Journal of Computer Science and Technology, 10 (10), pp 38-43, 2010. [6] Leo Breiman, “Bagging Predictors”, Machine Learning, Kluwer Academic Publishers, 24, pp 123-140, 1996. [7] Marina Skurichina and Robert P. W. Duin, “Bagging, Boosting and the Random Subspace Method for Linear Classifiers”, Pattern Analysis & Applications, 5, pp 121– 135, 2002. [8] Resul Das and Abdulkadir Sengur, “Evaluation of ensemble methods for diagnosing of valvular heart disease”, Expert Systems with Applications, 37 (7), pp 5110–5115, 2010. [9] V.Subha, D.Murugan, S.Prabha and A.Manivanna Boopathi, “Genetic algorithm based feature subset selection for fetal state classification”, Journal of communications Technology, Electronics and Computer Science, 2, pp 1317, 2015. [10] Resul Das, Ibrahim Turkoglu and Abdulkadir Sengur, “Diagnosis of valvular heart disease through neural networks ensembles”, Computer methods and programs in biomedicine, 93, pp 185–191, 2009. UCI Machine Learning Repository[11] http://archive.ics.uci.edu/ml/datasets.html/statlog/Heart. IJCSCN | Dec 2015 Available [email protected] 390