Download comparative analysis of support vector machine ensembles for heart

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
ISSN:2249-5789
V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390
COMPARATIVE ANALYSIS OF SUPPORT VECTOR MACHINE
ENSEMBLES FOR HEART DISEASE PREDICTION
V.Subha, M.Revathi, D.Murugan
Department of Computer Science and Engineering
Manonmaniam Sundaranar University
Tirunelveli-12.
[email protected], [email protected], [email protected]
Abstract
Heart attack occurs when the blood flow to a part
of the heart is blocked by a blood clot. If this clot cuts off
the blood flow completely, the part of the heart muscle
supplied by that artery begins to die. Currently there is no
cure for heart attack but it can be controlled by quitting
smoking, lowering cholesterol, controlling high blood
pressure, maintaining a healthy weight, and exercising.
Generally many tests are done that involve clustering or
classification of large scale data. In this work, prediction of
heart disease using support vector machine ensembles
using matlab is done. The aim of this paper is to analyze the
performance of Support Vector Machine (SVM) classifier
and Ensemble classifier methods such as Bagging, Boosting
and Random subspace for heart disease prediction.
Accuracies of different classification algorithms are
compared to bring out the best and effective algorithm
suitable for heart disease prediction.
Keywords: Data Mining, Statlog heart dataset, Support Vector
heartbeats. Initial symptoms may start as a mild discomfort
that progress to significant pain. In general to detect a
disease numerous tests must be conducted in a patient. The
usage of data mining techniques in disease prediction is to
reduce the test and increase the accuracy rate of detection.
Dataset
Classification
Techniques
Ensemble
Classifiers
SVM
Machine, Ensemble Classifiers.
1. Introduction
Data mining is the process of extracting
information from data. It is also called as knowledge
discovery. Data mining has become more and more popular
to analyze large amount of data in the past few years. The
most important and popular data mining techniques are
classification and clustering. In this paper SVM classifier
and ensemble methods such as Bagging, Boosting and
Random subspace are investigated for heart disease
prediction. The heart disease is often used interchangeably
with the term cardiovascular disease. Cardiovascular
disease generally refers to conditions that involve narrowed
or blocked blood vessels that can lead to a heart attack.
Symptoms of a heart attack includes discomfort, pressure,
heaviness, or pain in the chest, arm, or below the
breastbone, discomfort radiating to the back, jaw, throat, or
arm, indigestion, sweating, nausea, vomiting, extreme
weakness, anxiety, or shortness of breath, rapid or irregular
IJCSCN | Dec 2015
Available [email protected]
Bagging
Boosting
Random
Subspace
Performance Evaluation
Figure 1. Block Diagram
386
ISSN:2249-5789
V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390
2. Related work
Abdulkadir Sengur [1] proposed Support Vector
Machine Ensembles for Intelligent Diagnosis of Valvular
Heart Disease. The model employs the use of ensemble
learning for improving Support Vector Machine classifiers.
Sellappan Palaniappan and Rafiah Awang [2] proposed
Heart Disease Decision Support System (HDDSS) using
data mining classification modeling techniques. The model
employs three data mining techniques namely Decision
Tree, Naïve Bayes and Neural Network. Tzung-I Tang, et
al [3] compared decision tree and system reconstruction
analysis as applied to heart disease medical data mining.
Sumit Bhatia, et al [4] proposed a decision support system
for heart disease classification based on support vector
machine (SVM) and integer-coded genetic algorithm (GA).
For selecting the important and relevant features and
discarding the irrelevant and redundant ones, integer-coded
genetic algorithm is used which also maximizes SVM’s
classification accuracy. Asha Rajkumar and Sophia Reena
[5] used tanagra tool for classification and the results are
compared. The accuracy of Naïve Bayes is 52.33%,
Decision List is 52% and KNN is 45.67%. Leo Breiman [6]
proposed Bagging predictors for generating multiple
versions of a predictor and using these to get an aggregated
predictor which improves accuracy. Marina Skurichina and
Robert P. W. Duin [7] proposed bagging, boosting and the
random subspace methods for improving weak classifiers.
Abdulkadir Sengur [8] proposed ensemble learning
classifiers for diagnosis of valvular heart disorders. The
performance of ensemble methodology is evaluated using a
data set containing 215 samples and achieved 95.9%
sensitivity and 96% specificity rate in ensemble methods.
Subha, et. al [9] applied genetic algorithm and SVM for
finding relevant features for cardiotocogram classification.
Resul Das, et. al [10] proposed a neural network ensemble
method. The ensemble-based method creates new models
by combining the posterior probabilities of the predicted
values from multiple predecessor models. It obtained
97.4% classification accuracy from the experiments made
on data set containing 215 samples and 100% and 96%
sensitivity and specificity values in valvular heart disease
diagnosis.
The 14 attributes are as follows;
“Table 1. Description of attributes”
Sl. No.
Attribute
1
Age
2
Gender
3
Chest Pain(CP)
4
trestbps: resting blood pressure
5
Cholesterol
6
fbs: fasting blood sugar
7
restecg: resting electrocardiographic
8
thalach : maximum heart rate
9
exang : exercise induced angina
10
Oldpeak
11
Slope
12
ca: no. of major vessels
13
Thal
14
Class variable
10 fold Cross validation technique is used to split
the data. In a 10 fold cross validation, the data is divided in
to 10 parts where each part is approximately same to form
the full dataset. Learning procedure executes 10 times on
training sets and finally the accuracy rates for 10 sets are
averaged to yield an overall accuracy rate. Confusion
matrix is used to present the accuracy of classifiers obtained
through classification.
4. Methodology
3. Dataset description
In this work, Statlog Heart Dataset [11] is used
from the UCI machine learning repository. The dataset
contains 270 instances and 14 attributes with 2 class
attributes. This dataset contains information concerning
heart disease diagnosis.
IJCSCN | Dec 2015
Available [email protected]
4.1 SVM Classifier
SVM is a commonly used technique for data
classification. SVM produces a model which predicts target
value of data instances in the testing set. Support Vector
Machine (SVM) is used when data has exactly two classes.
An SVM classifier classifies data by finding the best
hyperplane that separates all data points of one class from
the other class. The hyperplane with the largest margin
between the two classes is the best one. Margin means the
387
ISSN:2249-5789
V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390
maximal width of the slab parallel to the hyperplane that
has no interior data points.



Bagging
Boosting
Random Subspace
The formulae for Support Vector Machine as follows:
Set of training data-{(a1,b1)…..(al,bl)},
where,
Bagging, boosting and random subspace method
have been used commonly for combining weak classifiers.
4.2.1 Bagging
Each ai  S denotes the input space of the sample
n
bi  S denotes the target value
i=1,2,……..l
l- size of training data
1
min 𝐽(𝐷, ) = ||𝐷||2 + 𝐶 ∑𝑖𝑖=1 𝑖
2
Bootstrap aggregation, or bagging, is a technique
that can be used with many classification methods and
regression methods to improve the prediction process by
reducing the variance associated with prediction. It is a
simple technique that many bootstrap samples are drawn
from the available data and some prediction method is
applied to each bootstrap sample, and then the results are
combined, by averaging for regression and simple voting
for classification, to obtain the overall prediction, with the
variance being reduced due to the averaging.
where,
C -constant of capacity control
𝑖 - slack factor
Optimization problem can be rewritten as follows:
1
max M(𝛼)= − ∑𝑙𝑖,𝑗=1 ∈ 𝛼𝑖 𝛼𝑗 𝑏𝑖 𝑏𝑗 𝐾(𝑎𝑖, 𝑎𝑗 ) +
2
∑𝑙𝑖=1 𝛼𝑖 ∑𝑖𝑖=1 𝛼𝑖 𝑏𝑖
𝛼i ∈ [0,C,], i = 1,2, …l.
𝐾(𝑎𝑖, 𝑎𝑗 )- kernel function
Optimal hyperplane with maximal margin:
∑ 𝛼𝑖 𝑏𝑖 𝐾(𝑎, 𝑎𝑖 ) + 𝑏 = 0
𝑠𝑣
4.2.2 Boosting
The AdaBoost family of algorithms also known as
boosting is another category of powerful ensemble method.
It changes the distribution of weights. Initially the weights
are uniform for all the training samples. The weights are
adjusted after training of each classifier is completed. For
misclassified samples the weights are increased while for
correctly classified samples they are decreased. The final
ensemble is constructed by combining individual classifiers
according to their own accuracies.
4.2.3 Random Subspace
In random subspace feature subspaces are picked
at random from the original feature space and individual
classifiers are created only based on those attributes in the
chosen feature subspaces using the original training set. The
outputs from different individual classifiers are combined
by the uniform majority voting to give the final prediction.
SVM for nonlinear classification in the input space is:
 (x) = sgn [∑𝑠𝑣 𝑏𝑖 𝛼𝑖 𝐾(𝑎𝑖, 𝑎) + 𝑏]
4.2 Ensemble Classifier (EC)
Ensemble Data Mining Methods, also known as
Committee Methods or Model Combiners are machine
learning methods that use the power of many models to
achieve better accuracy than the individual models.
The following ensemble methods are used in this
work;
IJCSCN | Dec 2015
Available [email protected]
5. Result Analysis
Different metrics can be used for evaluating the
performance of classifiers. In this work, the performance
metrics such as Accuracy, Sensitivity, Specificity, Positive
Predictive Value (PPV), Negative Predictive Value(NPV)
are used for evaluating the classifiers.
The formulas for these metrics are given below:
Sensitivity(%) =
TP
× 100
TP + FN
388
ISSN:2249-5789
V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390
Specificity(%) =
TN
× 100
TN + FP
100
90
TP
× 100
PPV(%) =
TP + FP
80
TN
× 100
TN + FN
60
NPV(%) =
Accuracy(%) =
70
50
TP + TN
× 100
TP + TN + FP + FN
30
10
TP - Total number of correctly classified true data.
Bagging
40
20
where,
SVM
Boosting
Random
Subspace
0
TN - Total number of misclassified true data.
FP - Total number of correctly classified false data.
FN - Total number of misclassified false data.
Figure 2. Performance Analysis of SVM and Ensemble
Classifier Methods
5.1 Experimental Results
The performance measures of Support Vector
Machine (SVM) and ensemble classifiers are given in table
2. The results are also shown graphically in figure 2.
Table 2. Performance Analysis of SVM and Ensemble
Classifier Methods
SVM
Bagging
Boosting
Random
Subspace
Accuracy
73.7
81.85
83.22
80.00
Sensitivity
73.78
81.49
83.00
77.9
Specificity
73.7
81.02
82.12
77.2
PPV
74.05
81.67
82.4
80.08
NPV
73.5
80.56
84.00
80.00
The experimental results shows that SVM
classifier achieved classification accuracy of 73.7%,
Bagging achieved a classification accuracy of 81.85% and
Random Subspace achieved a classification accuracy of
80% and Boosting achieved a classification accuracy of
83.22%.
It is clear that, Boosting method performs better
when compared to the other techniques in terms of
accuracy, sensitivity, specificity, PPV and NPV.
6. Conclusion
IJCSCN | Dec 2015
Available [email protected]
In this work, SVM and three ensemble methods
such as bagging, boosting and random subspace have been
implemented and tested with statlog heart dataset. 10-fold
cross-validation evaluation was used to measure the
accuracy of the three algorithms. The final comparative
analysis shows that the Boosting ensemble method
performed better than other methods. This work can be
further extended using other datasets. Ensemble methods
with other base classifiers can be applied for classification
in future. Feature selection technique can also be adopted
so that the performance may be improved.
389
ISSN:2249-5789
V Subha et al, International Journal of Computer Science & Communication Networks,Vol 5(6),386-390
7. References
[1] Abdulkadir Sengur “Support Vector Machine
Ensembles for Intelligent Diagnosis of Valvular Heart
Disease”, J Med Syst, 36 (4) pp 2649-2655, 2012.
[2] Sellappan Palaniappan and Rafiah Awang, “Intelligent
Heart Disease Prediction System Using Data Mining
Techniques”, International Journal of Computer Science
and Network Security, 8 (8), pp 343-350, 2008.
[3] Tzung-I Tang, Gang Zheng, Yalou Huang, Guangfu Shu
and Pengtao Wang, “A Comparative Study of Medical Data
Classification Methods Based on Decision Tree and System
Reconstruction Analysis”, IEMS, 4 (1), pp. 102-108, 2005.
[4] Sumit Bhatia, Praveen Prakash and G.N. Pillai, “SVM
Based Decision Support System for Heart Disease
Classification with Integer-Coded Genetic Algorithm to
Select Critical Features”, World Congress on Engineering
and Computer Science (WCECS), October, 2008.
[5] Asha Rajkumar and G.Sophia Reena, “Diagnosis of
Heart Disease Using Datamining Algorithm”, Global
Journal of Computer Science and Technology, 10 (10), pp
38-43, 2010.
[6] Leo Breiman, “Bagging Predictors”, Machine
Learning, Kluwer Academic Publishers, 24, pp 123-140,
1996.
[7] Marina Skurichina and Robert P. W. Duin, “Bagging,
Boosting and the Random Subspace Method for Linear
Classifiers”, Pattern Analysis & Applications, 5, pp 121–
135, 2002.
[8] Resul Das and Abdulkadir Sengur, “Evaluation of
ensemble methods for diagnosing of valvular heart
disease”, Expert Systems with Applications, 37 (7), pp
5110–5115, 2010.
[9] V.Subha, D.Murugan, S.Prabha and A.Manivanna
Boopathi, “Genetic algorithm based feature subset selection
for fetal state classification”, Journal of communications
Technology, Electronics and Computer Science, 2, pp 1317, 2015.
[10] Resul Das, Ibrahim Turkoglu and Abdulkadir Sengur,
“Diagnosis of valvular heart disease through neural
networks ensembles”, Computer methods and programs in
biomedicine, 93, pp 185–191, 2009.
UCI
Machine
Learning
Repository[11]
http://archive.ics.uci.edu/ml/datasets.html/statlog/Heart.
IJCSCN | Dec 2015
Available [email protected]
390