Download Application of Data Mining Techniques on Heart Disease Prediction

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Transcript
Application of Data Mining Techniques
on Heart Disease Prediction: A Survey
Ritika Chadha, Shubhankar Mayank, Anurag Vardhan
and Tribikram Pradhan
Abstract Globally, the medical industry is presumably “information rich” and
“knowledge poor”. KDD, i.e. knowledge discovery from data is hence, applied to
extract interesting patterns from the dataset using different data mining techniques.
This massive data available is essential for the extraction of useful information and
generate relationships amongst the attributes. The aim of this paper is to compile,
tabulate and analyze the different data mining techniques that have been implied
and implemented in the recent years for Heart Disease Prediction. Each previous
paper exhibits a set of strengths and limitations in terms of the data types used in the
dataset, accuracy, ease of interpretation, reliability and generalization ability. This
paper strives to bring out stark comparisons and put light to the pros and cons of
each of the techniques. By far, the observations reveal that Neural Networks performed well as compared to Naive Bayes and Decision Tree considering appropriate conditions.
Keywords Heart disease Decision tree
networks Genetic algorithm
Naive bayes Classification Neural
1 Introduction
Heart disease is solely the largest cause of death in developed countries and one of
the main contributors to disease burden in developing countries. Due to the shortage
of doctors and experts and neglect of the patients’ symptoms frequently calls for
data mining that serves as an analysis tool to discover hidden relationships and
patterns in HD (Heart Diseases) medical data. Pre-requisites required for detecting a
disease are the numerous tests that a patient has to go through. Added to this is the
large amount of complex data about patients, hospital resources, disease diagnosis,
electronic patient records, medical devices etc. To prevent this cost-consuming,
R. Chadha (&) S. Mayank A. Vardhan T. Pradhan
Department of ICT, Manipal University, Manipal 576104, India
© Springer India 2016
N.R. Shetty et al. (eds.), Emerging Research in Computing, Information,
Communication and Applications, DOI 10.1007/978-81-322-2553-9_38
413
414
R. Chadha et al.
Fig. 1 KDD
cumbersome task, data mining technique comes into play that is efficient and
cost-effective. Data mining techniques are the result of a long process of experimenting and R&D. It is divided into two tasks-Predictive Tasks and Descriptive
Tasks. Data mining involves few steps from raw data collection to some form of
interesting pattern. The process which takes place in iteration includes-Data
Cleaning, Data Integration, Data Selection, Data Transformation, Data Mining,
Pattern Evaluation, and Knowledge Discovery Process as shown in Fig. 1.
Our work presents an overall view all such tasks that are performed for the
extraction of data to be made possible in order to increase the prediction rate of the
heart disease by the application of various essential data mining techniques and
processes.
2 Related Work
In this paper, [1] Nidhi Bhatla et al., have performed an experiment in their work
An Analysis of Heart Disease Prediction using Different Data Mining
Techniques using the data mining tool Weka 3.6.6. This research results in an
accuracy of Neural networks of 100 % compared to 99.62 % and 90.74 % in
Decision tree and Naïve Bayes respectively. The method of Fuzzy Logic and
Genetic Algorithm is used that amalgamates the genetic algorithms for feature
selection and fuzzy expert systems by experimenting in Matlab using fuzzy tool.
Two kinds of algorithms are used. Earlier, 13 attributes were used for this prediction but this research work reduced the number of attributes to six only using
Genetic Algorithm and Feature Subset Selection. This prototype IHDPS, Intelligent
Heart Disease Prediction System had been developed using techniques such as,
Decision Trees, Naive Bayes and Neural Networks. In this paper, the analysis
shows that Neural Network has the highest accuracy i.e. 100 % so far. On the other
hand, Decision Tree has also performed well with 99.62 % accuracy by using 15
attributes.
Application of Data Mining Techniques …
415
The work of Amin et al. [2], Genetic Neural Network Based Data Mining in
Prediction of Heart Disease Using Risk Factors developed an intelligent data
mining system based on genetic algorithm. To transform data into useful form,
encoding was done between a range [−1, 1]. Neural Network Weight
Optimization by Genetic Algorithm system uses back-propagation algorithm for
learning and training the neural network on algorithm. The disadvantages were
removed in this paper by optimizing the initial weights of neural network. For this,
a genetic algorithm which is specialized for global searching was used. The
accuracy of prediction of heart disease on the training data was calculated as 89 %
and accuracy on validation data was 96.2 %.
In this paper [3] HDPS: Heart Disease Prediction System, AH Chen et al. have
applied the technique of ANN by using an LVQ system to represent the feature
space of observed data using prototypes W = (w(i), …, w(n)). The application of
winner-take-all training algorithm is used where the position of the so-called winner
is moved closer if it correctly classifies the data point or swayed away if the choice
is not apt. The accuracy of classification is near 80 % as well as 85 % sensitivity and
70 % specificity. Their approach consists of three steps namely-selection of 13
important clinical features-age, sex, chest pain type, trestbps, cholesterol, fasting
blood sugar, resting ecg, max heart rate, exercise induced angina, old peak, slope,
number of vessels colored, and thal. 80 % prediction rate is obtained by developing
an artificial neural network algorithm. The next step includes a user-friendly heart
disease predict system (HDPS) that generates prediction results using artificial
neural network (ANN) techniques on C and C# environment.
This paper [4] Heart Disease Prediction using Lazy Associative Classification
by M. Akhil Jabbar, lazy data mining approach for heart disease classification is
applied. Information centric attribute measure, PCA is applied to generate class
association rules. This class association rules is used to predict the occurrence of
heart disease. This approach has improved 10.8 % against J4.8 and 19.8 %
improvement over naïve Bayes for non-medical data sets. This approach reached
10.26 % improvement over J4.8 and 8.6 % improvement against naïve Bayes
respectively for heart disease data set.
In this paper [5], Early Prediction of Heart Diseases Using Data Mining
Techniques, three classifiers as ID3, CART and DT are used wherein CART is the
most accurate with 83.49 % and 0.23 s are engaged to build the model. The most
important attributes for heart diseases are cp (Chest pain), slope (The slope of the
peak exercise segment), Exang (Exercise induced angina), and Restecg (Resting
electrocardiographic). These attributes were found using three tests for the
assessment of input variables: Chi-square test, Info Gain test and Gain Ratio test.
Paper [6] Improved Study of Heart Disease Prediction System using Data
Mining Classification Techniques by Chaitrali S. Dangare includes two more
input attributes obesity and smoking to improve the overall prediction rate.
Decision trees, Naive Bayes and Neural networks are used which results in Neural
Networks providing more accurate results as compared to Decision trees and Naive
Baye where Neural Networks has a rate of 99.25 % as opposed to Naive Bayes
(94.44 %) and Decision Tree (96.66 %).
416
R. Chadha et al.
In this paper [7] Predictive Data Mining for Medical Diagnosis, by incorporating techniques like—ANN, Time Series, Clustering and Association Rules, soft
computing approaches etc., Jyoti Soni et al. concluded that Decision Tree outperforms and sometimes Bayesian classification is having similar accuracy as of
decision tree but other predictive methods like KNN, Neural Networks,
Classification based on clustering are not performing well. Also, after the application of genetic algorithm, the accuracy of the Decision Tree and Bayesian
Classification further improves.
Paper [8] Intelligent Heart Disease Prediction System Using Data Mining
Techniques, Sellappan Palaniappan et al. have used three data mining techniques.
Extraction of hidden knowledge from a historical heart disease database, building
and accessing models through DMX query language and functions and the training
and validation against a test dataset. Effectiveness is accounted for by using Lift
Chart and Classification Matrix. The most effective model to predict patients with
heart disease appears to be Naïve Bayes followed by Neural Network and Decision
Trees.
In the work of Hlaudi Daniel Masethe [9] Prediction of Heart Disease using
Classification Algorithms, an experiment was performed for the prediction of heart
attacks and comparison to find the best method of prediction. This can act as an
important tool for physicians to predict risky cases in the practice and advice
accordingly. The predictive accuracy determined by J48, REPTREE and
SIMPLE CART algorithms suggests that parameters used are reliable indicators to
predict the presence of heart diseases.
The work of K. Sudhakar et al. [10], Study of Heart Disease Prediction using
Data Mining presents the different techniques that are deployed in the recent years
for calculating the prediction rate in heart disease. These techniques include-ANN,
BN, Decision Trees and Classification Algorithms.
In the work of Aditya Sundar et al. Performance Analysis of Classification
Data mining Techniques Over Heart Disease Data Base [11], after experimentation, a prototype has been described using data mining techniques namely, Naïve
Bayes and WAC (weighted associative classifier). It creates a bridge between
significant data and knowledge e.g. patterns, relationships between the medical
symptoms. It serves as a tool to train nurses and medical interns to treat patients
with heart diseases.
The work of Abhishek Taneja [12] Heart Disease Prediction System Using
Data Mining Techniques deals with the conduction of 4 experiments by
employing selected classification algorithms on a full training dataset containing
7339 instances. KDD has been used in order to develop a prediction model that can
predict heart disease cases based on calculations done.
In this paper Applications of Data Mining Techniques in Healthcare and
Prediction of Heart Attacks [13] by K. Srinivas et al., the data mining techniques
such as-Rule Based, Decision Tree, Naïve Bayes, and Artificial Neural Network to
massive volumes of medical care data.
Application of Data Mining Techniques …
417
3 Objective
Our paper brings into limelight all the advantages and disadvantages of using the
different data mining techniques for the prediction of heart diseases. It also accounts
for the prediction rate for different techniques hence, bringing out the comparison
between each of them.
4 Methodology
The main methodology used for our work was by examining the publications,
journals and reviews in the field of computer science and engineering, data mining
and cardiovascular disease in recent times.
4.1
Data Mining and Neural Networks
An artificial neural network (ANN), also known as in short “neural network” (NN),
is a mathematical model or computational model based on the neural network found
in human anatomy. In this work, it is observed that the Heart Disease Prediction
System has been developed using 15 attributes for a 100 % accuracy. However, in
few papers, 13 attributes have also been used. For the calculations of the required
neural network figure, Weka 3.6.6 is used for experimenting along with few of the
researchers implementing heart disease classification and prediction trained via
ANN using C as a tool. A Multi-layer Perceptron Neural Networks (MLPNN) with
Back-propagation is used. The structure of MLPNN is as shown in Fig. 2.
Framework of ANN model: It maps a set of input data onto a set of appropriate
Fig. 2 Structure of MLPNN
418
R. Chadha et al.
output data. It consists of 3 layers namely -input layer, hidden layer & output layer.
Weights are allotted to each connection or branch from that particular neuron.
4.2
Genetic Algorithm
Genetic Algorithm (GA) is a heuristic that imitates the process of Darwin’s natural
evolution as cited in Fig. 3. This algorithm is used to generate optimized solutions.
It is inspired by techniques like- inheritance, mutation, selection, and crossover. For
instance, in the heart disease prediction, using Feature Subset Selection, GA is used
for the reduction of the number of attributes. This includes a set of input values that
are routinely considered through the application of fitness function which is
nonetheless, flexible expression of modelling criteria.
4.3
Decision Tree (DT)
Decision tree using the classification or regression techniques are built in the form
of a tree structure as shown in Fig. 4. It segregates a dataset into smaller subsets.
The final outcome is a tree with decision nodes and leaf nodes. A decision node
represents the branches while the leaf node is the result of the decision-making
process. The topmost decision node in a tree is known as the root node. Each leaf is
assigned to one class representing the apt target value. The leaf may hold a probability vector too. Top-down approach is implemented where navigation is done
from the root to the leaf according to the result of the tests along the path.
Fig. 3 Overall model of genetic algorithm
Application of Data Mining Techniques …
419
Fig. 4 Structure of a decision tree
4.4
Naive Bayes
This data mining classifier is based on the mathematical model called Bayesian
theorem and is perfect for use when the dimensionality of the inputs is very high.
Despite it being simple, Naive Bayes can outperform more sophisticated, complex
classification methods. Bayes theorem provides a method for the calculation of
posterior probability P(c|x), from P(c), P(x), and P(x|c). This assumption is called
class conditional independence. The theorem states as follows:
Author
Nidhi Bhatla,
Kiran Jyoti,
Syed Umar
Amin, Kavita
Agarwal, Dr.
Rizwan Beg,
AH Chen, SY
Huang, PS
Hong, CH
Cheng, EJ Lin
Title
An analysis
of heart
disease
prediction
using
different data
mining
techniques
Genetic
neural
network
based data
mining in
prediction of
heart disease
using risk
factors
HDPS: heart
disease
prediction
system
2011
2013
2013
Year
heart disease data
from machine
learning repository
of UCI. We have
total 303 instances
of which 164
instances belonged
The data for 50
people was
collected from
surveys done by the
American Heart
Association.
The dataset from
UCI machine
learning repository
is used.
Dataset
Artificial
neural network
Neural
networks,
fuzzy logic
and genetic
algorithm,
supervised
machine
learning,
genetic
algorithm,
IHDPS
Data analysis
and encoding,
neural network
weight
optimization
by genetic
algorithm,
neural
networks
Type
Table 1 Comparison of the prediction rate using data mining techniques in recent years
One benefit of
LVQ is that it
creates
prototypes that
are easy to
interpret for
experts in the
Data analysis
was needed for
correct data
preprocessing.
ANN requires
less formal
statistical
training,
Effective
classification
Advantages
Its “black box”
nature, greater
computational
burden, proneness
to overfitting, and
the empirical
Back propagation
algorithm is very
slow, and “black
box” nature of
ANN.
ANN requires
more fine tuning.
GA’s are slow.
Disadvantage
The accuracy of
prediction of heart
disease on the
training data was
calculated as 89 %
and accuracy on
validation data was
96.2 %. The least
mean square error
(MSE) achieved
was 0.034683
The accuracy of
classification is
near 80 % as well
as 85 % sensitivity
and 70 %
specificity. To
confirm the
(continued)
Naive Bayes
96.5 % Decision
Trees 99.62 %
Neural Networks
100 % KNN
45.67 %
Classification via
Clustering 88.3 %
Prediction result
420
R. Chadha et al.
Author
M. Akhil Jabbar
Dr B.L
Deekshatulu,
Dr. Priti
Chandra,
Vikas
Chaurasia,
Saurabh Pal
Chaitrali S.
Dangare,
Sulabha S. Apte,
Ph.D.
Title
Heart disease
prediction
using lazy
associative
classification
Early
prediction of
heart diseases
using data
mining
techniques
Improved
study of heart
disease
prediction
system using
data mining
Table 1 (continued)
2012
2013
2013
Year
The publicly
available heart
disease database is
used. The
Cleveland heart
Heart disease data
set available at
http://archive.ics.
uci.edu/ml/datasets/
Heart+Disease. The
data set has 76 raw
attributes.
to the healthy and
139 instances
belonged to the
heart disease
N.A.
Dataset
Decision trees,
Naive Bayes,
ANN
CART, ID3,
decision tree
Associative
classification,
principle
component
analysis, lazy
associative
classification
method
Type
It is possible to
build more
accurate
classifier.
Reduced
complexity in
images’
grouping with
the use of PCA
CART is easily
accessible to
beginning
users and does
not require a
high level of
technical
expertise to
operate.
More powerful
for
classification
problems. Easy
to implement.
respective
application
domain
Advantages
Trees can be
extremely sensitive
to small
perturbations in the
data: “black box”
nature of ANN,
The covariance
matrix is difficult
to be evaluated in
an accurate
manner. Lazy
classifiers typically
require more work
to classify all test
instances.
Trees can be
extremely sensitive
to small
perturbations in the
data:
nature of model
development.
Disadvantage
Decision Tree:
96.66 % for 13
attributes, 99.62 %
for 15 attributes
Naive Bayes:
94.44 % for 13
(continued)
83.49 % in CART,
72.93 % in ID3,
82.50 % in
Decision tree
goodness of this
model, a ROC
curve is also
displayed in Fig. 4.
Accuracy of
classification is
90 % for the
proposed system.
Prediction result
Application of Data Mining Techniques …
421
2011
2010
J yoti Soni Ujma
Ansari Dipesh
Sharma Sunita
soni
K. Srinivas B.
Kavihta Rani
Dr.
A. Govrdhan
N.A.
Total of 909
records with 15
medical attributes
(factors) were
obtained from the
Cleveland heart
disease database.
Dataset
Predictive
Data Mining
for Medical
Diagnosis:
An Overview
of Heart
Disease
Prediction
Applications
of data
mining
techniques in
healthcare
and
prediction of
heart attacks
Year
disease database is
used.
Author
classification
techniques
Title
Table 1 (continued)
Data mining
and artificial
neural
network,
genetic
algorithm,
association
rule discovery
Rule set
classifiers,
decision trees,
ANN, neuro
fuzzy,
Bayesian
Network
structure
discovery.
Type
Easy to
interpret
• easy to
generate
• can classify
new instances
rapidly
• decision trees
are powerful
classification
problems
It is very
comfortable
and efficient
way of
problem
solving!
Advantages
They can be
extremely sensitive
to small
perturbations in the
data, and “black
box” nature of
ANN
greater
computational
burden, proneness
to overfitting, and
the empirical
nature of model
development.
“Black Box”
nature of ANN
Disadvantage
(continued)
Accuracy of ANN
is 85.53 %, for
decision trees it
89 %, and 86.53 %
for Naive bayes
attributes, 90.74 %
for 15 attributes
Decision Trees:
99.25 % for for 13
attributes, 100 %
for 15 attributes
Prediction result
422
R. Chadha et al.
Author
Abhishek
Taneja
Ms. Ishtake S.H,
Prof. Sanap S.A.
Title
Heart disease
prediction
system using
data mining
techniques
Intelligent
heart disease
prediction
system using
data mining
techniques
Table 1 (continued)
2013
2013
Year
A total of 909
records with 15
medical attributes
(factors) were
obtained from the
Cleveland Heart
Disease database
The patient data set
is compiled from
data collected from
medical
practitioners in
South Africa.
Dataset
Type
Decision Tree
Classification,
Naive Bayes,
ANN
Decision tree
classification,
Naive Bayes,
ANN
Advantages
More powerful
for
classification
problems.
Naive Bayes is
easy to
implement
more powerful
for
classification
problems.
Naive Bayes is
easy to
implement
Disadvantage
Trees can be
extremely sensitive
to small
perturbations in the
data:
Trees can be
extremely sensitive
to small
perturbations in the
data:
Prediction result
J48 unpruned with
all attributes
94.29 % J48
pruned with all
attributes 95.41 %
J48 unpruned with
selected attributes
95.52 % J48
pruned with
selected attributes
95.56 % Naive
Bayes with all
attributes 91.96 %
Naive Bayes with
selected attributes
92.42 % Neural
Network with all
attributes 93.83 %
Neural Network
with selected
attributes 94.85 %
Accuracy is
94.93 % for
decision trees,
95 % for Naive
Bayes, 93.54 % for
artificial neural
networks.
(continued)
Application of Data Mining Techniques …
423
Author
N. Aditya
Sundar1,
P. Pushpa
Latha2, M.
Rama Chandra3
K. Sudhakar,
Dr.
M. Manimekalai
Hlaudi Daniel
Masethe,
Mosima Anna
Masethe
Title
Performance
analysis of
classification
data mining
techniques
over heart
disease data
base
Study of
heart disease
prediction
using data
mining
Prediction of
heart disease
using
classification
algorithms
Table 1 (continued)
2014
2014
2012
Year
Compiled from data
collected from
medical
practitioners in
South Africa
N.A.
N.A.
Dataset
Neural
Networks,
Decision trees,
Naive Bayes,
associative
classification
Decision tree
Naive Bayes,
weighted
association
classifier, a
priori
algorithm:
Type
More powerful
for
classification
problems, easy
to implement
Easy to
implement
Good results
obtained in
most of the
cases easily
parallelized
easy to
implement
more powerful
for
classification
problems, Easy
to implement
Advantages
Trees can be
extremely sensitive
to small
perturbations in the
data: Black Box
nature of ANN
Trees can be
extremely sensitive
to small
perturbations in the
data: Black Box
nature of ANN
Apriori can be very
slow.
Disadvantage
J48: 99.0741
Reptree: 99.0741
Naive Bayes:
97.222 Bayes net:
98.1481 simple
cart: 99.0741
78 % for Naive
Bayes and 84 %
for Weighted
Association
classifier.
Prediction result
424
R. Chadha et al.
Application of Data Mining Techniques …
425
5 Comparison of the Recent Years
On studying different papers written in recent years, Table 1 has been constructed.
This table bring out a stark contrast in the prediction rate on using different
techniques.
6 Conclusion and Future Work
For clear understanding, results/prediction rate for each of the papers are summarized in a tabular form and the best prediction rate obtained in each of the
techniques/methodologies is summarized by studying, analyzing and performing a
survey on all of the recent papers. It is perceived from our observations/experiments
that in few cases, the same classifier produces different accuracy for different data
mining techniques based on the number of attributes chosen and the kind of
algorithm that is applied. Several classifiers are analyzed for the required prediction.
We need to consider more varied parameters in the dataset for a complete
accuracy of the prediction system. The intent is to develop more intelligent heart
disease prediction models that employs more of the data mining techniques.
References
1. Bhatla, N., Jyoti, K.: An analysis of heart disease prediction using different data mining
techniques. Int. J. Eng. Res. Technol. (IJERT)
2. Amin, S.U., Agarwal, K., Beg, R.: Genetic neural network based data mining in prediction of
heart disease using risk factors. In: IEEE Conference on Information and Communication
Technologies (2013)
3. Chen, A.H., Huang, S.Y., Hong, P.S., Cheng, C.H., Lin, E.J.: HDPS: heart disease prediction
system. In: Computers in Cardiology Conference
4. Jabbar, M.A., Deekshatulu, B.L., Chandra, P.: Heart Disease Prediction using Lazy
Associative Classification, 2013 IEEE (2013)
5. Chaurasia, V., Pal S.: Early prediction of heart diseases using data mining techniques. Caribb.
J. Sci. Technol.
6. Dangare, C.S., Apte, S.S.: Improved study of heart disease prediction system using data
mining classification techniques. Int. J. Comput. Appl. (0975–888) 47(10), June 2012
7. Soni, J., Ansari, U., Sharma, D., Soni, S.: Predictive data mining for medical diagnosis. Int.
J. Comput. Appl. 17(8), 0975–8887 (2011)
8. Ishtake, S.H., Sanap, S.A.: Intelligent heart disease prediction system using data mining
techniques. Int. J. Healthc. Biomed. Res. 1(3), 94–101 (2013)
9. Masethe, H.D., Masethe, M.A.: Prediction of heart disease using classification algorithms. In:
World Congress on Engineering and Computer Science 2014 Vol II WCECS 2014, San
Francisco, USA, 22–24 Oct 2014
10. Sudhakar, K., Manimekalai, M.: Study of heart disease prediction using data mining. Int.
J. Adv. Res. Comput. Sci. Softw. Eng.
426
R. Chadha et al.
11. Sundar, N.A., Latha, P.P., Chandra, M.R.: Performance analysis of classification data mining
techniques over heart disease data base. [IJESAT] Int. J. Eng. Sci. Adv. Technol.
12. Taneja, A.: Heart disease prediction system using data mining techniques. Orient. J. Comput.
Sci. Technol.
13. Srinivas, K., Kavihta Rani, B., Govrdhan, A.: Applications of data mining techniques in
healthcare and prediction of heart attacks. (IJCSE) Int. J. Comput. Sci. Eng. 02(02), 250–255
(2010)