Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Application of Data Mining Techniques on Heart Disease Prediction: A Survey Ritika Chadha, Shubhankar Mayank, Anurag Vardhan and Tribikram Pradhan Abstract Globally, the medical industry is presumably “information rich” and “knowledge poor”. KDD, i.e. knowledge discovery from data is hence, applied to extract interesting patterns from the dataset using different data mining techniques. This massive data available is essential for the extraction of useful information and generate relationships amongst the attributes. The aim of this paper is to compile, tabulate and analyze the different data mining techniques that have been implied and implemented in the recent years for Heart Disease Prediction. Each previous paper exhibits a set of strengths and limitations in terms of the data types used in the dataset, accuracy, ease of interpretation, reliability and generalization ability. This paper strives to bring out stark comparisons and put light to the pros and cons of each of the techniques. By far, the observations reveal that Neural Networks performed well as compared to Naive Bayes and Decision Tree considering appropriate conditions. Keywords Heart disease Decision tree networks Genetic algorithm Naive bayes Classification Neural 1 Introduction Heart disease is solely the largest cause of death in developed countries and one of the main contributors to disease burden in developing countries. Due to the shortage of doctors and experts and neglect of the patients’ symptoms frequently calls for data mining that serves as an analysis tool to discover hidden relationships and patterns in HD (Heart Diseases) medical data. Pre-requisites required for detecting a disease are the numerous tests that a patient has to go through. Added to this is the large amount of complex data about patients, hospital resources, disease diagnosis, electronic patient records, medical devices etc. To prevent this cost-consuming, R. Chadha (&) S. Mayank A. Vardhan T. Pradhan Department of ICT, Manipal University, Manipal 576104, India © Springer India 2016 N.R. Shetty et al. (eds.), Emerging Research in Computing, Information, Communication and Applications, DOI 10.1007/978-81-322-2553-9_38 413 414 R. Chadha et al. Fig. 1 KDD cumbersome task, data mining technique comes into play that is efficient and cost-effective. Data mining techniques are the result of a long process of experimenting and R&D. It is divided into two tasks-Predictive Tasks and Descriptive Tasks. Data mining involves few steps from raw data collection to some form of interesting pattern. The process which takes place in iteration includes-Data Cleaning, Data Integration, Data Selection, Data Transformation, Data Mining, Pattern Evaluation, and Knowledge Discovery Process as shown in Fig. 1. Our work presents an overall view all such tasks that are performed for the extraction of data to be made possible in order to increase the prediction rate of the heart disease by the application of various essential data mining techniques and processes. 2 Related Work In this paper, [1] Nidhi Bhatla et al., have performed an experiment in their work An Analysis of Heart Disease Prediction using Different Data Mining Techniques using the data mining tool Weka 3.6.6. This research results in an accuracy of Neural networks of 100 % compared to 99.62 % and 90.74 % in Decision tree and Naïve Bayes respectively. The method of Fuzzy Logic and Genetic Algorithm is used that amalgamates the genetic algorithms for feature selection and fuzzy expert systems by experimenting in Matlab using fuzzy tool. Two kinds of algorithms are used. Earlier, 13 attributes were used for this prediction but this research work reduced the number of attributes to six only using Genetic Algorithm and Feature Subset Selection. This prototype IHDPS, Intelligent Heart Disease Prediction System had been developed using techniques such as, Decision Trees, Naive Bayes and Neural Networks. In this paper, the analysis shows that Neural Network has the highest accuracy i.e. 100 % so far. On the other hand, Decision Tree has also performed well with 99.62 % accuracy by using 15 attributes. Application of Data Mining Techniques … 415 The work of Amin et al. [2], Genetic Neural Network Based Data Mining in Prediction of Heart Disease Using Risk Factors developed an intelligent data mining system based on genetic algorithm. To transform data into useful form, encoding was done between a range [−1, 1]. Neural Network Weight Optimization by Genetic Algorithm system uses back-propagation algorithm for learning and training the neural network on algorithm. The disadvantages were removed in this paper by optimizing the initial weights of neural network. For this, a genetic algorithm which is specialized for global searching was used. The accuracy of prediction of heart disease on the training data was calculated as 89 % and accuracy on validation data was 96.2 %. In this paper [3] HDPS: Heart Disease Prediction System, AH Chen et al. have applied the technique of ANN by using an LVQ system to represent the feature space of observed data using prototypes W = (w(i), …, w(n)). The application of winner-take-all training algorithm is used where the position of the so-called winner is moved closer if it correctly classifies the data point or swayed away if the choice is not apt. The accuracy of classification is near 80 % as well as 85 % sensitivity and 70 % specificity. Their approach consists of three steps namely-selection of 13 important clinical features-age, sex, chest pain type, trestbps, cholesterol, fasting blood sugar, resting ecg, max heart rate, exercise induced angina, old peak, slope, number of vessels colored, and thal. 80 % prediction rate is obtained by developing an artificial neural network algorithm. The next step includes a user-friendly heart disease predict system (HDPS) that generates prediction results using artificial neural network (ANN) techniques on C and C# environment. This paper [4] Heart Disease Prediction using Lazy Associative Classification by M. Akhil Jabbar, lazy data mining approach for heart disease classification is applied. Information centric attribute measure, PCA is applied to generate class association rules. This class association rules is used to predict the occurrence of heart disease. This approach has improved 10.8 % against J4.8 and 19.8 % improvement over naïve Bayes for non-medical data sets. This approach reached 10.26 % improvement over J4.8 and 8.6 % improvement against naïve Bayes respectively for heart disease data set. In this paper [5], Early Prediction of Heart Diseases Using Data Mining Techniques, three classifiers as ID3, CART and DT are used wherein CART is the most accurate with 83.49 % and 0.23 s are engaged to build the model. The most important attributes for heart diseases are cp (Chest pain), slope (The slope of the peak exercise segment), Exang (Exercise induced angina), and Restecg (Resting electrocardiographic). These attributes were found using three tests for the assessment of input variables: Chi-square test, Info Gain test and Gain Ratio test. Paper [6] Improved Study of Heart Disease Prediction System using Data Mining Classification Techniques by Chaitrali S. Dangare includes two more input attributes obesity and smoking to improve the overall prediction rate. Decision trees, Naive Bayes and Neural networks are used which results in Neural Networks providing more accurate results as compared to Decision trees and Naive Baye where Neural Networks has a rate of 99.25 % as opposed to Naive Bayes (94.44 %) and Decision Tree (96.66 %). 416 R. Chadha et al. In this paper [7] Predictive Data Mining for Medical Diagnosis, by incorporating techniques like—ANN, Time Series, Clustering and Association Rules, soft computing approaches etc., Jyoti Soni et al. concluded that Decision Tree outperforms and sometimes Bayesian classification is having similar accuracy as of decision tree but other predictive methods like KNN, Neural Networks, Classification based on clustering are not performing well. Also, after the application of genetic algorithm, the accuracy of the Decision Tree and Bayesian Classification further improves. Paper [8] Intelligent Heart Disease Prediction System Using Data Mining Techniques, Sellappan Palaniappan et al. have used three data mining techniques. Extraction of hidden knowledge from a historical heart disease database, building and accessing models through DMX query language and functions and the training and validation against a test dataset. Effectiveness is accounted for by using Lift Chart and Classification Matrix. The most effective model to predict patients with heart disease appears to be Naïve Bayes followed by Neural Network and Decision Trees. In the work of Hlaudi Daniel Masethe [9] Prediction of Heart Disease using Classification Algorithms, an experiment was performed for the prediction of heart attacks and comparison to find the best method of prediction. This can act as an important tool for physicians to predict risky cases in the practice and advice accordingly. The predictive accuracy determined by J48, REPTREE and SIMPLE CART algorithms suggests that parameters used are reliable indicators to predict the presence of heart diseases. The work of K. Sudhakar et al. [10], Study of Heart Disease Prediction using Data Mining presents the different techniques that are deployed in the recent years for calculating the prediction rate in heart disease. These techniques include-ANN, BN, Decision Trees and Classification Algorithms. In the work of Aditya Sundar et al. Performance Analysis of Classification Data mining Techniques Over Heart Disease Data Base [11], after experimentation, a prototype has been described using data mining techniques namely, Naïve Bayes and WAC (weighted associative classifier). It creates a bridge between significant data and knowledge e.g. patterns, relationships between the medical symptoms. It serves as a tool to train nurses and medical interns to treat patients with heart diseases. The work of Abhishek Taneja [12] Heart Disease Prediction System Using Data Mining Techniques deals with the conduction of 4 experiments by employing selected classification algorithms on a full training dataset containing 7339 instances. KDD has been used in order to develop a prediction model that can predict heart disease cases based on calculations done. In this paper Applications of Data Mining Techniques in Healthcare and Prediction of Heart Attacks [13] by K. Srinivas et al., the data mining techniques such as-Rule Based, Decision Tree, Naïve Bayes, and Artificial Neural Network to massive volumes of medical care data. Application of Data Mining Techniques … 417 3 Objective Our paper brings into limelight all the advantages and disadvantages of using the different data mining techniques for the prediction of heart diseases. It also accounts for the prediction rate for different techniques hence, bringing out the comparison between each of them. 4 Methodology The main methodology used for our work was by examining the publications, journals and reviews in the field of computer science and engineering, data mining and cardiovascular disease in recent times. 4.1 Data Mining and Neural Networks An artificial neural network (ANN), also known as in short “neural network” (NN), is a mathematical model or computational model based on the neural network found in human anatomy. In this work, it is observed that the Heart Disease Prediction System has been developed using 15 attributes for a 100 % accuracy. However, in few papers, 13 attributes have also been used. For the calculations of the required neural network figure, Weka 3.6.6 is used for experimenting along with few of the researchers implementing heart disease classification and prediction trained via ANN using C as a tool. A Multi-layer Perceptron Neural Networks (MLPNN) with Back-propagation is used. The structure of MLPNN is as shown in Fig. 2. Framework of ANN model: It maps a set of input data onto a set of appropriate Fig. 2 Structure of MLPNN 418 R. Chadha et al. output data. It consists of 3 layers namely -input layer, hidden layer & output layer. Weights are allotted to each connection or branch from that particular neuron. 4.2 Genetic Algorithm Genetic Algorithm (GA) is a heuristic that imitates the process of Darwin’s natural evolution as cited in Fig. 3. This algorithm is used to generate optimized solutions. It is inspired by techniques like- inheritance, mutation, selection, and crossover. For instance, in the heart disease prediction, using Feature Subset Selection, GA is used for the reduction of the number of attributes. This includes a set of input values that are routinely considered through the application of fitness function which is nonetheless, flexible expression of modelling criteria. 4.3 Decision Tree (DT) Decision tree using the classification or regression techniques are built in the form of a tree structure as shown in Fig. 4. It segregates a dataset into smaller subsets. The final outcome is a tree with decision nodes and leaf nodes. A decision node represents the branches while the leaf node is the result of the decision-making process. The topmost decision node in a tree is known as the root node. Each leaf is assigned to one class representing the apt target value. The leaf may hold a probability vector too. Top-down approach is implemented where navigation is done from the root to the leaf according to the result of the tests along the path. Fig. 3 Overall model of genetic algorithm Application of Data Mining Techniques … 419 Fig. 4 Structure of a decision tree 4.4 Naive Bayes This data mining classifier is based on the mathematical model called Bayesian theorem and is perfect for use when the dimensionality of the inputs is very high. Despite it being simple, Naive Bayes can outperform more sophisticated, complex classification methods. Bayes theorem provides a method for the calculation of posterior probability P(c|x), from P(c), P(x), and P(x|c). This assumption is called class conditional independence. The theorem states as follows: Author Nidhi Bhatla, Kiran Jyoti, Syed Umar Amin, Kavita Agarwal, Dr. Rizwan Beg, AH Chen, SY Huang, PS Hong, CH Cheng, EJ Lin Title An analysis of heart disease prediction using different data mining techniques Genetic neural network based data mining in prediction of heart disease using risk factors HDPS: heart disease prediction system 2011 2013 2013 Year heart disease data from machine learning repository of UCI. We have total 303 instances of which 164 instances belonged The data for 50 people was collected from surveys done by the American Heart Association. The dataset from UCI machine learning repository is used. Dataset Artificial neural network Neural networks, fuzzy logic and genetic algorithm, supervised machine learning, genetic algorithm, IHDPS Data analysis and encoding, neural network weight optimization by genetic algorithm, neural networks Type Table 1 Comparison of the prediction rate using data mining techniques in recent years One benefit of LVQ is that it creates prototypes that are easy to interpret for experts in the Data analysis was needed for correct data preprocessing. ANN requires less formal statistical training, Effective classification Advantages Its “black box” nature, greater computational burden, proneness to overfitting, and the empirical Back propagation algorithm is very slow, and “black box” nature of ANN. ANN requires more fine tuning. GA’s are slow. Disadvantage The accuracy of prediction of heart disease on the training data was calculated as 89 % and accuracy on validation data was 96.2 %. The least mean square error (MSE) achieved was 0.034683 The accuracy of classification is near 80 % as well as 85 % sensitivity and 70 % specificity. To confirm the (continued) Naive Bayes 96.5 % Decision Trees 99.62 % Neural Networks 100 % KNN 45.67 % Classification via Clustering 88.3 % Prediction result 420 R. Chadha et al. Author M. Akhil Jabbar Dr B.L Deekshatulu, Dr. Priti Chandra, Vikas Chaurasia, Saurabh Pal Chaitrali S. Dangare, Sulabha S. Apte, Ph.D. Title Heart disease prediction using lazy associative classification Early prediction of heart diseases using data mining techniques Improved study of heart disease prediction system using data mining Table 1 (continued) 2012 2013 2013 Year The publicly available heart disease database is used. The Cleveland heart Heart disease data set available at http://archive.ics. uci.edu/ml/datasets/ Heart+Disease. The data set has 76 raw attributes. to the healthy and 139 instances belonged to the heart disease N.A. Dataset Decision trees, Naive Bayes, ANN CART, ID3, decision tree Associative classification, principle component analysis, lazy associative classification method Type It is possible to build more accurate classifier. Reduced complexity in images’ grouping with the use of PCA CART is easily accessible to beginning users and does not require a high level of technical expertise to operate. More powerful for classification problems. Easy to implement. respective application domain Advantages Trees can be extremely sensitive to small perturbations in the data: “black box” nature of ANN, The covariance matrix is difficult to be evaluated in an accurate manner. Lazy classifiers typically require more work to classify all test instances. Trees can be extremely sensitive to small perturbations in the data: nature of model development. Disadvantage Decision Tree: 96.66 % for 13 attributes, 99.62 % for 15 attributes Naive Bayes: 94.44 % for 13 (continued) 83.49 % in CART, 72.93 % in ID3, 82.50 % in Decision tree goodness of this model, a ROC curve is also displayed in Fig. 4. Accuracy of classification is 90 % for the proposed system. Prediction result Application of Data Mining Techniques … 421 2011 2010 J yoti Soni Ujma Ansari Dipesh Sharma Sunita soni K. Srinivas B. Kavihta Rani Dr. A. Govrdhan N.A. Total of 909 records with 15 medical attributes (factors) were obtained from the Cleveland heart disease database. Dataset Predictive Data Mining for Medical Diagnosis: An Overview of Heart Disease Prediction Applications of data mining techniques in healthcare and prediction of heart attacks Year disease database is used. Author classification techniques Title Table 1 (continued) Data mining and artificial neural network, genetic algorithm, association rule discovery Rule set classifiers, decision trees, ANN, neuro fuzzy, Bayesian Network structure discovery. Type Easy to interpret • easy to generate • can classify new instances rapidly • decision trees are powerful classification problems It is very comfortable and efficient way of problem solving! Advantages They can be extremely sensitive to small perturbations in the data, and “black box” nature of ANN greater computational burden, proneness to overfitting, and the empirical nature of model development. “Black Box” nature of ANN Disadvantage (continued) Accuracy of ANN is 85.53 %, for decision trees it 89 %, and 86.53 % for Naive bayes attributes, 90.74 % for 15 attributes Decision Trees: 99.25 % for for 13 attributes, 100 % for 15 attributes Prediction result 422 R. Chadha et al. Author Abhishek Taneja Ms. Ishtake S.H, Prof. Sanap S.A. Title Heart disease prediction system using data mining techniques Intelligent heart disease prediction system using data mining techniques Table 1 (continued) 2013 2013 Year A total of 909 records with 15 medical attributes (factors) were obtained from the Cleveland Heart Disease database The patient data set is compiled from data collected from medical practitioners in South Africa. Dataset Type Decision Tree Classification, Naive Bayes, ANN Decision tree classification, Naive Bayes, ANN Advantages More powerful for classification problems. Naive Bayes is easy to implement more powerful for classification problems. Naive Bayes is easy to implement Disadvantage Trees can be extremely sensitive to small perturbations in the data: Trees can be extremely sensitive to small perturbations in the data: Prediction result J48 unpruned with all attributes 94.29 % J48 pruned with all attributes 95.41 % J48 unpruned with selected attributes 95.52 % J48 pruned with selected attributes 95.56 % Naive Bayes with all attributes 91.96 % Naive Bayes with selected attributes 92.42 % Neural Network with all attributes 93.83 % Neural Network with selected attributes 94.85 % Accuracy is 94.93 % for decision trees, 95 % for Naive Bayes, 93.54 % for artificial neural networks. (continued) Application of Data Mining Techniques … 423 Author N. Aditya Sundar1, P. Pushpa Latha2, M. Rama Chandra3 K. Sudhakar, Dr. M. Manimekalai Hlaudi Daniel Masethe, Mosima Anna Masethe Title Performance analysis of classification data mining techniques over heart disease data base Study of heart disease prediction using data mining Prediction of heart disease using classification algorithms Table 1 (continued) 2014 2014 2012 Year Compiled from data collected from medical practitioners in South Africa N.A. N.A. Dataset Neural Networks, Decision trees, Naive Bayes, associative classification Decision tree Naive Bayes, weighted association classifier, a priori algorithm: Type More powerful for classification problems, easy to implement Easy to implement Good results obtained in most of the cases easily parallelized easy to implement more powerful for classification problems, Easy to implement Advantages Trees can be extremely sensitive to small perturbations in the data: Black Box nature of ANN Trees can be extremely sensitive to small perturbations in the data: Black Box nature of ANN Apriori can be very slow. Disadvantage J48: 99.0741 Reptree: 99.0741 Naive Bayes: 97.222 Bayes net: 98.1481 simple cart: 99.0741 78 % for Naive Bayes and 84 % for Weighted Association classifier. Prediction result 424 R. Chadha et al. Application of Data Mining Techniques … 425 5 Comparison of the Recent Years On studying different papers written in recent years, Table 1 has been constructed. This table bring out a stark contrast in the prediction rate on using different techniques. 6 Conclusion and Future Work For clear understanding, results/prediction rate for each of the papers are summarized in a tabular form and the best prediction rate obtained in each of the techniques/methodologies is summarized by studying, analyzing and performing a survey on all of the recent papers. It is perceived from our observations/experiments that in few cases, the same classifier produces different accuracy for different data mining techniques based on the number of attributes chosen and the kind of algorithm that is applied. Several classifiers are analyzed for the required prediction. We need to consider more varied parameters in the dataset for a complete accuracy of the prediction system. The intent is to develop more intelligent heart disease prediction models that employs more of the data mining techniques. References 1. Bhatla, N., Jyoti, K.: An analysis of heart disease prediction using different data mining techniques. Int. J. Eng. Res. Technol. (IJERT) 2. Amin, S.U., Agarwal, K., Beg, R.: Genetic neural network based data mining in prediction of heart disease using risk factors. In: IEEE Conference on Information and Communication Technologies (2013) 3. Chen, A.H., Huang, S.Y., Hong, P.S., Cheng, C.H., Lin, E.J.: HDPS: heart disease prediction system. In: Computers in Cardiology Conference 4. Jabbar, M.A., Deekshatulu, B.L., Chandra, P.: Heart Disease Prediction using Lazy Associative Classification, 2013 IEEE (2013) 5. Chaurasia, V., Pal S.: Early prediction of heart diseases using data mining techniques. Caribb. J. Sci. Technol. 6. Dangare, C.S., Apte, S.S.: Improved study of heart disease prediction system using data mining classification techniques. Int. J. Comput. Appl. (0975–888) 47(10), June 2012 7. Soni, J., Ansari, U., Sharma, D., Soni, S.: Predictive data mining for medical diagnosis. Int. J. Comput. Appl. 17(8), 0975–8887 (2011) 8. Ishtake, S.H., Sanap, S.A.: Intelligent heart disease prediction system using data mining techniques. Int. J. Healthc. Biomed. Res. 1(3), 94–101 (2013) 9. Masethe, H.D., Masethe, M.A.: Prediction of heart disease using classification algorithms. In: World Congress on Engineering and Computer Science 2014 Vol II WCECS 2014, San Francisco, USA, 22–24 Oct 2014 10. Sudhakar, K., Manimekalai, M.: Study of heart disease prediction using data mining. Int. J. Adv. Res. Comput. Sci. Softw. Eng. 426 R. Chadha et al. 11. Sundar, N.A., Latha, P.P., Chandra, M.R.: Performance analysis of classification data mining techniques over heart disease data base. [IJESAT] Int. J. Eng. Sci. Adv. Technol. 12. Taneja, A.: Heart disease prediction system using data mining techniques. Orient. J. Comput. Sci. Technol. 13. Srinivas, K., Kavihta Rani, B., Govrdhan, A.: Applications of data mining techniques in healthcare and prediction of heart attacks. (IJCSE) Int. J. Comput. Sci. Eng. 02(02), 250–255 (2010)