Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
ISSN 2321 3361 © 2016 IJESC Research Article Volume 6 Issue No. 9 Performance Comparison for C4.5 and K-NN Techniques on Diabetics in Heart Problem Viswanathan K 1 , Dr. Mayilvahanam P 2 , R. Christy PushpaLeela 3 Technical Arch itect 1 , Head 2 , Assistant Professor3 Cognizant Technology Solutions, Chennai, India1 , Vels Un iversity Chennai, India2 , Women’s Christian Co llege, Chennai, India 3 Abstract: The aim of th is research paper is to study and discuss the various classification algorith ms applied on different kinds of da tasets in terms of medical and compare the dataset performance. The classificat ion algorith ms with maximu m accuracies o n various kinds of med ical datasets are taken for performance analysis. The result of the performance analysis shows the most frequently used algorith ms on particular med ical dataset.This research paper also discusses comparison of C4.5 and K-NN. This research paper summarizes various reviewsand technical articles which focus on the current research on Medical diagnosis. Keywords: Classificat ion, Medical Diagnosis, C4.5,K-NN, Dataset andUCI, W EKA tool, Data-preprocessing. I. INTRODUCTION Data mining is the process of discovering actionable informat ion fro m large sets of data. Data mining, the ext raction of hidden predictive information fro m large databases. Data mining is a powerful new technology with great potential to help companies focus on the most important information in their data warehouses. Data min ing uses mathematical analysis to apply the patterns and movements that exist in data. These patterns cannot be discovered by traditional data explorat ion because the relationships are too complex or because there is too much data. This process can be defined by using the following six basic steps: 1. Problem Defin ition 2. Data prepare 3. Exp loring data 4. Apply the model 5. Walk around & authenticating models 6. Deploying & updating models Data-Preprocessing: Is a data mining technique that involves transforming raw data into a reasonable format. Real-world data is frequently incomp lete, inconsistent and lacking in certain actions and is likely to contain many erro rs. Data preprocessing is a supported method of resolving these issue called data-preprocessing. International Journal of Engineering Science and Computing, September 2016 Fig1: Data Mini ng – Process Flow II. DATA-CLASSIFICATION Data classification is one of the data min ing methodologies used to forecast and classify the predetermined data for the specific class. There are different classifications methods and some of the methods are below. 1. Support Vector Machine (for linear and nonlinear dataSVM) 2. K-nearest neighbor classifier(KNN) 3. C4.5 4. Bayesian classification (Statistical classifier) 5. Decision tree 6. Rule based classification (IF THEN Ru le) 7. Classification - Association Rule 3095 http://ijesc.org/ The k-medoid algorith m is similar to k-means except that the centroids have to belong to the data set being clustered. Fuzzy c-means is also similar, except that it co mputes fuzzy membership functions for each clusters rather than a hard one. Fig 2: Data Mi ning- Classification Method 2.11SVM–Support Vector Machine SVM is a classificat ion method.SVM supports data min ing and test mining. It was developed during the year of 1992. SVM Generally is capable to delivered h igh performance in terms of classification accuracy compare to other classification. 2.2C4.5 C4.5 algorithm is a greedy algorithm and was developed by Ross Quinlan and its used for the induction of decision trees.C4.5 is often referred to as a statistical classifier. The decision tree algorithm C4.5 is developed from ID3 in the following ways 2.3K-NN It is the nearest neighbour algorithm. The k-nearest neighbour’s algorith m is a technique for classifying objects based on the next train ing data in the feature space. The algorithm operates on a set of d-dimensional vectors, D = {xi | i = 1. . . N}, where xi ∈ k d denotes the ith data point. Selection k points in k d as the initial k cluster representatives . Algorithm iterates between two steps till junction. Step 1: Data Assignment each data point is assign to its adjoining centroid, with ties broken arbitrarily. Th is results in a partitioning of the data. Step 2: Relocation of “means”. Each group representative is relocating to the center (mean) of all data points assign to it. If the data points come with a possibility measure (Weights), then the relocation is to the expectations (weighted mean) of the data partitions. “Kernelize” k-means though margins between clusters are still linear in the embedded high-dimensional space, they can become nonlinear when projected back to the original space, thus allowing kernel k-means to deal with more co mp lex clusters. International Journal of Engineering Science and Computing, September 2016 III. Heart Disease Diagnosis of heart disease is a significant and tedious task in med icine. Heart disease occurs when the arteries which normally provide o xygen and blood to the heart choked completely. Various types of heart diseases are, Heart failure Card io myopathy Coronary heart disease Hypertensive heart disease Inflammatory heart disease Heart disease Co mmon risk factors are below Age Gender High blood pressure Smoking Habits Abnormal b lood lipids Fatness Diabetes Family antiquity No Physical task Data Set (Heart Patients Data Set) IV. The first step is the pre-processing to remove the inconsistent data. Apply the algorithm is used to classify the data. Sl no Attri bute Data-Type 1 Age Nu meric 2 Gender No minal 3 BP Systolic No minal 4 Diabetic Nu meric 5 BP Dialic Nu meric 6 Height Nu meric 7 Weight Nu meric 8 BMI Nu meric 9 Hypertension No minal 10 Rural No minal 11 Urban No minal 12 Disease Table 1: Data Set No minal V.WEKA TOOL WEKA is a data mining tool. It provides the facility to classify the data through various algorithms. WEKA stands for Waikato Environ ment for Kno wledge Learning and it was developed by the University of Waikato, New Zealand Software. 3096 http://ijesc.org/ Data Set- Heart Problem 84.5 83.2 90 77.3476.3 73.8 80 70 60 Heart Problem Fig4: Performance comparison bar -chart Fig3: WEKA tools VI. Cl assification Matrix The basic phenomenon used to classify the Heart disease classification using classifier is its performance and accuracy. The performance of a chosen classifier is validated based on error rate and co mputation time. The classificat ion accuracy is predicted in terms of Sensitivity and Specificity. It compares the actual values in the test dataset with the predicted values in the trained model. In this example, the test dataset contained 208 patients with heart disease and 246 patients without heart disease. Predicted Classified as a Healthy(0) Classified as a not Healthy(1) Actual Healthy (0) TP FN Actual not Healthy (1) FP TN VII. Conclusion Main objective of med ical data mining algorith m is to get best algorith ms that define given data from mu ltip le features. C4.5 is the best classifier for classificat ion and it shows better results in operational. Accuracy provided by C4.5 in binary classification is more reasonable. We have observed that C4.5 accuracy is good. VIII.REFER ENCES [1] International Journal of Co mputer Science & Information Technology (IJCSIT) Vo l 3, No 4, August 2011”PERFORMANCE ANA LYSIS OF VA RIOUS DATAMINING CLA SSIFICATION TECHNIQUES ON HEA LTHCA RE DATA” Shelly Gupta1, Dharminder Ku mar2 and Anand Sharma3 1AIM & ACT, Banasthali Un iversity, Banasthali, India shelly.gupta24@g mail.co m 2Depart ment of CSE, GJUS&T, Hisar, Indiadr_dk_ ku [email protected] 3Depart ment of CSE, GJUS&T, Hisar, Indiaandz24@g mail.co m. [2] International Journal of Innovative Technology and Exp loring Engineering (IJITEE) ISSN: 2278-3075, Volu me -2, Issue-6, May 2013 A Study on WEKA Tool for Data Preprocessing, Classification and Clustering SwastiSinghal, Monika Jena. Table 2: confusion matrix Where TP- True Positi ve TN - true Neg ati ve FP - False Positi ve FN - false Negati ve CT- Computing Ti me For measuring accuracy rate and Error Rate, the following mathematical model is used. Accuracy = TP+TN/ TP+FP+ TN +FN CVError Rate = FP + FN / TP+FP+ TN +FN Performance Comparison (C 4.5 and K-NN) Dataset SMO Naï ve Bayes C4.5 J48 kNN Heart Problem 77.34 76.3 84.5 73.8 83.2 [3] PayalDhakate, SuvarnaPatil, K. Rajeswari, Dr. V.Vaithiyananthan, DeepaAbin, - Preprocessing and Classification in WEKA using different classifiers‖, Journal of Engineering Research and Applications www.ijera.co m ISSN : 2248-9622, Vol. 4, Issue 8( Version 1), August 2014. [4] Remco R. Bouckaert, Eibe Frank, Mark A. HallGeoffrey Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten, ―WEKA—Experiences with a Java Open-Source Pro ject‖, Journal of Machine Learn ing Research, November 2010. [5] Performance Analysis of various Data mining classification (C4.5 Vs SVM) Techniques on Diabetics in Heart Problem. Viswanathan K, Dr.Mayilvahanan K, R.ChristyPushpaleela. [6] Thair Nu Phyu,”Survey of Classificat ion Techniques in Data Mining”IM ECS 2009 Table 3: Performance Comparison International Journal of Engineering Science and Computing, September 2016 3097 http://ijesc.org/ [7] Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy”, Advances in Knowledge Discovery and Data Mining”, (Chapter 1), AAAI/MIT Press 1996. [8] Witten,I. and Eibe,F. Data mining practical mach ine learning tools and techniques.2nded,Sanfrancisco:Morgan Kaufmann series in data management systems.,2005. [9] PardhaRepalli, “Pred iction on Diabetes Using Datamining Approach”. [10] Joseph L. Breault., “Data M ining Diabetic Databases:Are Rough Sets aUseful Addition”. [11] G. Parthiban, A. Rajesh, S.K.Srivats a, “Diagnosis of Heart Disease for Diabetic Patients using Naive Bayes Method “, International Journal ofCo mputer Applications (0975 – 8887) Vo lu me 24– No.3, June 2011. [12] P. Pad maja, “Characteristic evaluation of diabetes data using clustering techniques”, IJCSNS International Journal of Co mputer Science and NetworkSecurity, VOL.8 No.11, November 2008. [13] P.Yasodha, M. Kannan, ―Analysis of a Population of Diabetic Pat ientsDatabases in Weka Tool‖, Research Vol 2, Issue 5, May-2011. [14] VikasChaurasia, Saurabh Pal, ―Data M ining Approach to Detect Heart Dieses‖, International Journal of Advanced Co mputer Science and Information Technology Vol. 2. [15] D.Lavanya and Dr.K.Usha Rani, ―Ensemble decision tree classifier for breast cancer data. International Journal of Information Technology Convergence and Services, Vol.2, No.1. February 2011. International Journal of Engineering Science and Computing, September 2016 3098 http://ijesc.org/