Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CLASSIFICATION OF CHRONIC KIDNEY DISEASE WITH MOST KNOWN DATA MINING METHODS 1 MURAT KOKLU, 2 KEMAL TUTUNCU 1 Department of Computer Engineering, Selcuk University, Konya, TURKEY Department of Electrical – Electronics Engineering, Selcuk University, Konya, TURKEY E-mail: 1 [email protected], 2 [email protected] 2 Abstract- Data mining, a step of knowledge discovery process, has gathered together statistical, database, machine learning and artificial intelligence studies in recent researches. When investigating large amounts of data, it is important to use an effective search method for the occurrence of patterns. Statistical and machine learning techniques are used for the determination of the models to be used for data mining predictions. Today, Data mining is used in many different areas such as science and engineering, health, commerce, shopping, banking and finance, education and internet.The objective of this study is Chronic kidney disease dataset using 4 different Data Mining methods namely; Naive Bayes, C4.5 Algorithm, Support Vector Machine (SVM) and Multilayer Perceptron. Correctly classified instances were found as 95,00%, 97,75%, 99,00% and 99,75% for Naive Bayes, C4.5 Algorithm, SVM and Multilayer Perceptron respectively. Keywords- Chronic Kidney Disease, Data Mining, Naive Bayes, C4.5,SVM, Algorithm, Multilayer Perceptron. Various data mining classification approaches and machine learning algorithms are applied for prediction of chronic diseases. Here we are concerned about Chronic kidney disease (CKD), also known as chronic renal disease, is an abnormal function of kidney or a progressive failure of renal function over a period of months or years. Often, chronic kidney disease is diagnosed as a result of screening of people known to be at risk of kidney problems, such as those with high blood pressure or diabetes and those with a blood relative with CKD. It is differentiated from acute kidney disease in that the reduction in kidney function must be present for over 3 months.This work predominantly focused on, prediction of chronic kidney disease. Chronic Kidney disease is predicted using classification techniques of data mining [6]. Chronic kidney disease prediction is one of the most central problems in medical decision making because it is one of the leading cause of death. So, automated tool for early prediction of this disease will be useful to cure [7]. I. INTRODUCTION The knowledge should be managed effectively because responsible places producing and using knowledge are proceeding rapidly.Since the mid1990s, A lot of researches have been conducted to create technics, methods and means that support the discovery of useful information [1]. In the information age, creating value is through using resources efficiently rather than physical assets. For this purpose, many methods and technics have been developed for information management. Data mining is a technic used to reach this purpose. It determines the relationship between data by choosing the meaningful information from meaningless data. In general terms data mining analyzes the data, reaches the meaningful information between data and summarizes this information [2]. Besides, data mining means to use the advanced data mining instruments to discover unknown information and current patterns in large datasets. The instruments of data mining which means more than collecting and managing data are statistical models, mathematical algorithms and machine learning methods [3]. Today, data mining is used in many different areas like science and engineering, health, commerce, shopping, banking and finance, education and internet. Classification is a method frequently used in data mining and used to uncover hidden patterns in database. Classification is used to insert the data object into predefined several classes. The welldefined characteristics play a key role in performance of the classifier. Classification is based on a learning algorithm. Training cannot be done by using all data. This is performed on a sample of data belonging to the data collection. The purpose of learning is the creation of a classification model. In other words classification is a class determination process for an unknown record [4,5]. In this study, we experimented on the dataset of chronic kidney disease to explore the data mining algorithm to find outperforming algorithm for our considered domain. The rest of the paper is organized as follows, section 2 describes the theoretical background for classifiers used in this study for chronic kidney disease, section 3 describes experimental studies and section 4 concludes the paper. II. THEORETICAL BACKGROUND 2.1. Data Mining Techniques Having done in this study 4 different classifying techniques were used to diagnose chronic kidney disease. Short information about each of the classifying techniques namely Naive Bayes, C4.5 Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8 21 Classification of Chronic Kidney Disease With Most Known Data Mining Methods Algorithm, SVM and Multilayer Perceptronwill be mentioned in the following paragraphs. Naive Bayes:The Naive Bayes algorithm is a simple probabilistic classifier that calculates a set of probabilities by counting the frequency and combinations of values in a given data set. The algorithm uses Bayes theorem and assumes all attributes to be independent given the value of the class variable. This conditional independence assumption rarely holds true in real world applications, hence the characterization as Naive yet the algorithm tends to perform well and learn rapidly in various supervised classification problems [8, 9]. C4.5 Algorithm: A decision tree is a classifier which conducts recursive partition over the instance space. A typical decision tree is composed of internal nodes, edges and leaf nodes. Each internal node is called decision node representing a test on an attribute or a subset of attributes, and each edge is labeled with a specific value or range of value of the input attributes. In this way, internal nodes associated with their edges split the instance space into two or more partitions. Each leaf node is a terminal node of the tree with a class label. Given a set of training data, apply a measurement function onto all attributes to find a best splitting attribute. Once the splitting attribute is determined, the instance space is partitioned into several parts. Within each partition, if all training instances belong to one single class, the algorithm terminates. Otherwise, the splitting process will be recursively performed until the whole partition is assigned to the same class. Once a decision tree is built, classification rules can be easily generated, which can be used for classification of new instances with unknown class labels [10]. Support Vector Machine (SVM): Each vector in the gene expression matrix may be thought of as a point in an m-dimensional space.A simple way to build a binary classifier is to construct a hyperplane separating class members from nonmembers in this space. This is the approach taken by perceptrons, also known as single layer neural networks. Unfortunately, most real-world problems involve non-separable data for which there does not exist a hyperplane that successfully separates the class members from non-class members in the training set. One solution to the inseparability problem is to map the data into a higher-dimensional pace and define a separating hyperplane there. This higher-dimensional space is called the feature space, as opposed to the input space occupied by the training examples. With an appropriately chosen feature space of sufficient dimensionality, any consistent training set can be made separable [11, 12]. Multilayer Perceptron:MultiLayerPerceptrons constitute an important class of feed-forward Artificial Neural Networks (ANNs), developed to replicate learning and generalization abilities of humans with an attempt to model the functions of biological neural networks. They have many potential applications in the areas of Artificial Intelligence (AI) and Pattern Recognition (PR). Handwritten numeral recognition is a benchmark problem of PR. It has a clearly defined commercial importance and a level of difficulty that makes it challenging, yet it is not so large as to be completely intractable. Optical Character Recognition (OCR) of handwritten numerals is central to many commercial applications related to reading amounts from bank cheques, extracting numeric data from filled in forms, interpreting handwritten pin codes from mail pieces and so on. The work presented here mainly aims for establishing the usefulness of the MLP as a pattern classifier compared to the Nearest Neighbor (NN) classifier used as a suboptimal traditional classifier [13]. 2.2. Commonly-Accepted Performance Evaluation Measures This is the case we focus on in this study. Classification performance without focusing on a class is the most general way of comparing algorithms. It does not favor any particular application. The introduction of a new learning problem inevitably concentrates on its domain but omits a detailed analysis. Thus, the most used empirical measure, accuracy, does not distinguish between the number of correct labels of different classes [14]: TP = true positives: number of examples predicted positive that are actually positive FP = false positives: number of examples predicted positive that are actually negative TN = true negatives: number of examples predicted negative that are actually negative FN = false negatives: number of examples predicted negative that are actually positive Accuracy : It refers to the total number of records that are correctly classified by the classifier. Accuracy of a classifier is defined as the percentage of test set tuples that are correctly classified by the model [15]. + = 100% + + + True Positive Rate (TP Rate): It corresponds to the number of positive examples that have been correctly predicted by the classification model. False Positive Rate (FP Rate): It corresponds to the number of negative examples that have been wrongly predicted by the classification model. Precision: The fraction of retrieved instances that are relevant. = + 100% Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8 22 Classification of Chronic Kidney Disease With Most Known Data Mining Methods Recall: Refers to the true positive rate that means the proportion of positive tuples that were correctly identified. It was also known as sensitivity of the classifier. = description is mentioned in Table 1. Total 400 instances of the dataset is used for the training to prediction algorithms, out of which 250 has label chronic kidney disease (ckd) and 150 has label non chronic kidney disease (notckd) [7, 17]. Table 2 shows the description of the attributes in ChronicKidney-Disease dataset [17, 18]. 100% + F- measure: The F- measure also refers to F measures that combined both the measures Precision and Recall as the harmonic mean. − = 2∗ ∗ Table 2: Description of Attributes in The Chronic Kidney Disease Dataset 100% + ROC Curve:Receiver Operating Characteristics curved showed both sensitivity and specificity of the test. The comparison of TPR (True Positive Rate) and FPR (False Positive Rate) is defined as ROC curve. The TPR is the proportion of positive tuples that are correctly labeled by the model whereas FPR is of negative tuples misclassified as positive [15]. i.e. TPR = TP (TP+FN) and FPR = FP (FP+TN) Confusion Matrix:A confusion matrix contains information about actual and predicted classifications done by a classification system. Performance of such systems is commonly evaluated using the data in the matrix. The following table shows the confusion matrix for a two-class classifier. [16] The entries in the confusion matrix have the following meaning in the context of our study: TP is the number of correct predictions that an instance is positive, FN is the number of incorrect predictions that an instance is negative, FP is the number of incorrect of predictions that an instance positive and TN is the number of correct predictions that an instance is negative. Table 1: Confusion Matrix Predicted Positive Negative Actual Positive TP FN Negative FP TN 3.2. Performance Evaluation Naive Bayes, SVM, C4.5 and Multilayer Perceptron were used to predict Chronic Kidney Disease and the obtained result shown in Table 3 is obtained. Naive Bayes, as can be seen from Table 3, 380 of 400 samples were classified as correctly. Thus, the correct classification ratio is 95,00%. SVM, as can be seen from Table 3, 391 of 400 samples were classified as correctly. Thus, the correct classification ratio is 97,75%. III. EXPERIMENTAL STUDY 3.1. Dataset Description This study makes use of the dataset from the UCI Machine Learning Repository named Chronic Kidney Disease uploaded in 2015. This dataset has been collected from the Apollo hospital (Tamilnadu) nearly 2 months of period and has 25 attributes, 11 numeric and 14 nominal. The attributes and its Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8 23 Classification of Chronic Kidney Disease With Most Known Data Mining Methods C4.5, as can be seen from Table 3, 396 of 400 samples were classified as correctly. Thus, the correct classification ratio is 99,00%. Multilayer Perceptron, as can be seen from Table 3, 399 of 400 samples were classified as correctly. Thus, the correct classification ratio is 99,75 Table 5: Algorithms Detailed Accuracy by Class Table 3: Accuracy Ratio of Application CONCLUSIONS Having done in this study the performances of Naive Bayes, SVM, C4.5 and Multilayer Perceptron methods were evaluated in terms of classification accuracy to diagnose Chronic Kidney Disease. When comparing the performances of algorithms it’s been found that Multilayer Perceptron(99,75%) has highest accuracy whereas Naive Bayes (95,00%) had the worst accuracy. Another future direction can be testing with data sets of different domains other than standard UCI repository that can be from real life data or obtained from survey on different domains. In this work, 400 records were classified. There are 2 possible class namely “CKD” and “NOTCKD”.The result of tests on confusion matrix test data gives information about which classes the records belonging a certain class assigned by the algorithm. For instance, in C4.5 algorithm,249 of the records belonging to “CKD” class were assigned to “CKD” class correctly (true positive for “CKD”). One of the records belonging “CKD” class was assigned to “NOTCKD” class mistakenly (false positive for “NOTCKD”). 147 of the records belonging “NOTCKD” class were assigned to “NOTCKD” class correctly (true positive for “NOTCKD”). 3 of the records belonging “NOTCKD” class were assigned to “CKD” class correctly (false positive for “CKD”). Confusion matrixes for each algorithm are presented in Table 4. TP Rate, FP Rate, Precision and Recall values for each algorithm are given in Table 5. ACKNOWLEDGMENTS This study has been supported by Scientific Research Project Coordinatorshipof Selcuk University. REFERENCES 1. Coşkun C. and Baykal A. (2011). VeriMadenciliğindeSınıflandırmaAlgoritmalarının Bir ÖrnekÜzerindeKarşılaştırılması, Akademik Bilişim'11 Konferansı, Şubat 2011 InönüÜniversity, Malatya, TURKEY. 2. Gorunescu, F. (2011). Data Mining: Concepts, models and techniques (Vol. 12). p.1-43.Springer Science & Business Media. 3. Seifert, J.W. (2004). Data mining: An overview, http://www.fas.org/irp/crs/RL31798.pdf, Date accessed: January 2016. 4. Langaas, M. (1995).Discrimination and classification. Technical report, Department of Mathematical Sciences, The Norwegian Institute of Technology. http://www. math. ntnu. no/preprint/statistics/ps/S1-1995. 5. Han, J., Pei, J., and Kamber, M. (2011).Data Mining: Concepts and Techniques. Elsevier. 6. Sinha, P., & Sinha, P. (2015). Comparative Study of Chronic Kidney Disease Prediction using KNN and SVM. In International Journal of Engineering Research and Technology, Vol. 4, No. 12, ESRSA Publications. 7. Kumar M. (2016). Prediction of Chronic Kidney Disease Using Random Forest Machine Learning Algorithm, International Journal of Computer Science and Mobile Computing, Vol.5 Issue.2, p. 24-33. 8. Kohavi, R., and Provost, F. 1998. On Applied Research in Machine Learning. In Editorial for the Special Issue on Applications of Machine Learning and the Knowledge Discovery Process, Columbia University, New York, volume 30. Table 4: Confusion Matrix Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8 24 Classification of Chronic Kidney Disease With Most Known Data Mining Methods 9. Blake A.C.L. and Merz C.J. (1998). University of California at Irvine Repository of Machine Learning Databases, https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Diseas e, Last Access: 20.10.2016. 10. Jena, L., & Kamila, N. K. (2015). Distributed Data Mining Classification Algorithms for Prediction of Chronic-KidneyDisease. International Journal of Emerging Research in Management &Technology ISSN: 2278-9359 Vol.4, Issue.11, p.110-118. 11. Sokolova, M., Japkowicz, N. and Szpakowicz, S. (2006). Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation. In Australasian Joint Conference on Artificial Intelligence (pp. 1015-1021). Springer Berlin Heidelberg. 12. Patil, T. R.andSherekar, S. S. (2013). Performance analysis of Naive Bayes and J48 classification algorithm for data classification. International Journal of Computer Science and Applications, 6(2), 256-261. 13. Dimitoglou, G., Adams, J. A. and Jim, C. M. (2012). Comparison of the C4. 5 and a Naïve Bayes classifier for the 14. 15. 16. 17. 18. prediction of lung cancer survivability. arXiv preprint arXiv:1206.1121. Dai, W. and Ji, W. (2014). A mapreduce implementation of C4. 5 decision tree algorithm. International Journal of Database Theory and Application, 7(1), 49-60. Vapnik, V. N., and Vapnik, V. (1998). Statistical learning theory (Vol. 1). New York: Wiley. Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet, C., Ares, M., and Haussler, D. (1999). Support vector machine classification of microarray gene expression data. University of California, Santa Cruz, Technical Report UCSC-CRL-99-09. K. Roy, C. Chaudhuri, M. Kundu, M. Nasipuri and D. K. Basu, (2005). “Comparison of the Multi Layer Perceptron and the Nearest Neighbor Classifier for Handwritten Numeral Recognition”, Journal of Information Science and Engineering 21, pp:1247-1259. Maharjan, D. (2013). Perfromance Analysis of Mlp, C4. 5 and Naïve Bayes Classification Algorithms Using Income and Iris Datasets. Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8 25