Download classification of chronic kidney disease with most known data mining

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Expectation–maximization algorithm wikipedia , lookup

K-means clustering wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

Naive Bayes classifier wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
CLASSIFICATION OF CHRONIC KIDNEY DISEASE WITH MOST
KNOWN DATA MINING METHODS
1
MURAT KOKLU, 2 KEMAL TUTUNCU
1
Department of Computer Engineering, Selcuk University, Konya, TURKEY
Department of Electrical – Electronics Engineering, Selcuk University, Konya, TURKEY
E-mail: 1 [email protected], 2 [email protected]
2
Abstract- Data mining, a step of knowledge discovery process, has gathered together statistical, database, machine learning
and artificial intelligence studies in recent researches. When investigating large amounts of data, it is important to use an
effective search method for the occurrence of patterns. Statistical and machine learning techniques are used for the
determination of the models to be used for data mining predictions. Today, Data mining is used in many different areas such
as science and engineering, health, commerce, shopping, banking and finance, education and internet.The objective of this
study is Chronic kidney disease dataset using 4 different Data Mining methods namely; Naive Bayes, C4.5 Algorithm,
Support Vector Machine (SVM) and Multilayer Perceptron. Correctly classified instances were found as 95,00%, 97,75%,
99,00% and 99,75% for Naive Bayes, C4.5 Algorithm, SVM and Multilayer Perceptron respectively.
Keywords- Chronic Kidney Disease, Data Mining, Naive Bayes, C4.5,SVM, Algorithm, Multilayer Perceptron.
Various data mining classification approaches and
machine learning algorithms are applied for
prediction of chronic diseases. Here we are concerned
about Chronic kidney disease (CKD), also known as
chronic renal disease, is an abnormal function of
kidney or a progressive failure of renal function over
a period of months or years. Often, chronic kidney
disease is diagnosed as a result of screening of people
known to be at risk of kidney problems, such as those
with high blood pressure or diabetes and those with a
blood relative with CKD. It is differentiated from
acute kidney disease in that the reduction in kidney
function must be present for over 3 months.This work
predominantly focused on, prediction of chronic
kidney disease. Chronic Kidney disease is predicted
using classification techniques of data mining [6].
Chronic kidney disease prediction is one of the most
central problems in medical decision making because
it is one of the leading cause of death. So, automated
tool for early prediction of this disease will be useful
to cure [7].
I. INTRODUCTION
The knowledge should be managed effectively
because responsible places producing and using
knowledge are proceeding rapidly.Since the mid1990s, A lot of researches have been conducted to
create technics, methods and means that support the
discovery of useful information [1]. In the
information age, creating value is through using
resources efficiently rather than physical assets. For
this purpose, many methods and technics have been
developed for information management. Data mining
is a technic used to reach this purpose. It determines
the relationship between data by choosing the
meaningful information from meaningless data. In
general terms data mining analyzes the data, reaches
the meaningful information between data and
summarizes this information [2].
Besides, data mining means to use the advanced data
mining instruments to discover unknown information
and current patterns in large datasets. The instruments
of data mining which means more than collecting and
managing data are statistical models, mathematical
algorithms and machine learning methods [3]. Today,
data mining is used in many different areas like
science and engineering, health, commerce, shopping,
banking and finance, education and internet.
Classification is a method frequently used in data
mining and used to uncover hidden patterns in
database. Classification is used to insert the data
object into predefined several classes. The welldefined characteristics play a key role in performance
of the classifier. Classification is based on a learning
algorithm. Training cannot be done by using all data.
This is performed on a sample of data belonging to
the data collection. The purpose of learning is the
creation of a classification model. In other words
classification is a class determination process for an
unknown record [4,5].
In this study, we experimented on the dataset of
chronic kidney disease to explore the data mining
algorithm to find outperforming algorithm for our
considered domain. The rest of the paper is organized
as follows, section 2 describes the theoretical
background for classifiers used in this study for
chronic kidney disease, section 3 describes
experimental studies and section 4 concludes the
paper.
II. THEORETICAL BACKGROUND
2.1. Data Mining Techniques
Having done in this study 4 different classifying
techniques were used to diagnose chronic kidney
disease. Short information about each of the
classifying techniques namely Naive Bayes, C4.5
Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8
21
Classification of Chronic Kidney Disease With Most Known Data Mining Methods
Algorithm, SVM and Multilayer Perceptronwill be
mentioned in the following paragraphs.
Naive Bayes:The Naive Bayes algorithm is a
simple probabilistic classifier that calculates a set of
probabilities by counting the frequency and
combinations of values in a given data set. The
algorithm uses Bayes theorem and assumes all
attributes to be independent given the value of the
class variable. This conditional independence
assumption rarely holds true in real world
applications, hence the characterization as Naive yet
the algorithm tends to perform well and learn rapidly
in various supervised classification problems [8, 9].
C4.5 Algorithm: A decision tree is a
classifier which conducts recursive partition over the
instance space. A typical decision tree is composed of
internal nodes, edges and leaf nodes. Each internal
node is called decision node representing a test on an
attribute or a subset of attributes, and each edge is
labeled with a specific value or range of value of the
input attributes. In this way, internal nodes associated
with their edges split the instance space into two or
more partitions. Each leaf node is a terminal node of
the tree with a class label. Given a set of training
data, apply a measurement function onto all attributes
to find a best splitting attribute. Once the splitting
attribute is determined, the instance space is
partitioned into several parts. Within each partition, if
all training instances belong to one single class, the
algorithm terminates. Otherwise, the splitting process
will be recursively performed until the whole
partition is assigned to the same class. Once a
decision tree is built, classification rules can be easily
generated, which can be used for classification of
new instances with unknown class labels [10].
Support Vector Machine (SVM): Each
vector in the gene expression matrix may be thought
of as a point in an m-dimensional space.A simple way
to build a binary classifier is to construct a
hyperplane separating class members from nonmembers in this space. This is the approach taken by
perceptrons, also known as single layer neural
networks. Unfortunately, most real-world problems
involve non-separable data for which there does not
exist a hyperplane that successfully separates the
class members from non-class members in the
training set. One solution to the inseparability
problem is to map the data into a higher-dimensional
pace and define a separating hyperplane there. This
higher-dimensional space is called the feature space,
as opposed to the input space occupied by the training
examples. With an appropriately chosen feature space
of sufficient dimensionality, any consistent training
set can be made separable [11, 12].
Multilayer
Perceptron:MultiLayerPerceptrons constitute an
important class of feed-forward Artificial Neural
Networks (ANNs), developed to replicate learning
and generalization abilities of humans with an
attempt to model the functions of biological neural
networks. They have many potential applications in
the areas of Artificial Intelligence (AI) and Pattern
Recognition (PR). Handwritten numeral recognition
is a benchmark problem of PR. It has a clearly
defined commercial importance and a level of
difficulty that makes it challenging, yet it is not so
large as to be completely intractable. Optical
Character Recognition (OCR) of handwritten
numerals is central to many commercial applications
related to reading amounts from bank cheques,
extracting numeric data from filled in forms,
interpreting handwritten pin codes from mail pieces
and so on. The work presented here mainly aims for
establishing the usefulness of the MLP as a pattern
classifier compared to the Nearest Neighbor (NN)
classifier used as a suboptimal traditional classifier
[13].
2.2. Commonly-Accepted Performance Evaluation
Measures
This is the case we focus on in this study.
Classification performance without focusing on a
class is the most general way of comparing
algorithms. It does not favor any particular
application. The introduction of a new learning
problem inevitably concentrates on its domain but
omits a detailed analysis. Thus, the most used
empirical measure, accuracy, does not distinguish
between the number of correct labels of different
classes [14]:




TP = true positives: number of examples
predicted positive that are actually positive
FP = false positives: number of examples
predicted positive that are actually negative
TN = true negatives: number of examples
predicted negative that are actually negative
FN = false negatives: number of examples
predicted negative that are actually positive
Accuracy : It refers to the total number of
records that are correctly classified by the classifier.
Accuracy of a classifier is defined as the percentage
of test set tuples that are correctly classified by the
model [15].
+
=
100%
+
+
+
True Positive Rate (TP Rate): It corresponds to the
number of positive examples that have been correctly
predicted by the classification model.
False Positive Rate (FP Rate): It
corresponds to the number of negative examples that
have been wrongly predicted by the classification
model.
Precision: The fraction of retrieved
instances that are relevant.
=
+
100%
Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8
22
Classification of Chronic Kidney Disease With Most Known Data Mining Methods
Recall: Refers to the true positive rate that
means the proportion of positive tuples that were
correctly identified. It was also known as sensitivity
of the classifier.
=
description is mentioned in Table 1. Total 400
instances of the dataset is used for the training to
prediction algorithms, out of which 250 has label
chronic kidney disease (ckd) and 150 has label non
chronic kidney disease (notckd) [7, 17]. Table 2
shows the description of the attributes in ChronicKidney-Disease dataset [17, 18].
100%
+
F- measure: The F- measure also refers to F
measures that combined both the measures Precision
and Recall as the harmonic mean.
−
=
2∗
∗
Table 2: Description of Attributes in The Chronic
Kidney Disease Dataset
100%
+
ROC
Curve:Receiver
Operating
Characteristics curved showed both sensitivity and
specificity of the test. The comparison of TPR (True
Positive Rate) and FPR (False Positive Rate) is
defined as ROC curve. The TPR is the proportion of
positive tuples that are correctly labeled by the model
whereas FPR is of negative tuples misclassified as
positive [15].
i.e. TPR = TP (TP+FN)
and FPR = FP (FP+TN)
Confusion Matrix:A confusion matrix
contains information about actual and predicted
classifications done by a classification system.
Performance of such systems is commonly evaluated
using the data in the matrix. The following table
shows the confusion matrix for a two-class classifier.
[16]
The entries in the confusion matrix have the
following meaning in the context of our study:
 TP is the number of correct predictions that
an instance is positive,
 FN is the number of incorrect predictions
that an instance is negative,
 FP is the number of incorrect of predictions
that an instance positive and
 TN is the number of correct predictions that
an instance is negative.
Table 1: Confusion Matrix
Predicted
Positive Negative
Actual
Positive
TP
FN
Negative
FP
TN
3.2. Performance Evaluation
Naive Bayes, SVM, C4.5 and Multilayer Perceptron
were used to predict Chronic Kidney Disease and the
obtained result shown in Table 3 is obtained.
Naive Bayes, as can be seen from Table 3,
380 of 400 samples were classified as correctly. Thus,
the correct classification ratio is 95,00%.
SVM, as can be seen from Table 3, 391 of
400 samples were classified as correctly. Thus, the
correct classification ratio is 97,75%.
III. EXPERIMENTAL STUDY
3.1. Dataset Description
This study makes use of the dataset from the UCI
Machine Learning Repository named Chronic Kidney
Disease uploaded in 2015. This dataset has been
collected from the Apollo hospital (Tamilnadu)
nearly 2 months of period and has 25 attributes, 11
numeric and 14 nominal. The attributes and its
Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8
23
Classification of Chronic Kidney Disease With Most Known Data Mining Methods
C4.5, as can be seen from Table 3, 396 of 400
samples were classified as correctly. Thus, the correct
classification ratio is 99,00%.
Multilayer Perceptron, as can be seen from Table 3,
399 of 400 samples were classified as correctly. Thus,
the correct classification ratio is 99,75
Table 5: Algorithms Detailed Accuracy
by Class
Table 3: Accuracy Ratio of Application
CONCLUSIONS
Having done in this study the performances of Naive
Bayes, SVM, C4.5 and Multilayer Perceptron
methods were evaluated in terms of classification
accuracy to diagnose Chronic Kidney Disease. When
comparing the performances of algorithms it’s been
found that Multilayer Perceptron(99,75%) has highest
accuracy whereas Naive Bayes (95,00%) had the
worst accuracy.
Another future direction can be testing with
data sets of different domains other than standard
UCI repository that can be from real life data or
obtained from survey on different domains.
In this work, 400 records were classified. There are 2
possible class namely “CKD” and “NOTCKD”.The
result of tests on confusion matrix test data gives
information about which classes the records
belonging a certain class assigned by the algorithm.
For instance, in C4.5 algorithm,249 of the records
belonging to “CKD” class were assigned to “CKD”
class correctly (true positive for “CKD”). One of the
records belonging “CKD” class was assigned to
“NOTCKD” class mistakenly (false positive for
“NOTCKD”). 147 of the records belonging
“NOTCKD” class were assigned to “NOTCKD” class
correctly (true positive for “NOTCKD”). 3 of the
records belonging “NOTCKD” class were assigned to
“CKD” class correctly (false positive for “CKD”).
Confusion matrixes for each algorithm are
presented in Table 4.
TP Rate, FP Rate, Precision and Recall
values for each algorithm are given in Table 5.
ACKNOWLEDGMENTS
This study has been supported by Scientific Research
Project Coordinatorshipof Selcuk University.
REFERENCES
1. Coşkun
C.
and
Baykal
A.
(2011).
VeriMadenciliğindeSınıflandırmaAlgoritmalarının
Bir
ÖrnekÜzerindeKarşılaştırılması,
Akademik
Bilişim'11
Konferansı, Şubat 2011 InönüÜniversity, Malatya, TURKEY.
2. Gorunescu, F. (2011). Data Mining: Concepts, models and
techniques (Vol. 12). p.1-43.Springer Science & Business
Media.
3. Seifert, J.W. (2004). Data mining: An overview,
http://www.fas.org/irp/crs/RL31798.pdf,
Date
accessed:
January 2016.
4. Langaas, M. (1995).Discrimination and classification.
Technical report, Department of Mathematical Sciences, The
Norwegian Institute of Technology. http://www. math. ntnu.
no/preprint/statistics/ps/S1-1995.
5. Han, J., Pei, J., and Kamber, M. (2011).Data Mining: Concepts
and Techniques. Elsevier.
6. Sinha, P., & Sinha, P. (2015). Comparative Study of Chronic
Kidney Disease Prediction using KNN and SVM. In
International Journal of Engineering Research and
Technology, Vol. 4, No. 12, ESRSA Publications.
7. Kumar M. (2016). Prediction of Chronic Kidney Disease
Using Random Forest Machine Learning Algorithm,
International Journal of Computer Science and Mobile
Computing, Vol.5 Issue.2, p. 24-33.
8. Kohavi, R., and Provost, F. 1998. On Applied Research in
Machine Learning. In Editorial for the Special Issue on
Applications of Machine Learning and the Knowledge
Discovery Process, Columbia University, New York, volume
30.
Table 4: Confusion Matrix
Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8
24
Classification of Chronic Kidney Disease With Most Known Data Mining Methods
9. Blake A.C.L. and Merz C.J. (1998). University of California at
Irvine Repository of Machine Learning Databases,
https://archive.ics.uci.edu/ml/datasets/Chronic_Kidney_Diseas
e, Last Access: 20.10.2016.
10. Jena, L., & Kamila, N. K. (2015). Distributed Data Mining
Classification Algorithms for Prediction of Chronic-KidneyDisease. International Journal of Emerging Research in
Management &Technology ISSN: 2278-9359 Vol.4, Issue.11,
p.110-118.
11. Sokolova, M., Japkowicz, N. and Szpakowicz, S. (2006).
Beyond accuracy, F-score and ROC: a family of discriminant
measures for performance evaluation. In Australasian Joint
Conference on Artificial Intelligence (pp. 1015-1021).
Springer Berlin Heidelberg.
12. Patil, T. R.andSherekar, S. S. (2013). Performance analysis of
Naive Bayes and J48 classification algorithm for data
classification. International Journal of Computer Science and
Applications, 6(2), 256-261.
13. Dimitoglou, G., Adams, J. A. and Jim, C. M. (2012).
Comparison of the C4. 5 and a Naïve Bayes classifier for the
14.
15.
16.
17.
18.
prediction of lung cancer survivability. arXiv preprint
arXiv:1206.1121.
Dai, W. and Ji, W. (2014). A mapreduce implementation of
C4. 5 decision tree algorithm. International Journal of
Database Theory and Application, 7(1), 49-60.
Vapnik, V. N., and Vapnik, V. (1998). Statistical learning
theory (Vol. 1). New York: Wiley.
Brown, M. P., Grundy, W. N., Lin, D., Cristianini, N., Sugnet,
C., Ares, M., and Haussler, D. (1999). Support vector machine
classification of microarray gene expression data. University of
California, Santa Cruz, Technical Report UCSC-CRL-99-09.
K. Roy, C. Chaudhuri, M. Kundu, M. Nasipuri and D. K. Basu,
(2005). “Comparison of the Multi Layer Perceptron and the
Nearest Neighbor Classifier for Handwritten Numeral
Recognition”, Journal of Information Science and Engineering
21, pp:1247-1259.
Maharjan, D. (2013). Perfromance Analysis of Mlp, C4. 5 and
Naïve Bayes Classification Algorithms Using Income and Iris
Datasets.

Proceedings of ISER 45th International Conference, Rabat, Morocco, 8th -9th December 2016, ISBN: 978-93-86291-60-8
25