Download Performance Comparison for C4.5 and K-NN Techniques on

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Cluster analysis wikipedia , lookup

Nonlinear dimensionality reduction wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
ISSN 2321 3361 © 2016 IJESC
Research Article
Volume 6 Issue No. 9
Performance Comparison for C4.5 and K-NN Techniques on
Diabetics in Heart Problem
Viswanathan K 1 , Dr. Mayilvahanam P 2 , R. Christy PushpaLeela 3
Technical Arch itect 1 , Head 2 , Assistant Professor3
Cognizant Technology Solutions, Chennai, India1 , Vels Un iversity Chennai, India2 , Women’s Christian Co llege, Chennai, India 3
Abstract:
The aim of th is research paper is to study and discuss the various classification algorith ms applied on different kinds of da tasets in
terms of medical and compare the dataset performance. The classificat ion algorith ms with maximu m accuracies o n various kinds of
med ical datasets are taken for performance analysis. The result of the performance analysis shows the most frequently used
algorith ms on particular med ical dataset.This research paper also discusses comparison of C4.5 and K-NN. This research paper
summarizes various reviewsand technical articles which focus on the current research on Medical diagnosis.
Keywords: Classificat ion, Medical Diagnosis, C4.5,K-NN, Dataset andUCI, W EKA tool, Data-preprocessing.
I.
INTRODUCTION
Data mining is the process of discovering actionable
informat ion fro m large sets of data. Data mining, the ext raction
of hidden predictive information fro m large databases. Data
mining is a powerful new technology with great potential to
help companies focus on the most important information in
their data warehouses. Data min ing uses mathematical analysis
to apply the patterns and movements that exist in data. These
patterns cannot be discovered by traditional data explorat ion
because the relationships are too complex or because there is
too much data. This process can be defined by using the
following six basic steps:
1.
Problem Defin ition
2.
Data prepare
3.
Exp loring data
4.
Apply the model
5.
Walk around & authenticating models
6.
Deploying & updating models
Data-Preprocessing:
Is a data mining technique that involves transforming raw data
into a reasonable format. Real-world data is frequently
incomp lete, inconsistent and lacking in certain actions and is
likely to contain many erro rs. Data preprocessing is a supported
method of resolving these issue called data-preprocessing.
International Journal of Engineering Science and Computing, September 2016
Fig1: Data Mini ng – Process Flow
II.
DATA-CLASSIFICATION
Data classification is one of the data min ing methodologies
used to forecast and classify the predetermined data for the
specific class. There are different classifications methods and
some of the methods are below.
1.
Support Vector Machine (for linear and nonlinear dataSVM)
2.
K-nearest neighbor classifier(KNN)
3.
C4.5
4.
Bayesian classification (Statistical classifier)
5.
Decision tree
6.
Rule based classification (IF THEN Ru le)
7.
Classification - Association Rule
3095
http://ijesc.org/
The k-medoid algorith m is similar to k-means except that the
centroids have to belong to the data set being clustered. Fuzzy
c-means is also similar, except that it co mputes fuzzy
membership functions for each clusters rather than a hard one.
Fig 2: Data Mi ning- Classification Method
2.11SVM–Support Vector Machine
SVM is a classificat ion method.SVM supports data min ing and
test mining. It was developed during the year of 1992. SVM
Generally is capable to delivered h igh performance in terms of
classification accuracy compare to other classification.
2.2C4.5
C4.5 algorithm is a greedy algorithm and was developed by
Ross Quinlan and its used for the induction of decision
trees.C4.5 is often referred to as a statistical classifier. The
decision tree algorithm C4.5 is developed from ID3 in the
following ways
2.3K-NN
It is the nearest neighbour algorithm. The k-nearest neighbour’s
algorith m is a technique for classifying objects based on the
next train ing data in the feature space. The algorithm operates
on a set of d-dimensional vectors, D = {xi | i = 1. . . N}, where
xi ∈ k d denotes the ith data point. Selection k points in k d as
the initial k cluster representatives . Algorithm iterates between
two steps till junction.
Step 1: Data Assignment each data point is assign to its
adjoining centroid, with ties broken arbitrarily. Th is results in a
partitioning of the data.
Step 2: Relocation of “means”. Each group representative is
relocating to the center (mean) of all data points assign to it. If
the data points come with a possibility measure (Weights), then
the relocation is to the expectations (weighted mean) of the data
partitions.
“Kernelize” k-means though margins between clusters are still
linear in the embedded high-dimensional space, they can
become nonlinear when projected back to the original space,
thus allowing kernel k-means to deal with more co mp lex
clusters.
International Journal of Engineering Science and Computing, September 2016
III.
Heart Disease
Diagnosis of heart disease is a significant and tedious task in
med icine. Heart disease occurs when the arteries which
normally provide o xygen and blood to the heart choked
completely. Various types of heart diseases are,
 Heart failure
 Card io myopathy
 Coronary heart disease
 Hypertensive heart disease
 Inflammatory heart disease
Heart disease Co mmon risk factors are below
 Age
 Gender
 High blood pressure
 Smoking Habits
 Abnormal b lood lipids
 Fatness
 Diabetes
 Family antiquity
 No Physical task
Data Set (Heart Patients Data Set)
IV.
The first step is the pre-processing to remove the inconsistent
data. Apply the algorithm is used to classify the data.
Sl no
Attri bute
Data-Type
1
Age
Nu meric
2
Gender
No minal
3
BP Systolic
No minal
4
Diabetic
Nu meric
5
BP Dialic
Nu meric
6
Height
Nu meric
7
Weight
Nu meric
8
BMI
Nu meric
9
Hypertension
No minal
10
Rural
No minal
11
Urban
No minal
12
Disease
Table 1: Data Set
No minal
V.WEKA TOOL
WEKA is a data mining tool. It provides the facility to classify
the data through various algorithms. WEKA stands for
Waikato Environ ment for Kno wledge Learning and it was
developed by the University of Waikato, New Zealand
Software.
3096
http://ijesc.org/
Data Set- Heart
Problem
84.5
83.2
90
77.3476.3
73.8
80
70
60
Heart
Problem
Fig4: Performance comparison bar -chart
Fig3: WEKA tools
VI. Cl assification Matrix
The basic phenomenon used to classify the Heart disease
classification using classifier is its performance and accuracy.
The performance of a chosen classifier is validated based on
error rate and co mputation time. The classificat ion accuracy is
predicted in terms of Sensitivity and Specificity. It compares
the actual values in the test dataset with the predicted values in
the trained model. In this example, the test dataset contained
208 patients with heart disease and 246 patients without heart
disease.
Predicted
Classified as a
Healthy(0)
Classified as a not
Healthy(1)
Actual Healthy (0)
TP
FN
Actual not Healthy
(1)
FP
TN
VII. Conclusion
Main objective of med ical data mining algorith m is to get best
algorith ms that define given data from mu ltip le features. C4.5
is the best classifier for classificat ion and it shows better results
in operational. Accuracy provided by C4.5 in binary
classification is more reasonable. We have observed that C4.5
accuracy is good.
VIII.REFER ENCES
[1] International Journal of Co mputer Science & Information
Technology
(IJCSIT)
Vo l
3,
No
4,
August
2011”PERFORMANCE
ANA LYSIS
OF
VA RIOUS
DATAMINING CLA SSIFICATION TECHNIQUES ON
HEA LTHCA RE DATA” Shelly Gupta1, Dharminder Ku mar2
and Anand Sharma3 1AIM & ACT, Banasthali Un iversity,
Banasthali, India shelly.gupta24@g mail.co m 2Depart ment of
CSE, GJUS&T, Hisar, Indiadr_dk_ ku [email protected]
3Depart ment
of
CSE,
GJUS&T,
Hisar,
Indiaandz24@g mail.co m.
[2] International Journal of Innovative Technology and
Exp loring Engineering (IJITEE) ISSN: 2278-3075, Volu me -2,
Issue-6, May 2013 A Study on WEKA Tool for Data
Preprocessing, Classification and Clustering SwastiSinghal,
Monika Jena.
Table 2: confusion matrix
Where
 TP- True Positi ve
 TN - true Neg ati ve
 FP - False Positi ve
 FN - false Negati ve
 CT- Computing Ti me
For measuring accuracy rate and Error Rate, the following
mathematical model is used.
Accuracy = TP+TN/ TP+FP+ TN +FN
CVError Rate = FP + FN / TP+FP+ TN +FN
Performance Comparison (C 4.5 and K-NN)
Dataset
SMO
Naï ve
Bayes
C4.5
J48
kNN
Heart
Problem
77.34
76.3
84.5
73.8
83.2
[3] PayalDhakate, SuvarnaPatil,
K.
Rajeswari,
Dr.
V.Vaithiyananthan, DeepaAbin, - Preprocessing
and
Classification in WEKA using different classifiers‖, Journal of
Engineering Research and Applications www.ijera.co m ISSN :
2248-9622, Vol. 4, Issue 8( Version 1), August 2014.
[4] Remco R. Bouckaert, Eibe Frank, Mark A. HallGeoffrey
Holmes, Bernhard Pfahringer, Peter Reutemann, Ian H. Witten,
―WEKA—Experiences with a Java Open-Source Pro ject‖,
Journal of Machine Learn ing Research, November 2010.
[5] Performance Analysis of various Data mining
classification (C4.5 Vs SVM) Techniques on Diabetics in Heart
Problem.
Viswanathan
K,
Dr.Mayilvahanan
K,
R.ChristyPushpaleela.
[6] Thair Nu Phyu,”Survey of Classificat ion Techniques in
Data Mining”IM ECS 2009
Table 3: Performance Comparison
International Journal of Engineering Science and Computing, September 2016
3097
http://ijesc.org/
[7] Fayyad, Piatetsky-Shapiro, Smyth and Uthurusamy”,
Advances in Knowledge Discovery and Data Mining”,
(Chapter 1), AAAI/MIT Press 1996.
[8] Witten,I. and Eibe,F. Data mining practical mach ine
learning tools and techniques.2nded,Sanfrancisco:Morgan
Kaufmann series in data management systems.,2005.
[9] PardhaRepalli, “Pred iction on Diabetes Using Datamining
Approach”.
[10] Joseph L. Breault., “Data M ining Diabetic Databases:Are
Rough Sets aUseful Addition”.
[11] G. Parthiban, A. Rajesh, S.K.Srivats a, “Diagnosis of
Heart Disease for Diabetic Patients using Naive Bayes Method
“, International Journal ofCo mputer Applications (0975 – 8887)
Vo lu me 24– No.3, June 2011.
[12] P. Pad maja, “Characteristic evaluation of diabetes data
using clustering techniques”, IJCSNS International Journal of
Co mputer Science and NetworkSecurity, VOL.8 No.11,
November 2008.
[13] P.Yasodha, M. Kannan, ―Analysis of a Population of
Diabetic Pat ientsDatabases in Weka Tool‖, Research Vol 2,
Issue 5, May-2011.
[14] VikasChaurasia, Saurabh Pal, ―Data M ining Approach to
Detect Heart Dieses‖, International Journal of Advanced
Co mputer Science and Information Technology Vol. 2.
[15] D.Lavanya and Dr.K.Usha Rani, ―Ensemble decision
tree classifier for breast cancer data. International Journal of
Information Technology Convergence and Services, Vol.2,
No.1. February 2011.
International Journal of Engineering Science and Computing, September 2016
3098
http://ijesc.org/