Download Effective framework for prediction of disease outcome using medical

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Nonlinear dimensionality reduction wikipedia , lookup

Cluster analysis wikipedia , lookup

K-means clustering wikipedia , lookup

K-nearest neighbors algorithm wikipedia , lookup

Transcript
Int. J. Computational Intelligence Studies, Vol. X, No. Y, XXXX
Effective framework for prediction
of disease outcome using medical
datasets: clustering and classification
B.M. Patil, Ramesh C. Joshi and
Durga Toshniwal*
Department of Electronics and Computer Engineering,
Indian Institute of Technology Roorkee,
Roorkee 247667, Uttarakhand, India
Email: [email protected]
Email: [email protected]
Email: [email protected]
*Corresponding author
Abstract: The method of processing two algorithms within a single workflow,
and hence the combined method, is called as hybrid computing. We propose
a data mining framework comprising of two stages, namely clustering and
classification. The first stage employs k-means algorithm on data and generates
two clusters, namely cluster-0 and cluster-1. Instances in cluster-0 do not have
disease symptoms and cluster-1 consists of instances with disease symptoms.
The verification of valid grouping is then carried out by referring to the
association of class labels in original datasets. Incorrectly classified instances
are removed and remaining instances are used to build the classifier using C4.5
decision-tree algorithm with k-fold cross validation method. The framework
was tested using eight datasets from the machine learning repository of the
UCI. The proposed framework was evaluated for accuracy, sensitivity and
specificity measures. Our framework obtained promising classification accuracy
as compared to other methods found in the literature.
Keywords: clustering; classification; effective framework; hybrid computing;
disease outcome; computational intelligence.
Reference to this paper should be made as follows: Patil, B.M., Joshi, R.C. and
Toshniwal, D. (20XX) ‘Effective framework for prediction of disease outcome
using medical datasets: clustering and classification’, Int. J. Computational
Intelligence Studies, Vol. X, No. Y, pp.xx–xx.
Biographical notes: B.M. Patil is currently a PhD student in Electronics
and Computer Engineering Indian Institute of Technology, Roorkee, India.
He received his Bachelor’s degree in Computer Engineering from Gulbarga
University in 1993, MTech in Computer Science from Mysore University
in 1999. His current research interests include data mining, medical decision
support systems, artificial Intelligence and artificial neural network.
Ramesh C. Joshi is currently a Professor in the Department of Electronics and
Computer Engineering, IIT, Roorkee, India. He received ME and PhD degrees
in Electronics and Computer Engineering from University of Roorkee in 1970
and 1980, respectively. His research interests include Parallel & Distributed
Processing, AI, Databases and Information Security. He has guided about
Copyright © 200X Inderscience Enterprises Ltd.
B.M. Patil, R.C. Joshi and D. Toshniwal
25 PhD Theses and 150 MTech Dissertations and 115 ME/BE projects and
completed four sponsored projects as PI. He has published about 150 papers in
national/international journals and conferences. He has received Gold Medal
by Institute of Engineers (I) in 1978.
Durga Toshniwal is working as an Assistant Professor at the Department of
Electronics and Computer Engineering in IIT, Roorkee. She completed her
PhD in Computer Science Engineering from IIT, Roorkee, India, in 2005. She
obtained BE and MTech from NIT, Kurukshetra, India. She has authored
several papers in various international journals and international and national
conferences of repute. She was awarded IBM Faculty Award 2008 and IBM
Shared Research University Award 2009 for her research contributions. Her
areas of research interest include time-series data mining, privacy preserving
data mining, applying soft computing techniques in mining applications,
mining data streams and unstructured text mining.
1
Introduction
Over the last few years, data mining has been increasingly applied to solve problems in
the medical domain. Data mining has been applied with high rate of success in various
fields like marketing, banking, customer relationship management, engineering and
various other areas of science. However, its application for the analysis of medical data is
comparatively restricted. It is particularly true in applications that are able to predict
disease outcomes. The goal of researchers working on prediction of disease outcome is to
develop a model that can use patient-related data, predict the significant result and
thereby support decision-making. Data mining methods are applied to build the classifier
model for prognosis and diagnosis. There is a critical need to develop medical decision
support systems, which can assist practitioners in their diagnostic process.
The research area of data mining and Knowledge Discovery in Databases (KDD) has
evolved from a combination of statistics, machine learning, artificial intelligence, pattern
recognition, expert system, databases and information (Fayyad et al., 1996). Hand et al.
(2001) defined data mining as the ‘analysis of (often large) observational datasets to
find unsuspected relationships to summarize the data in novel ways that are both
understandable and useful to the data owner’. Han and Kamber (2006) defined data
mining as an extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from a huge amount of data. Data mining is
explained in details in Jain and Chen (2003).
Data mining problems are often solved using diverse approaches taken from computer
science (multi-dimensional databases, machine learning and soft computing) and
from statistics (clustering, classification and regression techniques). Classification using
decision-trees can make known useful patterns from observational data of brain injury
(Andrews et al., 2002). Data mining techniques were also used in improving birth
outcomes (Goodwin et al., 1997) and automated detection of hereditary syndromes
(Evans et al., 1997). Roychowdhury et al. (2004) proposed GA-fuzzy-based approach for
diagnosis of the diseases, namely pneumonia and jaundice. Patil et al. (2010) used four
data mining algorithms for prediction of survival of burn patients. Chattopadhyay et al.
(2008) proposed a novel attempt to develop fuzzy logic-based Expert Systems (ESs) that
Effective framework for prediction of disease outcome
are able to reason like doctors for screening adult psychosis. Zharkova and Jain (2007)
presented a method for classification of medical images that enable extraction of
quantified information in terms of texture, intensity and shape enabling improved
diagnosis of human organs.
The method of using two different computation algorithms working together either
sharing a task or in cascade one after other is called as hybrid computing. Our proposed
framework makes use of two stages, namely clustering and classification. The first stage
employs k-means algorithm on a data and generates two clusters, namely cluster-0 and
cluster-1. The cluster-0 consists of those instances which do not have disease symptoms
and cluster-1 consists of instances with disease symptoms (here k = 2 is based on the
number of outcomes). The verification of valid grouping is then carried out by referring
to the association of class labels in the original datasets. Incorrectly classified instances
are removed and remaining instances are used to build the classifier using C4.5 decisiontree algorithm with k-fold cross validation method. The framework was tested using eight
datasets from the machine learning repository of the University of California at Irvine
(UCI) (Newman et al., 1998).
The rest of the paper is organised as follows: in Section 2, we briefly discuss works
related to classification of medical data. We propose our framework in Section 3. The
k-means and decision-tree algorithms are explained. The implementation and analysis are
given in Section 4. Various measures used for performance analysis are also defined.
The conclusion follows in Section 5.
2
Related work
Heart disease is a broad term that can refer to any condition that affects the heart. A large
number of people die every year due to heart disease all over the world, and it is one of
the prominent factors which are responsible for cause of deaths in USA (Arialdi et al.,
2007). There are many different forms of heart disease. The most common is the
coronary artery disease caused by narrowing or blockage of the coronary arteries. While
many people with heart disease have symptoms, such as chest pain and fatigue, as many
as 50% have no symptoms until a heart attack occurs (http://www.ivillage.com/health,
accessed on September 2009). However, correct diagnosis at an early stage followed by
an appropriate treatment can result in significant life saving (Yan et al., 2003).
Different classification algorithms are used to detect the presence of heart disease in
the dataset from UCI machine learning repository. Serpen et al. (1997) proposed a
probabilistic potential function, neural network classifier algorithm for Cleveland heart
disease data. Classification accuracy achieved by this method was 58.28%. Tang et al.
(2004) developed a new model called Granular Support Vector Machines (GSVM) for
data classification problems. It systematically combines two theories, namely statistical
learning and granular computing. It works by building a sequence of information
granules and then builds a Support Vector Machines (SVM) in each information granule.
SVM and GSVM have obtained an accuracy of 83.04% and 84.04%, respectively.
Polat et al. (2007a) proposed a method that uses Artificial Immune Recognition
System (AIRS) and obtained 87% classification accuracy. Polat and Gunes (2007a)
suggested a new approach, combining feature selection, fuzzy weighted pre-processing
B.M. Patil, R.C. Joshi and D. Toshniwal
and AIRS classifier to classify the heart disease dataset and obtained a classification
accuracy of 92.59%. Humar and Novruz (2008) developed a hybrid system, which used
Artificial Neural Network (ANN) and Fuzzy Neural Network (FNN) and achieved an
accuracy of 86.8%. Das et al. (2009) proposed a system called Neural Networks
Ensemble, which creates new models by combining the posterior probabilities or the
predicted values from multiple predecessor models using SAS data miner tool and
achieved an accuracy of 89.01%.
Cardiac disorders diagnosis is based on SPECT (Single Photon Emission Computed
Tomography) images. Bakirci and Yildirim (2004) used feed-forward ANN and achieved
an accuracy of 90.04%. Polat et al. (2007c) proposed a method ensemble classifier
system based on different feature subsets and AIRS classifier to detect the cardiac
disorders from SPECT images and obtained accuracy of 97.74% by dividing the data in
approximately equal size for training and testing purposes.
The liver is a vital organ located in the upper right-hand side of the abdomen.
It performs numerous functions for the body: converting nutrients derived from food into
essential blood components, storing vitamins and minerals, regulating blood clotting,
producing proteins and enzymes, maintaining hormone balances, and metabolising and
detoxifying substances that would otherwise be harmful to the body (http://www.
medlineplus.gov). Cheung et al. (2001) used number of classification algorithms on
BUPA liver disorder dataset. They obtained an accuracy of 65.50% using C4.5
algorithm, 63.39% using Naive Bayes classifier, 61.83% using Bayesian Network
with Naive Dependence (BNND) and 61.42% using Bayesian Network with Naive
Dependence and Feature Selection (BNNF) classifier. Polat et al. (2007b) proposed
method of Fuzzy-AIRS and classified the dataset with an accuracy of 83.38% and
compared the result with AIRS classification algorithm, which obtained classification
accuracy of 81%.
Breast cancer is the most common cancer in women today in many countries. It is
considered to be the second leading cause of cancer deaths among women between
40 and 55 years of age (Delen et al., 2005). Cancer is a group of diseases in which cells
in the body grow, change and multiply in uncontrolled fashion. Breast cancer refers to
the erratic growth and propagation of cells that originate in the breast tissue. A group of
rapidly dividing cells may form a lump or mass of extra tissue called tumours. Tumours
can either be cancerous (malignant) or non-cancerous (benign). Malignant tumours
penetrate and destroy healthy body tissues. A group of cells within a tumour may also
break away and spread to other parts of the body (http://www.imaginis.com/breasthealth, accessed on 15 April 2010). A lot of work has been done on WBCD (Wisconsin
Breast Cancer Dataset) in the literature and high classification accuracy has been
achieved. Quinlan (1996) used C4.5 classification algorithm and obtained accuracy of
94.74% using tenfold cross-validation. Hamilton et al. (1996) proposed a method RIAC
(Rule Induction Algorithm based on Approximate Classification) and achieved accuracy
of 96% with tenfold cross validation. Pena-Reyes and Sipper (1999) proposed fuzzy-GA
method and obtained a classification accuracy of 97.36%. Polat and Gunes (2007b) used
least square SVM and an accuracy of 98.53% was obtained.
Hepatitis means an inflammation of the liver. It can be caused by many things,
including a bacterial infection, liver injury caused by a toxin (poison) and even an attack
on the liver by the body’s own immune system (http://www.netdoctor.co.uk/diseases/
facts/hepatitis.htm). A number of classification methods were proposed and achieved
Effective framework for prediction of disease outcome
high classification accuracies on the dataset taken from UCI machine learning repository.
Bascil and Temurtas (2009) used a multi-layer neural network with tenfold crossvalidation technique and compared their results with that of previous studies given on the
same dataset (www.is.umk.pl/projects/datasets.html, accessed on 15 April 2009).
However, none of the above-mentioned techniques used the proper validation of
class labels, which affect the predictive performance. Our study proposes a framework
which uses the k-means clustering algorithm aimed at validating chosen class labels of a
given data. The decision-tree classification algorithm is applied to the resulting pattern to
build the final classifier model using the k-fold cross validation method. This model is
evaluated on various datasets taken from the UCI machine learning repository.
3
Proposed method
The classification framework for the medical datasets is shown in Figure 1. It consists of
two stages: clustering and classification. In the first stage, we applied clustering method
for pattern extraction to validate the class label associated with given data and delete the
misclassified data. In the second stage, decision-tree is applied for classification using
k-fold cross validation (k = 10).
Figure 1
Effective framework for prediction of disease outcome
3.1 Datasets
In this study, we have taken the medical datasets of eight diseases from UCI machine
learning repository (Newman et al., 1998) details of which are given hereafter.
B.M. Patil, R.C. Joshi and D. Toshniwal
3.1.1 Cleveland heart disease dataset
The dataset contains a total number of 303 samples and has 13 attributes like age,
sex, chest pain type (cp), resting blood pressure (trestbps, mm/dl), serum cholesterol
(chol, mg/dl), fasting blood sugar (fbs, >120 mg/dl), resting electrocardiographic results
(restecg), maximum heart rate achieved (thalach), exercise induced angina (exang), ST
depression induced by exercise relative to rest (oldpeak), the slope of the peak exercise
ST segment (slope), ca: number of major vessels (0–3) coloured by fluoroscopy (ca) and
the heart status (thal, 3 = normal defect; 6 = fixed defect; 7 = reversible defect). The
dataset is being divided into five classes, 0 corresponding to absence of any disease and
1, 2, 3, 4 corresponding to four different types of diseases. Many researchers have used
this dataset to differentiate between the absence (0) and presence (1, 2, 3 or 4) of a
disease. The two classes are coded as ‘0’ for absence and ‘1’ for presence.
3.1.2 Statlog heart disease dataset
The Statlog heart disease dataset was taken from 270 samples belonging to patients with
heart problem while the remaining 150 samples are of healthy persons. The samples
taken from patients and healthy persons include 13 attributes which are given below:
1
age
2
sex
3
chest pain type (four values)
4
resting blood pressure
5
serum cholesterol (mg/dl)
6
fasting blood sugar (>120 mg/dl)
7
resting electrocardiographic results (values 0, 1, 2)
8
maximum heart rate achieved
9
exercise induced angina
10 oldpeak = ST depression induced by exercise relative to rest
11 the slope of the peak exercise ST segment
12 ca: number of major vessels (0–3) coloured by fluoroscopy
13 thal: 3 = normal defect; 6 = fixed defect; 7 = reversible defect.
The class information is included in the dataset as 1 and 2 regarding absence and
presence of disease, respectively.
3.1.3 Single photon emission computed tomography images dataset
The SPECT images dataset is concern with the diagnosis of cardiac disorders. The
dataset describes diagnosing of cardiac SPECT images. Each of the patients is classified
into two categories: normal and abnormal. The database of 267 SPECT image sets
(patients) was processed to extract features that summarise the original SPECT images.
Effective framework for prediction of disease outcome
A pattern in SPECT image dataset is represented by 22 binary features that have either 0
or 1 value. There are 55 normal (0) and 212 abnormal (1) subjects in SPECT image
dataset
3.1.4 BUPA liver disorder dataset
The liver disorders data prepared by BUPA Medical Research Company includes
345 samples consisting of six attributes and two classes. Two-hundred samples are of
healthy persons, while the remaining 145 data belong to patients in this dataset. Each
sample has six attributes, all of which are real valued, and are as follows:
1
mean corpuscular volume (mcv)
2
alkaline phosphatase (alkphos)
3
alanine aminotransferase (sgpt)
4
aspartate aminotransferase (sgot)
5
gamma-glutamyl transpeptidase (gammagt)
6
number of half-pint equivalents of alcoholic beverages drunk per day (drinks).
3.1.5 Wisconsin breast cancer dataset-1
WBCD-1 consists of 699 samples that were collected by Dr W.H. Wolberg at the
University of Wisconsin–Madison Hospitals taken from needle aspirates from human
breast cancer tissue. The WBCD consists of nine features obtained from fine needle
aspirates, each of which is ultimately represented as an integer value between 1 and 10.
The measured variables are the following:
1
Clump Thickness (x1)
2
Uniformity of Cell Size (x2)
3
Uniformity of Cell Shape (x3)
4
Marginal Adhesion (x4)
5
Single Epithelial Cell Size (x5)
6
Bare Nuclei (x6)
7
Bland Chromatin (x7)
8
Normal Nuclei (x8)
9
Mitoses (x9).
The dataset consists of 699 data, out of which 458 belong to benign group and the
remaining 241 data are of malignant nature.
3.1.6 Wisconsin breast cancer dataset-2
WBCD-2 was obtained from a repository having 32 attributes and 569 instances, out of
which 357 instances belong to benign class and the remaining 212 are of malignant class.
B.M. Patil, R.C. Joshi and D. Toshniwal
3.1.7 Wisconsin prognostic breast cancer dataset
Wisconsin Prognostic Breast Cancer (WPBC) dataset, concerning a number of 198 patients
and a binary decision class, has non-recurrent-events totalling 151 instances and
recurrent-events totalling 47 instances. The testing diagnosing accuracy, that is the main
performance measure of the classifier, was about 74.24% in accordance with the
performance of other well-known machine learning techniques.
3.1.8 Hepatitis dataset
The dataset which consists of hepatitis disease measurements contains two classes and
155 samples. The class distribution is Class 1: die (32) and Class 2: live (123). All samples
have 19 features. These features are: age, sex, steroid, antivirals, fatigue, malaise, anorexia,
liver big, liver firm, spleen palpable, spiders, ascites, varices, bilirubin, alk phosphate,
sgot, albumin, pro time and histology.
3.2 Clustering
Our motivation in this paper is based on the assumption (Han and Kamber, 2006) that the
instance with similar attribute values is more likely to have similar class label. Similarity
is measured based on Euclidean distance. Therefore, the misclassified instances after
clustering are deleted and correctly classified instances are considered for further
classification using decision-tree classifier. Many researchers have used clustering
method on unlabelled data to assign class labels. Later, they have used methods of
supervised learning for classification (Dhillon et al., 2003; Li et al., 2004; Kyriakopoulou
and Kalamboukis, 2007).
In our study, we are using k-means clustering algorithm to label data to validate the
class label associated with the dataset. The reason in choosing k-means is that Lange
et al. (2004) proved that the validation result obtained by k-means clustering is better
than others for k = 2 or 3. We also tried clustering by k-mediod algorithm but the
misclassification rate was 50%. We considered the result of k-means because the
misclassification rate was less. The k-means algorithm takes the input parameter, k, and
partitions a set of N points into k clusters, so that the resulting intracluster similarity is
high but the intercluster similarity is low.
The steps in the k-means method are the following (Shekhar et al., 2007):
1
Select k random instances from the training data subset as the centroids of the clusters
C1, C2,…, Ck
2
For each training instance X:
a
Compute the Euclidean distance D ( Ci , X ) , i = 1 … k
Find cluster that is closest to X.
b
3
Assign X to Cq. Update the centroid of Cq. (The centroid of a cluster is the
arithmetic mean of the instances in the cluster.)
Repeat Step 2 until the centroids of clusters C1, C2,…, Ck get stabilise in terms of
mean-squared error criterion.
Effective framework for prediction of disease outcome
3.3 Decision-tree algorithm
Decision-tree is among the most popular classification methods. The rules produced
by decision-tree are easy to interpret and understand, and hence can greatly help in
appreciating the underlying mechanisms that separate samples in different classes.
Among many decision-tree based classifiers, C4.5 is a well-established and widely used
algorithm. C4.5 is a supervised learning classification algorithm used to construct
decision-trees from the data (Quinlan, 1993). It uses a divide-and-conquer approach to
growing decision trees (Benjamin et al., 2000; Ture et al., 2009).
Let the attributes be denoted by A = {a1 , a2 ,… , am } , cases be represented by
D = {d1 , d 2 ,… , d n } , and classes be indicated by C = {c1 , c2 ,… , ck } . For a set of cases
D, a test Ti is a split of D based on attribute ai. It splits D into mutually exclusive subsets:
D1, D2,…, Dp. These subsets of cases are single class collections of cases. If a test T is
chosen, the decision-tree for D consists of a node identifying the test T, and one branch
for each possible subset Di. For each subset Di, a new test is then chosen for further split.
If Di satisfies a stopping criterion, the tree for Di is a leaf associated with the most
frequent class in Di. One reason for stopping is that cases in Di belong to one class. C4.5
decision-tree algorithm uses a modified splitting criterion called gain ratio. It uses arg
max (gain (D, T)) or arg max (gain ratio (D, T)) to choose tests for split.
k
Info ( D ) = −∑ p ( ci , D ) log 2 ( p ( ci , D ) )
(1)
i =1
p
Split ( D, T ) = ∑
i =1
⎛ D
Di
log 2 ⎜⎜ i
D
⎝ D
p
Di
i =1
D
Gain ( D, T ) = Info ( D ) − ∑
⎞
⎟⎟
⎠
(2)
info ( Di )
Gain Ratio ( D, T ) = Gain ( D, T ) Split
( D, T )
(3)
(4)
where, p(ci,, D) denotes the proportion of cases in D that belong to the ith class.
C4.5 selects the test that maximises gain ratio value (Benjamin et al., 2000). Once the
initial decision-tree is constructed, a pruning procedure is initiated to decrease the overall
tree size and decrease the estimated error rate of the tree (Quinlan, 1993). C4.5 uses the
information gain ratio criterion to determine the most discriminatory feature at each step
of its decision-tree induction process. In each round of selection, the information gain
ratio criterion chooses, from those features with an average-or-better information gain,
the feature that maximises the ratio of its gain divided by its entropy. C4.5 stops
recursively building sub-trees when:
1
an obtained data subset contains samples of only one class (then the leaf node is
labelled by this class), or
2
there is no available feature (then the leaf node is labelled by the majority class), or
3
the number of samples in the obtained subset is less than a specified threshold
(then leaf node is labelled by the majority class) (Quinlan, 1993).
B.M. Patil, R.C. Joshi and D. Toshniwal
4
Experimental results
In the proposed framework, the first stage involves k-means clustering. It is used for
pattern extraction as given below.
4.1 k-means implementation
The first stage employs k-means algorithm on a data and generates two clusters, namely
cluster-0 and cluster-1. The cluster-0 consists of those instances which do not have
disease symptoms and cluster-1 consists of instances with disease symptoms (for k = 2
based on the number of outcomes). The verification of valid grouping is then carried out
by referring to the association of class labels in the original datasets. If they are found to
be same, then the instance is correctly classified. Incorrectly classified instances are
removed. The data were randomly re-sampled and this process was repeated for ten
times. Among ten experiments, the minimum misclassified instances were taken for
validation of class labels (This is done in order to include maximum number of instances
for building the classifier.). From the chosen dataset, we removed the misclassified
instances. It is known that clustering processes will cluster the data and assign
class labels based on its intrinsic properties of data without considering actual classes
(i.e. class labels). We discuss one such case pertaining to Cleveland heart disease dataset.
In case of Cleveland heart disease data, 61 instances are misclassified. These are
deleted and 242 instances which are correctly classified are retained and used for
classification. The effect of this deletion resulted in building the decision-tree with
around 10–12 less number of instances in each leaf node as compared to the whole
dataset. This reduction in instances did not introduce any substantial loss in the
classification capability of the classifier. Polat et al. (2007a) deleted six instances due to
missing values and 27 instances due to disputed values. They used 270 instances for
analysis. In our case, we used 242 instances selected using clustering method mentioned
above. Similar kind of procedure was applied on all the seven datasets and
misclassification rate of each dataset is given in Table 1.
4.2 Building decision-tree classifier
Decision-tree classifier model was built for all eight medical datasets, in order to perform
the classification task.
4.2.1 Feature selection
One most important aspect in data mining is feature selection. Feature selection refers
to selecting relevant features from the data based on their importance. It follows one
of the two basic models: the wrapper model and filter model. The wrapper model makes
use of automatic feature selection, whereas in filter model the size of the data can be
considerably reduced by deleting irrelevant features. The main goal of feature selection
as a part of this study was to generate the dataset containing smallest number of nonredundant features in order to obtain the best results. An example of this type of
algorithm is the decision-tree algorithm. It uses important features and ignores the
irrelevant ones. This concept is useful when the choices of features are many and more
Effective framework for prediction of disease outcome
effort is required on reducing the features (Wieschaus and Schultz, 2003). The authors
(Sebban et al., 2002) reduced the features in order to reduce the cost and complexity of
classification algorithm as well as to improve the classifier efficiency.
Table 1
Clustering classes for disease data
S. No. Type of databases
Cluster attribute
Instances
Incorrectly
classified
Error (%)
1
Cleveland heart disease
dataset
Cluster-1 (present)/
Cluster-0 (absent)
303
61
20.13
2
Statlog heart disease
data set
Cluster-1 (present)/
Cluster-0 (absent)
270
34
12.59
3
SPECT dataset
Cluster-1 (abnormal)/
Cluster-0 (normal)
267
82
30.71
4
BUPA liver disorder
dataset
Cluster-1 (S1)/
Cluster-0 (S2)
345
88
25.32
5
WBCD (Wisconsin
breast cancer dataset)
Cluster-1 (malign)/
Cluster-0 (begin)
699
29
4.15
6
WBCD (Wisconsin
breast cancer dataset)
Cluster-1 (begin)/
Cluster-0 (malign)
569
39
6.85
7
WPBC (Wisconsin
Cluster-1 (recrnt)/
prognostic breast cancer ) Cluster-0 (N-recrnt )
198
43
21.72
8
Hepatitis dataset
155
30
19.36
Cluster-1 (die)/
Cluster-0 (live)
In Cleveland heart disease dataset, the total numbers of features available are 13 and
decision-tree has just used four features. They are shown in Figure 2, viz. thal, exang,
sex and cp.
Figure 2
Pruned decision-tree of Cleveland heart disease data
B.M. Patil, R.C. Joshi and D. Toshniwal
The decision-tree was obtained using the j48 (C4.5 decision-tree algorithm) algorithm.
The resulting j48 pruned decision-tree is shown in Figure 3, which is based on all the
training data with total size of nine nodes including five leaf nodes. The evaluation result
is based on tenfold cross validation.
The WBCD (Wisconsin Breast Cancer Dataset) consists of nine features and
decision-tree classifier built uses only four features as shown in Figure 3.
Figure 3
Pruned decision-tree of WBCD (Wisconsin Breast Cancer Dataset-1)
The rules are generated from these trees by considering each path as given in Tables 2
and 3. The rule, thus, obtained can be applied on similar kind of data, for which class
labels are unknown.
Table 2
Rule-1:
Rule-2:
Rule-3:
Rule-4:
Rule-5:
Default:
Rule generated from pruned decision-tree
IF thal <= 3 and exang <= 0,
THEN record is pre-classified as ‘absnt’
ELSE IF thal <= 3 exang > 0 and sex <= 0
THEN record is pre-classified as ‘absnt’
ELSE IF thal <= 3 exang > 0 and sex > 0 and cp <= 3
THEN record is pre-classified as ‘absnt’
ELSE IF thal <= 3 exang > 0 and sex > 0 and cp > 3
THEN record is pre-classified as ‘prsnt’
ELSE IF thal >= 3 THEN
THEN record is pre-classified as ‘prsnt’
ELSE ‘Ignored the record’
END IF
Effective framework for prediction of disease outcome
Table 3
Rule-1:
Rule generated from pruned decision-tree for WBCD (Wisconsin Breast Cancer
Dataset-1)
IF UnifmtSize <= 3 and BarNclei <= 6,
THEN record is pre-classified as ‘Begn’
Rule-2:
ELSE IF UnifmtSize <= 3 and BarNclei > 6 and Clumpth <= 2 THEN
record is pre-classified as ‘Begn’
Rule-3:
ELSE IF UnifmtSize <= 3 and BarNclei > 6 and Clumpth > 2 THEN
record is pre-classified as ‘malign’
Rule-4:
ELSE IF UnifmtSize >3 and MargAdsim <= 1 and Clumpth <= 6 and
UnifmtSize <= 5 THEN record is pre-classified as ‘Begn’
Rule-5:
ELSE IF UnifmtSize > 3 and MargAdsim <= 1 and Clumpth <= 6 and
UnifmtSize > 5 THEN record is pre-classified as ‘malign’
Rule-6:
ELSE IF UnifmtSize > 3 and MargAdsim <= 1 and Clumpth > 6 THEN
record is pre-classified as ‘malign’
Rule-7:
ELSE IF UnifmtSize > 3 and MargAdsim > 1 THEN record is
pre-classified as ‘malign’
Default:
ELSE ‘Ignored the record’
END IF
4.3 Performance measures
The performance of our proposed framework was evaluated using tenfold cross
validation method. The dataset was divided in ten equal subsets. The method was
repeated ten times and each time one subset is used for testing and other nine subsets are
placed together for training as given in. The process of training and testing is repeated
ten times, each time using a different testing subset (Delen et al., 2005).
A confusion matrix was calculated for the classifier to understand the results and is
given in Table 4. The confusion matrix is simply a square matrix that shows the various
classifications and misclassifications of the model. The columns of the matrix correspond
to the number of instances classified as a particular value and the rows correspond to the
number of instances with that actual classification. The measures of True Positive (TP)
and True Negative (TN) are correct classifications. A False Positive (FP) occurs when
the outcome is incorrectly predicted as YES (or positive) when it is actually NO
(negative). A False Negative (FN) occurs when the outcome is incorrectly predicted as
negative when it is actually positive.
Table 4
Confusion Matrix Measures
Predicted class
Actual class
Yes
No
Yes
true positive (TP)
false negative (FN)
No
false positive (FP)
true negative (TN)
Actual results obtained by our framework on the eight dataset are shown in Table 5 – the
italics in diagonal indicates the True positive (TP) and True Negative (TN) and other
diagonal represents False Negative (FN) and False Positive (FP).
Hepatitis
dataset
WPBC (Wisconsin
Prognostic Breast
Cancer ) dataset
WBCD-2 (Wisconsin
Breast Cancer Dataset)
WBCD-1 (Wisconsin
Breast Cancer Dataset)
BUPA liver
dataset
SPECT images
dataset
Statlog heart disease
dataset
Cleveland heart disease
04
123
01
20
122
02
03
27
01
347
177
01
02
213
04
01
451
00
60
186
00
131
52
02
02
188
43
108
05
03
03
126
Confusion matrix
97.93
96.78
99.25
99.11
99.59
98.92
97.88
96.70
Acc. %
99.18
87.09
99.43
99.55
100
100
95.55
97.67
Sen. %
Proposed method
90.90
99.19
99.14
98.15
98.36
98.49
98.49
95.57
Spec. %
Levenberg–Marquardt BP
(Bascil and Temurtas, 2009)
Naive Bayesian (Dumitru, 2009)
ARIS (Polat et al., 2007c)
LSSVM (Polat and Gunes, 2007b)
GA-AWIS (Ozsen and Gunes, 2009)
ICA-ARIS (Bascil and Temurtas, 2009)
91.51
74.24
98.51
97.08
85.21
97.74
87.43
87.4
Acc. %
Previous method
Kernel F-score (Polat and Gunes, 2009)
ANN-fuzzy (Polat et al., 2007a)
Name
–
27.78
–
97.87
–
99.04
–
93
Sen. %
–
91.67
–
97.77
–
92.85
–
78.5
Spec. %
Table 5
Types of data
B.M. Patil, R.C. Joshi and D. Toshniwal
Comparison of medical data accuracy, sensitivity and specificity
Effective framework for prediction of disease outcome
The performance is measured using the accuracy, sensitivity and specificity measures.
They are calculated using equations (5), (6) and (7) as given below:
TP + TN
TP + TN + FN + FP
(5)
Sensitivity =
TP
TP + FN
(6)
Specificity =
TN
TN + FP
(7)
Accuracy =
The proposed framework provides very promising results in comparison to previous
methods applied for each of the eight datasets. The comparison of the values for
accuracy, sensitivity and specificity are given in Table 5 for each dataset. In case of
Cleveland heart disease data, accuracy, sensitivity and specificity obtained were 96.70%,
97.67% and 95.57%, respectively, which were comparable to recent study by Polat et al.
(2007a). The accuracy, sensitivity and specificity obtained for Statlog heart disease were
97.87%, 99.55% and 97.49%, respectively, which were better than kernel F-score feature
selection method proposed by Polat and Gunes (2009). The results obtained by our
framework on SPECT images dataset were 98.92%, 100% and 98.49%, which were
better than ICA-ARIS (Bascil and Temurtas, 2009). The classification accuracies
obtained with GA-AWAIS (Ozsen and Gunes, 2009) for BUPA liver disorders using
tenfold cross validation was 85.21% and our method gave a result of 99.59%. The
accuracy, sensitivity and specificity obtained by proposed method on WBCD-1 were
99.11%, 99.55% and 98.15%, which were better as compared to LSSVM (Polat and
Gunes, 2007b). The other dataset WBCD-2 produced accuracy of 99.25% by proposed
method and results were better in comparison to Polat et al. (2007b) with ARIS. The
classification on WPBC dataset was used to predict whether the type of cancer was
recurrent or non-recurrent and results were compared with those of Dumitru (2009) and
our framework could provide greater accuracy of 96.78%. In hepatitis disease, data
accuracy obtained by our framework was 97.93%, which was better in comparison with
the method proposed by Bascil and Temurtas (2009). It is evident that results obtained by
our framework are better than that of other classifications methods proposed till date.
5
Conclusions
In the proposed work, k-means clustering is used for validation of class labels associated
with the data. During this, data instances are clustered into k (here k = 2, cluster-0 and
cluster-1) disjoint clusters. It has been observed that by removing the misclassified data
after clustering, the performance of classifier is significantly improved. The classification
accuracy, sensitivity and specificity obtained by the proposed framework are found to be
better than those obtained by other competing techniques on all eight medical datasets of
UCI machine learning data repository. The results indicate that the proposed framework
can be routinely used for decision support to medical practitioners.
B.M. Patil, R.C. Joshi and D. Toshniwal
Acknowledgements
The authors are highly thankful to reviewers and guest editor for their fruitful comments
and suggestions which helped us to improve our earlier versions of the paper and also
thankful to B.I. Khdakbhavi, Director of MBE Society’s College of Engineering,
Ambajogai, for their sponsorship and AICTE.
References
Andrews, P.J., Sleeman, D.H., Statham, P.F., McQuatt, A., Corruble,V., Jones, P.A., Howells, T.P.
and Macmillan, C.S. (2002) ‘Predicting recovery in patients suffering brain injury by using
admission variables and physiological data: a comparison between decision tree analysis and
logistic regression’, Journal of Neurosurgery, Vol. 97, pp.326–336.
Arialdi, M., Minino, M.P., Melonie, P., Heron, Ph.D., Sherry, L., Murphy, B.S., Kenneth D. and
Kochanek, M.A. (2007) National Vital Statistics Reports, Vol. 55, Vol. 19, p.7.
Bakirci, U. and Yildirim, T. (2004) ‘Diagnosis of cardiac problems from SPECT Images by
feedforward networks’, IEEE 12 Signal Processing and Communication Application
Conference, pp.103–105.
Bascil, M.S. and Temurtas, F.(2009) ‘A study on hepatitis disease diagnosis using multilayer neural
network with Levenberg Marquardt Training Algorithm’, Journal of Medical System.
Benjamin, K.T., Tom, B.Y.L., Samuel, W.K.C., Weijun, G. and Xuegang, Z. (2000) ‘Enhancement
of a Chinese discourse marker tagger with C4.5’, Proceedings of the Second Workshop on
Chinese Language Processing held in Conjunction with the 38th Annual Meeting of the
Association for Computational Linguistics, Association for Computational Linguistics,
Morristown, NJ, USA, Vol. 12, pp.38–45.
Chattopadhyay, S., Pratihar, D.K. and Sarkar, S. (2008) ‘Developing fuzzy classifiers to predict
the chance of occurrence of adult psychoses’, Knowledge-Based Systems, Vol. 21, No. 6,
pp.479–497.
Cheung, N. (2001) Machine Learning Techniques for Medical Analysis, Bsc Thesis, School of
Information Technology and Electrical Engineering, University of Queensland.
Das, R., Turkoglu, I. and Sengur, A. (2009) ‘Effective diagnosis of heart disease through neural
networks ensembles’, Expert Systems with Applications, Vol. 36, pp.7675–7680.
Delen, D., Walker, G. and Kadam, A. (2005) ‘Predicting breast cancer survivability: a comparison
of three data mining methods’, Artificial Intelligence in Medicine, Vol. 34, No. 2, pp.113–127.
Dhillon, I.S., Mallela, S. and Kumar, R. (2003) ‘A divisive information-theoretic feature
clustering algorithm for text classification’, Journal of Machine Learning Research, Vol. 3,
pp.1265–1287.
Dumitru, D. (2009) ‘Prediction of recurrent events in breast cancer using the Naive Bayesian
classification’, Computer Science Series, Vol. 36, No. 2, pp.92–96.
Evans, S., Lemon, S., Deters, C., Fusaro, R. and Lynch, H. (1997) ‘Automated detection of
hereditary syndromes using data mining’, Computers and Biomedical Research, Vol. 30,
pp.337–348.
Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996) ‘Data mining and knowledge discovery in
databases, Communications of the ACM, Vol. 39, No. 11, pp.24–26.
Goodwin, L., Prather, J., Schlitz, K., Iannacchione, M.A., Hammond, W. and Grzymala, J. (1997)
‘Data mining issues for improved birth outcomes’, Biomedical Science Instrumentation,
Vol. 34, No. 19, pp.291–296.
Hamilton, H.J., Shan, N. and Cercone, N. (1996) RIAC: A Rule Induction Algorithm Based on
Approximate Classification, Technical Report CS 96-06, University of Regina.
Effective framework for prediction of disease outcome
Han, J. and Kamber, M. (2006) Data Mining: Concepts and Techniques, 2nd ed., Morgan
Kaufmann Publishers.
Hand, D., Mannila, H. and Smyth, P. (2001) Principles of Data Mining, MIT Press, Cambridge,
MA.
Humar, K. and Novruz, A. (2008) ‘Design of a hybrid system for the diabetes and heart diseases’,
Expert Systems with Application, Vol. 35, pp.82–89.
Jain, L.C. and Chen, Z. (2003) ‘Industry, artificial intelligence in, encyclopedia of information
systems’, Elsevier Science, Vol. 2, pp.583–597.
Kyriakopoulou, A. and Kalamboukis, T. (2007) ‘Using clustering to enhance text classification’,
Proceedings of the 30th Annual International ACM SIGIR Conference on Research and
Development in Information Retrieval, pp.805–806.
Lange, T., Roth, V., Braun, M.L. and Buhmann, J.M. (2004) ‘Stability-based validation of
clustering solutions’, Neural Computation, Vol. 16, No. 6, pp.1299–1323.
Li, M., Cheng, Y. and Zhao, H. (2004) ‘Unlabeled data classification via support vector machine
and k-means clustering’, Proceedings of the Conference on Computer Graphics, Imaging and
Visualization (CGIV04), Penang, Malaysia, pp.183–186.
Newman, D., Hettich, J.S., Blake, C.L.S. and Merz, C.J. (1998) UCI Repository of machine
learning databases, Department of Information and Computer Science, University of
California, Irvine, CA. Available online at: www.ics.uci.edu/~mlearn/MLRepository.html
(accessed on 10 August 2009).
Ozsen, S. and Gunes, S. (2009) ‘Attribute weighting via genetic algorithms for attribute weighted
artificial immune system (AWAIS) and its application to heart disease and liver disorders
problems’, Expert Systems with Applications, Vol. 36, pp. 386–392
Patil, B.M., Joshi, R.C., Toshniwal, D. and Biradar, S. (2010) ‘A new approach: role of data
mining in prediction of survival of burn patients’, Journal of Medical System. Available online
at: www.springerlink.com/index/8pnh75n137t99892.pdf
Pena-Reyes, C.A. and Sipper, M. (1999) ‘A fuzzy-genetic approach to breast cancer diagnosis’,
Artificial Intelligence in Medicine, Vol. 17, pp.131–155.
Polat, K. and Gunes, S. (2007a) ‘A hybrid approach to medical decision support systems:
combining feature selection, fuzzy weighted pre-processing and AIRS’, Computer Methods
and Programs in Biomedicine, Vol. 88, No. 2, pp.164–174.
Polat, K. and Gunes, S. (2007b) ‘Breast cancer diagnosis using least square support vector
machine’, Digital Signal Processing, Vol. 17, No. 4, pp.694–701.
Polat, K. and Gunes, S. (2009) ‘A new feature selection method on classification of medical
datasets: Kernel F-score feature selection’, Expert Systems with Applications, Vol. 36,
pp.10367–10373.
Polat, K., Sahan, S. and Gunes, S. (2007a) ‘Automatic detection of heart disease using an artificial
immune recognition system (AIRS) with fuzzy resource allocation mechanism and k-nn
(nearest neighbour) based weighting preprocessing’, Expert Systems with Applications,
Vol. 32, No. 2, pp.625–631.
Polat, K., Sahan, S., Kodaz, H. and Gunes, S. (2007b) ‘Breast cancer and liver disorders
classification using artificial immune recognition system (AIRS) with performance
evaluation by fuzzy resource allocation mechanism’, Expert Systems with Applications,
Vol. 32, pp.172–183.
Polat, K., Sekerci, R. and Gunes, S. (2007c) ‘Artificial immune recognition system based classifier
ensemble on the different feature subsets for detecting the cardiac disorders from SPECT
images’, DEXA LNCS, Vol. 4653, pp. 45–53,
Quinlan, J.R. (1993) C4.5 Programs for Machine Learning, Morgan Kaufmann Publishers,
San Mateo, CA.
Quinlan, J.R. (1996) ‘Improved use of continuous attributes in C4.5’, Journal of Artificial
Intelligence Research, Vol. 4, pp.77–90.
B.M. Patil, R.C. Joshi and D. Toshniwal
Roychowdhury, A., Pratihar, D.K., Bose, N., Sankaranarayana, K.P. and Sudhahar, N. (2004)
‘Diagnosis of the diseases using a GA-fuzzy approach’, Information Sciences, Vol. 162,
No. 2, pp.105–120.
Sebban, M. and Nock, R. (2002) ‘A hybrid filter wrapper approach of feature selection using
information theory’, Pattern Recognition, Vol. 35, No. 4, pp.835–846.
Serpen, G., Jiang, H. and Allred, L.G. (1997) ‘Performance analysis of probabilistic potential
function neural network classifier’, Proceedings of Artificial Neural Networks in Engineering
Conference, Vol. 7, pp.471–476.
Shekhar, R., Gaddam, V., Phoha, V. and Kiran, S. (2007) ‘K-Means+ID3: a novel method for
supervised anomaly detection by cascading K-means clustering and ID3 decision tree learning
methods’, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 3, pp.1–10.
Tang, Y., Jin, B., Sun, Y. and Zhang, Y. (2004) ‘Granular support vector machines for medical
binary classification problems’, IEEE Symposium on Computational Intelligence in
Bioinformatics and Computational Biology, 7–8 October, San Diego, CA, pp.73–78.
Ture, M., Tokatli, F. and Kurt, I. (2009) ‘Using Kaplan–Meier analysis together with decision tree
methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of
breast cancer patients’, Expert Systems with Applications, Vol. 36, pp.2017–2026.
Wieschaus, E. and Schultz, M.A. (2003) A Comparison of Methods for the Reduction of Attributes
Before Classification in Data Mining, Yale University, Yale, CT.
Yan, H., Zheng, J., Jiang, Y., Peng, C. and Li, Q. (2003) ‘Development of a decision support
system for heart disease diagnosis using multilayer perception’, IEEE Symposium on Circuits
and Systems, Vol. 5, pp.V709–V712.
Zharkova, V. and Jain, L.C. (2007) ‘Artificial intelligence in recognition and classification of
astrophysical and medical images’, Springer Studies in Computational Intelligence, Vol. 46.