Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Int. J. Computational Intelligence Studies, Vol. X, No. Y, XXXX Effective framework for prediction of disease outcome using medical datasets: clustering and classification B.M. Patil, Ramesh C. Joshi and Durga Toshniwal* Department of Electronics and Computer Engineering, Indian Institute of Technology Roorkee, Roorkee 247667, Uttarakhand, India Email: [email protected] Email: [email protected] Email: [email protected] *Corresponding author Abstract: The method of processing two algorithms within a single workflow, and hence the combined method, is called as hybrid computing. We propose a data mining framework comprising of two stages, namely clustering and classification. The first stage employs k-means algorithm on data and generates two clusters, namely cluster-0 and cluster-1. Instances in cluster-0 do not have disease symptoms and cluster-1 consists of instances with disease symptoms. The verification of valid grouping is then carried out by referring to the association of class labels in original datasets. Incorrectly classified instances are removed and remaining instances are used to build the classifier using C4.5 decision-tree algorithm with k-fold cross validation method. The framework was tested using eight datasets from the machine learning repository of the UCI. The proposed framework was evaluated for accuracy, sensitivity and specificity measures. Our framework obtained promising classification accuracy as compared to other methods found in the literature. Keywords: clustering; classification; effective framework; hybrid computing; disease outcome; computational intelligence. Reference to this paper should be made as follows: Patil, B.M., Joshi, R.C. and Toshniwal, D. (20XX) ‘Effective framework for prediction of disease outcome using medical datasets: clustering and classification’, Int. J. Computational Intelligence Studies, Vol. X, No. Y, pp.xx–xx. Biographical notes: B.M. Patil is currently a PhD student in Electronics and Computer Engineering Indian Institute of Technology, Roorkee, India. He received his Bachelor’s degree in Computer Engineering from Gulbarga University in 1993, MTech in Computer Science from Mysore University in 1999. His current research interests include data mining, medical decision support systems, artificial Intelligence and artificial neural network. Ramesh C. Joshi is currently a Professor in the Department of Electronics and Computer Engineering, IIT, Roorkee, India. He received ME and PhD degrees in Electronics and Computer Engineering from University of Roorkee in 1970 and 1980, respectively. His research interests include Parallel & Distributed Processing, AI, Databases and Information Security. He has guided about Copyright © 200X Inderscience Enterprises Ltd. B.M. Patil, R.C. Joshi and D. Toshniwal 25 PhD Theses and 150 MTech Dissertations and 115 ME/BE projects and completed four sponsored projects as PI. He has published about 150 papers in national/international journals and conferences. He has received Gold Medal by Institute of Engineers (I) in 1978. Durga Toshniwal is working as an Assistant Professor at the Department of Electronics and Computer Engineering in IIT, Roorkee. She completed her PhD in Computer Science Engineering from IIT, Roorkee, India, in 2005. She obtained BE and MTech from NIT, Kurukshetra, India. She has authored several papers in various international journals and international and national conferences of repute. She was awarded IBM Faculty Award 2008 and IBM Shared Research University Award 2009 for her research contributions. Her areas of research interest include time-series data mining, privacy preserving data mining, applying soft computing techniques in mining applications, mining data streams and unstructured text mining. 1 Introduction Over the last few years, data mining has been increasingly applied to solve problems in the medical domain. Data mining has been applied with high rate of success in various fields like marketing, banking, customer relationship management, engineering and various other areas of science. However, its application for the analysis of medical data is comparatively restricted. It is particularly true in applications that are able to predict disease outcomes. The goal of researchers working on prediction of disease outcome is to develop a model that can use patient-related data, predict the significant result and thereby support decision-making. Data mining methods are applied to build the classifier model for prognosis and diagnosis. There is a critical need to develop medical decision support systems, which can assist practitioners in their diagnostic process. The research area of data mining and Knowledge Discovery in Databases (KDD) has evolved from a combination of statistics, machine learning, artificial intelligence, pattern recognition, expert system, databases and information (Fayyad et al., 1996). Hand et al. (2001) defined data mining as the ‘analysis of (often large) observational datasets to find unsuspected relationships to summarize the data in novel ways that are both understandable and useful to the data owner’. Han and Kamber (2006) defined data mining as an extraction of interesting (non-trivial, implicit, previously unknown and potentially useful) patterns or knowledge from a huge amount of data. Data mining is explained in details in Jain and Chen (2003). Data mining problems are often solved using diverse approaches taken from computer science (multi-dimensional databases, machine learning and soft computing) and from statistics (clustering, classification and regression techniques). Classification using decision-trees can make known useful patterns from observational data of brain injury (Andrews et al., 2002). Data mining techniques were also used in improving birth outcomes (Goodwin et al., 1997) and automated detection of hereditary syndromes (Evans et al., 1997). Roychowdhury et al. (2004) proposed GA-fuzzy-based approach for diagnosis of the diseases, namely pneumonia and jaundice. Patil et al. (2010) used four data mining algorithms for prediction of survival of burn patients. Chattopadhyay et al. (2008) proposed a novel attempt to develop fuzzy logic-based Expert Systems (ESs) that Effective framework for prediction of disease outcome are able to reason like doctors for screening adult psychosis. Zharkova and Jain (2007) presented a method for classification of medical images that enable extraction of quantified information in terms of texture, intensity and shape enabling improved diagnosis of human organs. The method of using two different computation algorithms working together either sharing a task or in cascade one after other is called as hybrid computing. Our proposed framework makes use of two stages, namely clustering and classification. The first stage employs k-means algorithm on a data and generates two clusters, namely cluster-0 and cluster-1. The cluster-0 consists of those instances which do not have disease symptoms and cluster-1 consists of instances with disease symptoms (here k = 2 is based on the number of outcomes). The verification of valid grouping is then carried out by referring to the association of class labels in the original datasets. Incorrectly classified instances are removed and remaining instances are used to build the classifier using C4.5 decisiontree algorithm with k-fold cross validation method. The framework was tested using eight datasets from the machine learning repository of the University of California at Irvine (UCI) (Newman et al., 1998). The rest of the paper is organised as follows: in Section 2, we briefly discuss works related to classification of medical data. We propose our framework in Section 3. The k-means and decision-tree algorithms are explained. The implementation and analysis are given in Section 4. Various measures used for performance analysis are also defined. The conclusion follows in Section 5. 2 Related work Heart disease is a broad term that can refer to any condition that affects the heart. A large number of people die every year due to heart disease all over the world, and it is one of the prominent factors which are responsible for cause of deaths in USA (Arialdi et al., 2007). There are many different forms of heart disease. The most common is the coronary artery disease caused by narrowing or blockage of the coronary arteries. While many people with heart disease have symptoms, such as chest pain and fatigue, as many as 50% have no symptoms until a heart attack occurs (http://www.ivillage.com/health, accessed on September 2009). However, correct diagnosis at an early stage followed by an appropriate treatment can result in significant life saving (Yan et al., 2003). Different classification algorithms are used to detect the presence of heart disease in the dataset from UCI machine learning repository. Serpen et al. (1997) proposed a probabilistic potential function, neural network classifier algorithm for Cleveland heart disease data. Classification accuracy achieved by this method was 58.28%. Tang et al. (2004) developed a new model called Granular Support Vector Machines (GSVM) for data classification problems. It systematically combines two theories, namely statistical learning and granular computing. It works by building a sequence of information granules and then builds a Support Vector Machines (SVM) in each information granule. SVM and GSVM have obtained an accuracy of 83.04% and 84.04%, respectively. Polat et al. (2007a) proposed a method that uses Artificial Immune Recognition System (AIRS) and obtained 87% classification accuracy. Polat and Gunes (2007a) suggested a new approach, combining feature selection, fuzzy weighted pre-processing B.M. Patil, R.C. Joshi and D. Toshniwal and AIRS classifier to classify the heart disease dataset and obtained a classification accuracy of 92.59%. Humar and Novruz (2008) developed a hybrid system, which used Artificial Neural Network (ANN) and Fuzzy Neural Network (FNN) and achieved an accuracy of 86.8%. Das et al. (2009) proposed a system called Neural Networks Ensemble, which creates new models by combining the posterior probabilities or the predicted values from multiple predecessor models using SAS data miner tool and achieved an accuracy of 89.01%. Cardiac disorders diagnosis is based on SPECT (Single Photon Emission Computed Tomography) images. Bakirci and Yildirim (2004) used feed-forward ANN and achieved an accuracy of 90.04%. Polat et al. (2007c) proposed a method ensemble classifier system based on different feature subsets and AIRS classifier to detect the cardiac disorders from SPECT images and obtained accuracy of 97.74% by dividing the data in approximately equal size for training and testing purposes. The liver is a vital organ located in the upper right-hand side of the abdomen. It performs numerous functions for the body: converting nutrients derived from food into essential blood components, storing vitamins and minerals, regulating blood clotting, producing proteins and enzymes, maintaining hormone balances, and metabolising and detoxifying substances that would otherwise be harmful to the body (http://www. medlineplus.gov). Cheung et al. (2001) used number of classification algorithms on BUPA liver disorder dataset. They obtained an accuracy of 65.50% using C4.5 algorithm, 63.39% using Naive Bayes classifier, 61.83% using Bayesian Network with Naive Dependence (BNND) and 61.42% using Bayesian Network with Naive Dependence and Feature Selection (BNNF) classifier. Polat et al. (2007b) proposed method of Fuzzy-AIRS and classified the dataset with an accuracy of 83.38% and compared the result with AIRS classification algorithm, which obtained classification accuracy of 81%. Breast cancer is the most common cancer in women today in many countries. It is considered to be the second leading cause of cancer deaths among women between 40 and 55 years of age (Delen et al., 2005). Cancer is a group of diseases in which cells in the body grow, change and multiply in uncontrolled fashion. Breast cancer refers to the erratic growth and propagation of cells that originate in the breast tissue. A group of rapidly dividing cells may form a lump or mass of extra tissue called tumours. Tumours can either be cancerous (malignant) or non-cancerous (benign). Malignant tumours penetrate and destroy healthy body tissues. A group of cells within a tumour may also break away and spread to other parts of the body (http://www.imaginis.com/breasthealth, accessed on 15 April 2010). A lot of work has been done on WBCD (Wisconsin Breast Cancer Dataset) in the literature and high classification accuracy has been achieved. Quinlan (1996) used C4.5 classification algorithm and obtained accuracy of 94.74% using tenfold cross-validation. Hamilton et al. (1996) proposed a method RIAC (Rule Induction Algorithm based on Approximate Classification) and achieved accuracy of 96% with tenfold cross validation. Pena-Reyes and Sipper (1999) proposed fuzzy-GA method and obtained a classification accuracy of 97.36%. Polat and Gunes (2007b) used least square SVM and an accuracy of 98.53% was obtained. Hepatitis means an inflammation of the liver. It can be caused by many things, including a bacterial infection, liver injury caused by a toxin (poison) and even an attack on the liver by the body’s own immune system (http://www.netdoctor.co.uk/diseases/ facts/hepatitis.htm). A number of classification methods were proposed and achieved Effective framework for prediction of disease outcome high classification accuracies on the dataset taken from UCI machine learning repository. Bascil and Temurtas (2009) used a multi-layer neural network with tenfold crossvalidation technique and compared their results with that of previous studies given on the same dataset (www.is.umk.pl/projects/datasets.html, accessed on 15 April 2009). However, none of the above-mentioned techniques used the proper validation of class labels, which affect the predictive performance. Our study proposes a framework which uses the k-means clustering algorithm aimed at validating chosen class labels of a given data. The decision-tree classification algorithm is applied to the resulting pattern to build the final classifier model using the k-fold cross validation method. This model is evaluated on various datasets taken from the UCI machine learning repository. 3 Proposed method The classification framework for the medical datasets is shown in Figure 1. It consists of two stages: clustering and classification. In the first stage, we applied clustering method for pattern extraction to validate the class label associated with given data and delete the misclassified data. In the second stage, decision-tree is applied for classification using k-fold cross validation (k = 10). Figure 1 Effective framework for prediction of disease outcome 3.1 Datasets In this study, we have taken the medical datasets of eight diseases from UCI machine learning repository (Newman et al., 1998) details of which are given hereafter. B.M. Patil, R.C. Joshi and D. Toshniwal 3.1.1 Cleveland heart disease dataset The dataset contains a total number of 303 samples and has 13 attributes like age, sex, chest pain type (cp), resting blood pressure (trestbps, mm/dl), serum cholesterol (chol, mg/dl), fasting blood sugar (fbs, >120 mg/dl), resting electrocardiographic results (restecg), maximum heart rate achieved (thalach), exercise induced angina (exang), ST depression induced by exercise relative to rest (oldpeak), the slope of the peak exercise ST segment (slope), ca: number of major vessels (0–3) coloured by fluoroscopy (ca) and the heart status (thal, 3 = normal defect; 6 = fixed defect; 7 = reversible defect). The dataset is being divided into five classes, 0 corresponding to absence of any disease and 1, 2, 3, 4 corresponding to four different types of diseases. Many researchers have used this dataset to differentiate between the absence (0) and presence (1, 2, 3 or 4) of a disease. The two classes are coded as ‘0’ for absence and ‘1’ for presence. 3.1.2 Statlog heart disease dataset The Statlog heart disease dataset was taken from 270 samples belonging to patients with heart problem while the remaining 150 samples are of healthy persons. The samples taken from patients and healthy persons include 13 attributes which are given below: 1 age 2 sex 3 chest pain type (four values) 4 resting blood pressure 5 serum cholesterol (mg/dl) 6 fasting blood sugar (>120 mg/dl) 7 resting electrocardiographic results (values 0, 1, 2) 8 maximum heart rate achieved 9 exercise induced angina 10 oldpeak = ST depression induced by exercise relative to rest 11 the slope of the peak exercise ST segment 12 ca: number of major vessels (0–3) coloured by fluoroscopy 13 thal: 3 = normal defect; 6 = fixed defect; 7 = reversible defect. The class information is included in the dataset as 1 and 2 regarding absence and presence of disease, respectively. 3.1.3 Single photon emission computed tomography images dataset The SPECT images dataset is concern with the diagnosis of cardiac disorders. The dataset describes diagnosing of cardiac SPECT images. Each of the patients is classified into two categories: normal and abnormal. The database of 267 SPECT image sets (patients) was processed to extract features that summarise the original SPECT images. Effective framework for prediction of disease outcome A pattern in SPECT image dataset is represented by 22 binary features that have either 0 or 1 value. There are 55 normal (0) and 212 abnormal (1) subjects in SPECT image dataset 3.1.4 BUPA liver disorder dataset The liver disorders data prepared by BUPA Medical Research Company includes 345 samples consisting of six attributes and two classes. Two-hundred samples are of healthy persons, while the remaining 145 data belong to patients in this dataset. Each sample has six attributes, all of which are real valued, and are as follows: 1 mean corpuscular volume (mcv) 2 alkaline phosphatase (alkphos) 3 alanine aminotransferase (sgpt) 4 aspartate aminotransferase (sgot) 5 gamma-glutamyl transpeptidase (gammagt) 6 number of half-pint equivalents of alcoholic beverages drunk per day (drinks). 3.1.5 Wisconsin breast cancer dataset-1 WBCD-1 consists of 699 samples that were collected by Dr W.H. Wolberg at the University of Wisconsin–Madison Hospitals taken from needle aspirates from human breast cancer tissue. The WBCD consists of nine features obtained from fine needle aspirates, each of which is ultimately represented as an integer value between 1 and 10. The measured variables are the following: 1 Clump Thickness (x1) 2 Uniformity of Cell Size (x2) 3 Uniformity of Cell Shape (x3) 4 Marginal Adhesion (x4) 5 Single Epithelial Cell Size (x5) 6 Bare Nuclei (x6) 7 Bland Chromatin (x7) 8 Normal Nuclei (x8) 9 Mitoses (x9). The dataset consists of 699 data, out of which 458 belong to benign group and the remaining 241 data are of malignant nature. 3.1.6 Wisconsin breast cancer dataset-2 WBCD-2 was obtained from a repository having 32 attributes and 569 instances, out of which 357 instances belong to benign class and the remaining 212 are of malignant class. B.M. Patil, R.C. Joshi and D. Toshniwal 3.1.7 Wisconsin prognostic breast cancer dataset Wisconsin Prognostic Breast Cancer (WPBC) dataset, concerning a number of 198 patients and a binary decision class, has non-recurrent-events totalling 151 instances and recurrent-events totalling 47 instances. The testing diagnosing accuracy, that is the main performance measure of the classifier, was about 74.24% in accordance with the performance of other well-known machine learning techniques. 3.1.8 Hepatitis dataset The dataset which consists of hepatitis disease measurements contains two classes and 155 samples. The class distribution is Class 1: die (32) and Class 2: live (123). All samples have 19 features. These features are: age, sex, steroid, antivirals, fatigue, malaise, anorexia, liver big, liver firm, spleen palpable, spiders, ascites, varices, bilirubin, alk phosphate, sgot, albumin, pro time and histology. 3.2 Clustering Our motivation in this paper is based on the assumption (Han and Kamber, 2006) that the instance with similar attribute values is more likely to have similar class label. Similarity is measured based on Euclidean distance. Therefore, the misclassified instances after clustering are deleted and correctly classified instances are considered for further classification using decision-tree classifier. Many researchers have used clustering method on unlabelled data to assign class labels. Later, they have used methods of supervised learning for classification (Dhillon et al., 2003; Li et al., 2004; Kyriakopoulou and Kalamboukis, 2007). In our study, we are using k-means clustering algorithm to label data to validate the class label associated with the dataset. The reason in choosing k-means is that Lange et al. (2004) proved that the validation result obtained by k-means clustering is better than others for k = 2 or 3. We also tried clustering by k-mediod algorithm but the misclassification rate was 50%. We considered the result of k-means because the misclassification rate was less. The k-means algorithm takes the input parameter, k, and partitions a set of N points into k clusters, so that the resulting intracluster similarity is high but the intercluster similarity is low. The steps in the k-means method are the following (Shekhar et al., 2007): 1 Select k random instances from the training data subset as the centroids of the clusters C1, C2,…, Ck 2 For each training instance X: a Compute the Euclidean distance D ( Ci , X ) , i = 1 … k Find cluster that is closest to X. b 3 Assign X to Cq. Update the centroid of Cq. (The centroid of a cluster is the arithmetic mean of the instances in the cluster.) Repeat Step 2 until the centroids of clusters C1, C2,…, Ck get stabilise in terms of mean-squared error criterion. Effective framework for prediction of disease outcome 3.3 Decision-tree algorithm Decision-tree is among the most popular classification methods. The rules produced by decision-tree are easy to interpret and understand, and hence can greatly help in appreciating the underlying mechanisms that separate samples in different classes. Among many decision-tree based classifiers, C4.5 is a well-established and widely used algorithm. C4.5 is a supervised learning classification algorithm used to construct decision-trees from the data (Quinlan, 1993). It uses a divide-and-conquer approach to growing decision trees (Benjamin et al., 2000; Ture et al., 2009). Let the attributes be denoted by A = {a1 , a2 ,… , am } , cases be represented by D = {d1 , d 2 ,… , d n } , and classes be indicated by C = {c1 , c2 ,… , ck } . For a set of cases D, a test Ti is a split of D based on attribute ai. It splits D into mutually exclusive subsets: D1, D2,…, Dp. These subsets of cases are single class collections of cases. If a test T is chosen, the decision-tree for D consists of a node identifying the test T, and one branch for each possible subset Di. For each subset Di, a new test is then chosen for further split. If Di satisfies a stopping criterion, the tree for Di is a leaf associated with the most frequent class in Di. One reason for stopping is that cases in Di belong to one class. C4.5 decision-tree algorithm uses a modified splitting criterion called gain ratio. It uses arg max (gain (D, T)) or arg max (gain ratio (D, T)) to choose tests for split. k Info ( D ) = −∑ p ( ci , D ) log 2 ( p ( ci , D ) ) (1) i =1 p Split ( D, T ) = ∑ i =1 ⎛ D Di log 2 ⎜⎜ i D ⎝ D p Di i =1 D Gain ( D, T ) = Info ( D ) − ∑ ⎞ ⎟⎟ ⎠ (2) info ( Di ) Gain Ratio ( D, T ) = Gain ( D, T ) Split ( D, T ) (3) (4) where, p(ci,, D) denotes the proportion of cases in D that belong to the ith class. C4.5 selects the test that maximises gain ratio value (Benjamin et al., 2000). Once the initial decision-tree is constructed, a pruning procedure is initiated to decrease the overall tree size and decrease the estimated error rate of the tree (Quinlan, 1993). C4.5 uses the information gain ratio criterion to determine the most discriminatory feature at each step of its decision-tree induction process. In each round of selection, the information gain ratio criterion chooses, from those features with an average-or-better information gain, the feature that maximises the ratio of its gain divided by its entropy. C4.5 stops recursively building sub-trees when: 1 an obtained data subset contains samples of only one class (then the leaf node is labelled by this class), or 2 there is no available feature (then the leaf node is labelled by the majority class), or 3 the number of samples in the obtained subset is less than a specified threshold (then leaf node is labelled by the majority class) (Quinlan, 1993). B.M. Patil, R.C. Joshi and D. Toshniwal 4 Experimental results In the proposed framework, the first stage involves k-means clustering. It is used for pattern extraction as given below. 4.1 k-means implementation The first stage employs k-means algorithm on a data and generates two clusters, namely cluster-0 and cluster-1. The cluster-0 consists of those instances which do not have disease symptoms and cluster-1 consists of instances with disease symptoms (for k = 2 based on the number of outcomes). The verification of valid grouping is then carried out by referring to the association of class labels in the original datasets. If they are found to be same, then the instance is correctly classified. Incorrectly classified instances are removed. The data were randomly re-sampled and this process was repeated for ten times. Among ten experiments, the minimum misclassified instances were taken for validation of class labels (This is done in order to include maximum number of instances for building the classifier.). From the chosen dataset, we removed the misclassified instances. It is known that clustering processes will cluster the data and assign class labels based on its intrinsic properties of data without considering actual classes (i.e. class labels). We discuss one such case pertaining to Cleveland heart disease dataset. In case of Cleveland heart disease data, 61 instances are misclassified. These are deleted and 242 instances which are correctly classified are retained and used for classification. The effect of this deletion resulted in building the decision-tree with around 10–12 less number of instances in each leaf node as compared to the whole dataset. This reduction in instances did not introduce any substantial loss in the classification capability of the classifier. Polat et al. (2007a) deleted six instances due to missing values and 27 instances due to disputed values. They used 270 instances for analysis. In our case, we used 242 instances selected using clustering method mentioned above. Similar kind of procedure was applied on all the seven datasets and misclassification rate of each dataset is given in Table 1. 4.2 Building decision-tree classifier Decision-tree classifier model was built for all eight medical datasets, in order to perform the classification task. 4.2.1 Feature selection One most important aspect in data mining is feature selection. Feature selection refers to selecting relevant features from the data based on their importance. It follows one of the two basic models: the wrapper model and filter model. The wrapper model makes use of automatic feature selection, whereas in filter model the size of the data can be considerably reduced by deleting irrelevant features. The main goal of feature selection as a part of this study was to generate the dataset containing smallest number of nonredundant features in order to obtain the best results. An example of this type of algorithm is the decision-tree algorithm. It uses important features and ignores the irrelevant ones. This concept is useful when the choices of features are many and more Effective framework for prediction of disease outcome effort is required on reducing the features (Wieschaus and Schultz, 2003). The authors (Sebban et al., 2002) reduced the features in order to reduce the cost and complexity of classification algorithm as well as to improve the classifier efficiency. Table 1 Clustering classes for disease data S. No. Type of databases Cluster attribute Instances Incorrectly classified Error (%) 1 Cleveland heart disease dataset Cluster-1 (present)/ Cluster-0 (absent) 303 61 20.13 2 Statlog heart disease data set Cluster-1 (present)/ Cluster-0 (absent) 270 34 12.59 3 SPECT dataset Cluster-1 (abnormal)/ Cluster-0 (normal) 267 82 30.71 4 BUPA liver disorder dataset Cluster-1 (S1)/ Cluster-0 (S2) 345 88 25.32 5 WBCD (Wisconsin breast cancer dataset) Cluster-1 (malign)/ Cluster-0 (begin) 699 29 4.15 6 WBCD (Wisconsin breast cancer dataset) Cluster-1 (begin)/ Cluster-0 (malign) 569 39 6.85 7 WPBC (Wisconsin Cluster-1 (recrnt)/ prognostic breast cancer ) Cluster-0 (N-recrnt ) 198 43 21.72 8 Hepatitis dataset 155 30 19.36 Cluster-1 (die)/ Cluster-0 (live) In Cleveland heart disease dataset, the total numbers of features available are 13 and decision-tree has just used four features. They are shown in Figure 2, viz. thal, exang, sex and cp. Figure 2 Pruned decision-tree of Cleveland heart disease data B.M. Patil, R.C. Joshi and D. Toshniwal The decision-tree was obtained using the j48 (C4.5 decision-tree algorithm) algorithm. The resulting j48 pruned decision-tree is shown in Figure 3, which is based on all the training data with total size of nine nodes including five leaf nodes. The evaluation result is based on tenfold cross validation. The WBCD (Wisconsin Breast Cancer Dataset) consists of nine features and decision-tree classifier built uses only four features as shown in Figure 3. Figure 3 Pruned decision-tree of WBCD (Wisconsin Breast Cancer Dataset-1) The rules are generated from these trees by considering each path as given in Tables 2 and 3. The rule, thus, obtained can be applied on similar kind of data, for which class labels are unknown. Table 2 Rule-1: Rule-2: Rule-3: Rule-4: Rule-5: Default: Rule generated from pruned decision-tree IF thal <= 3 and exang <= 0, THEN record is pre-classified as ‘absnt’ ELSE IF thal <= 3 exang > 0 and sex <= 0 THEN record is pre-classified as ‘absnt’ ELSE IF thal <= 3 exang > 0 and sex > 0 and cp <= 3 THEN record is pre-classified as ‘absnt’ ELSE IF thal <= 3 exang > 0 and sex > 0 and cp > 3 THEN record is pre-classified as ‘prsnt’ ELSE IF thal >= 3 THEN THEN record is pre-classified as ‘prsnt’ ELSE ‘Ignored the record’ END IF Effective framework for prediction of disease outcome Table 3 Rule-1: Rule generated from pruned decision-tree for WBCD (Wisconsin Breast Cancer Dataset-1) IF UnifmtSize <= 3 and BarNclei <= 6, THEN record is pre-classified as ‘Begn’ Rule-2: ELSE IF UnifmtSize <= 3 and BarNclei > 6 and Clumpth <= 2 THEN record is pre-classified as ‘Begn’ Rule-3: ELSE IF UnifmtSize <= 3 and BarNclei > 6 and Clumpth > 2 THEN record is pre-classified as ‘malign’ Rule-4: ELSE IF UnifmtSize >3 and MargAdsim <= 1 and Clumpth <= 6 and UnifmtSize <= 5 THEN record is pre-classified as ‘Begn’ Rule-5: ELSE IF UnifmtSize > 3 and MargAdsim <= 1 and Clumpth <= 6 and UnifmtSize > 5 THEN record is pre-classified as ‘malign’ Rule-6: ELSE IF UnifmtSize > 3 and MargAdsim <= 1 and Clumpth > 6 THEN record is pre-classified as ‘malign’ Rule-7: ELSE IF UnifmtSize > 3 and MargAdsim > 1 THEN record is pre-classified as ‘malign’ Default: ELSE ‘Ignored the record’ END IF 4.3 Performance measures The performance of our proposed framework was evaluated using tenfold cross validation method. The dataset was divided in ten equal subsets. The method was repeated ten times and each time one subset is used for testing and other nine subsets are placed together for training as given in. The process of training and testing is repeated ten times, each time using a different testing subset (Delen et al., 2005). A confusion matrix was calculated for the classifier to understand the results and is given in Table 4. The confusion matrix is simply a square matrix that shows the various classifications and misclassifications of the model. The columns of the matrix correspond to the number of instances classified as a particular value and the rows correspond to the number of instances with that actual classification. The measures of True Positive (TP) and True Negative (TN) are correct classifications. A False Positive (FP) occurs when the outcome is incorrectly predicted as YES (or positive) when it is actually NO (negative). A False Negative (FN) occurs when the outcome is incorrectly predicted as negative when it is actually positive. Table 4 Confusion Matrix Measures Predicted class Actual class Yes No Yes true positive (TP) false negative (FN) No false positive (FP) true negative (TN) Actual results obtained by our framework on the eight dataset are shown in Table 5 – the italics in diagonal indicates the True positive (TP) and True Negative (TN) and other diagonal represents False Negative (FN) and False Positive (FP). Hepatitis dataset WPBC (Wisconsin Prognostic Breast Cancer ) dataset WBCD-2 (Wisconsin Breast Cancer Dataset) WBCD-1 (Wisconsin Breast Cancer Dataset) BUPA liver dataset SPECT images dataset Statlog heart disease dataset Cleveland heart disease 04 123 01 20 122 02 03 27 01 347 177 01 02 213 04 01 451 00 60 186 00 131 52 02 02 188 43 108 05 03 03 126 Confusion matrix 97.93 96.78 99.25 99.11 99.59 98.92 97.88 96.70 Acc. % 99.18 87.09 99.43 99.55 100 100 95.55 97.67 Sen. % Proposed method 90.90 99.19 99.14 98.15 98.36 98.49 98.49 95.57 Spec. % Levenberg–Marquardt BP (Bascil and Temurtas, 2009) Naive Bayesian (Dumitru, 2009) ARIS (Polat et al., 2007c) LSSVM (Polat and Gunes, 2007b) GA-AWIS (Ozsen and Gunes, 2009) ICA-ARIS (Bascil and Temurtas, 2009) 91.51 74.24 98.51 97.08 85.21 97.74 87.43 87.4 Acc. % Previous method Kernel F-score (Polat and Gunes, 2009) ANN-fuzzy (Polat et al., 2007a) Name – 27.78 – 97.87 – 99.04 – 93 Sen. % – 91.67 – 97.77 – 92.85 – 78.5 Spec. % Table 5 Types of data B.M. Patil, R.C. Joshi and D. Toshniwal Comparison of medical data accuracy, sensitivity and specificity Effective framework for prediction of disease outcome The performance is measured using the accuracy, sensitivity and specificity measures. They are calculated using equations (5), (6) and (7) as given below: TP + TN TP + TN + FN + FP (5) Sensitivity = TP TP + FN (6) Specificity = TN TN + FP (7) Accuracy = The proposed framework provides very promising results in comparison to previous methods applied for each of the eight datasets. The comparison of the values for accuracy, sensitivity and specificity are given in Table 5 for each dataset. In case of Cleveland heart disease data, accuracy, sensitivity and specificity obtained were 96.70%, 97.67% and 95.57%, respectively, which were comparable to recent study by Polat et al. (2007a). The accuracy, sensitivity and specificity obtained for Statlog heart disease were 97.87%, 99.55% and 97.49%, respectively, which were better than kernel F-score feature selection method proposed by Polat and Gunes (2009). The results obtained by our framework on SPECT images dataset were 98.92%, 100% and 98.49%, which were better than ICA-ARIS (Bascil and Temurtas, 2009). The classification accuracies obtained with GA-AWAIS (Ozsen and Gunes, 2009) for BUPA liver disorders using tenfold cross validation was 85.21% and our method gave a result of 99.59%. The accuracy, sensitivity and specificity obtained by proposed method on WBCD-1 were 99.11%, 99.55% and 98.15%, which were better as compared to LSSVM (Polat and Gunes, 2007b). The other dataset WBCD-2 produced accuracy of 99.25% by proposed method and results were better in comparison to Polat et al. (2007b) with ARIS. The classification on WPBC dataset was used to predict whether the type of cancer was recurrent or non-recurrent and results were compared with those of Dumitru (2009) and our framework could provide greater accuracy of 96.78%. In hepatitis disease, data accuracy obtained by our framework was 97.93%, which was better in comparison with the method proposed by Bascil and Temurtas (2009). It is evident that results obtained by our framework are better than that of other classifications methods proposed till date. 5 Conclusions In the proposed work, k-means clustering is used for validation of class labels associated with the data. During this, data instances are clustered into k (here k = 2, cluster-0 and cluster-1) disjoint clusters. It has been observed that by removing the misclassified data after clustering, the performance of classifier is significantly improved. The classification accuracy, sensitivity and specificity obtained by the proposed framework are found to be better than those obtained by other competing techniques on all eight medical datasets of UCI machine learning data repository. The results indicate that the proposed framework can be routinely used for decision support to medical practitioners. B.M. Patil, R.C. Joshi and D. Toshniwal Acknowledgements The authors are highly thankful to reviewers and guest editor for their fruitful comments and suggestions which helped us to improve our earlier versions of the paper and also thankful to B.I. Khdakbhavi, Director of MBE Society’s College of Engineering, Ambajogai, for their sponsorship and AICTE. References Andrews, P.J., Sleeman, D.H., Statham, P.F., McQuatt, A., Corruble,V., Jones, P.A., Howells, T.P. and Macmillan, C.S. (2002) ‘Predicting recovery in patients suffering brain injury by using admission variables and physiological data: a comparison between decision tree analysis and logistic regression’, Journal of Neurosurgery, Vol. 97, pp.326–336. Arialdi, M., Minino, M.P., Melonie, P., Heron, Ph.D., Sherry, L., Murphy, B.S., Kenneth D. and Kochanek, M.A. (2007) National Vital Statistics Reports, Vol. 55, Vol. 19, p.7. Bakirci, U. and Yildirim, T. (2004) ‘Diagnosis of cardiac problems from SPECT Images by feedforward networks’, IEEE 12 Signal Processing and Communication Application Conference, pp.103–105. Bascil, M.S. and Temurtas, F.(2009) ‘A study on hepatitis disease diagnosis using multilayer neural network with Levenberg Marquardt Training Algorithm’, Journal of Medical System. Benjamin, K.T., Tom, B.Y.L., Samuel, W.K.C., Weijun, G. and Xuegang, Z. (2000) ‘Enhancement of a Chinese discourse marker tagger with C4.5’, Proceedings of the Second Workshop on Chinese Language Processing held in Conjunction with the 38th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Morristown, NJ, USA, Vol. 12, pp.38–45. Chattopadhyay, S., Pratihar, D.K. and Sarkar, S. (2008) ‘Developing fuzzy classifiers to predict the chance of occurrence of adult psychoses’, Knowledge-Based Systems, Vol. 21, No. 6, pp.479–497. Cheung, N. (2001) Machine Learning Techniques for Medical Analysis, Bsc Thesis, School of Information Technology and Electrical Engineering, University of Queensland. Das, R., Turkoglu, I. and Sengur, A. (2009) ‘Effective diagnosis of heart disease through neural networks ensembles’, Expert Systems with Applications, Vol. 36, pp.7675–7680. Delen, D., Walker, G. and Kadam, A. (2005) ‘Predicting breast cancer survivability: a comparison of three data mining methods’, Artificial Intelligence in Medicine, Vol. 34, No. 2, pp.113–127. Dhillon, I.S., Mallela, S. and Kumar, R. (2003) ‘A divisive information-theoretic feature clustering algorithm for text classification’, Journal of Machine Learning Research, Vol. 3, pp.1265–1287. Dumitru, D. (2009) ‘Prediction of recurrent events in breast cancer using the Naive Bayesian classification’, Computer Science Series, Vol. 36, No. 2, pp.92–96. Evans, S., Lemon, S., Deters, C., Fusaro, R. and Lynch, H. (1997) ‘Automated detection of hereditary syndromes using data mining’, Computers and Biomedical Research, Vol. 30, pp.337–348. Fayyad, U., Piatetsky-Shapiro, G. and Smyth, P. (1996) ‘Data mining and knowledge discovery in databases, Communications of the ACM, Vol. 39, No. 11, pp.24–26. Goodwin, L., Prather, J., Schlitz, K., Iannacchione, M.A., Hammond, W. and Grzymala, J. (1997) ‘Data mining issues for improved birth outcomes’, Biomedical Science Instrumentation, Vol. 34, No. 19, pp.291–296. Hamilton, H.J., Shan, N. and Cercone, N. (1996) RIAC: A Rule Induction Algorithm Based on Approximate Classification, Technical Report CS 96-06, University of Regina. Effective framework for prediction of disease outcome Han, J. and Kamber, M. (2006) Data Mining: Concepts and Techniques, 2nd ed., Morgan Kaufmann Publishers. Hand, D., Mannila, H. and Smyth, P. (2001) Principles of Data Mining, MIT Press, Cambridge, MA. Humar, K. and Novruz, A. (2008) ‘Design of a hybrid system for the diabetes and heart diseases’, Expert Systems with Application, Vol. 35, pp.82–89. Jain, L.C. and Chen, Z. (2003) ‘Industry, artificial intelligence in, encyclopedia of information systems’, Elsevier Science, Vol. 2, pp.583–597. Kyriakopoulou, A. and Kalamboukis, T. (2007) ‘Using clustering to enhance text classification’, Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.805–806. Lange, T., Roth, V., Braun, M.L. and Buhmann, J.M. (2004) ‘Stability-based validation of clustering solutions’, Neural Computation, Vol. 16, No. 6, pp.1299–1323. Li, M., Cheng, Y. and Zhao, H. (2004) ‘Unlabeled data classification via support vector machine and k-means clustering’, Proceedings of the Conference on Computer Graphics, Imaging and Visualization (CGIV04), Penang, Malaysia, pp.183–186. Newman, D., Hettich, J.S., Blake, C.L.S. and Merz, C.J. (1998) UCI Repository of machine learning databases, Department of Information and Computer Science, University of California, Irvine, CA. Available online at: www.ics.uci.edu/~mlearn/MLRepository.html (accessed on 10 August 2009). Ozsen, S. and Gunes, S. (2009) ‘Attribute weighting via genetic algorithms for attribute weighted artificial immune system (AWAIS) and its application to heart disease and liver disorders problems’, Expert Systems with Applications, Vol. 36, pp. 386–392 Patil, B.M., Joshi, R.C., Toshniwal, D. and Biradar, S. (2010) ‘A new approach: role of data mining in prediction of survival of burn patients’, Journal of Medical System. Available online at: www.springerlink.com/index/8pnh75n137t99892.pdf Pena-Reyes, C.A. and Sipper, M. (1999) ‘A fuzzy-genetic approach to breast cancer diagnosis’, Artificial Intelligence in Medicine, Vol. 17, pp.131–155. Polat, K. and Gunes, S. (2007a) ‘A hybrid approach to medical decision support systems: combining feature selection, fuzzy weighted pre-processing and AIRS’, Computer Methods and Programs in Biomedicine, Vol. 88, No. 2, pp.164–174. Polat, K. and Gunes, S. (2007b) ‘Breast cancer diagnosis using least square support vector machine’, Digital Signal Processing, Vol. 17, No. 4, pp.694–701. Polat, K. and Gunes, S. (2009) ‘A new feature selection method on classification of medical datasets: Kernel F-score feature selection’, Expert Systems with Applications, Vol. 36, pp.10367–10373. Polat, K., Sahan, S. and Gunes, S. (2007a) ‘Automatic detection of heart disease using an artificial immune recognition system (AIRS) with fuzzy resource allocation mechanism and k-nn (nearest neighbour) based weighting preprocessing’, Expert Systems with Applications, Vol. 32, No. 2, pp.625–631. Polat, K., Sahan, S., Kodaz, H. and Gunes, S. (2007b) ‘Breast cancer and liver disorders classification using artificial immune recognition system (AIRS) with performance evaluation by fuzzy resource allocation mechanism’, Expert Systems with Applications, Vol. 32, pp.172–183. Polat, K., Sekerci, R. and Gunes, S. (2007c) ‘Artificial immune recognition system based classifier ensemble on the different feature subsets for detecting the cardiac disorders from SPECT images’, DEXA LNCS, Vol. 4653, pp. 45–53, Quinlan, J.R. (1993) C4.5 Programs for Machine Learning, Morgan Kaufmann Publishers, San Mateo, CA. Quinlan, J.R. (1996) ‘Improved use of continuous attributes in C4.5’, Journal of Artificial Intelligence Research, Vol. 4, pp.77–90. B.M. Patil, R.C. Joshi and D. Toshniwal Roychowdhury, A., Pratihar, D.K., Bose, N., Sankaranarayana, K.P. and Sudhahar, N. (2004) ‘Diagnosis of the diseases using a GA-fuzzy approach’, Information Sciences, Vol. 162, No. 2, pp.105–120. Sebban, M. and Nock, R. (2002) ‘A hybrid filter wrapper approach of feature selection using information theory’, Pattern Recognition, Vol. 35, No. 4, pp.835–846. Serpen, G., Jiang, H. and Allred, L.G. (1997) ‘Performance analysis of probabilistic potential function neural network classifier’, Proceedings of Artificial Neural Networks in Engineering Conference, Vol. 7, pp.471–476. Shekhar, R., Gaddam, V., Phoha, V. and Kiran, S. (2007) ‘K-Means+ID3: a novel method for supervised anomaly detection by cascading K-means clustering and ID3 decision tree learning methods’, IEEE Transactions on Knowledge and Data Engineering, Vol. 19, No. 3, pp.1–10. Tang, Y., Jin, B., Sun, Y. and Zhang, Y. (2004) ‘Granular support vector machines for medical binary classification problems’, IEEE Symposium on Computational Intelligence in Bioinformatics and Computational Biology, 7–8 October, San Diego, CA, pp.73–78. Ture, M., Tokatli, F. and Kurt, I. (2009) ‘Using Kaplan–Meier analysis together with decision tree methods (C&RT, CHAID, QUEST, C4.5 and ID3) in determining recurrence-free survival of breast cancer patients’, Expert Systems with Applications, Vol. 36, pp.2017–2026. Wieschaus, E. and Schultz, M.A. (2003) A Comparison of Methods for the Reduction of Attributes Before Classification in Data Mining, Yale University, Yale, CT. Yan, H., Zheng, J., Jiang, Y., Peng, C. and Li, Q. (2003) ‘Development of a decision support system for heart disease diagnosis using multilayer perception’, IEEE Symposium on Circuits and Systems, Vol. 5, pp.V709–V712. Zharkova, V. and Jain, L.C. (2007) ‘Artificial intelligence in recognition and classification of astrophysical and medical images’, Springer Studies in Computational Intelligence, Vol. 46.