Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 Using Data Mining Technique to Classify Medical Data Set Mohammed Abdullah Naser Department of Computer, College of Sciences for Women, University of Babylon [email protected] Fryal Jassim Abd Al_Razaq Software Department , College of Information Technology , University of Babylon [email protected] Abstract Data mining is an iterative and multistage process to extract a useful knowledge. The real life data mining applications are interesting because they often present a different set of problems for data miners; one of such applications that we have done is on the diabetic patients databases. Medical data mining is the process of extracting hidden patterns from medical data. In this research, a data mining method using an enhanced RIPPER classification algorithm has been designed and implemented. This algorithm is rule-based classifier which is based on rule growing and pruning .It apply on the Pima Indians of National Institute of Diabetes Datasets and Mirjan hospital Diabetes Datasets in order to satisfy acceptable accuracy percentage of classification rate compare with other existing classification algorithms. The data set is loaded to the system and split it to patient and non-patient based on the patient class, and then the initial rules are applied to obtain the first classification percent. Information gain is applied in each classified step as an indicator for the important rule. In the final step, rules are growing based on unification method that applied in each step, where the rules with the high rank are merged to obtain a new rule. The obtained results were small in the first growing process. Accuracy rate is increased during the merging rules process. In final iteration, we obtained accuracy detection is higher than in the standard existing algorithm. Keywords: Data Mining Technique, Medical Data Set الخالصة تأخذ تطبيقات تنقيب البيانات التي تستخدم.تعدين (تنقيب) البيانات عملية تك اررية ومتعددة المراحل إلنتزاع تلك المعرفة المفيدة مثل تلك التطبيقات هو ما عملناه على بيانات،بيانات حياتية حقيقية اهمية كبيرة لكونها غالباً ما تحوي مشاكل متنوعة تستحق التنقيب .حيث ان تتقيب البيانات الطبيية تعني عملية إنتزاع أنماط مخفية من البيانات الطبية.مرضى السكري ) التصنيفية والتي تمRIPPER( تم استخدام طريقة من طرق تنقيب البيانات وذلك من خالل تحسين خوارزمية،في هذا البحث تم تطبيق.وهذه الخوارزمية من الخوارزميات التي تعتمد القواعد في تصنيفها من خالل تنمية القواعد وتهذيبها.تصميمها وتنفيذها ) لمرضى السكري باالضافة الى بيانات مرضى السكري لمستشفىPIMA( لمحسنة على بيانات معروفة بأسم ّ خوارزمية التصنيف ا حمل إلى النظام ويتم ّ البيانات االولية المعتمدة ت.مرجان الحقيقية لتحقيق نسبة دقة مقبولة بالمقارنة مع نسب تصنيف خوارزميات اخرى علما انه يتم اعتماد،ثم تطب ق القواعد األولية للحصول على نسبة المئوية للتصنيف االولى،تقسيمها إلى بيانات مرضى وغير مريضى القواعد تنمو مستندة على طريقة التوحيد التي، في الخطوة النهائية،مدى انطباق القواعد على البيانات كمؤشر الهمية تلك القاعدة النتائج المكتسبة كانت صغيرة في عملية. حيث يتم دمج القواعد ذات الرتبة العالية للحصول على قاعدة جديدة،كل خطوة ّ طبقت في في المرحلة النهائية للخوارزمية المحسنة حصلنا على دقة أعلى. في حين الدقة ازدادت أثناء عملية دمج القواعد،تنمية القواعد االولية .من الخوارزمية الـ ـقياسية الموجودة . البيانات الطبية، تعدين (تنقيب) البيانات:الكلمات المفتاحية 1. Introduction Data mining (DM) refers to the process of finding interesting patterns and useful information in databases. The interesting patterns can be used to make predictions [Fayy96]. The process of data mining is composed of several steps including selecting data to analyze, preparing the data, applying the data mining algorithms, and then interpreting and evaluating the results [Witt05]. The concept of data miming is well known in different areas such as business, engineering, communications, transport, medicine, education etc[Mao11]. Data from 1441 Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 various activities fields are produced and stored daily, processed, transmitted in different locations without taking into account their meanings. Managers focus their activity mainly on finding methods and techniques to organize huge data provided by transactions or other activities and to extract useful patterns, relations, associations from data etc. [Andr99, Tudo09]. Classification rule is one of the popular techniques of data mining. It is a data mining technique used to predict group membership for data instances. Classification involves finding rules that partition the data into disjoint groups[Chri12, Geor04]. The Repeated Incremental Pruning to Produce Error Reduction (RIPPER) is a classification algorithm designed to generate rules set directly from the training dataset. The name is drawn from the fact that the rules are learned incrementally. A new rule associated with a class value will cover various attributes of that class .The algorithm was designed to be fast and effective when dealing with large and noisy datasets compared to decision trees [Duma11] . In this paper we propose several enhancementson the RIPPER data mining algorithm in order to use in classify two diabetes dataset according to obtained mining useful patterns. The enhanced classification algorithm applies on the two Diabetes Datasets in order to satisfy acceptable accuracy percentage of classification rate compare with other existing classification algorithms. 2. Ruled-based Classification Algorithms. Rule-based classifiers classify data by using a collection of “if . . . then . . .” rules.The rule antecedent or condition is an expression made of attribute conjunctions.The rule consequent is a positive or negative classification.A rule r covers a given instance x if the attributes of the instance satisfy the rule condition. The coverage of a rule as the fraction of records that satisfy its antecedent. On the other hand, its accuracy as the fraction of records that satisfy both the antecedent and the consequent. Finally the classifier has exhaustive rules if they account for every possible combination of attribute values –i.e. each record is covered by at least one rule.RIPPER and CN2 are a good example of such algorithms. The advantages of rule-based classifiers are that they are extremely expressive since they are symbolic and operate with the attributes of the data without any transformation. Rule-based classifiers, and by extension decision trees, are easy to interpret, easy to generate and they can classify new instances efficiently [Amat11]. 3. RIPPER classification algorithm. The name is drawn from the fact that the rules are learned incrementally. A new rule associated with a class value will cover various attributes of that class .The algorithm was designed to be fast and effective when dealing with large and noisy datasets compared to decision trees [Duma11]. For two-class problems, RIPPER chooses the majority class as its default class and learns the rules for detecting the minority class. For multiclass problems, the classes are ordered according to their frequencies. It is illustrated by the following figure (1): 1442 Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 Algorithm: Ripper Algorithm. Input: Training dataset S with n instances and m attributes Output: Rule set Begin Sort classes in the order of least prevalent class to the most prevalent class. Create a new rule set while iterating from the prevalent class to the most prevalent class split S into into Spos and Sneg while Spos is not empty split SPos and Sneg into Gpos and Gneg subsets and Ppos and Pnegsubsets. create and prune a new rule if the error rate of the new rule is very large then end while else add new rule to rule set the total description length l is computed if l> d then end while end while end while End 4.The proposed classification system. Figure (1): Ripper Algorithm. The structure of the proposed system gathered between the data mining process stags and using of enhanced RIPPER algorithm. This integration resulted in the system presented in figure (2). 1443 Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 Medical Datasets The selected Database Preprocessed data Stage Feature& Rules SelectionStage Mining ProcessStage Model Evaluation Part of Data Test Classification Results Figure (2): Block diagram of the proposed classification system 4.1 Preprocessed Data Stage. We select the two sets of medical data sets that are used in this study namely Pima Indian Diabetes Data Set and Mirjan hospital Diabetes Data Set.Pima data set consists of 768 instances with no missing value attributes, and it consists of 9 attributes plus the class attribute. The name for each attributes or features (all numeric values).While the Mirjan hospital data sets consist of 160 instances with no missing value attributes, and it consists of 11 attributes plus the class attribute that may be 1 or 2.All these attributes can be shown in the following sections. We have described the data used in the experiments for data sets, each stage of data pre-processing activities are datacleaning phase that involves checking for any missing values in the data set, smooth noisy data, identify or remove outliers, and remove inconsistencies. For our diabetes data sets, cleaning data is not necessary because we have a cleaned data. Also we have used the two transformation steps: feature selection (6) out of (9) attributes to Pima data and (5) out of (11) attributes to Mirjan data for further processing of enhance RIPPER algorithm and aggregation steps for pruning and construct new rule. 1444 Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 4.2 Features and Rules Selection Stage. In the previous stage, the visualization charts indicate of how to choose the features that are most popular to use in mining algorithm. According the World Health Organization (WHO), the most popular (6) features out of 9 are : ((Glucose concentration), (2-Hour serum insulin) , (Body mass index) ,(Diabetes PF) ,(Age) and (Pregnant)). And the most popular (5) feature out of (11) for the Mirjan medical data set are selected :((hbA1c) , (waist) ,(TG) ,(HDL) and (microalbuminurea)) .The selected rules are based on the manually process of the trial and error of dataanalysis. Table (1): The initial rules to Pima-Indians-Diabetes Rule number Rule#1 Rule#2 Rule#3 Rule#4 Rule#5 Rule#6 Rule#7 Condition rule Action rule IF(Glucose concentration is high) & (2-Hour serum insulin is low) & (Body mass index is high) & (Diabetes PF is high) & (Age is low) Class 2 IF(Glucose concentration is medium) & (2-Hour serum insulin is low) & (Body mass index is medium) & (Diabetes PF is high) & (Age is high) IF(Pregnant is high) & (Glucose concentration is medium) &(Body mass index is high) &(Diabetes PF is high) & (Age is high) IF(Glucose concentration is high) &(2-Hour serum insulin is low) & (Body mass index is high) & (Diabetes PF is low) &(Age is medium) IF(Glucose concentration is high) & (Body mass index is low) & (Age is medium) If (Glucose concentration is medium) & (Body mass index ishigh) &(Age is low) IF(Pregnant is high) &(Glucose concentration is high) & (2-Hour serum insulin is low) &(Body mass index is high) &(Diabetes PF is high) & (Age is high) Class 2 Class 2 Class 2 Class 2 Class 2 Class 2 The low, high and medium value for each attribute is illustrated in the following table. Table (2): The low, high and medium values for each attribute to Pima-IndiansDiabetes Feature name Low High Medium Number of times pregnant 0 17 8 Plasma glucose concentration 100 126 70 Diastolic blood pressure 70 >80 80 Triceps skin fold thickness <24 >30 24 2-Hour serum insulin <100 >=200 [100-199] Body mass index <24.9 >25 [24.2 – 18.5] Diabetes edigree <0.088 >2.42000008 2.329 Age <21 >30 [21-29] 1445 Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 The same processes have used with Mirjan Hospital Diabetes Data sets to produce the initial rules and the low, high and medium values for each attribute. 4.3 Mining Processes Stage. It consists of three steps:1. Enhanced RIPPER Classification algorithm The Enhanced Ripper algorithm is illustrated by the following figure (3). Algorithm: Enhanced Ripper Algorithm. Input: Training dataset S with n instances and m attributes Output: Rules sets Begin Sort classes in the order of least prevalent class to the most prevalent class. While iteration for all frequencies of initial rule sets Split S into Spos and Sneg While Spos is not empty Split Spos into Ppos and Pneg Create and Prune a new rule If (m’s are equal in two selected rules) Continue in classification Else Select uniformed attributes in specified two rules, Return high rank attribute. Check error rate (Information Gain) for each rule as indicator . Add the new rule End While End While Return new rules End. Figure (3): EnhancedRipper algorithm. In the first step, it shows the input and output parameters, then iterations of all rules set are performed for all data set. The Data Set (S) is split to Spoitive and Snegative (Spos and Sneg). The next step illustrates of how rule punning process and stop condition is constructed based on these enhanced idea as shown in the following. 1446 Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 2. Rule Pruning This step is most important step of enhanced RIPPER algorithm. Each rule’s attributes are checked with the other seven initial rules. To make the aggregation process, the rule is aggregated based on the high rank of attribute extraction above the main diameter. This process is performed because the initial rules are variant in attributes values. However, the pruning rules are variant too. The accuracy detection of the pruned rule is achieved in high percentage. The following table illustrates the pruning process between each rule with other rules to Pima-Indians-Diabetes data sets. Table (4): Pruning Process of enhanced RIPPER algorithm to Pima-IndiansDiabetes data sets. Rule n,m Rule# Rule# 2 Rule# 3 Rule# 4 Rule# 5 Rule# 6 Rule# 7 1 Rule# 1 R 1,1 R 1,2 R 1,3 R 1,4 R 1,5 R 1,6 R 1,7 Rule# 2 R 2,1 R 2,2 R 2,3 R 2,4 R 2,5 R 2,6 R 2,7 Rule# 3 R 3,1 R 3,2 R 3,3 R 3,4 R 3,5 R 3,6 R 3,7 Rule# 4 R 4,1 R 4,2 R 4,3 R 4,4 R 4,5 R 4,6 R 4,7 Rule# 5 R 5,1 R 5,2 R 5,3 R 5,4 R 5,5 R 5,6 R 5,7 Rule# 6 R 6,1 R 6,2 R 6,3 R 6,4 R 6,5 R 6,6 R 6,7 Rule# 7 R 7,1 R 7,2 R 7,3 R 7,4 R 7,5 R 7,6 R 7,7 The initial rules depends on six attributes(1,2, 5, 6,7 and 8):((Pregnant ) (Glucose concentration), (2-Hour serum insulin) , (Body mass index) ,(Diabetes PF) and (Age) ).For example, the aggregation process between rule #1 and rule #2: Rule #1: IF( Glucose concentration is high) & (2-Hour serum insulin is low) & (Body mass index is high) & (Diabetes PF is high) & (Age is low). Rule #2: IF(Glucose concentration is medium) & (2-Hour serum insulin is low) & (Body mass index is medium) & (Diabetes PF is high) & (Age is high). Pruned rule:IF(Glucose concentration is medium) & (2-Hour serum insulin is low) & (Body mass index is high) & (Diabetes PF is high) & (Age is low). This pruning process in this step considers the best pruning and aggregation to derivative the new rules. Also the same processing performs on the rules of Mirjan Hospital Diabetes Data sets. 4.4 Model evaluation and classification results. In this stage, we evaluate the suggested enhance RIPPER algorithm by testing 500 non-patient class records on a new derivative rules. The accuracy detection was (75.37%) for the training stage in the pruning process. The testing process in the evaluation model was (94.4%) with 0.056% false negative rate for 28 remaining data set. Accuracy is calculated as: Accuracy = (TP+TN)/(TP+TN+FN+FP) FN: False Negative 1447 …. (1) Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 FP: False Positive Accuracy (Training) = 202 + 0 / 200 + 0+66+0= 75.37% Accuracy (Testing) = 472 + 0 / 472 +0+28+0= 94.4% For the Mirjan data sets ,the accuracy detection was (88.33%) for the training stage in the pruning process. The testing process in the evaluation model was (100%) with 0.0% false negative rateof remaining data set as the following: Accuracy (Training) = 106 + 0 / 106 + 0+0+14= 88.33% Accuracy (Testing) =s 40 + 0 / 40 +0+0+0= 100% 5. Performance comparison with other Data Mining Algorithms. In this section the accuracy for proposed algorithm compared with several algorithms such as RIPPER , JRIP , Belief Network(Laplace), Decision Tree , Naïve Bayes , C4.5 and FURIA that used Pima Indians Diabetes data set. Table (5)and figure (4) show the estimation of classification accuracies on the testing data. Table (5): A comparison of enhanced RIPPERwith other algorithms that used PimaIndians-Diabetes Data Set. Algorithm Accuracy Enhanced RIPPER 94.4% RIPPER 74.56 JRIP 72.00 Belief Network(Laplace) 72.50% Decision Tree Naïve Bayes C4.5 FURIA 72.00% 71.50% 73.43 74.71 Accuracy 100% 80% 60% 40% 20% 0% Figure (4) Accuracy of Enhanced RIPPER compared with other algorithms that used Pima-Indians-Diabetes Data Set. 1448 Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 6. Conclusions Designing and implementing enhanced RIPPER algorithm and apply it in a medical data set resulted in several conclusions: 1. The rule-based classifiers such as Ripper or enhanced Ripper are well suited for handling data sets with unbalanced class distributions that work on it. 2. Concerning the benefits of the system: a. The designed system facilitates the work on the medical dataset classification for patient and non-patient. Non-patient can classify accurately on the testing phase. Patient can be classified on the training phase. b. The proposed system is designed to work on behalf of medical staff completely and does not need user interference. c. Data mining classification methods produce many benefits concerning automation, simplicity and performance. d. The proposed system can implement on a medical dataset without time and memory consumption. 3. Concerning the implementation part: a. For enhanced RIPPER algorithm, the resulted pruning rules are very effective and sufficient. These rules were applied on Mirjan medical data and give a high rate detection percent. b. Performance comparison is calculated with other classification methods such as decision tree and the obtained results overcome the existing one. 7. References [Amat11] Xavier Amatrian, Alejandro Jaimes,Nuria Oliver, and Josep M.Pujol, Data Mining Methods for Recommender Systems,book,2011. [Andr99]E.Andrássoyá, and J.Paralič, Knowledge Discovery In Databases-A Comparison Of Different Views, International Conference on Information and Intelligent Systems,1999. [Chri12] Angeline Christobel. Y, Dr. P. SivaPrakasam, The Negative Impact of Missing Value Imputation in Classification of Diabetes Dataset and Solution for Improvement, IOSR Journal of Computer Engineering (IOSRJCE),Volume 7, Issue 4, PP 16-23, Nov. - Dec. 2012. [Duma11]Mlungisi Duma, BhekisiphoTwala, TshilidziMarwala, Fulufhelo V. Nelwamondo ,IMPROVING THE PERFORMANCE OF THE RIPPER IN INSURANCE RISK CLASSIFICATION: A COMPARITIVE STUDY USING FEATURE SELECTION, 8th International Conference on Informatics in Control, Automation and Robotics ,2011. [Fayy96]Usama Fayyad, Gregory Piatetsky, and Padhraic Smyth, From Data Mining to Knowledge Discovery in Databases, AI-magazine, vol.17, no.3, p.p.3754, 1996. [Geor04]Arun George Eapen, Application of Data mining in Medical Applications,Master thesis , University of Waterloo , 2004. [Folo03] S.O. Folorunso, A.B. Adeyemo, Alleviating Classification Problem of Imbalanced Dataset, African Journal of Computing & ICT , Vol 6. No. 2, June 2013. [Hühn 09]Jens Hühn and EykeHüllermeier, FURIA: An Algorithm For Unordered Fuzzy Rule Induction, Springer US, Volume 19, Issue 3, pp 293-319, December 2009. [Mao11] Yi Mao, , Yixin Chen, Gregory Hackmann, Minmin Chen, Chenyang Lu, Marin Kollef, and Thomas C. Bailey, Medical Data Mining for Early 1449 Journal of Babylon University/Pure and Applied Sciences/ No.(4)/ Vol.(23): 2015 Deterioration Warning in General Hospital Wards, Proc. ICDM Workshop on Biological Data Mining and its Applications in Healthcare (BioDM-11),2011. [Tudo09] Irina Tudor, LiviluIonita,Intelligent Agents as Data Mining Techniques Used in Academic Environment, The 4th International Conference on Virtual Learning ICVL, 2009. [Vija11] S.Vijayarani, M.Divya, An Efficient Algorithm for Generating Classification Rules Hiding, International Journal of Computer Science and technology, Vol.: 2, Issue 4, OCT.- DeC. 2011. [Witt05] Ian H.Witten, Eibe Frank, Data Mining: Practical Machine Learning Tools and Techniques,Elsevier,2005. 1450